- Getting started
- Prerequisites
- Building agents in Studio Web
- Building agents in Agent Builder

Agents user guide
Evaluations
When you're building an agent, the goal is to make it reliable—something you can trust to give the right output consistently. Evaluations help you figure out if your agent is doing a good job or if it needs improvement.
An evaluation is a pair between an input and an assertion—or evaluator—made on the output. The evaluator is a defined condition or rule used to assess whether the agent's output meets the expected output.
Evaluation sets are logical groupings of evaluations and evaluators.
Evaluation results are traces for completed evaluation runs that assess the performance of an agent. During these runs, the agent's accuracy, efficiency, and decision-making ability are measured and scored on how well the agent performs.
The evaluation score determines how well the agent is performing based on the assertions in a specific evaluation. The score is on a scale from 0 to 100. If you have failed evaluation runs, you must re-run and debug them.
Before you create an evaluation, you must first test your agent to see if the output is correct or not. If your agent is generating the correct output, you can create evaluations from the correct runs. If your agent is not generating the correct output, you can create evaluations from scratch.
- After you design your agent, select Test on cloud.
-
In the Test configuration window, confirm the resources used in the solution and provide the input for the test run.
-
Select Run.
The results are displayed in the Run output panel.
- If the output is correct,
select the Add to eval set button now available in the General
tab.
If the output isn't correct, you can:
- Refine the prompt: Adjust the prompt and test the agent until the output is correct.
- Create evaluations from incorrect outputs: Generate evaluations based on the incorrect outputs and manually edit them to align with the expected outcome.
-
Test runs are listed in the Add to Evaluation Set window. Select Add to default set for any run you want to add to an evaluation.
If you've already created an evaluation set, you can select it from the available dropdown list.
-
Next, go to the Evaluation sets panel and select View details for the evaluation set.
-
Select Evaluate set to run the evaluations. You can also select specific evaluations from the set that you would like to evaluate.
-
Go to the Results tab to view the evaluation score and details.
- After you design your agent, go to the Evaluation sets tab and
select Create New.
You can also select Import to use existing JSON data from evaluations of other agents.
- Add a relevant name for the evaluation set.
-
Select Add to set to create new evaluations. For each new evaluation in the set:
-
Add a name.
-
Add values for the Input fields (inherited from the defined input arguments) and the expected Output.
-
Select Save.
-
- Next, select Set Evaluators to assign evaluators to the evaluation
set.
You can assign one or several evaluators to a set.
-
Select Save changes.
-
From the Evaluation sets main page, select Run evaluation set for each set you want to run.
-
Go to the Results tab to view the evaluation score and details.
Use the Evaluators panel to create and manage your evaluators. By default, each agent has a predefined, LLM-based Default Evaluator.
To create your own evaluators:
-
Select Create New:
-
Select the evaluator type:
-
LLM-as-a judge: Semantic Similarity – Creates your own LLM-based evaluator.
-
Exact match – Checks if the agent output matches the expected output.
-
JSON similarity – Checks if two JSON structures or values are similar.
-
-
Select Continue.
-
Configure the evaluator:
-
Add a relevant name and description.
-
Select the Target output fields:
-
Root-level targeting (* All): Evaluates the entire output.
-
Field-specific targeting: Assesses specific first-level fields. Use the dropdown menu to select a field. The listed output fields are inherited from the output arguments you defined for the system prompt.
-
-
Add a prompt (only for the LLM-based evaluator).
-
Choosing the evaluator type
If you don't know what evaluator type suits your needs, see the following recommendations:
-
LLM-as-a-Judge:
-
Recommended as the default approach when targeting the root output.
-
Provides flexible evaluation of complex outputs.
-
Can assess quality and correctness beyond exact matching.
-
Best used when evaluating reasoning, natural language responses, or complex structured outputs.
-
-
Deterministic (Exact match or JSON similarity):
-
Recommended when expecting exact matches.
-
Most effective when output requirements are strictly defined.
-
Works with complex objects, but is best used with:
-
Boolean responses (true/false)
-
Specific numerical values
-
Exact string matches
-
Arrays of primitives.
-
-
A well-structured output makes evaluations more reliable. That’s why it is good to have structured outputs—it ensures consistency and makes comparisons easier.
Here is an example of a predefined prompt that evaluates the entire output:
As an expert evaluator, analyze the semantic similarity of these JSON contents to determine a score from 0-100. Focus on comparing the meaning and contextual equivalence of corresponding fields, accounting for alternative valid expressions, synonyms, and reasonable variations in language while maintaining high standards for accuracy and completeness. Provide your score with justification, explaining briefly and concisely why you gave that score.
Expected Output: {{ExpectedOutput}}
ActualOutput: {{ActualOutput}}
The Agent Score considers 30+ evaluations as a good benchmark.
For simple agents, aim for approximately 30 evaluations across 1-3 evaluation sets. For more complex agents, we recommended you have at least double that amount or more.
The number of evaluations depends on:
- Agent complexity
- Number of input parameters
- Output structure complexity
- Tool usage patterns
- Decision branches
- Input
- Range of possible inputs: data types, value ranges, optional fields
- Edge cases
- Usage patterns
- Common use cases
- Different personas
- Error scenarios
Grouping evaluations into sets helps organize them better. For example, you can have:
- One set for full output evaluation
- Another for edge cases
- Another for handling misspellings.
Coverage principles
- Logical coverage: Map out input combinations, edge cases, and boundary conditions.
- Redundancy management: Aim for 3-5 different evaluations per logically equivalent case.
- Quality over quantity: More evaluations don’t always mean better results. Focus on meaningful tests.
Create evaluations once the arguments are stable or complete. That also means your use case is established, and the prompt, tools, and contexts are finalized. If you modify the arguments, you need to adjust your evaluations accordingly. To minimize additional work, it's best to start with stable agents that have well-defined use cases. You can export and import evaluation sets between agents within the same organization or across different organizations. As long as your agent design is complete, you can move evaluations around as needed without having to recreate them from scratch.