- Getting started
- UiPath Agents in Studio Web
- UiPath Agents in Agent Builder
- UiPath Coded agents

Agents user guide
Evaluations
When you're building an agent, the goal is to make it reliable—something you can trust to give the right output consistently. Evaluations help you figure out if your agent is doing a good job or if it needs improvement.
An evaluation is a pair between an input and an assertion—or evaluator—made on the output. The evaluator is a defined condition or rule used to assess whether the agent's output meets the expected output or expected trajectory.
Evaluation sets are logical groupings of evaluations and evaluators.
Evaluation results are traces for completed evaluation runs that assess the performance of an agent. During these runs, the agent's accuracy, efficiency, and decision-making ability are measured and scored on how well the agent performs.
The evaluation score determines how well the agent is performing based on the assertions in a specific evaluation. The score is on a scale from 0 to 100. If you have failed evaluation runs, you must diagnose the cause, debug, and re-run them.
Before you create evaluations at scale, you can first test your agent on one off scenarios to see if the agent is able to perform its task and if output is correct or not. If your agent is generating the correct output, you can create evaluations from the correct runs. If your agent is not generating the correct output, you can fix the output and create an evaluation with the expected output, you can create evaluations from scratch.
- After you design your agent, select Test on cloud.
-
In the Test configuration window, confirm the resources used in
the solution and:
-
Provide the input for the test run:
- Provide inputs manually by typing in the content, or
- Simulate inputs: Use an LLM to generate inputs for your agent’s arguments. You can let the LLM auto-generate inputs or provide prompts to steer it toward specific examples.
- Configure whether you want to test with real tools or have one,
more, or all your tools simulated.
- Simulate tools: Use an LLM to simulate one or more agent tools. Describe how each tool should respond, and simulate partial or full toolsets your agent relies on.
-
-
Select Save and Run.
The results are displayed in the Run output panel. Indicators are available to show when your agent is running with real or simulated data.
- If the output is correct,
select the Add to eval set button now available in the General
tab.
If the output isn't correct, you can:
- Refine the prompt: Adjust the prompt and test the agent until the output is correct.
- Create evaluations from incorrect outputs: Generate evaluations based on the incorrect outputs and manually edit them to align with the expected outcome.
-
Test runs are listed in the Add to Evaluation Set window. Select Add to default set for any run you want to add to an evaluation.
If you've already created an evaluation set, you can select it from the available dropdown list.
-
Next, go to the Evaluation sets panel. Three options are available:
- Use the pre-built evaluation set to organize your evaluations.
- Generate a new set with simulated inputs and tools.
- Add evaluations in existing sets with real and simulated data.
- Select Evaluate set to run the evaluations. You can also select specific evaluations from the set that you would like to evaluate.
- Go to the Results tab to view the evaluation score and details.
- After you design your agent, go to the Evaluation sets tab and
select Create New.
You can also select Import to use existing JSON data from evaluations of other agents.
- Add a relevant name for the evaluation set.
-
Select Add to set to create new evaluations. For each new evaluation in the set:
-
Add a name.
-
Add values for the Input fields (inherited from the defined input arguments) and the expected Output.
-
Select Save.
-
- Next, select Set Evaluators to assign evaluators to the evaluation
set.
You can assign one or several evaluators to a set.
-
Select Save changes.
-
From the Evaluation sets main page, select Run evaluation set for each set you want to run.
-
Go to the Results tab to view the evaluation score and details.
You can also create evaluation sets with simulations. Generate new evaluation sets (or add to existing ones) using simulated inputs and tools.
- Select Create.
- Select Generate new
evaluation set.
You can let the LLM auto-generate the evaluation set based on your existing agent, its design runs, arguments or provide prompts to steer it toward specific examples.
For details, refer to Configuring simulations in evaluations.
Use the Evaluators panel to create and manage your evaluators. By default, each agent has a predefined, LLM-based Default Evaluator.
To create your own evaluators:
- Select Create New:
- Select the evaluator type:
- LLM-as-a judge: Semantic similarity – Creates your own LLM-based evaluator.
- Exact match – Checks if the agent output matches the expected output.
- JSON similarity – Checks if two JSON structures or values are similar.
- Trajectory evaluator – Uses AI to judge the agent based on run history and expected behavior.
- Select Continue.
- Configure the evaluator:
-
Add a relevant name and description.
- Select the Target output fields:
- Root-level targeting (* All): Evaluates the entire output.
- Field-specific targeting: Assesses specific first-level fields. Use the dropdown menu to select a field. The listed output fields are inherited from the output arguments you defined for the system prompt.
- Add a prompt (only for the LLM-based evaluator).
-
Choosing the evaluator type
If you don't know what evaluator type suits your needs, see the following recommendations:
-
LLM-as-a-Judge:
- Recommended as the default approach when targeting the root output.
- Provides flexible evaluation of complex outputs.
- Can assess quality and correctness beyond exact matching.
- Best used when evaluating reasoning, natural language responses, or complex structured outputs.
-
Deterministic (Exact match or JSON similarity):
- Recommended when expecting exact matches.
- Most effective when output requirements are strictly defined.
-
Works with complex objects, but is best used with:
- Boolean responses (true/false)
- Specific numerical values
- Exact string matches
- Arrays of primitives.
Simulations enhance agent evaluations by enabling safe, fast, and cost-effective testing through mocked tool and escalation behaviors instead of real endpoints. They offer granular control at the evaluation level, allowing teams to define which components to simulate and to combine real and simulated runs within the same evaluation set. This flexibility supports fixed or generated inputs and both literal output and behavior-based grading, improving test coverage, reproducibility, and the ability to assess whether agents behave as expected.
For additional information, refer to Configuring simulations for agent tools.
To set up new evaluation sets using simulations, follow these steps:
- From the Evaluation sets tab, select Create, then Generate new evaluation set.
- Enter a description of the
evaluation cases you want to generate.
You can provide high-level context, specific scenarios, or paste in relevant content to guide the generation. If you leave this field blank, evaluation cases are still automatically generated for you.
- Select Generate
evaluations.
Autopilot generates several evaluations. For each evaluation, you can view and edit the simulation instructions, input generation instructions, and the expected behavior notes.
- Select which evaluations you want to use, then select Add set.
To configure simulations for existing evaluations, follow these steps:
- Open any evaluation set and select Edit on any evaluation. The Edit evaluation panel is displayed.
-
In the Arrange section, define or generate input data using manual
values or runtime generation instructions.
If you define the input data manually, you can set the Testing field to True to indicate it is part of a test scenario.
- In the Act section, choose whether each tool should simulate behavior (mocked) or execute real calls, and add simulation instructions. Tool execution is the default setting.
- In the Assert section, specify whether the evaluation is based on output match or agent trajectory, and describe the expected behavior and output.
-
Select Save to apply your configuration.
A well-structured output makes evaluations more reliable. That’s why it is good to have structured outputs—it ensures consistency and makes comparisons easier.
Here is an example of a predefined prompt that evaluates the entire output:
As an expert evaluator, analyze the semantic similarity of these JSON contents to determine a score from 0-100. Focus on comparing the meaning and contextual equivalence of corresponding fields, accounting for alternative valid expressions, synonyms, and reasonable variations in language while maintaining high standards for accuracy and completeness. Provide your score with justification, explaining briefly and concisely why you gave that score.
Expected Output: {{ExpectedOutput}}
ActualOutput: {{ActualOutput}}
The Agent Score considers 30+ evaluations as a good benchmark.
For simple agents, aim for approximately 30 evaluations across 1-3 evaluation sets. For more complex agents, we recommended you have at least double that amount or more.
The number of evaluations depends on:
- Agent complexity
- Number of input parameters
- Output structure complexity
- Tool usage patterns
- Decision branches
- Input
- Range of possible inputs: data types, value ranges, optional fields
- Edge cases
- Usage patterns
- Common use cases
- Different personas
- Error scenarios
Grouping evaluations into sets helps organize them better. For example, you can have:
- One set for full output evaluation
- Another for edge cases
- Another for handling misspellings.
Coverage principles
- Logical coverage: Map out input combinations, edge cases, and boundary conditions.
- Redundancy management: Aim for 3-5 different evaluations per logically equivalent case.
- Quality over quantity: More evaluations don’t always mean better results. Focus on meaningful tests.
Create evaluations once the arguments are stable or complete. That also means your use case is established, and the prompt, tools, and contexts are finalized. If you modify the arguments, you need to adjust your evaluations accordingly. To minimize additional work, it's best to start with stable agents that have well-defined use cases. You can export and import evaluation sets between agents within the same organization or across different organizations. As long as your agent design is complete, you can move evaluations around as needed without having to recreate them from scratch.
- About evaluations
- Terminology
- Creating evaluations
- Creating evaluations from test runs
- Creating evaluations from scratch
- Generating evaluations
- Defining evaluators
- Configuring simulations in evaluations
- How to set up evaluation simulations
- Working with evaluations
- Structuring your evaluation prompt
- Number of evaluations
- Evaluation sets
- When to create evaluations