agents

latest

false

Agents user guide

Last updated Aug 11, 2025

Evaluations

About evaluations

When you're building an agent, the goal is to make it reliable—something you can trust to give the right output consistently. Evaluations help you figure out if your agent is doing a good job or if it needs improvement.

Terminology

An evaluation is a pair between an input and an assertion—or evaluator—made on the output. The evaluator is a defined condition or rule used to assess whether the agent's output meets the expected output or expected trajectory.

Evaluation sets are logical groupings of evaluations and evaluators.

Evaluation results are traces for completed evaluation runs that assess the performance of an agent. During these runs, the agent's accuracy, efficiency, and decision-making ability are measured and scored on how well the agent performs.

The evaluation score determines how well the agent is performing based on the assertions in a specific evaluation. The score is on a scale from 0 to 100. If you have failed evaluation runs, you must diagnose the cause, debug, and re-run them.

Creating evaluations

Before you create evaluations at scale, you can first test your agent on one off scenarios to see if the agent is able to perform its task and if output is correct or not. If your agent is generating the correct output, you can create evaluations from the correct runs. If your agent is not generating the correct output, you can fix the output and create an evaluation with the expected output, you can create evaluations from scratch.

Creating evaluations from test runs

After you design your agent, select Test on cloud.
In the Test configuration window, confirm the resources used in the solution and:
1. Provide the input for the test run:
  - Provide inputs manually by typing in the content, or
  - Simulate inputs: Use an LLM to generate inputs for your agent’s arguments. You can let the LLM auto-generate inputs or provide prompts to steer it toward specific examples.
2. Configure whether you want to test with real tools or have one, more, or all your tools simulated.
  - Simulate tools: Use an LLM to simulate one or more agent tools. Describe how each tool should respond, and simulate partial or full toolsets your agent relies on.
Select Save and Run.

The results are displayed in the Run output panel. Indicators are available to show when your agent is running with real or simulated data.
If the output is correct, select the Add to eval set button now available in the General tab.
If the output isn't correct, you can:
- Refine the prompt: Adjust the prompt and test the agent until the output is correct.
- Create evaluations from incorrect outputs: Generate evaluations based on the incorrect outputs and manually edit them to align with the expected outcome.
Test runs are listed in the Add to Evaluation Set window. Select Add to default set for any run you want to add to an evaluation.

If you've already created an evaluation set, you can select it from the available dropdown list.
Next, go to the Evaluation sets panel. Three options are available:
1. Use the pre-built evaluation set to organize your evaluations.
2. Generate a new set with simulated inputs and tools.
3. Add evaluations in existing sets with real and simulated data.
Select Evaluate set to run the evaluations. You can also select specific evaluations from the set that you would like to evaluate.
Go to the Results tab to view the evaluation score and details.

Creating evaluations from scratch

After you design your agent, go to the Evaluation sets tab and select Create New.
You can also select Import to use existing JSON data from evaluations of other agents.
Add a relevant name for the evaluation set.
Select Add to set to create new evaluations. For each new evaluation in the set:
1. Add a name.
2. Add values for the Input fields (inherited from the defined input arguments) and the expected Output.
3. Select Save.
Next, select Set Evaluators to assign evaluators to the evaluation set.
You can assign one or several evaluators to a set.
Select Save changes.
From the Evaluation sets main page, select Run evaluation set for each set you want to run.
Go to the Results tab to view the evaluation score and details.

Generating evaluations

You can also create evaluation sets with simulations. Generate new evaluation sets (or add to existing ones) using simulated inputs and tools.

Select Create.
Select Generate new evaluation set.
You can let the LLM auto-generate the evaluation set based on your existing agent, its design runs, arguments or provide prompts to steer it toward specific examples.

For details, refer to Configuring simulations in evaluations.

Defining evaluators

Use the Evaluators panel to create and manage your evaluators. By default, each agent has a predefined, LLM-based Default Evaluator.

To create your own evaluators:

Select Create New:
Select the evaluator type:
1. LLM-as-a judge: Semantic similarity – Creates your own LLM-based evaluator.
2. Exact match – Checks if the agent output matches the expected output.
3. JSON similarity – Checks if two JSON structures or values are similar.
4. Trajectory evaluator – Uses AI to judge the agent based on run history and expected behavior.
Select Continue.
Configure the evaluator:
1. Add a relevant name and description.
2. Select the Target output fields:
  - Root-level targeting (* All): Evaluates the entire output.
  - Field-specific targeting: Assesses specific first-level fields. Use the dropdown menu to select a field. The listed output fields are inherited from the output arguments you defined for the system prompt.
3. Add a prompt (only for the LLM-based evaluator).

Choosing the evaluator type

If you don't know what evaluator type suits your needs, see the following recommendations:

LLM-as-a-Judge:
- Recommended as the default approach when targeting the root output.
- Provides flexible evaluation of complex outputs.
- Can assess quality and correctness beyond exact matching.
- Best used when evaluating reasoning, natural language responses, or complex structured outputs.
Deterministic (Exact match or JSON similarity):
- Recommended when expecting exact matches.
- Most effective when output requirements are strictly defined.
- Works with complex objects, but is best used with:
  - Boolean responses (true/false)
  - Specific numerical values
  - Exact string matches
  - Arrays of primitives.

Configuring simulations in evaluations

Note: This feature is available in preview.

Simulations enhance agent evaluations by enabling safe, fast, and cost-effective testing through mocked tool and escalation behaviors instead of real endpoints. They offer granular control at the evaluation level, allowing teams to define which components to simulate and to combine real and simulated runs within the same evaluation set. This flexibility supports fixed or generated inputs and both literal output and behavior-based grading, improving test coverage, reproducibility, and the ability to assess whether agents behave as expected.

For additional information, refer to Configuring simulations for agent tools.

How to set up evaluation simulations

To set up new evaluation sets using simulations, follow these steps:

From the Evaluation sets tab, select Create, then Generate new evaluation set.
Enter a description of the evaluation cases you want to generate.
You can provide high-level context, specific scenarios, or paste in relevant content to guide the generation. If you leave this field blank, evaluation cases are still automatically generated for you.
Select Generate evaluations.
Autopilot generates several evaluations. For each evaluation, you can view and edit the simulation instructions, input generation instructions, and the expected behavior notes.
Select which evaluations you want to use, then select Add set.

To configure simulations for existing evaluations, follow these steps:

Open any evaluation set and select Edit on any evaluation. The Edit evaluation panel is displayed.
In the Arrange section, define or generate input data using manual values or runtime generation instructions.
If you define the input data manually, you can set the Testing field to True to indicate it is part of a test scenario.
In the Act section, choose whether each tool should simulate behavior (mocked) or execute real calls, and add simulation instructions. Tool execution is the default setting.
In the Assert section, specify whether the evaluation is based on output match or agent trajectory, and describe the expected behavior and output.
Select Save to apply your configuration.

Figure 1. Configuring tool simulations in evaluations

Working with evaluations

Structuring your evaluation prompt

A well-structured output makes evaluations more reliable. That’s why it is good to have structured outputs—it ensures consistency and makes comparisons easier.

Here is an example of a predefined prompt that evaluates the entire output:

Example prompt

As an expert evaluator, analyze the semantic similarity of these JSON contents to determine a score from 0-100. Focus on comparing the meaning and contextual equivalence of corresponding fields, accounting for alternative valid expressions, synonyms, and reasonable variations in language while maintaining high standards for accuracy and completeness. Provide your score with justification, explaining briefly and concisely why you gave that score.

Expected Output: {{ExpectedOutput}}

ActualOutput: {{ActualOutput}}

Number of evaluations

The Agent Score considers 30+ evaluations as a good benchmark.

For simple agents, aim for approximately 30 evaluations across 1-3 evaluation sets. For more complex agents, we recommended you have at least double that amount or more.

The number of evaluations depends on:

Agent complexity
- Number of input parameters
- Output structure complexity
- Tool usage patterns
- Decision branches
Input
- Range of possible inputs: data types, value ranges, optional fields
- Edge cases
Usage patterns
- Common use cases
- Different personas
- Error scenarios

Evaluation sets

Grouping evaluations into sets helps organize them better. For example, you can have:

One set for full output evaluation
Another for edge cases
Another for handling misspellings.

Coverage principles

Logical coverage: Map out input combinations, edge cases, and boundary conditions.
Redundancy management: Aim for 3-5 different evaluations per logically equivalent case.
Quality over quantity: More evaluations don’t always mean better results. Focus on meaningful tests.

When to create evaluations

Create evaluations once the arguments are stable or complete. That also means your use case is established, and the prompt, tools, and contexts are finalized. If you modify the arguments, you need to adjust your evaluations accordingly. To minimize additional work, it's best to start with stable agents that have well-defined use cases. You can export and import evaluation sets between agents within the same organization or across different organizations. As long as your agent design is complete, you can move evaluations around as needed without having to recreate them from scratch.

On this page

Evaluations
About evaluations
Terminology
Creating evaluations
Creating evaluations from test runs
Creating evaluations from scratch
Generating evaluations
Defining evaluators
Configuring simulations in evaluations
How to set up evaluation simulations
Working with evaluations
Structuring your evaluation prompt
Number of evaluations
Evaluation sets
When to create evaluations

Was this page helpful?

PREVIOUSEscalations and Agent Memory

NEXTAgent traces

Support and Services

Get The Help You Need

UiPath Academy

Learning RPA - Automation Courses

UiPath Forum

UiPath Community Forum

Trust and Security

Cookies Policy