IXP - Evaluating model performance

ixp

latest

false

Unstructured and complex documents user guide

Overview
Model building
Model validation
- Overview
- Evaluating model performance
- Gathering validation statistics
- Iterating on the taxonomy
Model deployment
Consuming models
- Consuming models via a workflow
- Consuming models via Document Understanding API
API
- API audit events
Frequently asked questions
- Frequently asked questions

Evaluating model performance

Assess IXP model performance using the Build tab for overall project scores and error rates, and the Measure tab for field group and field level metrics.

You can assess the performance of the model in the following locations:

The Build tab, which shows the overall project score, as well as the error rate for each document.
The Measure tab, which shows the field group and field level performance.

Evaluating model performance in Build

You can view an overall rating under the Project score in the Build tab.

Note:

Healthy models have a Good or Excellent project score and no field performance warnings.
The project score is calculated based on the average F1 Score across all fields.

In addition, you can view the error rate for each document in the Error rate column of the Documents section in Build.

Note:

Error rates are only available for annotated documents, and indicate the number of mistakes the model made on each document, that is, the difference between the predictions of the model and user annotations.

Note:

The document error rate can exceed 100%, because it is calculated as follows:

Document error rate = Number of errors ÷ Total annotations

Errors include not only fields that were annotated and extracted incorrectly, but also any extra predictions the model made that have no corresponding annotation. For example, an additional row the model predicted that a reviewer did not label.

Because these extra predictions count as errors but do not add to the annotation total, the number of errors can exceed the number of annotations, resulting in an error rate above 100%.

Evaluating model performance in Measure

Note:

The Measure page updates are available in public preview.

The Measure page helps you evaluate how well a model performs on annotated documents before you deploy them. The page includes:

A field performance table that surfaces key performance metrics per field and field group.
Support for comparing performance differences between model versions, highlighting improvements or regressions.
Visibility into the distribution of error types for each taxonomy field.
Data export capabilities for custom offline analysis.

The following sections describe the main components in Measure and explain how to use them effectively when you analyze model performance.

Project summary

The summary section provides a quick, high-level view of how your current model version performs across the project. You can use it to:

Select the model version you want to evaluate.
Get an at-a-glance read on overall performance using Project score and Avg. doc error rate.
Quickly spot whether overall project performance is trending up or down when comparing against a previous version.

Project score

The Project score summarizes overall model performance.

Why it is useful

Provides a single, consistent way to track overall progress as you iterate on the taxonomy, instructions, and annotations.
Helps you quickly determine whether a model version is generally improving or regressing before drilling into specific fields.

How it is calculated

Project score is computed as the simple average of F1 scores across all fields in the taxonomy.
F1 score is a standard model performance metric that balances precision and recall, that is, the harmonic mean of the two.
At a high level:
- Precision answers: How often were the predicted values of the model correct?
- Recall answers: How much of the annotated data did the model successfully find?

Note:

The Project score is an average. Specific field-level regressions or limitations can be reviewed with the Field performance table.

Avg. doc error rate

The Avg. doc error rate is the average of the error rates for each annotated document in the project.

Why it is useful

The Avg. doc error rate provides a quick indicator of how error-prone documents are when the selected model version processes them, which helps evaluate readiness to deploy.

How it is calculated

The value is computed as the simple average of the error rate of each fully annotated document in the project.

Field performance table

The Field performance table is the primary way to analyze model performance in the Measure page. It displays one row per field or field group, along with performance and error metrics calculated across the annotated documents in the project. The table does not take into account unannotated and partially annotated documents when calculating metrics.

The table helps answer questions such as:

Which fields limit the overall model performance?
Are errors concentrated in a few fields or spread broadly?
Did a recent model change improve or degrade specific fields?

The Field performance table includes several categories of metrics that help you analyze model performance from different perspectives. Each category answers a specific diagnostic question about how your model behaves across fields and documents.

Note:

Validation status and partial results To reduce waiting time:

Field performance metrics become visible once validation reaches a minimum completion threshold.
Warnings indicate when validation is still in progress and that the displayed results may change.

Performance metrics

The purpose of the performance metrics is to evaluate the overall quality of extraction for each field or field group.

The performance metrics are described as follows:

F1 score — The harmonic mean of precision and recall: F1 = 2 × (Precision × Recall) / (Precision + Recall). F1 score only remains high when both precision and recall are high. In practice, this makes F1 a strong overall quality indicator for extraction tasks where you care about avoiding incorrect values and avoiding missed values. Therefore, F1 is a useful first metric to review to analyze field performance changes across model versions.
Precision — Measures how often predicted values are correct: Precision = True positives / (True positives + False positives). True positives are predictions that match the annotated value, excluding values annotated as missing.
Recall — Measures how often the model finds a value when it exists: Recall = True positives / (True positives + False negatives). False negatives are annotated values that the model did not predict, excluding values annotated as missing.
Error rate — Total errors / Total annotations. Values marked as missing are included in the count of errors and annotations.
Error rate (excluding missing) — (Total errors – Extra predictions) / Annotated values. Annotated values marked as missing are excluded.

Predictions and errors

The purpose of the predictions and errors metrics is to understand the volume and composition of errors that contribute to poor performance.

The metrics are described as follows:

Total errors — Total number of errors for a field across all error classes: Total errors = Incorrect predictions + Missed predictions + Extra predictions.
Total predictions — Total number of predicted values for a field: Total predictions = Correct values + Correct missing + Incorrect predictions.
Incorrect predictions — Number of predictions where the extracted value does not match the annotation. Excludes predictions and annotated values marked as missing.
Extra predictions — Number of predicted values that the model should not have extracted, or did not have a corresponding annotation or annotation marked as missing.
Missed predictions — Number of annotated values that the model failed to extract.
Correct values — Number of predicted values that exactly match the annotation.
Correct missing — Number of instances where the model correctly predicted that a value is missing.

Annotations

The purpose of the annotations metrics is to provide context for how much labeled data supports each metric and how reliable performance scores are.

The metrics are described as follows:

Total annotations — Total number of annotations, including values marked as missing: Total annotations = Annotated values + Annotated values marked as missing.
Annotated values — Total number of annotated field values, excluding those marked as missing.
Annotated as missing — Total number of times a field was explicitly labeled as missing.

Document-level metrics

The purpose of document-level metrics is to understand how errors are distributed across documents rather than just across predictions.

The metrics are described as follows:

Documents with errors — Total number of documents where the field has at least one error.
Documents annotated — Total number of documents in which the field has at least one annotated field value.
Percentage of documents with errors — Percentage of annotated documents that contain at least one error for the field: Documents with errors / Documents annotated.

Example scenarios

Scenario 1: Low F1 + Low Precision, but Recall is moderate or high

What you observe

F1 is low, Precision is low, and Recall is moderate or high.

What it usually means

The model is extracting values for a field, but there are more values predicted than you expect to be found.
Common root causes:
- Field instruction is too broad or ambiguous. For example, the field instruction is capture the amount, but it does not specify which amount.
- The document has similar values that can be confused for one another, for example, subtotal versus total, ship-to versus bill-to.

What to do next

Compare the incorrect and extra predictions to identify whether the issue is tied to extracting the wrong value (non-zero incorrect predictions count) or the value should not have been extracted at all (non-zero extra predictions count).

Tighten field instructions with disambiguators, such as labels, keywords, and formatting constraints.

Scenario 2: High Missed Predictions (Recall is low), Precision is moderate or high

What you observe

Recall is low and Precision is moderate or high (F1 is typically low or moderate).
Missed predictions is high, often more than incorrect or extra.

What it usually means

The model is failing to extract values that are present.
Common root causes:
- Field instruction is too narrow, which means over-constrained examples or too-specific label requirements.
- The value appears in multiple formats, such as dates and IDs, and the instruction does not cover variants.

What to do next

Use Missed predictions + Annotated values to confirm this is a recall problem, that is, that the values exist but are not found. Check Annotated values to confirm there is a reasonable number of annotated datapoints for the field, and Missed predictions to confirm that the model is struggling to find values as opposed to predicting them incorrectly.
Expand instructions to include acceptable variants: alternative labels or synonyms, multiple formatting patterns, location hints (for example, near applicant details or under the borrower section).

Scenario 3: High Error rate but Low Docs with errors (errors concentrated in a few documents)

What you observe

Error rate is high or Total errors is high.
Docs with errors is low relative to documents annotated.
Often one field looks bad but only fails on a small subset of documents.

What it usually means

Errors are driven by outlier documents, not systemic field behavior.
Common root causes:
- A specific document or format behaves differently than the rest.
- OCR or quality issues in a small number of documents, such as blurry scans, skew, and handwritten overlays.
- The field is present in most documents but formatted unusually in a few, for example, multi-line versus single-line.

What to do next

Compare Docs with errors and Docs annotated, and optionally % of Docs with errors, to confirm concentration.
Sort documents by Error rate in the Build page and inspect documents with the highest error rate to identify if the field is performing poorly on a specific subset.

Scenario 4: Large swings in performance between versions with few annotations

What you observe

Large differences in F1 or error rate between model versions (up or down), but Annotated values is low, Docs annotated is low, or both.

What it usually means

The field metrics are not stable yet due to small sample size.
Common root causes:
- Too few examples — 1–2 documents can significantly change rates.
- Field is rarely present, that is, many missing cases and few true values.
- A handful of difficult documents dominate the metric.

What to do next

Check Annotated values, Docs annotated, and Annotated as missing to validate low coverage.
Treat the metrics as directional, not definitive, until coverage increases.
Add more labeled data specifically for that field: prioritize documents where the field is present, and include a diverse set of samples or variants.
Use version comparisons only after coverage is sufficient to reduce variability-driven noise.

Filtering and sorting

To filter rows in the table, select one or more of the available quick filters:

Annotated Values <10
Field F1 score < 50
Field F1 score within 50–70

The following images depict an example of the Field performance table results before and after you apply a quick filter:

You can also sort the Field performance table by any metric in the table. When a sort is applied, values are sorted within their respective field group. For example, sorting the table by F1 score sorts the fields within each field group relative to one another:

Visibility settings

By default, Measure shows differences for performance metrics, for example, F1 score and error rate.

To view differences across all metrics, proceed as follows:

Enable the Show differences in scores from: Version toggle.
Select the Show differences in scores from: Version dropdown.
Select Visibility settings.
In the Version changes - visibility settings pop-up, select All Metrics. The available options are:
- Performance metrics only — Performance metrics are determined by model predictions being compared to annotations, such as F1 score and error rate.
- All metrics
- Show changes inside model variability — By default, changes within the current version's variability ranges are not considered significant and are hidden. Enable to display them. When selected, the following option becomes available:
  - Show colors for all changes — By default, changes within the variability range appear in gray. Enable to color all changes green or red.
Select Save.

Model versions

Model versions capture the current state of the project at the time the version was created. You can deploy model versions to save them and use them in an automation. In addition, you can star versions in the Measure page to save their performance statistics. You can compare the current performance against previous versions to ensure continued performance improvement during iteration on instructions.

Selecting a model version

Use the Version dropdown to choose which validation results of a specific model version are displayed throughout the Measure page, such as Field performance, Document performance, and associated metrics. When you switch the model version, all metrics on the page are updated to reflect the validation results of the selected version.

Comparing different model versions using score differences

When multiple model versions are available, the Measure page allows you to compare the current model against a previous version. This way, you can better understand the impact of changes to field instructions, changes in annotations, or model configuration updates.

How this works

Measure allows you to view score differences from another model version.
Positive or negative changes highlight improvements or regressions. By default, Measure makes comparisons against the previous model version relative to the most recently created model version.

To compare a different model version, select an available version using the Show differences in scores from the version dropdown.

Understanding model variability and impact on score differences

Some models in IXP are non-deterministic, which means that the set of predictions of a field between model versions can vary slightly even when the instructions of that field are unchanged.

The Measure page allows you to take model variability into account during performance analysis. This helps you:

Understand whether a performance change is meaningful.
Avoid overinterpreting small metric fluctuations.

By default:

Score differences that fall within the variability range of a metric are hidden when comparing two model versions.
You can select to show all score differences or only differences that are greater than or equal to the variability of a metric.

These defaults ensure attention is focused on significant changes in model performance, and not noise.

To show differences between model versions irrespective of model variability, proceed as follows:

Enable the Show differences in scores from: Version toggle.
Select the Show differences in scores from: Version dropdown.
Select Visibility settings.
In the pop-up window, select Show changes inside model variability. The available options are:
- Performance metrics only — Performance metrics are determined by model predictions being compared to annotations, such as F1 score and error rate.
- All metrics
- Show changes inside model variability — By default, changes within the current version's variability ranges are not considered significant and are hidden. Enable to display them. When selected, the following option becomes available:
  - Show colors for all changes — By default, changes within the variability range appear in gray. Enable to color all changes green or red.
Optionally, select Show colors for all differences if you want all score differences to appear in green or red. By default, differences within the variability range are displayed in gray.
Select Save.

Starring a model version

A new model version is created each time you make changes to your taxonomy, including instructions, or to the model settings. The latest version of the model is always available, but you can also star, that is to lock in place, a specific model version to always show the performance statistics in the dashboard.

To star a model version, proceed as follows:

Expand the Model Version drop-down menu to view the list with all available versions.
Select the star icon next to the model version that you want to always be displayed at the top of the list and on the dashboard.

Note:

Starring a model version does not save the model version itself, only the performance statistics. To save a model version, you must deploy it in the Deploy tab.

Exporting Measure data

You can export data from the Measure page for:

Offline analysis.
Custom filtering.
Sharing results with stakeholders.

Exports include field-level predictions, annotations, and performance metrics visible in the Measure page.

To export data, proceed as follows:

Navigate to the Measure page.
Select the vertical ellipsis.
Select Export as Excel file.

On this page

Evaluating model performance in Build
Evaluating model performance in Measure
Project summary
Field performance table
Model versions
Selecting a model version
Comparing different model versions using score differences
Starring a model version
Exporting Measure data

Was this page helpful?

PREVIOUSOverview

NEXTGathering validation statistics

Evaluating model performance in Build​

Evaluating model performance in Measure​

Project summary​

Project score​

Why it is useful​

How it is calculated​

Avg. doc error rate​

Why it is useful​

How it is calculated​

Field performance table​

Performance metrics​

Predictions and errors​

Annotations​

Document-level metrics​

Example scenarios​

Scenario 1: Low F1 + Low Precision, but Recall is moderate or high​

What you observe​

What it usually means​

What to do next​

Scenario 2: High Missed Predictions (Recall is low), Precision is moderate or high​

What you observe​

What it usually means​

What to do next​

Scenario 3: High Error rate but Low Docs with errors (errors concentrated in a few documents)​

What you observe​

What it usually means​

What to do next​

Scenario 4: Large swings in performance between versions with few annotations​

What you observe​

What it usually means​

What to do next​

Filtering and sorting​

Visibility settings​

Model versions​

Selecting a model version​

Comparing different model versions using score differences​

How this works​

Understanding model variability and impact on score differences​

Starring a model version​

Exporting Measure data​

Was this page helpful?

Evaluating model performance in Build

Evaluating model performance in Measure

Project summary

Project score

Why it is useful

How it is calculated

Avg. doc error rate

Why it is useful

How it is calculated

Field performance table

Performance metrics

Predictions and errors

Annotations

Document-level metrics

Example scenarios

Scenario 1: Low F1 + Low Precision, but Recall is moderate or high

What you observe

What it usually means

What to do next

Scenario 2: High Missed Predictions (Recall is low), Precision is moderate or high

What you observe

What it usually means

What to do next

Scenario 3: High Error rate but Low Docs with errors (errors concentrated in a few documents)

What you observe

What it usually means

What to do next

Scenario 4: Large swings in performance between versions with few annotations

What you observe

What it usually means

What to do next

Filtering and sorting

Visibility settings

Model versions

Selecting a model version

Comparing different model versions using score differences

How this works

Understanding model variability and impact on score differences

Starring a model version

Exporting Measure data