- Overview
- Model building
- Model validation
- Overview
- Evaluating model performance
- Gathering validation statistics
- Iterating on the taxonomy
- Model deployment
- API
- Frequently asked questions

Unstructured and complex documents user guide
Evaluating model performance
You can assess the performance of the model in the following locations:
- The Build tab, which shows the overall project score, as well as the error rate for each document.
- The Measure tab, which shows the field group and field level performance.
Evaluating model performance in Build
You can view an overall rating under the Project score in the Build tab.
- Healthy models have a Good or Excellent project score and no field performance warnings.
- The project score is calculated based on the average F1 Score across all fields.

In addition, you can view the error rate for each document in the Error rate column of the Documents section in Build.
Error rates are only available for annotated documents, and indicate the number of mistakes the model made on each document, that is, the difference between the predictions of the model and user annotations.

Evaluating model performance in Measure
The Measure page helps you evaluate how well a model performs on annotated documents before you publish them. The page includes:
- A field performance table that surfaces key performance metrics per field and field group.
- Support for comparing performance differences between model versions, highlighting improvements or regressions.
- Visibility into the distribution of error types for each taxonomy field.
- Data export capabilities for custom offline analysis.
The following sections describe the main components in Measure and explain how to use them effectively when you analyze model performance.
Project summary
The summary section provides a quick, high-level view of how your current model version performs across the project. You can use it to:
- Select the model version you want to evaluate.
- Get an at-a-glance read on overall performance using Project score and Avg. doc error rate.
- Quickly spot whether overall project performance is trending up or down when comparing against a previous version.
Project score
The Project score summarizes overall model performance.
Why it is useful
- Provides a single, consistent way to track overall progress as you iterate on the taxonomy, instructions, and annotations.
- Helps you quickly determine whether a model version is generally improving or regressing before drilling into specific fields.
How it is calculated
- Project score is computed as the simple average of F1 scores across all fields in the taxonomy.
- F1 score is a standard model performance metric that balances precision and recall, that is, the harmonic mean of the two.
- At a high level:
- Precision answers: How often were the predicted values of the model correct?
- Recall answers: How much of the annotated data did the model successfully find?
The Project score is an average. Specific field-level regressions or limitations can be reviewed with the Field performance table.
Avg. doc error rate
The Avg. doc error rate is the average of the error rates for each annotated document in the project.
Why it is useful
The Avg. doc error rate provides a quick indicator of how error-prone documents are when the selected model version processes them, which helps evaluate readiness to publish.
How it is calculated
The value is computed as the simple average of the error rate of each fully annotated document in the project.
Field performance table
The Field performance table is the primary way to analyze model performance in the Measure page. It displays one row per field or field group, along with performance and error metrics calculated across the annotated documents in the project. The table does not take into account unannotated and partially annotated documents when calculating metrics.
The table helps answer questions such as:
- Which fields limit the overall model performance?
- Are errors concentrated in a few fields or spread broadly?
- Did a recent model change improve or degrade specific fields?
The Field performance table includes several categories of metrics that help you analyze model performance from different perspectives. Each category answers a specific diagnostic question about how your model behaves across fields and documents.
Validation status and partial results To reduce waiting time:
- Field performance metrics become visible once validation reaches a minimum completion threshold.
- Warnings indicate when validation is still in progress and that the displayed results may change.
Performance metrics
The purpose of the performance metrics is to evaluate the overall quality of extraction for each field or field group.
The performance metrics are described as follows:
- F1 score — The harmonic mean of precision and recall: F1 = 2 × (Precision × Recall) / (Precision + Recall). F1 score only remains high when both precision and recall are high. In practice, this makes F1 a strong overall quality indicator for extraction tasks where you care about avoiding incorrect values and avoiding missed values. Therefore, F1 is a useful first metric to review to analyze field performance changes across model versions.
- Precision — Measures how often predicted values are correct: Precision = True positives / (True positives + False positives). True positives are predictions that match the annotated value, excluding values annotated as missing.
- Recall — Measures how often the model finds a value when it exists: Recall = True positives / (True positives + False negatives). False negatives are annotated values that the model did not predict, excluding values annotated as missing.
- Error rate — Total errors / Total annotations. Values marked as missing are included in the count of errors and annotations.
- Error rate (excluding missing) — (Total errors – Extra predictions) / Annotated values. Annotated values marked as missing are excluded.
Predictions and errors
The purpose of the predictions and errors metrics is to understand the volume and composition of errors that contribute to poor performance.
The metrics are described as follows:
- Total errors — Total number of errors for a field across all error classes: Total errors = Incorrect predictions + Missed predictions + Extra predictions.
- Total predictions — Total number of predicted values for a field: Total predictions = Correct values + Correct missing + Incorrect predictions.
- Incorrect predictions — Number of predictions where the extracted value does not match the annotation. Excludes predictions and annotated values marked as missing.
- Extra predictions — Number of predicted values that the model should not have extracted, or did not have a corresponding annotation or annotation marked as missing.
- Missed predictions — Number of annotated values that the model failed to extract.
- Correct values — Number of predicted values that exactly match the annotation.
- Correct missing — Number of instances where the model correctly predicted that a value is missing.
Annotations
The purpose of the annotations metrics is to provide context for how much labeled data supports each metric and how reliable performance scores are.
The metrics are described as follows:
- Total annotations — Total number of annotations, including values marked as missing: Total annotations = Annotated values + Annotated values marked as missing.
- Annotated values — Total number of annotated field values, excluding those marked as missing.
- Annotated as missing — Total number of times a field was explicitly labeled as missing.
Document-level metrics
The purpose of document-level metrics is to understand how errors are distributed across documents rather than just across predictions.
The metrics are described as follows:
- Documents with errors — Total number of documents where the field has at least one error.
- Documents annotated — Total number of documents in which the field has at least one annotated field value.
- Percentage of documents with errors — Percentage of annotated documents that contain at least one error for the field: Documents with errors / Documents annotated.
Example scenarios
Scenario 1: Low F1 + Low Precision, but Recall is moderate or high
What you observe
F1 is low, Precision is low, and Recall is moderate or high.
What it usually means
- The model is extracting values for a field, but there are more values predicted than you expect to be found.
- Common root causes:
- Field instruction is too broad or ambiguous. For example, the field instruction is capture the amount, but it does not specify which amount.
- The document has similar values that can be confused for one another, for example, subtotal versus total, ship-to versus bill-to.
What to do next
Compare the incorrect and extra predictions to identify whether the issue is tied to extracting the wrong value (non-zero incorrect predictions count) or the value should not have been extracted at all (non-zero extra predictions count).
Tighten field instructions with disambiguators, such as labels, keywords, and formatting constraints.
Scenario 2: High Missed Predictions (Recall is low), Precision is moderate or high
What you observe
- Recall is low and Precision is moderate or high (F1 is typically low or moderate).
- Missed predictions is high, often more than incorrect or extra.
What it usually means
- The model is failing to extract values that are present.
- Common root causes:
- Field instruction is too narrow, which means over-constrained examples or too-specific label requirements.
- The value appears in multiple formats, such as dates and IDs, and the instruction does not cover variants.
What to do next
- Use Missed predictions + Annotated values to confirm this is a recall problem, that is, that the values exist but are not found. Check Annotated values to confirm there is a reasonable number of annotated datapoints for the field, and Missed predictions to confirm that the model is struggling to find values as opposed to predicting them incorrectly.
- Expand instructions to include acceptable variants: alternative labels or synonyms, multiple formatting patterns, location hints (for example, near applicant details or under the borrower section).
Scenario 3: High Error rate but Low Docs with errors (errors concentrated in a few documents)
What you observe
- Error rate is high or Total errors is high.
- Docs with errors is low relative to documents annotated.
- Often one field looks bad but only fails on a small subset of documents.
What it usually means
- Errors are driven by outlier documents, not systemic field behavior.
- Common root causes:
- A specific document or format behaves differently than the rest.
- OCR or quality issues in a small number of documents, such as blurry scans, skew, and handwritten overlays.
- The field is present in most documents but formatted unusually in a few, for example, multi-line versus single-line.
What to do next
- Compare Docs with errors and Docs annotated, and optionally % of Docs with errors, to confirm concentration.
- Sort documents by Error rate in the Build page and inspect documents with the highest error rate to identify if the field is performing poorly on a specific subset.
Scenario 4: Large swings in performance between versions with few annotations
What you observe
- Large differences in F1 or error rate between model versions (up or down), but Annotated values is low, Docs annotated is low, or both.
What it usually means
- The field metrics are not stable yet due to small sample size.
- Common root causes:
- Too few examples — 1–2 documents can significantly change rates.
- Field is rarely present, that is, many missing cases and few true values.
- A handful of difficult documents dominate the metric.
What to do next
- Check Annotated values, Docs annotated, and Annotated as missing to validate low coverage.
- Treat the metrics as directional, not definitive, until coverage increases.
- Add more labeled data specifically for that field: prioritize documents where the field is present, and include a diverse set of samples or variants.
- Use version comparisons only after coverage is sufficient to reduce variability-driven noise.
Filtering and sorting
To filter rows in the table, select one or more of the available quick filters:
- Annotated Values <10
- Field F1 score < 50
- Field F1 score within 50–70
You can also sort the Field performance table by any metric in the table. When a sort is applied, values are sorted within their respective field group. For example, sorting by F1 score sorts the fields within each field group relative to one another.
Visibility settings
By default, Measure shows differences for performance metrics, for example, F1 score and error rate.
To view differences across all metrics, proceed as follows:
- Enable the Show differences in scores from: Version toggle.
- Select the Show differences in scores from: Version dropdown.
- Select Visibility settings.
- In the Version changes - visibility settings pop-up, select All Metrics. The available options are:
- Performance metrics only — Performance metrics are determined by model predictions being compared to annotations, such as F1 score and error rate.
- All metrics
- Show changes inside model variability — By default, changes within the current version's variability ranges are not considered significant and are hidden. Enable to display them. When selected, the following option becomes available:
- Show colors for all changes — By default, changes within the variability range appear in gray. Enable to color all changes green or red.
- Select Save.
Model versions
Model versions capture the current state of the project at the time the version was created. You can publish model versions to save them and use them in an automation. In addition, you can star versions in the Measure page to save their performance statistics. You can compare the current performance against previous versions to ensure continued performance improvement during iteration on instructions.
Selecting a model version
Use the Version dropdown to choose which validation results of a specific model version are displayed throughout the Measure page, such as Field performance, Document performance, and associated metrics. When you switch the model version, all metrics on the page are updated to reflect the validation results of the selected version.
Comparing different model versions using score differences
When multiple model versions are available, the Measure page allows you to compare the current model against a previous version. This way, you can better understand the impact of changes to field instructions, changes in annotations, or model configuration updates.
How this works
- Measure allows you to view score differences from another model version.
- Positive or negative changes highlight improvements or regressions. By default, Measure makes comparisons against the previous model version relative to the most recently created model version.
To compare a different model version, select an available version using the Show differences in scores from the version dropdown.
Understanding model variability and impact on score differences
Some models in IXP are non-deterministic, which means that the set of predictions of a field between model versions can vary slightly even when the instructions of that field are unchanged.
The Measure page allows you to take model variability into account during performance analysis. This helps you:
- Understand whether a performance change is meaningful.
- Avoid overinterpreting small metric fluctuations.
By default:
- Score differences that fall within the variability range of a metric are hidden when comparing two model versions.
- You can select to show all score differences or only differences that are greater than or equal to the variability of a metric.
These defaults ensure attention is focused on significant changes in model performance, and not noise.
To show differences between model versions irrespective of model variability, proceed as follows:
- Enable the Show differences in scores from: Version toggle.
- Select the Show differences in scores from: Version dropdown.
- Select Visibility settings.
- In the pop-up window, select Show changes inside model variability.
The available options are:
- Performance metrics only — Performance metrics are determined by model predictions being compared to annotations, such as F1 score and error rate.
- All metrics
- Show changes inside model variability — By default, changes within the current version's variability ranges are not considered significant and are hidden. Enable to display them. When selected, the following option becomes available:
- Show colors for all changes — By default, changes within the variability range appear in gray. Enable to color all changes green or red.
- Optionally, select Show colors for all differences if you want all score differences to appear in green or red. By default, differences within the variability range are displayed in gray.
- Select Save.
Starring a model version
A new model version is created each time you make changes to your taxonomy, including instructions, or to the model settings. The latest version of the model is always available, but you can also star, that is to lock in place, a specific model version to always show the performance statistics in the dashboard.
To star a model version, proceed as follows:
- Expand the Model Version drop-down menu to view the list with all available versions.
- Select the star icon next to the model version that you want to always be displayed at the top of the list and on the dashboard.
Starring a model version does not save the model version itself, only the performance statistics. To save a model version, it must be published in the Publish tab.
Exporting Measure data
You can export data from the Measure page for:
- Offline analysis.
- Custom filtering.
- Sharing results with stakeholders.
Exports include field-level predictions, annotations, and performance metrics visible in the Measure page.
To export data, proceed as follows:
- Navigate to the Measure page.
- Select the vertical ellipsis.
- Select Export as Excel file.