Unstructured and complex documents user guide

Last updated Nov 10, 2025

Overview

This section outlines the process involved in validating the performance of model versions in a project. Validating model performance is critical to ensuring the accuracy and reliability of the model before it is deployed to a production environment.

The model validation process

Evaluate model performance by comparing different model versions.
Gather validation statistics.
Refine the model until it reaches the performance level suitable for your use case as follows:
- Review model predictions.
- Iterate on the extraction schema.

User interface

The dashboard from the Measure tab includes the following details:

The performance of complete extractions for a specific field group and all fields of a field group.
The average performance of all fields in a specific field group.
The individual field-level performance.

The following list contains a description of all field performance indicators:

Red dial - A red field performance dial indicates that not enough annotated examples have been provided.
Amber circle - An amber performance indicator is displayed when a field’s performance is less than satisfactory.
Red circle - A red performance indicator is displayed when a field is performing poorly.
Recall - Among the true extractions, how many extractions the model actually predicted.
Precision - Among the extractions that the model applied, how many extractions were actually correct.
F1 Score - Harmonic mean between precision and recall.

When you understand the field-level performance and the impact of changing field instructions, these can help you determine if the model is production-ready.

Best practices

Annotate at least 10 documents and 10 fields to get a meaningful project and field score.
You should decide when to stop training the model based on your specific business needs and use case objectives. This means that you may require certain fields to have a higher precision and recall than others.

Note: High-precision models minimize false positives, while high-recall models reduce false negatives.

On this page