Document Understanding user guide

Last updated Apr 6, 2026

DELIVERY:

Dataset diagnostics

Training a new model from scratch can sometimes be a very demanding job.

Dataset Diagnostics feature helps you build effective datasets by providing feedback and hints of the steps needed to achieve good accuracy for the trained model.

Located in the Management Bar of the Document Manager, Dataset Diagnostics provides visual and written guidance throughout the whole process of training a new model.

There are three dataset status levels exposed in the Management bar:

Red - More labelled training data is required.
Orange - More labelled training data is recommended.
Green - The needed level of labelled training data is achieved.

If no fields are created in the session, the dataset status level is grey.

More information on each status is available in the Dataset Diagnostics popup menu. Select the Dataset Diagnostics button to open it.

Dataset tab

Provides information about the documents used for training the model, the total number of imported pages, and the total number of labelled pages.

The separation on the color status bar is determined by the recommended number of labelled pages needed for training the model and the actual status of your dataset, including labelled and unlabelled data. Hovering over each color of the status bar provides extra information, in a tooltip, about each status.

The numbers available on the Dataset tab are calculated based on the number of regular fields and item fields from the training session.

Red - The dataset requires more labelled data for training the model.
Orange - For an increased level of accuracy on the trained model, more labelled data is recommended. You can choose to proceed further with the actual data, but the level of accuracy is not as high as wanted.
Green - The labelled data is enough for the dataset to be trained accordingly and to receive accurate information.

Fields tab

Provides information about each labelled field, more precisely the total number of training pages the label is tagged on, the total number of evaluated documents with the labelled field, and its status for the current training set.

Field - The name of the labelled field.
Training Pages - The number of pages in the Training+Validation set on which the field is labelled.
Evaluation Documents - The number of documents in the Evaluation set on which this field is labelled.
Status - The status of each field, marked by three options, Red, Orange, and Green.

Here are all the options available for the Status bar:

Red - There is insufficient data about the field, more labels being required.
Orange - More pages need to be labelled for the results to be relevant.
Green - There are enough labelled pages for the results to be relevant.

Refresh and Close buttons are applicable for both tabs, meaning that if the Refresh button is selected on the Dataset tab, the Fileds tab is also refreshed.

Refresh - Use the refresh option after alterations have been made to the dataset, whether on the number of total pages or the number of labelled pages. The popup menu automatically refreshes every few minutes and it takes place on both tabs, simultaneously. Use this function when a refresh is needed outside the automatic window.
Close - Once all the needed information is gathered, close the menu by using the Close button. The entire popup menu is closed, no matter the tab from which the button is selected.

Calculator Tab

The Calculator tab provides the same information as the one you've already added when creating a new document type.

You can use the Dataset Calculator to modify parts of the information initially added, when the document type was created.

You can modify the following fields with the Dataset Calculator:

Out-of-the-box document type
Number of languages
Number of layouts

The following fields from the Calculator Tab are read-only and their values are determined by doing an intersection of the used out-of-the-box document type and the current schema fields:

Out-of-the-box regular fields
Out-of-the-box column fields
Out-of-the-box classification fields

Modifying any of the mentioned fields impacts the recommended size of the Dataset. The Dataset tab from the current opened popup is updated to a green/yellow/red status based on the new recommended size. Once the changes are saved, the overall Dataset Diagnosis indicator takes into account the new Dataset Tab health.

Let's say that when you initially created the document type, you have selected Invoices for the Out-of-the-box document type field. If you change your initial choice to something else, Receipts for example, then the dataset asimilates the information for both document types and displays the information that intersects from both (Invoices and Receipts) types you selected.

If there are fields that are present only in one of the models, then they show up in the Custom regular fields or Custom column fields, because these changes apply to both regular and classification fields.

On this page

Dataset diagnostics menu
Dataset tab
Fields tab
Calculator Tab

Was this page helpful?

PREVIOUSCheckboxes and signatures

NEXTOCR services

Document Understanding user guide

Dataset diagnostics menu​

Dataset tab​

Fields tab​

Calculator Tab​

Was this page helpful?

Dataset diagnostics menu

Dataset tab

Fields tab

Calculator Tab