Activities - Data Extraction Scope

activities

latest

false

Document Understanding activities

Data Extraction Scope

Data Extraction Scope activity, providing a scope for extractor activities configured against taxonomy-defined document types.

UiPath.IntelligentOCR.Activities.DataExtraction.DataExtractionScope

Description

Provides a scope for extractor activities, enabling you to configure them according to the document types defined in your taxonomy. The output of the activity is stored in an ExtractionResult variable, containing all automatically extracted data, and can be used as input for the Export Extraction Results activity. This activity also features a Configure Extractors wizard, which lets you specify exactly what fields from the document types defined in the taxonomy you want to extract.

Project compatibility

Windows - Legacy | Windows

Configuration

Designer panel

Input

DocumentPath - The path to the document you want to validate. This field supports only strings and String variables.
Note:
The supported file types for this property field are .png, .gif, .jpe, .jpg, .jpeg, .tiff, .tif, .bmp, and .pdf.
DocumentText - The text of the document itself, stored in a String variable. This value can be retrieved from the Digitize Document activity. Visit Digitize Document for more information on how to achieve this. This field supports only strings and String variables.
DocumentObjectModel - The Document Object Model you want to use to validate the document against. This model is stored in a Document variable and can be retrieved from the Digitize Document activity. Visit Digitize Document for more information on how to achieve this. This field supports only Document variables.
Taxonomy - The Taxonomy against which the document is to be processed, stored in a DocumentTaxonomy variable. This object can be obtained by using a Load Taxonomy activity. This field supports only DocumentTaxonomy variables.
ClassificationResults - The results of running a classifier activity on the specified document, stored in a ClassificationResult object. This field is optional if you specify a DocumentTypeId instead. This field supports only ClassificationResult variables.
DocumentTypeID - The Document Type ID, as found in the Taxonomy Manager. This field is optional if you specify a file in the ClassificationResults field. This field supports only strings and String variables.

Output

ExtractionResults - The extraction results of the data extraction process, stored in an ExtractionResult variable.
Note:
If the page range for data extraction indicates that only a part of the original file is targeted, the Data Extraction Scope generates a file in the TEMP project folder that is then passed to the extractors. The temporary file contains only the page range that extractors should receive for document processing.

Properties panel

Authentication

The Authentication properties of this activity allow you to perform auto-validation via on-premises robots. Before configuring these properties, ensure you have fulfilled the prerequisites mentioned in the Configuring Authentication page. Once these steps are completed, you can then proceed to fill in the Authentication properties of the activity.

Runtime Credentials Asset - Use this field when you need to access Document Understanding auto-validation features while the robot is connected to a local Orchestrator, or from a different tenant. You can choose to enter a Credential Asset, for authentication purposes, in one of the following ways:
- From the dropdown list, select the desired Credential Asset from the Orchestrator to which the UiPath® Robot is connected to.
- Manually enter the path to the Orchestrator Credential Asset where you store the external application credentials for accessing the auto-validation features.
  
  The format of the path should be: <OrchestratorFolderName>/<AssetName>.
Runtime Tenant Url - Use this field, alongside the Runtime Credentials Asset field. Enter the URL of the tenant that the robot will connect to in order to execute the auto-validation. The URL should be in the following format: https://<baseURL>/<OrganizationName>/<TenantName>.

Common

DisplayName - The display name of the activity.

Input

ApplyAutoValidation - Adjust confidence using generative extraction cross-checking. If values are auto-validated, the confidence of those values will be set to the confidence threshold. Enabling this feature has additional AI unit consumption.
ClassificationResults - The results of running a classifier activity on the specified document, stored in a ClassificationResult object. This field is optional if you specify a DocumentTypeId instead. This field supports only ClassificationResult variables.
DocumentObjectModel - The Document Object Model you want to use to validate the document against. This model is stored in a Document variable and can be retrieved from the Digitize Document activity. Visit Digitize Document for more information on how to achieve this. This field supports only Document variables.
DocumentPath - The path to the document you want to validate. This field supports only strings and String variables.
Note:
The supported file types for this property field are .png, .gif, .jpe, .jpg, .jpeg, .tiff, .tif, .bmp, and .pdf.
DocumentText - The text of the document itself, stored in a String variable. This value can be retrieved from the Digitize Document activity. Visit Digitize Document for more information on how to achieve this. This field supports only strings and String variables.
DocumentTypeID - The Document Type ID, as found in the Taxonomy Manager. This field is optional if you specify a file in the ClassificationResults field. This field supports only strings and String variables.
FormatValuesIfPossible - Specifies that if a value has derived parts reported, then it isn't overridden by the data extraction scope, but if it doesn't have derived parts, then the data extraction scope tries to compute it. If the option is set to False then the values are not formatted.
AutoValidationConfidenceThreshold - Confidence threshold for generative validation. Only field values with confidence below this threshold will be validated. If values are confirmed, the confidence of those values will be set to this threshold.
Taxonomy - The Taxonomy against which the document is to be processed, stored in a DocumentTaxonomy variable. This object can be obtained by using a Load Taxonomy activity. This field supports only DocumentTaxonomy variables.

Misc

Private - If selected, the values of variables and arguments are no longer logged at Verbose level.

Output

ExtractionResults - The extraction results of the data extraction process, stored in an ExtractionResult variable.
Note:
If the page range for data extraction indicates that only a part of the original file is targeted, the Data Extraction Scope generates a file in the TEMP project folder that is then passed to the extractors. The temporary file contains only the page range that extractors should receive for document processing.

Using the Configure Extractor Wizard

The Configure Extractors Wizard can be accessed via the Data Extraction Scope and allows you to choose which extractors are applied to each document type and field.

From the body of the activity, select Configure Extractors. The wizard button becomes available after dragging at least one extractor activity into the body of the Data Extraction Scope activity. This wizard displays all the document types defined in the taxonomy and their respective fields, and enables you to choose which extractor you want to use for each.

Figure 1. Overview of the Configure Extractors wizard

Each document type can be expanded and its fields can be viewed in the wizard and selected for extraction.

Figure 2. The selection of an extractor for a document type in the Configure Extractors wizard

The Framework Alias field can be used to map an extractor to one or more trainers. For instance, you can give a Machine Learning Extractor the alias R2D2 and then you can use the same alias for a Machine Learning Extractor Trainer. This creates a link between the extractor and the trainer and has training purposes for the extractor. Each extractor has a unique alias while multiple trainers can share the same alias.

You can configure the Minimum Confidence field to allow a confidence threshold between 0 and 100. The predicted value for a field is considered only if the prediction's confidence score is equal or higher than the configured Minimum confidence. If a prediction's confidence score is less than the Minimum confidence threshold, the predicted value is not stored in the output of the Data Extraction Scope activity.

Tip:

You can identify an optimal confidence level by testing various documents within your workflow, recording the results in an Excel spreadsheet, for example, and then analyze what threshold value is the most accurate.

Select Get of refresh extractor capabilities, for the extractors that support this functionality, to easily map your taxonomy fields with the available extractor fields or refresh them in case the extractor fields have changed.

The check boxes next to each field in any column, if selected, cause the Data Extractor Scope to request that particular field from the extractor. If the check box is unchecked, Data Extractor Scope does not request a value for that field from the extractor.

The text inputs next to each field enable you to map fields defined in your Taxonomy with the fields defined in the extractor's internal taxonomy, if any. For regular fields, add in the text input the identifier for target field from the extractor's internal taxonomy. For table fields, the parent table field is mapped at the table level, and the corresponding columns are mapped individually.

Note:

When using the Machine Learning Extractor in a setup with defined Column Fields, these can be mapped to a table field from your Taxonomy. They will be displayed under a collection called items.

The number of columns in the wizard varies according to the number of extractors present in the scope activity. The name of each column is given by the display name of each extractor activity.

Figure 3. Multiple extractors present in the Configure Extractors wizard

If multiple extractors are used in the activity, the order of the extractors in the scope defines their priority. For example, let's consider three extractors. Extractor 1 returns an acceptable value (which is above the Minimum Confidence level) for a particular requested field, then that field is not requested when Extractor 2 and Extractor 3 are executed. If Extractor 1 and Extractor 2 return values below the Minimum Confidence level for that particular field, or return nothing at all, the results from Extractor 3 are taken into account, if they satisfy the confidence acceptability conditions.

Document Understanding Integration

The Data Extraction Scope activity is part of the Document Understanding solutions. Visit the Document Understanding Guide for more information.

On this page

Description
Project compatibility
Configuration
Designer panel
Properties panel
Using the Configure Extractor Wizard
Document Understanding Integration

Was this page helpful?

PREVIOUSIntelligent Keyword Classifier Trainer

NEXTDocument Understanding Project Extractor

Description​

Project compatibility​

Configuration​

Designer panel​

Input​

Output​

Properties panel​

Authentication​

Common​

Input​

Misc​

Output​

Using the Configure Extractor Wizard​

Document Understanding Integration​

Was this page helpful?

Description

Project compatibility

Configuration

Designer panel

Input

Output

Properties panel

Authentication

Common

Input

Misc

Output

Using the Configure Extractor Wizard

Document Understanding Integration