document-understanding
2024.10
true
UiPath logo, featuring letters U and I in white
Document Understanding User Guide
Automation CloudAutomation Cloud Public SectorAutomation SuiteStandalone
Last updated Nov 22, 2024

Digitization overview

What is digitization

Digitization is the process of obtaining machine readable text from a given incoming file, so that a robot can then understand its contents and act upon them. It is the first step applied on files that need to be processed through the Document UnderstandingTM framework.

The digitization step has two outputs:

  • the text from the processed file, stored in a string variable, and
  • the Document Object Model of that file - JSON object containing basic information such as name, content type, text length, the number of pages, as well as detailed information such as page rotation, detected language, content and coordinates for every word identified in the file.

In the Document Processing Framework, digitization is performed using the Digitize Document activity.

What digitization is not

Even though related, the digitization step is not OCR.

In many cases, the files that need to be processed are native PDF files (not scanned), that can be read programmatically by the robot without applying OCR.

When is OCR used in digitization

The Digitize Document activity requires, as part of its configuration, the selection of an OCR engine - so that, at need, it can be used, but only executes OCR on:

  • files that are images
    • supported images formats are .png, .jpe, .jpg, .jpeg, .tiff, .tif, .bmp
    • for multi-page TIFF files, OCR is applied for each page
  • PDF pages that
    • do not expose any machine readable content
    • contain images that cover a significant area of the page.
Note: The following digitization limitations apply:
  • There is a 160 MB file size limit.
  • There is a maximum 500 pages per document limit.

OCR is also applied, always, if the Digitize Document activity is configured with the ForceApplyOCR flag set to True. This option is usually recommended for use cases in which a significant percentage of files seem to contain native content, but the natively read content does not correspond to what a user can see in those files.

How to choose your OCR engine

As each use case has its own particularities, it is strongly recommended to test all available OCR Engines with different settings, in order to determine which one works best for your project. Another recommendation is to pay particular attention to the OCR engine arguments, such as Profile, Scale, Language etc. (may vary from one engine to another), so that you identify the best settings for each use case.

  • What is digitization
  • What digitization is not
  • When is OCR used in digitization
  • How to choose your OCR engine

Was this page helpful?

Get The Help You Need
Learning RPA - Automation Courses
UiPath Community Forum
Uipath Logo White
Trust and Security
© 2005-2024 UiPath. All rights reserved.