- 概述
- 文档处理合同
- 发行说明
- 关于文档处理合同
- Box 类
- IPersistedActivity 接口
- PrettyBoxConverter 类
- IClassifierActivity 接口
- IClassifierCapabilitiesProvider 接口
- 分类器文档类型类
- 分类器结果类
- 分类器代码活动类
- 分类器原生活动类
- 分类器异步代码活动类
- 分类器文档类型功能类
- ContentValidationData Class
- EvaluatedBusinessRulesForFieldValue Class
- EvaluatedBusinessRuleDetails Class
- 提取程序异步代码活动类
- 提取程序代码活动类
- 提取程序文档类型类
- 提取程序文档类型功能类
- 提取程序字段功能类
- 提取程序原生活动类
- 提取程序结果类
- FieldValue Class
- FieldValueResult Class
- ICapabilitiesProvider 接口
- IExtractorActivity 接口
- 提取程序有效负载类
- 文档操作优先级枚举
- 文档操作数据类
- 文档操作状态枚举
- 文档操作类型枚举
- 文档分类操作数据类
- 文档验证操作数据类
- 用户数据类
- 文档类
- 文档拆分结果类
- DomExtensions 类
- 页类
- 页面分区类
- 多边形类
- 多边形转换器类
- 元数据类
- 词组类
- 词类
- 处理源枚举
- 结果表格单元类
- 结果表值类
- 结果表列信息类
- 结果表类
- 旋转枚举
- Rule Class
- RuleResult Class
- RuleSet Class
- RuleSetResult Class
- 分区类型枚举
- 词组类型枚举
- IDocumentTextProjection 接口
- 分类结果类
- 提取结果类
- 结果文档类
- 结果文档范围类
- 结果数据点类
- 结果值类
- 结果内容引用类
- 结果值令牌类
- 结果派生字段类
- 结果数据源枚举
- 结果常量类
- 简单字段值类
- 表字段值类
- 文档组类
- 文档分类类
- 文档类型类
- 字段类
- 字段类型枚举
- FieldValueDetails Class
- 语言信息类
- 元数据输入类
- 文本类型枚举
- 类型字段类
- ITrackingActivity 接口
- ITrainableActivity 接口
- ITrainableClassifierActivity 接口
- ITrainableExtractorActivity 接口
- 可训练的分类器异步代码活动类
- 可训练的分类器代码活动类
- 可训练的分类器原生活动类
- 可训练的提取程序异步代码活动类
- 可训练的提取程序代码活动类
- 可训练的提取程序原生活动类
- 基本数据点类 - 预览
- 提取结果处理程序类 - 预览
- Document Understanding ML
- Document Understanding OCR 本地服务器
- Document Understanding
- 智能 OCR
- 发行说明
- 关于“智能 OCR”活动包
- 项目兼容性
- 加载分类
- 将文档数字化
- 分类文档作用域
- 基于关键词的分类器
- Document Understanding 项目分类器
- 智能关键词分类器
- 创建文档分类操作
- 创建文档验证工件
- 检索文档验证工件
- 等待文档分类操作然后继续
- 训练分类器范围
- 基于关键词的分类训练器
- 智能关键词分类训练器
- 数据提取作用域
- Document Understanding 项目提取程序
- Document Understanding 项目提取程序训练器
- 基于正则表达式的提取程序
- 表单提取程序
- 智能表单提取程序
- 文档脱敏
- 创建文档验证操作
- 等待文档验证操作然后继续
- 训练提取程序范围
- 导出提取结果
- 机器学习提取程序
- 机器学习提取程序训练器
- 机器学习分类器
- 机器学习分类训练器
- 生成分类器
- 生成式提取程序
- 配置身份验证
- ML 服务
- OCR
- OCR 合同
- OmniPage
- PDF
- [未公开] Abbyy
- [未列出] Abbyy 嵌入式

Document Understanding 活动
基于正则表达式的提取程序
UiPath.IntelligentOCR.Activities.DataExtraction.RegexBasedExtractor
描述
Enables you to create and use a custom Regular Based Expression to extract information from a document. This activity can be used only together with the Data Extraction Scope activity.
This activity cannot work with set or boolean fields.
项目兼容性
Windows - Legacy | Windows
配置
设计器面板
Configure Expressions - Opens the Configure Regular Expressions wizard.
属性面板
常见
- “显示名称”- 活动的显示名称。
输入
- Configuration - Specifies the configuration value for the extractor as a
JSONescaped string. Use the extractor wizard to generate the configuration. You can keep the configuration in the Properties panel, as a string, or you can define it by using the wizard and bind it to a variable. It is advisable to edit the Configuration field by using the wizard and not the Properties panel. - Timeout - Specifies the timeout value for any Regex search, in milliseconds. A timeout of
0, or negative, is interpreted as infinite. The default value is2000. - UseVisualAlignment - If selected, the regular expressions are applied to a text version generated based on visual word alignments (a visual word alignment includes words separated by a single space character, lines separated by a single newline character, and pages separated by two lines characters). The default value is False. This option can be used for complex layouts where it is easier for users to write regular expressions based on how words are visually organized on lines, ignoring any sentence, paragraph, or layout group otherwise identified in the document.
其他
- “私有”- 选中后将不再以“Verbose”级别记录变量和参数的值。
使用配置正则表达式向导
-
在“数据提取作用域”活动中,向工作流添加“基于正则表达式的提取程序”活动。
-
Configure your regular expressions by selecting Configure Expressions. The Wizard window opens.
Figure 1. Overview of the Configure Regular Expressions wizard

-
展开文档类型条目,以查看所有已定义的字段,并开始配置您的正则表达式。系统会自动从项目的分类中读取文档类型及其相应字段。“正则表达式”配置选项可用于分类中的每个字段。检查您可能会在向导中遇到的以下配置选项:
- You can have a document type, that, when you expand it, a single regular field is displayed. For a simple field, only a single regular expression can be defined using the Configure Regular Expressions wizard that opens when you select Edit next to that field.
Figure 2. A document type in the Configure Regular Expressions wizard that has a regular field defined

- You can have a document type, that, when you expand it, a table field is displayed, showing configuration options for a table, such as Expression for the entire table content, or an Expression for individual rows. Check the following list for the multiple settings and options available for a table field configuration:
- The Table Value RegEx can be used for capturing an entire table area. If no value is added in the Table field line, the entire text content of the document is considered onward for table processing.
- The Rows Value RegEx can be used for capturing an entire row from a given table capture. If no value is added in the Rows field line, the table area is split by end-of-line. Each captured value is considered from this point forward as a row on which the column extraction is to be applied.
- “列值正则表达式”可用于从每个捕获的行中捕获特定列的值。
Figure 3. A document type in the Configure Regular Expressions wizard that has a table field defined

使用表格、行和列正则表达式的场景
查看以下可能发生的场景,以使用可用的表格正则表达式选项:
- 如果您将“表格正则表达式”和“行正则表达式”字段留空,则文档文本版本中的所有行都将用于应用“列级别正则表达式”来标识单元格值。
- 如果您定义了一个正则表达式来捕获表格区域,但将“行正则表达式”留空,则使用每个“列正则表达式”单独处理表格中的所有行,以捕获单元格值。
- 如果您将“表格正则表达式”留空,但定义了“行正则表达式”,则使用“行正则表达式”捕获的所有文本,并应用“列正则表达式”捕获每一行的单元格值。
- 如果您同时填写“表格”和“行正则表达式”,则该活动将应用“表格正则表达式”来标识表格字符串,然后应用“行正则表达式”来标识每一行,再应用“列级别正则表达式”来捕获单元格值。
-
Add your regular expression in the Expression field. You have the option of either writing the whole RegEx in the Expression field or to build it by using the Edit option.
重要提示:For any of the regular expressions you define, make sure you have at least one capture group. Only the captured parts of an expression are used for value reporting.
-
Select the dropdown list from the Regex Options column. You can set various regex options from this multi-select option.
您可以从以下选项中进行选择:
-
“CultureInvariant” - 指定忽略语言文化差异。
-
ECMAScript - Enables ECMA (European Computer Manufacturers Association) Script compliant behavior for the expression. This value can be used only in conjunction with the IgnoreCase and Multiline options.
-
ExplicitCapture - Specifies that the only valid captures are the ones of groups that are explicitly named or numbered and are defined as
(?<name> subexpression). Any unnamed parentheses are ignored. -
“忽略大小写” - 指定搜索不区分大小写。
-
IgnorePatternWhitespace - Eliminates the unescaped white space from the defined pattern and enables the comments marked with
#(hashtag symbol). This option does not apply to character classes, numeric quantifiers, or tokens marking the beginning of an individual RegEx language element. -
“单行” - 指定在单行中启动搜索。点
(.)匹配所有字符,包括异常\n。 -
“多行” - 指定在多行中启动搜索。对于此选项,特殊字符
^和$可匹配任何行的开头和结尾。 -
“从右到左” - 指定从右到左执行搜索。
备注:Visit RegexOptions Enum for more information about the regular expression options you can use.
-
Figure 4. The expanded Regex Options dropdown showing the available options

正则表达式编辑器向导
-
Select Edit to edit the options of that field and the format of the regular expression. The RegEx Builder wizard opens.
Figure 5. Overview of the RegEx Builder wizard

-
Input your desired text in the Test Text field. This is the text that you want to apply RegEx to based on the search criteria you choose. After that, insert a value in the Value field of the RegEx, which will then become highlighted in the Test Text field as well.
Figure 6. Entering text in the Test Text field and highlighting a certain value from it using the Value field

-
从下拉列表中选择一种正则表达式类型。这将设置正则表达式以匹配以下特征之一:
- “文字” - 匹配您指定的确切字符。此选项区分大小写。
- “数字” - 匹配数字。
- “其中之一” - 匹配集合中存在的单个字符。
- “非其中之一” - 匹配集合中不存在的单个字符。
- “任何内容” - 匹配除
\n以外的任何字符。 - “任何单词字符” - 匹配任何字母和数字。
- “空格” - 匹配一个空格。
- “开头为” - 从行开始的位置开始搜索。
- “结尾为” - 在行尾处开始搜索。
- “高级” - 需要自定义表达式。
- “电子邮件” - 匹配电子邮件地址。
- “URL” - 匹配 URL。
- “美国日期” - 匹配美国日期格式。
- “美国电话号码” - 匹配美国电话号码格式。
Figure 7. The dropdown list showing the available characteristics for the regular expression
备注:Visit .NET regular expressions for more information about regular expressions in .NET.
-
使用“值”字段以写入正则表达式的值。
-
Select a quantifier from the Quantifiers dropdown list. You can choose from the following options:
- “精确” - 精确匹配前面指定的元素次数。默认情况下,它设置为
1。 - “任何(0 次或更多)” - 匹配前面的元素零次或多次,但次数越少越好。
- “至少一次(1 次或更多)” - 匹配前面的元素一次或多次。
- “零次或一次” - 匹配前面的元素零次或一次,但次数越少越好。
- “在 x 次和 y 次之间” - 匹配前面的元素的次数为
x次和y次之间,其中x和y是整数,但次数越少越好。
- “精确” - 精确匹配前面指定的元素次数。默认情况下,它设置为
-
要编辑字段,可以使用以下选项:
- Select Add
to add an extra RegEx field. - Select Move up
and Move down
to move fields up and down in the hierarchy. - Select Remove
to delete the field.
- Select Add
-
如果要提取该特定字段,请选中“捕获”选项的复选框。
-
The Full Expression field shows the entire expression exactly how you customized it.
-
Select one or multiple options from the Regex Options dropdown list.
Figure 8. The available options in the Regex Options dropdown list

-
Select Save once all your configurations are done to exit the Edit mode.
-
Select Saveagain to close the wizard.
Document Understanding 集成
The RegEx Based Extractor activity is part of the Document Understanding solutions.