ai-center

2021.10

false

发行说明
在开始之前
- 安装或升级 AI Center
- 兼容性矩阵
入门指南
项目
- 关于项目
- 管理项目
数据集
- 关于数据集
- 管理数据集
ML 包
管道
ML 技能
- 关于 ML 技能
- 管理 ML 技能
ML 日志
- 关于 ML 日志
AI Center 中的 Document Understanding
- Data Manager
- OCR 服务
如何
- ML 包
  - 将自定义命名实体识别与持续学习结合使用
基本故障排除指南
- AI Center 常规故障排除和常见问题解答
- AI Center 独立版故障排除

重要 :

新发布内容的本地化可能需要 1-2 周的时间才能完成。

不在支持范围内

AI Center 用户指南

适用平台：

上次更新日期 2024年11月11日

浅色文本分类

“开箱即用包”>“UiPath 语言分析”>“浅色文本分类”

这是用于文本分类的通用、可重训练模型。它支持基于拉丁字符的所有语言，例如英语、法语、西班牙语等。必须训练此 ML 包，如果在未训练的情况下进行部署，则部署将失败，并显示错误，指出模型未训练。此模型在词袋上运行。此模型提供基于 n-gram 的可解释性。

模型详细信息

输入类型

JSON 和 CSV

输入说明

要分类为字符串的文本：“I loved this movie”。

输出说明

具有类别和置信度的 JSON（介于 0 到 1 之间）。

{
    "class": "7",
    "confidence": 0.1259827300369445,
    "ngrams": [
        [
            "like",
            1.3752658445706787
        ],
        [
            "like this",
            0.032029048484416685
        ]
    ]
}{
    "class": "7",
    "confidence": 0.1259827300369445,
    "ngrams": [
        [
            "like",
            1.3752658445706787
        ],
        [
            "like this",
            0.032029048484416685
        ]
    ]
}

已启用训练

默认情况下启用训练。

管道

此包支持所有三种类型的管道（完整训练、训练和评估）。该模型使用高级技术通过超参数搜索来查找高性能模型。默认情况下，超参数搜索（ BOW.hyperparameter_search.enable 变量）处于启用状态。评估报告中提供了性能最高的模型的参数。

数据集格式

有三个选项可用于构建此模型的数据集：JSON、CSV 和AI Center JSON 格式。模型将读取指定目录中的所有 CSV 和 JSON 文件。对于每种格式，模型都需要两个列或两个属性， dataset.input_column_name 和dataset.target_column_name 默认情况下。这两个列和/或目录的名称可以使用环境变量进行配置。

CSV 文件格式

每个 CSV 文件可以有任意数量的列，但模型只会使用两个列。这些列由 dataset.input_column_name 指定和 dataset.target_column_name 参数。

检查以下示例和环境变量以获取 CSV 文件格式示例。

text, label
I like this movie, 7
I hated the acting, 9text, label
I like this movie, 7
I hated the acting, 9

上一个示例的环境变量如下：

dataset.input_format: auto
dataset.input_column_name: text
dataset.target_column_name：label

JSON 文件格式

多个数据点可能属于同一个 JSON 文件。

检查以下示例和环境变量以获取 JSON 文件格式示例。

[
  {
    "text": "I like this movie",
    "label": "7"
  },
  {
    "text": "I hated the acting",
    "label": "9"
  }
][
  {
    "text": "I like this movie",
    "label": "7"
  },
  {
    "text": "I hated the acting",
    "label": "9"
  }
]

上一个示例的环境变量如下：

dataset.input_format: auto
dataset.input_column_name: text
dataset.target_column_name：label

ai_center 文件格式

这是可以设置的环境变量的默认值，此模型将读取所提供目录中扩展名为.json的所有文件。

检查以下示例和环境变量以获取 ai_center 文件格式示例。

{
    "annotations": {
        "intent": {
            "to_name": "text",
            "choices": [
                "TransactionIssue",
                "LoanIssue"
            ]
        },
        "sentiment": {
            "to_name": "text",
            "choices": [
                "Very Positive"
            ]
        },
        "ner": {
            "to_name": "text",
            "labels": [
                {
                    "start_index": 37,
                    "end_index": 47,
                    "entity": "Stakeholder",
                    "value": " Citi Bank"
                },
                {
                    "start_index": 51,
                    "end_index": 61,
                    "entity": "Date",
                    "value": "07/19/2018"
                },
                {
                    "start_index": 114,
                    "end_index": 118,
                    "entity": "Amount",
                    "value": "$500"
                },
                {
                    "start_index": 288,
                    "end_index": 293,
                    "entity": "Stakeholder",
                    "value": " Citi"
                }
            ]
        }
    },
    "data": {
        "cc": "",
        "to": "[email protected]",
        "date": "1/29/2020 12:39:01 PM",
        "from": "[email protected]",
        "text": "I opened my new checking account with Citi Bank in 07/19/2018 and met the requirements for the promotion offer of $500 . It has been more than 6 months and I have not received any bonus. I called the customer service several times in the past few months but no any response. I request the Citi honor its promotion offer as advertised."{
    "annotations": {
        "intent": {
            "to_name": "text",
            "choices": [
                "TransactionIssue",
                "LoanIssue"
            ]
        },
        "sentiment": {
            "to_name": "text",
            "choices": [
                "Very Positive"
            ]
        },
        "ner": {
            "to_name": "text",
            "labels": [
                {
                    "start_index": 37,
                    "end_index": 47,
                    "entity": "Stakeholder",
                    "value": " Citi Bank"
                },
                {
                    "start_index": 51,
                    "end_index": 61,
                    "entity": "Date",
                    "value": "07/19/2018"
                },
                {
                    "start_index": 114,
                    "end_index": 118,
                    "entity": "Amount",
                    "value": "$500"
                },
                {
                    "start_index": 288,
                    "end_index": 293,
                    "entity": "Stakeholder",
                    "value": " Citi"
                }
            ]
        }
    },
    "data": {
        "cc": "",
        "to": "[email protected]",
        "date": "1/29/2020 12:39:01 PM",
        "from": "[email protected]",
        "text": "I opened my new checking account with Citi Bank in 07/19/2018 and met the requirements for the promotion offer of $500 . It has been more than 6 months and I have not received any bonus. I called the customer service several times in the past few months but no any response. I request the Citi honor its promotion offer as advertised."

为了利用前面的示例 JSON，需要按如下方式设置环境变量：

dataset.input_format: ai_center
dataset.input_column_name: data.text
dataset.target_column_name：annotations.intent.choices

在 GPU 或 CPU 上训练

训练不需要 GPU

环境变量

dataset.input_column_name
- 包含文本的输入列的名称。
- 默认值为 data.text。
- 确保根据输入的 JSON 或 CSV 文件配置此变量。
dataset.target_column_name
- 包含文本的目标列的名称。
- 默认值为 annotations.intent.choices。
- 确保根据输入的 JSON 或 CSV 文件配置此变量。
dataset.input_format
- 训练数据的输入格式。
- 默认值为 ai_center。
- 支持的值为： ai_center或auto 。
- 如果选择ai_center ，则仅支持JSON文件。如果选择了ai_center ，请确保还将 dataset.target_column_name 的值更改为annotations.sentiment.choices 。
- 如果选择auto ，则同时支持CoNLL和JSON文件。
BOW.hyperparameter_search.enable
- 此参数的默认值为True 。如果保持启用状态，这将在给定的时间范围内找到性能最高的模型和计算资源。
- 这还将生成HyperparameterSearch_report PDF 文件，以展示已尝试的参数变体。
BOW.hyperparameter_search.timeout
- 允许运行超参数搜索的最长时间 (以秒为单位)。
- 默认值为 1800。
BOW.explain_inference
- 如果将其设置为True ，则在将模型用作 ML 技能的推理期间，一些最重要的 n-gram 也将与预测一起返回。
- 默认值为 False。

可选变量

您可以通过单击“ 添加新 ”按钮来添加其他可选变量。但是，如果将 BOW.hyperparameter_search.enable 变量设置为True ，则会搜索这些变量的最佳值。对于模型要使用的以下可选参数，请将 BOW.hyperparameter_search.enable 搜索变量设置为False ：

BOW.lr_kwargs.class_weight
- 支持的值为： balanced或None 。
BOW.ngram_range
- 可被视为模型特征的连续单词序列的序列长度范围。
- 请务必遵循以下格式： (1, x) ，其中x是您要允许的最大序列长度。
BOW.min_df
- 用于设置要视为特征的数据集中 n-gram 的最小出现次数。
- 建议的值介于0和10之间。
dataset.text_pp_remove_stop_words
- 用于配置是否应在搜索中包含停用词（例如， the 、 or等词）。
- 支持的值为： True或False 。

工件

评估报告是一个 PDF 文件，其中包含以下用户可读格式的信息：

每类 ngram
精确召回图
分类报告
混淆矩阵
用于超参数搜索的最佳模型参数

每类 ngram

本节包含影响该类模型预测的前 10 个 n-gram。用于训练模型的每个类都有一个不同的表格。

精确召回图

您可以使用此图和表格来检查模型的精度、召回率以及 f1 分数。此图下方的表格还提供了阈值以及相应的精度和召回率值。此表可确定要在工作流中配置的所需阈值，以便决定何时将数据发送到人机回圈中的 Action Center。请注意，所选阈值越高，路由到人机回圈中的 Action Center 数据量就越多。

每个类都有一个精确召回率图。

有关精确率 - 召回率图的示例，请参见下图。

有关精确召回率表的示例，请参阅下表。

精度	召回	阈值
0.8012232415902141	0.6735218508997429	0.30539842728983285
0.8505338078291815	0.6143958868894601	0.37825683923133907
0.9005524861878453	0.4190231362467866	0.6121292357073038
0.9514563106796117	0.2519280205655527	0.7916427288647211

分类报告

分类报告包含以下信息：

标签 - 测试集的标签部分
精度 - 预测的准确性
召回 - 已检索的相关实例
F1 分数 - 精度和召回率之间的几何平均值；您可以使用此分数来比较两个模型
支持 - 特定标签在测试集中出现的次数

有关分类报告的示例，请参阅下表。

标签	精度	召回	F1 分数	支持
0.0	0.805	0.737	0.769	319
1.0	0.731	0.812	0.77	389
2.0	0.778	0.731	0.754	394
3.0	0.721	0.778	0.748	392
4.0	0.855	0.844	0.85	385
5.0	0.901	0.803	0.849	395

混淆矩阵

用于超参数搜索的最佳模型参数

当 BOW.hyperparameter_search.enable 变量设置为True时，算法选取的最佳模型参数将显示在此表中。要使用超参数搜索未涵盖的不同参数重新训练模型，您还可以在“ 环境变量”中手动设置这些参数。有关这方面的更多信息，请查看 (doc:light-text-classification#environment-variables) 部分。

有关此报告的示例，请参见下表。