- Release Notes
- Getting started
- Notifications
- Projects
- Datasets
- Data Labeling
- ML packages
- Out of the box packages
- Pipelines
- ML Skills
- ML Logs
- Document UnderstandingTM in AI Center
- AI Center API
- Licensing
- AI Solutions Templates
- How to
- Basic Troubleshooting Guide
Custom Named Entity Recognition
Out of the Box Packages > UiPath Language Analysis > CustomNamedEntityRecognition
This model allows you to bring your own dataset tagged with entities you want to extract. The training and evaluation datasets need to be in either CoNLL or JSON format. The data can also be exported from the AI Center Data Labelling tool or can also be exported from Label Studio. This ML Package must be retrained, if deployed without training first, deployment will fail with an error stating that the model is not trained.
For an example on how to use this model, see Extracting chemicals from research paper by category for a use case.
When to use the Custom Named Entity Recognition (NER) model
Use the Custom NER model to extract:
-
special information from the text. This information is called
entity
. -
the names of people, places, organisations, locations, dates, numerical values, etc. The extracted entities are mutually exclusive. Entities are at single or multi-word level, not at the sub-word level. For example, in the I live in New York sentence, an entity can be New York but not in the I read the New Yorker sentence.
You can use the extracted entities directly in the information extraction processes or as inputs to the downstream tasks like classification of the source text, sentiment analysis of the source text, PHI, etc.
Training dataset recommendations
- Have at least 200 samples per entity if the entities are dense in the samples, meaning that most of the samples (more then 75%) contain 3-5 of these entities.
- If the entities are sparse (every sample has less then three entities) i.e., only a few of all the entities appear in most of the documents, then it is recommended to have at least 400 samples per entity. This helps the model to better understand the discriminative features.
- If there are more than 10 entities, add 100 more samples in an incremental way until you reach the desired performance metric.
Best practices
- Have meaningful entities; if a human cannot identify an entity, neither can a model.
- Have simple entities. Instead of a single entity address, break it down into multiple entities: street name, state name, city name, or zip code etc.
- Create both train and test datasets, and use a full pipeline for training.
- Start with a minimum number of samples for annotation, covering all the entities.
- Make sure all the entities are represented in both the train and test split.
- Run a full pipeline and check the test metrics. If the test metric is not satisfactory, check the classification report and identify the ill-performing entities. Add more samples that cover the ill-performing entities and repeat the training process, until the desired metric is reached.
This multilingual model supports the languages listed below. These languages were chosen because they are the top 100 languages with the largest Wikipedias:
- Afrikaans
- Albanian
- Arabic
- Aragonese
- Armenian
- Asturian
- Azerbaijani
- Bashkir
- Basque
- Bavarian
- Belarusian
- Bengali
- Bishnupriya Manipuri
- Bosnian
- Breton
- Bulgarian
- Burmese
- Catalan
- Cebuano
- Chechen
- Chinese (Simplified)
- Chinese (Traditional)
- Chuvash
- Croatian
- Czech
- Danish
- Dutch
- English
- Estonian
- Finnish
- French
- Galician
- Georgian
- German
- Greek
- Gujarati
- Haitian
- Hebrew
- Hindi
- Hungarian
- Icelandic
- Ido
- Indonesian
- Irish
- Italian
- Japanese
- Javanese
- Kannada
- Kazakh
- Kirghiz
- Korean
- Latin
- Latvian
- Lithuanian
- Lombard
- Low Saxon
- Luxembourgish
- Macedonian
- Malagasy
- Malay
- Malayalam
- Marathi
- Minangkabau
- Mongolian
- Nepali
- Newar
- Norwegian (Bokmal)
- Norwegian (Nynorsk)
- Occitan
- Persian (Farsi)
- Piedmontese
- Polish
- Portuguese
- Punjabi
- Romanian
- Russian
- Scots
- Serbian
- Serbo-Croatian
- Sicilian
- Slovak
- Slovenian
- South Azerbaijani
- Spanish
- Sundanese
- Swahili
- Swedish
- Tagalog
- Tajik
- Tamil
- Tatar
- Telugu
- Thai
- Turkish
- Ukrainian
- Urdu
- Uzbek
- Vietnamese
- Volapük
- Waray-Waray
- Welsh
- West Frisian
- Western Punjabi
- Yoruba
List of named entities in the text. Each element in the list has the following items in the prediction:
- Text that was recognized
- Starting and ending positions of the text, character-wise
- Type of the named entity
- Confidence
{ "response" : [{ "value": "George Washington", "start_index": 0, "end_index": 17, "entity": "PER", "confidence": 0.96469810605049133 }] }
{ "response" : [{ "value": "George Washington", "start_index": 0, "end_index": 17, "entity": "PER", "confidence": 0.96469810605049133 }] }
All three types of pipelines (Full Training, Training, and Evaluation) are supported by this package. For most use cases, no parameters need to be specified, the model is using advanced techniques to find a performant model. In subsequent trainings after the first, the model uses incremental learning (that is, the previously trained version will be used, at the end of a Training Run).
You can use the Label Studio APIs to write back the data and predictions with weak confidence. Your data can then be re-labeled and exported in CoNLL format.
For more information on how to use Label Studio, see Getting started with Label Studio. Also, you can download the UiPath® Studio activity for Label Studio Integration here.
Alternatively, you can leverage the data labelling feature in AI Center.
You can use either GPU or CPU for training. We recommend using GPU since it's faster.
This model supports reading all files in a given directory during all pipeline runs (training, evaluation, and full pipeline).
Set Date
, use SetDate
.
CoNLL file format
.txt
and/or .conll
extension using the CoNLL file format in the provided directory.
The CoNLL file format represents a body of text with one word per line, each word containing 10 tab-separated columns with information about the word (for example, surface and syntax).
The trainable named entity recognition supports two CoNLL formats:
- With just two columns in the text.
- With four columns in the text.
conll
or label_studio
.
label_studio
format is the same as the CoNLL
format, with separation between two data points being a new empty line. To support separation between two data points with
-DOCSTART- -X- O O
, add dataset.input_format as an environment variable and set its value to conll
.
For more information, see the examples below.
Japan NNP B-NP B-LOC
began VBD B-VP O
the DT B-NP O
defence NN I-NP O
of IN B-PP O
their PRP$ B-NP O
Asian JJ I-NP B-MISC
Cup NNP I-NP I-MISC
title NN I-NP O
with IN B-PP O
a DT B-NP O
lucky JJ I-NP O
2-1 CD I-NP O
win VBP B-VP O
against IN B-PP O
Syria NNP B-NP B-LOC
in IN B-PP O
a DT B-NP O
Group NNP I-NP O
C NNP I-NP O
championship NN I-NP O
match NN I-NP O
on IN B-PP O
Friday NNP B-NP O
. . O OFounding O
member O
Kojima B-PER
Minoru I-PER
played O
guitar O
on O
Good B-MISC
Day I-MISC
, O
and O
Wardanceis I-MISC
cover O
of O
a O
song O
by O
UK I-LOC
post O
punk O
industrial O
band O
Killing B-ORG
Joke I-ORG
. O
Japan NNP B-NP B-LOC
began VBD B-VP O
the DT B-NP O
defence NN I-NP O
of IN B-PP O
their PRP$ B-NP O
Asian JJ I-NP B-MISC
Cup NNP I-NP I-MISC
title NN I-NP O
with IN B-PP O
a DT B-NP O
lucky JJ I-NP O
2-1 CD I-NP O
win VBP B-VP O
against IN B-PP O
Syria NNP B-NP B-LOC
in IN B-PP O
a DT B-NP O
Group NNP I-NP O
C NNP I-NP O
championship NN I-NP O
match NN I-NP O
on IN B-PP O
Friday NNP B-NP O
. . O OFounding O
member O
Kojima B-PER
Minoru I-PER
played O
guitar O
on O
Good B-MISC
Day I-MISC
, O
and O
Wardanceis I-MISC
cover O
of O
a O
song O
by O
UK I-LOC
post O
punk O
industrial O
band O
Killing B-ORG
Joke I-ORG
. O
JSON file format
.json
extension using the JSON format.
Check the following sample and environment variables for a JSON file format example.
{
"text": "Serotonin receptor 2A ( HTR2A ) gene polymorphism predicts treatment response to venlafaxine XR in generalized anxiety disorder . anxiety disorder ( GAD ) is a chronic psychiatric disorder with significant morbidity and mortality .\)
Antidepressant drugs are the preferred choice for treatment ; however , treatment response is often variable .\)
Several studies in major depression have implicated a role of the serotonin receptor gene ( HTR2A ) in treatment response to antidepressants .\)
We tested the hypothesis that the genetic polymorphism rs7997012 in the HTR2A gene predicts treatment outcome in GAD patients treated with venlafaxine XR . Treatment response was assessed in 156 patients that participated in a 6-month open - label clinical trial of venlafaxine XR for GAD . Primary analysis included Hamilton Anxiety Scale ( HAM-A ) reduction at 6 months .\)
Secondary outcome measure was the Clinical Global Impression of Improvement ( CGI-I ) score at 6 months .\)
Genotype and allele frequencies were compared between groups using χ(2) contingency analysis .\)
The frequency of the G-allele differed significantly between responders ( 70% ) and nonresponders ( 56% ) at 6 months ( P=0.05 ) using the HAM-A scale as outcome measure .\)
Similarly , using the CGI-I as outcome , the G-allele was significantly associated with improvement ( P=0.01 ) .\)
Assuming a dominant effect of the G-allele , improvement differed significantly between groups ( P=0.001 , odds ratio=4.72 ) .\)
Similar trends were observed for remission although not statistically significant .\)
We show for the first time a pharmacogenetic effect of the HTR2A rs7997012 variant in anxiety disorders , suggesting that pharmacogenetic effects cross diagnostic categories .\)
Our data document that individuals with the HTR2A rs7997012 single nucleotide polymorphism G-allele have better treatment outcome over time .\)
Future studies with larger sample sizes are necessary to further characterize this effect in treatment response to antidepressants in GAD .",
"entities": [{
"entity": "TRIVIAL",
"value": "Serotonin",
"start_index": 0,
"end_index": 9
}, {
"entity": "TRIVIAL",
"value": "venlafaxine",
"start_index": 81,
"end_index": 92
}, {
"entity": "TRIVIAL",
"value": "serotonin",
"start_index": 409,
"end_index": 418
}, {
"entity": "TRIVIAL",
"value": "venlafaxine",
"start_index": 625,
"end_index": 636
}, {
"entity": "TRIVIAL",
"value": "venlafaxine",
"start_index": 752,
"end_index": 763
}, {
"entity": "FAMILY",
"value": "nucleotide",
"start_index": 1800,
"end_index": 1810
}]
}
{
"text": "Serotonin receptor 2A ( HTR2A ) gene polymorphism predicts treatment response to venlafaxine XR in generalized anxiety disorder . anxiety disorder ( GAD ) is a chronic psychiatric disorder with significant morbidity and mortality .\)
Antidepressant drugs are the preferred choice for treatment ; however , treatment response is often variable .\)
Several studies in major depression have implicated a role of the serotonin receptor gene ( HTR2A ) in treatment response to antidepressants .\)
We tested the hypothesis that the genetic polymorphism rs7997012 in the HTR2A gene predicts treatment outcome in GAD patients treated with venlafaxine XR . Treatment response was assessed in 156 patients that participated in a 6-month open - label clinical trial of venlafaxine XR for GAD . Primary analysis included Hamilton Anxiety Scale ( HAM-A ) reduction at 6 months .\)
Secondary outcome measure was the Clinical Global Impression of Improvement ( CGI-I ) score at 6 months .\)
Genotype and allele frequencies were compared between groups using χ(2) contingency analysis .\)
The frequency of the G-allele differed significantly between responders ( 70% ) and nonresponders ( 56% ) at 6 months ( P=0.05 ) using the HAM-A scale as outcome measure .\)
Similarly , using the CGI-I as outcome , the G-allele was significantly associated with improvement ( P=0.01 ) .\)
Assuming a dominant effect of the G-allele , improvement differed significantly between groups ( P=0.001 , odds ratio=4.72 ) .\)
Similar trends were observed for remission although not statistically significant .\)
We show for the first time a pharmacogenetic effect of the HTR2A rs7997012 variant in anxiety disorders , suggesting that pharmacogenetic effects cross diagnostic categories .\)
Our data document that individuals with the HTR2A rs7997012 single nucleotide polymorphism G-allele have better treatment outcome over time .\)
Future studies with larger sample sizes are necessary to further characterize this effect in treatment response to antidepressants in GAD .",
"entities": [{
"entity": "TRIVIAL",
"value": "Serotonin",
"start_index": 0,
"end_index": 9
}, {
"entity": "TRIVIAL",
"value": "venlafaxine",
"start_index": 81,
"end_index": 92
}, {
"entity": "TRIVIAL",
"value": "serotonin",
"start_index": 409,
"end_index": 418
}, {
"entity": "TRIVIAL",
"value": "venlafaxine",
"start_index": 625,
"end_index": 636
}, {
"entity": "TRIVIAL",
"value": "venlafaxine",
"start_index": 752,
"end_index": 763
}, {
"entity": "FAMILY",
"value": "nucleotide",
"start_index": 1800,
"end_index": 1810
}]
}
The environment variables for the previous example would be as follows :
- dataset.input_format:
json
- dataset.input_column_name:
text
- dataset.output_column_name:
entities
ai_center file format
.json
extension.
Check the following sample and environment variables for an ai_center file format example.
{
"annotations": {
"intent": {
"to_name": "text",
"choices": [
"TransactionIssue",
"LoanIssue"
]
},
"sentiment": {
"to_name": "text",
"choices": [
"Very Positive"
]
},
"ner": {
"to_name": "text",
"labels": [
{
"start_index": 37,
"end_index": 47,
"entity": "Stakeholder",
"value": " Citi Bank"
},
{
"start_index": 51,
"end_index": 61,
"entity": "Date",
"value": "07/19/2018"
},
{
"start_index": 114,
"end_index": 118,
"entity": "Amount",
"value": "$500"
},
{
"start_index": 288,
"end_index": 293,
"entity": "Stakeholder",
"value": " Citi"
}
]
}
},
"data": {
"cc": "",
"to": "[email protected]",
"date": "1/29/2020 12:39:01 PM",
"from": "[email protected]",
"text": "I opened my new checking account with Citi Bank in 07/19/2018 and met the requirements for the promotion offer of $500 . It has been more than 6 months and I have not received any bonus. I called the customer service several times in the past few months but no any response. I request the Citi honor its promotion offer as advertised."
}
}
{
"annotations": {
"intent": {
"to_name": "text",
"choices": [
"TransactionIssue",
"LoanIssue"
]
},
"sentiment": {
"to_name": "text",
"choices": [
"Very Positive"
]
},
"ner": {
"to_name": "text",
"labels": [
{
"start_index": 37,
"end_index": 47,
"entity": "Stakeholder",
"value": " Citi Bank"
},
{
"start_index": 51,
"end_index": 61,
"entity": "Date",
"value": "07/19/2018"
},
{
"start_index": 114,
"end_index": 118,
"entity": "Amount",
"value": "$500"
},
{
"start_index": 288,
"end_index": 293,
"entity": "Stakeholder",
"value": " Citi"
}
]
}
},
"data": {
"cc": "",
"to": "[email protected]",
"date": "1/29/2020 12:39:01 PM",
"from": "[email protected]",
"text": "I opened my new checking account with Citi Bank in 07/19/2018 and met the requirements for the promotion offer of $500 . It has been more than 6 months and I have not received any bonus. I called the customer service several times in the past few months but no any response. I request the Citi honor its promotion offer as advertised."
}
}
For leveraging the previous sample JSON, the environment variables need to be set as follows:
- dataset.input_format to
ai_center
- dataset.input_column_name to
data.text
- dataset.output_column_name to
annotations.ner.labels
-
dataset.input_column_name
- The name of the column containing text.
- Default value is
data.text
. - This variable is only needed if the input file format is
ai_center
orJSON
.
-
dataset.target_column_name
- The name of the column containing labels.
- Default value is
annotations.ner.labels
. - This variable is only needed if the input file format is
ai_center
orJSON
.
-
model.epochs
- The number of epochs.
- Default value is
5
.
-
dataset.input_format
- The input format of the training data.
- Default value is
ai_center
. - Supported values are:
ai_center
,conll
,label_studio
orjson
.Note: Thelabel_studio
format is the same as theCoNLL
format, with separation between two data points being a new empty line. To support separation between two data points with-DOCSTART- -X- O O
, add dataset.input_format as an environment variable and set its value toconll
.
- Evaluation report, containing the following files:
- Classification report
- Confusion matrix
- Precision Recall Information
- JSON files: separate JSON files corresponding to each section of the Evaluation Report PDF file. These JSON files are machine-readable and you can use them to pipe the model evaluation into Insights using the workflow.
Classification report
The classification report is derived from the test data set when running full or evaluation pipeline. It contains the following information for every entity in the form of a diagram:
- Entity- The name of the entity.
- Precision - The precision metric for correctly predicting the entity over the test set.
- Recall - The recall metric of correctly predicting the entity over the test set.
- F1 score - The f1-score metric
for correctly predicting the entity over the test set; you can use this score to
compare entity based performance of two differently trained versions of this
model.
Confusion matrix
A table with explanations explaining different categories of error is also provided under the confusion matrix. The error categories per entity arecorrect,incorrect,missed, and spurious are explained in that table.
Precision recall information
You can use this information to check the precision, recall trade-off of the model. The thresholds and corresponding precision and recall values are also provided in a table above the diagram for every entity. This table will allow you to choose the desired threshold to configure in your workflow so as to decide when to send the data to Action Center for human in the loop. Note that the higher the chosen threshold, the higher the amount of data that gets routed to Action Center for human in the loop will be.
There is a precision-recall diagram and table for each entity.
For an example of a precision-recall table per entity, see the table below.
threshold |
precision |
recall |
---|---|---|
0.5 |
0.9193 |
0.979 |
0.55 |
0.9224 |
0.9777 |
0.6 |
0.9234 |
0.9771 |
0.65 |
0.9256 |
0.9771 |
0.7 |
0.9277 |
0.9759 |
0.75 |
0.9319 |
0.9728 |
0.8 |
0.9356 |
0.9697 |
0.85 |
0.9412 |
0.9697 |
0.9 |
0.9484 |
0.9666 |
0.95 |
0.957 |
0.9629 |
For an example of a precision-recall diagram per entity, see the figure below.
Evaluation CSV file
This is a CSV file with predictions on the test set used for evaluation. The file contains the columns:
- Text - The text used for evaluation.
- Actual_entities - The entities that were provided as labeled data in the evaluation dataset.
- Predicted_entities - The entities that the trained model predicted.
- Error_type_counts - The difference between the actual entities and predicted entities categorized by error types.