ai-center
latest
false
UiPath logo, featuring letters U and I in white
AI Center
Automation CloudAutomation SuiteStandalone
Last updated Nov 19, 2024

Text Classification

OS Packages > Language Analysis > TextClassification

This is a generic, retrainable model for language Classification. This ML Package must be retrained, if deployed without training first, deployment will fail with an error stating that the model is not trained.

This model is a deep learning architecture for language classification. It is based on BERT, a self-supervised method for pretraining natural language processing systems. A GPU can be used both at serving time and training time. A GPU delivers ~5-10x improvement in speed. The model was open-sourced by Facebook AI Research.

Languages

The main driver of the performance of the model will be the data quality used for training. Additionally, the data used to parametrize this model may also influence performance. This model was trained on the top 100 languages with the largest Wikipedias (full list)

Model details

Input type

JSON

Input description

Text to be classified as String: "I loved this movie."

Output description

JSON with predicted class name, associated confidence on that class prediction (between 0-1).

Example:

{
  "class": "Positive",
  "confidence": 0.9422031841278076
}{
  "class": "Positive",
  "confidence": 0.9422031841278076
}

Pipelines

All three types of pipelines (Full Training, Training and Evaluation) are supported by this package.

For most use cases, no parameters need to be specified, the model is using advanced techniques to find a performant model. In subsequent trainings after the first, the model uses incremental learning (that is, the previously trained version will be used, at the end of a Training Run).

Dataset format

Two options are possible to structure your dataset for this model. You can't use both options at same time. By default model will look for dataset.csv file in top folder directory if found it uses option 2 here otherwise model try to use option 1 (folder structure).

Use folder structure to separate your class

Create one folder for each class (at top level of the dataset) and add one text file per data point in corresponding folder (the folder is the class and the file only has the input). Dataset structure looks like this:

Dataset
-- folderNamedAsClass1 # the name of the folder must be name of the class
---- text1Class1.txt #file can have any name
...
---- textNClass1.txt
-- folderNamedAsClass2
---- text1Class2.txt
...
---- textMClass2.txt
..Dataset
-- folderNamedAsClass1 # the name of the folder must be name of the class
---- text1Class1.txt #file can have any name
...
---- textNClass1.txt
-- folderNamedAsClass2
---- text1Class2.txt
...
---- textMClass2.txt
..

Use one csv file

Regroup all your data into one csv file named dataset.csv at top level of your dataset. The file will need to have two columns input (the text) and target (the class). It looks as follow:

input,target 
I like this movie,positive 
I hated the acting,negativeinput,target 
I like this movie,positive 
I hated the acting,negative

Paper

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina.

  • Languages
  • Model details
  • Input type
  • Input description
  • Output description
  • Pipelines
  • Dataset format
  • Paper

Was this page helpful?

Get The Help You Need
Learning RPA - Automation Courses
UiPath Community Forum
Uipath Logo White
Trust and Security
© 2005-2024 UiPath. All rights reserved.