AI Center - Text Classification

ai-center

latest

false

AI Center user guide

Getting started
Notifications
- My notifications
Projects
- About Projects
- Managing Projects
Datasets
- About Datasets
- Managing Datasets
Data Labeling
ML packages
Out of the box packages
Pipelines
ML Skills
- About ML Skills
- Managing ML Skills
ML Logs
- About ML Logs
Document UnderstandingTM in AI Center
- Document Manager
- OCR Services
AI Center API
- Overview
- API list
Licensing
AI Solutions Templates
- About AI Solution Templates
  - Email AI
How to
- ML packages
  - Use Custom NER with continuous learning
- ML Skills
Basic Troubleshooting Guide

Text Classification

Note:

Out of the box ML packages is deprecated. For more information, check the Deprecation timeline page from the Overview guide.

OS Packages > Language Analysis > TextClassification

This is a generic, retrainable model for language Classification. This ML Package must be retrained, if deployed without training first, deployment will fail with an error stating that the model is not trained.

This model is a deep learning architecture for language classification. It is based on BERT, a self-supervised method for pretraining natural language processing systems. A GPU can be used both at serving time and training time. A GPU delivers ~5-10x improvement in speed. The model was open-sourced by Facebook AI Research.

Languages

The main driver of the performance of the model will be the data quality used for training. Additionally, the data used to parametrize this model may also influence performance. This model was trained on the top 100 languages with the largest Wikipedias (full list)

Model details

Input type

JSON

Input description

Text to be classified as String: "I loved this movie."

Output description

JSON with predicted class name, associated confidence on that class prediction (between 0-1).

Example:

{
  "class": "Positive",
  "confidence": 0.9422031841278076
}
{
  "class": "Positive",
  "confidence": 0.9422031841278076
}

Pipelines

All three types of pipelines (Full Training, Training and Evaluation) are supported by this package.

For most use cases, no parameters need to be specified, the model is using advanced techniques to find a performant model. In subsequent trainings after the first, the model uses incremental learning (that is, the previously trained version will be used, at the end of a Training Run).

Dataset format

Two options are possible to structure your dataset for this model. You can't use both options at same time. By default model will look for dataset.csv file in top folder directory if found it uses option 2 here otherwise model try to use option 1 (folder structure).

Use folder structure to separate your class

Create one folder for each class (at top level of the dataset) and add one text file per data point in corresponding folder (the folder is the class and the file only has the input). Dataset structure looks like this:

Dataset
-- folderNamedAsClass1 # the name of the folder must be name of the class
---- text1Class1.txt #file can have any name
...
---- textNClass1.txt
-- folderNamedAsClass2
---- text1Class2.txt
...
---- textMClass2.txt
..
Dataset
-- folderNamedAsClass1 # the name of the folder must be name of the class
---- text1Class1.txt #file can have any name
...
---- textNClass1.txt
-- folderNamedAsClass2
---- text1Class2.txt
...
---- textMClass2.txt
..

Use one csv file

Regroup all your data into one csv file named dataset.csv at top level of your dataset. The file will need to have two columns input (the text) and target (the class). It looks as follow:

input,target 
I like this movie,positive 
I hated the acting,negative
input,target 
I like this movie,positive 
I hated the acting,negative

Paper

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina.

On this page

Languages
Model details
Input type
Input description
Output description
Pipelines
Dataset format
Paper

Was this page helpful?

PREVIOUSObject Detection

NEXTTPOT AutoML Classification

Languages​

Model details​

Input type​

Input description​

Output description​

Pipelines​

Dataset format​

Use folder structure to separate your class​

Use one csv file​

Paper​