- Getting started
- Balance
- Clusters
- Concept drift
- Coverage
- Datasets
- General fields (previously entities)
- Labels (predictions, confidence levels, hierarchy, etc.)
- Models
- Streams
- Model Rating
- Projects
- Precision
- Recall
- Reviewed and unreviewed messages
- Sources
- Taxonomies
- Training
- True and false positive and negative predictions
- Validation
- Messages
- Administration
- Manage sources and datasets
- Understanding the data structure and permissions
- Create a data source in the GUI
- Uploading a CSV file into a source
- Create a new dataset
- Multilingual sources and datasets
- Enabling sentiment on a dataset
- Amend a dataset's settings
- Delete messages via the UI
- Delete a dataset
- Delete a source
- Export a dataset
- Using Exchange Integrations
- Preparing data for .CSV upload
- Model training and maintenance
- Understanding labels, general fields and metadata
- Label hierarchy and best practice
- Defining your taxonomy objectives
- Analytics vs. automation use cases
- Turning your objectives into labels
- Building your taxonomy structure
- Taxonomy design best practice
- Importing your taxonomy
- Overview of the model training process
- Generative Annotation (NEW)
- Dastaset status
- Model training and annotating best practice
- Training with label sentiment analysis enabled
- Train
- Introduction to Refine
- Precision and recall explained
- Precision and recall
- How does Validation work?
- Understanding and improving model performance
- Why might a label have low average precision?
- Training using Check label and Missed label
- Training using Teach label (Refine)
- Training using Search (Refine)
- Understanding and increasing coverage
- Improving Balance and using Rebalance
- When to stop training your model
- Using general fields
- Generative extraction
- Using analytics and monitoring
- Automations and Communications Mining
- Licensing information
- FAQs and more
Model training FAQs
- General model training
- Label training
The objective of training a model is to create a set of training data that is as representative as possible of the dataset as a whole, so that the platform can accurately and confidently predict the relevant labels and general fields for each message. The labels and general fields within a dataset should be intrinsically linked to the overall objectives of the use case and provide significant business value.
As soon as data is uploaded to the platform, the platform begins a process called unsupervised learning, whereby it groups messages into clusters of similar semantic intent. This process can take up to a couple of hours, depending on the size of the dataset, and clusters will appear once it is complete.
To be able to train a model, you need a minimum amount of existing historical data. This is used as training data to provide the platform with the necessary information to confidently predict each of the relevant concepts for your analysis and/or automation.
The recommendation for any use case is a minimum of 12 months of historical data, in order to properly capture any seasonality or irregularity in the data (e.g. month-end processes and busy seasons).
No, you do not need to save your model after any changes are made. Every time you train the platform on your data (i.e. annotating any messages), a new model version is created for your dataset. Performance statistics for older model versions can be viewed in Validation.
Please check the Validation page in the platform, which reports various performance measures and provides a holistic model health rating. This page updates after every training event and it can be used to identify areas where the model may need more training examples or some label corrections in order to ensure consistency.
Please see the Validation page, for full explanations of model performance and how to improve it.
The clusters are a helpful way to help you quickly build up your taxonomy, but users will spend most of their time training in Explore rather than Discover.
If users spend too much time annotating via clusters, there’s a risk of overfitting the model to look for messages that only fit these clusters when making predictions. The more varied examples there are for each label, the better the model will be at finding the different ways of expressing the same intent or concept. This is one of the main reasons why we only show 30 clusters at a time.
Once enough training has been completed or a significant volume of data has been added to the platform (see here), however, Discover does retrain. When it retrains, it takes into account the existing training to-date, and will try to present new clusters that are not well covered by the current taxonomy.
For more information on Discover, see here.
There are 30 clusters in total, each containing 12 messages. In the platform, you are able to filter the number of messages shown on the page in increments between 6 and 12 per page. Our recommendation is annotating 6 at a time to ensure that you reduce the risk of partially annotating any messages.
Precision and recall are metrics used to measure the performance of a machine learning model. A detailed description of each can be found under the Using Validation section of our how-to guides.
You can access the validation overview of earlier models by hovering over ‘Model Version’ in the top left corner of the Validation page. This can be helpful for tracking and comparing progress as you train out your model.
If you need to roll your model back to a previous pinned version, please see here for more details.
Yes, it’s really easy to do. You can go into the settings for each label and rename it at any point. You can see how to do it here.
Information about your dataset, including how many message that have been annotated, is displayed in the Datasets Settings page. To see how to access it, click here.
If you can see in the Validation page that your label is performing poorly, there are various ways to improve its performance. See here to understand more.
The little red dials next to each label/general field indicate whether more examples are needed for the platform to accurately estimate the label/general field's performance. The dials start to disappear as you provide more training examples and will disappear completely once you reach 25 examples.
After this, the platform will be able to effectively evaluate the performance of a given label/general field and may return a performance warning if the label/general field is not healthy.
The platform is able to learn from empty messages and uninformative messages as long as they are annotated correctly. However, it is worth noting that uninformative labels will likely need a significant number of training examples, as well as to be loosely grouped by concept, to ensure best performance.
- General model training
- What is the objective of training a model?
- Why can I not see anything in Discover if I've just uploaded data into the platform?
- How much historical data do I need to train a model?
- Do I need to save my model every time I make a change?
- How do I know what the performance of the model is?
- Why are there only 30 clusters available and can we set them individually?
- How many messages are in each cluster?
- What do precision and recall mean?
- Can I return to an earlier version of my model?
- Label training
- Can I change the name of a label later on?
- How do I find out the number of messages I have annotated?
- One of my labels is performing poorly, what can I do to improve it?
- What does the red dial next to my label or general field indicate? How do I get rid of it?
- Should I avoid annotating empty/uninformative messages?