- Introduction
- Setting up your account
- Balance
- Clusters
- Concept drift
- Coverage
- Datasets
- General fields
- Labels (predictions, confidence levels, label hierarchy, and label sentiment)
- Models
- Streams
- Model Rating
- Projects
- Precision
- Recall
- Annotated and unannotated messages
- Extraction Fields
- Sources
- Taxonomies
- Training
- True and false positive and negative predictions
- Validation
- Messages
- Access Control and Administration
- Manage sources and datasets
- Understanding the data structure and permissions
- Creating or deleting a data source in the GUI
- Uploading a CSV file into a source
- Preparing data for .CSV upload
- Creating a dataset
- Multilingual sources and datasets
- Enabling sentiment on a dataset
- Amending dataset settings
- Deleting a message
- Deleting a dataset
- Exporting a dataset
- Using Exchange integrations
- Model training and maintenance
- Understanding labels, general fields, and metadata
- Label hierarchy and best practices
- Comparing analytics and automation use cases
- Turning your objectives into labels
- Overview of the model training process
- Generative Annotation
- Dastaset status
- Model training and annotating best practice
- Training with label sentiment analysis enabled
- Training chat and calls data
- Understanding data requirements
- Train
- Introduction to Refine
- Precision and recall explained
- Precision and Recall
- How validation works
- Understanding and improving model performance
- Reasons for label low average precision
- Training using Check label and Missed label
- Training using Teach label (Refine)
- Training using Search (Refine)
- Understanding and increasing coverage
- Improving Balance and using Rebalance
- When to stop training your model
- Using general fields
- Generative extraction
- Using analytics and monitoring
- Automations and Communications Mining™
- Developer
- Exchange Integration with Azure service user
- Exchange Integration with Azure Application Authentication
- Exchange Integration with Azure Application Authentication and Graph
- Fetching data for Tableau with Python
- Elasticsearch integration
- Self-hosted Exchange integration
- UiPath® Automation Framework
- UiPath® Marketplace activities
- UiPath® official activities
- How machines learn to understand words: a guide to embeddings in NLP
- Prompt-based learning with Transformers
- Efficient Transformers II: knowledge distillation & fine-tuning
- Efficient Transformers I: attention mechanisms
- Deep hierarchical unsupervised intent modelling: getting value without training data
- Fixing annotating bias with Communications Mining™
- Active learning: better ML models in less time
- It's all in the numbers - assessing model performance with metrics
- Why model validation is important
- Comparing Communications Mining™ and Google AutoML for conversational data intelligence
- Licensing
- FAQs and more

Communications Mining user guide
Precision and recall explained
Precision and Recall are fundamental metrics to measure the performance of a machine learning model. If you are training models, make sure you understand the metrics before these try to assess their own performance of the model.
- Precision is the proportion of all the predictions that were actually correct.
- Recall is the proportion of all possible true positives that the platform identified.
This sections contains some real-world examples that explain how precision and recall work.
Example 1 – Scenario 1
If you have an electronic passport you might be familiar with the electronic gates (e-gates) at border control when arriving into the country. They have image recognition cameras installed that are designed to analyse your face and check if it matches the digital version on your passport. In essence, it’s a classification problem they are trying to solve – is this person who they say they are, or not.
Let’s say an airport decide they want to implement these electronic gates. However, they want to check how effective the cameras are at matching peoples’ faces to passport images before they let the public use them. In this example the aim is to use a camera that only identifies (or predicts) faces that match the image on the passport. These cameras want to let as many people through as possible but catch all of the people that might be using someone else’s passport, or a fake one where the images don’t match.
Precision
Precision would measure how accurate the camera was at letting the correct people through the gates. Essentially, of all the people it let through, what proportion of them had a matching passport.
In the first test you get 100 people to use the new camera. The results show the camera let 70 people through and rejects 30, who then have to go to the traditional desks manned by people.
Of the 70 people it let through, it turns out that there were actually 4 that it shouldn’t have let through (we already know beforehand they had the wrong passports). To calculate the precision, we would do the following:
Precision = Number of correctly identified people / The total number of people let through (correct and incorrect) = 66/(66+4) = 94%
Recall
There is one small problem here though. Let’s say we know there are actually 95 people in total with correct passports, and only 66 of them were correctly let through (as per the previous one), meaning 29 (95-66) people were incorrectly rejected and had to join the manual queue. How can we do a better job of correctly identifying all the people that we should let through?
This is where our other measure, recall, comes into play. Of all the people the camera should have identified as being correct and let through, recall measures how many of those it picked up. In this example we know that only 66 out of the 95 people who had correct passports were let through, so recall would be calculated in the following way:
Recall = Number of correct passports identified / The total number of people with correct passports = 66/95 = 69%
Example 1 - Scenario 2
Let’s take another scenario to show how precision and recall might change. We use the same setup, but this time the camera has been trained on a wider variety of images, and we want to test how much this improves the camera.
Just like Scenario 1, the same 100 people go through the passport gates again and we know that 95 of them have correct passports.
This time though, 85 are allowed through, with 15 being rejected to go to the traditional desks manned by humans. Of those 85 people let through the gates there were 82 correctly allowed through and 3 people that shouldn’t have been let through as they had the wrong passports.
Precision in this case is = 82/(82+3) = 96%
Recall = 82/95 = 86%
In this scenario we have a similar precision score but quite an improvement in recall. This means that whilst our predictions were still accurate (94% vs 96%) we were able to identify more of the cases where someone should have been let through as they had the correct passport (69% vs 86%). This shows that the additional training has significantly improved the recall of the camera compared to Scenario 1.
Example 2
Another simple example shows how the same measures can differ across situations.
Fire alarms are designed to detect when a fire breaks out. In a way, they have to predict when there is a fire, but there are also occasions they will get it wrong and cause a false alarm. What is more important in this situation is making sure that when there is a fire it is detected 100% of the time. We can accept the odd false alarm as long as when there is a fire, it is detected. In this example having high recall is more important – making sure every fire is detected!
Let’s say in a year there are 10 fires detected, and only 1 of them is real. The alarm/detector predicted a fire 10 times, 1 was correct, 9 were incorrect. In this case, precision was only 10% (1/10), but recall was 100% (1/1). Of all the fires that existed, the fire alarm detected all of them. So, whilst precision was poor, and there were many false alarms, recall was perfect, and we caught the one time that there was a fire.
When deciding which metrics to use, Precision or Recall, you have to consider that you can use both, or choose one of them, depending on your case.
The previous examples show a trade-off between the two metrics and how each one becomes more important depending on the situation it is used for.
Taking the fire alarm example, it is more important to pick up all cases of fire, because the consequences of not doing so are dangerous. If a fire broke out and the detector didn’t work, people could die. In these scenarios we would want to optimise for high recall – to make sure all cases were identified, even at the expense of false fire alarms
In contrast, for the passport gate example it would be more important to only let people through the gates whose image on their passport matched up with the one the camera detected. You don’t want to let through someone that had either a fake or wrong passport. You want to optimise for high precision in this example, and you don’t mind if the odd person who should have been let through is sent to the desk for manual inspection. In this case recall would be lower, but precision (which matters more here) would be high.