- Getting started
- Balance
- Clusters
- Concept drift
- Coverage
- Datasets
- General fields (previously entities)
- Labels (predictions, confidence levels, hierarchy, etc.)
- Models
- Streams
- Model Rating
- Projects
- Precision
- Recall
- Reviewed and unreviewed messages
- Sources
- Taxonomies
- Training
- True and false positive and negative predictions
- Validation
- Messages
- Administration
- Manage sources and datasets
- Understanding the data structure and permissions
- Create a data source in the GUI
- Uploading a CSV file into a source
- Create a new dataset
- Multilingual sources and datasets
- Enabling sentiment on a dataset
- Amend a dataset's settings
- Delete messages via the UI
- Delete a dataset
- Delete a source
- Export a dataset
- Using Exchange Integrations
- Preparing data for .CSV upload
- Model training and maintenance
- Understanding labels, general fields and metadata
- Label hierarchy and best practice
- Defining your taxonomy objectives
- Analytics vs. automation use cases
- Turning your objectives into labels
- Building your taxonomy structure
- Taxonomy design best practice
- Importing your taxonomy
- Overview of the model training process
- Generative Annotation (NEW)
- Dastaset status
- Model training and annotating best practice
- Training with label sentiment analysis enabled
- Train
- Introduction to Refine
- Precision and recall explained
- Precision and recall
- How does Validation work?
- Understanding and improving model performance
- Why might a label have low average precision?
- Training using Check label and Missed label
- Training using Teach label (Refine)
- Training using Search (Refine)
- Understanding and increasing coverage
- Improving Balance and using Rebalance
- When to stop training your model
- Using general fields
- Generative extraction
- Using analytics and monitoring
- Automations and Communications Mining
- Licensing information
- FAQs and more
Precision and recall explained
Precision and recall are fundamental metrics to measure the performance of a machine learning model and it's important for those training models to understand them before they try to assess their own model's performance.
So what do these measures mean?
Precision is the proportion of all the predictions that were actually correct.
Recall is the proportion of all possible true positives that were identified.
Below are some real-world examples that explain how precision and recall work.
Example 1 – Scenario 1
If you have an electronic passport you might be familiar with the electronic gates (e-gates) at border control when arriving into the country. They have image recognition cameras installed that are designed to analyse your face and check if it matches the digital version on your passport. In essence, it’s a classification problem they are trying to solve – is this person who they say they are, or not.
Let’s say an airport decide they want to implement these electronic gates. However, they want to check how effective the cameras are at matching peoples’ faces to passport images before they let the public use them. In this example the aim is to use a camera that only identifies (or predicts) faces that match the image on the passport. These cameras want to let as many people through as possible but catch all of the people that might be using someone else’s passport, or a fake one where the images don’t match.
Precision
Precision would measure how accurate the camera was at letting the correct people through the gates. Essentially, of all the people it let through, what proportion of them had a matching passport.
In the first test you get 100 people to use the new camera. The results show the camera let 70 people through and rejects 30, who then have to go to the traditional desks manned by people.
Of the 70 people it let through, it turns out that there were actually 4 that it shouldn’t have let through (we already know beforehand they had the wrong passports). To calculate the precision, we would do the following:
Precision = Number of correctly identified people / The total number of people let through (correct and incorrect) = 66/(66+4) = 94%
Recall
There is one small problem here though. Let’s say we know there are actually 95 people in total with correct passports, and only 66 of them were correctly let through (as per above), meaning 29 (95-66) people were incorrectly rejected and had to join the manual queue. How can we do a better job of correctly identifying all the people that we should let through?
This is where our other measure, recall, comes into play. Of all the people the camera should have identified as being correct and let through, recall measures how many of those it picked up. In this example we know that only 66 out of the 95 people who had correct passports were let through, so recall would be calculated in the following way:
Recall = Number of correct passports identified / The total number of people with correct passports = 66/95 = 69%
Example 1 - Scenario 2
Let’s take another scenario to show how precision and recall might change. We use the same setup, but this time the camera has been trained on a wider variety of images, and we want to test how much this improves the camera.
Just like Scenario 1, the same 100 people go through the passport gates again and we know that 95 of them have correct passports.
This time though, 85 are allowed through, with 15 being rejected to go to the traditional desks manned by humans. Of those 85 people let through the gates there were 82 correctly allowed through and 3 people that shouldn’t have been let through as they had the wrong passports.
Precision here is = 82/(82+3) = 96%
Now let’s see how recall was affected:
Recall = 82/95 = 86%
In this scenario we have a similar precision score but quite an improvement in recall. This means that whilst our predictions were still accurate (94% vs 96%) we were able to identify more of the cases where someone should have been let through as they had the correct passport (69% vs 86%). This shows that the additional training has significantly improved the recall of the camera compared to Scenario 1.
Example 2
Another simple example shows how the same measures can differ across situations.
Fire alarms are designed to detect when a fire breaks out. In a way, they have to predict when there is a fire, but there are also occasions they will get it wrong and cause a false alarm. What is more important in this situation is making sure that when there is a fire it is detected 100% of the time. We can accept the odd false alarm as long as when there is a fire, it is detected. In this example having high recall is more important – making sure every fire is detected!
Let’s say in a year there are 10 fires detected, and only 1 of them is real. The alarm/detector predicted a fire 10 times, 1 was correct, 9 were incorrect. In this case, precision was only 10% (1/10), but recall was 100% (1/1). Of all the fires that existed, the fire alarm detected all of them. So, whilst precision was poor, and there were many false alarms, recall was perfect, and we caught the one time that there was a fire.
There are two correct answers to that question:
- Both
- It depends
The above examples show a trade-off between the two metrics and how each one becomes more important depending on the situation it is used for.
Taking the fire alarm example, it is more important to pick up all cases of fire, because the consequences of not doing so are dangerous. If a fire broke out and the detector didn’t work, people could die. In these scenarios we would want to optimise for high recall – to make sure all cases were identified, even at the expense of false fire alarms
In contrast, for the passport gate example it would be more important to only let people through the gates whose image on their passport matched up with the one the camera detected. You don’t want to let through someone that had either a fake or wrong passport. You want to optimise for high precision in this example, and you don’t mind if the odd person who should have been let through is sent to the desk for manual inspection. In this case recall would be lower, but precision (which matters more here) would be high.