Measuring Model Goodness - Part 2

Measurability is an important aspect of the Team Data Science Process (TDSP) as it quantifies how good the machine learning model is for the business and helps gain acceptance from the key stakeholders. In part 1 of this series, we defined a template for measuring model goodness specifically looking at classification-type problems. This is summarised in the table below.

Classification
Quantify Key Business Metrics Once the business context is understood, quantify the key business metrics as follows:
- The value and cost of predicting true positives/negatives and false positives/negatives, respectively. For a binary classification problem, this can be represented as a 2D value-cost matrix.
- The percentage of cases that can be automated by the model without requiring human intervention.
Define Reasonable Baselines By doing some exploratory data analysis, define reasonable baselines as follows:
- Random classifier: A model that determines the classes randomly.
- Majority-class classifier: A model that picks the majority class always. This is relevant when the classes are highly imbalanced like in the case of fraud/spam detection.
- Current deployed classifier: This can be a model/human that the customer has currently deployed to perform the classification task.
Build and Evaluate Model Using historical data, train a classifier and tune hyperparameters using cross-validation. Then evaluate the model on a hold-out test set using the following metrics:
- True Positives
- True Negatives
- False Positives
- False Negatives
- Precision
- Recall

The first 4 metrics can be visualised using a confusion matrix and ROC curve. The last 2 metrics can be visualised using a PR curve.
Translate Model Performance to Business Metrics Finally, translate the model performance metrics to the key business metrics as follows:
- Expected value: Can be computed by using the confusion matrix and the value-cost matrix.
- % cases to manually review: Can be computed using the precision, recall and actual positive class rate.

In this post, we will look at regression which is another broad class of machine learning problems.

Regression

Regression is the process of estimating a quantitative response. It is a supervised machine learning technique that given a new observation, predicts a real-valued output. This is done based on training data that contains observations whose real-valued outputs are known. In this post, I will look at a specific example of regression which is to predict the dollar value of a home. I will be using the open Boston house prices dataset  which consists of 506 examples, 13 numeric/categoric features such as per capita crime rate, average number of rooms per dwelling, distance to employment centres, etc. and 1 real-valued target representing the median value of a home in 1000s of dollars. More details on the dataset can be found in the Jupyter notebook that I created for this topic on Github.

The business need is to predict the home value (or a real-valued output in the general case) with as low an error as possible and to also be able to see how confident we are of our predictions. These needs can be quantified using the following metrics:

  • Error metrics:
    • Mean/Median Absolute Error (MAE)
    • Mean/Median Absolute Percentage Error (MAPE)
    • Root Mean Squared Error (RMSE)
  • x% Confidence Interval (CI) or Coefficient of Variation (CoV) for each of our estimates

The 3 metrics that are typically used to quantify the estimation error are MAE, MAPE and RMSE. It is important to understand the differences between them so that the appropriate metric(s) can be used for the problem at hand. Both MAE and RMSE are in the same units of the target variable and are easy to understand in that regard. For the housing price problem, these errors are expressed in dollars and can easily be interpreted. The magnitude of the error, however, cannot be easily understood using these two metrics. For example, an error of $1000 may seem small at first glance but if the actual house price is $2000 then the error is not small in relation to that. This is why MAPE is useful to understand these relative differences as the error is expressed in terms of % error. It is also worth highlighting a difference between MAE and RMSE. RMSE has a tendency to give a higher weight to larger errors - so if larger errors need to be expressed more prominently, RMSE may be the more desirable metric. The M prefix in MAE and MAPE stands for either mean or median and they are both useful in understanding the skew in the error distributions. The mean error would be affected more by outliers than the median.

The CI and CoV metrics can be used to quantify how confident we are in our estimates. By choosing an appropriate threshold for the CoV, we can determine the proportion of cases that we can predict with reasonably high certainty and therefore can potentially automate those. There are of course other ways of quantifying uncertainty [123] but in this post, we will only cover CI and CoV.

We now need to define reasonable baselines to compare our models to.

  • Baseline 1: Overall mean (or median)
    • For the housing price example, this is the overall mean value of all houses in Boston.
  • Baseline 2: Mean (or median) of a particular group
    • For the housing price example, we can define the group as houses with a certain number of rooms. The predicted value will then be the mean value of the house with those number of rooms. We need to make sure that we have enough samples of data in each of the groups. If there aren't sufficient samples, then we should fall back to the overall mean.
  • Baseline 3: Current deployed model
    • Since we do not have access to this model for our example, we will drop this baseline.

The next step is to train a regression model. For this example, I used a pipeline consisting of a standard scaler, PCA to potentially reduce the dimensionality of the feature vector and a gradient boosting regressor to finally predict the value of the house. The quantile loss function was used so that we can predict the median value (alpha = 0.5) and the lower and upper values of the confidence interval (defined by alpha of 0.05 and 0.95 respectively). The code used to build the model can be found here.

Once the models are trained, we then need to evaluate them. We consider the following evaluation metrics:

  • Absolute Error (AE): This is defined as | y - ŷ |, where y is the actual value and ŷ is the estimated value
  • Absolute Percentage Error (APE): This is defined as | y - ŷ | / y * 100
  • Squared Error (SE): This is defined as (y - ŷ)2

We can now translate the above metrics into the final business metrics as follows:

  • Mean/Median AE: Mean/median of the absolute errors over all examples in the test set
  • Mean/Median APE: Mean/median of the absolute % errors over all examples in the test set
  • RMSE: Square root of the mean squared error over all examples in the test set

The above errors are visualised in the figure below.

reg_key_perf_metrics.png

Observations:

  • The Quantile Regression (Q-Regression) model does much better than the baselines achieving the lowest values for all error metrics. The mean and median APEs for the model are 10% and 7% respectively.
  • The mean and median error metrics show that the distribution of the errors is positively skewed. This is also reflected in the RMSE metric where RMSE is greater than MAE showing the effect of the larger errors.

In order to get a better understanding of the distribution of the errors, it is useful to plot the cumulative distribution of the APE. This is shown below.  It can be seen that for the Q-Regression model, 90% of the cases achieve an error of less than 20%. For baselines 1 and 2, we only get roughly 60% and 50% of the cases respectively, with less than 20% error.

ape_cdf.png

Finally, in order to determine which cases we can accurately estimate and in turn automate, we need to look at the confidence interval (CI) and the coefficient of variation (CoV). The 95% CI is estimated using the Q-Regression model by setting the quantile parameter alpha to 0.05 and 0.95. This gives us a rough measure of the spread of the estimate value. We can then compute a rough measure of the CoV by dividing the spread by the median predicted value. We expect to see a higher CoV for an estimate that we are highly uncertain of. Ideally, the CoV will also be correlated with the error metric but that need not always be the case. By setting a threshold on the CoV, we can determine how many cases we can predict with reasonably high certainty and thereby, automate.

In this example, a CoV threshold of 0.625 is chosen. This results in automating roughly 50% of the cases with reasonably high certainty in our estimate value. The figure below shows the box plots of the APE metric for all cases and the 'automatable' cases.

reg_automation.png

Observations:

  • The 50% cases identified as 'automatable' have much lower errors than all the cases.
    • Mean APE drops from 10% to 6%
    • Median APE drops from 7% to 5%
    • Maximum APE drops from 92% to 17%

This concludes the two-part series on measuring model goodness. We've looked at measurability through the lens of the business and defined a template for the entire process for two broad classes of machine learning problems - classification and regression. This is summarised in the table below. For the sake of completeness, the classification column is copied from the table above.

Classification Regression
Quantify Key Business Metrics - The value and cost of predicting true positives/negatives and false positives/negatives, respectively. For a binary classification problem, this can be represented as a 2D value-cost matrix.
- The percentage of cases that can be automated by the model without requiring human intervention.
- Error metrics, where a lower value is preferred:
   - Mean/median absolute error (MAE)
   - Mean/median absolute % error (MAPE)
   - Root mean squared error (RMSE)
- Measure of confidence or certainty for the predictions. In this post, we looked at confidence interval (CI) and coefficient of variation (CoV).
Define Reasonable Baselines - Random classifier: A model that determines the classes randomly.
- Majority-class classifier: A model that picks the majority class always. This is relevant when the classes are highly imbalanced like in the case of fraud/spam detection.
- Current deployed classifier: This can be a model/human that the customer has currently deployed to perform the classification task.
- Overall mean: This is the overall mean value of the target variable.
- Mean value by group: This is the mean value of the target variable for a particular group. Need to make sure that there are enough samples for the groups of interest. If not, then fall back to the overall mean.
- Current deployed model: This can be a model/human that the customer has currently deployed to perform the predictions.
Build and Evaluate Model Using historical data, train a classifier and tune hyperparameters using cross-validation. Then evaluate the model on a hold-out test set using the following metrics:
- True Positives
- True Negatives
- False Positives
- False Negatives
- Precision
- Recall

The first 4 metrics can be visualised using a confusion matrix and ROC curve. The last 2 metrics can be visualised using a PR curve.
Using historical data, train a classifier and tune hyperparameters using cross-validation. Then evaluate the model on a hold-out test set using the following metrics:
- Absolute Error (AE)
- Absolute % Error (APE)
- Squared Error (SE)

The distributions of these metrics can be visualised using a PDF or CDF.
Translate Model Performance to Business Metrics - Expected value: Can be computed by using the confusion matrix and the value-cost matrix.
- % cases to manually review: Can be computed using the precision, recall and actual positive class rate.
- Error metrics: Can be easily computed by taking the mean and median of the above metrics.
- Confidence measure: Can estimate the CI for each estimate using quantile regression. A rough measure of CoV can be computed using the CI and the median predicted value. By setting a CoV threshold, we can determine the % of cases that can be 'automatable'.

Hope you enjoyed this series. Thanks for reading!

This article has been cross-posted from the Microsoft Data Insights blog here.

Measuring Model Goodness - Part 1

Data and AI are transforming businesses worldwide from finance, manufacturing and retail to healthcare, telecommunications and education. At the core of this transformation is the ability to convert raw data into information and useful, actionable insights. This is where data science and machine learning come in.

Machine Learning [source]

Machine Learning [source]

The above method, although facetious, is certainly one way of building a machine learning system. If it however needs to be good and reliable, we should be a bit more methodical about it by:

  • understanding the business needs
  • acquiring and processing the relevant data
  • accurately formulating the problem
  • building the model using the right machine learning algorithm
  • evaluating the model, and
  • validating the performance in the real world before finally deploying it

This entire process has been documented as the Team Data Science Process (TDSP) at Microsoft, captured in the figure below.

TDSP [source]

TDSP [source]

This post, as part of a two-part series, is focused on measuring model goodness, specifically looking at quantifying business value and converting typical machine learning performance metrics (like precision, recall, RMSE, etc.) to business metrics. This is typically how models are validated and accepted in the real world. The relevant stages of the process are highlighted in red in the figure above. Measuring model goodness also involves comparing the model performance to reasonable baselines, which will also be covered in this post. To illustrate all of this, two broad classes of machine learning problems will be looked at:

  • Classification: Covered in Part 1
  • Regression: To be covered in Part 2

Classification

Classification is the process of predicting qualitative or categorical responses [source]. It is a supervised machine learning technique that classifies a new observation to a set of discrete categories. This is done based on training data that contains observations whose categories are known. Classification problems can be binary in which there are two target categories or multi-class in which there are more than two mutually exclusive target categories.

A Binary Classification Problem

In this post, I will look at the binary classification problem taking breast cancer detection as a concrete example. I will be using the open breast cancer Wisconsin dataset for model building and evaluation. The techniques discussed below can easily be extended to other binary and multi-class classification problems. All the source code used to build and evaluate the models can be found on Github here.

Let’s first understand the business context and the data. Breast cancer is the most common cancer in women and is the main cause of death from cancer among women [source]. Early detection of the cancer type — whether it is benign or malignant — can help save lives by picking the appropriate treatment strategy. In this dataset, there are 569 cases in total out of which 357 are benign (62.6%) and 212 are malignant (37.4%). There are 30 features in total, computed from a digitised image of a biopsy of a breast mass. These features describe the characteristics of the nuclei like radius, texture, perimeter, smoothness, etc. The detection of the type of cancer is currently being done by a radiologist which is time consuming and the business need is to speed up the diagnosis process so that the appropriate treatment can be started early. Before diving too deep into the modelling stage, we need to understand the value/cost of making the right/wrong prediction. This information can be represented using a value-cost matrix as shown below:

vc_matrix_250_170.png

The rows in the matrix represent the actual class and the columns represent the predicted class. The first row/column represents in the negative class (benign in this case) and the second row/column represents the positive class (malignant in this case). Here’s what each of the elements in the matrix represents:

  • Row 0, Column 0: Value of predicting the negative class correctly
  • Row 0, Column 1: Cost of predicting the positive class wrongly
  • Row 1, Column 0: Cost of predicting the negative class wrongly
  • Row 1, Column 1: Value of predicting the positive class correctly

For breast cancer detection, we can define the value-cost matrix as follows:

value_cost_matrix_400_360.png
  • Detecting benign and malignant cases correctly are given equal and positive value. Although the malignant type is the worse situation to be in for the patient, the goal is to diagnose early, start treatment and cure both types of cancer. So, from a treatment perspective, detecting both cases accurately have equal value.
  • The cost of flagging a malignant case as benign (false negative) is much worse than flagging a benign case as malignant (false positive). Therefore, the false negative is given a cost of -4 and the false positive is given a cost of -2.

Another business need is to automate as much of the process as possible. One way of doing this is to use the machine learning model as a filter, to automate the detection of simpler benign cases and only flag the possible malignant cases for the radiologist to review. This is shown in the diagram below. We can quantify this business need by looking at the % of cases to review manually (denoted as x in the figure). Ideally, we want to automate everything without requiring the human in the loop. This means that the model should have 100% accuracy and x should be equal to the actual proportion of positive classes.

automation.PNG

After the business needs are properly quantified, the next step is to define reasonable baselines to compare our models to.

  • Baseline 1: Random classifier that decides randomly if the cancer type is benign / malignant.
  • Baseline 2: Majority-class classifier that picks the majority class all the time. In this case, the majority class is benign. This strategy would make much more sense if the classes were highly imbalanced.

We could add a third baseline which is the performance of the radiologist or any other model that the business has currently deployed. In this example though, this information is not known and so this baseline is dropped.

Now comes the fun part of building the model. I will gloss over a lot of the detail here, as an entire post may be necessary to go through the exploratory analysis, modelling techniques and best practices. I used a pipeline consisting of a feature scaler, PCA for feature reduction and finally a random forest (RF) classifier. 5-fold cross validation and grid search were done to determine the optimum hyperparameters. The entire source code can be found here.

Once the models are trained, the next step is to evaluate them. There are various performance metrics that can be used but, in this post, I will go over the following 6 metrics as they can very easily be explained, visualised and translated into business metrics.

  • True Positives (TP) / True Positive Rate (TPR): Number of correct positive predictions / Probability of predicting positive given that the actual class is positive
  • False Negatives (FN) / False Negative Rate (FNR): Number of wrong negative predictions / Probability of predicting negative given that the actual class is positive
  • True Negatives (TN) / True Negative Rate (TNR): Number of correct negative predictions / Probability of predicting negative given that the actual class is negative
  • False Positives (FP) / False Positive Rate (FPR): Number of wrong positive predictions / Probability of predicting positive given that the actual class is negative
  • Precision (P): Proportion of predicted positives that are correct
  • Recall (R): Proportion of actual positives captured

All these metrics are inter-related. The first four can be easily visualised using a ROC curve and a confusion matrix. The last two can be visualised using a precision-recall (PR) curve. Let’s first visualise the performance of the model and the baselines using the ROC and PR curves.

Both plots show the performance of the models using various thresholds to determine the positive and negative classes. As can be seen, the RF classifier outperforms the baselines in every respect. Once the appropriate threshold is picked, we can plot the normalised confusion matrices as shown below. These matrices show the conditional probabilities of predicted values given the actual value. We can see that the baselines do quite poorly especially in predicting the positive class — the false negative rates are quite high. The RF classifier, on the other hand, seem to get much more of the positive and negative class predictions right achieving a TPR of 93% and TNR of 97%.

Now that we’ve established that the new RF model outperforms the baselines, we still need to make sure that it meets our business objectives. i.e. high positive business value and fewer cases to review manually. We therefore need to translate the above performance metrics into the following:

  • Total expected value/cost of the model
  • % of cases that need to be reviewed manually

The first business metric can be computed by taking the dot product of the flattened confusion matrix (normalised by the total number of samples, S) with the flattened value-cost matrix. This is shown below.

expected_value_600_170.png

Note that S is a scalar equal to TN + FP + FN + TP. The expected value is essentially the weighted average of the values/costs, where the weights are the probabilities of predicting the positive and negative classes [source].

The second business metric can be computed using precision and recall as follows.

automation_formula_600_60.png

The positive class rate is known from the data used to evaluate the model, i.e. the test set. Based on the business needs, we then need to decide how much of those positive classes we must accurately determine, i.e. the recall. If we want to detect all the positive cases, the target recall should be 100%. We can then find the corresponding precision (and threshold) from the PR curve. For the ideal model with 100% precision and recall, the proportion of positive cases to review would be equal to the actual positive class rate. This means we could, in theory, achieve 100% automation.

The following two plots show the final evaluation of the model goodness in terms of the two key business metrics.

Observations:

  • Both baseline models have quite large costs because of the high false negatives. The random forest classifier has a good positive value by getting a lot of the positive and negative cases right.
  • For the case of breast cancer detection, we want to capture all the positive/malignant cases, i.e. 100% recall. Looking at 50% or 75% recall does not make sense in this example because of the high cost of not treating the malignant cases. For other binary classification problems like fraud detection for instance, a lesser recall may be acceptable so long as the model can flag the high cost-saving fraudulent cases. Regardless, we can see that the random forest classifier outperforms the baselines on the automation front as well. To capture 100% of the malignant cases, the radiologist would only need to review about 50% of all cases, i.e. 13% additional false positives, whereas he/she would have to review almost all the cases using the baseline models.

In conclusion, the first part of this series has looked at measuring model goodness through the lens of the business, specifically looking at classification-type problems. For further details on this subject, the books below are great references. In the second and final part of this series, I will cover regression which requires looking at slightly different metrics.

Further Reading

This article has been cross-posted from the Microsoft Data Insights blog here.

Reverse Geocoding Custom Data Sources

According to Wikipedia,

Reverse geocoding is the process of back (reverse) coding of a point location (latitude, longitude) to a readable address or place name. This permits the identification of nearby street addresses, places, and/or areal subdivisions such as neighbourhoods, county, state, or country. 

Reverse geocoding is crucial to the work that I do at OpenSignal where I've to churn through terabytes of crowdsourced data to compare operator performance (in terms of coverage and data speeds) and user numbers at various geographical areas across the globe. To be able to do these analyses easily, I built a geocoder library in Python that is offline and fast, and that improves on an existing one built by Richard Penman. By making this offline, you do not have to deal with slow web APIs (such as Nominatim and Google) and query limits. The library was built with speed in mind and it can geocode 10 million GPS coordinates in under 30 seconds on a machine with 8 cores.

Since its release over a year ago, the library has been pretty well received by the community. It's been downloaded over 10,000 times and has received 1230 stars on Github. Being featured as the #1 post on Hacker News certainly helped drive a lot of traffic to it. I'm really grateful to the community who have helped report and squash quite a few bugs along the way.

Under the Hood

Under the hood, the library comes packaged with a database of places with a population greater than 1000, which was obtained from GeoNames. This entire database is loaded into a k-d tree and the nearest neighbour algorithm is used to find the city/town closest to the input point location. There's a nice explanation of k-d trees in Data Skeptic, one of my favourite podcasts. The scipy implementation of k-d trees is, unfortunately, single-threaded and does not exploit the multiple CPUs available on your machine. Thus, to improve performance, I implemented a parallelised k-d tree that comes into its own for really large inputs (in the order of millions) as seen in the graph below.

Running time (in seconds) for various input sizes

For the nearest neighbour algorithm, I use the Euclidean distance formula to compare distances between any two GPS coordinates. This isn't the most accurate, especially at latitudes further away from the equator where projection distortions are much more noticeable. I had to make a tradeoff between accuracy and speed, and I chose to focus on speed. The primary use case for this library, as I envisioned it, was to be able to geocode to larger administrative regions (such as cities and counties) and at this scale, inaccuracies due to latitude distortions are not so apparent. If you're interested in smaller regions, you could implement the Haversine formula as detailed in this post.

Usage at OpenSignal

Within OpenSignal, the library is used in various ways. We use it for ad-hoc analysis of our coverage and speed data and also in our coverage maps where you can compare the performance of operators in a country/region.

OpenSignal Coverage Map for Singapore (App available on the App Store and Google Play)

If you pan across the map above, the reverse geocoder library is used to determine the main country of interest. We, however, noticed a problem when you zoom into a border region between two or more countries.

The Problem

Taking Singapore as an example, the library fails when you query for locations near the border. 

An example of the library failing at border regions

The locations circled in blue are the ones in the database obtained from GeoNames. The locations circled in black are the ones given as input to the geocoder library. The lines show the geocoded locations returned by the library for those inputs. As you can see, the lines in red show the wrong country being returned. For instance, if you search for a location in the north of Singapore such as Woodlands or Yishun, the library returns Johor Bahru in Malaysia. Singapore, being a city-state, has only one administrative region in the GeoNames database and the location is unfortunately set to the downtown area in the south.

The Solution

I've now modified the library so that it accepts custom data sources. To fix the Singapore problem, you can customise the GeoNames database by adding the following three locations:

lat lon name admin1 admin2 cc
1.38511 103.98685 Singapore East SG
1.39367 103.66711 Singapore West SG
1.45106 103.80779 Singapore North SG

You can load this custom data source and pass it to the library as follows:

import io
import reverse_geocoder as rg

geo = rg.RGeocoder(mode=2, verbose=True, stream=io.StringIO(open('custom_source.csv', encoding='utf-8').read()))
coordinates = (51.5214588,-0.1729636),(9.936033, 76.259952),(37.38605,-122.08385)
results = geo.query(coordinates)

The results of this are shown in the image below.

Geocoding fixed with custom data source

The three custom locations are circled in magenta and you can see that the library now returns the right location (as shown by the green lines) within the country of interest.

Special thanks to Grégoire Charvet for reviewing my code for this enhancement.