SHAP (SHapley Additive exPlanations)

Code Implementation here - Classification, Regression

Introduction

In this section, we explore what is SHAP and how it will help us interpret a model. Given a sample and its prediction, SHAP decomposes the prediction additively between features using a game-theoretic approach.

Game theory was initially developed by John von Neumann and Oskar Morgenstern in 1944 as a mathematical theory. Following on from this, in 1944 Neumann published "The Theory of Games and Economic Behavior", co-authored with Morgenstern. This is considered to be one of the main foundation texts of game theory. But the economist John Nash, John Harsanyi, and Reinhard Selten received the Nobel Prize for Economics in 1994 for further developing game theory in relation to economics.

A great way to increase the transparency of a model is by using SHAP values. SHAP (SHapley Additive exPlanations), proposed by Lundberg and Lee (2016), is a united approach to explain the output of any machine learning model, by measuring the features’ importance to the predictions. SHAP (SHapley Additive exPlanations) is a game-theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions.

Game theory

‌Game theory is the process of modeling the strategic interaction between two or more people in a situation containing set rules and outcomes in which each person's payoff is affected by the decision made by others. Game theory is used by the economist, political scientist, military and others.

Non Cooperative Game theory

Non-Cooperative game theory is a competitive social interaction where there will be some winners and some losers. It is where Nash equilibrium comes into play. This book doesn't deal in detail about game theory. To read more about Nash equilibrium refer to an article from Investopedia. The non-cooperative game theory is best understood with the example of the Prisoner's Dilemma.

Cooperative Game theory

Cooperative Game theory is where every player has agreed to work together towards a common goal. Like the Nash equilibrium, cooperative game theory has shapely values. In-game theory, a coalition is what you call a group of players in a cooperative game.

Shapely values

A method of dividing up the gains or costs among players, according to the value of their individual contributions.

It rests on three important pillars:

1. Marginal contribution

The contribution of each player is determined by what is gained or lost by removing them from the Game. This is called their marginal contributions. To make it clear let us take an example.

‌Say every day you and your friends bake cookies. One day you get sick and don't go to bake cookies. That day, your friends produce 50 lesser cookies than they would have if you were there with them. So your marginal contribution to the coalition per day is 50 cookies.

2. Interchangeable players have an equal value

If two parties bring the same things to the coalition, they should have to contribute the same amount and should be rewarded for their contributions.

3. Dummy player has zero values

If a member of the coalition contributes nothing then they should receive nothing. But it might not be fair in all cases, let us take an example to make this concept more clear.

‌If you go to a restaurant with friends and you order nothing and eat nothing, then there is no need for you to chip in on the bill. But in another scenario, when you are on maternity leave and are not paid since you are not working, is not fair.

‌ Mathematically shapely values are represented by

In a coalitional game, we have a set N of n players. We also have a function v that gives the value (or payout) for any subset of those players, i.e. let S be a subset of N, then v(S) gives you the value of that subset. So, for a coalitional game (N, v) we can use the equation to calculate the payout for player i, i.e. the Shapley value for any feature/column.

Let us take an example and break it down to understand it better. Again, we will take the cookie example. Yes, you guessed it right, we love cookies too much.

Let us say David and Lisa are making cookies separately. David makes 10 cookies and Lisa makes 20 cookies.

When David and Lisa working together, they streamline the process and are able to bake 40 cookies together.

If we consider 1$ for each cookie, then they make $30 when they bake it separately (30 cookies - David makes 10 and Lisa make 20). But when they bake together they get $40 (40 cookies together).

Let us now calculate the marginal contribution

Case 1

If David makes 10 cookies alone then, 40-10 = 30

This is Lisa's marginal contribution to the coalition.

Case 2

If Lisa makes 20 cookies alone then, 40-20 = 20

This is the marginal contribution of David to the coalition.

So in the first case, David's value to the coalition is 10 cookies and in the second case, David's value to the coalition is 20 cookies. According to the Shapely equation, to find the Shapely value of David we need to average these two values i.e., (10+20)/2 = 15. This the Shapely value for David

Similarly, for Lisa, in the first case, the value to the coalition is 30 cookies. And in the second case, her value to the coalition is 20 cookies. So Shapley value for Lisa will be (20+30)/2 = 25.

We hope the above example has given clarity on what Shapely values are.

‌SHAP gives the feature importance assigned to every feature which will correspond to the contribution by it for the prediction.

SHAP can be applied to tabular data and image data. SHAP is a model agnostic method, which means it can be applied to any model.

Visualizations

The various interpret abilities provided by SHAP are:

1.Global Bar Plots

Global bar plots show the global feature importance of a model with the mean SHAP value corresponding to a feature.

Classification:

In the below plot, you can see a global bar plot for our XGBClassifier wherein features are displayed in descending order of their mean SHAP value. With the below plot, it is safe to say that our XGBoost model believes ‘Tenure’ to be the most important feature in predicting whether a customer will ‘Churn’ or not. Intuitively, we would expect the contract tenure to be important too and the plot aligns with our intuition.

Note how the ‘max_display’ parameter allows users to customize plots. In this case, we have set the value to 15, but you can adjust the number to view the top n features.

Regression:

We have used the same data for the regression scenario, but the continuous target feature is ‘Monthly Charges’. The plot is like a classification bar plot with all features listed in descending order of importance on the y-axis and the mean SHAP value on the x-axis. As we can infer, Internet Service _Fiber Optic’ has a high impact in predicting ‘Monthly Charges’.

2.Local Bar Plot

Description: Local bar plots are like Global bar plots in some ways, but the key difference is that they show the feature importance as per the model for a single instance/row of data rather than the entire dataset.

Classification:

Below, you can see the local bar plot corresponding to row number 1000. The result is somewhat similar to our Global bar plot as ‘Tenure’ is once again the most important feature.

If we make the local bar plot for row number 3000, we see different results. Naturally, not all instances would have similar explanations for their predictions. In this case, ‘Tenure’ - which was the most important feature as per the Global bar plot, and which we saw to be important for the 1000th instance, is missing!

Instead, we see that Fiber Optic Internet Service has impacted the prediction the most.

Note how the X-axis of the Local bar plots does not read mean SHAP value. Since it is just for a single instance, we plot the actual SHAP value calculated for that instance.

Regression:

In case of regression plot, we picked another random row, 2500th row in this case, and from the plot, we can infer that ‘Internet Service_Fiber optic’ is an important feature which we could also confirm from the global summary plot but in the case of local interpretation of bar plot we can see the positive and negative impact of each feature. The negative value in blue shows a negative SHAP value indicating ‘Internet Service_Fiber optic’ is important negatively hence it is trying to pull the final predicted value to a lower side.

The below plot is for the 500th row, where all the features have a positive impact and are listed as per their importance.

3. Cohort Plots

In Cohort plots, we group two features and try to understand their impact on the target value. In the below plot we grouped the target feature name. So, our target feature - ‘Churn’, has two SHAP importance plots w.r.t every feature - one for each ‘Yes’ and ‘No’. The value shown on the side is the count of each ‘Yes’ or ‘No’

‘Tenure’ has a bar with a SHAP value of 0.1 for “Yes” and with the value of 0.08 for “No”. We can infer how we can group one feature value to understand the impact.

Replot

4. Feature Clustered Bar Plots

With Feature Clustered plots, we can find redundant features - where either of the two features could be used to predict the target value. We can also change the threshold value to find more clusters.

Practitioners often compute correlation or other clustering techniques to identify the redundant features. With SHAP, we can exploit a more direct approach using model loss comparisons.

We can set a threshold value for the clustering between 0 and 1 – where 0 implies the features are perfectly redundant and 1 means they are fully independent.

Classification:

In the below plot, we observe that with threshold value 0.5, ‘Tenure’ and ‘Total Charges’ have redundancy which is more than 50%. Also, ‘Monthly Charges’ has redundancy with 6 other columns including ‘Internet Service_No’, ‘Online Security_No internet service’ and 4 other columns.

Regression:

Similarly in the case of regression, with the same threshold, the features are that are redundant are the same.

Note that SHAP computations by Explainer objects come with clustering included. The global bar plot does not need users to explicitly pass the clustering parameter.

5. Beeswarm Plot

Beeswarm plot explains on same lines as a Global bar plot. The enhancement is that the Beeswarm plot emphasizes the positive and negative impact of a feature across instances.

Classification:

The below plot is for the classification scenario where each feature impact is highlighted with a color scheme. The color scheme is used to represent the feature value. We can conclude that the bulk of the instances have a high Tenure value and a low Tenure value tends to have a high SHAP value in our prediction. Hence we can infer that the lower value of Tenure drives the target value towards 1 – Churn_Yes.

Intuitively, we would say the same thing – if a customer has a short Tenure, he/she is more likely to Churn.

Regression:

We can imply the same logic in the case of regression. Internet Service_Fiber optic has highest importance and if the customer has Fiber optic internet service, the SHAP value tends to be high and hence the customer is more likely to have a higher amount as Monthly charges. Customers without fiber optic internet services, however, have negative/low SHAP values and hence will have lower Monthly Charges.

Note that each dot indicates a particular row and a high SHAP value implies a higher predicted value for our target value. The color schemes only represent the particular feature value.

This is also a summary plot but here the exact SHAP values are plotted instead of absolute SHAP values.

Pros

  • It is better than feature importance since it considers the coalition effect for calculating feature importance

  • SHAP values give the prediction both positive and negative explanations

  • SHAP connects LIME and Shapley values. This is very useful to better understand both methods. It also helps to unify the field of interpretable machine learning.

  • Global interpretations are consistent with the Local explanations

Cons

  • SHAP value calculation is very time expensive as it checks all the possible combinations

  • KernelSHAP explainers ignore feature dependence

Reference

Last updated