让 SHAP 输出比优雅更优雅的图表

2021-09-21 23:15#1 标记1

我希望用 SHAP 值解释你的模型对你的工作有很大帮助。在本文中，我将介绍 SHAP 图中的更多新颖特性。如果你还没有阅读上一篇文章，我建议你先阅读一下，然后再回到这篇文章。
一位伟大的房地产经纪人曾经用他的专业房屋展示启发了我。他先给我看了房子的外观、草坪和街道的位置。然后他带我走过每一层楼，每一个房间。他鼓励我打开抽屉和壁橱，我惊讶看到凹陷的灯。作为数据科学家，我们似乎也以类似的方式展示我们的机器学习模型。我们向模型用户解释整个模型是有意义的——预测值与目标变量之间的正或负关系与业务实践是一致的。这被称为全局可解释性。不仅模型本身是有意义的，模型对每种情况的预测也应该是有意义的。我们根据预测的具体值来解释为什么每种情况都能得到预测。这被称为局部可解释性。SHAP 值能够同时显示这两种情况。
SHAP API[1]自创建以来取得了更大的进步。在本文中，我将介绍从简单到复杂的新图表。关于全局可解释性，我将向您展示 (a) 条形图、(b) 群组图和 (c) 热力图的新颖性。关于局部可解释性，我将介绍 (d) 瀑布图、(e) 条形图、(f) 力图和 (g) 决策图。此外，您可能需要为 SHAP 图自定义外观。许多 SHAP 图可以与 Matplotlib 一起使用以进行自定义。因此，我将向您展示如何自定义 SHAP 图的图例、字体大小等，以及如何使用 subplot 将多个 SHAP 图水平或垂直组合。最后，在某些情况下，您可能需要构建目标变量具有多个类别的多分类模型。我将向您展示如何使用 SHAP 值来解释多分类模型。Jupyter notebook 可通过从GitHub(https://github.com/dataman-git/codes_for_articles/blob/master/The SHAP values with More Charts for article.ipynb)获得。

0x 构建 XGBoost 模型
为了方便比较结果，我仍然使用与用 SHAP 值解释你的模型[2] 里相同的数据集。该数据集的目标值是从低到高（0-10）的质量等级。输入变量是每个葡萄酒样品的含量，包括固定酸度、挥发性酸度、柠檬酸、残糖、氯化物、游离二氧化硫、总二氧化硫、密度、pH、硫酸盐和酒精。有 1599 个葡萄酒样本，分为 1279 个训练样本和 320 个测试样本。
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.model_selection import train_test_splitimport xgboost as xgbdf = pd.read_csv('winequality-red.csv')features = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']Y = df['quality']X = df[features]X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 1234)xgb_model = xgb.XGBRegressor(random_state=42)xgb_model.fit(X_train, Y_train)# The SHAP Valuesimport shapexplainer = shap.Explainer(xgb_model)shap_values = explainer(X_test)
1x 全局可解释性
1.1 特征重要性的条形图
shap.plots.bar(shap_values, max_display=10) # default is max_display=12
如果你有太多的预测变量，变量重要性条形图会变得很长很丑。它不会与您的听众产生共鸣并失去说服力。切断图表的尾部？但是观众不会知道尾部预测变量的集体贡献——有时它们可能比顶部预测变量的贡献更大。SHAP 条形图可让您指定要显示的预测变量数量，并对尾部预测变量求和。这是一个很好的卖点，因为您可以告知观众尾部预测变量的集体贡献。

图 (1.1)：条形图
1.2 队列图
可以根据预测变量将总体分为两组或更多组。对不同族裔组成的人口多差异性提供了更多见解。您可以使用.cohorts(N)将人口划分为N 个群组（使用 sklearn DecisionTreeRegressor）。下面我想把总体，也就是 320 个测试样本，分成两个队列。它为一个群组返回 237 个样本，为第二个群组返回 83 个样本，如图例所示。
shap.plots.bar(shap_values.cohorts(2).abs.mean(0))

图 (1.2)：队列图
这种最佳划分的阈值是alcohol = 11.15。条形图告诉我们，去酒精 ≥11.15 的队列的原因是因为酒精含量高（SHAP = 0.5）、高硫酸盐（SHAP = 0.2）和高挥发性酸（SHAP = 0.18）等。我们可能有市场细分策略：我们可以将这个群组标记为最佳葡萄酒，而将另一个群组标记为高价值葡萄酒。

img
1.3 热力图
热力图以二维的方式可视化数据的大小。颜色的变化可以提供有关数据如何聚类的视觉线索。您可以使用从热到冷的配色方案来显示关系。变量重要性在 Y 轴中按降序显示，条形类似于条形图中的条形。顶部的*f(x)*曲线是实例的模型预测。SHAP 首先在实例上运行分层聚类以对它们进行聚类，然后shap.order.hclust在 X 轴上对实例进行排序（使用）。2D 热力图的中心是 base_value（使用.base_value），它是所有实例的平均预测。
shap.plots.heatmap(shap_values[1:100])
下面的热力图显示了高预测（右侧*f(x)*中的高值）与高酒精含量和高硫酸盐（红色）相关。

图 (1.3)：热力图 (I)
我选择下面的另一组实例向您展示聚类是隐式完成的，但解释保持不变。高预测值*（左侧f(x)*中的高值）与高酒精含量和高硫酸盐（红色）相关。
shap.plots.heatmap(shap_values[200:300])

图（1.3）：热力图（二）
局部可解释性
我将展示相同观察结果的各种图。请比较图（2.1）和图（2.4）。
2.1 个别案例的瀑布图
瀑布图有力地显示了为什么一个案例在给定其变量值的情况下会收到其预测。您从瀑布图的底部开始，添加（红色）或减去（蓝色）值以获得最终预测。下图显示了 X_test 中第一次观察的预测。它从底部的基值 5.637 开始，这是所有观测值的平均值。这个观察得到最终预测 4.139（顶部）的原因是因为 5.637-0.04-0.04-0.09+0.09+0.11-0.13-0.27-0.3-0.34-0.5 = 4.139（注意有一个小的舍入误差）。变量名称旁边的值是它们的值，即第一次观察的总二氧化硫值为 9.0。
shap.plots.waterfall(shap_values[0]) # For the first observation

图 (2.1)：第一次观测的瀑布图
下面我展示了对 X_test 中第二次观察的预测。最终预测为 5.582 的原因是因为 5.637+0.01+0.03–0.03+0.04+0.1+0.12–0.12–0.15+0.24–0.29 = 5.582。
shap.plots.waterfall(shap_values[1]) # For the second observation

图 (2.1)：第二次观测的瀑布图
如果您要创建瀑布图，请参阅为所有模型的 SHAP 值创建瀑布图[3]。
2.2 单个案例的条形图
与瀑布图相比，条形图以零为中心并显示变量的贡献。见图 (2.2)。
shap.plots.bar(shap_values[0]) # For the first observation

图 (2.2)：第一次观察的柱状图
shap.plots.bar(shap_values[1]) # For the second observation

图 (2.2)：第二次观测的柱状图
2.3 单个案例的力图
我在上一篇文章用 SHAP 值解释你的模型[4]。在力图下方，从基值 5.647 开始。蓝色力量将预测推向左侧，红色力量推向右侧。第一次观察的最终预测是 5.25。
explainer = shap.TreeExplainer(xgb_model)shap_values = explainer.shap_values(X_test)shap.initjs()def p(j): return(shap.force_plot(explainer.expected_value, shap_values[j,:], X_test.iloc[j,:]))p(0)

图 (2.3)：第一次观测的力图
p(1)

图 (2.3)：第二次观测的力图
2.4 个案的决策图
当有很多预测变量时，力图可能会变得很忙，无法很好地呈现它们。决策图将是一个不错的选择。括号中的数字是预测值。决策图的底部是基值。给定预测变量的力量，这条线向左或向右。
expected_value = explainer.expected_valueprint("The expected value is ", expected_value)shap_values = explainer.shap_values(X_test)[0]shap.decision_plot(expected_value, shap_values, X_test)

图 (2.4)：第一次观测的决策图
shap_values = explainer.shap_values(X_test)[1]shap.decision_plot(expected_value, shap_values, X_test)

图 (2.4)：第二次观测的决策图
3x 二分类目标
通过reg:logistic在 xgb.XGBRegressor() 中指定，可以轻松完成二分类目标模型。但是，XGBoost 在瀑布图中显示对数赔率而不是预测概率。关于在这个 github 问题[5]中以预测概率呈现瀑布图的必要性已经有很长时间的讨论，并且 Lundberg 先生从那时起进行了进一步的修改。在本节中，我想详细介绍对数赔率中的瀑布图，以及它如何以预测概率的形式呈现。我将首先向您展示力图，以便您可以与瀑布图进行比较。
3.1 力图
我创建了一个二分类目标变量并指定reg:logistic构建 XGBoost 模型。力图代码与第（2.3）节中的相同。
from sklearn.model_selection import train_test_splitimport xgboost as xgbfeatures = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']Y = np.where(df['quality']>5,1,0)X = df[features]X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 1234)xgb_binary_model = xgb.XGBRegressor(objective='reg:logistic',random_state=42)xgb_binary_model.fit(X_train, Y_train)shap.initjs()def p(j): explainer = shap.TreeExplainer(xgb_binary_model) xgb_binary_shap_values = explainer.shap_values(X_train) return(shap.force_plot(explainer.expected_value, xgb_binary_shap_values[j,:], X_train.iloc[j,:], link='logit'))p(0)

3.2 XGBoost 与瀑布图
import shapexplainer = shap.Explainer(xgb_binary_model)xgb_binary_shap_values = explainer(X_train)shap.plots.waterfall(xgb_binary_shap_values[0])
我们期望这个二元模型的输出是 0 到 1 之间的概率。然而，下面瀑布图中的最终预测是 f(x) = 4.894>>1。为什么？这是因为默认情况下，SHAP 根据逻辑链接函数之前的边际输出解释 XGBoost 分类器模型，如其文档中所述[6]。瀑布图中 x 轴上的单位是对数赔率单位，但不是概率。

您可以使用逻辑 sigmoid 函数将 log-odd 转换为 [0,1] 的概率expit(x) = 1/(1+exp(-x))，该函数是 logit 函数的倒数。换句话说，1(1+exp(-4.894)) = 0.992，这是上述 (3.1) 力图中的预测概率。我加入了这个 github issue 中[7]的讨论，并分享了以下实用函数，该函数执行从对数赔率到概率的转换。
def xgb_shap_transform_scale(original_shap_values, Y_pred, which): from scipy.special import expit #Compute the transformed base value, which consists in applying the logit function to the base value from scipy.special import expit #Importing the logit function for the base value transformation untransformed_base_value = original_shap_values.base_values[-1] #Computing the original_explanation_distance to construct the distance_coefficient later on original_explanation_distance = np.sum(original_shap_values.values, axis=1)[which] base_value = expit(untransformed_base_value ) # = 1 / (1+ np.exp(-untransformed_base_value)) #Computing the distance between the model_prediction and the transformed base_value distance_to_explain = Y_pred[which] - base_value #The distance_coefficient is the ratio between both distances which will be used later on distance_coefficient = original_explanation_distance / distance_to_explain #Transforming the original shapley values to the new scale shap_values_transformed = original_shap_values / distance_coefficient #Finally resetting the base_value as it does not need to be transformed shap_values_transformed.base_values = base_value shap_values_transformed.data = original_shap_values.data #Now returning the transformed array return shap_values_transformed
因此，让我们应用转换，然后为一次观察绘制瀑布图：
obs = 0Y_pred = xgb_binary_model.predict(X_train)print("The prediction is ", Y_pred[obs])shap_values_transformed = xgb_shap_transform_scale(xgb_binary_shap_values, Y_pred, obs)shap.plots.waterfall(shap_values_transformed[obs])
现在瀑布图显示为预测概率。有关更多详细信息，请参阅可通过此 GitHub(https://github.com/dataman-git/codes_for_articles/blob/master/The SHAP values with More Charts for article.ipynb)链接获得 Jupyter notebook。

SHAP瀑布图
在我的帖子为所有模型的 SHAP 值创建瀑布图[8]中，我开源了不需要上述转换的瀑布图代码。它可以让您绘制静态瀑布图或交互式瀑布图。
4x 如何自定义 SHAP 图
如前所述，许多 SHAP 图可以使用 Matplotlib 进行自定义。请记住通过关闭 SHAP 函数的绘图参数show=False。下面我展示了一个示例，图例掩盖了图形，因此我们希望将其移动到更好的位置。此示例将人群分为三个群组，有关如何绘制群组图，请参阅第 (1.2) 节群组图。
4.1 图例、字体大小等
explainer = shap.Explainer(xgb_model)shap_values = explainer(X_test)shap.plots.bar(shap_values.cohorts(3).abs.mean(0))

图（3.1.1）
图例的位置可以通过关键字参数指定bbox_to_anchor，这为手动图例放置提供了很大程度的控制。请参阅 Matplotlib 的图例指南[9]。
shap.plots.bar(shap_values.cohorts(3).abs.mean(0), show=False)fig = plt.gcf() # gcf means "get current figure"fig.set_figheight(11)fig.set_figwidth(9)#plt.rcParams['font.size'] = '12'ax = plt.gca() #gca means "get current axes"leg = ax.legend(bbox_to_anchor=(0., 1.02, 1., .102))for l in leg.get_texts(): l.set_text(l.get_text().replace('Class', 'Klasse'))plt.show()

图（3.1.2）
4.2 在 subplot 中显示 SHAP 图
您可能希望呈现多个水平或垂直对齐的 SHAP 图。这可以通过使用Matplotlib 的 subplot 函数[10]轻松完成。
fig = plt.figure(figsize=(10,5))ax1 = fig.add_subplot(121)shap_values = explainer.shap_values(X_test)[0]shap.decision_plot(expected_value, shap_values, X_test, show=False)ax1.title.set_text('The First Observation')ax2 = fig.add_subplot(122)shap_values = explainer.shap_values(X_test)[1]shap.decision_plot(expected_value, shap_values, X_test, show=False)ax2.title.set_text('The Second Observation')plt.tight_layout()plt.show()

图（3.2）
5x 多分类模型的 SHAP 图
您可能已经构建了一个多分类模型，将实例分为几个类。SHAP 如何帮助演示多类模型？下面我为三个类创建了一个新的目标变量 Multiclass：Best、Premium 和 Value。该模型是通过指定multi:softprobXGBClassifier 的参数来完成的。
# Multiclassdf['Multiclass'] = np.where(df['quality']>6, 'Best', # 2 = 'Best', 1 = 'Premium', 0 = 'Value' np.where(df['quality']>5, 'Premium','Value'))Y = df['Multiclass']X = df[features]X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 1234)xgb_model = xgb.XGBClassifier(objective="multi:softprob", random_state=42)xgb_model.fit(X_train, Y_train)
多类模型的输出是类的概率矩阵。我们有三个类，所以输出是三个类的概率，总和为 1.0。在 scikit-learn 中，该函数.predict_proba()呈现概率，如列 2-Best、1-Premium 和 0-Value 中所示。该函数.predict()呈现预测的分类，如下面的 Pred 列所示。
multiclass_actual_pred = pd.DataFrame(xgb_model.predict_proba(X_test)).round(2)multiclass_actual_pred['Actual'] = Y_test.valuesmulticlass_actual_pred['Pred'] = xgb_model.predict(X_test)multiclass_actual_pred.columns = ['2 - Best','1 - Premium','0 - Value','Pred','Actual']multiclass_actual_pred.head()

我们可以显示这样的混淆矩阵：
pd.crosstab(multiclass_actual_pred['Actual'],multiclass_actual_pred['Pred'])

您可以使用汇总图按类别显示变量重要性。下面是两种显示结果的方法。
import shapexplainer = shap.TreeExplainer(xgb_model)shap_values = explainer.shap_values(X_test,approximate=True)plt.title('The Summary Plot for the Multiclass Model'+''+'Class 2 - Best, Class 1 - Premium, Class 0 - Value')shap.summary_plot(shap_values, X_test, plot_type="bar")

图（4.1.1）
np.shape(shap_values) # Three classes, 320 observations, 11 variables
fig = plt.figure(figsize=(20,10))ax0 = fig.add_subplot(131)ax0.title.set_text('Class 2 - Best ')shap.summary_plot(shap_values[2], X_test, plot_type="bar", show=False)ax0.set_xlabel(r'SHAP values', fontsize=11)plt.subplots_adjust(wspace = 5)ax1 = fig.add_subplot(132)ax1.title.set_text('Class 1 - Premium')shap.summary_plot(shap_values[1], X_test, plot_type="bar", show=False)plt.subplots_adjust(wspace = 5)ax1.set_xlabel(r'SHAP values', fontsize=11)ax2 = fig.add_subplot(133)ax2.title.set_text('Class 0 - Value')shap.summary_plot(shap_values[0], X_test, plot_type="bar", show=False)ax2.set_xlabel(r'SHAP values', fontsize=11)# plt.tight_layout(pad=3) # You can also use plt.tight_layout() instead of using plt.subplots_adjust() to add space between plotsplt.show()

图（4.1.2）
结论
感谢您的阅读。我希望这能让你更好地理解这个主题。
原文：The SHAP with More Elegant Charts
链接：https://medium.com/dataman-in-ai/the-shap-with-more-elegant-charts-bc3e73fa1c0c
作者：Dr. Dataman
参考资料
[1]
SHAP API: https://shap.readthedocs.io/en/latest/index.html[2]
用 SHAP 值解释你的模型: https://mp.weixin.qq.com/s?__biz=MzU1NTg2ODQ5Nw==&mid=2247486984&idx=1&sn=0247ed21a4d864b93c3722d31d51acd4&chksm=fbcc8636ccbb0f2021408c37f7db2f0a5a53f540048b9301f12f2e72e8e450446fe596d934e6&token=1200126831&lang=zh_CN#rd[3]
为所有模型的 SHAP 值创建瀑布图: https://medium.com/dataman-in-ai/the-waterfall-plots-for-the-shap-values-of-all-models-245afc0aa8ec[4]
用 SHAP 值解释你的模型: https://mp.weixin.qq.com/s?__biz=MzU1NTg2ODQ5Nw==&mid=2247486984&idx=1&sn=0247ed21a4d864b93c3722d31d51acd4&chksm=fbcc8636ccbb0f2021408c37f7db2f0a5a53f540048b9301f12f2e72e8e450446fe596d934e6&token=1200126831&lang=zh_CN#rd[5]
这个 github 问题: https://github.com/slundberg/shap/issues/29[6]
如其文档中所述: https://shap.readthedocs.io/en/latest/example_notebooks/api_examples/plots/waterfall.html[7]
0,1] 的概率expit(x) = 1/(1+exp(-x))，该函数是 logit 函数的倒数。换句话说，1(1+exp(-4.894)) = 0.992，这是上述 (3.1) 力图中的预测概率。我加入了[这个 github issue 中: https://github.com/slundberg/shap/issues/29[8]
为所有模型的 SHAP 值创建瀑布图: https://medium.com/dataman-in-ai/the-waterfall-plots-for-the-shap-values-of-all-models-245afc0aa8ec[9]
图例指南: https://matplotlib.org/stable/tutorials/intermediate/legend_guide.html[10]
Matplotlib 的 subplot 函数: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html
欢迎关注公众号

有兴趣加群讨论数据挖掘和分析的朋友可以加我微信（witwall），暗号：入群

也欢迎投稿！