Are ML models interpretable?

Gowtam Singulur
7 min readMay 29, 2021

Interpretability is an essential aspect of any ML model because it answers the question “Why a particular model is behaving this specific way?

So, why knowing how a model came to a particular decision vital to us? For example, let’s look at a Cancer detection problem for which we had developed a state of the art binary classification model with 0.98 test AUC — We are using AUC since Cancer detection problem tends to have highly imbalanced data — using a Neural network. The only caveat being it’s highly complex, lowering its interpretability. But most of the medical professionals needs more info than just the model’s prediction to confirm the result and convey it to the patient. Hence, interpretability can increase the trust/reliability of the model.

What’s interpreting a model means?

To be brief, we want to understand what affects a model’s prediction.
For example, let’s take a simple linear regression(LR) model. If there are n features x_1, x_2, x_3,…, x_n, the output of the model is

f(x_1, x_2, x_3,…, x_n) = ɸ_1*x_1+ɸ_2*x_2+ɸ_3*x_3+…+ɸ_n*x_n+b

In the LR model above, we had assigned a coefficient ɸ_i for each feature x_i and calculate f(X) to get the output. In this case, it’s trivial to figure out feature importances. For each feature x_i, if its corresponding coefficient ɸ_i has a large absolute value, then the feature significantly impacts the final output. Since we are using a linear combination of inputs to calculate the result, we can only uncover linear relationships.

For instance, if in our cancer prediction example, we are predicting how likely the given person is to develop cancer on a scale of 0 to 100 (100 being most likely). If the person’s age is between 35 to 50 and has a smoking habit, they will likely develop cancer. Since it’s a non-linear relationship, LR can’t capture this. We need to use a more complex model to do this.

For this blog, let’s use this Kaggle problem. The features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. We were given ten real-values features, which were computed from cell nuclei present in the respective image. They had calculated Ten real-valued features for each cell nucleus. The mean, standard error, and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features.

To get a idea of the data that we are going to use in this blog, you can visuzlize each feature using below code.

import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("/kaggle/input/breast-cancer-wisconsin-data/data.csv")
df.head()
df.describe()

def analysis_on_Numerical_features(data, Numerical_features):
for feature in Numerical_features:
print("Feature name: "+feature)
print("Max value: "+str(data[feature].max()))
print("Max value: "+str(data[feature].min()))
print("Mean value: "+str(data[feature].mean()))
print("Median value: "+str(data[feature].median()))
print("Mode value: "+str(data[feature].mode()))
print("Number of null values: "+str(data[feature].isna().sum()))
print("##"*10 + " Desity and box plots "+ "##"*10 )
sns.distplot(data[feature])
plt.show()
sns.boxplot(x='diagnosis', y=feature, data = data)
plt.show()
print("*"*150)
num_cols = list(df.columns)
#removing "diagnosis" column as it's a categorical feature
num_cols.remove('diagnosis')
analysis_on_Numerical_features(df,num_cols )

Interpreting generic ML classification models

So, given this data, we want to predict whether the cancer is malignant(M) or benign(B) and want to know how each feature impacts the model’s prediction. Let’s train a Decision Tree Classifier (DTC) and Logistic Regression(LOR) and see how we need to use specific techniques for each model to get feature importances.

Please go through the below generalized code to experiment with classic ML models. You can add more models to it by adding their objects to models dictionary.

from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression as lr
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, confusion_matrix
from scipy.stats import uniform
x_train, x_test, y_train, y_test = train_test_split(df, y, test_size=0.33, random_state=42, stratify=y)#performing scaling to normalize the data.
from sklearn.preprocessing import StandardScaler
min_max_scalar = StandardScaler()
x_train = min_max_scalar.fit_transform(x_train)
x_test = min_max_scalar.fit_transform(x_test)
models = {
'lr': lr(),
'svc': SVC(),
'dtc':DecisionTreeClassifier(),
'rfc': RandomForestClassifier(),
'xgbc':XGBClassifier(objective = 'binary:logistic', verbosity = 0)
}
def train_and_test_model(model_key, distributions, model_name):
print("Model training", model_name)
model = models[model_key]
clf = RandomizedSearchCV(model, distributions, random_state=0, n_iter = 100, verbose=0)
clf.fit(x_train, y_train)
print(clf.best_estimator_)
final_model = clf.best_estimator_
final_model.fit(x_train, y_train)
y_pred = final_model.predict(x_test)
print("Acccuracy is",accuracy_score(y_test,y_pred))
print("F1 score is", f1_score(y_test, y_pred, pos_label='B'))
if model_key!='svc':
print("AUC score is", roc_auc_score(y_test, final_model.predict_proba(x_test)[:,1]))
confusion_matrix_test = confusion_matrix(y_test, y_pred)
sns.heatmap(confusion_matrix_test,xticklabels=['M','B'], yticklabels=['M','B'],annot=True).set_title("Confusion Matrix")
plt.show()
return final_model
distributions = dict(C=uniform(loc=0.00001, scale=10000), class_weight=['balanced', None])
model = train_and_test_model('lr', distributions, "logistic regression")

The below code will train a hyperparameter tuned DTC and plots the tree graph of it.

from scipy.stats import randint
from sklearn import tree
distributions = dict(criterion=['gini', 'entropy'], splitter=['best', 'random'], max_depth = randint(1, 100), min_samples_split= uniform(loc=0, scale=1), max_features=['auto', 'sqrt', 'log2'], class_weight=['balanced', None])model = train_and_test_model('dtc', distributions, "Decisoin Tree Classifier"fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(model,
feature_names=x_train.columns,
class_names=['M', 'B'],
filled=True)
fig.savefig("decistion_tree.png", dpi=fig.dpi)
Tree plot of the Decision tree model trained on this data

For DTC, the feature importance of any feature is calculated as “total reduction of the criterion(Gini-impurity/entropy) brought by that feature while building it.” So, let’s calculate the GI reduction caused by the “radius_worst” feature while splitting the whole data at the root node.

So, this process is done at every node split while building the tree. After creating the tree, we will get a cumulative reduction of criterion caused by each feature which will be their corresponding feature importances.

Let’s train a Logistic regression(LOR) model as well. We will get a hyperplane as an output for the LOR model that will try to split “Malignant” patients and “benign” patients to minimize binary log-loss.

f(x_1, x_2, x_3,…, x_n) = ɸ_1*x_1+ɸ_2*x_2+ɸ_3*x_3+…+ɸ_n*x_n+b; if y > 1: predict X is “malignant”;if y < 1: predict X is “benign”

In the LOR model, we can’t directly use coefficients associated with each feature as feature importances since they are not linearly related to the output. We will get the feature importances by calculating the change in odds of the patient having “Malignant” cancer by modifying each input. Please refer to this blog for an explanation of the odds.

distributions = dict(C=uniform(loc=0.00001, scale=10000), class_weight=['balanced', None])
model = train_and_test_model('lr', distributions, "logistic regression")

Odds of patient having malignant cancer = P(y=1)/p(y=0) = P(y=1)/(1-p(y=1))

In logistic regression, we use sigmoid function to calculate probabilities of output.

Probability of the patient “X” having malignant cancer
Odds of the patient “X” having malignant cancer

Let’s calculate how the odds will change if we change a particular feature. For example, let’s take feature ɸ_i and add one to it. So, ɸ_i becomes ɸ_i+1.

So, if we increase the ɸ_i by one, odds will change by e^ɸ_i times. We can use this for all the features to calculate their feature importances — The feature importance of a feature ɸ_i while using logistic regression is e^ɸ_i.

The below code will train a hyperparameter tuned LOR and plots the bar graph of their feature importance.

import math
feature_importance = pd.DataFrame(list(x_train.columns), columns = ["feature"])
feature_importance["importance"] = pow(math.e, model.coef_[0])
feature_importance = feature_importance.sort_values(by = ["importance"], ascending=False)
print(feature_importance.head())
from sklearn.linear_model import LogisticRegression
ax = feature_importance.plot.barh(x='feature', y='importance')
plt.show()

By calculating feature importances for Decision Tree Classifier and Logistic regression, we learned that getting them from complex(non-linear) models is non-trivial and requires specialized techniques.

So, we need a generalized technique to calculate feature importance. For this, we will discuss the SHAP approach to get feature importances for any model.

What is SHAP approach, and how to implement it?

SHAP generates feature importances by comparing what the model predicts with and without the respective feature. However, there is a caveat to this approach. For particular models, the order in which the model sees the feature affects the SHAP values. For example, any tree-based model will give different values when changing the order because of the different features used at various levels to split the data. So, to mitigate this issue, the SHAP technique is applied with every possible combination and aggregates the values to output final feature importances. SHAP library is fully optimized, so we don’t need to worry about the run-time issues when using it in production solutions.

We can visualize the SHAP values for our DTC using the below code snippet.

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
clf = DecisionTreeClassifier(criterion='gini',max_depth=5, max_features='sqrt',min_samples_split=0.2184028548119228)
clf = clf.fit(x_train, y_train)
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(x_train)
shap.summary_plot(shap_values, x_train)

For the Logistic Regression, the only change we need to do is that we should use LinearExplainer instead of TreeExplainer.

Ipynb created for this blog: https://www.kaggle.com/gowtamsingulur/modeling-with-feature-importance

In the next blog, we will discuss “Are neural networks interpretable?”

References:

Drink coffee and keep on learning

--

--