This article will attempt to define interpretability in the context of machine learning. Then, we will highlight the importance of machine learning interpretability and explain why it has gained popularity recently. Further, a taxonomy of machine learning interpretability techniques will be provided. Moreover, an evaluation schema of machine learning interpretability will be presented, and finally, we will discuss the properties of the explanations humans tend to prefer.

. . .

Definition(s) of interpretability

Although the literature on machine learning interpretability has been increasing the past few years, no consensus has been reached yet on its definition. Miller (2017) defines interpretability as the degree to which a human can understand the cause of a decision. Kim (2016) on the other hand, defines interpretability as the degree to which a human can consistently predict a model’s output.

In this article we will adopt the following definition: a model is interpretable if its decisions can be easily understood by humans. In other words, a model M1 is more interpretable than a model M2, if the decisions taken by M1 are easier to understand.


Importance of interpretability

The mere high performance of a machine learning model is not enough to trust its decisions. In fact, as it was pointed out by Dosh-Velez (2017)The problem is that a single metric, such as classification accuracy, is an incomplete description of most real-world tasks. This is especially true when the machine learning model is applied in a critical domain (i.e., domain of application where an error committed by the model could lead to severe consequences). If wrong decisions have severe impacts, then interpretability is a must.

Moreover, the need for interpretable machine learning solutions has now been translated to several Artificial Intelligence related regulations passed by countries around the world. The European Union’s General Data Protection Regulation (GDPR) for example, requires among other things, the transparency of all algorithmic decisions (Principle of Transparency) .

Nonetheless, there are still cases where interpretability is not a requirement. These cases are either when the model is used in a low-risk environment, i.e., an error will not have serious consequences, (e.g. a movie recommender system) or the technique has already been extensively studied and examined (e.g. optical character recognition). According to Doshi-Velez and Kim (2017) the need for interpretability is due to an incompleteness in problem formalization, that is to say that for certain problems it is not enough to get the prediction (the what). The model must also explain how it came to the prediction (the why), because a correct prediction only partially solves the problem.

Finally, and according to the same authors, the interpretability of machine learning models makes the evaluation of the following properties easier:

  • Fairness: Ensuring that predictions are unbiased and do not implicitly or explicitly discriminate against protected groups. An interpretable model can tell you why it has decided that a certain person should not get a loan, and it becomes easier for a human to judge whether the decision is based on a learned demographic (e.g. racial) bias.
  • Privacy: Ensuring that sensitive information in the data is protected.
  • Reliability or Robustness: Ensuring that small changes in the input do not lead to large changes in the prediction.
  • Causality: Check that only causal relationships are picked up.
  • Trust: It is easier for humans to trust a system that explains its decisions compared to a black box.


Taxonomy of interpretability methods

These methods and techniques can be categorized into four categories: model-agnostic, model-specific, global-interpretability, local-interpretability. The first refer to the set of interpretability techniques that can be applied to any machine learning black-box model, the second to those that can only be applied to one model. The third interpretability methods areconcerned with the overall behavior of the model, and finally the fourth category are methods that are concerned with justifying only one prediction of the model.

A more detailed overview of the categories of machine learning interpretability is provided in the following taxonomy taken from the paper Explainable AI: A Review of Machine Learning Interpretability Methods:

Figure 1: Taxonomy of interpretability methods


Evaluation of interpretability

No consensus exists on how to evaluate the interpretability of a machine learning model. Nonetheless, researchers have made initial attempts to formulate some approaches for evaluation. Doshi-Velez and Kim (2017) propose three main levels for the evaluation of interpretability:

  • Application level evaluation (real task):The interpretability method is tested by end-users. For instance, in case the software developed is a fracture detection system with a machine learning component that locates and marks fractures in X-rays. At the application level, radiologists would test the fracture detection software directly to evaluate the model. A good baseline for this is always how good a human would be at explaining the same decision.
  • Human level evaluation (simple task):is a simplified application level evaluation. The difference is that these experiments are not carried out with the domain experts, but with laypersons. This makes experiments cheaper (especially if the domain experts are radiologists) and it is easier to find more testers. An example would be to show a user different explanations and the user would choose the best one.
  • Function level evaluation (proxy task):does not require humans. This works best when the class of model used has already been evaluated by someone else in a human level evaluation. For example, it might be known that the end users understand decision trees. In this case, a proxy for explanation quality may be the depth of the tree. Shorter trees would get a better explainability score. It would make sense to add the constraint that the predictive performance of the tree remains good and does not decrease too much compared to a larger tree.


Human-friendly explanations

Psychology and cognitive science can be of great help when it comes to discovering what humans consider to constitute a “good explanation”. Indeed, Miller (2017) has carried out a huge survey of publications on explanations, and this subsection builds on his summary.


Good explanations are:

  • Contrastive: The importance of contrasting explanations is one of the most important findings of machine learning interpretability. Instead of asking why this prediction was made, we tend to ask why this prediction was made instead of another prediction. Humans tend to ask the question: What would the prediction be if input X had been different?. If a client’s loan application is rejected, they would not care for the reasons that generally lead to a rejection. They would be more interested in the factors in their application that would need to change to get the loan. They would be more interested in knowing the contrast between their application and the would-be-accepted application (Lipton 1990).
  • Selected: Explanations do not have to cover the whole list of causes of an event. They only need to select one or two causes from a variety of possible causes as an explanation.
  • Social: An explanation is part of an interaction between the explainer and the receiver of the explanation (explainee). The social context should determine the content and nature of the explanations. Psychologists and sociologists can help come up with explanations that fit with the audience targeted by the application.
  • Focus on the abnormal: Humans focus more on abnormal causes to explain events (Kahnemann and Tversky, 1981). These are causes that have a small probability to occur. Humans consider these kinds of “abnormal” causes as good explanations.
  • Consistent with prior beliefs of the explainee: Humans tend to ignore information that is inconsistent with their prior beliefs (confirmation bias, Nickerson 1998). Explanations are not exempt from this kind of bias. Humans tend to devalue or ignore explanations that do not agree with their prior beliefs.


This article broadly introduced the concept of machine learning interpretability. We saw its definitions, importance, taxonomy, and evaluation methods. In the next article, we will attempt to do the same with fuzzy logic. And in the third article, we will see how fuzzy logic can benefit machine learning interpretability.



Christoph Molnar, Interpretable Machine Learning

Miller, Tim. “Explanation in artificial intelligence: Insights from the social sciences.” arXiv Preprint arXiv:1706.07269. (2017)

Kim, Been, Rajiv Khanna, and Oluwasanmi O. Koyejo. “Examples are not enough, learn to criticize! Criticism for interpretability.” Advances in Neural Information Processing Systems (2016)

Doshi-Velez, Finale, and Been Kim. “Towards a rigorous science of interpretable machine learning,” no. Ml: 1–13. ( 2017)

Transparent information, communication and modalities for the exercise of the rights of the data subject

  1. B. Arrieta et al., “Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI,” Information Fusion, vol. 58, pp. 82–115, Oct. 2019
  2. v. Carvalho, E. M. Pereira, and J. S. Cardoso, “Machine Learning Interpretability: A Survey on Methods and Metrics,” Electronics 2019, Vol. 8, Page 832, vol. 8, no. 8, p. 832, Jul. 2019, doi: 10.3390/ELECTRONICS8080832

Lipton, Peter. “Contrastive explanation.” Royal Institute of Philosophy Supplements 27 (1990): 247–266

Nickerson, Raymond S. “Confirmation Bias: A ubiquitous phenomenon in many guises.” Review of General Psychology 2 (2). Educational Publishing Foundation: 175. (1998)

More Tags
Talk to one of our experts about what IDATHA can do for you

Let's work together to build something great

schedule a meeting