For the past decade, the medical and computer science communities have been working together to develop efficient computer-aided decision tools, using technological advances in both fields, to improve research and diagnosis. The release of large public datasets have allowed the development and comparison of AI and deep learning tools that can outperform the human eye and assist medical researchers. We have reached a point where it seems like “machine” performance powered by deep learning meets (or even surpasses in some cases) human performance.
In spite of such performance, efficient integration of AI driven tools in routine use requires being able to explain how the algorithm learned, what the algorithm learned and how it makes a specific decision which is called Interpretability. Interpretability is critical for medical decisions as it should assist practitioners efficiently by giving relevant and consistent medical expertise and provide new insights on medical images.
We propose here a new method to improve AI and deep learning tools interpretability on Whole Slide Images (WSI), commonly used in digital pathology from translational research to diagnosis.
WSI Classification model and Interpretability heatmaps
We based our study on the Camelyon-16 dataset consisting of 345 hematoxylin and eosin (H&E) stained WSI of lymph node sections. The images from the dataset were classified into two classes “Normal” or “Tumor” when containing metastasis and cancerous lesions (Fig. 1).
The most common approach to train AI tools are based on cell-level annotations performed by medical experts. Such a process is highly dependent on human expertise and is time consuming. To decrease human dependancy of the training, workflows that use only global slide-level labels to train have been proposed [3,4]. These deep learning workflows have been developed to mimic the pathologist expertise. The WSI is divided into patches called tiles. A deep-learning, feature-extractor module computes a vector that describes each tile’s morphological content called tile descriptors. Each tile descriptor is used to compute a single score per tile through a tile scoring module. These scores are used to compute a vector that describes the slide. This slide descriptor will be used to make the decision and classify the slide. This set of tile scores are aggregated to compute a slide descriptor used by a decision module to predict a class. In addition to requiring very little supervision to be trained, these methods have the great advantage of being interpretable by design. Indeed, tile scores can be used to compute an explanation heat-map over the slide to highlight critical regions in the decision making (see Fig. 2).
Even if these heat-maps have been proven to be really efficient , the fact that they rely on a single score (tile score) computed in the prediction process while tissues are complex organized structures, points to a possible limitation in the interpretability of these trained algorithms.
Our solution: Improving interpretability using descriptors features
We propose using an attribution method  to identify the set features in tile descriptors that are mostly used by the trained model for a given diagnosis (e.g. here the “tumor” class).
To give the most meaningful explanations and try to understand what these features are, we rely on feature visualizations  and propose that medical experts look at the tile that mostly activates a given feature and the max activation image: an image on which pixels were tuned to activate the feature (see Fig. 3).
Fig. 3: Tile level interpretable visualization
For example, on Fig. 3, for “Feature B” pathologists were able to recognize the striped texture validated by the max activation image and explained that, most probably, this feature activates in regions where spindle-shaped cells appear – a specific cell organization that is known to be metastatic.
We also propose a new manner to compute heat-maps based on the activation of these set features in each tile (see Fig. 4) and we measure the relevance of these explanations using tile-level AUC (also called localization AUC) with regard to lesion annotation provided with Camelyon-16 “tumor” slides.
For that, we trained two different WSI classification architectures that reach about the same slide level classification performance and we computed the localization AUC using heat-maps generated with tile scores and with our feature-based heat-maps. This measure enables us to quantify that we have improved the interpretability by over 29% and up to 75% (see Table 1).
Fig. 4: Slide level interpretable heat-maps
Table 1: Slide-level classification measure and interpretability measures
Thus we proposed a method that can be applied to a whole set of WSI classification methods and which enabled us to improve the interpretability of trained models. This method includes the automatic identification of features, tile level visualization for features interpretability and feature-based heat-maps for slide level decisions.
For more technical and detailed information about WSI classification pipelines, attribution methods and heat-maps computation, you can find the preprint paper presented at the workshop on Interpretability of Machine Intelligence in Medical Image Computing at MICCAI 2020 and published in “Interpretable and Annotation-Efficient Learning for Medical Image Computing” (Springer) on Arxiv : https://arxiv.org/abs/2009.14001
Antoine Pirovano, Data Scientist at Keen Eye
David Guet, PhD, Digital Pathology Application Specialist at Keen Eye
 B. Ehteshami Bejnordi and M. Veta and P. Johannes van Diest and B. van Gin-neken and N. Karssemeijer and G. Litjens and J. A. W. M. van der Laak and the CAMELYON16 Consortium “Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer” in Journal of the American Medical Association, vol. 312(22), pp. 2199–2210. Dec. 2017.
 Lee, B. and Paeng, K “A Robust and Effective Approach Towards Accurate Metastasis Detection and pN-stage Classification in Breast Cancer}” in Lectures Notes in Computer Science, pp. 841-850. 2018.
 P. Courtiol, E. W. Tramel, M. Sanselme, G. Wainrib, “Classification and Disease Localization in Histopathology Using Only Global Labels: A Weakly-Supervised Approach,” in Computing Research Repository (CoRR), Arxiv, 2018. [Online]. Available: https://arxiv.org/abs/1802.02212.
 M. Ilse and J. M. Tomczak and M. Welling, “Attention-based deep multiple instance learning”, in Proceedings of the International Conference on Machine Learning (ICML), 2018.
 G. Campanella, M. G. Hanna, L. Geneslaw, A. Miraflor, V. W. K. Silva, K. J. Busam et al., “Clinical-grade computational pathology using weakly supervised deep learning on whole slide images,” in Nature Medicine, vol . 25 pp . 1. Aug., 19. DOI.10.1038/s41591-019-0508-1.
 K. Simoyan and A. Vedaldi and A. Zisserman, “Deep Inside Convolutional Net-works: Visualising Image Classification Models and Saliency Maps” in Computing Research Repository (CoRR), Arxiv, Dec. 2013.
 C. Olah and A. Satyanarayan and I. Johnson and S. Carter and L. Schubert and K.Ye and A. Mordvintsev, “The building blocks of interpretability”, in Distill, 2018.