Top image classification models work well on the major datasets, but how do they perform on CT scans?
A common benchmark for image classification is ImageNet. The dataset tests the model's ability to generalize prediction across many image categories. Although this benchmark is useful in providing a common objective for all image classification model, it does not determine the efficacy of the model in more specific and specialized domains. In this project, we will look to understand the performance of top models in the medical imaging domain. The dataset that we use is from a current Kaggle competition . The dataset is comprised of Cervical Spine CT scans taken from 12 sites around the world by the Radiological Society of North America. The objective is to detect fractures in the spine scans. These images present a different problem as the task of classification is tough for highly trained humans let alone machine learning models.
We will be comparing the performance of 2 image classification models that use different architectures to understand how well they can handle the CT scan classification task. The first model that we will be evaluating is the Vision Transformer. This model utilizes the Transformer architecture for classification rather than the widely accepted CNN archicture. The transformer architecture's success in the NLP field made it a strong candidate for use in the computer vision field. The second model that we will evaluate is ResNet[2]. This represents one of the best performing CNN architecture models. Our hope is to understand whether these 2 models, which represent some of the best models that are currently available, are suitable to specialized tasks in medicine.
The paper we are primarily citing, An Image is Worth 16x16 Words: Transformers For Image Recognition At Scale[1], defends the argument that the reliance on current state of the art CNNs is not necessary and a pure transformer applied directly to sequence of image patches can perform very well on image classification tasks. The authors conclude that Vision Transformers can achieve "excellent" results while requiring "substantially" fewer computational resources to train compared to convolutional networks.
As stated by the paper, the idea of self-attention-based architecture and specifically Transformers were proposed by Vaswani et al. in their paper Attention Is All You Need[3] in 2017. Their focus was on machine translation tasks and their model achieved 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. However, Dosovitskiy et al. acknowledge that convolutional networks remain dominant in the computer vision field. Other research has applied self-attention architecture in local neighborhoods for each query pixel instead of globally as done by Parmar et al. in their paper Image Transformer[4].
In this study, the authors hope to scale the self-attention architecture through Transformers on large datasets, that have previously only been by CNNs as done by Djolonga et al. in their paper On robustness and transferability of convolutional neural networks[5] in 2020. ResNets have been one of the most popular methods of image classification, and the paper Deep Residual Learning for Image Recognition[2] shows how effective 18-layer and 34-layer ResNets can be in relation to plain nets. ResNets are slightly different in that a shortcut connection is added to each pair of 3x3 filters, thus having no extra parameter compared to the plain counterparts. This paper expands on that research by continuing to focus on the aforementioned datasets, but train Transformers instead of ResNet-based models. In this study, the authors hope to scale the self-attention architecture through Transformers on large datasets, that have previously only been by CNNs as done by Djolonga et al. in their paper On robustness and transferability of convolutional neural networks[5] in 2020. This paper expands on that research by continuing to focus on the aforementioned datasets, but train Transformers instead of ResNet-based models.
We will be using the implementation of the models from Huggingface: Vision Transformer (ViT) and ResNet
[1] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T. & Houlsby, N. "An image is worth 16x16 words: Transformers for image recognition at scale". arXiv preprint arXiv:2010.11929, 2020.
[2] He, K., Zhang, X., Ren, S., and Sun, J., "Deep Residual Learning for Image Recognition", arXiv preprints arXiv:1512.03385, 2015.
[3] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." In NIPS, 2017.
[4] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. "Image Transformer". In ICML, 2018.
[5] Josip Djolonga, Jessica Yung, Michael Tschannen, Rob Romijnders, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Matthias Minderer, Alexander D’Amour, Dan Moldovan, Sylvan Gelly, Neil Houlsby, Xiaohua Zhai, and Mario Lucic. "On robustness and transferability of convolutional neural networks". arXiv:2007.08558, 2020.
Aniket Lachyankar (lachyankar.a@northeastern.edu) and Satwik Kamarthi (kamarthi.s@northeastern.edu)