Improving NIC

Introduction

The intersection of vision and natural language processing has been a widely active topic in the machine learning world in the past 7 years. With the Neural Image Caption (NIC) technique introduced in 2015, Vinyals and his team were able to achieve state-of-the-art performance in this image representation task [1]. Although the development of this neural network architecture was a successful and extraordinary feat, there were still many mislabeled images shown and discussed in the paper. For the visually impaired, who depend on this type of technology, incorrect image captions could mislead the actual situation. The question then becomes whether or not the performance for NIC can be improved.

Our goal is to improve its performance in order to better serve those who need it. We set out on this task by taking some learnings from the Deep Learning course by Professor David Bau, ecommendations by the author, and following our own intuition. Over the course of the project, we built our own NIC from scratch. In doing this, we made some modifications and upgrades to its existing architecture, the training data, and its evaluation criteria (Figure 1). Our results show that NIC can be improved significantly and additionally, we identified some methods that can be incorporated to improve NIC further.

Background

Since Yann Lecun invented convolutional neural networks (CNNs) in 1989 [2], many have adopted deep learning models for processing images. Previously proposed solutions for image captioning tended to stitch together multiple frameworks [3,4]. The significance of the Show and Tell paper lies in its end-to-end system design, which leverages both deep CNNs for image processing and Recurrent Neural Networks (RNNs) for sequence modeling to create a single CNN-RNN network that generates descriptions of images. They were heavily inspired by Cho, et. al. (2014), who pioneered the dual RNN-RNN encoder-decoder structure that achieved state-of-the-art performance in machine translation [5].

This approach allows for image processing and text generation to occur within the same network, in contrast to methods such as the one developed by Li, et. al., which began with detections and pieced together a final description using phrases containing detected objects and relationships [3]. With all of these endeavors to solve a task that is simple for humans but extremely complex for machines to do, we were motivated to dissect the NIC model and discover how we could make additional improvements without completely changing the overall structure.

Method

For our project we chose to build the method for NIC from scratch, and to add a deeper 152-layer Residual Network (ResNet152), a Layer Norm between the Long-Short Term Memory (LSTM) cells, several different image augmentations, and nucleus sampling [12]. We wanted to understand the architect so that we can understand the best way to improve its performance. In total, we compared results for the original NIC paper architect, PyTorch-based NIC architect by Shwetank Panwar [6], and the re-implementation of our own version of NIC (Table 1).

Table 1. Architecture comparison of the original NIC model, Shwetank Panwar's PyTorch implementation of NIC, and our own PyTorch implementation of NIC.

For augmentations, we chose three approaches that used simple and complex transformations, which are summarized below in Figure 2. We used the Albumentations Python package [7], which boasts fast and flexible image augmentations, to perform these transformations. We selected these functions because we wanted to create diversity within the dataset so that the model could better learn image captioning.

We also discovered a newer sampling method called nucleus sampling that leverages probability mass to filter the cumulative distribution function (CDF) of the word probabilities to sample among words that may be more surprising than samples made through beam search or top sampling [8]. To implement nucleus sampling, we chose to utilize Temperature Sampling to choose among the word probabilities. Temperature Sampling is inspired by statistical thermodynamics where high temperature means low energy states are more likely to occur [9].

Results

Methods 2 and 3 for image augmentations resulted in suboptimal image captions, but we were able to generate text descriptive of the input images using Method 1 (Figure 4). As shown in Figure 4 (A), the captions on both images do not describe the scene at all. We suspected that this is due to the more complex image augmentations performed on the images for the second and third methods.

By probing into the model, we discovered the underlying reason these methods were underperforming. The sample augmentation image displayed in Figure 5 confirmed our belief that image augmentation can be detrimental to the model when applied in excess. In identifying this issue with Methods 2 and 3, we decided to move forward using Method 1 only to train the model and then test the effect of nucleus sampling.

We evaluated the model by generating captions for every image in the validation set of 5,000 images, computing the BLEU-4 scores against the five human-written annotations as the reference, and taking the mean of all scores. The results show that our model outperforms both the original NIC architecture and the aforementioned Shwetank model (Figure 6).

Notably, the use of nucleus sampling significantly increased performance in both of the tested models, yielding BLEU-4 scores of 31.1 and 27.5 for our model and Shwetank's model, respectively. When generating captions with beam search, however, this degraded the scores by approximately 3 points for both methods. Nucleus sampling is a relatively new concept that was published 5 years after the Show and Tell paper Vinyals, et al. and our results demonstrate that this technique is more effective than beam search.

Regardless of the sampling technique selected for the caption generation, our architecture performs better than the Shwetank model in terms of BLEU-4 scores. When employing nucleus sampling, for example, our model evaluation metric exceeds Shwetank's by about 4 points. This could be due to our choices for data augmentation, the increase in ResNet hidden size, the use of LayerNorm instead of dropout, or the number of calls to the LSTM decoder. Due to a lack of computing power, we did not have the chance to train separate models and identify the most effective architecture adjustment. However, this would be an interesting aspect to explore in the future by changing one element in the architecture at a time, then training and evaluating the model.

Table 2 displays the BLEU-4 scores for our method along with several others that have been successful in the recent years. The key developments include semantic representation of the input image and the addition of attention [10]. While our method could not outperform these models that incorporate new and advanced technology, we show that simple and subtle changes are capable of improving performance significantly. In our case, we observed a 3.9 BLEU-4 score increase compared to the original NIC model.