Deep Learning Final Project For CS 7150

Augmentation and Adjustment: Can we improve Neural Image Caption model performance?

Neural Image Caption (NIC) generates image captions using Vision Based Deep Network and LSTM

Related Work

In 2015, Oriol Vinyals [1] and his team were successfully able to automatically generate a caption given an input image by combining advanced computer vision and natural language processing techniques. The model, dubbed the Neural Image Caption (NIC), utilizes Convolution Neural Networks (CNN) and Recurrent Neural Network (RNN) technology to capture not only objects within the images, but also relationships between them. Previous attempts to solve the problem proposed to glue together sentences from images or image descriptions to describe the content of an image with words [2],[3]. However, this paper presented a novel technique that achieved state-of-the-art performance inspired by breakthroughs made in 2014 in machine translation. Cho, et. al [4] proved that text translation can be achieved by simply leveraging RNNs to encode source sentences, then decode them to generate target sentences. Vinyals, et. al. were able to adapt this idea to form a more robust model using a CNN as the encoder which produces a rich representation of the input image by embedding it to a fixed-length vector.

The NIC model directly maximizes the probability of correctly describing an image. To model the probability of a sentence given an image, a particular RNN called Long-Short Term Memory (LSTM) was selected due to its abilities for handling sequence tasks as well as vanishing and exploding gradients. As input to the LSTM model, the authors chose to use word embedding vectors [5] and speculated that distances between word embeddings may provide more insight for rarer words. We aim to explore this hypothesis by adjusting the word embedding vector representations such that each word is closest to other words with similar definitions. This in turn can improve the performance of the model in some particular edge cases. An important note also mentioned in the paper indicated that performance improves as the size of the datasets increases. This is true for many machine learning models and we would like to discover if this problem can be mitigated by simply using data augmentation approaches such as image rotation and positional translation. With these changes to the model, we hope to improve its performance in order to better serve those who need it.

Project Proposal

Vision and natural language has been an active topic in the machine learning world for the past 7 years. Although Vinyals and his team were able to achieve state-of-the-art performance, there were still many images incorrectly described during their time. For the visually impaired, incorrect image captions could mislead the actual situation. This is critical because the visually impaired could benefit the most from this type of technology. The question becomes whether or not the performance for NIC can be improved? One idea is to increase the training dataset, but the biggest issue is to find or create a large enough dataset to use, which is why we propose the use of data augmentation techniques. By leveraging data augmentation techniques, the training dataset can be artificially increased, and potentially gain performance.

Another way to potentially increase performance will be to use a word embedding trick. Vinyals [1] mentioned that the word embedding vector model, chosen over the one-hot encoding and bag-of-words, was selected because of its independence from dictionary size and information structure. We propose to adjust the word embeddings so that the words with similar definitions are closest to each other. This should allow better sentence selection to describe the image, and improve performance of the NIC model. By adjusting word embeddings to similar definitions, we can pair performance gains with data augmentation techniques to see whether performance can be near or meet human performance.

References

[1] Vinyals, Oriol, et al. Show and tell: A neural image caption generator. Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

[2] A. Farhadi, et al. Every picture tells a story: Generating sentences from images. In ECCV, 2010

[3] G. Kulkarni, et al. Baby talk: Understanding and generating simple image descriptions. In CVPR, 2011.

[4] K. Cho, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, 2014.

[5] T. Mikolov, et al. Efficient estimation of word representations in vector space. In ICLR, 2013.

Team Members

Click here for project progress"