Paper Reading - Convolutional Image Captioning ( CVPR 2018 )

2023-01-25 22:05:50

Link of the Paper: https://arxiv.org/abs/1711.09151

Motivation:

LSTM units are complex and inherently sequential across time.
Convolutional networks have shown advantages on machine translation and conditional image generation.

Innovation:

The authors develop a convolutional ( CNN-based ) image captioning method that shows comparable performance to an LSTM based method on standard metrics.

　　　　

The authors analyze the characteristics of CNN and LSTM nets and provide useful insights such as -- CNNs produce more entropy ( useful for diverse predictions ), better classification accuracy, and do not suffer from vanishing gradients.

Improvement:

Improved performance with a CNN model that uses Attention Mechanism to leverage spatial image features.

General Points:

Image Captioning is applicable to virtual assistants, editing tools, image indexing and support of the disabled.
Image Captioning is a basic ingredient for more complex operations such as storytelling and visual summarization.
An illustration of a classical RNN architecture for image captioning is provided below.