Deep Visual-Semantic Alignments for Generating Image Descriptions – [slides]

Here are some slides I put together to try and explain/present this great paper:

Karpathy, A., & Fei-Fei, L. (2015). Deep Visual-Semantic Alignments for Generating Image Descriptions. CVPR.

Let’s summarize it in two lines:
The authors proposed a way to combine information from an image and a corresponding text caption. They use a Recurrent Neural Network (RNN) to then generate text captions that describe the image.

Pretty interesting stuff. This is the first paper where I really took a close look at RNNs as well.