Here are some slides I put together to try and explain/present this great paper:
Karpathy, A., & Fei-Fei, L. (2015). Deep Visual-Semantic Alignments for Generating Image Descriptions. CVPR.
Let’s summarize it in two lines:
The authors proposed a way to combine information from an image and a corresponding text caption. They use a Recurrent Neural Network (RNN) to then generate text captions that describe the image.
Pretty interesting stuff. This is the first paper where I really took a close look at RNNs as well.