Conditional Image Generation with PixelCNN Decoders – slides

Awhile ago I presented and attempted to explain this work to our reading group:

van den Oord, A., Kalchbrenner, N., Vinyals, O., Espeholt, L., Graves, A., & Kavukcuoglu, K. (2016). Conditional Image Generation with PixelCNN Decoders. In D. D. Lee, M. Sugiyama, U. V Luxburg, I. Guyon, & R. Garnett (Eds.), NIPS (pp. 4790–4798). Retrieved from

And also dived a bit into their previous work,
van den Oord, A., Kalchbrenner, N., & Kavukcuoglu, K. (2016). Pixel Recurrent Neural Networks. Arxiv, 48. Retrieved from

While I usually post slides to the web shortly after, this time I’ve been scared to do so. There are a few critical points from this paper that I still don’t understand. And while I told myself that I would spend some time to figure this out, it is now months later, and I’ve taken no action. So as now is always the time to continue on in spite of the fear, I’ll let you, dear Internet, have these slides in all there erroneous ways.

Rather than focusing on the confusion, let’s focus on what we do understand. This paper proposes a way to generate images. While we have seen this before using generative adversarial nets (GANs), this paper differs in very important ways. First, this paper does this conditionally, that is, they generate images based on given class information! So given the class of some image (e.g., dog), this PixelCNN will generate images that look like a dog. In contrast, GANs just generate an image that looks like a real image, without caring about the content.

The second important way PixelCNN distinguishes itself from the GAN approach is that, PixelCNN’s generate images a single pixel at a time. That is, a PixelCNN will generate (or sample from a learned distribution), a single pixel in the top left corner of the image. Given this pixel, it will generate the next pixel. Then given these two pixels, it will generate the third pixel, and so on. Thus PixelCNN conditions on the given class information (e.g., what type of image to generate), and the pixels it has generated so far.

This important point is where part of my confusion lies. This paper talked about the “blind spot” problem. While I think I see what the blind spot problem is, I don’t quite see how they fix it in this work. I’ve highlighted my areas of confusion in these slides. If you can enlighten me, please feel free to leave a comment below.

You can view the slides here, and download them in PPT or PDF format if you like.

Questions/comments? If you just want to say thanks, consider sharing this article or following me on Twitter!