Over the past couple months, I've been trying to learn more about neural networks and deep learning by implementing various algorithms myself. Here's what I've been up to:
I loosely followed the TensorFlow CIFAR-10 tutorial, using a simple convolutional neural network to classify 32x32 images into one of 10 categories (cats, deer, airplanes, etc). Running on a GPU for less than an hour, it achieved a pretty decent accuracy.
Using a pre-trained network trained to classify images from the ImageNet dataset, I implemented the Neural Algorithm for Artistic Style paper (which is similar to what's used by the photo-filtered algorithm Prisma, but isn't the same). The algorithm's underlying assumption is that the activations of earlier layers of a convolutional neural network represent the "stylistic" aspect of an image — the details of strokes, colors used, etc; and that the later, higher-level layers represent the high-level "content" of an image. I used gradient descent to generate an image that where low-level feature activations were similar to activations from the "style image," while the higher-level feature activations stayed similar to the "content" image.
I'm interested in generative models that use neural networks to produce new content. I took the MNIST dataset and trained a variational autoencoder — a neural network that tries to compress its input image into a vector of numbers and then reconstruct it. You can do arithmetic on these latent vectors — for example, "blending" two digits together by averaging their vectors. Latent vectors can also be sampled from randomly to generate random digit-like images.
Classifying handwritten digits 0–9, using the MNIST dataset, as part of the CS2950k course at Brown University. We started by writing a fully-connected multi-layer neural network without using a deep learning framework, implementing backpropogation manually with numpy. Then, we built a fully-connected network in Tensorflow. Finally, following the Deep MNIST tutorial on the Tensorflow website, we build a convolutional network that achieved 99% accuracy on the test dataset.
Deep convolutional generative adversarial networks (DCGANs) are a really interesting approach to the problem of generative "plasusible fakes" that look like something that could have come from a real-world dataset, but are actually randomly generated. DCGANs are essentially two networks — a "generator" that generates images, and a "discriminator" that is trained to guess whether an image comes from the real dataset, or wwas generated by the generator. The two are alternatingly trained — when the generator is being trained, error is backpropogated through the disciminator first, trying to maximize the featues that cause the discriminator to think the generator's output is real.
The DCGAN proved difficult to train — the learning rate needed to be fine-tuned to prevent the generator from collapsing to a single output image. The discriminator seemed to learn much faster than the generator — the discriminator would reach 90% accuracy quickly, while the generator would take 10x more steps to get good enough at "fooling" the discriminator to bring the discriminator accuracy back down to 50%.
I'd like to try using generative adversarial models on different images, like ImageNet or faces, or on audio or textual data. I'd also like to experiment with different architectures to make GANs easier to train — training multiple disciminators at once might help by making it less likely for the generator to overfit its output to the particular discriminator's quirks.
DeepMind stunned the machine-learning community in 2013 by announcing that it had built a single neural network that could learn to play any of 15 Atari games, taking as input only the raw pixels on the screen and a "reward" value that told it when it scored a point. The computer "plays" the game many thousands of times, and is trained to take a game-screen image and output the action that's most likely to lead to rewards down the line.
Rather than an Atari game, I wrote a very simple game for the computer to play — "rocks" fall from random positions at the top of the screen, and the player moves their "paddle" horizontally to catch them. Teaching a network to play the game was more difficult than I expected. Using Deep Q-Learning like Deepmind did, I was able to train a pretty good AI to play the 8-by-8-square version of the game — but I had trouble training one to be successful on the 16x16 version. I switched to using policy gradients — a slightly simpler approach — and was able to train a successful player on the 16x16 version. Unlike the Deepmind paper, neither of these networks were convolutional — convolutional versions of these networks didn't seem to converge quickly, and I'm not sure the game was complex enough for them to have had any benefit.