Monday, January 2, 2017
Summary of NIPS 2016
Basic outline:
- State of AI Research and trends in 2016
- Talks and papers that I liked
- Predictions for 2017
- Miscellaneous notes
State of Artificial Intelligence Research: 2016 Trends
GANs
For those of you who are unfamiliar with GANs, here's a simple writeup explaining them. GANs were initially developed for generating images, but researchers are exploring their applications in reinforcement learning, domain adaptation, security ML, compression, and more. A more appropriate terminology would be “adversarial methods”, because the network we are trying to train need not be "generative" - it could be a deterministic classifier or policy instead.
The research community cares about this because some loss functions are often difficult to specify directly. In many over-parameterized problems, the optimization process will “short-circuit” design flaws in the objective and yield a result that might minimize the objective, but isn't quite what we really wanted.
Ian Goodfellow gave a very nice tutorial on GANs. The main challenge for GANs right now is preventing degenerate equilibria between the discriminator and generator. These issues often manifest themselves as mode collapse (the generator learns a single tight mode over a single sample) or the discriminator "overpowering" the generator. I'm not worried about this - I think with time, intuitive tricks and recipes from experienced researchers will make them easier to use. People used to think that only Geoff Hinton could train DNNs, and that ended up not being the case.
With no way to define intelligence (except just pointing to ourselves), we don't even know if it's a quantity that can be maximized. For all we know, human-level intelligence could be a tradeoff. Maybe any entity significantly smarter than a human being would be crippled by existential despair, or spend all its time in Buddha-like contemplation.
Does intelligent life arise from optimization? Or is it an equilibrium? Or to quote Yann, “a bunch of hacks that just happen to work?” AI research is not ready to answer these questions yet, but I am optimistic that research on GANs will tease out what kinds of problems are better-suited for expressing as an equilibrium-finding vs. minimization problem.
Deep RL
People usually regard Reinforcement Learning (RL) as pertaining to playing video games and controlling robots, but RL methods are far more general than that - any ML problem with dependencies across space or time can benefit from RL techniques (i.e. sequential pixels). RL allows us to think about non-stationary data distributions as "environments" that respond to an agent's behavior. RL is applied creatively to a lot of ML problems nowadays, such as translation.
Pieter Abbeel and John Schulman gave a very nice tutorial on policy gradient methods, and John gave a talk on the “nuts and bolts” of Deep RL research. The nuts and bolts talk is jam-packed with great advice that would have taken me years to figure out on my own, and I highly recommend it for anyone who is working on RL projects.
Someone asked the Deep RL panelists what some of the most challenging problems were in RL. Rich Sutton and Raia Hadsell mentioned learning of subtasks/subgoals, i.e. hierarchical representations of the task at hand. John Schulman mentioned an interesting comment about how the Bellman discount factor in the discrete MDP setting makes it difficult to plan more than >100 time steps into the future, even when r=0.99. Finally, Chelsea Finn discussed sample complexity: it’s annoying that we have to learn every RL task from scratch, and some kind of inter-task transfer learning is needed. After all, humans can quickly learn to play new games because we've played other games before.
Bayesian Deep Learning
I think one of the most promising areas of research right now is the marriage of deep neural nets with graphical models, referred to some as “Bayesian deep learning.” It’s probably the closest thing to a theoretical understanding of deep learning that generalizes to how neural nets are used in practice.
- Symmetrized Variational Inference - a way to avoid mode collapse with Gaussian VAEs
- Rejection Sampling Variational inference - reparameterization trick for a rejection sampler, allowing us to train gamma, beta, Dirichlet and many other variational distributions! This is huge.
- Robust Inference with Variational Bayes - improving the robustness of a variational posterior to choice of prior and likelihood.
- ELBO surgery - ELBO is the most common variational lower bound used to train VAEs, and this paper proposes an alternative way to write the ELBO: split up the KL(q(z|x)|p(z)) term into an index-code mutual information term and marginal KL term KL(q(z)|p(z)). Experiments show that the latter term is actually quite large, which suggests that we should focus on improving priors. This work is quite important because it provides a easy-to-implement debugging tool for anyone training VAEs.
- Operator Variational Inference, Normalizing Flows, and Real-NVP enable richer families of posterior approximations.
- Variational lossy autoencoders
Open Research
Talks and Papers that I Liked
I had nowhere enough mental capacity to see all the papers and talks at NIPs, but I did write down some talks, papers, and posters that I came away really inspired by. Here’s a non-exhaustive list.
- Andrew Ng gave a practical tutorial on how to debug ML research projects. This is a must-watch for beginners in the field. Here is a video of a similar talk he gave at the Deep Learning summer school.
- Fixed Points of Belief Propagation - the authors formulate the convergence condition of belief propagation (BP) as the solution of polynomial equations. This makes me wonder if these methods could be used to bridge nonlinear dynamical systems with statistical inference.
- Deep Networks with Stochastic Depth - layer-wise dropout allows for training ResNets with >1000 layers.
- Layer Norm - like batch norm, layer norm is used to prevent covariance shift between successive deep layers. The advantage is that layer norm does not introduce gradient variance from other minibatch elements.
- Deep Successor Reinforcement Learning - I really liked this paper because it gets at model-based + model-free ideas. The motivation is that model-free methods like DQN take a long time to learn adequate low-level features because we are often trying to train a CNN to “see” using a single value scalar that is not directly related to image features. The idea is to learn “successor features” in an unsupervised manner (using rich supervision provided by model prediction error), and then separately learn a simple linear value function on top of these features. The paper is short and well-written.
- The Intelligent Biosphere by Drew Purves (couldn't find the talk slides). This talk was a bit of an oddball among the other mathy ones at NIPS (I thought the drawings were distracting and failed to capture the nuance of what Drew was saying), but I think the message is crucial for AI researchers who want to take a step back and think about what it means for a system to be “intelligent”. The basic idea is that “intelligence” extends beyond brains; something as simple as self-replicating RNA exhibit intelligent behavior at the evolutionary scale. The natural world is fractal, cyclic, and fuzzy, as opposed to the simplicity of our current simulated environments for benchmarking AI. The implication is that there may be challenges learning in these environments that don’t carry over from Atari games. One interesting comment Drew made was that in a biosphere, every organism is a resource to another organism. That is, learning and adaptation of each organism is not independent of other organisms (reminds me of GANs)! If you’re interested in these environment + intelligence ideas, similar arguments have been made in ethology and neuroethology texts.
- Input Convex Neural Networks - the authors presented a very interesting follow-up workshop poster at the Deep RL workshop (I can't find a paper), where they were able to solve continuous Q functions by parameterizing the Q-function to be convex with respect to the action vector of the agent (and non-convex with respect to the input state). This allows for efficient DQN policies in continuous action domains.
- Using Fast Weights to Attend to the Recent Past - basically an alternative to LSTM that seems to work a lot better.
- Phased LSTM - a way to learn an adaptive time scale for LSTM architectures, where the hidden state is updated at a learned frequency. This is ideal for learning from data with arbitrary sampling rates, such as from event-based vision sensors (which naturally solve some hard computer vision problems like aliasing and bandwidth constraints). It also happens that this Zurich lab happens to make a really awesome event-based camera and I’m really excited to see how they apply ML to it.
- Residual Networks Behave Like Ensembles of Relatively Shallow Networks - the title says it all. I thought the experiments could have been a lot better, but regardless, this work could potentially have a huge impact in new ways to train very deep ResNets quickly.
- Graph Convolutional Networks - generalization of convolutions to allow DNN processing on graph data structures (the kind with edges and vertices). This work is important for getting DL methods to learn approximate solutions to NP-hard graph problems.
- Minimizing Quadratic Functions in Constant Time - I did a double-take when I saw the title of this poster. I don’t understand the math very well, but the algorithm is very simple: subsample a n×n matrix to a k×k matrix and solve it. A simple scaling of the resulting minimized value yields an approximation to the minimized objective of the original system! In problems where we need to branch-and-bound a large search space (an optimal portfolio), we might find the value of a solution more informative than the solution itself.
Predictions for 2017
- SGD + backprop will still be the best method for training deep neural nets in 2017.
- Really good transfer learning, domain adaptation, and/or multitask learning will be one of the major accomplishments in 2017 (this was suggested by Andrew Ng during his tutorial).
- Ensemble methods (e.g. training 10 versions of a network and averaging the predictions) will squeeze out 2-5% performance on all kinds of ML tasks.
- Widespread use of model-based + model-free RL, e.g. Reinforcement Learning with Unsupervised Auxiliary Tasks as a way to alleviate sample-complexity of model-free methods.
Miscellaneous Notes
- Boston Dynamics displayed a series of very entertaining and impressive robotics demos. Their basic approach appears to be high precision hardware that behaves faithfully to a simulated planning model, plus a bit of robust control to handle the imperfections. Machine Learning doesn’t seem to be an important priority for their approach to robotic control, but I think it's good to have uncorrelated research approaches. It's hard to say which one will end up working best. God, I hope Alphabet doesn’t sell them off. Their stuff is really cool.
- Apple has publicly announced that they will open up their research culture. I think a product-driven research culture has the potential to align research towards important problems (rather than maximizing citations, as is done in academia).
- Certain prop trading firms are using neural nets to do price forecasting. I had previously thought that all trading firms were too concerned about model interpretability to use neural nets.
- Speaking of hedge funds, many top hedge funds and trading shops came to NIPS to run career booths, but there was a surprising lack of interest from attendees (attendees were more interested in the likes of Apple, Facebook, Deepmind, Google, etc). At a regular college career fair in the East Coast, these roles are reversed. The talent pool at NIPS seems to be far more interested in tech and open-ended research than making money.
- Deers, despite being “herbivores”, will sometimes eat the heads of baby birds for extra calcium (this was from Drew Purves talk). Life rejects your reductionist labels, human!
- Engineering Principles From Stable and Developing Brains (Saket Navlakha) - interestingly enough, pruning unused neuron units seems to work better for learning than growing additional neurons.
- Yoshua Bengio - much better generalization (train vs. test error) happens when telling generative models what their latent codes should be.
- There was a great question asked during the Brains and Bits panel about why we are focusing so much intelligence research on humans, instead of simpler animals like octopi and birds. Yoshua says that focusing on simple models might cause researchers to overlook more general principles of intelligence. I agree with the argument, but not it’s premise - I think it is our AI algorithms that are simple, not the octopi and birds. I believe that we should build an octopi AI before a human one (see the "whole iguana" approach).
- Yann Lecun thinks model interpretability is overrated: we have lots of interpretable algorithms, but we still lack the basic principles with which to build a general-purpose learning system that can navigate the complexities of our natural world. I agree wholeheartedly with this - requiring interpretability unnecessarily confines us to look for solutions in models that we understand bottom-up. If history of scientific discovery is any clue, that is not always the case. In fact, serendipitous discovery sometimes invalidates our bottom-up assumptions! (e.g. "neural nets don't work").
- Yann is not a fan of the log-likelihood metric. Log-likelihood is evaluated under a particular model (e.g. PixelCNN decoder), and if the model is not good to begin with, then log-likelihood is somewhat meaningless.
- Andrew Saxe asked a great question at the Brain and Bits workshop: “what are fringe topics that might be important to solving intelligence?” Demis Hassabis replied that people don't regard the study of imagination, dreaming, consciousness as very scientific, but they may become serious practical questions within 5 years. Terry Sejnowski replied that understanding sleep is crucial, as it is very much a computational phenomenon rather than a torpid one. Yoshua wonders about what it means to "understand something", i.e. neural nets and other models can learn computations and statistical correlations, but how should semantic meaning be extracted?