Illustrated: Efficient Neural Architecture Search
--- Guide on macro and micro search strategies in ENAS
2019-03-27 09:41:07
This blog is copied from: https://towardsdatascience.com/illustrated-efficient-neural-architecture-search-5f7387f9fb6
Designing neural networks for various tasks like image classification and natural language understanding often requires significant architecture engineering and expertise. Enter Neural Architecture Search (NAS), a task to automate the manual process of designing neural networks. NAS owes its growing research interest to the increasing prominence of deep learning models of late.
There are many ways to search for or discover neural architectures. Over the past couple of years, the community has seen different search methods proposed including:
- Reinforcement learning
Neural Architecture Search with Reinforcement Learning (Zoph and Le, 2016)
NASNet (Zoph et al., 2017)
ENAS (Pham et al., 2018) - Evolutionary algorithm
Hierarchical Evo (Liu et al., 2017)
AmoebaNet (Real et al., 2018) - Sequential model-based optimisation (SMBO)
PNAS (Liu et al., 2017) - Bayesian optimisation
Auto-Keras (Jin et al., 2018)
NASBOT (Kandasamy et al. 2018) - Gradient-based optimisation
SNAS (Xie et al., 2018)
DARTS (Liu et al., 2018)
In this post, we will look at Efficient Neural Architecture Search (ENAS)which employs reinforcement learning to build convolutional neural networks (CNNs) and recurrent neural networks (RNNs). The authors (Hieu Pham, Melody Guan, Barret Zoph, Quoc V. Le, and Jeff Dean) proposed a predefined neural network to generate new neural networks guided by a reinforcement learning framework using macro and micro search. That’s right – a neural network building another neural network.
The purpose of this article is to provide the readers a tutorial on how the macro and micro search strategies lead to generating neural networks. While the illustrations and animations serve to guide the readers, the sequence of animations do not necessarily reflect the flow of operations (due to vectorisation etc.).
We shall narrow the scope of this tutorial to neural architecture search for CNNs in an image classification task. This article assumes that the reader is familiar with the basics of RNNs, CNNs, and reinforcement learning. Familiarity with deep learning concepts like transfer learning and skip/residual connections will greatly help as they are heavily used in the architecture search. It is not required to have read the paper, but it would speed up your understanding.
Contents
0. Overview
1. Search Strategy
1.1. Macro Search
1.2. Micro Search
2. Notes
3. Summary
4. Implementations
5. References
0. Overview
In ENAS, there are 2 types of neural networks involved:
- Controller – a predefined RNN, which is a long short-term memory (LSTM) cell
- Child model – the desired CNN for image classification
Like most other NAS algorithms, the ENAS involves 3 concepts:
- Search space — all the different possible architectures or child models that can possibly be generated
- Search strategy — a method to generate these architectures or child models
- Performance evaluation — a method to measure the effectiveness of the generated child models
Let’s see how these five ideas form the ENAS story.
The controller controls or directs the building of the child model’s architecture by “generating a set of instructions” (or, more rigorously, making decisions or sampling decisions) using a certain search strategy. These decisions are things like what types of operations (convolutions, pooling etc.) to perform at a particular layer of the child model. Using these decisions, a child model is built. A generated child model is one of the many possible child models that can be built in the search space.
This particular child model is then trained to convergence (~95% training accuracy) using stochastic gradient descent to minimise the expected loss function between the predicted class and ground truth class (for an image classification task). This is done for a specified number of epochs, what I’d like to call child epochs, say 100. Then, a validation accuracy is obtained from this trained model.
Then, we update the controller’s parameters using REINFORCE, a policy-based reinforcement learning algorithm, to maximise the expected reward function which is the validation accuracy. This parameter update hopes to improve the controller in generating better decisions that give higher validation accuracies.
This entire process (from 3 paragraphs before this) is just one epoch — let’s call it controller epoch. We then repeat this for a specified number of controller epochs, say 2000.
Of all the 2000 child models generated, the one with the highest validation accuracy gets the honour to be the neural network for your image classification task. However, this child model must go through just one more round of training (again specified by the number of child epochs), before it can be used for deployment.
A pseudo algorithm for the entire training is written below:
CONTROLLER_EPOCHS = 2000
CHILD_EPOCHS = 100
Build controller network
for i in CONTROLLER_EPOCHS:
1. Generate a child model
2. Train this child model for CHILD_EPOCHS
3. Obtain val_acc
4. Update controller parameters
Get child model with the highest val_acc
Train this child model for CHILD_EPOCHS
This entire problem is essentially a reinforcement learning framework with the archetypal elements:
- Agent — Controller
- Action — The decisions taken to build the child network
- Reward — Validation accuracy from the child network
The aim of this reinforcement learning task is to maximise the reward (validation accuracy) from the actions taken (decisions taken to build child model architecture) by the agent (controller).
1. Search Strategy
Recall in the previous section that the controller generates the child model’s architecture using a certain search strategy. There are two questions that you should ask in this statement — (1) how does the controller make decisions and (2) what search strategy?
How does the controller make decisions?
This brings us to the model of the controller, which is an LSTM. This LSTM samples decisions via softmax classifiers, in an auto-regressive fashion: the decision in the previous step is fed as input embedding into the next step.
What are the search strategies?
The authors of ENAS proposed 2 strategies for searching for or generating an architecture.
- Macro search
- Micro search
Macro search is an approach where the controller designs the entire network. Examples of publications that use this include NAS by Zoph and Le, FractalNet and SMASH. On the other hand, micro search is an approach where the controller designs modules or building blocks, which are combined to build the final network. Some papers that implement this approach are Hierarchical NAS, Progressive NAS and NASNet.
In the following 2 sub-sections we will see how ENAS implements these 2 strategies.
1.1 Macro Search
In macro search, the controller makes 2 decisions for every layer in the child model:
- the operation to perform on the previous layer (see Notes for the list of operations)
- the previous layer to connect to for skip connections
In this macro search example, we will see how the controller generates a 4-layer child model. Each layer in this child model is colour-coded with red, green, blue and purple respectively.
Convolutional Layer 1 (Red)
We’ll start with running the first time step of the controller. The output of this time step is softmaxed to get a vector, which translates to a conv3×3
operation.
What this means for the child model is that the we perform a convolution with a 3×3 filter on the input image.
The output from the first time step (conv3×3) of the controller corresponds to building the first layer (red) in the child model. This means the child model will first perform 3×3 convolution on the input image.I know I mentioned that the controller needs to make 2 decisions but there’s only 1 here. Since this is the first layer, we can only sample one decision which is the operation to perform, because there’s nothing else to connect to except for the input image itself.
Convolutional Layer 2 (Green)
To build the subsequent convolutional layers, the controller makes 2 decisions (no more lies): (i) operation and (ii) layer(s) to connect to. Here, we see that it generated 1
and sep5×5
.
What this means for the child model is that we first perform a sep5×5
operation on the output of the previous layer. Then, this output is concatenated along the depth together with the output of Layer 1
, i.e. the output from the red layer.
Convolutional Layer 3 (Blue)
We repeat the previous step again to generate the 3rd convolutional layer. Again, we see here that the controller generates 2 things: (i) operation and (ii) layer(s) to connect to. Below, the controller generated 1
and 2
, and the operation max3×3
.
So, the child model performs the operation max3×3
on the output of the previous layer (Layer 2, green). Then, the result of this operation is concatenated along the depth dimension with Layers 1
and 2
.
Convolutional Layer 4 (Purple)
We repeat the previous step again to generate the 4th convolutional layer. This time the controller generated 1
and 3
, and the operation conv5×5
.
The child model performs the operation conv5×5
on the output of the previous layer (Layer 3, blue). Then, the result of this operation is concatenated along the depth dimension with Layers 1
and 3
.
End
And there you have it — a child model generated using the macro search! Now on to micro search. Heads up: micro search isn’t as straightforward as macro search.
1.2 Micro Search
As mentioned earlier, micro search designs modules or building blocks which are then connected together to form the final architecture. ENAS calls these building blocks convolutional cells and reduction cells. Simply put, a convolutional cell or reduction cell is just a block of operations. Both are similar — the only thing different about reduction cells is that the operations are applied with a stride of 2, thus reducing the spatial dimensions.
How to connect these cells to form the final network, you may ask?
The final network
Below is an image that gives you a quick overview of the final generated child model.
Fig. 1.2.1: Overview of the final neural network generated. Image source.Let’s come back to this in a bit.
Building units for networks derived for micro search
There’s sort of a hierarchy of the ‘building units’ of child networks derived from micro search. From biggest to smallest:
- block
- convolutional cell / reduction cell
- node
A child model consists of several blocks. Each block consists of Nconvolutional cells and 1 reduction cell, in that order, as mentioned above. Each convolutional/reduction cell comprises B nodes. And each node consists of standard convolutional operations (we’ll see this later). (N and B are hyperparameters that can be tuned by the architect.)
Below is a child model with 3 blocks. Each block consists of N=3 convolutional cells and 1 reduction cell. The operations within each cell are not shown here.
Fig. 1.2.2: Overview of the final neural network generated. Image source.So how to generate this child model from micro search, you may ask? Continue reading!
Generate a child model from micro search
For this micro search tutorial, let’s build a child model that has 1 block, for simplicity’s sake. Each block (there’s only one though) comprises N=3 convolutional cells and 1 reduction cell. Each cell comprises B=4 nodes. This means our generated child model should look like this:
Fig. 1.2.3: A neural network generated from micro search which has 1 block, consisting of 3 convolutional cells and 1 reduction cell. The individual operations are not shown here.Let’s now build a convolutional cell!
Fast forward
To explain how to build a convolutional cell, let me take you to a stage where we have already built the first 2 convolutional cells for you. Notice that the last operations from each of these 2 cells are add
operations. Let’s just take it for granted for now.
With 2 convolutional cells built for us, let’s move on to the third.
Convolutional Cell #3
Now, let’s ‘prepare’ the third convolutional cell — the cell that you and I will be building together.
Fig. 1.2.5: ‘Preparing’ the third convolutional cell in micro search.Recall that every convolutional cell consists of 4 nodes. Now you might say: Okay sure so where are these nodes?
The first two nodes — read this very carefully and slowly — are the two previous cells from the current cell — yes, cells. What about the other 2 nodes? These 2 nodes fall in this very convolutional cell that we are building right now. Let’s make known where these nodes are:
Fig. 1.2.6: Identifying the 4 nodes while building Convolutional Cell #3.From this section onwards, you can safely disregard the ‘Convolutional cell’ labels you see on the image above and concentrate on the ‘Nodes’ labels:
Node 1 — red (Convolutional Cell #1)
Node 2 — blue (Convolutional Cell #2)
Node 3 — green
Node 4 — purple
If you’re wondering if these nodes will change for every convolutional cell we’re building, the answer is yes! Every cell will ‘assign’ the nodes in this manner.
You might also wonder — since we’ve already built the operations in Node 1 and Node 2 (which are Convolutional Cells #1 and #2), what’s there left to build in these nodes? You asked the right question.
Convolutional Cell #3: Node 1 (red) and Node 2 (blue)
For any cell that we’re building, the first 2 nodes do not have to be built but instead become the inputs to the other nodes. In our example, since we are building 4 nodes, so Node 1 and 2 can be inputs to Node 3 and Node 4. So, yay! We don’t have to do anything for Node 1 and Node 2 and we can now move on to building Node 3 and Node 4. Phew!
Convolutional Cell #3: Node 3 (Green)
Node 3 is where the building starts. Unlike in macro search where the controller samples 2 decisions for every layer, here in micro search we have the controller samples 4 decisions for us (or rather 2 sets of decisions):
- 2 nodes to connect to
- the respective 2 operations to perform on the nodes to connect to
With 4 decisions to make, the controller runs 4 time steps. Have a look below:
Fig. 1.2.7: The outputs of the first four controller time steps (2, 1,avg5×5, sep5×5), which will be used to build Node 3.
From the above we see that the controller sampled 2
, 1
, avg5×5
, and sep5×5
from each of the four time steps. How does this translate to the architecture of the child model? Let’s see:
avg5×5, sep5×5) are translated to build Node 3.
From the above, there are three things that just happened:
- The output from Node
2
(blue) undergoes theavg5×5
operation. - The output from Node
1
(red) undergoes asep5×5
operation. - Both the results from these two operations undergo an
add
operation.
The output from this node is the tensor that undergoes the add
operation. This explains why Nodes 1 and 2 end with add
operations.
Convolutional Cell #3: Node 4 (Purple)
Now for Node 4. We repeat the same steps, just that the controller now has 3 nodes to choose from (Nodes 1, 2 and 3). Below, the controller generated 3
, 1
, id
and avg3×3
.
id, avg3×3), which will be used to build Node 4.
This translates to building the following:
Fig. 1.2.10: How the outputs of the first four controller time steps (3, 1,id, avg3×3) are translated to build Node 3.
What just happened?
- The output from Node
3
(green) undergoes anid
operation. - The output from Node
1
(red) undergoes anavg3×3
operation. - Both the results from these two operations undergo an
add
operation.
And that’s it we’re done for Convolutional Cell #3.
Reduction Cell
Recall that for every N convolutional cells, we need to have a reduction cell. Since N=3 in this tutorial, and we’ve just finished with Convolutional Cell #3, it’s time to build a reduction cell. As mentioned earlier, the design of the reduction cell is similar to Convolutional Cell #3, except that the operations that are sampled have a stride of 2.
End
And so that wraps up generating a child model out of the micro search strategy. Phew! I hope that wasn’t too much for you, because it was for me when I first read the paper.
2. Notes
Because this post mainly shows the macro and micro search strategies, I’ve left out many small details (especially on the concept of transfer learning). Let me briefly cover them:
- What’s so ‘efficient’ in ENAS? Answer: transfer learning. If a computation between two nodes has been done (trained) before, the weights from the convolutional filters and 1×1 convolutions (to maintain number of channel outputs; not mentioned in the previous sections) will be reused. This is what makes ENAS faster than its predecessors!
- It is possible that the controller samples a decision where no skip connection is needed.
- There are 6 operations available for the controller: convolutions with filter sizes 3×3 and 5×5, depthwise-separable convolutions with filter sizes 3×3 and 5×5, max pooling and average pooling of kernel size 3×3.
- Do read up on the concatenate operation at the end of each cell which ties up ‘loose ends’ of any nodes.
- Do read up briefly on the policy gradient algorithm (REINFORCE) reinforcement learning.
3. Summary
Macro search (for an entire network)
The final child model is as shown below.
Fig. 3.1: Generating a convolutional neural network with macro search.Micro search (for a convolutional cell)
Note that only part of the final child model is shown here.
Fig. 3.2: Generating a convolutional neural network with micro search. Only part of the full architecture is shown.4. Implementations
5. References
Efficient Neural Architecture Search via Parameter Sharing
Neural Architecture Search with Reinforcement Learning
Learning Transferable Architectures for Scalable Image Recognition
That’s it! Remember to read the ENAS paper Efficient Neural Architecture Search via Parameter Sharing. If you have any questions, please highlight and leave a comment.
Other Articles on Deep Learning
General
Counting No. of Parameters in Deep Learning Models
Related to NLP
Line-by-Line Word2Vec Implementation
Related to Computer Vision
Breaking down Mean Average Precision (mAP)
Optimisation
Step-by-Step Tutorial on Linear Regression with Stochastic Gradient Descent
10 Gradient Descent Optimisation Algorithms + Cheat Sheet
Special thanks to Ren Jie Tan, Derek, and Yu Xuan Tay for ideas, suggestions and corrections to this article.
==