Evolution of Deep Convolutional Neural - Recherche en IA spécialisée au MIT

Evolution of Deep Convolutional Neural
Networks Using Cartesian Genetic
Programming

Masanori Suganuma
RIKEN Center for AIP, Tokyo, Japan
Tohoku University, Miyagi, Japan

suganuma@vision.is.tohoku.ac.jp

Masayuki Kobayashi
Shinichi Shirakawa
Tomoharu Nagao
Yokohama National University, Kanagawa, Japan

kobayashi-masayuki-xc@ynu.jp
shirakawa-shinichi-bg@ynu.ac.jp
nagao@ynu.ac.jp

https://doi.org/10.1162/evco_a_00253

Abstrait

The convolutional neural network (CNN), one of the deep learning models, has demon-
strated outstanding performance in a variety of computer vision tasks. Cependant, as the
network architectures become deeper and more complex, designing CNN architectures
requires more expert knowledge and trial and error. In this article, we attempt to auto-
matically construct high-performing CNN architectures for a given task. Our method
uses Cartesian genetic programming (CGP) to encode the CNN architectures, adopt-
ing highly functional modules such as a convolutional block and tensor concatenation,
as the node functions in CGP. The CNN structure and connectivity, represented by the
CGP, are optimized to maximize accuracy using the evolutionary algorithm. We also in-
troduce simple techniques to accelerate the architecture search: rich initialization and
early network training termination. We evaluated our method on the CIFAR-10 and
CIFAR-100 datasets, achieving competitive performance with state-of-the-art models.
Remarquablement, our method can find competitive architectures with a reasonable compu-
tational cost compared to other automatic design methods that require considerably
more computational time and machine resources.

Mots clés

Genetic programming, convolutional neural network, deep learning.

Introduction

Deep learning, a machine learning approach using deep neural networks, is becom-
ing popular for solving artificial intelligence tasks. Deep neural networks (DNNs)
have been successful in various tasks such as image recognition (LeCun et al., 1998;
Krizhevsky et al., 2012), speech recognition (Hinton et al., 2012), and reinforcement
learning tasks (Mnih et al., 2013, 2015). Particularly, convolutional neural networks
(CNNs) (LeCun et al., 1998) have demonstrated outstanding performance on image
recognition tasks in the last few years and have been applied to a variety of computer vi-
sion applications (Vinyals et al., 2015; Zhang et al., 2016). A commonly used CNN archi-
tecture consists of a series of convolution and pooling layers followed by fully connected
layers. Several recent studies have demonstrated significant progress by developing

Manuscript received: 14 Octobre 2018; revised: 15 Février 2019; accepted: 22 Février 2019.
© 2019 Massachusetts Institute of Technology

Evolutionary Computation 28(1): 141–163

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
e
v
c
o
un
r
t
je
c
e
–
p
d

F
/

2
8
1
1
4
1
2
0
2
0
3
6
2
e
v
c
o
_
un
_
0
0
2
5
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

M.. Suganuma et al.

CNN architectures. Powerful CNN models (par exemple., GoogLeNet (Szegedy et al., 2015),
ResNet (He et al., 2016), and DenseNet (Huang et al., 2017)) have continued to show
considerable improvement over the years, achieving state-of-the-art results on various
benchmark datasets.

Despite their success, designing CNN architectures is still a difficult task because
many design parameters exist, such as the depth of a network, the type and parame-
ters of each layer, and the connectivity of the layers. As the successful networks have
become deeper and more complex, they require a greater number of structural hyper-
parameters to be tuned to achieve the best performance on a specific dataset. Donc,
considerable trial and error or expert knowledge is required to construct suitable ar-
chitectures for a target task. Considering this situation, automatic design methods for
CNN architectures are highly beneficial.

Neural network architecture design can be viewed as a model selection problem in
machine learning. The straight forward approach is to deal with the architecture design
as a hyperparameter optimization problem. Namely, the hyperparameters regarding
network structure such as the number of layers and neurons are optimized using tech-
niques based on Bayesian optimization or evolutionary computation (Snoek et al., 2012;
Loshchilov and Hutter, 2016) to improve the performance of the validation dataset.

Evolutionary computation has been traditionally applied to optimize both the net-
work topology and the connection weights (Schaffer et al., 1992; Stanley and Miikku-
lainen, 2002). There are two types of encoding schemes for network representations:
direct and indirect encoding (Yao, 1999). Direct encoding represents the number and
connectivity of neurons directly as the genotype, whereas indirect encoding represents
a generation rule for network architectures. Although almost all traditional approaches
optimize the number and connectivity of low-level neurons, modern neural network
architectures for deep learning have many units and various types of units to be op-
timized. Optimizing so many structural parameters in a reasonable amount of com-
putational time may be difficult for traditional evolutionary neural network methods.
A promising approach is the use of highly functional modules as a minimum unit to
reduce the search space of deep architectures.

In this article, we attempt to design CNN architectures based on genetic program-
ming (GP). We use the Cartesian genetic programming (CGP) (Miller and Thomson,
2000; Harding, 2008; Miller and Smith, 2006) encoding scheme, which is a direct encod-
ing scheme, to represent the CNN structure and connectivity. As we aim to search the
CNN architectures, the phenotype of GP should be the network structure. Donc,
we adopt a graph-based GP rather than a tree-based GP. The advantage of the CGP-
based method is its simplicity and flexibility; it can represent variable-length network
structures including skip connections and encode the network by a fixed length string.
To reduce the search space, we also adopt relatively highly functional modules oper-
ating the three-dimensional tensor (par exemple., convolutional block and tensor concatenation)
as the node function in CGP. Par exemple, one of our node functions consists of a con-
volution layer, batch normalization, and a rectified linear unit (ReLU), not only a single
function or a low-level neuron. To evaluate the architecture represented by the CGP,
we train the network using a training dataset by a usual stochastic gradient descent
method, and then the performance on another training dataset is assigned as the fitness
of the architecture. We call the former dataset the model training dataset and the latter
dataset the architecture evaluation dataset. Based on this fitness function, an evolutionary
algorithm optimizes the CNN architectures. We evaluate our method on the CIFAR-10
and CIFAR-100 datasets (Krizhevsky and Hinton, 2009). The experimental results show

142

Evolutionary Computation Volume 28, Nombre 1

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
e
v
c
o
un
r
t
je
c
e
–
p
d

F
/

2
8
1
1
4
1
2
0
2
0
3
6
2
e
v
c
o
_
un
_
0
0
2
5
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Evolution of Deep CNNs using CGP

that our method can find the competitive CNN architectures with a reasonable compu-
tational cost compared with state-of-the-art models.

This article expands on the work in Suganuma et al. (2017). In this article, we ad-
ditionally propose simple speed-up techniques for an architecture search and add the
result for the CIFAR-100 dataset.

The rest of this article is organized as follows. The next section presents related
work on the neural network architecture design as well as their limitations. In Section
3, we describe our genetic programming approach to designing CNN architectures. Nous
test the performance of our method in Section 4. Enfin, in Section 5, we describe our
conclusion and future work.

2 Related Work

This section provides a review of related works on automatic neural network architec-
ture design. We roughly divide the previous studies into two categories: optimization
of learning and structure parameters and optimization of neural network architectures.
The former indicates the traditional hyperparameter optimization approach, alors que
the latter aims to search the flexible architectures, which is more relevant to this article.

2.1 Optimization of Learning and Model Parameters

The hyperparameters in machine learning methods, such as learning rate, regulariza-
tion coefficients, and the number of neurons in neural networks, should be tuned to
improve predictive performance. En général, we cannot obtain gradients of such hyper-
parameters. The naive methods for hyperparameter optimization are the grid search
and the random search (Bergstra and Bengio, 2012). A more sophisticated approach is
to use the sequential model-based global optimization methods such as Bayesian op-
timization (Snoek et al., 2012). Bayesian optimization is a global optimization method
of black-box and noisy objective functions, and it maintains a surrogate model learned
by using previously evaluated solutions. A Gaussian process is usually adopted as the
surrogate model, which can easily handle the uncertainty and noise of the objective
fonction.

Snoek et al. (2012) optimized nine hyperparameters in CNN, such as the num-
ber of epochs and the learning rate, and showed that an automatically tuned network
outperforms networks tuned by a human expert. Snoek et al. (2015) succeeded in im-
proving the scalability of hyperparameter search by using a deep neural network in-
stead of the Gaussian process to reduce the computational cost for surrogate model
bâtiment, and they optimized the learning hyperparameters of the fixed CNN archi-
tecture. Bergstra et al. (2011) proposed the tree-structured Parzen estimator (TPE) et
showed better results than manual search and random search. They also proposed a
meta-modeling approach (Bergstra et al., 2013) based on the TPE for supporting auto-
matic hyperparameter optimization. Hutter et al. (2011) proposed an algorithm called
sequential model-based algorithm configuration (SMAC) for general algorithm config-
uration. SMAC adopts a random forest as the surrogate model instead of a Gaussian
processus. Several studies optimized the hyperparameters of DNNs using SMAC-based
méthodes (par exemple., Domhan et al., 2015).

The hyperparameter optimization approach often tunes the learning parameters
(par exemple., learning rate, mini-batch size, and regularization parameters) and the predefined
structural parameters (par exemple., the numbers of layers and neurons, and the type of acti-
vation functions). En général, this approach requires predefined architectures, lequel
means it is hard to design flexible architectures from scratch.

Evolutionary Computation Volume 28, Nombre 1

143

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
e
v
c
o
un
r
t
je
c
e
–
p
d

F
/

2
8
1
1
4
1
2
0
2
0
3
6
2
e
v
c
o
_
un
_
0
0
2
5
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

M.. Suganuma et al.

2.2 Optimization of Neural Network Architectures

2.2.1 Reinforcement Learning Approach
Interesting works, the automatic design of deep neural network architecture using re-
inforcement learning, have been attempted recently (Zoph and Le, 2017; Baker et al.,
2017). In Zoph and Le (2017), a recurrent neural network (RNN) was used to generate
neural network architectures, and the RNN was trained with reinforcement learning to
maximize the expected accuracy on a learning task. This method uses distributed train-
ing and asynchronous parameter updates with 800 graphics processing units (GPUs)
to accelerate the reinforcement learning process. Baker et al. (2017) proposed a meta-
modeling approach based on reinforcement learning to produce CNN architectures. UN
Q-learning agent explores and exploits a space of model architectures with an (cid:2)-greedy
strategy and experience replay. En plus, this method optimizes the number of lay-
ers, which is a useful feature because it allows a system to design suitable architectures
for a target dataset.

We can see that these methods adopt the indirect encoding scheme for network rep-
resentation from the evolutionary computation perspective because they train the gen-
erative rules for network architectures. These methods have succeeded in constructing
competitive CNN architectures for image classification tasks. Unlike these methods, notre
approach uses direct encoding based on Cartesian genetic programming to design the
CNN architectures as described in Section 3. En outre, we introduce relatively highly
functional modules such as convolutional block and tensor concatenation to efficiently
find better CNN architectures.

Evolutionary Computation Approach

2.2.2
Evolutionary algorithms have been applied in optimizing neural network architectures
so far (Schaffer et al., 1992; Stanley and Miikkulainen, 2002). The methods for evolution-
ary neural networks optimize the connection weights and/or network structure of low-
level neurons by the evolutionary algorithm. Morse and Stanley (2016) compared the
evolutionary algorithm, stochastic gradient descent, and RMSProp on the optimization
of the weights in neural networks. The evolutionary algorithm showed competitive per-
formance, but the number of neural network weights used in the experiment was 1,500,
which is quite a small number compared to the modern deep neural networks used for
computer vision tasks. Sun et al. (2018) proposed a neural network training method for
unsupervised DNNs such as the auto-encoder and the restricted Boltzmann machine,
which combines the genetic algorithm and stochastic gradient descent. The method,
cependant, was not applied to CNNs. Verbancsics and Harguess (2013, 2015) optimized
the weights of artificial neural networks and CNNs by using the hypercube-based neu-
roevolution of augmenting topologies (HyperNEAT) (Stanley et al., 2009). Cependant, à
the best of our knowledge, the methods with HyperNEAT have not achieved competi-
tive performance compared with the state-of-the-art methods. These methods seem to
be difficult to scale for recent deep neural networks having a large number of neurons
and connections.

To deal with large-scale architectures, an approach combining back-propagation
and evolution is promising. Dans cette approche, neural network architectures are designed
by the evolutionary algorithm, and the network weights are optimized by a stochastic
gradient descent method through back-propagation. Compared to the hyperparame-
ter optimization approach, this approach can design more flexible architectures from
scratch by using the evolutionary algorithm.

144

Evolutionary Computation Volume 28, Nombre 1

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
e
v
c
o
un
r
t
je
c
e
–
p
d

F
/

2
8
1
1
4
1
2
0
2
0
3
6
2
e
v
c
o
_
un
_
0
0
2
5
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Evolution of Deep CNNs using CGP

Real et al. (2017) optimized large-scale neural networks by using the evolutionary
algorithm and achieved better performance than modern CNNs in image classification
tasks. In this method, they represent the CNN architecture as a graph structure and
optimize it by the evolutionary algorithm. The connection weights of the reproduced
architecture are optimized by stochastic gradient descent as well as the usual neural
network training, and the accuracy for the architecture evaluation dataset is assigned
as the fitness. The individuals are initialized by the small networks and become larger
through evolution. Cependant, this method was run on 250 computers and required ap-
proximately 10 days for optimization.

Miikkulainen et al. (2017) proposed a method called CoDeepNEAT, which is an
extended version of NEAT, and showed good performance in image classification and
image captioning tasks. This method designs the network architectures using blueprints
and modules. The blueprint chromosome is a graph where each node has a pointer to a
particular module species. Each module chromosome is a graph that represents a small
DNN. Spécifiquement, each node in the blueprint is replaced with a module selected from a
particular species to which that node points. During the evaluation phase, the modules
and blueprints are combined to generate assembled networks, and the networks are
evaluated.

Xie and Yuille (2017) designed CNN architectures using the genetic algorithm with a
binary string representation. They proposed a method for encoding a network structure,
where the connectivity of each layer is defined by a binary string representation. Le
type of each layer, the number of channels, and the size of a receptive field are not
evolved in this method.

These studies focus on designing large and flexible network architectures. In gen-
eral, these methods require considerable computational cost to optimize the neural net-
work architectures because they need to train the connection weights to calculate the fit-
ness of the architectures. In this work, we propose using Cartesian genetic programming
(CGP) to represent the deep neural network architectures and to use highly functional
modules as the node functions to reduce the search space. En outre, we introduce
simple techniques to reduce the computational cost of the architecture search.

3 Designing CNN Architectures Using Cartesian Genetic

Programming

In our method, we directly encode the CNN architectures based on CGP (Miller and
Thomson, 2000; Miller and Smith, 2006; Harding, 2008) and use highly functional mod-
ules as node functions. The CNN architecture defined by CGP is trained by a stochastic
gradient descent using a model training dataset and assigns the fitness value based
on the accuracies for another training dataset (c'est à dire., the architecture evaluation dataset).
Alors, the architecture is optimized to maximize the accuracy on the architecture eval-
uation dataset by using the evolutionary algorithm. Chiffre 1 illustrates an overview
of the proposed method. Dans cette section, we describe the network representation and
the evolutionary algorithm used in the proposed method. En plus, we explain the
simple speed-up techniques of the architecture search.

3.1 Representation of CNN Architectures

For the CNN architecture representation, we use the CGP encoding scheme which rep-
resents an architecture of CNNs as directed acyclic graphs with a two-dimensional
grid. CGP was proposed as a general form of genetic programming in Miller and

Evolutionary Computation Volume 28, Nombre 1

145

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
e
v
c
o
un
r
t
je
c
e
–
p
d

F
/

2
8
1
1
4
1
2
0
2
0
3
6
2
e
v
c
o
_
un
_
0
0
2
5
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

M.. Suganuma et al.

Chiffre 1: Overview of the proposed method. The method represents the CNN architec-
tures based on CGP. The CNN architecture is trained on a learning task and assigned
a fitness based on the accuracies of the trained model for the architecture evaluation
dataset. The evolutionary algorithm searches the better architectures.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
e
v
c
o
un
r
t
je
c
e
–
p
d

F
/

2
8
1
1
4
1
2
0
2
0
3
6
2
e
v
c
o
_
un
_
0
0
2
5
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Chiffre 2: Example of a genotype and a phenotype. The genotype (gauche) defines the CNN
architecture (droite). Node No. 5 on the left is inactive and does not appeared in the path
from the inputs to the outputs. The summation node applies max pooling to downsam-
ple the first input into the same size as the second input.

Thomson (2000), and it is called Cartesian because CGP represents a program using a
two-dimensional grid of nodes. The graph corresponding to a phenotype is encoded to
a string called a genotype and optimized by the evolutionary algorithm.

Let us assume that the grid has Nr rows by Nc columns; then the number of interme-
diate nodes is Nr × Nc, and the numbers of inputs and outputs depend on the task. Le
genotype consists of a string of integers with a fixed length, and each gene determines
the function type of the node and the connection between nodes. The c-th column’s
node is only allowed to be connected from (c − 1) à (c − l)-th column’s nodes, where l
is called a level-back parameter. Chiffre 2 shows an example of the genotype, the pheno-
type, and the corresponding CNN architecture. As seen in Figure 2, the CGP encoding

146

Evolutionary Computation Volume 28, Nombre 1

Evolution of Deep CNNs using CGP

Chiffre 3: The ResBlock architecture.

scheme has a possibility that not all of the nodes are connected to the output nodes (par exemple.,
node No. 5 in Figure 2). We call these nodes inactive nodes. Whereas the genotype in
CGP is a fixed-length representation, the number of nodes in the phenotypic network
varies because of the inactive nodes, which is a desirable feature because the number
of layers can be determined by the evolutionary algorithm.

Referring to the modern CNN architectures, we select the highly functional mod-
ules as the node function. The frequently used processing in the CNN is convolution
and pooling; the convolution processing uses local connectivity and spatially shares
the learnable weights, and the pooling is nonlinear downsampling. We prepare the six
types of node functions called ConvBlock, ResBlock, max pooling, average pooling, con-
catenation, and summation. These nodes operate on the three-dimensional (3-D) tensor
(also known as the feature map) defined by the dimensions of the row, column, et
channel.

The ConvBlock consists of a convolutional layer with the stride of one followed by
the batch normalization (Ioffe and Szegedy, 2015) and the rectified linear unit (ReLU)
(Nair and Hinton, 2010). To maintain the size of the input, we pad the input with zero
values around the border before the convolutional operation. Donc, the ConvBlock
takes the M × N × C tensor as input and produces the M × N × C(cid:2)
tensor, where M, N ,
C, and C(cid:2)
are the numbers of rows, columns, input channels, and output channels, concernant-
spectively. We prepare several ConvBlocks with different output channels and receptive
field size (kernel size) in the function set of CGP.

As shown in Figure 3, the ResBlock is composed of the ConvBlock, the batch nor-
malization, the ReLU, and the tensor summation. The ResBlock is a building block of
the modern succeeded CNN architectures, Par exemple, He et al. (2016), Zagoruyko
and Komodakis (2016), and Kupyn et al. (2018). Following this recent trend of human
architecture design, we decided to use ResBlock as the building block in our method.
The ResBlock performs identity mapping by the shortcut connection as described in He
et autres. (2016). The row and column sizes of the input are preserved in the same way as
the ConvBlock after convolution. As shown in Figure 3, the output feature maps of the
ResBlock are calculated by the ReLU activation and the summation with the input. Le
ResBlock takes the M × N × C tensor as an input and produces the M × N × C(cid:2)
tensor.
We prepare several ResBlocks with different output channels and receptive field size
(kernel size) in the function set of CGP.

Evolutionary Computation Volume 28, Nombre 1

147

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
e
v
c
o
un
r
t
je
c
e
–
p
d

F
/

2
8
1
1
4
1
2
0
2
0
3
6
2
e
v
c
o
_
un
_
0
0
2
5
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

M.. Suganuma et al.

Tableau 1: The node functions and abbreviated symbols used in the experiments.

Node type

Symbol

Variation

ConvBlock

ResBlock

MP
Max pooling
Average pooling AP
Concatenation
Summation

CB (C(cid:2), k) C(cid:2) ∈ {32, 64, 128}
k ∈ {3 × 3, 5 × 5}
RB (C(cid:2), k) C(cid:2) ∈ {32, 64, 128}
k ∈ {3 × 3, 5 × 5}
—
—
—
—

Concat
Sum

C(cid:2)
: Number of output channels
k: Receptive field size (kernel size)

The max and average poolings perform a max and average operation, respectivement,
over the local neighbors of feature maps. We use the pooling with the 2 × 2 receptive
field size and the stride of two. The pooling layer takes the M × N × C tensor and pro-
duces the M (cid:2) × N (cid:2) × C tensor, where M (cid:2) = (cid:4)M/2(cid:5) and N (cid:2) = (cid:4)N/2(cid:5).

The concatenation function takes two feature maps and concatenates them in the
channel dimension. When concatenating the feature maps with different numbers of
rows and columns, we downsample the larger feature map by max pooling to make
×
them the same sizes as the input. Let us assume that we have two inputs of size M1
× C2, then the size of the output feature maps is min(M1, M2) ×
× C1 and M2
N1
min(N1, N2) × (C1

+ C2).

× N2

The summation performs the element-wise summation of two feature maps, chan-
nel by channel. Similar to the concatenation, when summing the two feature maps with
the different numbers of rows and columns, we downsample the larger feature map by
max pooling. En outre, if the inputs have different numbers of channels, we expand
channels of the feature maps with a smaller channel size by filling with zero. Let us
× C2, then the sizes
assume that we have two inputs of size M1
of the output feature maps are min(M1, M2) × min(N1, N2) × max(C1, C2). In Figure 2,
the summation node applies the max pooling to downsample the first input into the
same size as the second input. By using the summation and concatenation operations,
our method can express the shortcut connection or branch layers, such as those used in
GoogLeNet (Szegedy et al., 2015) and residual network (ResNet) (He et al., 2016).

× C1 and M2

× N1

× N2

The output node represents the softmax function to produce a distribution over the
target classes. The outputs fully connect to all elements of the input. The node functions
used in the experiments are displayed in Table 1.

3.2

Evolutionary Algorithm

Following the standard CGP, we use a point mutation as the genetic operator. The func-
tion and the connection of each node randomly change to valid values according to the
mutation rate. The fitness evaluation of the CNN architecture involves the CNN train-
ing and requires approximately 0.5 à 1 hours in our setting. Donc, we need to ef-
ficiently evaluate some candidate solutions in parallel at each generation. To efficiently
use the computational resource, we repeatedly apply the mutation operator while an
active node does not change, and obtain the candidate solutions to be evaluated. Nous
call this mutation the forced mutation. De plus, to maintain a neutral drift, which is

148

Evolutionary Computation Volume 28, Nombre 1

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
e
v
c
o
un
r
t
je
c
e
–
p
d

F
/

2
8
1
1
4
1
2
0
2
0
3
6
2
e
v
c
o
_
un
_
0
0
2
5
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Evolution of Deep CNNs using CGP

effective for the CGP evolution (Miller and Smith, 2006; Miller and Thomson, 2000), nous
modify a parent by the neutral mutation if the fitnesses of the offspring do not improve.
The neutral mutation operates on only the genes of inactive nodes without modifica-
tion of the phenotype. We use the modified (1 + λ) evolution strategy (with λ = 2 in our
experiments) using the above artifice. The procedure of our evolutionary algorithm is
listed in Algorithm 1.

Le (1 + λ) evolution strategy, the default evolutionary algorithm in CGP, is an al-
gorithm with fewer strategy parameters: the mutation rate and the offspring size, mean-
ing that we do not need to expend considerable effort to tune such strategy parameters.
This is a reason we use the (1 + λ) evolution strategy in our method.

3.3

Speed-Up Techniques

The proposed CNN architecture optimization is time-consuming because it requires
training the candidate CNN architectures in the usual way to assign the fitness value.
We introduce two speed-up techniques for the architecture search: the rich initialization
and the early termination of training. In the rich initialization, we initialize the individ-
ual by ResNet (He et al., 2016) or DenseNet (Huang et al., 2017) such as structure and
start the evolution with a good architecture. The early termination technique stops the
neural network training runs that are unlikely to reach better accuracy.

3.3.1 Rich Initialization
Using a good and hand-designed initial network architecture helps the efficient archi-
tecture search more than using a randomly initialized network architecture. By initial-
izing the individual using a sophisticated architecture, we expect to find a better archi-
tecture in an early generation. In the experiment, we consider two well-known CNN
architectures, the residual network (ResNet) (He et al., 2016) and the densely connected
convolutional network (DenseNet) (Huang et al., 2017) as the initial individuals. Be-
cause we use the direct encoding of network architectures based on CGP, c'est facile de
edit the genotype string to represent the architectures we want. Spécifiquement, to lever-
age the original node functions, the CGP-CNN starts with the modified ResNet and
DenseNet architectures. The architectures of the modified ResNet and DenseNet are
displayed in Table 2. The ResNet and DenseNet architectures can be represented by

Evolutionary Computation Volume 28, Nombre 1

149

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
e
v
c
o
un
r
t
je
c
e
–
p
d

F
/

2
8
1
1
4
1
2
0
2
0
3
6
2
e
v
c
o
_
un
_
0
0
2
5
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

M.. Suganuma et al.

Tableau 2: The modified ResNet and DenseNet architectures for the rich initialization. Le
symbols of the node functions are defined in Table 1.

(un) The modified ResNet.

(b) The modified DenseNet.

Layers

Node functions

Layers

Node functions

ResBlock (1)
Pooling Layer

ResBlock (2)
Pooling Layer

ResBlock (3)
Pooling Layer

RB (32, 3) × 3
MP
RB (64, 3) × 3
MP
(128, 3) × 3
MP

Classification Layer

Fully-connected

Convolution

Dense Block

CB (32, 3)
(cid:3)

(cid:2)

CB (32, 3)

(1)

Concat

× 2

Transition Layer
(1)

Dense Block

(2)

Transition Layer
(2)

CB (32, 3)
AP

(cid:3)

CB (32, 3)

Concat

× 4

CB (64, 3)
AP

(cid:3)

(cid:2)

Dense Block

(3)

CB (32, 3)

Concat

× 4

Classification
Layer

AP
Fully-connected

using ResBlock and ConvBlock, respectivement. Although these architectures are slightly
different from those in He et al. (2016) and Huang et al. (2017), considering the pooling
layers in ResNet and the transition layers in the DenseNet, they are more promising
initial individuals than the randomly initialized ones. We denote this rich initialization
technique as RichInit.

Early Termination of Network Training

3.3.2
In the training procedure for the CNNs used in Suganuma et al. (2017), CNNs are
trained for a fixed number of epochs. Cependant, one can stop the training early to save
computational costs if the architecture seems to have no chance of reaching good per-
formance. Ici, we introduce a simple early termination technique based on a reference
curve.

Nepoch

1 . . . , f t

The reference curve is constructed by using the previous accuracy curves of net-
work training and is used to decide whether the current architecture is promising or not.
Let f t = (f t
) denote the reference curve at the t-th generation, where Nepoch
indicates the maximum number of epochs, and f 1 is initialized with the initial parent’s
accuracy curve for the architecture evaluation dataset. We terminate the CNN training
if the accuracies for the architecture evaluation dataset of the current architecture are
worse than the values of the reference curve by N consecutive times. If the training is
terminated early, the fitness of the architecture is assigned zero. Chiffre 4 illustrates the
concept of the early termination.

When the best fitness among the λ offspring exceeds the parent’s fitness, the values
of the reference curve f t are updated by taking the average of the reference curve and

150

Evolutionary Computation Volume 28, Nombre 1

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
e
v
c
o
un
r
t
je
c
e
–
p
d

F
/

2
8
1
1
4
1
2
0
2
0
3
6
2
e
v
c
o
_
un
_
0
0
2
5
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Evolution of Deep CNNs using CGP

Chiffre 4: Concept image of our early termination criterion. When an architecture records
low accuracies for the architecture evaluation dataset in several consecutive epochs, nous
terminate the architecture training.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
e
v
c
o
un
r
t
je
c
e
–
p
d

F
/

2
8
1
1
4
1
2
0
2
0
3
6
2
e
v
c
o
_
un
_
0
0
2
5
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

the accuracy curve of the best offspring: f t+1 = 1
2 ( f t + f c ), where f c is the accuracy
curve of the best offspring, and the values consist of the accuracies for the architecture
evaluation dataset at each epoch. The procedures for early termination and the reference
curve update are listed in Algorithms 2 et 3, respectivement. We expect that the evaluation
process of the architectures to speed up without performance deterioration by using
early termination.

Experiments and Results

4.1 Dataset
We apply our method to the CIFAR-10 and CIFAR-100 datasets consisting of 60,000
color images (32 × 32 pixels) dans 10 et 100 classes, respectivement. Each dataset is split
into a training set of 50,000 images and a test set of 10,000 images. We randomly

Evolutionary Computation Volume 28, Nombre 1

151

M.. Suganuma et al.

sample 45,000 examples from the training set to train the CNN, and the remaining 5,000
examples are used for architecture evaluation (c'est à dire., fitness evaluation of CGP).

On the CIFAR-10 dataset, we also consider a small-data scenario. The small-data
scenario is a situation where we only have a small number of data for training. Nous
use only one-tenth of the dataset in the experiment. En général, the performance of the
CNN architectures highly depends on the dataset, and it is difficult to manually design
an appropriate network architecture for a new dataset. The hand-designed CNN archi-
tectures seem to be tuned for the benchmark datasets such as CIFAR-10. The purpose
of the small-data scenario is to simulate the new dataset. In the small-data scenario, nous
randomly sample 4,500 images for the model training and 500 images for architecture
evaluation.

4.2

Experimental Setting

To assign the fitness value to the candidate CNN architecture, we train the CNN by
stochastic gradient descent (SGD) with a mini-batch size of 128. The softmax cross-
entropy loss is used as the loss function. We initialize the weights by the method de-
scribed in He et al. (2015) and use the Adam optimizer (Kingma and Ba, 2015) with an
= 0.999. We train each CNN
initial learning rate α = 0.01, and momentum β1
pour 50 epochs and use the maximum accuracy of the last ten epochs as the fitness value.
We reduce the learning rate by a factor of 10 at the 30th epoch.

= 0.9, β2

We preprocess the data with pixel-mean subtraction. To prevent overfitting, we use
−4. We also use data augmentation based on He
a weight decay with coefficient 1.0 × 10
et autres. (2016): padding 4 pixels on each size and randomly crop a 32 × 32 patch from the
padded image or its horizontally flipped image.

The parameter setting for CGP is shown in Table 3. We use a relatively large num-
ber of columns to generate deep architectures. The number of active nodes in the indi-
vidual of CGP is restricted. Donc, we apply the mutation operator until the CNN
architecture that satisfies the restriction of the number of active nodes is generated. Le
offspring size λ is two; that is the same number of GPUs in our experimental machines.
We test two node function sets called ConvSet and ResSet for our method. The ConvSet
contains ConvBlock, max pooling, average pooling, summation, and concatenation in
Tableau 1, and the ResSet contains ResBlock, Max pooling, Average pooling, Summation,
and Concatenation. The difference between these two function sets is whether the set
contains ConvBlock or ResBlock. The numbers of generations are 500 for ConvSet, 300
for ResSet, et 500 for the small-data scenario.

The best CNN architecture from the CGP process is retrained using all 50,000 im-
ages in the training set, and then we compute the test accuracy. We optimize the weights

152

Evolutionary Computation Volume 28, Nombre 1

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
e
v
c
o
un
r
t
je
c
e
–
p
d

F
/

2
8
1
1
4
1
2
0
2
0
3
6
2
e
v
c
o
_
un
_
0
0
2
5
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Evolution of Deep CNNs using CGP

Tableau 3: Parameter setting for the CGP.

Parameters

Values

Mutation rate
# Offspring (λ)
# Rows (Nr )
# Columns (Nc)
Minimum number of active nodes
Maximum number of active nodes
Levels-back (je)

0.05
2
5
30
10
50
10

of the obtained architecture for 500 epochs with a different training procedure; we use
−4.
SGD with a momentum of 0.9, a mini-batch size of 128 and a weight decay of 5.0 × 10
Following the learning rate schedule in He et al. (2016), we start with a learning rate of
0.01 and set it to 0.1 at the 5th epoch, and reduce it by a factor of 10 at the 250th and
370th epochs. We report the test accuracy at the 500th epoch as the final performance.
We implement the proposed method using the Chainer framework (Tokui et al.,
2015) (version 1.16.0) and run it on a machine with two NVIDIA GeForce GTX 1080
or two GTX 1080 Ti GPUs. We use a GTX 1080 et 1080 Ti for the experiments on the
CIFAR-10 and 100 datasets, respectivement. Due to the memory limitation, the candidate
CNNs occasionally take up the GPU memory, and the network training process fails
due to an out-of-memory error. In that case, we assign a zero fitness to the candidate
architecture.

4.3 Result on the CIFAR-10 and 100 Datasets

We run the proposed method ten times on each dataset and report the classification er-
rors. Ici, the result of the proposed method without speed-up techniques (rich initial-
ization and early termination described in Subsection 3.3) is discussed. We compare the
classification performance with the state-of-the-art CNN models, including the hand-
designed CNNs and automatically designed CNNs by the architecture search meth-
ods, on the CIFAR-10 and 100 datasets. A summary of the classification performances
is provided in Tables 4 et 5. We refer to the architectures constructed by the pro-
posed method as CGP-CNN. Par exemple, CGP-CNN (ConvSet) means the proposed
method with the ConvSet node function set. The models, Maxout, Network in Network,
VGG, ResNet, FractalNet, and Wide ResNet, are the hand-designed CNN architectures,
whereas MetaQNN, Neural Architecture Search, Large-Scale Evolution, Genetic CNN,
and CoDeepNEAT are the models obtained by the architecture search methods. Le
hand-designed CNN architectures mean that the CNN architectures are designed by
human experts. The values of other models, except for VGG and ResNet on CIFAR-
100, are referenced from the literature. We implemented the VGG net and ResNet for
CIFAR-100 because they were not applied to these datasets in Simonyan and Zisserman
(2015) and He et al. (2016). The architecture of the VGG is identical with configuration
D in Simonyan and Zisserman (2015). We denote this model as VGG in this article. Dans
Tables 4 et 5, the numbers of learnable weight parameters in the models are also listed.
In CGP-CNN, the numbers of learnable weight parameters of the best architecture are
reported.

Evolutionary Computation Volume 28, Nombre 1

153

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
e
v
c
o
un
r
t
je
c
e
–
p
d

F
/

2
8
1
1
4
1
2
0
2
0
3
6
2
e
v
c
o
_
un
_
0
0
2
5
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

M.. Suganuma et al.

Tableau 4: Comparison of the error rates (%), the numbers of learnable weight parameters,
and the search costs on the CIFAR-10 dataset. We run the proposed method ten times for
each dataset and report the classification errors in the format of “best (mean ± std).” We
refer to the architectures constructed by the proposed method as CGP-CNN. In CGP-
CNN, the numbers of learnable weight parameters of the best architecture are reported.
The values of other models are referenced from the literature.

Model

# params

Test error

GPU days

Maxout (Goodfellow et al., 2013)
Network in Network (Lin et al., 2014)
VGG (Simonyan and Zisserman, 2015)
ResNet (He et al., 2016)
FractalNet (Larsson et al., 2017)
Wide ResNet (Zagoruyko and Komodakis, 2016)

CoDeepNEAT (Miikkulainen et al., 2017)
Genetic CNN (Xie and Yuille, 2017)
MetaQNN (Baker et al., 2017)
Large-Scale Evolution (Real et al., 2017)
Neural Architecture Search (Zoph and Le, 2017)

CGP-CNN (ConvSet)

CGP-CNN (ResSet)

—
—
15.2M.
1.7M.
38.6M.
36.5M.

—
—
3.7M.
5.4M.
37.4M.

1.50M.

2.01M.

9.38
8.81
7.94
6.61
5.22
4.00

7.30
7.10
6.92
5.40
3.65

—
—
—
—
—
—

—
17
80–100
2750
16800–22400

5.92
(6.48 ± 0.48)
5.01
(6.10 ± 0.89)

Tableau 5: Comparison of the error rates (%) and the numbers of learnable weight param-
eters on the CIFAR-100 dataset. We run the proposed method ten times for each dataset
and report the classification errors in the format of “best (mean ± std).” We refer to
the architectures constructed by the proposed method as CGP-CNN. In CGP-CNN, le
numbers of learnable weight parameters of the best architecture are reported. The val-
ues of other models except for VGG and ResNet are referenced from the literature.

Model

# params

Test error

CoDeepNEAT (Miikkulainen et al., 2017)
Neural Architecture Search (Zoph and Le, 2017)
Genetic CNN (Xie and Yuille, 2017)
MetaQNN (Baker et al., 2017)
Large-Scale Evolution (Real et al., 2017)

CGP-CNN (ConvSet)
CGP-CNN (ResSet)

—
—
15.2M.
1.7M.
38.6M.
36.5M.

—
37.4M.
—
3.7M.
40.4M.

2.01M.
4.60M.

38.57
35.68
33.45
32.40
23.30
19.25

—
—
29.03
27.14
23.0

26.7 (28.1 ± 0.83)
25.1 (26.8 ± 1.21)

154

Evolutionary Computation Volume 28, Nombre 1

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
e
v
c
o
un
r
t
je
c
e
–
p
d

F
/

2
8
1
1
4
1
2
0
2
0
3
6
2
e
v
c
o
_
un
_
0
0
2
5
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Evolution of Deep CNNs using CGP

On the CIFAR-10 dataset, CGP-CNNs outperform most of the hand-designed mod-
els and have a good balance between the classification errors and the number of
parameters. CGP-CNN (ResSet) shows better performance compared to CGP-CNN
(ConvSet). Compared with other architecture search methods, CGP-CNN (ConvSet and
ResSet) outperforms MetaQNN (Baker et al., 2017), Genetic CNN (Xie and Yuille, 2017),
and CoDeepNEAT (Miikkulainen et al., 2017), and the best architecture of CGP-CNN
(ResSet) outperforms Large-Scale Evolution (Real et al., 2017). The Neural Architecture
Recherche (Zoph and Le, 2017) achieved the best error rate, but this method used 800 GPUs
and required considerable computational costs to search the best architecture. Tableau 4
also shows the number of GPU days (the computational time multiplied by the number
of GPUs used in the experiments) for the architecture search. As seen in this table, notre
method can find a good architecture with a reasonable computational cost. We guess
that our method could reduce the search space and find better architectures in early
iteration by using the highly functional modules. The CIFAR-100 dataset is a very chal-
lenging task because there are many classes. CGP-CNN finds the competitive network
architectures in a reasonable computational time. Even though our model is not at the
same level as the state-of-the-art architectures, our model has a good balance between
the classification errors and the number of parameters.

The error rates of the architecture search methods (not only our method) do
not reach the Wide ResNet which is a human-designed architecture. Cependant, ces
human-designed architectures are developed with the expenditure of tremendous hu-
man effort. An advantage of architecture search methods is that they can automatically
find a good architecture for a new dataset. Another advantage of CGP-CNN is that
the numbers of weight parameters in the discovered architectures are fewer than those
in the human-designed architectures, which is beneficial when we want to implement
CNN on a mobile device. Note that we did not introduce any criteria for the architecture
complexity in the fitness function. It might be possible to find more compact architec-
tures by introducing the penalty term into the fitness function, which is important future
travail.

Chiffre 5 illustrates the examples of the CNN architectures obtained by the proposed
method, CGP-CNN (ConvSet and ResSet). As seen in Figure 5, we observe the complex
architectures that are hard to design by hand. Spécifiquement, CGP-CNN (ConvSet) uses
the summation and concatenation nodes leading a wide network and allowing the for-
mation of skip-connections. Donc, the CGP-CNN (ConvSet) architecture is wider
than that of CGP-CNN (ResSet). En plus, we also observe that CGP-CNN (Res-
Set) has a similar structure to ResNet (He et al., 2016). ResNet consists of a series of
two types of modules: the module with several convolutions and shortcut connections
without downsampling, and downsampling convolution with a stride of 2. Although
our method cannot downsample in the ConvBlock and the ResBlock, we see that CGP-
CNN (ResSet) uses pooling layer as an alternative to the downsampling convolution.
We can say that our method can also find an architecture similar to one designed by
human experts.

En outre, we conducted the Wilcoxon rank sum test to statistically compare Res-
−2). On CIFAR-10
Set and ConvSet. We set the significance level to 5% (c'est à dire., 5.0 × 10
−2,
−2 and 1.15 × 10
and CIFAR-100, the p-values for ResSet and ConvSet were 3.19 × 10
respectivement, indicating that the architectures using ResSet outperform those using
ConvSet. We cannot conduct a statistical comparison test between CGP-CNN and other
architecture search methods because we cannot obtain the detailed experimental results
of other methods.

Evolutionary Computation Volume 28, Nombre 1

155

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
e
v
c
o
un
r
t
je
c
e
–
p
d

F
/

2
8
1
1
4
1
2
0
2
0
3
6
2
e
v
c
o
_
un
_
0
0
2
5
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

M.. Suganuma et al.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
e
v
c
o
un
r
t
je
c
e
–
p
d

F
/

2
8
1
1
4
1
2
0
2
0
3
6
2
e
v
c
o
_
un
_
0
0
2
5
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Chiffre 5: The CNN architectures obtained by the proposed method on the CIFAR-10
dataset.

4.4

The Effect of the Rich Initialization

To investigate the effect of the rich initialization described in Subsection 3.3.1, we com-
pare the CGP-CNN (ConvSet) and CGP-CNN (ResSet) models using rich initialization
to those models without the rich initialization. Since the DenseNet and ResNet-like
architectures can be constructed by using our ConvSet and ResSet, respectivement, nous
initialize the CGP-CNN (ConvSet) with the DenseNet and initialize the CGP-CNN
(ResNet) with the ResNet. A summary of the error rates and computational times is pro-
vided in Table 6. We observe that rich initialization contributes to improving the perfor-
mance on both the CIFAR-10 and CIFAR-100 datasets. En plus, the performances
of the ResNet-like initialization show better performance than those of the DensNet-like
initialization.

We additionally conducted the Wilcoxon rank sum tests for two independent sam-
ples to analyze the effects of rich initialization. Based on a statistical test with the

156

Evolutionary Computation Volume 28, Nombre 1

Evolution of Deep CNNs using CGP

Tableau 6: Comparison of the error rates (%), the computational times, and the num-
bers of learnable weight parameters, on the CIFAR-10 and 100 datasets. CGP-CNNs
with/without rich initialization (RichInit) are compared. We run each method ten times
and report the classification errors in the format of “best (mean ± std).” The computa-
tional times are the mean over the ten runs. Note that we used different GPUs; GTX 1080
et 1080 Ti were used for CIFAR-10 and 100, respectivement. The numbers of the learnable
weight parameters of the best architecture are reported.

Model

# params

Time
(jours)

Error rate
(CIFAR-10)

Error rate
(CIFAR-100)

CGP-CNN (ConvSet)

with RichInit (DenseNet)

CGP-CNN (ConvSet)

with RichInit (DenseNet)

CGP-CNN (ResSet)

with RichInit (ResNet)

CGP-CNN (ResSet)

with RichInit (ResNet)

1.50M.
2.01M.
2.04M.
2.95M.

3.52M.
2.72M.
3.43M.
4.34M.

15.6
14.5
13.0
15.7

14.7
12.4
10.9
11.7

5.92 (6.48 ± 0.48)
5.01 (5.80 ± 0.52)
—
—

5.01 (6.10 ± 0.89)
4.90 (5.60 ± 0.52)
—
—

—
—
26.7 (28.1 ± 0.83)
24.6 (26.4 ± 1.32)

—
—
25.1 (26.8 ± 1.21)
23.8 (25.8 ± 1.22)

significance level of 5%, the p-values for ResSet with/without rich initialization on
CIFAR-10 and 100 étaient 5.5 × 10−3 and 2.2 × 10−2, respectivement, and the values for
−2
ConvSet with/without the rich initialization on CIFAR-10 and 100 étaient 1.4 × 10
−2, respectivement. The p-values for both ResSet and ConvSet were less than
et 1.9 × 10
5.0 × 10−2, Et ainsi, rich initialization can significantly improve the performance over
that of the original models. Nous, cependant, notice that the rich initialization of the archi-
tecture has a possibility of becoming stuck in a local optimum solution. Donc, nous
may need to take into account the diversity of the population when rich initialization is
used.

4.5

The Effect of the Early Termination

We conducted an experiment introducing early termination of network training on the
CIFAR-10 dataset to check the effect. The parameter for determining the termination, N,
is varied as N = 3, 5, et 10. The parameter setting is the same as that described in Sub-
section 4.2. Tableau 7 shows the error rates and the computational times of the proposed
method with/without early termination. The average numbers of epochs for network
training are also listed in Table 7. We also report the case when we use both speed-up
techniques, rich initialization and early termination, in ResSet.

From Table 7, early termination reduces the optimization time without significant
performance deterioration and the optimization time decreases as N decreases. The av-
erage number of epochs is half that of the number in the original method when N = 10
and decreases approximately linearly as N decreases. The computational time also de-
creased as the average number of epochs decreases. In the CGP-CNN (ResSet) avec
N = 3, the computational time becomes 2.39 jours (c'est à dire., 16.3% of that of the original CGP-
CNN (ResSet)). This is reasonable for execution by most users. En plus, we observe
that the combination of early termination and rich initialization is more effective con-
sidering both computational time and performance; c'est, the performances are better
than those of the architectures without rich initialization.

Evolutionary Computation Volume 28, Nombre 1

157

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
e
v
c
o
un
r
t
je
c
e
–
p
d

F
/

2
8
1
1
4
1
2
0
2
0
3
6
2
e
v
c
o
_
un
_
0
0
2
5
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

M.. Suganuma et al.

Tableau 7: Comparison of the error rates (%), the computational times, and the average
numbers of epochs for network training, on the CIFAR-10 dataset with the different
parameter N. We run each method ten times and report the errors in the format of “best
(mean ± std).” The values of the computational time and the average number of epochs
are the mean over the ten runs.

Method

CGP-CNN (ConvSet)

with early termination (N = 3)
with early termination (N = 5)
with early termination (N = 10)

CGP-CNN (ResSet)

with early termination (N = 3)
with early termination (N = 5)
with early termination (N = 10)

CGP-CNN (ResSet) with RichInit
with early termination (N = 3)
with early termination (N = 5)
with early termination (N = 10)

Error rate

5.92 (6.48 ± 0.48)
5.61 (6.52 ± 0.59)
5.81 (6.34 ± 0.39)
6.40 (6.93 ± 0.50)

5.01 (6.10 ± 0.89)
5.30 (6.22 ± 0.46)
5.42 (6.34 ± 0.61)
5.27 (6.12 ± 0.50)

4.90 (5.60 ± 0.52)
5.06 (5.83 ± 0.47)
5.07 (5.76 ± 0.57)
4.90 (5.54 ± 0.60)

Time
(jours)

Average
# epochs

15.6
3.32
6.73
11.4

14.7
2.39
4.93
6.05

11.7
1.29
2.86
6.36

50.0
8.97
18.8
30.6

50.0
7.96
15.1
23.9

50.0
6.88
12.4
22.2

En outre, we conducted Kruskal-Wallis rank sum tests for four independent
samples—the CGP-CNN with early terminations of N = 3, 5, et 10, and the one with-
out early termination—to analyze the effect of the early termination technique. Pour
−1
ConvSet, ResSet and ResSet with RichInit, the p-values were 1.28 × 10
−2), dans-
et 4.86 × 10
dicating that each model achieved similar classification performance. Par conséquent, notre
early termination technique can reduce the computational time without performance
deterioration.

−1, respectivement. The p-values for all cases were large (> 5.0 × 10

−1, 4.77 × 10

4.6 Result on the Small-Data Scenario

In the small-data scenario, we compare our method with VGG and ResNet. We trained
the VGG and ResNet models with the same settings used in the retraining process of
the proposed method. Tableau 8 shows the comparison of error rates in the small-data
scénario. We observe that our methods, CGP-CNN (ConvSet) and CGP-CNN (ResSet),
can find better architectures than VGG and ResNet.

VGG and ResNet are designed and tuned for a relatively large amount of data and
have millions of parameters to train. Donc, the number of samples in the small-
data scenario seems to be too small to prevent overfitting. Entre-temps, our method can
automatically tune the architecture depending on the dataset and achieve better per-
formance even on the small datasets. The numbers of learnable weight parameters are
relatively small, suggesting that the proposed method finds the compact architectures
to prevent overfitting. Chiffre 6 illustrates the architectures constructed by using the pro-
posed method. From this figure, our method found relatively wide structure compared
with the architectures that appeared in Figure 5. The computational time of the pro-
posed method in the small-data scenario is a few days using a NVIDIA GeForce GTX
1080 Ti.

158

Evolutionary Computation Volume 28, Nombre 1

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
e
v
c
o
un
r
t
je
c
e
–
p
d

F
/

2
8
1
1
4
1
2
0
2
0
3
6
2
e
v
c
o
_
un
_
0
0
2
5
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Evolution of Deep CNNs using CGP

Tableau 8: Comparison of the error rates on the CIFAR-10 dataset (small-data scenario).
We run the experiment ten times and report the classification errors in the format of
“best (mean ± std).” The numbers of learnable weight parameters of the best architec-
ture are also reported.

Model

Error rate

# params

VGG (Simonyan and Zisserman, 2015)
ResNet (He et al., 2016)
CGP-CNN (ConvSet)
CGP-CNN (ResSet)

23.05 (24.07 ± 0.43)
24.01 (24.83 ± 0.51)
19.78 (22.14 ± 1.80)
19.33 (20.52 ± 1.36)

15.2M.
1.70M.
1.94M.
0.92M.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
e
v
c
o
un
r
t
je
c
e
–
p
d

F
/

2
8
1
1
4
1
2
0
2
0
3
6
2
e
v
c
o
_
un
_
0
0
2
5
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Chiffre 6: The CNN architectures obtained by the proposed method on the small-data
scénario.

We additionally conducted a Wilcoxon rank sum test for comparison of our models
and other models. Based on the statistical test with the significance level of 5%, the p-
−4, et
value between ResSet and VGG (Simonyan and Zisserman, 2015) était 7.25 × 10

Evolutionary Computation Volume 28, Nombre 1

159

M.. Suganuma et al.

the p-value between ResSet and ResNet (He et al., 2016) était 2.1 × 10−5. The p-value for
−2, respectivement.
ConvSet compared to VGG and ResNet were 1.05 × 10

−1 and 1.1 × 10

5 Conclusion

In this article, we proposed a CGP-based approach for designing deep CNN architec-
tures and verified its potential. The proposed method generates the CNN architectures
based on the CGP encoding scheme with highly functional modules and uses the mod-
ified evolutionary algorithm to efficiently find good architectures. De plus, we intro-
duced simple speed-up techniques, rich initialization and early termination of network
entraînement, to reduce the computational time. We constructed CNN architectures for the
image classification task with the CIFAR-10 and CIFAR-100 datasets and considered
two different data size settings. The experimental results showed that the proposed
method could find competitive CNN architectures compared with the state-of-the-art
models. Regarding the speed-up techniques, rich initialization can improve the dis-
covered architecture performance, and early termination has succeeded to reduce the
computational time without significant performance deterioration. By combining early
termination with rich initialization, the computational time can be reduced, et le
performance improved. In the small-data scenario, the proposed method can also
find better and compact architectures compared with the existing hand-designed
architectures.

The bottleneck of the architecture search of the DNN is the computational cost. Un-
other possible speed-up technique is that we start with a small data size and increase the
training data for the neural networks as the generation progresses. De plus, to sim-
plify and compact the CNN architectures, we may introduce regularization techniques
to the architecture search process. Or, we may be able to manually simplify the obtained
CNN architectures by removing redundant or less effective layers.

Another possible future topic would be to apply other evolutionary algorithms such
as the standard genetic algorithm used in Real et al. (2017) to our proposed method.
While we employed a simple evolution strategy in this article, we did not discuss how
other evolutionary algorithms affect the performance on the architecture search. Ainsi,
we would like to investigate this point in the future.

Les références

Boulanger, B., Gupta, O., Naik, N., and Raskar, R.. (2017). Designing neural network architectures
using reinforcement learning. In Proceedings of the 5th International Conference on Learning
Representations. Retrieved from arXiv:1611.02167.

Bergstra, J., and Bengio, Oui. (2012). Random search for hyper-parameter optimization. Journal de

Machine Learning Research, 13:281–305.

Bergstra, J., Yamins, D., and Cox, D. (2013). Making a science of model search: Hyperparameter
optimization in hundreds of dimensions for vision architectures. In Proceedings of the 30th
International Conference on Machine Learning, pp. 115–123.

Bergstra, J.. S., Bardenet, R., Bengio, Y., and Kégl, B. (2011). Algorithms for hyper-parameter opti-

mization. In Advances in Neural Information Processing Systems, 24, pp. 2546–2554.

Domhan, T., Springenberg, J.. T., and Hutter, F. (2015). Speeding up automatic hyperparameter
optimization of deep neural networks by extrapolation of learning curves. In Proceedings of
the 24th International Conference on Artificial Intelligence, pp. 3460–3468.

160

Evolutionary Computation Volume 28, Nombre 1

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
e
v
c
o
un
r
t
je
c
e
–
p
d

F
/

2
8
1
1
4
1
2
0
2
0
3
6
2
e
v
c
o
_
un
_
0
0
2
5
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Evolution of Deep CNNs using CGP

Goodfellow, je. J., Warde-Farley, D., Mirza, M., Courville, UN., and Bengio, Oui. (2013). Maxout net-
travaux. In Proceedings of the 30th International Conference on Machine Learning, pp. 1319–1327.

Harding, S. (2008). Evolution of image filters on graphics processor units using Cartesian ge-
netic programming. In Proceedings of the IEEE Congress on Evolutionary Computation, pp. 1921–
1928.

Il, K., Zhang, X., Ren, S., and Sun, J.. (2015). Delving deep into rectifiers: Surpassing human-level
performance on ImageNet classification. In Proceedings of the IEEE International Conference on
Computer Vision, pp. 1026–1034.

Il, K., Zhang, X., Ren, S., and Sun, J.. (2016). Deep residual learning for image recognition. Dans
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778.

Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, UN., Jaitly, N., Senior, UN., Vanhoucke, V., Nguyen,
P., Sainath, T., and Kingsbury, B. (2012). Deep neural networks for acoustic modeling in
speech recognition: The shared views of four research groups. IEEE Signal Processing Maga-
zine, 29(6):82–97.

Huang, G., Liu, Z., van der Maaten, L., and Weinberger, K. Q. (2017). Densely connected convo-
lutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition, pp. 4700–4708.

Hutter, F., Hoos, H. H., and Leyton-Brown, K. (2011). Sequential model-based optimization for
general algorithm configuration. In Proceedings of the 5th International Conference on Learning
and Intelligent Optimization, pp. 507–523.

Ioffe, S., and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by
reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine
Apprentissage, pp. 448–456.

Kingma, D. P., and Ba, J.. (2015). Adam: A method for stochastic optimization. In Proceedings of the
3rd International Conference on Learning Representations. Retrieved from arXiv:1412.6980.

Krizhevsky, UN., and Hinton, G. E. (2009). Learning multiple layers of features from tiny images. Technologie-

nical Report.

Krizhevsky, UN., Sutskever, JE., and Hinton, G. E. (2012). ImageNet classification with deep convo-
lutional neural networks. In Advances in Neural Information Processing Systems, 25, pp. 1097–
1105.

Kupyn, O., Budzan, V., Mykhailych, M., Mishkin, D., and Matas, J.. (2018). DeblurGAN: Blind mo-
tion deblurring using conditional adversarial networks. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 8183–8192.

Larsson, G., Maire, M., and Shakhnarovich, G. (2017). FractalNet: Ultra-deep neural networks
without residuals. In Proceedings of the 5th International Conference on Learning Representations.
Retrieved from arXiv:1605.07648.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P.. (1998). Gradient-based learning applied to doc-

ument recognition. Proceedings of the IEEE, 86(11):2278–2324.

Lin, M., Chen, Q., and Yan, S. (2014). Network in network. In Proceedings of the 2nd International

Conference on Learning Representations. Retrieved from arXiv:1312.4400.

Loshchilov, JE., and Hutter, F. (2016). CMA-ES for hyperparameter optimization of deep neural
réseaux. In Proceedings of the 4th International Conference on Learning Representations Work-
shop. Retrieved from arXiv:1604.07269.

Miikkulainen, R., Liang, J.. Z., Meyerson, E., Rawal, UN., Fink, D., Francon, O., Raju, B., Shahrzad,
H., Navruzyan, UN., Duffy, N., and Hodjat, B. (2017). Evolving deep neural networks. Re-
trieved from arXiv:1703.00548.

Evolutionary Computation Volume 28, Nombre 1

161

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
e
v
c
o
un
r
t
je
c
e
–
p
d

F
/

2
8
1
1
4
1
2
0
2
0
3
6
2
e
v
c
o
_
un
_
0
0
2
5
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

M.. Suganuma et al.

Miller, J.. F., and Smith, S. L. (2006). Redundancy and computational efficiency in Cartesian genetic

programming. IEEE Transactions on Evolutionary Computation, 10(2):167–174.

Miller, J.. F., and Thomson, P.. (2000). Cartesian genetic programming. In Proceedings of the European

Conference on Genetic Programming, pp. 121–132.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, UN., Antonoglou, JE., Wierstra, D., and Riedmiller, M..
(2013). Playing Atari with deep reinforcement learning. In Proceedings of Workshop on Deep
Learning in Neural Information Processing Systems. Retrieved from arXiv:1312.5602v1.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, UN. UN., Veness, J., Bellemare, M.. G., Graves, UN., Ried-
miller, M., Fidjeland, UN. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, UN., Antonoglou, JE.,
King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015). Human-level control
through deep reinforcement learning. Nature, 518(7540):529–533.

Morse, G., and Stanley, K. Ô. (2016). Simple evolutionary optimization can rival stochastic gra-
dient descent in neural networks. In Proceedings of the Genetic and Evolutionary Computation
Conference 2016 (GECCO), pp. 477–484.

Naïr, V., and Hinton, G. E. (2010). Rectified linear units improve restricted Boltzmann machines.

In Proceedings of the 27th International Conference on Machine Learning, pp. 807–814.

Real, E., Moore, S., Selle, UN., Saxena, S., Suematsu, Oui. L., Le, Q. V., and Kurakin, UN. (2017). Large-
scale evolution of image classifiers. In Proceedings of the 34th International Conference on Ma-
chine Learning, pp. 2902–2911.

Schaffer, J.. D., Whitley, D., and Eshelman, L. J.. (1992). Combinations of genetic algorithms and
neural networks: A survey of the state of the art. In Proceedings of International Workshop on
Combinations of Genetic Algorithms and Neural Networks, pp. 1–37.

Simonyan, K., and Zisserman, UN. (2015). Very deep convolutional networks for large-scale im-
age recognition. In Proceedings of the 3rd International Conference on Learning Representations.
Retrieved from arXiv:1409.1556v6.

Snoek, J., Larochelle, H., and Adams, R.. P.. (2012). Practical Bayesian optimization of machine
learning algorithms. In Advances in Neural Information Processing Systems, 25, pp. 2951–2959.

Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N., Patwary, M.. M.. UN., Prabhat,
P., and Adams, R.. P.. (2015). Scalable Bayesian optimization using deep neural networks. Dans
Proceedings of the 32nd International Conference on Machine Learning, pp. 2171–2180.

Stanley, K. O., D’Ambrosio, D. B., and Gauci, J.. (2009). A hypercube-based encoding for evolving

large-scale neural networks. Artificial Life, 15(2):185–212.

Stanley, K. O., and Miikkulainen, R.. (2002). Evolving neural networks through augmenting

topologies. Evolutionary Computation, 10(2):99–127.

Suganuma, M., Shirakawa, S., and Nagao, T. (2017). A genetic programming approach to design-
ing convolutional neural network architectures. In Proceedings of the Genetic and Evolutionary
Computation Conference 2017 (GECCO), pp. 497–504.

Sun, Y., Yen, G. G., and Yi, Z. (2018). Evolving unsupervised deep neural networks for learning
meaningful representations. IEEE Transactions on Evolutionary Computation. Retrieved from
est ce que je:10.1109/TEVC.2018.2808699.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., et
Rabinovich, UN. (2015). Going deeper with convolutions. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 1–9.

Tokui, S., Oono, K., Hido, S., and Clayton, J.. (2015). Chainer: A next-generation open source frame-
work for deep learning. In Proceedings of Workshop on Machine Learning Systems (LearningSys)
in The Twenty-Ninth Annual Conference on Neural Information Processing Systems.

162

Evolutionary Computation Volume 28, Nombre 1

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
e
v
c
o
un
r
t
je
c
e
–
p
d

F
/

2
8
1
1
4
1
2
0
2
0
3
6
2
e
v
c
o
_
un
_
0
0
2
5
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Evolution of Deep CNNs using CGP

Verbancsics, P., and Harguess, J.. (2013). Generative neuroevolution for deep learning. Retrieved

from arXiv:1312.5355.

Verbancsics, P., and Harguess, J.. (2015). Image classification using generative neuro evolution for
deep learning. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision,
pp. 488–493.

Vinyals, O., Toshev, UN., Bengio, S., and Erhan, D. (2015). Show and tell: A neural image caption
generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.
3156–3164.

Xie, L., and Yuille, UN. (2017). Genetic CNN. In Proceedings of the International Conference on Computer

Vision, pp. 1388–1397.

Yao, X. (1999). Evolving artificial neural networks. Proceedings of the IEEE, 87(9):1423–1447.

Zagoruyko, S., and Komodakis, N. (2016). Wide residual networks. In Proceedings of the British

Machine Vision Conference, pp. 87.1–87.12.

Zhang, R., Isola, P., and Efros, UN. UN. (2016). Colorful image colorization. In Proceedings of the 14th

European Conference on Computer Vision, pp. 649–666.

Zoph, B., and Le, Q. V. (2017). Neural architecture search with reinforcement learning. Dans
Proceedings of the 5th International Conference on Learning Representations. Retrieved from
arXiv:1611.01578.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
e
v
c
o
un
r
t
je
c
e
–
p
d

F
/

2
8
1
1
4
1
2
0
2
0
3
6
2
e
v
c
o
_
un
_
0
0
2
5
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Evolutionary Computation Volume 28, Nombre 1

163 Evolution of Deep Convolutional Neural image

Télécharger le PDF