CARTA

CARTA

Communicated by Litian Liu

Understanding Dynamics of Nonlinear Representation
Learning and Its Application

Kenji Kawaguchi
kkawaguchi@fas.harvard.edu
Harvard University, Cambridge, MAMÁ 02138, U.S.A.

Linjun Zhang
linjun.zhang@rutgers.edu
Rutgers University, New Brunswick, Nueva Jersey 08901

Zhun Deng
zhundeng@g.harvard.edu
Harvard University Cambridge, MAMÁ 02138, U.S.A.

Representations of the world environment play a crucial role in artifi-
cial intelligence. It is often inefficient to conduct reasoning and infer-
ence directly in the space of raw sensory representations, such as pixel
values of images. Representation learning allows us to automatically dis-
cover suitable representations from raw sensory data. Por ejemplo, given
raw sensory data, a deep neural network learns nonlinear representations
at its hidden layers, which are subsequently used for classification (o
regression) at its output layer. This happens implicitly during training
through minimizing a supervised or unsupervised loss. In this letter, nosotros
study the dynamics of such implicit nonlinear representation learning.
We identify a pair of a new assumption and a novel condition, called
the on-model structure assumption and the data architecture alignment
condición. Under the on-model structure assumption, the data architec-
ture alignment condition is shown to be sufficient for the global conver-
gence and necessary for global optimality. Además, our theory explains
how and when increasing network size does and does not improve the
training behaviors in the practical regime. Our results provide practi-
cal guidance for designing a model structure; Por ejemplo, the on-model
structure assumption can be used as a justification for using a particu-
lar model structure instead of others. As an application, we then derive a
new training framework, which satisfies the data architecture alignment
condition without assuming it by automatically modifying any given
training algorithm dependent on data and architecture. Given a stan-
dard training algorithm, the framework running its modified version is
empirically shown to maintain competitive (practical) test performances
while providing global convergence guarantees for deep residual neural

Computación neuronal 34, 991–1018 (2022)
https://doi.org/10.1162/neco_a_01483

© 2022 Instituto de Tecnología de Massachusetts.
Publicado bajo Creative Commons
Atribución 4.0 Internacional (CC POR 4.0) licencia.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi

pag
d

/

yo

F
/

/

/

/

3
4
4
9
9
1
2
0
0
3
0
8
5
norte
mi
C
oh
_
a
_
0
1
4
8
3
pag
d

.

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

992

k. Kawaguchi, l. zhang, and Z. Deng

networks with convolutions, skip connections, and batch normalization
with standard benchmark data sets, including MNIST, CIFAR-10, CIFAR-
100, Semeion, KMNIST, and SVHN.

1 Introducción

LeCun, bengio, and Hinton (2015) described deep learning as one of hier-
archical nonlinear representation learning approaches:

Deep-learning methods are representation-learning methods with multi-
ple levels of representation, obtained by composing simple but non-linear
modules that each transform the representation at one level (starting with
the raw input) into a representation at a higher, slightly more abstract
nivel (pag. 436).

In applications such as computer vision and natural language process-
En g, the success of deep learning can be attributed to its ability to learn hi-
erarchical nonlinear representations by automatically changing nonlinear
features and kernels during training based on the given data. This is in con-
trast to classical machine-learning methods where representations or equiv-
alently nonlinear features and kernels are fixed during training.

Deep learning in practical regimes, which has the ability to learn nonlin-
ear representation (bengio, Courville, & Vincent, 2013), has had a profound
impact in many areas, including object recognition in computer vision (Ri-
fai et al., 2011; Hinton, Osindero, & Teh, 2006; bengio, Lamblin, Popovici,
& Larochelle, 2007; Ciregan, Meier, & Schmidhuber, 2012; Krizhevsky,
Sutskever, & Hinton, 2012), style transfer (Gatys, Ecker, & Bethge, 2016;
Luan, París, Shechtman, & Bala, 2017), image super-resolution (Dong, Loy,
Él, & Espiga, 2014), speech recognition (Dahl, Ranzato, mohamed, & Hinton,
2010; Deng et al., 2010; Seide, li, & Yu, 2011; mohamed, Dahl, & Hinton,
2011; Dahl, Yu, Deng, & Acero, 2011; Hinton et al., 2012), traducción automática-
ción (Schwenk, Rousseau, & Attik, 2012; Le, Oparin, Allauzen, Gauvain, &
Yvon, 2012), paraphrase detection (Socher, Huang, Pennington, Ng, & Man-
y, 2011), desambiguación del sentido de la palabra (Bordes, Glorot, Weston, & bengio,
2012), and sentiment analysis (Glorot, Bordes, & bengio, 2011; Socher, Pen-
nington, Huang, Ng, & Manning, 2011). Sin embargo, we do not yet know the
precise condition that makes deep learning tractable in the practical regime
of representation learning.

To initiate a study toward such a condition, we consider the following
problem setup that covers deep learning in the practical regime and other
nonlinear representation learning methods in general. We are given a train-
∈ Y ⊆ Rmy
ing data set ((xi
are the ith input and the ith target, respectivamente. We would like to learn a pre-
dictor (or hypothesis) from a parametric family H = { F (·, i ) : Rmx → Rmy |
θ ∈ Rd} by minimizing the objective L,

i=1 of n samples where xi

∈ X ⊆ Rmx and yi

, yi))norte

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi

pag
d

/

yo

F
/

/

/

/

3
4
4
9
9
1
2
0
0
3
0
8
5
norte
mi
C
oh
_
a
_
0
1
4
8
3
pag
d

.

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Understanding Nonlinear Representation Learning

l(i ) = 1
norte

norte(cid:2)

yo=1

(cid:3)( F (xi

, i ), yi),

993

(1.1)

dónde (cid:3) : Rmy × Y → R≥0 is the loss function that measures the discrepancy
, i ) and the target yi for each sample. En esto
between the prediction f (xi
= xi (for i = 1, . . . , norte) to include the setting of
letter, it is allowed to let yi
unsupervised learning. The function f is also allowed to represent a wide
, i )
range of machine learning models. Para (deep) neural network,
represents the preactivation output of the last layer of the network, y
the parameter vector θ ∈ Rd contains all the trainable parameters, incluido
weights and bias terms of all layers.

F (xi

Por ejemplo, one of the simplest models is the linear model in the form
(cid:6)i , where φ : X → Rd is a fixed function and φ(X) is a non-
of f (X, i ) = φ(X)
linear representation of input data x. This is a classical machine learning
model where much of the effort goes into the design of the handcrafted
feature map φ via feature engineering (Tornero, Fuggetta, Lavazza, & Lobo,
1999; Zheng & Casari, 2018). In this linear model, we do not learn the rep-
resentation φ(X) because the feature map φ is fixed without dependence on
the model parameter θ that is optimized with the data set ((xi

, yi))norte

yo=1.

Similar to many definitions in mathematics, where an intuitive notion in
a special case is formalized to a definition for a more general case, we now
abstract and generalize this intuitive notion of the representation φ(X) del
linear model to that of all differentiable models as follows:

Definición 1. Given any x ∈ X and differentiable function f , we define
be the gradient representation of the data x under the model f at θ .

∂ f (X,i )
∂θ

a

(cid:6)i

∂ f (X,i )
∂θ

= ∂φ(X)
∂θ

This definition recovers the standard representation φ(X) in the linear
= φ(X) and is applicable to all differentiable non-
model as
linear models in representation learning. Además, this definition captures
the key challenge of understanding the dynamics of nonlinear representa-
tion learning well, as illustrated below. Using the notation of dθ t
= (cid:6)t, el
dt
dynamics of the model f (X, θ t ) over the time t can be written by

d
dt

F (X, θ t ) =

∂ f (X, θ t )
∂θ t

dθ t
dt

=

∂ f (X, θ t )
∂θ t

(cid:6)t.

(1.2)

Aquí, we can see that the dynamics are linear in (cid:6)t if there is no gradient
≈ ∂ f (X,i 0 )
. Sin embargo, with representation
representation learning as
∂θ 0
∂ f (X,θ t )
∂θ t

changes depending on t (y
aprendiendo, the gradient representation
(cid:6)t), resulting in the dynamics that are nonlinear in (cid:6)t. Por lo tanto, the defini-
tion of the gradient representation can distinguish fundamentally different
dynamics in machine learning.

∂ f (X,θ t )
∂θ t

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi

pag
d

/

yo

F
/

/

/

/

3
4
4
9
9
1
2
0
0
3
0
8
5
norte
mi
C
oh
_
a
_
0
1
4
8
3
pag
d

.

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

994

k. Kawaguchi, l. zhang, and Z. Deng

In this letter, we initiate the study of the dynamics of learning gradi-
ent representation that are nonlinear in (cid:6)t. Eso es, we focus on the regime
∂ f (X,θ t )
at the end of training time t dif-
where the gradient representation
∂θ t

∂ f (X,i 0 )
∂θ 0

fers greatly from the initial representation
. This regime was stud-
ied in the past for the case where the function φ(X) (cid:8)→ f (X, i ) is affine for
some fixed feature map φ (sajonia, McClelland, & Ganguli, 2014; Kawaguchi,
2016, 2021; Laurent & Brecht, 2018; Bartlett, Helmbold, & Largo, 2019; Zou,
Largo, & Gu, 2020; Xu et al., 2021). Unlike any previous studies, we focus
on the problem setting where the function φ(X) (cid:8)→ f (X, i ) is nonlinear and
nonaffine, with the effect of nonlinear (gradient) representation learning.
The results of this letter avoid the curse of dimensionality by studying the
global convergence of the gradient-based dynamics instead of the dynamics
of global optimization (Kawaguchi et al., 2016) and Bayesian optimization
(Kawaguchi, Kaelbling, & Lozano-Pérez, 2015). En tono rimbombante, we do not re-
quire any wide layer or large input dimension throughout this letter. Nuestro
main contributions are summarized as follows:

1. En la sección 2, we identify a pair of a novel assumption and a new
condición, called the common model structure assumption and the data-
architecture alignment condition. Under the common model structure
assumption, the data-architecture alignment condition is shown to
be a necessary condition for the globally optimal model and a suffi-
cient condition for the global convergence. The condition is depen-
dent on both data and architecture. Además, we empirically verify
and deepen this new understanding. When we apply representation
learning in practice, we often have overwhelming options regarding
which model structure to be used. Our results provide a practical
guidance for choosing or designing model structure via the common
model structure assumption, which is indeed satisfied by many rep-
resentation learning models used in practice.

2. En la sección 3, we discard the assumption of the data-architecture
alignment condition. En cambio, we derive a novel training framework,
called the exploration-exploitation wrapper (EE wrapper), which sat-
isfies the data-architecture alignment condition time-independently
a priori. The EE wrapper is then proved to have global convergence
guarantees under the safe-exploration condition. The safe-exploration
condition is what allows us to explore various gradient represen-
tations safely without getting stuck in the states where we cannot
provably satisfy the data-architecture alignment condition. The safe-
exploration condition is shown to hold true for ResNet-18 with stan-
dard benchmark data sets, including MNIST, CIFAR-10, CIFAR-100,
Semeion, KMNIST, and SVHN time-independently.

3. En la sección 3.4, the EE wrapper is shown to not degrade practical per-
formances of ResNet-18 for the standard data sets, MNIST, CIFAR-10,
CIFAR-100, Semeion, KMNIST, and SVHN. To our knowledge, este

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi

pag
d

/

yo

F
/

/

/

/

3
4
4
9
9
1
2
0
0
3
0
8
5
norte
mi
C
oh
_
a
_
0
1
4
8
3
pag
d

.

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Understanding Nonlinear Representation Learning

995

letter provides the first practical algorithm with global convergence
guarantees without degrading practical performances of ResNet-18
on these standard data sets, using convolutions, skip connections,
and batch normalization without any extremely wide layer of the
width larger than the number of data points. To the best of our knowl-
borde, we are not aware of any similar algorithms with global conver-
gence guarantees in the regime of learning nonlinear representations
without degrading practical performances.

It is empirically known that increasing network size tends to improve
training behaviors. En efecto, the size of networks correlates well with the
training error in many cases in our experiments (p.ej., see Figure 1b). Cómo-
alguna vez, the size and the training error do not correlate well in some ex-
perimentos (p.ej., see Figure 1c). Our new theoretical results explain that
the training behaviors correlate more directly with the data-architecture
alignment condition instead. The seeming correlation with the network
size is caused by another correlation between the network size and the
data-architecture alignment condition. This is explained in more detail in
sección 2.3.

2 Understanding Dynamics via Common Model Structure

and Data-Architecture Alignment

En esta sección, we identify the common model structure assumption and
study the data-architecture alignment condition for the global convergence
in nonlinear representation learning. We begin by presenting an overview
of our results in section 2.1, deepen our understandings with experiments
in section 2.2, discuss implications of our results in section 2.3, and establish
mathematical theories in section 2.4.

2.1 Overview. We introduce the common model structure assumption
in section 2.1.1 and define the data-architecture alignment condition in sec-
ción 2.1.2. Using the assumption and the condition, we present the global
convergence result in section 2.1.3.

2.1.1 Common Model Structure Assumption. Through examinations of
representation learning models used in applications, we identified and for-
malized one of their common properties as follows:

Assumption 1 (Common Model Structure Assumption). There exists a sub-
set S ⊆ {1, 2, . . . , d} such that f (xi
for any i ∈
, i ) =
{1, . . . , norte} and θ ∈ Rd.

k=1 1{k ∈ S}i

(cid:4) ∂ f (xi

,i )
k

(cid:3)

∂θ

(cid:5)

d

k

Assumption 1 is satisfied by common machine learning models, semejante
as kernel models and multilayer neural networks, with or without convo-
lutions, batch normalization, pooling, and skip connections. Por ejemplo,

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi

pag
d

/

yo

F
/

/

/

/

3
4
4
9
9
1
2
0
0
3
0
8
5
norte
mi
C
oh
_
a
_
0
1
4
8
3
pag
d

.

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

996

k. Kawaguchi, l. zhang, and Z. Deng

d

k(

k=1 1{k ∈ S}i

consider a multilayer neural network of the form f (X, i ) = Wh(X, tu) + b,
where h(X, tu) is an output of its last hidden layer and the parameter
vector θ consists of the parameters (W., b) at the last layer and the pa-
rameters u in all other layers as θ = vec([W., b, tu]). Aquí, for any matrix
M ∈ Rm× ¯m, we let vec(METRO) ∈ Rm ¯m be the standard vectorization of the ma-
trix M by stacking columns. Entonces, assumption 1 holds because f (X, i ) =
(cid:3)
∂ f (xi
,i )
k : k ∈ S} = {vec([W., b])k :
), where S is defined by {i
∂θ
k
k = 1, 2, . . . , ξ } with vec([W., b]) ∈ Rξ
. Since h is arbitrary in this example,
the common model structure assumption holds, Por ejemplo, for any mul-
tilayer neural networks with a fully connected last layer. En general, ser-
cause the nonlinearity at the output layer can be treated as a part of the
loss function (cid:3) while preserving convexity of q (cid:8)→ (cid:3)(q, y) (p.ej., cross-entropy
loss with softmax), this assumption is satisfied by many machine learning
modelos, including ResNet-18 and all models used in the experiments in this
letter (as well as all linear models). Además, assumption 1 is automatically
satisfied in the next section by using the EE wrapper.

, . . . , y(cid:3)
norte)

2.1.2 Data-Architecture Alignment Condition. Given a target matrix Y =
(cid:6) ∈ Rn×my and a loss function (cid:3), we define the modified tar-
, y2
, . . . , en)
(y1
, y(cid:3)
(cid:6) ∈ Rn×my by Y(cid:3) = Y for the squared loss (cid:3),
get matrix Y(cid:3) = (y(cid:3)
2
1
− 1 para el (binary and multiclass) cross-entropy losses
= 2Yi j
and by (Y(cid:3))i j
(cid:6) ∈ Rn×mx , el
(cid:3) with Yi j
∈ {0, 1}. Given input matrix X = (x1
output matrix fX (i ) ∈ Rn×my is defined by fX (i )i j
∈ R. For any
matrix M ∈ Rm× ¯m, we let Col(METRO) ⊆ Rm be its column space. With these
notations, we are now ready to introduce the data-architecture alignment
condición:

, . . . , xn)
, i ) j
= f (xi

, x2

Definición 2 (Data-Architecture Alignment Condition). Given any data set
(X, Y), differentiable function f , and loss function (cid:3), the data-architecture align-
ment condition is said to be satisfied at θ if vec(Y(cid:3)) ∈ Col(

∂vec( fX (i ))
∂θ

).

The data-architecture alignment condition depends on both data
(through the target Y and the input X) y arquitectura (through the model
F ). It is satisfied only when the data and architecture align well to each
(cid:6)θ ∈ R, el
otro. Por ejemplo, in the case of linear model f (X, i ) = φ(X)
condition can be written as vec(Y(cid:3)) ∈ Col((cid:8)(X )) dónde (cid:8)(X ) ∈ Rn×d and
(cid:8)(X )i j
= φ(xi) j. In definition 2, fX (i ) is a matrix of the preactivation out-
puts of the last layer. De este modo, in the case of classification tasks with a nonlinear
activation at the output layer, fX (i ) and Y are not in the same space, cual
is the reason we use Y(cid:3) here instead of Y.

En tono rimbombante, the data-architecture alignment condition does not make
∈ Rnmy×d:
any requirements on the the rank of the Jacobian matrix
is allowed to be smaller than nmy and d. De este modo, for ex-
the rank of
amplio, the data-architecture alignment condition can be satisfied depend-
ing on the given data and architecture even if the minimum eigenvalue of

∂vec( fX (i ))
∂θ

∂vec( fX (i ))
∂θ

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi

pag
d

/

yo

F
/

/

/

/

3
4
4
9
9
1
2
0
0
3
0
8
5
norte
mi
C
oh
_
a
_
0
1
4
8
3
pag
d

.

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Understanding Nonlinear Representation Learning

997

(

∂vec( fX (i ))
∂θ

∂vec( fX (i ))
∂θ

)(cid:6) is zero, in both cases of overparameteriza-
the matrix
ción (p.ej., d (cid:9) norte) and underparameterization (p.ej., d (cid:10) norte). This is further
illustrated in section 2.2 and discussed in section 2.3. We note that we fur-
ther discard the assumption of the data-architecture alignment condition in
sección 3 as it is automatically satisfied by using the EE wrapper.

2.1.3 Global Convergence. Under the common model structure assump-
ción, the data-architecture alignment condition is shown to be what lets us
avoid the failure of the global convergence and suboptimal local minima.
More concretely, we prove a global convergence guarantee under the data-
architecture alignment condition as well as the necessity of the condition
for the global optimality:

Teorema 1 (Informal Version). Let assumption 1 hold. Then the following two
statements hold for gradient-based dynamics:

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

(i) The global optimality gap bound decreases per iteration toward zero at the
|t |) for any T such that the data-architecture alignment con-

rate of O(1/
dition is satisfied at θ t for t ∈ T .

(ii) For any θ ∈ Rd, the data-architecture alignment condition at θ is necessary
to have the globally optimal model fX (i ) = ηY(cid:3) at θ for any η ∈ R.

Theorem 1i guarantees the global convergence without the need to sat-
isfy the data-architecture alignment condition at every iteration or at the
limit point. En cambio, it shows that the bound on the global optimality gap
decreases toward zero per iteration whenever the data-architecture align-
ment condition holds. Theorem 1ii shows that the data-architecture align-
ment condition is necessary for the global optimality. Intuitivamente, this is
because the expressivity of a model class satisfying the common model
structure assumption is restricted such that it is required to align the archi-
tecture to the data in order to contain the globally optimal model fX (i ) =
ηY(cid:3) (for any η ∈ R).

To better understand the statement of theorem 1i, consider a counterex-
ample with a data set consisting of the single point (X, y) = (1, 0), el modelo
F (X, i ) = θ 4 − 10θ 2 + 6i + 100, and the squared loss (cid:3)(q, y) = (q − y)2. En
this example, we have L(i ) = f (X, i )2, which has multiple suboptimal
local minima of different values. Entonces, via gradient descent, el modelo
converges to the closest local minimum and, En particular, does not neces-
sarily converge to a global minimum. En efecto, this example violates the com-
mon model structure assumption (assumption 1) (although it satisfies the
data-architecture alignment condition), showing the importance of the
common model structure assumption along with the data-architecture
alignment. This also illustrates the nontriviality of theorem 1i in that the
data-architecture alignment is not sufficient, and we needed to understand
what types of model structures are commonly used in practice and formal-
ize the understanding as the common model structure assumption.

/

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi

pag
d

/

yo

F
/

/

/

/

3
4
4
9
9
1
2
0
0
3
0
8
5
norte
mi
C
oh
_
a
_
0
1
4
8
3
pag
d

.

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

998

k. Kawaguchi, l. zhang, and Z. Deng

To further understand the importance of the common model structure
assumption in theorem 1, we now consider the case where we do not re-
quire the assumption. En efecto, we can guarantee the global convergence
without the common model structure assumption if we ensure that the min-
imum eigenvalue of the matrix
is nonzero. This can be
proved by the following derivation. Let θ be an arbitrary stationary point
of L. Then we have 0 = ∂L(i )
norte
, which implies
yo=1(
eso

∂vec( fX (i ))
∂θ

∂vec( fX (i ))
∂θ

(cid:3)(q,yi )
∂q

= 1
norte

∂ f (xi
∂θ

q= f (xi

,i ))

(cid:6)
)

(cid:3)

,i )

∂θ

(

|

∂vec( fX (i ))
∂θ

v = 0,

(2.1)

|

,i ))

q= f (x1

(cid:3)(q,u1 )
∂q

(cid:6), . . . , (

where v = vec((

(cid:3)(q,uN )
∂q
is nonzero,
if the minimum eigenvalue of the matrix
= 0 for all i ∈ {1, 2, . . . , norte}. Using the con-
then we have v = 0:
vexity of the map q (cid:8)→ (cid:3)(q, y) (which is satisfied by the squared loss and
cross-entropy loss), this implies that for any q1

) ∈ Rnmy . Por lo tanto,

, . . . , qn ∈ Rmy ,

(cid:6)
q= f (xn,i ))

∂vec( fX (i ))
∂θ

∂vec( fX (i ))
∂θ

(cid:3)(q,ui )
∂q

q= f (xi

, q2

(cid:6)
)

,i )

(

|

|

norte(cid:2)

(cid:3)( F (xi

, i ), yi)

(cid:3)(qi

, yi) −

(cid:6)

(cid:3)(q, ui)
∂q

(cid:8)

(cid:7)
(cid:7)
(cid:7)
(cid:7)
q= f (xi

,i )

(2.2)

(cid:8)

(qi

− f (xi

, i ))

(2.3)

(cid:3)(qi

, yi).

(2.4)

l(i ) = 1
norte

≤ 1
norte

≤ 1
norte

yo=1
norte(cid:2)

(cid:6)

yo=1
norte(cid:2)

yo=1

(

(cid:6)
)

the minimum eigenvalue of

Since f (x1), F (x2), . . . , F (xn) ∈ Rmy , this implies that any stationary point
θ is a global minimum if
the matrix
∂vec( fX (i ))
∂vec( fX (i ))
is nonzero, without the common model structure as-
∂θ
∂θ
sumption (see assumption 1). En efecto, in the above example with the model
F (X, i ) = θ 4 − 10θ 2 + 6i + 100, the common model structure assumption is
violated, but we still have the global convergence if the minimum eigen-
F (X, i ) = y = 0 at any stationary point θ
value is nonzero—for example,
such that the minimum eigenvalue of the matrix
es
nonzero. A diferencia de, theorem 1 allows the global convergence even when
the minimum eigenvalue of the matrix
is zero by uti-
lizing the common model structure assumption.

∂vec( fX (i ))
∂θ

∂vec( fX (i ))
∂θ

∂vec( fX (i ))
∂θ

∂vec( fX (i ))
∂θ

(cid:6)
)

(cid:6)
)

(

(

The formal version of theorem 1 is presented in section 2.4 and is proved
in appendix A in the supplementary information that relies on the addi-
tional previous works of Clanuwat et al. (2019), Krizhevsky and Hinton
(2009), Mityagin (2015), Netzer et al. (2011), Paszke et al. (2019a, 2019b),
and Poggio et al. (2017). Before proving the statement, we first examine the
meaning and implications of our results through illustrative examples in
secciones 2.2 y 2.3.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi

pag
d

/

yo

F
/

/

/

/

3
4
4
9
9
1
2
0
0
3
0
8
5
norte
mi
C
oh
_
a
_
0
1
4
8
3
pag
d

.

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Understanding Nonlinear Representation Learning

999

2.2 Illustrative Examples in Experiments. Teorema 1 suggests that
data-architecture alignment condition vec(Y(cid:3)) ∈ Col(
) has the abil-
ity to distinguish the success and failure cases, even when the minimum
is zero for all t ≥ 0. En esto
eigenvalue of the matrix
sección, we conduct experiments to further verify and deepen this theoret-
ical understanding.

∂vec( fX (θ t ))
∂θ t

∂vec( fX (θ t ))
∂θ t

∂vec( fX (θ t ))
∂θ t

(cid:6)
)

(

We employ a fully connected network having four layers with 300 neu-
rons per hidden layer, and a convolutional network, LeNet (LeCun et al.,
1998), with five layers. For the fully connected network, we use the two-
moons data set (Pedregosa et al., 2011) and a sine wave data set. To create
the sine wave data set, we randomly generated the input xi from the uni-
form distribution on the interval [−1, 1] and set yi
= 1{sin(20xi) < 0} ∈ R for all i ∈ [n] with n = 100. For the convolutional network, we use the Se- meion data set (Srl & Brescia, 1994) and a random data set. The random data set was created by randomly generating each pixel of the input im- ∈ R16×16×1 from the standard normal distribution and by sampling age xi yi uniformly from {0, 1} for all i ∈ [n] with n = 1000. We set the activation functions of all layers to be softplus ¯σ (z) = ln(1 + exp(ςz))/ς with ς = 100, which approximately behaves as the ReLU activation as shown in appendix C in the supplementary information. See appendix B in the supplementary information for more details of the experimental settings. The results of the experiments are presented in Figure 1. In each panel of the figure, the training errors and the values of QT are plotted over time T. Here, QT counts the number of Y(cid:3) not satisfying the condition vec(Y(cid:3)) ∈ Col( ) during t ∈ {0, 1, . . . , T} and is defined by ∂vec( fX (θ t )) ∂θ t QT = T(cid:2) t=0 (cid:9) 1 vec(Y(cid:3)) /∈ Col (cid:10) ∂vec( fX (θ t )) ∂θ t (cid:11)(cid:12) . (2.5) Figure 1a shows the results for the fully connected network. For the two- moons data set, the network achieved the zero training error with QT = 0 for all T (i.e., vec(Y(cid:3)) ∈ Col( ) for all T). For the sine wave data set, it obtained high training errors with QT = T for all T (i.e., vec(Y(cid:3)) /∈ ) for all T). This is consistent with our theory. Our theory ex- Col( plains that what makes a data set easy to be fitted or not is whether the condition vec(Y(cid:3)) ∈ Col( ) is satisfied or not. ∂vec( fX (θ T )) ∂θ T ∂vec( fX (θ T )) ∂θ T ∂vec( fX (θ t )) ∂θ t Figures 1b and 1c show the results for the convolutional networks with two random initial points using two different random seeds. In the fig- ure panels, we report the training behaviors with different network sizes mc = 1, 2, and 4; the number of convolutional filters per convolutional layer is 8 × mc and the number of neurons per fully connected hidden layer is 128 × mc. As can be seen, with the Semeion data set, the networks of all sizes achieved zero error with QT = 0 for all T. With the random data set, l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 4 4 9 9 1 2 0 0 3 0 8 5 n e c o _ a _ 0 1 4 8 3 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 1000 K. Kawaguchi, L. Zhang, and Z. Deng l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / Figure 1: Training error and the value of QT over time steps T. The legend of panel b is shown in panel c. The value of QT measures the number of Y(cid:3) not satisfying the condition of vec(Y(cid:3)) ∈ Col( ). This figure validates our theoretical understanding that the bound on the global optimality gap de- creases at any iteration when QT does not increase at the iteration—that is, when the QT value is flat. The minimum eigenvalue of the matrix M(θ t ) = ∂vec( fX (θ )) )(cid:6) is zero at all iterations in Figures 1b and 1c for all cases ∂θ with mc ∂vec( fX (θ )) ( ∂θ = 1. ∂vec( fX (θt )) ∂θt / / / 3 4 4 9 9 1 2 0 0 3 0 8 5 n e c o _ a _ 0 1 4 8 3 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 the deep networks yielded the zero training error whenever QT is not lin- early increasing over the time or, equivalently, whenever the condition of vec(Y(cid:3)) ∈ Col( ) holds sufficiently many steps T. This is consistent with our theory. ∂vec( fX (θ T )) ∂θ T Finally, we also confirmed that gradient representation significantly from the initial one ∂ f (x,θ 0 ) ∂θ 0 in our experiments. That is, the ∂ f (x,θ t ) ∂θ t changed Understanding Nonlinear Representation Learning 1001 Table 1: The Change of the Gradient Representation during Training, (cid:13)M(θ ˆT ) − ∂vec( fX (θ )) M(θ 0)(cid:13)2 ∂θ )(cid:6) and ˆT Is the Last Time Step. F, Where M(θ ) := ∂vec( fX (θ )) ∂θ ( a. Fully-Connected Network Data Set Two moons Sine wave 2.09 × 1011 3.95 × 109 b. Convolutional Network mc = 4 mc = 2 mc = 1 seed#1 Data Set Semeion 8.09 × 1012 Random 3.73 × 1012 seed#2 5.19 × 1012 1.64 × 1012 seed#1 9.82 × 1012 3.43 × 107 seed#2 3.97 × 1012 4.86 × 1012 seed#1 2.97 × 1012 1.40 × 107 seed#2 5.41 × 1012 8.57 × 1011 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . values of (cid:13)M(θ T ) − M(θ 0)(cid:13)2 F were significantly large and tended to increase as T increases, where the matrix M(θ ) ∈ Rnmy×nmy is defined by M(θ ) = ∂vec( fX (θ )) )(cid:6). Table 1 summarizes the values of (cid:13)M(θ T ) − M(θ 0)(cid:13)2 ∂θ F ∂vec( fX (θ )) ∂θ ( at the end of the training. 2.3 Implications. In section 2.1.3, we showed that an uncommon model structure f (x, θ ) = θ 4 − 10θ 2 + 6θ + 100 does not satisfy assumption 1, and assumption 1 is not required for global convergence if the minimum eigen- value is nonzero. However, in practice, we typically use machine learning models that satisfy assumption 1 instead of the model f (x, θ ) = θ 4 − 10θ 2 + 6θ + 100, and the minimum eigenvalue is zero in many cases. In this con- text, theorem 1 provides the justification for common practice in nonlinear representation learning. Furthermore, theorem 1i contributes to the litera- ture by identifying the common model structure assumption (assumption 1) and the data-architecture alignment condition (definition 1) as the novel and practical conditions to ensure the global convergence even when the minimum eigenvalue becomes zero. Moreover, theorem 1ii shows that this condition is not arbitrary in the sense that it is also necessary to obtain the globally optimal models. Furthermore, the data-architecture alignment con- dition is strictly more general than the condition of the minimum eigen- value being nonzero, in the sense that the latter implies the former but not vice versa. Our new theoretical understanding based on the data-architecture align- ment condition can explain and deepen the previously known empirical observation that increasing network size tends to improve training be- haviors. Indeed, the size of networks seems to correlate well with the training error to a certain degree in Figure 1b. However, the size and the training error do not correlate well in Figure 1c. Our new theoretical under- standing explains that the training behaviors correlate more directly with / e d u n e c o a r t i c e - p d / l f / / / / 3 4 4 9 9 1 2 0 0 3 0 8 5 n e c o _ a _ 0 1 4 8 3 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 1002 K. Kawaguchi, L. Zhang, and Z. Deng ∂vec( fX (θ t )) ∂θ t the data-architecture alignment condition of vec(Y(cid:3)) ∈ Col( ) in- stead. The seeming correlation with the network size is indirect and caused by another correlation between the network size and the condition of vec(Y(cid:3)) ∈ Col( ) more likely tends to hold when the network size is larger because the matrix ∂vec( fX (θ t )) is of size nmy × d where d is the number of parameters: that is, by ∂θ t ) to increase ). That is, the condition of vec(Y(cid:3)) ∈ Col( increasing d, we can increase the column space Col( the chance of satisfying the condition of vec(Y(cid:3)) ∈ Col( ∂vec( fX (θ t )) ∂θ t ∂vec( fX (θ t )) ∂θ t ∂vec( fX (θ t )) ∂θ t ∂vec( fX (θ t )) ∂θ t that ∂vec( fX (θ )) ∂θ the minimum eigenvalue of the matrix M(θ t ) = Note ∂vec( fX (θ )) )(cid:6) is zero at all iterations in Figures 1b and 1c for ∂θ all cases of mc = 1. Thus, Figures 1b and 1c also illustrate the fact that while having the zero minimum eigenvalue of the matrix M(θ t ), the dynamics can achieve the global convergence under the data-architecture alignment condition. Moreover, because the multilayer neural network in the lazy training regime (Kawaguchi & Sun, 2021) achieves zero training errors for all data sets, Figure 1 additionally illustrates that our theoretical and empirical results apply to the models outside of the lazy training regime and can distinguish “good” data sets from “bad” data sets given a learning algorithm. ). ( In sum, our new theoretical understanding has the ability to explain and distinguish the successful case and failure case based on the data- architecture alignment condition for the common machine learning mod- els. Because the data-architecture alignment condition is dependent on data and architecture, theorem 1, along with our experimental results, shows why and when the global convergence in nonlinear representation learning is achieved based on the relationship between the data (X, Y) and architec- ture f . This new understanding is used in section 3 to derive a practical algorithm and is expected to be a basis for many future algorithms. 2.4 Details and Formalization of Theorem 1. This section presents the precise mathematical statements that formalize the informal description of theorem 1. In the following sections, we devide the formal version of theo- rem 1 into theorem 2, theorem 3, and proposition 1. 2.4.1 Preliminaries. Let (θ t )∞ t=0 be the sequence defined by θ t+1 = θ t − αt ¯gt with an initial parameter vector θ 0, a learning rate αt, and an update vector ¯gt. The analysis in this section relies on the following assumption on the update vector ¯gt: Assumption 2. There exist ¯c, c > 0 such that c(cid:13)∇L(θ t )(cid:13)2 ≤ ∇L(θ t )
(cid:13) ¯gt(cid:13)2 ≤ ¯c(cid:13)∇L(θ t )(cid:13)2 for any t ≥ 0.

(cid:6) ¯gt and

Assumption 2 is satisfied by using ¯gt = Dt∇L(θ t ), where Dt
positive-definite symmetric matrix with eigenvalues in the interval [C,

is any

¯c].

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi

pag
d

/

yo

F
/

/

/

/

3
4
4
9
9
1
2
0
0
3
0
8
5
norte
mi
C
oh
_
a
_
0
1
4
8
3
pag
d

.

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Understanding Nonlinear Representation Learning

1003

If we set Dt = I, we have gradient descent, and assumption 2 is satisfied
with c = ¯c = 1. This section also uses the standard assumption of differen-
tiability and Lipschitz continuity:
Assumption 3. For every i ∈ [norte], the function (cid:3)
tiable and convex, the map fi : i (cid:8)→ f (xi
)(cid:13) ≤ L(cid:13)θ − θ (cid:16)(cid:13) for all θ , i (cid:16)
∇L(i (cid:16)

i : q (cid:8)→ (cid:3)(q, yi) is differen-
, i ) is differentiable, y (cid:13)∇L(i ) −

in the domain of L for some L ≥ 0.

The assumptions on the loss function in assumption 3 are satisfied by
using standard loss functions, including the squared loss, logistic loss, y
cross-entropy loss. Although the objective function L is nonconvex and
non-invex, the function q (cid:8)→ (cid:3)(q, yi) is typically convex.
, y∗
For any matrix Y∗ = (y∗
2
1

(cid:6) ∈ Rn×my , we define

, . . . , y∗
norte)

L∗

(Y

) = 1
norte

norte(cid:2)

yo=1

(cid:3) (y


i

, yi) .

(2.6)

Por ejemplo, for the squared loss (cid:3), the value of L∗
minimum value of L as

(Y(cid:3)) is at most the global

(2.7)

L∗

(Y(cid:3)) ≤ L(i ), ∀θ ∈ Rd,
(cid:3)

(cid:13)2
2

norte
yo=1

− yi

(Y(cid:3)) = 1
norte

∂vec( fX (i ))
∂θ

since L∗
(cid:13)yi
= 0 ≤ L(i ) ∀θ ∈ Rd. This letter also uses
the notation of [k] = {1, 2, . . . , k} for any k ∈ N+
y (cid:13) · (cid:13) = (cid:13) · (cid:13)
2 (Euclidean
norm). Finalmente, we note that for any η ∈ R, the condition of vec(Y(cid:3)) ∈
) is necessary to learn a near-global optimal model fX (i ) =
Columna(
ηY(cid:3):
Proposition 1. Suppose assumption 1 sostiene. If vec(Y(cid:3)) /∈ Col(
fX (i ) (cid:19)= ηY(cid:3) for any η ∈ R.
Prueba. All proofs of this letter are presented in appendix A in the supple-
(cid:2)
mentary information.

∂vec( fX (i ))
∂θ

), entonces

(ηY∗

) for any Y∗

such that vec(Y∗

2.4.2 Global Optimality at the Limit Point. The following theorem shows
that every limit point ˆθ of the sequence (θ t )t achieves a loss value L( ˆθ ) No
worse than infη∈R L∗
) para
all t ∈ [t, ∞) with some τ ≥ 0:
Teorema 2. Suppose assumptions 1 a 3 hold. Assume that the learning rate se-
quence (αt )t satisfies either (i) (cid:14) ≤ αt ≤ c(2−(cid:14))
for some (cid:14) > 0 o (ii) limt→∞ αt =
αt = ∞. Then for any Y∗ ∈ Rn×my , if there exists τ ≥ 0 such that
0 y
vec(Y∗
) for all t ∈ [t, ∞), every limit point ˆθ of the sequence
(θ t )t satisfies

(cid:3)∞
t=0
) ∈ Col(

∂vec( fX (θ t ))
∂θ t

∂vec( fX (θ t ))
∂θ t

) ∈ Col(

L ¯c

l( ˆθ ) ≤ L∗

(ηY

), ∀η ∈ R.

(2.8)

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi

pag
d

/

yo

F
/

/

/

/

3
4
4
9
9
1
2
0
0
3
0
8
5
norte
mi
C
oh
_
a
_
0
1
4
8
3
pag
d

.

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

1004

k. Kawaguchi, l. zhang, and Z. Deng

Por ejemplo, for the squared loss (cid:3)(q, y) = (cid:13)q − y(cid:13)2, theorem 2 implies

that every limit point ˆθ of the sequence (θ t )t is a global minimum as

l( ˆθ ) ≤ L(i ), ∀θ ∈ Rd,

(2.9)

if vec(Y(cid:3)) ∈ Col(
l( ˆθ ) ≤ L∗

∂vec( fX (θ t ))
∂θ t

) for t ∈ [t, ∞) with some τ ≥ 0. This is because

(Y(cid:3)) ≤ L(i ) ∀θ ∈ Rd from theorem 2 and equation 2.7.

En la práctica, one can easily satisfy all the assumptions in theorem 2 ex-
cept for the condition that vec(Y∗
) for all t ∈ [t, ∞). C.A-
) ∈ Col(
cordingly, we now weaken this condition by analyzing optimality at each
iteration so that the condition is verifiable in experiments.

∂vec( fX (θ t ))
∂θ t

2.4.3 Global Optimality Gap at Each Iteration. The following theorem states
that under standard settings, the sequence (θ t )t∈T converges to a loss value
no worse than infη∈R L∗
) at the rate of O(1/
semejante
that vec(Y∗) ∈ Col(
) for t ∈ T :
Teorema 3. Suppose assumptions 1 y 3 hold. Dejar (αt, ¯gt ) = ( 2a
l
an arbitrary α ∈ (0, 1). Then for any T ⊆ N
) for all t ∈ T , it holds that
Columna(

, ∇L(θ t )) con
0 and Y∗ ∈ Rn×my such that vec(Y∗) ∈

|t |) for any T and Y∗

(ηY∗
∂vec( fX (θ t ))
∂θ t

∂vec( fX (θ t ))
∂θ t

l(θ t ) ≤ L∗

(ηY

mín.
t∈T

) + 1√

|t |

(cid:13)

LζηL(θ t0 )
2a(1 − α)

,

(2.10)

for any η ∈ R, where t0
η)(cid:13)2),
ˆβ(i , η) := η((
i
k for all k ∈ S and ν(i )k
ción 1.

= min{t : t ∈ T }, ζη := 4 maxt∈T max((cid:13)norte(θ t )(cid:13)2, (cid:13) ˆβ(θ t,
∂vec( fX (i ))
=
∂θ
= 0 for all k /∈ S with the set S being defined in assump-

(cid:6) ∂vec( fX (i ))
)
∂θ

), and ν(i )k

∂vec( fX (i ))
∂θ

vec(Y∗

)†(

(cid:6)
)

For the squared loss (cid:3), theorem 3 implies the following for any T ≥ 1: para

any T ⊆ [t] such that vec(Y(cid:3)) ∈ Col(

∂vec( fX (θ t ))
∂θ t

) for all t ∈ T , tenemos

mín.
t∈[t]

l(θ t ) ≤ inf
θ ∈Rd

(cid:14)

l(i ) + oh(1/

|t |).

This is because L∗
≤ mint∈T L(θ t ) from T ⊆ [t].

(Y(cid:3)) ≤ infθ ∈Rd L(i ) from equations 2.7 and mint∈[t]

(2.11)

l(θ t )

Similarmente, for the binary and multiclass cross-entropy losses, theorem 3
implies the following for any T ≥ 1: for any T ⊆ [t] such that vec(Y(cid:3)) ∈
Columna(

) for all t ∈ T , we have that for any η ∈ R,

∂vec( fX (θ t ))
∂θ t

l(θ t ) ≤ L∗

mín.
t∈[t]

(cid:14)

(ηY(cid:3)) + oh(η2/

|t |).

(2.12)

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi

pag
d

/

yo

F
/

/

/

/

3
4
4
9
9
1
2
0
0
3
0
8
5
norte
mi
C
oh
_
a
_
0
1
4
8
3
pag
d

.

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Understanding Nonlinear Representation Learning

1005

Given any desired (cid:14) > 0, since L∗
(ηY(cid:3)) → 0 as η → ∞, setting η to be suf-
ficiently large obtains the desired (cid:14) value as mint∈T L(θ t ) ≤ (cid:14) in equation
2.12 como

|t | → ∞.

3 Application to the Design of Training Framework

The results in the previous section show that the bound on the global
optimality gap decreases per iteration whenever the data-architecture
alignment condition holds. Using this theoretical understanding, en esto
sección, we propose a new training framework with prior guarantees while
learning hierarchical nonlinear representations without assuming the data-
architecture alignment condition. Como resultado, we made significant improve-
ments over the most closely related study on global convergence guarantees
(Kawaguchi & Sol, 2021). En particular, whereas the related study requires a
wide layer with a width larger than n, our results reduce the requirement to
norte. Por ejemplo, the MNIST data set has
a layer with a width larger than
norte = 60,000 and hence previous studies require 60,000 neurons at a layer,
60,000 ≈ 245 neurons at a layer. Our require-
whereas we only require
ment is consistent and satisfied by the models used in practice that typically
have from 256 a 1024 neurons for some layers.

We begin in section 3.1 with additional notations and then present the
training framework in section 3.2 and convergence analysis in section 3.3.
We conclude in section 3.4 by providing empirical evidence to support our
theory.

(yo)

3.1 Additional Notations. We denote by θ

∈ Rdl the vector of all the
trainable parameters at the lth layer for l = 1, . . . , H where H is the depth or
the number of trainable layers (es decir., one plus the number of hidden layers).
Eso es, the Hth layer is the last layer containing the trainable parameter
(h) at the last layer. For any pair (yo, yo(cid:16)) such that 1 ≤ l ≤ l(cid:16) ≤ H, nosotros
vector θ
define θ
(yo:yo(cid:16) )
(yo)
if l = l(cid:16)
. We consider a family of training algorithms that update the param-
eter vector θ as follows: for each l = 1, . . . , h,

(cid:16) —for example, i

(yo(cid:16) )](cid:6) ∈ Rdl:yo

= θ and θ

= [i (cid:6)
(yo)

, . . . , i (cid:6)

= θ

(yo:yo(cid:16) )

(1:h)

θ t+1
(yo)

= θ t
(yo)

− gt
(yo)

,

gt
(yo)

∼ G

(yo)(θ t, t),

(3.1)

(yo) outputs a distribution over the vector gt

where the function G
(yo) and dif-
fers for different training algorithms. Por ejemplo, para (minibatch) stochas-
tic gradient descent (SGD), gt
(yo) represents a product of a learning rate
(yo) at the time t. We define G =
and a stochastic gradient with respect to θ
(GRAMO

(h)) to represent a training algorithm.

, . . . , GRAMO
For an arbitrary matrix M ∈ Rm×m(cid:16)
tor in Rm, Mi∗ be its ith row vector in Rm(cid:16)
We define M ◦ M(cid:16)

, we let M∗ j be its jth column vec-
, and rank(METRO) be its matrix rank.
to be the Hadamard product of any matrices M and M(cid:16)
.

(1)

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi

pag
d

/

yo

F
/

/

/

/

3
4
4
9
9
1
2
0
0
3
0
8
5
norte
mi
C
oh
_
a
_
0
1
4
8
3
pag
d

.

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

1006

k. Kawaguchi, l. zhang, and Z. Deng

For any vector v ∈ Rm, we let diag(v ) ∈ Rm×m be the diagonal matrix with
diag(v )ii

i for i ∈ [metro]. We denote by Im the m × m identity matrix.

= v

3.2 Exploration-Exploitation Wrapper. En esta sección, we introduce the
exploration-exploitation (EE) wrapper A. The EE wrapper A is not a stand-
alone training algorithm. En cambio, it takes any training algorithm G as its
input and runs the algorithm G in a particular way to guarantee global con-
vergence. We note that the exploitation phase in the EE wrapper does not
optimize the last layer; en cambio, it optimizes hidden layers, whereas the ex-
ploration phase optimizes all layers. The EE wrapper allows us to learn the
representación
that differs significantly from the initial representa-
∂ f (X,i 0 )

∂θ 0 without making assumptions on the minimum eigenvalue of
the matrix
by leveraging the data-architecture align-
ment condition. The data-architecture alignment condition is ensured by
the safe-exploration condition (defined in section 3.3.1), which is time inde-
pendent and holds in practical common architectures (as demonstrated in
sección 3.4).

∂vec( fX (i ))
∂θ

∂vec( fX (i ))
∂θ

∂ f (X,θ t )
∂θ t

ción

(cid:6)
)

(

3.2.1 Main Mechanisms. Algoritmo 1 outlines the EE wrapper A. During
the exploration phase in lines 3 a 7 of algorithm 1, the EE wrapper A freely
explores hierarchical nonlinear representations to be learned without any
restricciones. Entonces, during the exploitation phase in lines 8 a 12, it starts
exploiting the current knowledge to ensure vec(Y(cid:3)) ∈ Col(
) for all
t to guarantee global convergence. The value of τ is the hyperparameter
that controls the time when it transitions from the exploration phase to the
exploitation phase.

∂vec( fX (θ t ))
∂θ t

In the exploitation phase, the wrapper A only optimizes the parame-
ter vector θ t
(H−1) en el (H − 1)th hidden layer, instead of the parameter
vector θ t
(h) at the last layer or the Hth layer. Despite this, the EE wrap-
per A is proved to converge to global minima of all layers in Rd. el ex-
ploitation phase still allows us to significantly change the representations
as M(θ t ) (cid:19)≈ M(θ τ
) for t > τ . This is because we optimize the hidden layers
instead of the last layer without any significant overparameterizations.

The exploitation phase uses an arbitrary optimizer ˜G with the update
(H−1)(θ t, t) with ˜gt = αt ˆgt ∈ RdH−1 . During the two phases, nosotros
vector ˜gt ∼ ˜G
can use the same optimizers (p.ej., SGD for both G and ˜G) or different opti-
mizers (p.ej., SGD for G and L-BFGS for ˜G).

3.2.2 Model Modification. This section defines the details of the model
modification at line 2 of algorithm 1. Given any base network ¯f , el
wrapper A first checks whether the last two layers of the given network
¯f are fully connected. If not, one or two fully connected last layers are
added such that the output of the network ¯f can be written by ¯f (X, ¯θ ) =

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi

pag
d

/

yo

F
/

/

/

/

3
4
4
9
9
1
2
0
0
3
0
8
5
norte
mi
C
oh
_
a
_
0
1
4
8
3
pag
d

.

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Understanding Nonlinear Representation Learning

1007

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi

pag
d

/

yo

F
/

/

/

/

3
4
4
9
9
1
2
0
0
3
0
8
5
norte
mi
C
oh
_
a
_
0
1
4
8
3
pag
d

.

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

(1:H−2))). Aquí, z(X, i

¯W (h) ¯σ ( ¯W (H−1)z(X, i
(1:H−2)) is the output of the (H − 2)th
capa, and the function z is arbitrary and can represent various deep net-
obras. Además, ¯σ is a nonlinear activation function, and ¯W (H−1) and ¯W (h)
are the weight matrices of the last two layers. The wrapper A then modi-
fies these last two layers as follows. In the case of my = 1, the model ¯f is
modified to

F (X, i ) = W (h)pag (W. (H−1), z(X, i

(1:H−2))),

(3.2)

where W (H−1) ∈ RmH ×mH−1 and W (h) ∈ Rmy×mH are the weight matrices of the
last two layers. The nonlinear activation σ is defined by σ (q, q(cid:16)
)
(qq(cid:16)
), where ˜σ is some nonlinear function. Por ejemplo, we can set ˜σ (q) =
q (sigmoid) with any hyperparameter ς (cid:16) > 0, for which it holds that as
1
1+e−ς (cid:16)
ς (cid:16) → ∞,

) = ˜σ (qq(cid:16)

F (X, i ) → W (h)relu(W. (H−1)z(X, i

(1:H−2))).

(3.3)

1008

k. Kawaguchi, l. zhang, and Z. Deng

We generalize equation 3.2 to the case of my ≥ 2 como

F (X, i ) j

= W (h)
j∗ σ

j(W. (H−1, j), z(X, i

(1:H−2))),

(3.4)

j

= σ until line 10. At line 10, the wrapper A replaces σ
R( j) (q, q(cid:16)

for j ∈ [mi] where W (H−1, j) ∈ RmH ×mH−1 is the weight matrix at the (H − 1)th
layer and σ
j by
pag
=
R( j) where σ
(W. (H−1, j))(cid:6). To consider the bias term, we include the constant neuron to the
(1:H−2))(cid:6), 1](cid:6) ∈ RmH−1 ,
output of the (H − 1)th layer as z(X, i
where ¯z(X, i

(1:H−2)) is the output without the constant neuron.

) con R( j) = θ τ

(1:H−2)) = [ ¯z(X, i

(H−1, j) and θ

) = ˜σ (R( j)q(cid:16)

) (qq(cid:16)

(H−1, j)

3.3 Convergence Analysis. En esta sección, we establish global conver-
gence of the EE wrapper A without using assumptions from the previous
sección. Let τ be an arbitrary positive integer and ε be an arbitrary posi-
tive real number. Dejar (θ t )
t=0 be a sequence generated by the EE wrapper A.
(h)) and B ¯(cid:14) = minθ

We define ˆL(i
(1:H−2)
θ τ
(H−1)), ¯(cid:14)) for any ¯(cid:14) ≥ 0.
(H−1)

(H−1)) = L(θ τ
¯(cid:14) = argminθ

, θ τ
, i
(H−1)
máximo( ˆL(i

(cid:13) dónde (cid:19)

(H−1)

∈(cid:19) ¯(cid:14)

(cid:13)i

(H−1)

(H−1)

3.3.1 Safe-Exploration Condition. The mathematical analysis in this sec-
tion relies on the safe-exploration condition, which allows us to safely
explore deep nonlinear representations in the exploration phase with-
out getting stuck in the states of vec(Y(cid:3)) /∈ Col(
). The safe-
exploration condition is verifiable, time-independent, data-dependent, y
architecture-dependent. The verifiability and time independence make the
assumption strong enough to provide prior guarantees before training. El
data dependence and architecture dependence make the assumption weak
enough to be applicable for a wide range of practical settings.

∂vec( fX (θ t ))
∂θ t

For any q ∈ RmH−1

×mH , we define the matrix-valued function φ(q, i

(1:H−2))

∈ Rn×mH mH−1 by



Fi(q, i

(1:H−2)) =

˜σ (z(cid:6)

˜σ (z(cid:6)

1

1 q∗1)z(cid:6)

n q∗1)z(cid:6)

norte

· · ·
. . .
· · ·



,

˜σ (z(cid:6)

˜σ (z(cid:6)

1

1 q∗mH )z(cid:6)

n q∗mH )z(cid:6)

norte

, i

= z(xi

(1:H−2)) ∈ RmH−1 and ˜σ (z(cid:6)

∈ R1×mH−1 for all i ∈ [norte] y
where zi
k ∈ [mH]. Using this function, the safe exploration condition is formally
stated as:
Assumption 4 (Safe-Exploration Condition). There exist a q ∈ RmH−1
and a θ

∈ Rd1:H−2 such that rank(Fi(q, i

i q∗k)z(cid:6)

(1:H−2))) = norte.

(1:H−2)

×mH

i

The safe-exploration condition asks for only the existence of one param-
(1:H−2))) = norte. Él

eter vector in the network architecture such that rank(Fi(q, i

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi

pag
d

/

yo

F
/

/

/

/

3
4
4
9
9
1
2
0
0
3
0
8
5
norte
mi
C
oh
_
a
_
0
1
4
8
3
pag
d

.

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Understanding Nonlinear Representation Learning

1009

is not about the training trajectory (θ t )t. Since the matrix φ(q, i
(1:H−2)) es
of size n × mHmH−1, the safe-exploration condition does not require any
wide layer of size mH ≥ n or mH−1
≥ n. En cambio, it requires a layer of size
≥ n. This is a significant improvement over the most closely re-
mHmH−1
lated study (Kawaguchi & Sol, 2021) where the wide layer of size mH ≥
≥ n does not imply the safe-
n was required. Note that having mHmH−1
≥ n is a necessary condition to sat-
exploration condition. En cambio, mHmH−1
isfy the safe-exploration condition, whereas mH ≥ n or mH−1
≥ n was a nec-
essary condition to satisfy assumptions in previous papers, including the
most closely related study (Kawaguchi & Sol, 2021). The safe-exploration
condition is verified in experiments in section 3.4.

3.3.2 Additional Assumptions. We also use the following assumptions:

i(q) − ∇(cid:3)

Assumption 5. For any i ∈ [norte], the function (cid:3)
y (cid:13)(cid:3)
Assumption 6. For each i ∈ [norte], the functions θ
q (cid:8)→ ˜σ (q) are real analytic.

)(cid:13) ≤ L(cid:3)(cid:13)q − q(cid:16)(cid:13) for all q, q(cid:16) ∈ R.

i(q(cid:16)

i : q (cid:8)→ (cid:3)(q, yi) is differentiable,

(1:H−2)

(cid:8)→ z(xi

, i

(1:H−2)) y

dy

(cid:3)

(cid:3)

k=1 yk log exp(qk )
(cid:16) exp.(qk

Assumption 5 is satisfied by using standard loss functions such
as the squared loss (cid:3)(q, y) = (cid:13)q − y(cid:13)2 and cross-entropy loss (cid:3)(q, y) =

(cid:16) ) . The assumptions of the invexity and convex-
ity of the function q (cid:8)→ (cid:3)(q, yi) en secciones 3.3.3 y 3.3.4 also hold
for these standard loss functions. Using L(cid:3) in assumption 5, we de-
fine ˆL = L(cid:3)
(h, j))
norte
ImH−1 ](Fi(θ τ

(cid:13)z(cid:13)2, where Z ∈ Rn is defined by Zi
(cid:6)
= (W. (h)
j∗ )

= max j∈[mi]

(cid:6)(cid:13) with θ

(cid:13)[diag(θ τ

(1:H−2))i∗)

(H−1, j)

(h, j)

, θ τ

.

k

Assumption 6 is satisfied by using any analytic activation function
such as sigmoid, hyperbolic tangents, and softplus activations q (cid:8)→ ln(1 +
exp.(ςq))/ς with any hyperparameter ς > 0. This is because a composition
of real analytic functions is real analytic, and the following are all real ana-
lytic functions in θ
(1:H−2): the convolution, affine map, average pooling, skip
connection, and batch normalization. Por lo tanto, the assumptions can be sat-
isfied by using a wide range of machine learning models, including deep
neural networks with convolution, skip connection, and batch normaliza-
ción. Además, the softplus activation can approximate the ReLU activation
for any desired accuracy, eso es, ln(1 + exp.(ςq))/ς → relu(q) as ς → ∞,
where ReLU represents the ReLU activation.

3.3.3 Global Optimality at the Limit Point. The following theorem proves
the global optimality at limit points of the EE wrapper with a wide range
of optimizers, including gradient descent and modified Newton methods:
Teorema 4. Suppose assumptions 4 a 6 hold and that the function (cid:3)
i : q (cid:8)→
(cid:3)(q, yi) is invex for any i ∈ [norte]. Assume that there exist ¯c, c > 0 such that
(H−1))(cid:13)2 for any t ≥ τ .
C(cid:13)∇ ˆL(θ t

(cid:6) ˆgt and (cid:13) ˆgt(cid:13)2 ≤ ¯c(cid:13)∇ ˆL(θ t

(H−1))(cid:13)2 ≤ ∇ ˆL(θ t

(H−1))

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi

pag
d

/

yo

F
/

/

/

/

3
4
4
9
9
1
2
0
0
3
0
8
5
norte
mi
C
oh
_
a
_
0
1
4
8
3
pag
d

.

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

1010

k. Kawaguchi, l. zhang, and Z. Deng

Assume that the learning rate sequence (αt )t≥τ satisfies either (i) (cid:14) ≤ αt ≤ c(2−(cid:14))
ˆL ¯c
for some (cid:14) > 0 o (ii) limt→∞ αt = 0 y
t=τ αt = ∞. Then with probabil-
ity one, every limit point ˆθ of the sequence (θ t )t is a global minimum of L as
l( ˆθ ) ≤ L(i ) for all θ ∈ Rd.

(cid:3)∞

3.3.4 Global Optimality Gap at Each Iteration. We now present global con-
vergence guarantees of the EE wrapper A with gradient decent and SGD:
Teorema 5. Suppose assumptions 4 a 6 hold and that the function (cid:3)
i : q (cid:8)→
(cid:3)(q, yi) is convex for any i ∈ [norte]. Entonces, with probability one, the following two
statements hold:

(i) (Gradient descent) if ˆgt = ∇ ˆL(θ t

(H−1)) and αt = 1
ˆL

for t ≥ τ , then for any

¯(cid:14) ≥ 0 and t > τ ,

l(θ t ) ≤ inf
θ ∈Rd

¯(cid:14) ˆL
máximo(l(i ), ¯(cid:14)) + B2
2(t − τ )

.

(ii) (SGD) if E[ ˆgt|θ t] = ∇ ˆL(θ t
(cid:3)∞

t=τ (αt )2 < ∞ and (H−1)) (almost surely) with E[(cid:13) ˆgt(cid:13)2] ≤ G2, and t=τ αt = ∞ for t ≥ τ , then for any ¯(cid:14) ≥ 0 (cid:3)∞ if αt ≥ 0, and t > t ,

mi[l(θ t∗

)] ≤ inf
θ ∈Rd

máximo(l(i ), ¯(cid:14)) + B2

¯(cid:14) + G2
(cid:3)
t
2

(cid:3)
t

k=τ α2

k

k=τ α

k

(3.5)

where t∗ ∈ argmink∈{t,τ +1,…,t}l(θ k).

In theorem 5ii, with αt ∼ O(1/

t), the optimality gap becomes

mi[l(θ t

)] − inf
θ ∈Rd

máximo(l(i ), ¯(cid:14)) = ˜O(1/

t).

3.4 experimentos. This section presents empirical evidence to support
our theory and what is predicted by a well-known hypothesis. We note
that there is no related work or algorithm that can guarantee global con-
vergence in the setting of our experiments where the model has convolu-
ciones, skip connections, and batch normalizations without any wide layer
(of the width larger than n). Además, unlike any previous studies that pro-
pose new methods, our training framework works by modifying any given
método.

3.4.1 Sine Wave Data Set. We have seen in section 2.2 that gradient de-
scent gets stuck at suboptimal points for the sine wave data set. Using the
same setting as that in section 2.2 with ε = 0.01, τ = 2000, and ˜G = G, nosotros
confirm in Figure 2 that the EE wrapper A can modify gradient descent to
avoid suboptimal points and converge to global minima as predicted by our

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi

pag
d

/

yo

F
/

/

/

/

3
4
4
9
9
1
2
0
0
3
0
8
5
norte
mi
C
oh
_
a
_
0
1
4
8
3
pag
d

.

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Understanding Nonlinear Representation Learning

1011

Cifra 2: Training errors for three random trials.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi

pag
d

/

yo

Cifra 3: Changes of the representations where M(i ) := ∂vec( fX (i ))

∂θ

∂vec( fX (i ))
∂θ

(

)(cid:6).

theory. Cifra 3 shows the value of the change of the gradient representa-
ción, (cid:13)METRO(θ T ) − M(i 0)(cid:13)2
F, for each time step T. As can be seen, the values of
(cid:13)METRO(θ T ) − M(i 0)(cid:13)2
F are large for both methods. Notablemente, the EE wrapper A
of the base case significantly increases the value of (cid:13)METRO(θ T ) − M(i 0)(cid:13)2
F even
in the exploitation phase after τ = 2000 as we are optimizing the hidden
capa. See appendix B in the supplementary information for more details of
the experiments for the sine wave data set.

3.4.2 Image Data Sets. The standard convolutional ResNet with 18 capas
(Él, zhang, Ren, & Sol, 2016) is used as the base model ¯f . We use ResNet-
18 for the illustration of our theory because it is used in practice and it has
convolution, skip connections, and batch normalization without any width
larger than the number of data points. This setting is not covered by any of
the previous theories for global convergence. We set the activation to be the
softplus function q (cid:8)→ ln(1 + exp.(ςq))/ς with ς = 100 for all layers of the
base ResNet. This approximates the ReLU activation well, as shown in ap-
pendix C in the supplementary information. We employ the cross-entropy
loss and ˜σ (q) = 1
1+e−q . We use a standard algorithm, SGD, with its standard
hyperparameter setting for the training algorithm G with ˜G = G—that is,
−5, the mo-
we let the minibatch size be 64, the weight decay rate be 10
mentum coefficient be 0.9, the learning rate be αt = 0.1, and the last epoch

F
/

/

/

/

3
4
4
9
9
1
2
0
0
3
0
8
5
norte
mi
C
oh
_
a
_
0
1
4
8
3
pag
d

.

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

1012

k. Kawaguchi, l. zhang, and Z. Deng

Mesa 2: Verification of the Safe-Exploration Condition (Assumption 4).

Data Set

norte

mH−1

MNIST
CIFAR-10
CIFAR-100
Semeion
KMNIST
SVHN

60,000
50,000
50,000
1,000
60,000
73,257

513
513
513
513
513
513

mH

234
195
195
4
234
286

Assumption 4

Verified
Verified
Verified
Verified
Verified
Verified

= (cid:23)2(n/mH−1 )(cid:24) where n is the number of
Nota: mH
training data, mH is the width of the last hidden layer,
and mH−1 is the width of the penultimate hidden layer.

0 ˆT were selected from ε ∈ {10

ˆT be 200 (with data augmentation) y 100 (without data augmentation).
−5} y
The hyperparameters ε and τ = τ
t
∈ {0.4, 0.6, 0.8} by only using training data. Eso es, we randomly divided
0
each training data set (100%) into a smaller training data set (80%), y un
validation data set (20%) for a grid search over the hyperparameters. Ver
appendix B in the supplementary information for the results of the grid
search and details of the experimental setting. This standard setting satis-
fies assumptions 5 y 6, leaving assumption 4 to be verified.

−3, 10

Verification of Assumption 4. Mesa 2 summarizes the verification results
of the safe-exploration condition. Because the condition only requires an
existence of a pair (i , q) satisfying the condition, we verified it by using
a randomly sampled q from the standard normal distribution and a θ re-
= 513
turned by a common initialization scheme (He et al., 2015). As mH−1
(512 + the constant neuron for the bias term) for the standard ResNet, nosotros
set mH = (cid:23)2(n/mH−1)(cid:24) throughout all the experiments with the ResNet. Para
each data set, the rank condition was verified twice by the two standard
methods: one from Press, Teukolsky, Vetterling, and Flannery (2007) y
another from Golub and Van Loan (1996).

Test Performance. One well-known hypothesis is that the success of deep-
learning methods partially comes from its ability to automatically learn
deep nonlinear representations suitable for making accurate predictions
from data (LeCun et al., 2015). As the EE wrapper A keeps this ability of
representation learning, the hypothesis suggests that the test performance
of the EE wrapper A of a standard method is approximately comparable to
that of the standard method. Unlike typical experimental studies, our ob-
jective here is to confirm this prediction instead of showing improvements
over a previous method. We empirically confirmed the prediction in Tables
3 y 4 where the numbers indicate the mean test errors (and standard de-
viations are in parentheses) over five random trials. As expected, the values

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi

pag
d

/

yo

F
/

/

/

/

3
4
4
9
9
1
2
0
0
3
0
8
5
norte
mi
C
oh
_
a
_
0
1
4
8
3
pag
d

.

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Understanding Nonlinear Representation Learning

1013

Mesa 3: Test Error (%): Data Augmentation.

Data Set

Estándar

A(Estándar)

MNIST
CIFAR-10
CIFAR-100
Semeion
KMNIST
SVHN

0.40 (0.05)
7.80 (0.50)
32.26 (0.15)
2.59 (0.57)
1.48 (0.07)
4.67 (0.05)

0.30 (0.05)
7.14 (0.12)
28.38 (0.42)
2.56 (0.55)
1.36 (0.11)
4.43 (0.11)

Mesa 4: Test Error (%): No Data Augmentation.

Data Set

Estándar

A(Estándar)

MNIST
CIFAR-10
CIFAR-100

0.52 (0.16)
15.15 (0.87)
54.99 (2.29)

0.49 (0.02)
14.56 (0.38)
46.13 (1.80)

Cifra 4: Training Loss with Data Augmentation.

de (cid:13)METRO(θ ˆT ) − M(i 0)(cid:13)2
2 were also large—for example, 4.64 × 1012 for the stan-
dard method and 3.43 × 1012 for wrapper A of the method with the Semeion
data set.

Training Behavior. Cifra 4 shows that the EE wrapper A can improve
training loss values of the standard SGD algorithm in the exploitation phase
without changing its hyperparameters because ˜G = G in these experiments.
In the figure, the plotted lines indicate the mean values over five random
ensayos, and the shaded regions show error bars with 1 standard deviation.

Computational Time. The EE wrapper A runs the standard SGD G in the
exploration phase and the SGD ˜G = G only on the subset of the weights
i
(H−1) in the exploitation phase. De este modo, the computational time of the EE
wrapper A is similar to that of the SGD in the exploration phase, y eso

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi

pag
d

/

yo

F
/

/

/

/

3
4
4
9
9
1
2
0
0
3
0
8
5
norte
mi
C
oh
_
a
_
0
1
4
8
3
pag
d

.

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

1014

k. Kawaguchi, l. zhang, and Z. Deng

Mesa 5: Total Wall-Clock Time in a Local GPU Workstation.

Data Set

Estándar

A(Estándar)

Semeion
CIFAR-10

364.60 (0.94)
3616.92 (10.57)

356.82 (0.67)
3604.5 (6.80)

Mesa 6: Test Errors (%) of A(Estándar) with ˜G = L-BFGS.

(a) with data augmentation

(b) without data augmentation

t
0

ε

−3
−5

10
10

0.4

0.6

0.8

0.26
0.37

0.38
0.32

0.37
0.37

t
0

ε

−3
−5

10
10

0.4

0.6

0.36
0.42

0.43
0.35

0.8

0.42
0.35

tends to be faster than the SGD in the exploitation phase. To confirm this,
we measure computational time with the Semeion and CIFAR-10 data sets
under the same computational resources (p.ej., without running other jobs
in parallel) in a local workstation for each method. The mean wall-clock
tiempo (in seconds) over five random trials is summarized in Table 5, dónde
the numbers in parentheses are standard deviations. It shows that the EE
wrapper A is slightly faster than the standard method, as expected.

Effect of Learning Rate and Optimizer. We also conducted experiments on
the effects of learning rates and optimizers using the MNIST data set with
data augmentation. Using the best learning rate from {0.2, 0.1, 0.01, 0.001}
for each method (with ˜G = G = SGD), the mean test errors (%) over five ran-
dom trials were 0.33 (0.03) for the standard base method and 0.27 (0.03) para
the A wrapper of the standard base method (the numbers in parentheses
are standard deviations). Además, Mesa 6 reports the preliminary results
on the effect of optimizers with ˜G being set to the limited-memory Broyden–
Fletcher–Goldfarb–Shanno algorithm (L-BFGS) (with G = the standard
SGD). By comparing Tables 3 y 6, we can see that using a different op-
timizer in the exploitation phase can potentially lead to performance im-
provements. A comprehensive study of this phenomenon is left to future
trabajar.

4 Conclusión

Despite the nonlinearity of the dynamics and the noninvexity of the objec-
tivo, we have rigorously proved convergence of training dynamics to global
minima for nonlinear representation learning. Our results apply to a wide
range of machine learning models, allowing both underparameterization

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi

pag
d

/

yo

F
/

/

/

/

3
4
4
9
9
1
2
0
0
3
0
8
5
norte
mi
C
oh
_
a
_
0
1
4
8
3
pag
d

.

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Understanding Nonlinear Representation Learning

1015

and overparameterization. Por ejemplo, our results are applicable to the
)(cid:6) es
case where the minimum eigenvalue of the matrix
zero for all t ≥ 0. Under the common model structure assumption, modelos
that cannot achieve zero error for all data sets (except some “good” data
conjuntos) are shown to achieve global optimality with zero error exactly when
the dynamics satisfy the data-architecture alignment condition. Nuestros resultados
provide guidance for choosing and designing model structure and algo-
rithms via the common model structure assumption and data-architecture
alignment condition.

∂vec( fX (θ t ))
∂θ t

∂vec( fX (θ t ))
∂θ t

(

The key limitation in our analysis is the differentiability of the func-
tion f . For multilayer neural networks, this is satisfied by using stan-
dard activation functions, such as softplus, sigmoid, and hyperbolic tan-
gents. Whereas softplus can approximate ReLU arbitrarily well, the direct
treatment of ReLU in nonlinear representation learning is left to future
trabajar.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

Our theoretical results and numerical observations uncover novel math-
ematical properties and provide a basis for future work. Por ejemplo,
we have shown global convergence under the data-architecture alignment
. The EE wrapper A is only one way
condition vec(Y(cid:3)) ∈ Col
to ensure this condition. There are many other ways to ensure the data-
architecture alignment condition, and each way can result in a new algo-
rithm with guarantees.

(cid:4) ∂vec( fX (θ t ))

∂θ t

(cid:5)

Referencias

Bartlett, PAG. l., Helmbold, D. PAG., & Largo, PAG. METRO. (2019). Gradient descent with identity
initialization efficiently learns positive-definite linear transformations by deep
residual networks. Computación neuronal, 31(3), 477–502. 10.1162/neco_a_01164,
PubMed: 30645179

bengio, y., Courville, A., & Vincent, PAG. (2013). Representation learning: A review and
new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence,
35(8), 1798–1828. 10.1109/TPAMI.2013.50, PubMed: 23787338

bengio, y., Lamblin, PAG., Popovici, D., & Larochelle, h. (2007). Greedy layerwise train-
ing of deep networks. In B. Schölkopf, j. Platón, & t. Hoffman (Editores.), Avances en
sistemas de procesamiento de información neuronal, 19 (pag. 153). Red Hook. Nueva York: Curran.

Bordes, A., Glorot, X., Weston, J., & bengio, Y. (2012). Joint learning of words and
meaning representations for open-text semantic parsing. En Actas de la
15th International Conference on Artificial Intelligence and Statistics, 22 (páginas. 127–
135).

Ciregan, D., Meier, Ud., & Schmidhuber, j. (2012). Multi-column deep neural networks
for image classification. En Actas de la 2012 IEEE Conference on Computer Vi-
sion and Pattern Recognition (páginas. 3642–3649). Piscataway, Nueva Jersey: IEEE.

Clanuwat, T., Bober-Irizar, METRO., Kitamoto, A., Lamb, A., Yamamoto, K., & Ha, D.
(2019). Deep learning for classical Japanese literature. In NeurIPS Creativity Work-
shop 2019.

/

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi

pag
d

/

yo

F
/

/

/

/

3
4
4
9
9
1
2
0
0
3
0
8
5
norte
mi
C
oh
_
a
_
0
1
4
8
3
pag
d

.

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

1016

k. Kawaguchi, l. zhang, and Z. Deng

Dahl, GRAMO., Ranzato, METRO., mohamed, A.-r., & Hinton, GRAMO. mi. (2010). Phone recogni-
tion with the mean-covariance restricted Boltzmann machine. In J. Lafferty, C.
williams, j. Shawe-Taylor, R. Zemel, & A. Culotta (Editores.), Advances in neural infor-
mation processing systems, 23 (páginas. 469–477). Red Hook, Nueva York: Curran.

Dahl, GRAMO. MI., Yu, D., Deng, l., & Acero, A. (2011). Context-dependent pre-trained deep
neural networks for large-vocabulary speech recognition. IEEE Transactions on
Audio, Discurso, and Language Processing, 20(1), 30–42. 10.1109/TASL.2011.2134090
Deng, l., Seltzer, METRO. l., Yu, D., Acero, A., mohamed, A.-r., & Hinton, GRAMO. (2010). Binary
coding of speech spectrograms using a deep auto-encoder. En Actas de la
Eleventh Annual Conference of the International Speech Communication Association.
Red Hook, Nueva York: Curran.

Dong, C., Loy, C. C., Él, K., & Espiga, X. (2014). Learning a deep convolutional net-
work for image super-resolution. In Proceedings of the European Conference on Com-
puter Vision (páginas. 184–199). Berlina: Saltador.

Gatys, l. A., Ecker, A. S., & Bethge, METRO. (2016). Image style transfer using convolu-
tional neural networks. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (páginas. 2414–2423). Piscataway, Nueva Jersey: IEEE.

Glorot, X., Bordes, A., & bengio, Y. (2011). Domain adaptation for large-scale senti-
ment classification: A deep learning approach. In Proceedings of the International
Conference on Machine Learning. Nueva York: ACM.

Golub, GRAMO. h., & Van Loan, C. F. (1996). Matrix computations. baltimore, Maryland: Johns

Hopkins University Press.

Él, K., zhang, X., Ren, S., & Sol, j. (2015). Delving deep into rectifiers: Surpassing
human-level performance on ImageNet classification. In Proceedings of the IEEE
International Conference on Computer Vision (páginas. 1026–1034). Piscataway, Nueva Jersey: IEEE.
Él, K., zhang, X., Ren, S., & Sol, j. (2016). Identity mappings in deep residual net-
obras. In Proceedings of the European Conference on Computer Vision (páginas. 630–645).
Berlina: Saltador.

Hinton, GRAMO., Deng, l., Yu, D., Dahl, GRAMO. MI., mohamed, A.-r., Jaitly, NORTE., . . . Kingsbury,
B. (2012). Deep neural networks for acoustic modeling in speech recognition: El
shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–
97. 10.1109/MSP.2012.2205597

Hinton, GRAMO. MI., Osindero, S., & Teh, Y.-W. (2006). A fast learning algorithm for deep
belief nets. Computación neuronal, 18(7), 1527–1554. 10.1162/neco.2006.18.7.1527,
PubMed: 16764513

Kawaguchi, k. (2016). Deep learning without poor local minima. In D. Sotavento, METRO.
Sugiyama, Ud.. Luxburg, I. Guyon, & R. Garnett (Editores.), Advances in neural infor-
mation processing systems, 29 (páginas. 586–594). Red Hook, Nueva York: Curran.

Kawaguchi, k. (2021). On the theory of implicit deep learning: Global conver-
gence with implicit layers. In Proceedings of the International Conference on Learning
Representaciones.

Kawaguchi, K., Kaelbling, l. PAG., & Lozano-Pérez, t. (2015). Bayesian optimization
with exponential convergence. In C. Cortes, norte. lorenzo, D. Sotavento, METRO. Sugiyama,
& R. Garnett (Editores.), Advances in neural information processing systems, 28 (páginas. 2809–
2817). Red Hook. Nueva York: Curran.

Kawaguchi, K., Maruyama, y., & Zheng, X. (2016). Global continuous optimization
with error bound and fast convergence. Journal of Artificial Intelligence Research,
56, 153–195. 10.1613/jair.4742

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi

pag
d

/

yo

F
/

/

/

/

3
4
4
9
9
1
2
0
0
3
0
8
5
norte
mi
C
oh
_
a
_
0
1
4
8
3
pag
d

.

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Understanding Nonlinear Representation Learning

1017

Kawaguchi, K., & Sol, q. (2021). A recipe for global convergence guarantee in deep
neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, 35
(páginas. 8074–8082). Palo Alto, California: AAAI.

Krizhevsky, A., & Hinton, GRAMO. (2009). Learning multiple layers of features from tiny images

(Technical report). Citeseer.

Krizhevsky, A., Sutskever, I., & Hinton, GRAMO. mi. (2012). ImageNet classification with
deep convolutional neural networks. In F. Pereira, C. j. C. Burges, l. Bottou, &
k. q. Weinberger (Editores.), Advances in neural information processing systems, 25 (páginas.
1097–1105). Red Hook. Nueva York: Curran.

Laurent, T., & Brecht, j. (2018). Deep linear networks with arbitrary loss: All local
minima are global. In Proceedings of the International Conference on Machine Learning
(páginas. 2902–2907).

Le, H.-S., Oparin, I., Allauzen, A., Gauvain, J.-L., & Yvon, F. (2012). Structured output
layer neural network language models for speech recognition. IEEE Transactions
on Audio, Discurso, and Language Processing, 21(1), 197–206.

LeCun, y., bengio, y., & Hinton, GRAMO. (2015). Deep learning. Naturaleza, 521(7553), 436–444.

10.1038/nature14539, PubMed: 26017442

LeCun, y., Bottou, l., bengio, y., & Haffner, PAG. (1998). Gradient-based learning
applied to document recognition. In Proceedings of the IEEE, 86(11), 2278–2324.
10.1109/5.726791

Luan, F., París, S., Shechtman, MI., & Bala, k. (2017). Deep photo style transfer. En
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (páginas.
4990–4998). Piscataway, Nueva Jersey: IEEE.

Mityagin, B. (2015). The zero set of a real analytic function. arXiv:1512.07276.
mohamed, A.-r., Dahl, GRAMO. MI., & Hinton, GRAMO. (2011). Acoustic modeling using deep
belief networks. IEEE Transactions on Audio, Discurso, and Language Processing, 20(1),
14–22. 10.1109/TASL.2011.2109382

Netzer, y., Wang, T., Coates, A., Bissacco, A., Wu, B., & Ng, A. Y. (2011). Reading
digits in natural images with unsupervised feature learning. In NIPS Workshop
on Deep Learning and Unsupervised Feature Learning.

Paszke, A., Bruto, S., Massa, F., Lerer, A., Bradbury, J., Chanan, GRAMO., . . . Chintala, S.
(2019a). Pytorch: An imperative style, high-performance deep learning library. En
h. Wallach, h. Larochelle, A. Beygelzimer, F. d’Alché-Buc, mi. Fox, & R. Garnett
(Editores.), Advances in neural information processing systems, 32 (páginas. 8026–8037). Red
Hook. Nueva York: Curran.

Paszke, A., Bruto, S., Massa, F., Lerer, A., Bradbury, J., Chanan, GRAMO., . . . Chintala, S.
(2019b). Pytorch: An imperative style, high-performance deep learning library. En
h. Wallach, h. Larochelle, A. Beygelzimer, F. d’Alché-Buc, mi. Fox, & R. Garnett
(Editores.), Advances in neural information processing systems, 32 (páginas. 8024–8035). Red
Hook. Nueva York: Curran.

Pedregosa, F., Varoquaux, GRAMO., Gramfort, A., Michel, v., Thirion, B., Grisel, . . . Duches-
nay, MI., (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learn-
ing Research, 12, 2825–2830.

Pogio, T., Kawaguchi, K., Liao, P., Miranda, B., Rosasco, l., Boix, X., Hidary, J., &
Mhaskar, h. (2017). Theory of deep learning III: Explaining the non-overfitting puzzle.
arXiv:1801.00173.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi

pag
d

/

yo

F
/

/

/

/

3
4
4
9
9
1
2
0
0
3
0
8
5
norte
mi
C
oh
_
a
_
0
1
4
8
3
pag
d

.

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

1018

k. Kawaguchi, l. zhang, and Z. Deng

Prensa, W.. h., Teukolsky, S. A., Vetterling, W.. T., & Flannery, B. PAG. (2007). Numerical
recipes: The art of scientific computing. (3tercera ed.). Cambridge: Cambridge University
Prensa.

Rifai, S., Dauphin, Y. NORTE., Vincent, PAG., bengio, y., & Muller, X. (2011). The manifold tan-
gent classifier. In J. Shawe-Taylor, R. Zemel, PAG. Bartlett, F. Pereira, & k. q. Wein-
berger (Editores.), Advances in neural information processing systems, 24 (páginas. 2294–2302).
Red Hook. Nueva York: Curran.

sajonia, A. METRO., McClelland, j. l., & Ganguli, S. (2014). Exact solutions to the nonlinear
dynamics of learning in deep linear neural networks. In Proceedings of the Inter-
national Conference on Learning Representations.

Schwenk, h., Rousseau, A., & Attik, METRO. (2012). Large, pruned or continuous space
language models on a GPU for statistical machine translation. En procedimientos de
the NAACL-HLT 2012 Taller: Will We Ever Really Replace the N-gram Model? On
the Future of Language Modeling for HLT (páginas. 11–19). Stroudsburg, Pensilvania: Asociación
para Lingüística Computacional.

Seide, F., li, GRAMO., & Yu, D. (2011). Conversational speech transcription using context-
dependent deep neural networks. In Proceedings of the Twelfth Annual Conference
of the International Speech Communication Association. Nueva York: ACM.

Socher, r., Huang, mi. h., Pennington, J., Ng, A. y., & Manning, C. D. (2011). Dy-
namic pooling and unfolding recursive autoencoders for paraphrase detection.
In J. Shawe-Taylor, R. Zemel, PAG. Bartlett, F. Pereira, & k. q. Weinberger (Editores.),
Advances in neural information processing systems, 24 (páginas. 801–809). Red Hook, Nueva York:
Curran.

Socher, r., Pennington, J., Huang, mi. h., Ng, A. y., & Manning, C. D. (2011). Semi-
supervised recursive autoencoders for predicting sentiment distributions. En profesional-
cesiones de la 2011 Conference on Empirical Methods in Natural Language Processing
(páginas. 151–161). Stroudsburg, Pensilvania: Asociación de Lingüística Computacional.
Srl, B. T., & Brescia, I. (1994). Semeion handwritten digit data set. Roma, Italia: Semeion

Research Center of Sciences of Communication.

Tornero, C. r., Fuggetta, A., Lavazza, l., & Lobo, A. l. (1999). A conceptual ba-
sis for feature engineering. Journal of Systems and Software, 49(1), 3-15. 10.1016/
S0164-1212(99)00062-X

Xu, K., zhang, METRO., Jegelka, S., & Kawaguchi, k. (2021). Optimization of graph neural
redes: Implicit acceleration by skip connections and more depth. En curso-
ings of the International Conference on Machine Learning.

Zheng, A., & Casari, A. (2018). Feature engineering for machine learning: Principles and

techniques for data scientists. Sebastopol, California: O’Reilly Media.

Zou, D., Largo, PAG. METRO., & Gu, q. (2020). On the global convergence of training
deep linear ResNets. In Proceedings of the International Conference on Learning
Representaciones.

Received July 16, 2021; accepted November 24, 2021.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi

pag
d

/

yo

F
/

/

/

/

3
4
4
9
9
1
2
0
0
3
0
8
5
norte
mi
C
oh
_
a
_
0
1
4
8
3
pag
d

.

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3LETTER image

Descargar PDF