Communicated by Alexander Schwing

Deep Restricted Kernel Machines Using Conjugate
Feature Duality

Johan A. K. Suykens
johan.suykens@esat.kuleuven.be
KU Leuven ESAT-STADIUS, B-3001 Leuven, 比利时

The aim of this letter is to propose a theory of deep restricted kernel
machines offering new foundations for deep learning with kernel ma-
中国人. From the viewpoint of deep learning, it is partially related to
restricted Boltzmann machines, which are characterized by visible and
hidden units in a bipartite graph without hidden-to-hidden connections
and deep learning extensions as deep belief networks and deep Boltz-
mann machines. From the viewpoint of kernel machines, it includes least
squares support vector machines for classification and regression, kernel
principal component analysis (PCA), matrix singular value decomposi-
的, and Parzen-type models. A key element is to first characterize these
kernel machines in terms of so-called conjugate feature duality, yielding
a representation with visible and hidden units. It is shown how this is
related to the energy form in restricted Boltzmann machines, with con-
tinuous variables in a nonprobabilistic setting. In this new framework
of so-called restricted kernel machine (RKM) 陈述, the dual
variables correspond to hidden features. Deep RKM are obtained by cou-
pling the RKMs. The method is illustrated for deep RKM, consisting of
three levels with a least squares support vector machine regression level
and two kernel PCA levels. In its primal form also deep feedforward neu-
ral networks can be trained within this framework.

1 介绍

Deep learning has become an important method of choice in several
research areas including computer vision, speech recognition, and lan-
guage processing (乐存, 本吉奥, & 欣顿, 2015). Among the existing
techniques in deep learning are deep belief networks, deep Boltzmann
machines, convolutional neural networks, stacked autoencoders with pre-
training and fine-tuning, 和别的 (本吉奥, 2009; 好人, 本吉奥, &
考维尔, 2016; 欣顿, 2005; 欣顿, Osindero, & Teh, 2006; LeCun et al.,
2015; 李, Grosse, Ranganath, & 的, 2009; Salakhutdinov, 2015; Schmid-
huber, 2015; Srivastava & Salakhutdinov, 2014; 陈, Schwing, Yuille, &
Urtasun, 2015; Jaderberg, Simonyan, Vedaldi, & Zisserman, 2014; Schwing
& Urtasun, 2015; Zheng et al., 2015). Support vector machines (支持向量机) 和

神经计算 29, 2123–2163 (2017) © 2017 麻省理工学院.
在知识共享下发布
土井:10.1162/NECO_a_00984
归因 3.0 Unported (抄送 3.0) 执照.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

2124

J. Suykens

kernel-based methods have made a large impact on a wide range of appli-
cation fields, together with finding strong foundations in optimization and
learning theory (Boser, Guyon, & Vapnik, 1992; 科尔特斯 & Vapnik, 1995; 拉斯-
必须 & 威廉姆斯, 2006; Schölkopf & Smola, 2002; Suykens, Van Gestel,
De Brabanter, De Moor, & Vandewalle, 2002; Vapnik, 1998; Wahba, 1990).
所以, one can pose the question: Which synergies or common founda-
tions could be developed between these different directions? There has al-
ready been exploration of such synergies— for example, in kernel methods
for deep learning (给 & Saul, 2009), deep gaussian processes (Damianou &
劳伦斯, 2013; Salakhutdinov & 欣顿, 2007), convolutional kernel net-
作品 (Mairal, Koniusz, Harchaoui, & Schmid, 2014), multilayer support
vector machines (Wiering & Schomaker, 2014), and mathematics of the neu-
ral response (Smale, Rosasco, Bouvrie, Caponnetto, & Poggio, 2010), 之中
其他的.

In this letter, we present a new theory of deep restricted kernel ma-
中国人 (deep RKM), offering foundations for deep learning with kernel
machines. It partially relates to restricted Boltzmann machines (RBMs),
which are used within deep belief networks (欣顿, 2005; Hinton et al.,
2006). In RBMs, one considers a specific type of Markov random field, 字符-
acterized by a bipartite graph consisting of a layer of visible units and
another layer of hidden units (本吉奥, 2009; Fisher & 刺猬, 2014; 欣顿
等人。, 2006; Salakhutdinov, 2015). In RBMs, which are related to harmoni-
嗯 (斯摩棱斯基, 1986; Welling, Rosen-Zvi, & 欣顿, 2004), there are no
connections between the hidden units (欣顿, 2005), and often also no
visible-to-visible connections. In deep belief networks, the hidden units of
a layer are mapped to a next layer in order to create a deep architecture.
In RBM, one considers stochastic binary variables (Ackley, 欣顿, & 硒-
jnowski, 1985; Hertz, Krogh, & 帕尔默, 1991), and extensions have been
made to gaussian-Bernoulli variants (Salakhutdinov, 2015). Hopfield net-
作品 (Hopfield, 1982) take continuous values, and a class of Hamiltonian
neural networks has been studied in DeWilde (1993). 还, discriminative
RBMs have been studied where the class labels are considered at the level
of visible units (Fisher & 刺猬, 2014; 拉罗谢尔 & 本吉奥, 2008). In all of
these methods the energy function plays an important role, as it also does
in energy-based learning methods (乐存, Chopra, Hadsell, Ranzato, &
黄, 2006).

Representation learning issues are considered to be important in deep
学习 (本吉奥, 考维尔, & Vincent, 2013). The method proposed in this
letter makes a link to restricted Boltzmann machines by characterizing sev-
eral kernel machines by means of so-called conjugate feature duality. Dual-
ity is important in the context of support vector machines (Boser et al., 1992;
科尔特斯 & Vapnik, 1995; Vapnik, 1998; Suykens et al., 2002; Suykens, Alzate,
& Pelckmans, 2010), 优化 (Boyd & Vandenberghe, 2004; Rockafel-
拉尔, 1987), and in mathematics and physics in general. Here we consider
hidden features conjugated to part of the unknown variables. This part of

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

Deep Restricted Kernel Machines Using Conjugate Feature Duality

2125

the formulation is linked to a restricted Boltzmann machine energy expres-
锡安, though with continuous variables in a nonprobabilistic setting. 在这个
方式, a model can be expressed in both its primal representation and its dual
representation and give an interpretation in terms of visible and hidden
units, in analogy with RBM. The primal representation contains the feature
map, while the dual model representation is expressed in terms of the ker-
nel function and the conjugated features.

The class of kernel machines discussed in this letter includes least
squares support vector machines (LS-SVM) for classification and regres-
锡安, kernel principal component analysis (kernel PCA), matrix singular
value decomposition (matrix SVD), and Parzen-type models. These have
been previously conceived within a primal and Lagrange dual setting in
Suykens and Vandewalle (1999乙), Suykens et al. (2002), Suykens, Van Ges-
电话, Vandewalle, and De Moor (2003), and Suykens (2013, 2016). Other exam-
ples are kernel spectral clustering (Alzate & Suykens, 2010; Mall, Langone,
& Suykens, 2014), kernel canonical correlation analysis (Suykens et al.,
2002), and several others, which will not be addressed in this letter, but can
be the subject of future work. In this letter, we give a different characteriza-
tion for these models, based on a property of quadratic forms, which can be
verified through the Schur complement form. The property relates to a spe-
cific case of Legendre-Fenchel duality (Rockafellar, 1987). Also note that in
classical mechanics, converting a Lagrangian into Hamiltonian formulation
is by Legendre transformation (戈德斯坦, Poole, & Safko, 2002).

The kernel machines with conjugate feature representations are used
then as building blocks to obtain the deep RKM by coupling the RKMs. 这
deep RKM becomes unrestricted after coupling the RKMs. The approach is
explained for a model with three levels, consisting of two kernel PCA lev-
els and a level with LS-SVM classification or regression. The conjugate fea-
tures of level 1 are taken as input of level 2 和, 随后, the features
of level 2 as input for level 3. The objective of the deep RKM is the sum of
the objectives of the RKMs in the different levels. The characterization of
the stationary points leads to solving a set of nonlinear equations in the un-
knowns, which is computationally expensive. 然而, for the case of lin-
ear kernels, in part of the levels it reveals how kernel fusion is taking place
over the different levels. For this case, a heuristic algorithm is obtained with
level-wise solving. For the general nonlinear case, a reduced-set algorithm
with estimation in the primal is proposed.

In this letter, we make a distinction between levels and layers. We use
the terminology of levels to indicate the depth of the model. The terminol-
ogy of layers is used here in connection to the feature map. Suykens and
Vandewalle (1999A) showed how a multilayer perceptron can be trained by
a support vector machine method. It is done by defining the hidden layer
to be equal to the feature map. 这样, the hidden layer is treated at
the feature map and the kernel parameters level. Suykens et al. (2002) 前任-
plained that in SVM and LS-SVM models, one can have a neural networks

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

2126

J. Suykens

interpretation in both the primal and the dual. The number of hidden
units in the primal equals the dimension of the feature space, while in the
dual representation, it equals the number of support vectors. 这样,
it provides a setting to work with parametric models in the primal and
kernel-based models in the dual. 所以, we also illustrate in this letter
how deep multilayer feedforward neural networks can be trained within
the deep RKM framework. While in classical backpropagation (Rumelhart,
欣顿, & 威廉姆斯, 1986), one typically learns the model by specifying a
single objective (例如, unless imposing additional stability constraints to ob-
tain stable multilayer recurrent networks with dynamic backpropagation;
(Suykens, Vandewalle, & De Moor, 1995), in the deep RKM the objective
function consists of the different objectives related to the different levels.

总之, we aim at contributing to the following challenging ques-

tions in this letter:

• Can we find new synergies and foundations between SVM and kernel

methods and deep learning architectures?

• Can we extend primal and dual model representations, as occurring
in SVM and LS-SVM models, from shallow to deep architectures?
• Can we handle deep feedforward neural networks and deep kernel

machines within a common setting?

In order to address these questions, this letter is organized as follows.
部分 2 outlines the context of this letter with a brief introductory part on
restricted Boltzmann machines, SVMs, LS-SVMs, kernel PCA, and SVD. 在
部分 3 we explain how these kernel machines can be characterized by
conjugate feature duality with visible and hidden units. In section 4 深的
restricted kernel machines are explained for three levels: an LS-SVM regres-
sion level and two additional kernel PCA levels. In section 5, different algo-
rithms are proposed for solving in either the primal or the dual, 哪里的
former will be related to deep feedfoward neural networks and the latter
to kernel-based models. Illustrations with numerical examples are given in
部分 6. 部分 7 concludes the letter.

2 Preliminaries and Context

在这个部分, we explain basic principles of restricted Boltzmann ma-
中国人, SVMs, LS-SVMs, and related formulations for kernel PCA, and SVD.
These are basic ingredients needed before introducing restricted kernel ma-
chines in section 3.

2.1 Restricted Boltzmann Machines. An RBM is a specific type of
Markov random field, characterized by a bipartite graph consisting of a
layer of visible units and another layer of hidden units (本吉奥, 2009; Fisher
& 刺猬, 2014; Hinton et al., 2006; Salakhutdinov, 2015), without hidden-to-
hidden connections. Both the visible and hidden variables, denoted by v

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

Deep Restricted Kernel Machines Using Conjugate Feature Duality

2127

数字 1: Restricted Boltzmann machine consisting of a layer of visible units v
and a layer of hidden units h. They are interconnected through the interaction
matrix W, depicted in blue.

and h, 分别, have stochastic binary units with value 0 或者 1. A joint
状态 {v, H} is defined for these visible and hidden variables with energy (看
数字 1),

乙(v, H; 我 ) = −v TWh − cT v − aT h,

(2.1)

where θ = {瓦, C, A} are the model parameters, W is an interaction weight
矩阵, and c, a contain bias terms.

One then obtains the joint distribution over the visible and hidden units

作为

磷(v, H; 我 ) = 1
Z(我 )

经验值(−E(v, H; 我 ))

(2.2)

with the partition function

Z(我 ) =

(西德:2)

(西德:2)

v

H

经验值(−E(v, H; 我 ))

for normalization.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

Thanks to the specific bipartite structure, one can obtain an explicit
h exp(−E(v, H; 我 )). 这

(西德:3)

expression for the marginalization P(v; 我 ) = 1
Z(我 )
conditional distributions are obtained as
(西德:4)

磷(H|v; 我 ) =

p(H( j)

|v ),

磷(v|H; 我 ) =

(西德:4)
j

p(v

(我)

|H),

(2.3)

2128

J. Suykens

(西德:3)

= 1|v ) = σ (

v
where p(H( j)
i Wi j
的) with σ (X) = 1/(1 + 经验值(−x)) the logistic function. Here v
note the ith visible unit and the jth hidden unit, 分别.

+ a j ) 和 p(v

= 1|H) = σ (

(我)

(我)

(西德:3)

+

j Wi jh( j)
(我) and h( j) 的-

Because exact maximum likelihood for this model is intractable, 与-
trastive divergence algorithm is used with the following update equation
for the weights,

(西德:4)W = α(EPdata (vhT ) − EPT (vhT )),

(2.4)

with learning rate α and EPdata the expectation with regard to the data distri-
bution Pdata(H, v; 我 ) =P(H|v; 我 )Pdata(v ), where Pdata(v ) denotes the empirical
分配. 此外, EPT is a distribution defined by running a Gibbs
chain for T steps initialized at the data. Often one takes T = 1, while T → ∞
recovers the maximum likelihood approach (Salakhutdinov, 2015).

In Boltzmann machines there are, in addition to visible-to-hidden, 还

visible-to-visible and hidden-to-hidden interaction terms with

乙(v, H; 我 ) = −v TWh − 1
2

v T Lv − 1
2

hT Gh

(2.5)

and θ = {瓦, L, G} as explained in Salakhutdinov and Hinton (2009).

In section 3 we make a connection between the energy expression, 平等-
的 2.1, and a new representation of least squares support vector machines
and related kernel machines, which will be made in terms of visible and
hidden units. We now briefly review basics of SVMs, LS-SVMs, PCA, 和
SVD.

2.2 Least Squares Support Vector Machines and Related Kernel

Machines.

2.2.1 SVM and LS-SVM. Assume a binary classification problem with
∈ Rd and corresponding class

i=1 with input data xi

, 做)}氮

{−1, 1}. An SVM classifier takes the form

training data {(希
labels yi

ˆy = sign[wT ϕ(X) + 乙],

where the feature map ϕ(·) : Rd → Rn f maps the data from the input space
to a high-dimensional feature space and ˆy is the estimated class label for a
given input point x ∈ Rd. The training problem for this SVM classifier (Boser
等人。, 1992; 科尔特斯 & Vapnik, 1995; Vapnik, 1998) 是

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

Deep Restricted Kernel Machines Using Conjugate Feature Duality

2129

min
w,乙,ξ

wT w + C

1
2

氮(西德:2)

我=1

ξ

subject to yi[wT ϕ(希) + 乙] 1 − ξ

,

i = 1, . . . , 氮

(2.6)

ξ

0,

i = 1, . . . , 氮,

where the objective function makes a trade-off between minimization of the
regularization term (corresponding to maximization of the margin 2/(西德:6)w(西德:6)
2)
and the amount of misclassifications, controlled by the regularization con-
stant c > 0. The slack variables ξ
i are needed to tolerate misclassifications
on the training data in order to avoid overfitting the data. 下列
dual problem in the Lagrange multipliers α
i is obtained, related to the first
set of constraints:

- 1
2

氮(西德:2)

maxα

subject to

氮(西德:2)

我, j=1

yiy j K(希

, x j ) A

A

j

+

氮(西德:2)

j=1

A

j

A

iyi

= 0

我=1
0≤ α

≤ c, i = 1, . . . , 氮.

(2.7)

Here a positive-definite kernel K is used with K(X, z) = ϕ(X)T ϕ(z) =
(西德:3)n f
j(z). The SVM classifier is expressed in the dual as
j=1

j(X)ϕ

ϕ

ˆy = sign

(西德:2)

i∈S

SV

A

i yi K(希

, X) + 乙

,

(2.8)

SV denotes the set of support vectors, corresponding to the nonzero
, x j ) =
i x j )d with ν ≥ 0, or gaussian RBF kernel

哪里
A
i values. Common choices are, 例如, to take a linear K(希
xT
i x j, polynomial K(希
, x j ) = exp(-(西德:6)希
K(希
The LS-SVM classifier (Suykens & Vandewalle, 1999乙) is a modification

, x j ) = (ν + xT
/σ 2).
− x j

(西德:6)2
2

to it,

min
w,乙,不

1
2

wT w + C 1
2

氮(西德:2)

我=1

e2

subject to yi[wT ϕ(希) + 乙] = 1 − ei

,

i = 1, . . . , 氮,

(2.9)

where the value 1 in the constraints is taken as a target value instead
of a threshold value. This implicitly corresponds to a regression on the

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

2130

J. Suykens

A


我=1

-
class labels ±1. From the Lagrangian L(w, 乙, e; A) = 1
(西德:3)
2
}, one takes the conditions for optimality
= 0. Writing the solution in α, 乙
= 0, ∂L/∂α

{做[wT ϕ(希) +乙] - 1 + 不
∂L/∂w = 0, ∂L/∂b = 0, ∂L/∂ei
gives the square linear system

wT w + γ 1
2


i=1 e2

(西德:3)

(西德:11) + I/γ y1:氮

时间
1:氮

y

0

(西德:9)

A

(西德:10)

(西德:9)

(西德:10)

=

1氮

0

,

在哪里 (西德:11)
我j
[1; …; 1] 和, as classifier in the dual,

ϕ(希)T ϕ(x j ) = yiy j K(希

= yiy j

(2.10)

, x j ) and y1:N= [y1

; . . . ; yN], 1N=

ˆy = sign

(西德:9)

氮(西德:2)

我=1

(西德:10)

A

iyiK(希

, X) + 乙

.

(2.11)

This formulation has also been extended to multiclass problems in Suykens
等人. (2002).

In the LS-SVM regression formulation (Suykens et al., 2002) one per-
forms ridge regression in the feature space with an additional bias term b,

min
w,乙,不

1
2

wT w + γ 1
2

氮(西德:2)

我=1

e2

subject to yi

= wT ϕ(希) + 乙 + 不

,

i = 1, . . . , 氮,

which gives

K + I/γ 1N

时间

1

0

(西德:9)

A

(西德:10)

(西德:9)

(西德:10)

=

y1:氮

0

with the predicted output

ˆy =

氮(西德:2)

我=1

A

iK(希

, X) + 乙,

(2.12)

(2.13)

(2.14)

= K(希

, x j ) = ϕ(希)T ϕ(x j ). The classifier formulation can also be
where Ki j
transformed into the regression formulation by multiplying the constraints
in equation 2.9 by the class labels and considering new error variables
(Suykens et al., 2002). In the zero bias term case, this corresponds to ker-
nel ridge regression (Saunders, Gammerman, & Vovk, 1998), which is also

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

Deep Restricted Kernel Machines Using Conjugate Feature Duality

2131

related to function estimation in reproducing kernel Hilbert spaces, regular-
ization networks, and gaussian processes, within a different setting (Poggio
& Girosi, 1990; Wahba, 1990; 拉斯穆森 & 威廉姆斯, 2006; Suykens et al.,
2002).

2.2.2 Kernel PCA and Matrix SVD. Within the setting of using equality
constraints and the L2 loss function, typical for LS-SVMs, one can character-
ize the kernel PCA problem (Schölkopf, Smola, & 穆勒, 1998) as follows,
as shown in Suykens et al. (2002, 2003):

1
min
w,乙,不
2
subject to ei

氮(西德:2)

e2

wT w − γ 1
2

我=1
= wT ϕ(希) + 乙,

i = 1, . . . , N。

(2.15)

From the KKT conditions, one obtains the following in the Lagrange multi-
pliers α
我,

K(C)α = λα with λ = 1/γ ,

(2.16)

(西德:3)


我=1

ϕ(希) and α = [A
1

where K(C)
= (ϕ(希) − ˆμϕ )时间 (ϕ(x j ) − ˆμϕ ) are the elements of the centered ker-
我j
nel matrix K(C), ˆμϕ = (1/氮)
; ….; αN]. In equation 2.15,
maximizing instead of minimizing also leads to equation 2.16. The center-
ing of the kernel matrix is obtained as a result of taking a bias term b in the
模型. The γ value is treated at a selection level and is chosen so as to cor-
respond to λ = 1/γ , where λ are eigenvalues of K(C). In the zero bias term
案件, K(C) becomes the kernel matrix K = [ϕ(希)T ϕ(x j )]. 还, kernel spectral
clustering (Alzate & Suykens, 2010) was obtained in this setting by consid-
ering a weighted version of the L2 loss part, weighted by the inverse of the
degree matrix of the graph in the clustering problem.

Suykens (2016) showed recently that matrix SVD can be obtained from

the following primal problem:

minw,v,e,r

subject to ei

−wT v + γ 1
2

氮(西德:2)

e2

+ γ 1
2

中号(西德:2)

r2
j

我=1
j=1
= wT ϕ(希), i = 1, . . . , 氮

(2.17)

r j

= v T ψ (z j ), j = 1, . . . , 中号,

i=1 and {z j
}氮

}中号
在哪里 {希
j=1 are data sets related to two data sources, which in
the matrix SVD (Golub & Van Loan, 1989; 斯图尔特, 1993) case correspond to
the sets of rows and columns of the given matrix. Here one has two fea-
ture maps ϕ(·) and ψ (·). After taking the Lagrangian and the necessary

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

2132

J. Suykens

conditions for optimality, the dual problem in the Lagrange multipliers α

and β
j, related to the first and second set of constraints, results in

(西德:9)

(西德:10) (西德:9)

0 A
AT 0

(西德:10)

(西德:9)

= λ

(西德:10)

,

A

β

A

β

(2.18)

where A = [ϕ(希)T ψ (z j )] denotes the matrix with i jth entry ϕ(希)T ψ (z j ),
λ = 1/γ corresponding to nonzero eigenvalues, and α = [A
; ….; αN], β =
1
[β
; ….; βM]. For a given matrix A, by choosing the linear feature maps
1
ϕ(希) = CT xi, ψ (z j ) = z j with a compatibility matrix C that satisfies ACA =
A, this eigenvalue problem corresponds to the SVD of matrix A (Suykens,
2016) in connection with Lanczos’s decomposition theorem. One can also
see that for a symmetric matrix, the two data sources coincide, 和
objective of equation 2.17 reduces to the kernel PCA objective, 方程
2.15 (Suykens, 2016), involving only one feature map instead of two feature
maps.

3 Restricted Kernel Machines and Conjugate Feature Duality

3.1 LS-SVM Regression as a Restricted Kernel Machine: Linear Case.
A training data set D = {(希
, 做)}氮
i=1 is assumed to be given with input data
∈ Rp (now with p outputs), where the data are
∈ Rd and output data yi

assumed to be identical and independently distributed and drawn from an
unknown but fixed underlying distribution P(X,y), a common assumption
made in statistical learning theory (Vapnik, 1998).

We will explain now how LS-SVM regression can be linked to the en-
ergy form expression of an RBM with an interpretation in terms of hid-
den and visible units. In view of these connections with RBMs and the
fact that there will be no hidden-to-hidden connections, we will call it a
restricted kernel machine (RKM) 表示, when this particular in-
terpretation of the model is made. For LS-SVM regression, the part in the
RKM interpretation that will take a similar form as the RBM energy function

RRKM(v, H) = −v T ˜Wh

= −(xTWh + bT h − yT h)
= eT h,

(3.1)

with a vector of hidden units h ∈ Rp and a vector of visible units v ∈ Rnv
with nv = d + 1 + p equal to

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

Deep Restricted Kernel Machines Using Conjugate Feature Duality

2133


⎦ and ˜W =

v =


X

1
−y


,



bT

Ip

(3.2)

and e = y − ˆy with ˆy = W T x + b the estimated output vector for a given in-
put vector x where e, y, ˆy ∈ Rp, W ∈ Rd×p, b ∈ Rp. Note that b is treated as
part of the interconnection matrix by adding a constant 1 within the vector
v, which is also frequently done in the area of neural networks (Suykens
等人。, 1995). While in RBM the units are binary valued, in the RKM, 他们是
continuous valued. The notation R in RRKM(v, H) refers to the fact that the
expression is restricted; there are no hidden-to-hidden connections.

For the training problem, the sum is taken over the training data
, 做)}氮

{(希

i=1 with

氮(西德:2)

Rtrain
RKM

=

RRKM(v

, 你好)

我=1

氮(西德:2)

= −

(xT

i Whi

+ bT hi

− yT

i hi)

(3.3)

我=1

氮(西德:2)

.

eT
i hi

=

我=1

Note that we will adopt the following notation h( j),i to denote the value of
the jth unit for the ith data point, and ei

∈ Rp for i = 1, . . . , 氮.

, 你好

We start now from the LS-SVM regression training problem, 方程
2.12, but for the multiple outputs case. We express the objective in terms
RKM and show how the hidden units can be introduced. Defining λ =
of Rtrain
1/γ > 0, we obtain

J =

2

Tr(W TW ) + 1
2λ

氮(西德:2)

我=1

i ei s.t. 不
eT

= yi

− W T xi

− b, ∀i

=

氮(西德:2)

我=1

氮(西德:2)

我=1

eT
i hi

-

λ

2

氮(西德:2)

我=1

hT
i hi

+

2

Tr(W TW ) s.t. 不

= yi

− W T xi

− b, ∀i

(yT

− xT

i W − bT )你好

-

λ

2

氮(西德:2)

我=1

hT
i hi

+

2

Tr(W TW ) (西德:2) J

= Rtrain
RKM

-

λ

2

氮(西德:2)

我=1

hT
i hi

+

2

Tr(W TW ),

(3.4)

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

2134

J. Suykens

where λ, η are positive regularization constants and the first term corre-
RKM. J denotes the lower bound on J.1 This is based on the prop-
sponds to Rtrain
erty that for two arbitrary vectors e, H, one has

1
2λ eT e ≥ eT h −

λ

2

hT h, ∀e, h ∈ Rp.

(3.5)

The maximal value of the right-hand side in equation 3.5 is obtained for h =
e/λ, which follows from ∂ (eT h − λ
2 hT h)/∂h2 =
−λI < 0. The maximal value that can be obtained for the right-hand side equals the left-hand side, 1 2λ eT e. The property 3.5 can also be verified by writing it in quadratic form, 2 hT h)/∂h = 0 and ∂ 2(eT h − λ (cid:14) (cid:13) eT hT 1 2 (cid:9) (cid:10) (cid:9) (cid:10) 1 λ I I I λI e h ≥ 0, ∀e, h ∈ Rp, (3.6) which holds. This follows immediately from the Schur complement form,2 2 (λI − I(λI)I) ≥ 0, which holds. Writing equa- which results in the condition 1 tion 3.5 as 1 2λ eT e + λ 2 hT h ≥ eT h (3.7) gives a property that is also known in Legendre-Fenchel duality for the case of a quadratic function (Rockafellar, 1987). Furthermore, it also follows from equation 3.5 that 1 2λ eT e = max h (cid:15) eT h − (cid:16) hT h . λ 2 (3.8) We will call the method of introducing the hidden features hi into equation 3.4 conjugate feature duality, where the hidden features hi are conjugated to = the ei. Here, Rtrain i hi will be called an inner pairing between the ei RKM and the hidden features hi (see Figure 2). i eT (cid:3) l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 2 9 8 2 1 2 3 1 9 8 1 6 5 8 n e c o _ a _ 0 0 9 8 4 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 1Note that also the term − λ 2 N i=1 hT (cid:3) energy correspond to matrix G equal to the identity matrix. The term additional regularization term. (cid:17) i hi appears. This would in a Boltzmann machine η 2 Tr(W TW ) is an (cid:18) , one has Q ≥ 0 if and only if A > 0 和

A B
BT C

2This states that for a matrix Q =

Schur complement C − BT A

−1B ≥ 0 (Boyd & Vandenberghe, 2004).

Deep Restricted Kernel Machines Using Conjugate Feature Duality

2135

数字 2: Restricted kernel machine (RKM) representation for regression. 这
feature map ϕ(X) maps the input vector x to a feature space (possibly by mul-
tilayers, depicted in yellow), and the hidden features are obtained through an
inner pairing eT h where e = y − ˆy compares the given output vector y with the
predictive model output vector ˆy = W T ϕ(X) + 乙, where the interconnection ma-
trix W is depicted in blue.

We proceed now by looking at the stationary points of J(你好

, 瓦, 乙):3

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

∂J
∂hi
∂J
∂W
∂J
∂b

= 0 ⇒ yi

= W T xi

+ 乙 + λhi

, ∀i

= 0 ⇒ W = 1

(西德:2)

xihT

(3.9)

= 0

(西德:2)

= 0.

你好

/λ, which means that the maximal value of
The first condition yields hi
+ 不
+ λhi. Also note the similarity be-
= ˆyi
J is reached. 所以, 做
tween the condition W = 1
i xihT
i and equation 2.4 in the contrastive di-

vergence algorithm. Elimination of hi from this set of conditions gives the
solution in W, 乙:

= ei
= ˆyi
(西德:3)

3The following properties are used throughout this letter:
=A, ∂Tr(XA)

∂aT Xb
∂X
aT a = Tr(aaT ) for matrices A, 乙, X and vectors a, 乙 (彼得森 & Pedersen, 2012).

= baT , ∂Tr(XT A)

=A, ∂Tr(AXT )

= abT , ∂aT XT b

∂Tr(XT BX )
∂X
= AT , ∂xT a
∂x

∂X

∂X

∂X

∂X

= BX + BT X,
= ∂aT x
= a,
∂x

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

2136

(西德:3)

i xixT

(西德:3)

+ ληId

时间

i x

(西德:3)

i xi

bT

⎦ =

(西德:3)

i xiyT

(西德:3)

时间

i y

.

J. Suykens

(3.10)

Elimination of W from the set of conditions gives the solution in hi

, 乙:

1
这 [xT

i x j] + λIN 1N

时间

1

0

⎦ =

,

YT

0

HT

bT

(3.11)

hN] ∈ Rp×N,
i x j] denoting the matrix with i j-entry xT
和 [xT
yN] ∈ Rp×N. From this square linear system, one can solve {你好
Y = [y1
} 和
乙. 1N denotes a vector of all ones of size N and IN the identity matrix of size
N × N.

i x j, H = [h1

It is remarkable to see here that the hidden features hi take the same role
as the Lagrange dual variables α
i in the LS-SVM formulation based on La-
grange duality, 方程 2.13, when taking η = 1 and p = 1. For the esti-
mated values ˆyi on the training data, one can express the model in terms
of W, b or in terms of hi
, 乙. In the restricted kernel machine interpretation
of the LS-SVM regression, one has the following primal and dual model
陈述:

(磷)RKM : ˆy = W T x + 乙

中号

(西德:10)

(西德:11)

(D)RKM : ˆy = 1

(西德:2)

j

h jxT

j x + 乙

(3.12)

evaluated at a point x where the primal representation is in terms of W, 乙
and the dual representation is in the hidden features hi. The primal repre-
sentation is suitable for handling the “large N, small d” case, while the dual
representation for “small N, large d” (Suykens et al., 2002).

3.2 Nonlinear Case. The extension to the general nonlinear case goes
by replacing xi by ϕ(希) where ϕ(希) : Rd → Rn f denotes the feature map,
with nf the dimension of the feature space. 所以, the objective function
为了, the RKM interpretation becomes

J =

氮(西德:2)

我=1

(yT

− ϕ(希)TW − bT )你好

-

λ

2

氮(西德:2)

我=1

hT
i hi

+

2

Tr(W TW ),

(3.13)

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

Deep Restricted Kernel Machines Using Conjugate Feature Duality

2137

with the vector of visible units v ∈ Rnv with nv = n f

+ 1 + p equal to

v =


.


ϕ(X)
1
−y

(3.14)

Following the same approach as in the linear case, one then obtains as a
solution in the primal

(西德:3)

j

ϕ(x j )ϕ(x j )时间 + ληIn f

(西德:3)

ϕ(x j )时间

j

(西德:3)

ϕ(x j )

j

(西德:3)

bT

⎦ =

ϕ(x j )yT
j
j
(西德:3)

时间
j

j y

.

(3.15)

In the conjugate feature dual, one obtains the same linear system as equa-
, x j ) = ϕ(希)T ϕ(x j ) 在-
的 3.11, but with the positive-definite kernel K(希
stead of the linear kernel xT

η K + λIN 1N
1

时间

1

0

⎦ =

HT

bT

.

YT

0

(3.16)

i x j:

We also employ the notation [K(希
the i jth entry equal to K(希

, x j ).

, x j )] to denote the kernel matrix K with

The primal and dual model representations are expressed in terms of the

feature map and kernel function, 分别:

(磷)RKM : ˆy = W T ϕ(X) + 乙

中号

(西德:10)

(西德:11)

(D)RKM : ˆy = 1

(西德:2)

j

h jK(x j

, X) + 乙.

(3.17)

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

One can define the feature map in either an implicit or an explicit way.
When employing a positive-definite kernel function K(·, ·), 根据
the Mercer theorem, there exists a feature map ϕ such that K(希
, x j ) =
ϕ(希)T ϕ(x j ) holds. 另一方面, one could also explicitly define an
expression for ϕ and construct the kernel function according to K(希
, x j ) :=
ϕ(希)T ϕ(x j ). For multilayer perceptrons, Suykens and Vandewalle (1999A)
showed that the hidden layer can be chosen as the feature map. We can let it
correspond to

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

2138

FF(X) = σ (Uqσ (…σ (U2
ϕ

σ (U1x + β

1) + β

2)…) + βq)

J. Suykens

(3.18)

, U2

related to a feedforward (FF) neural network with multilayers, with hidden
, . . . , βq. By con-
, . . . , Uq and bias term vectors β
layer matrices U1
1
struction, one obtains KFF(希
FF(x j ). Note that the activation
function σ might be different also for each of the hidden layers. A common
choice is a sigmoid or hyperbolic tangent function. Within the context of
, . . . , βq are treated at the feature map and the kernel
this letter, U1
parameter levels.

, …, Uq, β
1

, x j ) := ϕ

FF(希)T ϕ

, β
2

As Suykens et al. (2002) 解释了, one can also give a neural network
interpretation to both the primal and the dual representation, with a num-
ber of hidden units equal to the dimension of the feature space for the
primal representation and the number of support vectors in the dual rep-
resentation, 分别. For the case of a gaussian RBF kernel, one has a
one-hidden-layer interpretation with an infinite number of hidden units in
the primal, while in the dual, the number of hidden units equals the number
of support vectors.

3.3 Classifier Formulation. In the multiclass case, the LS-SVM classifier

constraints are

Dyi (W T ϕ(希) + 乙) = 1p − ei

, i = 1, . . . , 氮,

(3.19)

where yi
nal matrix Dyi

{−1, 1}p, 不

∈ Rp with p outputs encoding the classes and diago-

}.
在这种情况下, starting from the LS-SVM classifier objective, one obtains

= diag{ y(1),我

, . . . , y(p),我

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

J =

2

Tr(W TW ) + 1
2λ

氮(西德:2)

我=1

i ei s.t. 不
eT

= 1p − Dyi (W T ϕ(希) + 乙), ∀i

氮(西德:2)

我=1

eT
i hi

-

λ

2

氮(西德:2)

我=1

hT
i hi

+

2

− Dyi (W T ϕ(希) + 乙), ∀i

Tr(W TW ) s.t. 不

= 1p

=

(西德:23)
1时间
p

氮(西德:2)

我=1

- (ϕ(希)TW + bT )Dyi

-

你好

(西德:24)

λ

2

氮(西德:2)

我=1

hT
i hi

+

2

Tr(W TW ) (西德:2) J.

(3.20)

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

The stationary points of J(你好

, 瓦, 乙) are given by

Deep Restricted Kernel Machines Using Conjugate Feature Duality

2139

⎪⎪⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎪⎪⎩

∂J
∂hi
∂J
∂W
∂J
∂b

= 0 ⇒ 1p = Dyi (W T ϕ(希) + 乙) + λhi
(西德:2)

ϕ(希)hT

i Dyi

= 0 ⇒ W = 1

(西德:2)

= 0

Dyi hi


= 0.

, ∀i

(3.21)

The solution in the conjugate features follows then from the linear system:

η K + λIN 1N
1

时间

1

0

⎦ =

YT

0

HT
D

bT

(3.22)

with HD = [Dy1 h1

, . . . , DyN hN].

The primal and dual model representations are expressed in terms of the

feature map and the kernel function, 分别:

(磷)RKM : ˆy = sign[W T ϕ(X) + 乙]

中号

(西德:10)

(西德:11)

(D)RKM : ˆy = sign


1

(西德:2)

j

(3.23)

Dy j h jK(x j

, X) + 乙

.

3.4 Kernel PCA. In the kernel PCA case we start from the objective in

方程 2.15 and introduce the conjugate hidden features:

i ei s.t. 不
eT

= W T ϕ(希), ∀i

J =

2

Tr(W TW ) - 1
2λ
氮(西德:2)

氮(西德:2)

λ

eT
i hi

+

2

我=1

氮(西德:2)

我=1

hT
i hi

+

2

≤ −

我=1
氮(西德:2)

= −

我=1

= −Rtrain
RKM

+

Tr(W TW ),

λ

2

氮(西德:2)

我=1

hT
i hi

我=1

+

2

Tr(W TW ) s.t. 不

= W T ϕ(希), ∀i

ϕ(希)TWhi

+

λ

2

氮(西德:2)

hT
i hi

+

2

Tr(W TW ) (西德:2) J

(3.24)

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

2140

J. Suykens

where the upper bound J is introduced now by relying on the same property
2 hT h ≥ eT h, but in a
as used in the regression/classification case, 1
different way. 注意

2λ eT e + λ

- 1
2λ eT e = min

H

(−eT h +

λ

2

hT h).

(3.25)

The minimal value for the right-hand side is obtained for h = e/λ, 哪个
equals the left-hand side in that case.

We then proceed by characterizing the stationary points of J(你好

⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎩

∂J
∂hi
∂J
∂W

= 0 ⇒ W T ϕ(希) = λhi

, ∀i

= 0 ⇒ W = 1

(西德:2)

ϕ(希)hT

.

, 瓦 ):

(3.26)

/λ. 所以, the minimum value
= ei
Note that the first condition yields hi
of J is reached. Elimination of W gives the following solution in the conju-
gated features,

1
η KHT = HT (西德:16),

(3.27)

hN] ∈ Rs×N and (西德:16) = diag{λ

, …, λs} with s ≤ N the number
where H = [h1
of selected components. One can verify that the solutions corresponding to
the different eigenvectors hi and their corresponding eigenvalues λ
i all lead
to the value J = 0.

1

The primal and dual model representations are

(磷)RKM : ˆe = W T ϕ(X)

中号

(西德:10)

(西德:11)

(D)RKM : ˆe = 1

(西德:2)

j

h jK(x j

, X).

(3.28)

Here the number of hidden units equals s with h ∈ Rs and the visible

units v ∈ Rn f with v = ϕ(X), and RRKM(v, H) = −v TWh.

3.5 Singular Value Decomposition. For the SVD case, we start from
the objective in equation 2.17 and introduce the conjugated hidden features.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

Deep Restricted Kernel Machines Using Conjugate Feature Duality

2141

The model is characterized now by matrices W, V:

J = −

2

Tr(V TW ) + 1
2λ

氮(西德:2)

我=1

eT
i ei

+ 1
2λ

中号(西德:2)

j=1

rT
j r j

s.t. 不

= W T ϕ(希), ∀i & r j

= V T ψ (z j ), ∀ j

氮(西德:2)

我=1

eT
i hei

-

λ

2

氮(西德:2)

我=1

hT
ei hei

+

中号(西德:2)

j=1

rT
j hr j

-

λ

2

中号(西德:2)

j=1

hT
r j hr j

-

2

Tr(V TW )

s.t. 不

= W T ϕ(希), ∀i & r j

= V T ψ (z j ), ∀ j

氮(西德:2)

=

ϕ(希)TWhei

-

λ

2

氮(西德:2)

我=1

hT
ei hei

+

中号(西德:2)

j=1

ψ (z j )TVhr j

我=1

-

λ

2

中号(西德:2)

j=1

hT
r j hr j

-

2

Tr(V TW ) (西德:2) J.

(3.29)

在这种情况下, Rtrain
RKM
, 瓦, hr j
points of J(hei

(西德:3)


我=1

=
ϕ(希)TWhei
, V ) are given by

(西德:3)

+

中号
j=1

ψ (z j )TVhr j . The stationary

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

∂J
∂hei
∂J
∂W
∂J
∂hr j
∂J
∂V

= 0 ⇒ W T ϕ(希) = λhei

, ∀i

= 0 ⇒ V = 1

(西德:2)

ϕ(希)hT

= 0 ⇒ V T ψ (z j ) = λhr j

, ∀ j

= 0 ⇒ W = 1

(西德:2)

j

ψ (z j )hT
r j

.

(3.30)

, hr j :

Elimination of W, V gives the solution in the conjugated dual features
hei

0

这 [ϕ(希)T ψ (z j )]
1

这 [ψ (z j )T ϕ(希)]
1

0

HT
e

时间
H
r

⎦ =

(西德:16),

HT
e

时间
H
r

(3.31)

with He = [he1
. . . , λs} with s ≤ N + M a specified number of nonzero eigenvalues.

. . . heN ] ∈ Rs×N, Hr = [hr1

. . . hrM ] ∈ Rs×M and (西德:16) = diag{λ

1

,

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

2142

J. Suykens

The primal and dual model representations are

(磷)RKM : ˆe = W T ϕ(X)
ˆr = V T ψ (z)

中号

(西德:10)

(西德:11)

(D)RKM : ˆe = 1

ˆr = 1

(西德:2)

j
(西德:2)

hr j

ψ (z j )T ϕ(X)

hei

ϕ(希)T ψ (z),

(3.32)

which corresponds to matrix SVD in the case of linear compatible fea-
ture maps and if an additional compatibility condition holds (Suykens,
2016).

3.6 Kernel pmf. For the case of kernel probability mass function (kernel

pmf) estimation (Suykens, 2013), we start from the objective

J =

氮(西德:2)

我=1

(pi

− ϕ(希)T w)你好

-

氮(西德:2)

我=1

+

pi

2

wT w

(3.33)

in the unknowns w ∈ Rn f , pi
ε R. Suykens (2013) explained how
a similar formulation is related to the probability rule in quantum measure-
ment for a complex valued model.

ε R, and hi

The stationary points are characterized by

⎪⎪⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎪⎪⎩

∂J
∂hi
∂J
∂w
∂J
∂ pi

= 0 ⇒ pi

= wT ϕ(希), ∀i
(西德:2)

ϕ(希)你好

= 0 ⇒ w = 1

= 0 ⇒ hi

= 1, ∀i.

(3.34)

0
The regularization constant η can be chosen to normalize
is achieved by the choice of an appropriate kernel function), which gives
then the kernel pmf obtained in Suykens (2013). This results in the repre-
句子

= 1 (pi

i pi

(西德:3)

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

Deep Restricted Kernel Machines Using Conjugate Feature Duality

2143

(磷)RKM : pi

= wT ϕ(希)

中号

(西德:10)

(西德:11)

(D)RKM : pi

= 1

(西德:2)

j

K(x j

, 希).

4 Deep Restricted Kernel Machines

(3.35)

In this section we couple different restricted kernel machines within a deep
建筑学. Several coupling configurations are possible at this point. 我们
illustrate deep restricted kernel machines here for an architecture consisting
of three levels. We discuss two configurations:

1. Two kernel PCA levels followed by an LS-SVM regression level
2. LS-SVM regression level followed by two kernel PCA levels

In the first architecture, the first two levels extract features that are used
within the last level for classification or regression. Related types of archi-
tectures are stacked autoencoders (本吉奥, 2009), where a pretraining phase
provides a good initialization for training the deep neural network in the
fine-tuning phase. The deep RKM will consider an objective function jointly
related to the kernel PCA feature extractions and the classification or regres-
锡安. We explain how the insights of the RKM kernel PCA representations
can be employed for combined supervised training and feature selection.
A difference with other methods is also that conjugated features are used
within the layered architecture.

In the second architecture, one starts with regression and then lets two
kernel PCA levels further act on the residuals. In this case connections will
be shown with deep Boltzmann machines (Salakhutdinov, 2015; Salakhut-
dinov & 欣顿, 2009) when considering the special case of linear feature
地图, though for the RKMs in a nonprobabilistic setting.

4.1 Two Kernel PCA Levels Followed by Regression Level. We focus

here on a deep RKM architecture consisting of three levels:

• Level 1 consists of kernel PCA with given input data xi and is char-

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

acterized by conjugated features h(1)

.

• Level 2 consists of kernel PCA by taking h(1)

as input and is charac-

terized by conjugated features h(2)

.

• Level 3 consists of LS-SVM regression on h(2)
i with output data yi and
is characterized by conjugated features h(3)
.

2144

J. Suykens

As predictive model is taken,

ˆe(1) = W T
1

1(X),
ϕ

ˆe(2) = W T
2

2((西德:16)−1
ϕ

1

ˆe(1)),

ˆy = W T
3

3((西德:16)−1
ϕ

2

ˆe(2)) + 乙,

(4.1)

∈ Rn f3

∈ Rn f2

; the level 2 part ϕ

2 : Rn f1 → Rn f2 , W2

×s(1)
3 : Rn f2 → Rn f3 , W3

evaluated at point x ∈ Rd. The level 1 part has feature map ϕ
1 : Rd → Rn f1 ,
×s(2)
∈ Rn f1
W1
; and the level
ˆe(1) 和 (西德:16)−1
3 part ϕ
ˆe(2) (和
2
ˆe(1) ∈ Rs(1) , ˆe(2) ∈ Rs(2)
) are taken as input for levels 2 和 3, 分别,
在哪里 (西德:16)
, (西德:16)
2 denote the diagonal matrices with the corresponding eigen-
1
价值观. The latter is inspired by the property that for the uncoupled kernel
/λ holds on the training data according to
PCA levels, the property hi
方程 3.26, which is then further extended to the out-of-sample case in
方程 4.1.

×p. 注意 (西德:16)−1

= ei

1

The objective function in the primal is

Jdeep,磷

= J1

+ J2

+ J3

J1

= − 1
2λ
1

J2

= − 1
2λ
2

氮(西德:2)

我=1

氮(西德:2)

我=1

时间

e(1)

+

e(1)

时间

e(2)

+

e(2)

1
2

2
2

Tr(W T

1 W1)

Tr(W T

2 W2)

J3

= 1
2λ
3

氮(西德:2)

我=1

时间

e(3)

+

e(3)

3
2

Tr(W T

3 W3)

(4.2)

(4.3)

ϕ

1(希), ˆei

(2) = W T
2

(1) = W T
1

2((西德:16)−1
(2)) + 乙. 如何-
ϕ
with ˆei
1
曾经, this objective function is not directly usable for minimization due to
时间
the minus sign terms − 1
e(1)
. For direct
2λ

minimization of an objective in the primal, we will use the following stabi-
lized version,

and − 1
2λ

3((西德:16)−1
ϕ

i=1 e(2)

i=1 e(1)

= W T
3

(1)), ˆyi

e(2)

(西德:3)

(西德:3)

ˆei

ˆei

时间

2

1

2

Jdeep,Pstab

= J1

+ J2

+ J3

+ 1
2

cstab(J2
1

+ J2

2 ),

(4.4)

with cstab a positive constant. The role of this stabilization term for the kernel
PCA levels is explained in the appendix. While in stacked autoencoders

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

Deep Restricted Kernel Machines Using Conjugate Feature Duality

2145

数字 3: Example of a deep restricted kernel machine consisting of three levels
with kernel PCA in levels 1 和 2 and LS-SVM regression in level 3.

one has an unsupervised pretraining and a supervised fine-tuning phase
(本吉奥, 2009), here we train the whole network at once.

For a characterization of the deep RKM in terms of the conjugated
(见图 3), we will study the stationary points

, H(2)

, H(3)

features h(1)

Jdeep

= J1

+ J2

+ J

,

3

(4.5)

, W1
where the objective Jdeep(H(1)

sists of the sum of the objectives of levels 1,2,3 given by J1, J2, J

, 乙) for the deep RKM con-
, 分别.

, H(3)

, H(2)

, W2

, W3

3

This becomes

Jdeep

= −

氮(西德:2)

我=1

ϕ
1(希)TW1h(1)

+

λ
1
2

氮(西德:2)

我=1

时间

H(1)

H(1)

+

1
2

Tr(W T

1 W1)

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

氮(西德:2)

-

我=1

ϕ
2(H(1)

)TW2h(2)

+

λ
2
2

氮(西德:2)

我=1

时间

H(2)

H(2)

+

2
2

Tr(W T

2 W2)

+

氮(西德:2)

我=1

(yT

− ϕ

3(H(2)

)TW3

− bT )H(3)

-

λ
3
2

氮(西德:2)

我=1

时间

H(3)

H(3)

+

3
2

Tr(W T

3 W3),

with the following inner pairings at the three levels:

等级 1 :

氮(西德:2)

我=1

时间

e(1)

H(1)

=

氮(西德:2)

我=1

ϕ
1(希)TW1h(1)

,

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

(4.6)

2146

等级 2 :

等级 3 :

氮(西德:2)

我=1

氮(西德:2)

我=1

时间

e(2)

H(2)

=

时间

e(3)

H(3)

=

氮(西德:2)

我=1

氮(西德:2)

我=1

ϕ
2(H(1)

)TW2h(2)

,

(yT

− ϕ

3(H(2)

)TW3

− bT )H(3)

.

J. Suykens

(4.7)

The stationary points of Jdeep(H(1)

, W1

, H(2)

, W2

, H(3)

, W3

, 乙) are given by

= 0 ⇒ W T
1

1(希) = λ
ϕ

1H(1)

-

[ϕ

2(H(1)

)TW2h(2)

], ∀i,

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

∂Jdeep
∂h(1)

∂Jdeep
∂W1
∂Jdeep
∂h(2)

∂Jdeep
∂W2
∂Jdeep
∂h(3)

∂Jdeep
∂W3
∂Jdeep
∂b


∂h(1)

时间 ,

= 0 ⇒ W1

= 1

1

(西德:2)

ϕ
1(希)H(1)

= 0 ⇒ W T
2

ϕ
2(H(1)

) = λ

2H(2)

-

= 0 ⇒ W2

= 1

2

(西德:2)

ϕ
2(H(1)

)H(2)

= 0 ⇒ yi

− W T
3

ϕ
3(H(2)

) − b = λ

3H(3)

, ∀i,

ϕ
3(H(2)

)H(3)

时间 ,

= 1

3

(西德:2)

H(3)

= 0.

= 0 ⇒ W3

= 0

(西德:2)


∂h(2)

时间 ,

[ϕ

3(H(2)

)TW3h(3)

], ∀i,

(4.8)

The primal and dual model representations for the deep RKM are then

中号

(西德:10)

(西德:11)

ϕ

ˆe(1) = W T
1
(磷)DeepRKM : ˆe(2) = W T
2
3((西德:16)−1
ϕ

1(X)
2((西德:16)−1
ϕ

ˆy = W T
3

2

ˆe(1))

1
ˆe(2)) + 乙

ˆe(1) = 1

1

(D)DeepRKM : ˆe(2) = 1

2
(西德:3)

ˆy = 1

3

, X)

(西德:3)

(西德:3)

j K1(x j
j K2(H(1)

j h(1)
j h(2)
j K3(H(2)

j

j h(3)

ˆe(1))

, (西德:16)−1
1
j
, (西德:16)−1
ˆe(2)) + 乙.
2

(4.9)

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

Deep Restricted Kernel Machines Using Conjugate Feature Duality

2147

By elimination of W1
equations in the conjugated features h(1)

, W3, one obtains the following set of nonlinear
i and b:

, H(2)

, H(3)

, W2

H(1)
j K1(x j

, 希) = λ

1H(1)

- 1

2

(西德:2)

j

j K2(H(1)
H(2)

j

, H(1)

) = λ

2H(2)

- 1

3

, H(1)
j )

∂K2(H(1)

∂h(1)

∂K3(H(2)

∂h(2)

j

(西德:2)

, H(2)
j )

时间

H(2)
j

H(2)

, ∀i

j K3(H(2)
H(3)

j

, H(2)

) + 乙 + λ

3H(3)

, ∀i

(西德:2)

j

(西德:2)

j

1

1

1

2

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

(西德:2)

j
= 0.

= 1

3

H(3)

(西德:2)

时间

H(3)
j

H(3)

, ∀i,

(4.10)

Solving this set of nonlinear equations is computationally expensive. 如何-
, K3,lin
曾经, for the case of taking linear kernels K2 and K3 (and Klin
denoting linear kernels) 方程 4.10 simplifies to

, K2,lin

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

等级 1 :

等级 2 :

等级 3 :

(西德:15)

(西德:15)

[K1(x j

, 希)] + 1

2

(西德:16)

[Klin(H(2)
j

, H(2)

)]

HT
1

= HT
1

(西德:16)
1

[K2,lin(H(1)
j

, H(1)

)] + 1

3

(西德:16)

[Klin(H(3)
j

, H(3)

)]

HT
2

= HT
2

(西德:16)
2

1

1

1

2

1

3

[K3,lin(H(2)
j

, H(2)

)] + λ

3IN 1N

时间

1

0

⎦ =

.

YT

0

HT
3

bT

(4.11)

= [H(1)
1

…H(1)

氮 ], H2

Here we denote H1
氮 ]. One sees
that at levels 1 和 2, a data fusion is taking place between K1 and Klin and
, 1
between K2,lin and Klin, 在哪里 1
are specifying the relative weight


3
given to each of these kernels. 这样, one can choose for emphasizing
or deemphasizing the levels with respect to each other.

氮 ], H3

, 1

2

1

= [H(2)
1

…H(2)

= [H(3)
1

…H(3)

4.2 Regression Level Followed by Two Kernel PCA Levels. 在这个
案件, we consider a deep RKM architecture with the following three
级别:

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

2148

J. Suykens

• Level 1 consists of LS-SVM regression with given input data xi and

output data yi and is characterized by conjugated features h(1)

.

• Level 2 consists of kernel PCA by taking h(1)

terized by conjugated features h(2)

.

• Level 3 consists of kernel PCA by taking h(2)

as input and is charac-

as input and is charac-

terized by conjugated features h(3)

.

We look then for the stationary points of

Jdeep

= J

1

+ J2

+ J3

(4.12)

, W1
where the objective Jdeep(H(1)

sists of the sum of the objectives of levels 1, 2, 3 given by J
主动地. Deep RKM consists of coupling the RKMs.

, W3) for the deep RKM con-
, J2, J3, 重新指定-

, 乙, H(2)

, H(3)

, W2

1

This becomes

Jdeep

=

氮(西德:2)

我=1

(yT

− ϕ

1(希)TW1

− bT )H(1)

-

λ
1
2

氮(西德:2)

我=1

时间

H(1)

+

H(1)

1
2

Tr(W T

1 W1)

氮(西德:2)

-

我=1

氮(西德:2)

-

我=1

ϕ
2(H(1)

)TW2h(2)

+

ϕ
3(H(2)

)TW3h(3)

+

λ
2
2

λ
3
2

氮(西德:2)

我=1

氮(西德:2)

我=1

时间

H(2)

H(2)

+

时间

H(3)

H(3)

+

2
2

3
2

Tr(W T

2 W2)

Tr(W T

3 W3),

(4.13)

×p, the level 2 part ϕ
3 : Rs(2) → Rn f3 , W3

1 : Rd → Rn f1 , W1
∈ Rn f1
, and the level 3 part ϕ

with ϕ
×s(2)
Rn f2
. Note that in
Jdeep, the sum of the three inner pairing terms is similar to the energy in
deep Boltzmann machines (Salakhutdinov, 2015; Salakhutdinov & 欣顿,
2009) for the particular case of linear feature maps ϕ
3 and symmetric
1
interaction terms. For the special case of linear feature maps, one has

2 : Rp → Rn f2 , W2

∈ Rn f3

×s(3)

, ϕ

, ϕ

2

Udeep

= −v T ˜W1h(1) − h(1)时间

W2h(2) − h(2)时间

W3h(3),

(4.14)

which takes the same form as equation 29 in Salakhutdinov (2015), with ˜W1
defined in the sense of equation 3.1 in this letter. The “U” in Udeep refers
to the fact that the deep RKM is unrestricted after coupling because of the
hidden-to-hidden connections between layers 1 和 2 and between layers
2 和 3, while the uncoupled RKMs are restricted.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

Deep Restricted Kernel Machines Using Conjugate Feature Duality

2149

The stationary points of Jdeep(H(1)

, W1

, 乙, H(2)

= 0 ⇒ yi

− W T
1

1(希) − b = λ
ϕ

1H(1)

+

ϕ
2(H(1)

)TW2h(2)

, H(3)

(西德:25)

, W2

∂h(1)

, W3) are given by

(西德:26)
, ∀i

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

∂Jdeep
∂h(1)

∂Jdeep
∂W1
∂Jdeep
∂b
∂Jdeep
∂h(2)

∂Jdeep
∂W2
∂Jdeep
∂h(3)

∂Jdeep
∂W3

时间

ϕ
1(希)H(1)

= 1

1

(西德:2)

H(1)

= 0

= 0 ⇒ W1

= 0

(西德:2)

= 0 ⇒ W T
2

ϕ
2(H(1)

) = λ

2H(2)

-

= 0 ⇒ W2

= 1

2

(西德:2)

ϕ
2(H(1)

)H(2)

时间

= 0 ⇒ W T
3

ϕ
3(H(2)

) = λ

3H(3)

, ∀i

= 0 ⇒ W3

= 1

3

(西德:2)

ϕ
3(H(2)

)H(3)

时间 .

(西德:25)
ϕ
3(H(2)

)TW3h(3)

(西德:26)
, ∀i

(4.15)


∂h(2)

As predictive model for this deep RKM case, 我们有

中号

(西德:10)

(西德:11)

(磷)DeepRKM : ˆy = W T
1

1(X) + 乙
ϕ

(D)DeepRKM : ˆy = 1

1

(西德:2)

j

H(1)
j K1(x j

, X) + 乙.

(4.16)

, W2

, W3, one obtains the following set of nonlinear

By elimination of W1
equations in the conjugated features h(1)

, H(2)

, H(3)

, and b:

H(1)
j K1(x j

, 希) + 乙 + λ

1H(1)

+ 1

2

(西德:2)

j

, H(1)
j )

∂K2(H(1)

∂h(1)

时间

H(2)
j

H(2)

, ∀i


(西德:2)

= 1

1
H(1)

(西德:2)

j
= 0


1

2
1

3

j K2(H(1)
H(2)

j

, H(1)

) = λ

2H(2)

- 1

3

(西德:2)

j

, H(2)
j )

∂K3(H(2)

∂h(2)

时间

H(3)
j

H(3)

, ∀i

j K3(H(2)
H(3)

j

, H(2)

) = λ

3H(3)

, ∀i.

(西德:2)

j
(西德:2)

j

(4.17)

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

2150

J. Suykens

When taking linear kernels K2 and K3, the set of nonlinear equations sim-
plifies to

[K1(x j

, 希)] + 1

2

[Klin(H(2)
j

, H(2)

)] + λ

1IN 1N

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

等级 1 :

等级 2 :

等级 3 :

=

(西德:15)

(西德:15)

1

1

(西德:9)

1

2

1

3

时间

1

(西德:10)

YT

0

[K2,lin(H(1)
j

, H(1)

)] + 1

(西德:16)

3

(西德:9)

(西德:10)

HT
1
bT

0

(西德:16)

[Klin(H(3)
j

, H(3)

)]

HT
2

= HT
2

(西德:16)
2

[K3,lin(H(2)
j

, H(2)

)]

HT
3

= HT
3

(西德:16)
3

(4.18)

with a similar data fusion interpretation as explained in the previous sub-
部分.

5 Algorithms for Deep RKM

The characterization of the stationary points for the objective functions in
the different deep RKM models typically leads to solving large sets of non-
linear equations in the unknown variables, especially for large given data
套. 所以, in this section, we outline a number of approaches and al-
gorithms for working with the kernel-based models (in either the primal or
the dual). We also outline algorithms for training deep feedforward neural
networks in a parametric way in the primal within the deep RKM setting.
The algorithms proposed in sections 5.2 和 5.3 are applicable also to large
data sets.

5.1 Levelwise Solving for Kernel-Based Models. For the case of linear
kernels in levels 2 和 3 in equation 4.11 和 4.18, we propose a heuristic
algorithm that consists of level-wise solving linear systems and eigenvalue
decompositions by alternating fixing different unknown variables.

xN] ∈ Rd×N, Y = [y1

For equation 4.18, in order to solve level 1 as a linear system, one needs
yN] ∈ Rp×N, 但是也
the input/output data X = [x1
the knowledge of h(2)
. 所以, an initialization phase is required. 一
can initialize h(2)
as zero or at random at level 1, obtain H1, and propagate
it to level 2. At level 2, after initializing H3, one finds H2, which is then
propagated to level 3, where one computes H3. After this forward phase,
one can go backward from level 3 to level 1 in a backward phase.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

Deep Restricted Kernel Machines Using Conjugate Feature Duality

2151

Schematically this gives the following heuristic algorithm:

Forward phase (等级 1 → level 3)

, H3 initialization

H2
等级 1 : H1 := f1(X, 是, H2) (for equation 4.18) 或者

H1 := f1(X, H2) (for equation 4.11)

等级 2 : H2 := f2(H1
, H3)
等级 3 : H3 := f3(H2) (for equation 4.18) 或者
H3 := f1(是, H2) (for equation 4.11)

Backward phase (等级 3 → level 1)

等级 2 : H2 := f2(H1
, H3)
等级 1 : H1 := f1(X, 是, H2) (for equation 4.18) 或者

H1 := f1(X, H2) (for equation 4.11).

One can repeat the forward and backward phases a number of times,
without the initialization step. 或者, one could also apply an algo-
rithm with forward-only phases, which can then be applied a number of
times after each other.

5.2 Deep Reduced Set Kernel-Based Models with Estimation in Pri-
, ˜W3 are made to

马尔. In the following approach, approximations ˜W1
W1

, W3:

, ˜W2

, W2

W1

= 1

1

˜W2

= 1

2

˜W3

= 1

3

氮(西德:2)

我=1

中号(西德:2)

j=1

中号(西德:2)

j=1

ϕ
1(希)H(1)

时间 (西德:12) ˜W1

= 1

1

中号(西德:2)

j=1

ϕ
1( ˜x j )˜h(1)
j

时间 ,

ϕ
2(˜h(1)

j )˜h(2)

j

ϕ
2(˜h(2)

j )˜h(3)

j

时间 ,

时间 ,

(5.2)

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

}氮
where a subset of the training data set { ˜x j
i=1 is considered with
中号 (西德:14) 氮. This approximation corresponds to a reduced-set technique in ker-
nel methods (Schölkopf et al., 1999). In order to have a good representation
of the data distribution, one can take a fixed-size algorithm with subset se-
lection according to quadratic Renyi entropy (Suykens et al., 2002), or a ran-
dom subset as a simpler scheme.

{希

}中号
j=1

(5.1)

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

2152

J. Suykens

We proceed then with a primal estimation scheme by taking stabiliza-
tion terms for the kernel PCA levels. In the case of two kernel PCA levels
followed by LS-SVM regression, we minimize the following objective:

中号(西德:2)

j=1

中号(西德:2)

j=1

min
,˜h(3)
j

,乙,(西德:16)

,(西德:16)

2

1

˜h(1)
j

,˜h(2)
j

Jdeep,Pstab

= − 1
2

e(1)
j

时间 (西德:16)−1

1 e(1)

j

+

+

1
2

2
2

Tr( ˜W T
1

˜W1) - 1
2

e(2)
j

时间 (西德:16)−1

2 e(2)

j

Tr( ˜W T
2

˜W2) + 1
2λ
3

中号(西德:2)

j=1

时间

e(3)
j

+

e(3)
j

3
2

Tr( ˜W T
3

˜W3)

+ 1
2

cstab

+ 1
2

cstab


⎝− 1
2


⎝− 1
2

中号(西德:2)

j=1

中号(西德:2)

j=1

e(1)
j

e(2)
j

The predictive model then becomes:

2

Tr( ˜W T
1

˜W1)

2

Tr( ˜W T
2

˜W2)

.

(5.3)

+

+

1
2

2
2

时间 (西德:16)−1

1 e(1)

j

时间 (西德:16)−1

2 e(2)

j

ˆe(1) = 1

1

ˆe(2) = 1

2

ˆy = 1

3

中号(西德:2)

j=1

中号(西德:2)

j=1

中号(西德:2)

j=1

˜h(1)
j K1( ˜x j

, X),

j K2(˜h(1)
˜h(2)

j

, (西德:16)−1
1

ˆe(1)),

j K3(˜h(2)
˜h(3)

j

, (西德:16)−1
2

ˆe(2)) + 乙.

(5.4)

The number of unknowns in this case is M × (s(1) + s(2) + p) + s(1) + s(2) + 1.
或者, instead of the regularization terms Tr( ˜W T
˜Wl ), one could also

take Tr( ˜H(我) ˜H(我)时间 ) where ˜H(我) = [˜h(我)
˜h(我)
中号 ] for l = 1, 2, 3.
1
One can also maximize Tr((西德:16)
1) + Tr((西德:16)

1) +
Tr((西德:16)
2)) to the objective, 方程 5.3, with c0 a positive constant. 注意
the components of ˜H(1), ˜H(2) in levels 1 和 2 do not possess an orthogonal-
ity property unless this is imposed as additional constraints to the objective
function.

2) by adding a term −c0(Tr((西德:16)

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

Deep Restricted Kernel Machines Using Conjugate Feature Duality

2153

5.3 Training Deep Feedforward Neural Networks within the Deep
RKM Framework. For training of deep feedforward neural networks
within this deep RKM setting, one minimizes Jdeep,Pstab in the unknown in-
terconnection matrices of the different levels. In case one takes one hidden
layer per level, the following objective is minimized

min
,β
1,2,3

,U1,2,3

W1,2,3

,乙,(西德:16)

1

,(西德:16)

2

Jdeep,Pstab

= − 1
2

中号(西德:2)

j=1

e(1)
j

时间 (西德:16)−1

1 e(1)

j

+

+

1
2

2
2

Tr(W T

1 W1) - 1
2

中号(西德:2)

j=1

e(2)
j

时间 (西德:16)−1

2 e(2)

j

Tr(W T

2 W2) + 1
2λ
3

中号(西德:2)

j=1

时间

e(3)
j

+

e(3)
j

3
2

Tr(W T

3 W3)

+ 1
2

cstab

+ 1
2

cstab


⎝− 1
2


⎝− 1
2

中号(西德:2)

j=1

中号(西德:2)

j=1

e(1)
j

e(2)
j

时间 (西德:16)−1

1 e(1)

j

时间 (西德:16)−1

2 e(2)

j

for the model

ˆe(1) = W T
1
ˆe(2) = W T
2
ˆy = W T
3

σ (U1x + β
(西德:16)−1
σ (U2
1
(西德:16)−1
σ (U3
2

1),
ˆe(1) + β

ˆe(2) + β

2),
3) + 乙.

(西德:23)
W T

1 W1

Tr

(西德:23)
W T

2 W2

Tr

+

+

1
2

2
2

(西德:24)


2

(西德:24)


2

(5.5)

(5.6)

ˆe(2), which results
σ (U2
2),

或者, 一
(西德:16)−1
2
σ ((西德:16)−1
W T
2
1
of unknowns
s(2) + 1) + s(1) + s(2) + 1, where nh1,2,3 denote the number of hidden units.

can take additional nonlinearities on (西德:16)−1
ˆe(1),
1
ˆe(1) = W T
σ (U1x + β
ˆe(2) =
1
3) + 乙. The number
ˆe(2)) + β
× (p +
× (s(2) + s(1) + 1) + nh3

σ ((西德:16)−1
2
× (s(1) + d + 1) + nh2

在模型中
ˆy = W T
3

ˆe(1)) + β
is nh1

σ (U3

1),

In order to further reduce the number of unknowns, and partially in-
spired by convolutional operations in convolutional neural networks (Le-
Cun, 波图, 本吉奥, & Haffner, 1998), we also consider the case where U1
×n2 , the number of un-
and U2 are Toeplitz matrices. For a matrix U ∈ Rn1
knowns is reduced then from n1n2 to n1

+ n2

- 1.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

2154

J. Suykens

桌子 1: Comparison of Test Error (%) of Models M

1 和M

2 on UCI Data Sets.

pid

bld

离子

adu

中号
中号
中号
中号
中号
中号
中号

1,A
1,乙
1,C
2,A
2,乙
2,乙,时间
2,C

bestbmark

16.99 [17.46(0.65)]
19.53 [20.02(1.53)] 26.09 [30.96(3.34)]
18.75 [19.39(0.89)] 25.22 [31.48(4.11)]
17.08 [17.48(0.56)]
17.83 [21.21(4.78)]
21.88 [24.73(5.91)] 28.69 [32.39(3.48)]
21.09 [20.20(1.51)] 27.83 [28.86(2.83)]
15.07 [15.15(0.15)]
18.75 [20.33(2.75)] 28.69 [28.38(2.80)] 10.23 [6.92(3.69)] 14.91 [15.08 (0.15)]
15.71 [15.97(0.07)]
19.03 [19.16(1.10)] 26.08 [27.74(9.40)]
15.21 [15.19(0.08)]
24.61 [22.34(1.95)] 32.17 [27.61(3.69)]
14.4(0.3)

0 [0.68(1.60)]
0 [5.38(12.0)]
0 [8.21(6.07)]
1.71 [5.68(2.22)]

6.83 [6.50(8.31)]
3.42 [9.66(6.74)]
4.0(2.1)

22.7(2.2)

29.6(3.7)

Notes: Shown first is the test error corresponding to the selected model with minimal
validation error from the different random initializations. Between brackets, the mean and
standard deviation of the test errors related to all initializations are shown. The lowest test
error is in bold.

6 Numerical Examples

6.1 Two Kernel PCA Levels Followed by Regression Level: Examples.

We define the following models and methods for comparison:

• [中号

1,A]: with additional

1]: Deep reduced set kernel-based models (with RBF kernel) with es-
timation in the primal according to equation 5.3 with the following
choices:
[中号
term −c0(Tr((西德:16)
Tr( ˜H(我) ˜H(我)时间 ) (l = 1, 2, 3) regularization terms
[中号
1,乙]: without additional term −c0(Tr((西德:16)
[中号
j=1 e(3)
1,C]: with objective function 1
2λ
j
那是, only the level 3 regression objective.
2]: Deep feedforward neural networks with estimation in the primal
2,乙],
2,乙,时间 ] Toeplitz matrices are

according to equation 5.5 with the same choices in [中号
[中号
taken for the U matrices in all levels, except for the last level.

2))
+ 这
2 Tr(W T

1) + Tr((西德:16)
时间
e(3)
j

1]. In the model [中号

2,C] as above in [中号

1) + Tr((西德:16)

2,A], [中号

3 W3),

2))

(西德:3)

中号

3

3

• [中号

= 64, Ntest

We test and compare the proposed algorithms on a number of UCI data
套: Pima indians diabetes (pid) (d = 8, p = 1, N= 400, Nval
=
256), Bupa liver disorder (bld) (d = 6, p = 1, N= 170, Nval
=
(d = 34, p = 1, N=
Johns Hopkins University ionosphere (离子)
115),
170, Nval
=
11000, Ntest
= 12222) data sets, where the number of inputs (d), outputs
(p), 训练 (氮), 验证 (Nval), and test data (Ntest) are indicated. 这些
numbers correspond to previous benchmarking studies in Van Gestel et al.
(2004). 表中 1, bestbmark indicates the best result obtained in the bench-
marking study of Van Gestel et al. (2004) from different classifiers, 包括-
ing SVM and LS-SVM classifiers with linear, polynomial, and RBF kernel;

= 117), adult (adu) (d = 14, p = 1, N= 22000, Nval

= 112, Ntest
= 60, Ntest

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

Deep Restricted Kernel Machines Using Conjugate Feature Duality

2155

linear and quadratic discriminant analysis; decision tree algorithm C4.5;
logistic regression; one-rule classifier; instance-based learners; and Naive
Bayes.

= 10 (中号
= 3, 3, 3; λ
3

1,A,乙);
−2;
2,乙,时间 : s(1),(2),(3) = 4, 4, p;

= 10

3

c0
cstab
nh1,2,3

The tuning parameters, selected at the validation level, 是
−2; cstab
1: s(1),(2),(3) = 2, 2, p; 米= 20; λ
• pid: For M
1,A). For M
= 0.1 (中号
2,A,乙); c0
= 10

= 10
2: s(1),(2),(3) = 4, 2, p; nh1,2,3
= 0.1 (中号
−2; cstab
= 1.
1: s(1),(2),(3) = 3, 2, p; 米= 20; λ
1,A). For M
= 1000 (中号
2,A,乙); c0
= 10
= 3, 3, 5; λ
3

2: s(1),(2),(3) = 4, 2, p; nh1,2,3
= 0.1 (中号

• bld: For M
= 0.1 (中号

= 10 (中号
= 3, 3, 3; λ
3

2,A). For M

= 10−3; cstab

2,A) For M

c0
cstab
nh1,2,3

3

• ion: For M
= 0.1 (中号

= 1000.

−3; cstab
1: s(1),(2),(3) = 3, 2, p; 米= 30; λ
1,A). For M
= 1000 (中号
2,A,乙); c0
= 10−3; cstab
= 3, 3, 3; λ
3

2: s(1),(2),(3) = 3, 3, p; nh1,2,3
= 0.1 (中号

2,A) For M

= 1000.

= 10

3

c0
cstab
nh1,2,3

= 100 (中号
= 3, 3, 5; λ
3

1,A,乙);
= 10−3;
2,乙,时间 : s(1),(2),(3) = 4, 2, p;

−3; cstab

= 100 (中号
= 3, 3, 3; λ
3

1,A,乙);
−3;
2,乙,时间 : s(1),(2),(3) = 3, 3, p;

= 10

= 10

• adu: For M

−3; cstab

−4
= 10
= 10, 5, 3;
2,乙,时间 : s(1),(2),(3) =

1: s(1),(2),(3) = 20, 10, p; 米= 15; λ
= 0.1 (中号

3

(中号
1,A,乙); c0
= 10
λ
3
5, 2, p; nh1,2,3

−7; cstab

1,A). For M
2,A,乙); c0
= 10

= 0.1 (中号
= 10, 5, 3; λ
3

2: s(1),(2),(3) = 5, 2, p; nh1,2,3
= 0.1 (中号

2,A). For M
1,2,3

3

−7; cstab

= 103, 这

1,2
1 和M

= 10, 这
The other tuning parameters were selected as η

= 1, 1, 1.
= 1, 1, 1 for pid, bld,
= 10−3 for adu, unless specified differently above. 在
ion and η
2 型号, the ˜H(1),(2),(3) matrices and the interconnection ma-
the M
trices were initialized at random according to a normal distribution with
zero mean and standard deviation 0.1 (100, 20, 10, 和 3 initializations for
pid, bld, 离子, adu, 分别), the diagonal matrices (西德:16)
1,2 by the identity
矩阵, and σ
1. For the training, A
quasi-Newton method was used with fminunc in Matlab.

= 1 for the RBF kernel models in M

1,2,3

1,2,3

The following general observations from the experiments are shown in

桌子 1:

• Having the additional terms with kernel PCA objectives in levels 1
和 2, as opposed to the level 3 objective only, gives improved results
on all tried data sets.

• The best selected value for cstab varies among the data sets. In case
this value is large, the value of the objective function terms related to
the kernel PCA parts is close to zero.

• The use of Toeplitz matrices for the U matrices in the deep feedfor-
ward neural networks leads to competitive performance results and
greatly reduces the number of unknowns.

数字 4 illustrates the evolution of the objective function (in logarithmic
规模) during training on the ion data set, for different values of cstab and in
comparison with a level 3 objective function only.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

2156

J. Suykens

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

数字 4: Illustration of the evolution of the objective function (logarithmic
规模) during training on the ion data set. Shown are training curves for the
model M
2,a for different choices of cstab (equal to 1, 10, 100 in blue, 红色的, and ma-
genta, 分别) in comparison with M
2,C (等级 3 objective only, in black),
for the same initialization.

6.2 Regression Level Followed by Two Kernel PCA Levels: Examples.

6.2.1 Regression Example on Synthetic Data Set. 在这个例子中, we com-
pare a basic LS-SVM regression with deep RKM consisting of three levels
, x j ) =
with LS-SVM + KPCA + KPCA, where a gaussian RBF kernel K(希
经验值(-(西德:6)希
/σ 2) is used in the LS-SVM level and linear kernels in the
KPCA levels. Training, 验证, and test data sets are generated from the
following true underlying function,

− x j

(西德:6)2
2

F (X) = sin(0.3X) + 因斯(0.5X) + 罪(2X),

(6.1)

where zero mean gaussian noise with standard deviation 0.1, 0.5, 1, 和
2 is added to the function values for the different data sets. In this ex-
充足, we have a single input and single output d = p = 1. 训练数据
(with noise) are generated in the interval [−10, 10] with steps 0.1, valida-
tion data (with noise) 在 [−9.77, 9.87] with steps 0.11, and test data (noise-
较少的) 在 [−9.99, 9.99] with steps 0.07. In the experiments, 100 realizations for
the noise are made, for which the mean and standard deviation of the re-
sults are shown in Table 2. The tuning parameters are selected based on the

Deep Restricted Kernel Machines Using Conjugate Feature Duality

2157

桌子 2: Comparison between Basic LS-SVM Regression and Deep RKM on the
Synthetic Data Set, for Different Noise Levels.

Noise

Basic

Deep (1+1)

Deep (7+2)

0.1
0.5
1
2

−4
0.0019 ± 4.3 10
0.0403 ± 0.0098
0.1037 ± 0.0289
0.3368 ± 0.0992

−4
0.0018 ± 4.2 10
0.0374 ± 0.0403
0.0934 ± 0.0269
0.2902 ± 0.0875

−4
0.0019 ± 4.4 10
0.0397 ± 0.0089
0.0994 ± 0.0301
0.3080 ± 0.0954

1

2, 这

1, 这

3 (这

validation set, which are σ , γ for the RBF kernel in the basic LS-SVM model
and σ , λ
= 1 has been chosen) for the complete deep RKM. 这
number of forward-backward passes in the deep RKM is chosen equal to
10. For deep RKM, we take the following two choices for the number of
components s(2), s(3) in the KPCA levels: 1 和 1, 7 和 2 for level 2 和
等级 3, 分别. For deep RKM, the optimal values for 1
是 105,

which means that the level 2 和 3 kernel PCA levels receive higher weight
in the kernel fusion terms. As seen in Table 2, deep RKM improves over the
basic LS-SVM regression in this example. The optimal values for (σ, λ
1) 是
(1, 0.001) for noise level 0.1 和 (1, 0.01) for noise level 0.5, (1, 0.4) for noise
等级 1, 和 (1, 1) for noise level 2.

, 1

3

2

6.2.2 Multiclass Example: USPS. 在这个例子中, the USPS handwritten
digits data set is taken from http://www.cs.nyu.edu/∼roweis/data.html.
It contains 8-bit grayscale images of digits 0 通过 9 和 1100 examples
of each class. These data are used without additional scaling or prepro-
cessing. We compare a basic LS-SVM model (with primal representation
×p, b ∈ Rp with p = 10, 那是, one output per
ˆy = W T ϕ(X) + b and W ∈ Rn f
班级, and RBF kernel) with deep RKM consisting of LS-SVM + KPCA +
KPCA with RBF kernel in levels 1 and linear kernels in levels 2 和 3 (和
number of selected components s(2), s(3) in levels 2 和 3). In level 1 of deep
RKM, the same type of model is taken as in the basic LS-SVM model. 在
这边走, we intend to study the effect of the two additional KPCA layers.
The dimensionality of the input data is d = 256. Two training set sizes were
taken (N= 2000 and N = 4000 data points, 那是, 200 和 400 examples per
班级), 2000 data points (200 per class) for validation, 和 5000 数据 (500 每
班级) for testing. The tuning parameters are selected based on the valida-
tion set: σ , γ for the RBF kernel in the basic LS-SVM model and σ , λ
2, 这
3
(这
= 1 has been chosen) for deep RKM. The number of forward-backward
passes in the deep RKM is chosen equal to 2. The results are shown for
的情况下 2000 training data in Figure 5, showing the results on training,
验证, and test data with the predicted class labels and the predicted
output values for the different classes. For the case N = 2000, the se-
lected values were σ 2 = 45, s(2) = 10, s(3) = 1, λ
= 106. 这
1

−6, 1

= 10

1, 这

1

= 1

3

2

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

2158

J. Suykens

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

数字 5: Deep RKM on USPS handwritten digits data set. Left top: Training
data results (2000 数据). Left Bottom: Validation data results (2000 数据). 正确的
顶部: Test data results (5000 数据). Right bottom: Output values for the 10 differ-
ent classes on the validation set.

misclassification error on the test data set is 3.18% for the deep RKM and
3.26% for the basic LS-SVM (with σ 2 = 45 and γ = 1/λ
1). For the case
−6,
N= 4000, the selected values were σ 2 = 45, s(2) = 1, s(3) = 1, λ
1
= 106. The misclassification error on the test data set is 2.12% 为了
1

2
deep RKM and 2.14% for the basic LS-SVM (with σ 2 = 45 and γ = 1/λ
1).
This illustrates that for deep RKM, 级别 2 和 3 are given high relative
importance through the selection of large 1

价值观.

= 1

3

= 10

, 1

2

3

6.2.3 Multiclass Example: MNIST. The data set, which is used without
additional scaling or preprocessing, is taken from http://www.cs.nyu.
edu/∼roweis/data.html. The dimensionality of the input data is d = 784
(images of size 28 × 28 for each of the 10 类). 在这种情况下, we take
an ensemble approach where the training set (N= 50,000 和 10 类)
has been partitioned into small nonoverlapping subsets of size 50 (5 数据
points per class). The choice for this subset size resulted from taking the
最后的 10,000 points of this data set as validation data with the use of 40,000
data for training in that case. Other tuning parameters were selected in a
similar way. 这 1000 resulting submodels have been linearly combined

Deep Restricted Kernel Machines Using Conjugate Feature Duality

2159

2

1

−6, 这

= 10

= 10

= 1

3

= 1, 1

after applying the tanh function to their outputs. The linear combination is
determined by solving an overdetermined linear system with ridge regres-
锡安, following a similar approach as discussed in section 6.4 of Suykens
等人. (2002). For the submodels, deep RKMs consisting of LSSVM + KPCA
+ KPCA with RBF kernel in levels 1 and linear kernels in levels 2 和
3, are taken. The selected tuning parameters are σ 2 = 49, s(2) = 1, s(3) = 1,
−6. The number of forward-backward passes
λ
1
in the deep RKM is chosen equal to 2. The training data set has been ex-
tended with another 50,000 training data consisting of the same data points
but corrupted with noise (random perturbations with zero mean and stan-
dard deviation 0.5, truncated to the range [0,1]), which is related to the
method with random perturbations in Kurakin, 好人, and Bengio
(2016). The misclassification error on the test data set (10,000 data points) 是
1.28%, which is comparable in performance to deep belief networks (1.2%)
and in between the reported test performances of deep Boltzmann machines
(0.95, 1.01%) and SVM with gaussian kernel (1.4%) (Salakhutdinov, 2015)
(参见http://yann.lecun.com/exdb/mnist/ for an overview and compari-
son of performances obtained by different methods).

7 结论

In this letter, a theory of deep restricted kernel machines has been proposed.
It is obtained by introducing a notion of conjugate feature duality where the
conjugate features correspond to hidden features. Existing kernel machines
such as least squares support vector machines for classification and regres-
锡安, kernel PCA, matrix SVD, and Parzen-type models are considered as
building blocks within a deep RKM and are characterized through the con-
jugate feature duality. By means of the inner pairing, one achieves a link
with the energy expression of restricted Boltzmann machines, though with
continuous variables in a nonprobabilistic setting. It also provides an inter-
pretation of visible and hidden units. 所以, this letter connects, 在
一只手, to deep learning methods and, 另一方面, to least squares
support vector machines and kernel methods. 这样, the insights and
foundations achieved in these different research areas could possibly mu-
tually reinforce each other in the future. Much future work is possible in
different directions, including efficient methods and implementations for
big data, the extension to other loss functions and regularization schemes,
treating multimodal data, different coupling schemes, and models for clus-
tering and semisupervised learning.

附录: Stabilization Term for Kernel PCA

We explain here the role of the stabilization term in kernel PCA as a mod-
ification to equation 2.15. 在这种情况下, the objective function in the primal

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

2160

J. Suykens

minw,不

1
2

subject to ei

(西德:2)

C

wT w −

e2

+ cstab
2

wT w −

1
2

C

2

(西德:2)

2

e2

(A.1)


= wT ϕ(希), i = 1, . . . , 氮.

(西德:31)

2

(西德:3)

Denoting J0
(西德:3)
我(不

wT w − γ
2
− wT ϕ(希)), from which it follows that

= 1
2

i e2

A

the Lagrangian is L = J0

+ cstab

2 J2

0

+

⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎩

∂L
∂w
∂L
∂ei
∂L
∂α

= 0 (1 + cstabJ0)w =

(西德:3)

A

ϕ(希)

= 0 (1 + cstabJ0)γ ei

= α

= 0 ⇒ ei

= wT ϕ(希).

Assuming that 1 + cstabJ0
(1/C )A
i with K ji
for the original formulation (corresponding to cstab

=
(西德:16)= 0, elimination of w and ei yields
= ϕ(x j )T ϕ(希), which is the solution that is also obtained

= 0).

jK ji

A

j

(西德:3)

致谢

The research leading to these results has received funding from the Eu-
ropean Research Council (FP7/2007-2013) / ERC AdG A-DATADRIVE-B
(290923) under the European Union’s Seventh Framework Programme.
This letter reflects only my views; the EU is not liable for any use that may
be made of the contained information; Research Council KUL: GOA/10/09
MaNet, CoE PFV/10/002 (OPTEC), BIL12/11T; PhD/Postdoc grants; Flem-
ish government: FWO: PhD/postdoc grants, 项目: G0A4917N (深的
restricted kernel machines), G.0377.12 (Structured systems), G.088114氮
(tensor-based data similarity); IWT: PhD/postdoc grants, 项目: SBO
POM (100031); iMinds Medical Information Technologies SBO 2014; Bel-
gian Federal Science Policy Office: IUAP P7/19 (DYSCO, dynamical sys-
特姆斯, control and optimization, 2012–2017).

参考

Ackley, D. H。, 欣顿, G. E., & Sejnowski, 时间. J. (1985). A learning algorithm for Boltz-

mann machines. 认知科学, 9, 147–169.

Alzate, C。, & Suykens, J. A. K. (2010). Multiway spectral clustering with out-of-
sample extensions through weighted kernel PCA. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 32(2), 335–347.

本吉奥, 是. (2009). Learning deep architectures for AI. 波士顿: 现在.
本吉奥, Y。, 考维尔, A。, & Vincent, 磷. (2013). Representation learning: A review and
new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence,
35, 1798–1828.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

Deep Restricted Kernel Machines Using Conjugate Feature Duality

2161

Boser, 乙. E., Guyon, 我. M。, & Vapnik, V. 氮. (1992). A training algorithm for optimal
margin classifiers. In Proceedings of the COLT Fifth Annual Workshop on Computa-
tional Learning Theory (PP. 144–152). 纽约: ACM.

Boyd, S。, & Vandenberghe, L. (2004). Convex optimization. 剑桥: 剑桥

大学出版社.

陈, L.-C., Schwing, A. G。, Yuille, A. L。, & Urtasun, 右. (2015). Learning deep
structured models. In Proceedings of the 32nd International Conference on Machine
学习.

给, Y。, & Saul, L. K. (2009). Kernel methods for deep learning. In Y. 本吉奥, D.
Schuurmans, J. D. 拉弗蒂, C. K. 我. 威廉姆斯, & A. Culotta (编辑。), 进展
神经信息处理系统, 22. 红钩, 纽约: 柯兰.

科尔特斯, C。, & Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 273–

297.

Damianou, A. C。, & 劳伦斯, 氮. D. (2013). Deep gaussian processes. PMLR, 31,

207–215.

De Wilde, Ph. (1993). Class of Hamiltonian neural networks. Phys. 牧师. 乙, 47, 1392–

1396.

Fischer, A。, & 刺猬, C. (2014). Training restricted Boltzmann machines: An introduc-

的. Pattern Recognition, 47, 25–39.

戈德斯坦, H。, Poole, C。, & Safko, J. (2002). Classical mechanics. Reading, 嘛:

Addison-Wesley.

Golub, G. H。, & Van Loan, C. F. (1989). Matrix computations. 巴尔的摩: 约翰霍普金斯大学

大学出版社.

好人, 我。, 本吉奥, Y。, & 考维尔, A. (2016). 深度学习. 剑桥, 嘛: 和

按.

Hertz, J。, Krogh, A。, & 帕尔默, 右. G. (1991). Introduction to the theory of neural compu-

站. Reading, 嘛: Addison-Wesley.

欣顿, G. 乙. (2005). What kind of graphical model is the brain? In Proc. 19th In-
ternational Joint Conference on Artificial Intelligence (PP. 1765–1775). 旧金山:
Morgan Kaufmann.

欣顿, G. E., Osindero, S。, & Teh, Y.-W. (2006). A fast learning algorithm for deep

belief nets. 神经计算, 18, 1527–1554.

Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective
computational abilities. Proceedings of the National Academy of Sciences USA, 79,
2554–2558.

Jaderberg, M。, Simonyan, K., Vedaldi, A。, & Zisserman, A. (2015). Deep structured
output learning for unconstrained text recognition. In Proceedings of the Interna-
tional Conference on Learning Representations.

Kurakin, A。, 好人, 我。, & 本吉奥, S. (2016). Adversarial machine learning at scale.

arXiv:1611.01236

拉罗谢尔, H。, & 本吉奥, 是. (2008). Classification using discriminative restricted
Boltzmann machines. In Proceedings of the 25th International Conference on Machine
学习. 纽约: ACM.

乐存, Y。, 本吉奥, Y。, & 欣顿, G. (2015). 深度学习. 自然, 521, 436–444.
乐存, Y。, 波图, L。, 本吉奥, Y。, & Haffner, 磷. (1998). Gradient-based learn-
the IEEE, 86, 2278–

ing applied to document recognition. 会议记录
2324.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

2162

J. Suykens

乐存, Y。, Chopra, S。, Hadsell, R。, Ranzato, M。, & 黄, F.-J. (2006). A tutorial on
energy-based learning. In G. Bakir, 时间. Hofmann, 乙. Schölkopf, A. Smola, & 乙.
Taskar (编辑。), Predicting structured data. 剑桥, 嘛: 与新闻界.

李, H。, Grosse, R。, Ranganath, R。, & 的, A. 是. (2009). Convolutional deep belief net-
works for scalable unsupervised learning of hierarchical representations. In Pro-
ceedings of the 26th Annual International Conference on Machine Learning (PP. 609–
616). 纽约: ACM.

Mairal, J。, Koniusz, P。, Harchaoui, Z。, & Schmid, C. (2014). Convolutional kernel net-
作品. In Z. Ghahramani, 中号. Welling, C. 科尔特斯, 氮. D. 劳伦斯, & K. 问. Wein-
berger (编辑。), Advances in neural information processing systems (NIPS).

Mall, R。, Langone, R。, & Suykens, J. A. K. (2014). Multilevel hierarchical kernel spec-
tral clustering for real-life large scale complex networks. PLOS One, 9(6), e99966.
彼得森, K. B., & Pedersen, 中号. S. (2012). The matrix cookbook. Lyngby: Technical Uni-

versity of Denmark.

Poggio, T。, & Girosi, F. (1990). Networks for approximation and learning. 会议记录

of the IEEE, 78(9), 1481–1497.

拉斯穆森, C. E., & 威廉姆斯, C. (2006). Gaussian processes for machine learning. 凸轮-

桥, 嘛: 与新闻界.

Rockafellar, 右. 时间. (1987). Conjugate duality and optimization. 费城: SIAM.
Rumelhart, D. E., 欣顿, G. E., & 威廉姆斯, 右. J. (1986). Learning representations by

back-propagating errors. 自然, 323, 533–536.

Salakhutdinov, 右. (2015). Learning deep generative models. 安努. 牧师. Stat. 应用。,

2, 361–385.

Salakhutdinov, R。, & 欣顿, G. 乙. (2007). Using deep belief nets to learn covariance
kernels for gaussian processes. 在J. Platt, D. Koller, 是. 歌手, & S. 时间. Roweis (编辑。),
Advances in neural information processing systems, 20. 红钩, 纽约: 柯兰.
Salakhutdinov, R。, & 欣顿, G. 乙. (2009). Deep Boltzmann machines. PMLR, 5, 448–

455.

Saunders, C。, Gammerman, A。, & Vovk, V. (1998). Ridge regression learning algo-
rithm in dual variables. In Proc. of the 15th Int. Conf. on Machine Learning (PP. 515–
521). 旧金山: Morgan Kaufmann.

施米德胡贝尔, J. (2015). 神经网络中的深度学习: 概述. 神经网络-

作品, 61, 85–117.

Schölkopf, B., Mika, S。, 布尔吉斯, C. C。, Knirsch, P。, 穆勒, K. R。, Rätsch, G。, & Smola,
A. J. (1999). Input space versus feature space in kernel-based methods. IEEE Trans-
actions on Neural Networks, 10(5), 1000–1017.

Schölkopf, B., & Smola, A. (2002). Learning with kernels. 剑桥, 嘛: 与新闻界.
Schölkopf, B., Smola, A。, & 穆勒, K.-R. (1998). Nonlinear component analysis as a

kernel eigenvalue problem. 神经计算, 10, 1299–1319.

Schwing, A. G。, & Urtasun, 右. (2015). Fully connected deep structured networks.

arXiv:1503.02351

Smale, S。, Rosasco, L。, Bouvrie, J。, Caponnetto, A。, & Poggio, 时间. (2010). Mathematics
of the neural response. Foundations of Computational Mathematics, 10(1), 67–91.
斯摩棱斯基, 磷. (1986). Information processing in dynamical systems: 的基础
harmony theory. 在D中. 乙. Rumelhart & J. L. 麦克莱兰 (编辑。), Parallel distributed
加工: Explorations in the microstructure of cognition, 卷. 1: Foundations.
纽约: 麦格劳-希尔.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

Deep Restricted Kernel Machines Using Conjugate Feature Duality

2163

Srivastava, N。, & Salakhutdinov, 右. (2014). Multimodal learning with deep Boltz-

mann machines. Journal of Machine Learning Research, 15, 2949–2980.

斯图尔特, G. 瓦. (1993). On the early history of the singular value decomposition.

SIAM Review, 35(4), 551–566.

Suykens, J. A. K. (2013). Generating quantum-measurement probabilities from an

optimality principle. Physical Review A, 87(5), 052134.

Suykens, J. A. K. (2016). SVD revisited: A new variational principle, compatible fea-
ture maps and nonlinear extensions. Applied and Computational Harmonic Analysis,
40(3), 600–609.

Suykens, J. A. K., Alzate, C。, & Pelckmans, K. (2010). Primal and dual model repre-

sentations in kernel-based learning. Statistics Surveys, 4, 148–183.

Suykens, J. A. K., & Vandewalle, J. (1999A). Training multilayer perceptron classifiers
based on a modified support vector method. IEEE Transactions on Neural Networks,
10(4), 907–911.

Suykens, J. A. K., & Vandewalle, J. (1999乙). Least squares support vector machine

classifiers. Neural Processing Letters, 9(3), 293–300.

Suykens, J. A. K., Vandewalle, J。, & De Moor, 乙. (1995). Artificial neural networks for

modeling and control of non-linear systems. 纽约: 施普林格.

Suykens, J. A. K., Van Gestel, T。, De Brabanter, J。, De Moor, B., & Vandewalle, J. (2002).

Least squares support vector machines. 新加坡: World Scientific.

Suykens, J. A. K., Van Gestel, T。, Vandewalle, J。, & De Moor, 乙. (2003). A support vec-
tor machine formulation to PCA analysis and its kernel version. IEEE Transactions
on Neural Networks, 14(2), 447–450.

Van Gestel, T。, Suykens, J. A. K., Baesens, B., Viaene, S。, Vanthienen, J。, Dedene, G。,
De Moor, B., & Vandewalle, J. (2004). Benchmarking least squares support vector
machine classifiers. Machine Learning, 54(1), 5–32.

Vapnik, V. (1998). Statistical learning theory. 纽约: 威利.
Wahba, G. (1990). Spline models for observational data. 费城: SIAM.
Welling, M。, Rosen-Zvi, M。, & 欣顿, G. 乙. (2004). Exponential family harmoniums
with an application to information retrieval. In L. K. Saul, 是. 韦斯, & L. 波图
(编辑。), Advances in neural information processing systems, 17. 剑桥, 嘛: 和
按.

Wiering, 中号. A。, & Schomaker, L. 右. 乙. (2014). Multi-layer support vector machines. 在
J. A. K. Suykens, 中号. Signoretto, & A. Argyriou (编辑。), Regularization, 优化,
kernels, and support vector machines (PP. 457–476). Boca Raton, FL: Chapman &
Hall/CRC.

郑, S。, Jayasumana, S。, Romera-Paredes, B., Vineet, 五、, Su, Z。, Du, D ., 黄, C。,
& Torr, 磷. H. S. (2015). Conditional random fields as recurrent neural networks.
In Proceedings of the International Conference on Computer Vision. 皮斯卡塔韦, 新泽西州:
IEEE.

Received March 31, 2016; 接受三月 15, 2017.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
C

_
A
_
0
0
9
8
4
p
d

.

/

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3
下载pdf