LETTER
Communicated by Alexander Schwing
Deep Restricted Kernel Machines Using Conjugate
Feature Duality
Johan A. K. Suykens
johan.suykens@esat.kuleuven.be
KU Leuven ESAT-STADIUS, B-3001 Leuven, Belgium
The aim of this letter is to propose a theory of deep restricted kernel
machines offering new foundations for deep learning with kernel ma-
chines. From the viewpoint of deep learning, it is partially related to
restricted Boltzmann machines, which are characterized by visible and
hidden units in a bipartite graph without hidden-to-hidden connections
and deep learning extensions as deep belief networks and deep Boltz-
mann machines. From the viewpoint of kernel machines, it includes least
squares support vector machines for classification and regression, kernel
principal component analysis (PCA), matrix singular value decomposi-
tion, and Parzen-type models. A key element is to first characterize these
kernel machines in terms of so-called conjugate feature duality, yielding
a representation with visible and hidden units. It is shown how this is
related to the energy form in restricted Boltzmann machines, with con-
tinuous variables in a nonprobabilistic setting. In this new framework
of so-called restricted kernel machine (RKM) representations, the dual
variables correspond to hidden features. Deep RKM are obtained by cou-
pling the RKMs. The method is illustrated for deep RKM, consisting of
three levels with a least squares support vector machine regression level
and two kernel PCA levels. In its primal form also deep feedforward neu-
ral networks can be trained within this framework.
1 Introduction
Deep learning has become an important method of choice in several
research areas including computer vision, speech recognition, and lan-
guage processing (LeCun, Bengio, & Hinton, 2015). Among the existing
techniques in deep learning are deep belief networks, deep Boltzmann
machines, convolutional neural networks, stacked autoencoders with pre-
training and fine-tuning, and others (Bengio, 2009; Goodfellow, Bengio, &
Courville, 2016; Hinton, 2005; Hinton, Osindero, & Teh, 2006; LeCun et al.,
2015; Lee, Grosse, Ranganath, & Ng, 2009; Salakhutdinov, 2015; Schmid-
huber, 2015; Srivastava & Salakhutdinov, 2014; Chen, Schwing, Yuille, &
Urtasun, 2015; Jaderberg, Simonyan, Vedaldi, & Zisserman, 2014; Schwing
& Urtasun, 2015; Zheng et al., 2015). Support vector machines (SVM) and
Neural Computation 29, 2123–2163 (2017) © 2017 Massachusetts Institute of Technology.
Published under a Creative Commons
doi:10.1162/NECO_a_00984
Attribution 3.0 Unported (CC BY 3.0) license.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
2124
J. Suykens
kernel-based methods have made a large impact on a wide range of appli-
cation fields, together with finding strong foundations in optimization and
learning theory (Boser, Guyon, & Vapnik, 1992; Cortes & Vapnik, 1995; Ras-
mussen & Williams, 2006; Schölkopf & Smola, 2002; Suykens, Van Gestel,
De Brabanter, De Moor, & Vandewalle, 2002; Vapnik, 1998; Wahba, 1990).
Therefore, one can pose the question: Which synergies or common founda-
tions could be developed between these different directions? There has al-
ready been exploration of such synergies— for example, in kernel methods
for deep learning (Cho & Saul, 2009), deep gaussian processes (Damianou &
Lawrence, 2013; Salakhutdinov & Hinton, 2007), convolutional kernel net-
works (Mairal, Koniusz, Harchaoui, & Schmid, 2014), multilayer support
vector machines (Wiering & Schomaker, 2014), and mathematics of the neu-
ral response (Smale, Rosasco, Bouvrie, Caponnetto, & Poggio, 2010), among
others.
In this letter, we present a new theory of deep restricted kernel ma-
chines (deep RKM), offering foundations for deep learning with kernel
machines. It partially relates to restricted Boltzmann machines (RBMs),
which are used within deep belief networks (Hinton, 2005; Hinton et al.,
2006). In RBMs, one considers a specific type of Markov random field, char-
acterized by a bipartite graph consisting of a layer of visible units and
another layer of hidden units (Bengio, 2009; Fisher & Igel, 2014; Hinton
et al., 2006; Salakhutdinov, 2015). In RBMs, which are related to harmoni-
ums (Smolensky, 1986; Welling, Rosen-Zvi, & Hinton, 2004), there are no
connections between the hidden units (Hinton, 2005), and often also no
visible-to-visible connections. In deep belief networks, the hidden units of
a layer are mapped to a next layer in order to create a deep architecture.
In RBM, one considers stochastic binary variables (Ackley, Hinton, & Se-
jnowski, 1985; Hertz, Krogh, & Palmer, 1991), and extensions have been
made to gaussian-Bernoulli variants (Salakhutdinov, 2015). Hopfield net-
works (Hopfield, 1982) take continuous values, and a class of Hamiltonian
neural networks has been studied in DeWilde (1993). Also, discriminative
RBMs have been studied where the class labels are considered at the level
of visible units (Fisher & Igel, 2014; Larochelle & Bengio, 2008). In all of
these methods the energy function plays an important role, as it also does
in energy-based learning methods (LeCun, Chopra, Hadsell, Ranzato, &
Huang, 2006).
Representation learning issues are considered to be important in deep
learning (Bengio, Courville, & Vincent, 2013). The method proposed in this
letter makes a link to restricted Boltzmann machines by characterizing sev-
eral kernel machines by means of so-called conjugate feature duality. Dual-
ity is important in the context of support vector machines (Boser et al., 1992;
Cortes & Vapnik, 1995; Vapnik, 1998; Suykens et al., 2002; Suykens, Alzate,
& Pelckmans, 2010), optimization (Boyd & Vandenberghe, 2004; Rockafel-
lar, 1987), and in mathematics and physics in general. Here we consider
hidden features conjugated to part of the unknown variables. This part of
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Deep Restricted Kernel Machines Using Conjugate Feature Duality
2125
the formulation is linked to a restricted Boltzmann machine energy expres-
sion, though with continuous variables in a nonprobabilistic setting. In this
way, a model can be expressed in both its primal representation and its dual
representation and give an interpretation in terms of visible and hidden
units, in analogy with RBM. The primal representation contains the feature
map, while the dual model representation is expressed in terms of the ker-
nel function and the conjugated features.
The class of kernel machines discussed in this letter includes least
squares support vector machines (LS-SVM) for classification and regres-
sion, kernel principal component analysis (kernel PCA), matrix singular
value decomposition (matrix SVD), and Parzen-type models. These have
been previously conceived within a primal and Lagrange dual setting in
Suykens and Vandewalle (1999b), Suykens et al. (2002), Suykens, Van Ges-
tel, Vandewalle, and De Moor (2003), and Suykens (2013, 2016). Other exam-
ples are kernel spectral clustering (Alzate & Suykens, 2010; Mall, Langone,
& Suykens, 2014), kernel canonical correlation analysis (Suykens et al.,
2002), and several others, which will not be addressed in this letter, but can
be the subject of future work. In this letter, we give a different characteriza-
tion for these models, based on a property of quadratic forms, which can be
verified through the Schur complement form. The property relates to a spe-
cific case of Legendre-Fenchel duality (Rockafellar, 1987). Also note that in
classical mechanics, converting a Lagrangian into Hamiltonian formulation
is by Legendre transformation (Goldstein, Poole, & Safko, 2002).
The kernel machines with conjugate feature representations are used
then as building blocks to obtain the deep RKM by coupling the RKMs. The
deep RKM becomes unrestricted after coupling the RKMs. The approach is
explained for a model with three levels, consisting of two kernel PCA lev-
els and a level with LS-SVM classification or regression. The conjugate fea-
tures of level 1 are taken as input of level 2 and, subsequently, the features
of level 2 as input for level 3. The objective of the deep RKM is the sum of
the objectives of the RKMs in the different levels. The characterization of
the stationary points leads to solving a set of nonlinear equations in the un-
knowns, which is computationally expensive. However, for the case of lin-
ear kernels, in part of the levels it reveals how kernel fusion is taking place
over the different levels. For this case, a heuristic algorithm is obtained with
level-wise solving. For the general nonlinear case, a reduced-set algorithm
with estimation in the primal is proposed.
In this letter, we make a distinction between levels and layers. We use
the terminology of levels to indicate the depth of the model. The terminol-
ogy of layers is used here in connection to the feature map. Suykens and
Vandewalle (1999a) showed how a multilayer perceptron can be trained by
a support vector machine method. It is done by defining the hidden layer
to be equal to the feature map. In this way, the hidden layer is treated at
the feature map and the kernel parameters level. Suykens et al. (2002) ex-
plained that in SVM and LS-SVM models, one can have a neural networks
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
2126
J. Suykens
interpretation in both the primal and the dual. The number of hidden
units in the primal equals the dimension of the feature space, while in the
dual representation, it equals the number of support vectors. In this way,
it provides a setting to work with parametric models in the primal and
kernel-based models in the dual. Therefore, we also illustrate in this letter
how deep multilayer feedforward neural networks can be trained within
the deep RKM framework. While in classical backpropagation (Rumelhart,
Hinton, & Williams, 1986), one typically learns the model by specifying a
single objective (e.g., unless imposing additional stability constraints to ob-
tain stable multilayer recurrent networks with dynamic backpropagation;
(Suykens, Vandewalle, & De Moor, 1995), in the deep RKM the objective
function consists of the different objectives related to the different levels.
In summary, we aim at contributing to the following challenging ques-
tions in this letter:
• Can we find new synergies and foundations between SVM and kernel
methods and deep learning architectures?
• Can we extend primal and dual model representations, as occurring
in SVM and LS-SVM models, from shallow to deep architectures?
• Can we handle deep feedforward neural networks and deep kernel
machines within a common setting?
In order to address these questions, this letter is organized as follows.
Section 2 outlines the context of this letter with a brief introductory part on
restricted Boltzmann machines, SVMs, LS-SVMs, kernel PCA, and SVD. In
section 3 we explain how these kernel machines can be characterized by
conjugate feature duality with visible and hidden units. In section 4 deep
restricted kernel machines are explained for three levels: an LS-SVM regres-
sion level and two additional kernel PCA levels. In section 5, different algo-
rithms are proposed for solving in either the primal or the dual, where the
former will be related to deep feedfoward neural networks and the latter
to kernel-based models. Illustrations with numerical examples are given in
section 6. Section 7 concludes the letter.
2 Preliminaries and Context
In this section, we explain basic principles of restricted Boltzmann ma-
chines, SVMs, LS-SVMs, and related formulations for kernel PCA, and SVD.
These are basic ingredients needed before introducing restricted kernel ma-
chines in section 3.
2.1 Restricted Boltzmann Machines. An RBM is a specific type of
Markov random field, characterized by a bipartite graph consisting of a
layer of visible units and another layer of hidden units (Bengio, 2009; Fisher
& Igel, 2014; Hinton et al., 2006; Salakhutdinov, 2015), without hidden-to-
hidden connections. Both the visible and hidden variables, denoted by v
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Deep Restricted Kernel Machines Using Conjugate Feature Duality
2127
Figure 1: Restricted Boltzmann machine consisting of a layer of visible units v
and a layer of hidden units h. They are interconnected through the interaction
matrix W, depicted in blue.
and h, respectively, have stochastic binary units with value 0 or 1. A joint
state {v, h} is defined for these visible and hidden variables with energy (see
Figure 1),
E(v, h; θ ) = −v TWh − cT v − aT h,
(2.1)
where θ = {W, c, a} are the model parameters, W is an interaction weight
matrix, and c, a contain bias terms.
One then obtains the joint distribution over the visible and hidden units
as
P(v, h; θ ) = 1
Z(θ )
exp(−E(v, h; θ ))
(2.2)
with the partition function
Z(θ ) =
(cid:2)
(cid:2)
v
h
exp(−E(v, h; θ ))
for normalization.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Thanks to the specific bipartite structure, one can obtain an explicit
h exp(−E(v, h; θ )). The
(cid:3)
expression for the marginalization P(v; θ ) = 1
Z(θ )
conditional distributions are obtained as
(cid:4)
P(h|v; θ ) =
p(h( j)
|v ),
P(v|h; θ ) =
(cid:4)
j
i
p(v
(i)
|h),
(2.3)
2128
J. Suykens
(cid:3)
= 1|v ) = σ (
v
where p(h( j)
i Wi j
di) with σ (x) = 1/(1 + exp(−x)) the logistic function. Here v
note the ith visible unit and the jth hidden unit, respectively.
+ a j ) and p(v
= 1|h) = σ (
(i)
(i)
(cid:3)
+
j Wi jh( j)
(i) and h( j) de-
Because exact maximum likelihood for this model is intractable, a con-
trastive divergence algorithm is used with the following update equation
for the weights,
(cid:4)W = α(EPdata (vhT ) − EPT (vhT )),
(2.4)
with learning rate α and EPdata the expectation with regard to the data distri-
bution Pdata(h, v; θ ) = P(h|v; θ )Pdata(v ), where Pdata(v ) denotes the empirical
distribution. Furthermore, EPT is a distribution defined by running a Gibbs
chain for T steps initialized at the data. Often one takes T = 1, while T → ∞
recovers the maximum likelihood approach (Salakhutdinov, 2015).
In Boltzmann machines there are, in addition to visible-to-hidden, also
visible-to-visible and hidden-to-hidden interaction terms with
E(v, h; θ ) = −v TWh − 1
2
v T Lv − 1
2
hT Gh
(2.5)
and θ = {W, L, G} as explained in Salakhutdinov and Hinton (2009).
In section 3 we make a connection between the energy expression, equa-
tion 2.1, and a new representation of least squares support vector machines
and related kernel machines, which will be made in terms of visible and
hidden units. We now briefly review basics of SVMs, LS-SVMs, PCA, and
SVD.
2.2 Least Squares Support Vector Machines and Related Kernel
Machines.
2.2.1 SVM and LS-SVM. Assume a binary classification problem with
∈ Rd and corresponding class
i=1 with input data xi
, yi)}N
∈ {−1, 1}. An SVM classifier takes the form
training data {(xi
labels yi
ˆy = sign[wT ϕ(x) + b],
where the feature map ϕ(·) : Rd → Rn f maps the data from the input space
to a high-dimensional feature space and ˆy is the estimated class label for a
given input point x ∈ Rd. The training problem for this SVM classifier (Boser
et al., 1992; Cortes & Vapnik, 1995; Vapnik, 1998) is
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Deep Restricted Kernel Machines Using Conjugate Feature Duality
2129
min
w,b,ξ
wT w + c
1
2
N(cid:2)
i=1
ξ
i
subject to yi[wT ϕ(xi) + b] ≥ 1 − ξ
i
,
i = 1, . . . , N
(2.6)
ξ
i
≥ 0,
i = 1, . . . , N,
where the objective function makes a trade-off between minimization of the
regularization term (corresponding to maximization of the margin 2/(cid:6)w(cid:6)
2)
and the amount of misclassifications, controlled by the regularization con-
stant c > 0. The slack variables ξ
i are needed to tolerate misclassifications
on the training data in order to avoid overfitting the data. The following
dual problem in the Lagrange multipliers α
i is obtained, related to the first
set of constraints:
− 1
2
N(cid:2)
maxα
subject to
N(cid:2)
i, j=1
yiy j K(xi
, x j ) α
i
α
j
+
N(cid:2)
j=1
α
j
α
iyi
= 0
i=1
0≤ α
i
≤ c, i = 1, . . . , N.
(2.7)
Here a positive-definite kernel K is used with K(x, z) = ϕ(x)T ϕ(z) =
(cid:3)n f
j(z). The SVM classifier is expressed in the dual as
j=1
⎤
j(x)ϕ
⎡
ϕ
ˆy = sign
⎣
(cid:2)
i∈S
SV
α
i yi K(xi
, x) + b
⎦ ,
(2.8)
SV denotes the set of support vectors, corresponding to the nonzero
, x j ) =
i x j )d with ν ≥ 0, or gaussian RBF kernel
where S
α
i values. Common choices are, for example, to take a linear K(xi
xT
i x j, polynomial K(xi
, x j ) = exp(−(cid:6)xi
K(xi
The LS-SVM classifier (Suykens & Vandewalle, 1999b) is a modification
, x j ) = (ν + xT
/σ 2).
− x j
(cid:6)2
2
to it,
min
w,b,ei
1
2
wT w + γ 1
2
N(cid:2)
i=1
e2
i
subject to yi[wT ϕ(xi) + b] = 1 − ei
,
i = 1, . . . , N,
(2.9)
where the value 1 in the constraints is taken as a target value instead
of a threshold value. This implicitly corresponds to a regression on the
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
2130
J. Suykens
α
i
N
i=1
−
class labels ±1. From the Lagrangian L(w, b, e; α) = 1
(cid:3)
2
}, one takes the conditions for optimality
= 0. Writing the solution in α, b
= 0, ∂L/∂α
i
{yi[wT ϕ(xi) +b] − 1 + ei
∂L/∂w = 0, ∂L/∂b = 0, ∂L/∂ei
gives the square linear system
wT w + γ 1
2
N
i=1 e2
i
(cid:3)
⎡
⎣
(cid:11) + I/γ y1:N
T
1:N
y
0
(cid:9)
⎤
⎦
α
b
(cid:10)
(cid:9)
(cid:10)
=
1N
0
,
where (cid:11)
i j
[1; …; 1] with, as classifier in the dual,
ϕ(xi)T ϕ(x j ) = yiy j K(xi
= yiy j
(2.10)
, x j ) and y1:N = [y1
; . . . ; yN], 1N =
ˆy = sign
(cid:9)
N(cid:2)
i=1
(cid:10)
α
iyiK(xi
, x) + b
.
(2.11)
This formulation has also been extended to multiclass problems in Suykens
et al. (2002).
In the LS-SVM regression formulation (Suykens et al., 2002) one per-
forms ridge regression in the feature space with an additional bias term b,
min
w,b,ei
1
2
wT w + γ 1
2
N(cid:2)
i=1
e2
i
subject to yi
= wT ϕ(xi) + b + ei
,
i = 1, . . . , N,
which gives
⎡
⎣
K + I/γ 1N
T
N
1
0
(cid:9)
⎤
⎦
α
b
(cid:10)
(cid:9)
(cid:10)
=
y1:N
0
with the predicted output
ˆy =
N(cid:2)
i=1
α
iK(xi
, x) + b,
(2.12)
(2.13)
(2.14)
= K(xi
, x j ) = ϕ(xi)T ϕ(x j ). The classifier formulation can also be
where Ki j
transformed into the regression formulation by multiplying the constraints
in equation 2.9 by the class labels and considering new error variables
(Suykens et al., 2002). In the zero bias term case, this corresponds to ker-
nel ridge regression (Saunders, Gammerman, & Vovk, 1998), which is also
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Deep Restricted Kernel Machines Using Conjugate Feature Duality
2131
related to function estimation in reproducing kernel Hilbert spaces, regular-
ization networks, and gaussian processes, within a different setting (Poggio
& Girosi, 1990; Wahba, 1990; Rasmussen & Williams, 2006; Suykens et al.,
2002).
2.2.2 Kernel PCA and Matrix SVD. Within the setting of using equality
constraints and the L2 loss function, typical for LS-SVMs, one can character-
ize the kernel PCA problem (Schölkopf, Smola, & Müller, 1998) as follows,
as shown in Suykens et al. (2002, 2003):
1
min
w,b,ei
2
subject to ei
N(cid:2)
e2
i
wT w − γ 1
2
i=1
= wT ϕ(xi) + b,
i = 1, . . . , N.
(2.15)
From the KKT conditions, one obtains the following in the Lagrange multi-
pliers α
i,
K(c)α = λα with λ = 1/γ ,
(2.16)
(cid:3)
N
i=1
ϕ(xi) and α = [α
1
where K(c)
= (ϕ(xi) − ˆμϕ )T (ϕ(x j ) − ˆμϕ ) are the elements of the centered ker-
i j
nel matrix K(c), ˆμϕ = (1/N)
; ….; αN]. In equation 2.15,
maximizing instead of minimizing also leads to equation 2.16. The center-
ing of the kernel matrix is obtained as a result of taking a bias term b in the
model. The γ value is treated at a selection level and is chosen so as to cor-
respond to λ = 1/γ , where λ are eigenvalues of K(c). In the zero bias term
case, K(c) becomes the kernel matrix K = [ϕ(xi)T ϕ(x j )]. Also, kernel spectral
clustering (Alzate & Suykens, 2010) was obtained in this setting by consid-
ering a weighted version of the L2 loss part, weighted by the inverse of the
degree matrix of the graph in the clustering problem.
Suykens (2016) showed recently that matrix SVD can be obtained from
the following primal problem:
minw,v,e,r
subject to ei
−wT v + γ 1
2
N(cid:2)
e2
i
+ γ 1
2
M(cid:2)
r2
j
i=1
j=1
= wT ϕ(xi), i = 1, . . . , N
(2.17)
r j
= v T ψ (z j ), j = 1, . . . , M,
i=1 and {z j
}N
}M
where {xi
j=1 are data sets related to two data sources, which in
the matrix SVD (Golub & Van Loan, 1989; Stewart, 1993) case correspond to
the sets of rows and columns of the given matrix. Here one has two fea-
ture maps ϕ(·) and ψ (·). After taking the Lagrangian and the necessary
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
2132
J. Suykens
conditions for optimality, the dual problem in the Lagrange multipliers α
i
and β
j, related to the first and second set of constraints, results in
(cid:9)
(cid:10) (cid:9)
0 A
AT 0
(cid:10)
(cid:9)
= λ
(cid:10)
,
α
β
α
β
(2.18)
where A = [ϕ(xi)T ψ (z j )] denotes the matrix with i jth entry ϕ(xi)T ψ (z j ),
λ = 1/γ corresponding to nonzero eigenvalues, and α = [α
; ….; αN], β =
1
[β
; ….; βM]. For a given matrix A, by choosing the linear feature maps
1
ϕ(xi) = CT xi, ψ (z j ) = z j with a compatibility matrix C that satisfies ACA =
A, this eigenvalue problem corresponds to the SVD of matrix A (Suykens,
2016) in connection with Lanczos’s decomposition theorem. One can also
see that for a symmetric matrix, the two data sources coincide, and the
objective of equation 2.17 reduces to the kernel PCA objective, equation
2.15 (Suykens, 2016), involving only one feature map instead of two feature
maps.
3 Restricted Kernel Machines and Conjugate Feature Duality
3.1 LS-SVM Regression as a Restricted Kernel Machine: Linear Case.
A training data set D = {(xi
, yi)}N
i=1 is assumed to be given with input data
∈ Rp (now with p outputs), where the data are
∈ Rd and output data yi
xi
assumed to be identical and independently distributed and drawn from an
unknown but fixed underlying distribution P(x,y), a common assumption
made in statistical learning theory (Vapnik, 1998).
We will explain now how LS-SVM regression can be linked to the en-
ergy form expression of an RBM with an interpretation in terms of hid-
den and visible units. In view of these connections with RBMs and the
fact that there will be no hidden-to-hidden connections, we will call it a
restricted kernel machine (RKM) representation, when this particular in-
terpretation of the model is made. For LS-SVM regression, the part in the
RKM interpretation that will take a similar form as the RBM energy function
is
RRKM(v, h) = −v T ˜Wh
= −(xTWh + bT h − yT h)
= eT h,
(3.1)
with a vector of hidden units h ∈ Rp and a vector of visible units v ∈ Rnv
with nv = d + 1 + p equal to
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Deep Restricted Kernel Machines Using Conjugate Feature Duality
2133
⎤
⎥
⎦ and ˜W =
v =
⎡
⎢
⎣
x
1
−y
⎤
⎥
⎦ ,
⎡
⎢
⎣
W
bT
Ip
(3.2)
and e = y − ˆy with ˆy = W T x + b the estimated output vector for a given in-
put vector x where e, y, ˆy ∈ Rp, W ∈ Rd×p, b ∈ Rp. Note that b is treated as
part of the interconnection matrix by adding a constant 1 within the vector
v, which is also frequently done in the area of neural networks (Suykens
et al., 1995). While in RBM the units are binary valued, in the RKM, they are
continuous valued. The notation R in RRKM(v, h) refers to the fact that the
expression is restricted; there are no hidden-to-hidden connections.
For the training problem, the sum is taken over the training data
, yi)}N
{(xi
i=1 with
N(cid:2)
Rtrain
RKM
=
RRKM(v
i
, hi)
i=1
N(cid:2)
= −
(xT
i Whi
+ bT hi
− yT
i hi)
(3.3)
i=1
N(cid:2)
.
eT
i hi
=
i=1
Note that we will adopt the following notation h( j),i to denote the value of
the jth unit for the ith data point, and ei
∈ Rp for i = 1, . . . , N.
, hi
We start now from the LS-SVM regression training problem, equation
2.12, but for the multiple outputs case. We express the objective in terms
RKM and show how the hidden units can be introduced. Defining λ =
of Rtrain
1/γ > 0, we obtain
J =
η
2
Tr(W TW ) + 1
2λ
N(cid:2)
i=1
i ei s.t. ei
eT
= yi
− W T xi
− b, ∀i
≥
=
N(cid:2)
i=1
N(cid:2)
i=1
eT
i hi
−
λ
2
N(cid:2)
i=1
hT
i hi
+
η
2
Tr(W TW ) s.t. ei
= yi
− W T xi
− b, ∀i
(yT
i
− xT
i W − bT )hi
−
λ
2
N(cid:2)
i=1
hT
i hi
+
η
2
Tr(W TW ) (cid:2) J
= Rtrain
RKM
−
λ
2
N(cid:2)
i=1
hT
i hi
+
η
2
Tr(W TW ),
(3.4)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
2134
J. Suykens
where λ, η are positive regularization constants and the first term corre-
RKM. J denotes the lower bound on J.1 This is based on the prop-
sponds to Rtrain
erty that for two arbitrary vectors e, h, one has
1
2λ eT e ≥ eT h −
λ
2
hT h, ∀e, h ∈ Rp.
(3.5)
The maximal value of the right-hand side in equation 3.5 is obtained for h =
e/λ, which follows from ∂ (eT h − λ
2 hT h)/∂h2 =
−λI < 0. The maximal value that can be obtained for the right-hand side
equals the left-hand side, 1
2λ eT e. The property 3.5 can also be verified by
writing it in quadratic form,
2 hT h)/∂h = 0 and ∂ 2(eT h − λ
(cid:14)
(cid:13)
eT hT
1
2
(cid:9)
(cid:10) (cid:9)
(cid:10)
1
λ I I
I λI
e
h
≥ 0, ∀e, h ∈ Rp,
(3.6)
which holds. This follows immediately from the Schur complement form,2
2 (λI − I(λI)I) ≥ 0, which holds. Writing equa-
which results in the condition 1
tion 3.5 as
1
2λ eT e +
λ
2
hT h ≥ eT h
(3.7)
gives a property that is also known in Legendre-Fenchel duality for the case
of a quadratic function (Rockafellar, 1987). Furthermore, it also follows from
equation 3.5 that
1
2λ eT e = max
h
(cid:15)
eT h −
(cid:16)
hT h
.
λ
2
(3.8)
We will call the method of introducing the hidden features hi into equation
3.4 conjugate feature duality, where the hidden features hi are conjugated to
=
the ei. Here, Rtrain
i hi will be called an inner pairing between the ei
RKM
and the hidden features hi (see Figure 2).
i eT
(cid:3)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
1Note that also the term − λ
2
N
i=1 hT
(cid:3)
energy correspond to matrix G equal to the identity matrix. The term
additional regularization term.
(cid:17)
i hi appears. This would in a Boltzmann machine
η
2 Tr(W TW ) is an
(cid:18)
, one has Q ≥ 0 if and only if A > 0 and the
A B
BT C
2This states that for a matrix Q =
Schur complement C − BT A
−1B ≥ 0 (Boyd & Vandenberghe, 2004).
Deep Restricted Kernel Machines Using Conjugate Feature Duality
2135
Figure 2: Restricted kernel machine (RKM) representation for regression. The
feature map ϕ(x) maps the input vector x to a feature space (possibly by mul-
tilayers, depicted in yellow), and the hidden features are obtained through an
inner pairing eT h where e = y − ˆy compares the given output vector y with the
predictive model output vector ˆy = W T ϕ(x) + b, where the interconnection ma-
trix W is depicted in blue.
We proceed now by looking at the stationary points of J(hi
, W, b):3
⎧
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩
∂J
∂hi
∂J
∂W
∂J
∂b
= 0 ⇒ yi
= W T xi
+ b + λhi
, ∀i
= 0 ⇒ W = 1
η
(cid:2)
i
xihT
i
(3.9)
= 0 ⇒
(cid:2)
i
= 0.
hi
/λ, which means that the maximal value of
The first condition yields hi
+ ei
+ λhi. Also note the similarity be-
= ˆyi
J is reached. Therefore, yi
tween the condition W = 1
i xihT
i and equation 2.4 in the contrastive di-
η
vergence algorithm. Elimination of hi from this set of conditions gives the
solution in W, b:
= ei
= ˆyi
(cid:3)
3The following properties are used throughout this letter:
= A, ∂Tr(XA)
∂aT Xb
∂X
aT a = Tr(aaT ) for matrices A, B, X and vectors a, b (Petersen & Pedersen, 2012).
= baT , ∂Tr(XT A)
= A, ∂Tr(AXT )
= abT , ∂aT XT b
∂Tr(XT BX )
∂X
= AT , ∂xT a
∂x
∂X
∂X
∂X
∂X
= BX + BT X,
= ∂aT x
= a,
∂x
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
2136
⎡
(cid:3)
⎣
i xixT
i
(cid:3)
+ ληId
T
i
i x
(cid:3)
⎤
⎡
⎤
⎡
i xi
⎦
⎣
N
W
bT
⎦ =
⎣
(cid:3)
i xiyT
i
(cid:3)
T
i
i y
⎤
⎦ .
J. Suykens
(3.10)
Elimination of W from the set of conditions gives the solution in hi
, b:
⎡
⎣
1
η [xT
i x j] + λIN 1N
T
N
1
0
⎤
⎡
⎦
⎣
⎤
⎡
⎦ =
⎣
⎤
⎦ ,
YT
0
HT
bT
(3.11)
…hN] ∈ Rp×N,
i x j] denoting the matrix with i j-entry xT
with [xT
…yN] ∈ Rp×N. From this square linear system, one can solve {hi
Y = [y1
} and
b. 1N denotes a vector of all ones of size N and IN the identity matrix of size
N × N.
i x j, H = [h1
It is remarkable to see here that the hidden features hi take the same role
as the Lagrange dual variables α
i in the LS-SVM formulation based on La-
grange duality, equation 2.13, when taking η = 1 and p = 1. For the esti-
mated values ˆyi on the training data, one can express the model in terms
of W, b or in terms of hi
, b. In the restricted kernel machine interpretation
of the LS-SVM regression, one has the following primal and dual model
representations:
(P)RKM : ˆy = W T x + b
M
(cid:10)
(cid:11)
(D)RKM : ˆy = 1
η
(cid:2)
j
h jxT
j x + b
(3.12)
evaluated at a point x where the primal representation is in terms of W, b
and the dual representation is in the hidden features hi. The primal repre-
sentation is suitable for handling the “large N, small d” case, while the dual
representation for “small N, large d” (Suykens et al., 2002).
3.2 Nonlinear Case. The extension to the general nonlinear case goes
by replacing xi by ϕ(xi) where ϕ(xi) : Rd → Rn f denotes the feature map,
with nf the dimension of the feature space. Therefore, the objective function
for, the RKM interpretation becomes
J =
N(cid:2)
i=1
(yT
i
− ϕ(xi)TW − bT )hi
−
λ
2
N(cid:2)
i=1
hT
i hi
+
η
2
Tr(W TW ),
(3.13)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Deep Restricted Kernel Machines Using Conjugate Feature Duality
2137
with the vector of visible units v ∈ Rnv with nv = n f
+ 1 + p equal to
v =
⎤
⎥
⎦ .
⎡
⎢
⎣
ϕ(x)
1
−y
(3.14)
Following the same approach as in the linear case, one then obtains as a
solution in the primal
⎡
(cid:3)
⎣
j
ϕ(x j )ϕ(x j )T + ληIn f
(cid:3)
ϕ(x j )T
j
(cid:3)
ϕ(x j )
j
N
⎤
⎡
⎤
⎡
(cid:3)
⎦
⎣
W
bT
⎦ =
⎣
ϕ(x j )yT
j
j
(cid:3)
T
j
j y
⎤
⎦ .
(3.15)
In the conjugate feature dual, one obtains the same linear system as equa-
, x j ) = ϕ(xi)T ϕ(x j ) in-
tion 3.11, but with the positive-definite kernel K(xi
stead of the linear kernel xT
⎡
⎣
η K + λIN 1N
1
T
N
1
0
⎤
⎡
⎦
⎣
⎦ =
⎣
HT
bT
⎡
⎤
⎦ .
YT
0
(3.16)
i x j:
⎤
We also employ the notation [K(xi
the i jth entry equal to K(xi
, x j ).
, x j )] to denote the kernel matrix K with
The primal and dual model representations are expressed in terms of the
feature map and kernel function, respectively:
(P)RKM : ˆy = W T ϕ(x) + b
M
(cid:10)
(cid:11)
(D)RKM : ˆy = 1
η
(cid:2)
j
h jK(x j
, x) + b.
(3.17)
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
One can define the feature map in either an implicit or an explicit way.
When employing a positive-definite kernel function K(·, ·), according to
the Mercer theorem, there exists a feature map ϕ such that K(xi
, x j ) =
ϕ(xi)T ϕ(x j ) holds. On the other hand, one could also explicitly define an
expression for ϕ and construct the kernel function according to K(xi
, x j ) :=
ϕ(xi)T ϕ(x j ). For multilayer perceptrons, Suykens and Vandewalle (1999a)
showed that the hidden layer can be chosen as the feature map. We can let it
correspond to
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
2138
FF(x) = σ (Uqσ (…σ (U2
ϕ
σ (U1x + β
1) + β
2)…) + βq)
J. Suykens
(3.18)
, U2
related to a feedforward (FF) neural network with multilayers, with hidden
, . . . , βq. By con-
, . . . , Uq and bias term vectors β
layer matrices U1
1
struction, one obtains KFF(xi
FF(x j ). Note that the activation
function σ might be different also for each of the hidden layers. A common
choice is a sigmoid or hyperbolic tangent function. Within the context of
, . . . , βq are treated at the feature map and the kernel
this letter, U1
parameter levels.
, …, Uq, β
1
, x j ) := ϕ
FF(xi)T ϕ
, β
2
As Suykens et al. (2002) explained, one can also give a neural network
interpretation to both the primal and the dual representation, with a num-
ber of hidden units equal to the dimension of the feature space for the
primal representation and the number of support vectors in the dual rep-
resentation, respectively. For the case of a gaussian RBF kernel, one has a
one-hidden-layer interpretation with an infinite number of hidden units in
the primal, while in the dual, the number of hidden units equals the number
of support vectors.
3.3 Classifier Formulation. In the multiclass case, the LS-SVM classifier
constraints are
Dyi (W T ϕ(xi) + b) = 1p − ei
, i = 1, . . . , N,
(3.19)
where yi
nal matrix Dyi
∈ {−1, 1}p, ei
∈ Rp with p outputs encoding the classes and diago-
}.
In this case, starting from the LS-SVM classifier objective, one obtains
= diag{ y(1),i
, . . . , y(p),i
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
J =
η
2
Tr(W TW ) + 1
2λ
N(cid:2)
i=1
i ei s.t. ei
eT
= 1p − Dyi (W T ϕ(xi) + b), ∀i
≥
N(cid:2)
i=1
eT
i hi
−
λ
2
N(cid:2)
i=1
hT
i hi
+
η
2
− Dyi (W T ϕ(xi) + b), ∀i
Tr(W TW ) s.t. ei
= 1p
=
(cid:23)
1T
p
N(cid:2)
i=1
− (ϕ(xi)TW + bT )Dyi
−
hi
(cid:24)
λ
2
N(cid:2)
i=1
hT
i hi
+
η
2
Tr(W TW ) (cid:2) J.
(3.20)
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
The stationary points of J(hi
, W, b) are given by
Deep Restricted Kernel Machines Using Conjugate Feature Duality
2139
⎧
⎪⎪⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎪⎪⎩
∂J
∂hi
∂J
∂W
∂J
∂b
= 0 ⇒ 1p = Dyi (W T ϕ(xi) + b) + λhi
(cid:2)
ϕ(xi)hT
i Dyi
= 0 ⇒ W = 1
η
(cid:2)
= 0 ⇒
Dyi hi
i
i
= 0.
, ∀i
(3.21)
The solution in the conjugate features follows then from the linear system:
⎡
⎣
η K + λIN 1N
1
T
N
1
0
⎤
⎡
⎦
⎣
⎤
⎡
⎦ =
⎣
⎤
⎦
YT
0
HT
D
bT
(3.22)
with HD = [Dy1 h1
, . . . , DyN hN].
The primal and dual model representations are expressed in terms of the
feature map and the kernel function, respectively:
(P)RKM : ˆy = sign[W T ϕ(x) + b]
M
(cid:10)
(cid:11)
(D)RKM : ˆy = sign
⎡
⎣ 1
η
(cid:2)
j
⎤
(3.23)
Dy j h jK(x j
, x) + b
⎦ .
3.4 Kernel PCA. In the kernel PCA case we start from the objective in
equation 2.15 and introduce the conjugate hidden features:
i ei s.t. ei
eT
= W T ϕ(xi), ∀i
J =
η
2
Tr(W TW ) − 1
2λ
N(cid:2)
N(cid:2)
λ
eT
i hi
+
2
i=1
N(cid:2)
i=1
hT
i hi
+
η
2
≤ −
i=1
N(cid:2)
= −
i=1
= −Rtrain
RKM
+
Tr(W TW ),
λ
2
N(cid:2)
i=1
hT
i hi
i=1
η
+
2
Tr(W TW ) s.t. ei
= W T ϕ(xi), ∀i
ϕ(xi)TWhi
+
λ
2
N(cid:2)
hT
i hi
+
η
2
Tr(W TW ) (cid:2) J
(3.24)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
2140
J. Suykens
where the upper bound J is introduced now by relying on the same property
2 hT h ≥ eT h, but in a
as used in the regression/classification case, 1
different way. Note that
2λ eT e + λ
− 1
2λ eT e = min
h
(−eT h +
λ
2
hT h).
(3.25)
The minimal value for the right-hand side is obtained for h = e/λ, which
equals the left-hand side in that case.
We then proceed by characterizing the stationary points of J(hi
⎧
⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎩
∂J
∂hi
∂J
∂W
= 0 ⇒ W T ϕ(xi) = λhi
, ∀i
= 0 ⇒ W = 1
η
(cid:2)
i
ϕ(xi)hT
i
.
, W ):
(3.26)
/λ. Therefore, the minimum value
= ei
Note that the first condition yields hi
of J is reached. Elimination of W gives the following solution in the conju-
gated features,
1
η KHT = HT (cid:16),
(3.27)
…hN] ∈ Rs×N and (cid:16) = diag{λ
, …, λs} with s ≤ N the number
where H = [h1
of selected components. One can verify that the solutions corresponding to
the different eigenvectors hi and their corresponding eigenvalues λ
i all lead
to the value J = 0.
1
The primal and dual model representations are
(P)RKM : ˆe = W T ϕ(x)
M
(cid:10)
(cid:11)
(D)RKM : ˆe = 1
η
(cid:2)
j
h jK(x j
, x).
(3.28)
Here the number of hidden units equals s with h ∈ Rs and the visible
units v ∈ Rn f with v = ϕ(x), and RRKM(v, h) = −v TWh.
3.5 Singular Value Decomposition. For the SVD case, we start from
the objective in equation 2.17 and introduce the conjugated hidden features.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Deep Restricted Kernel Machines Using Conjugate Feature Duality
2141
The model is characterized now by matrices W, V:
J = −
η
2
Tr(V TW ) + 1
2λ
N(cid:2)
i=1
eT
i ei
+ 1
2λ
M(cid:2)
j=1
rT
j r j
s.t. ei
= W T ϕ(xi), ∀i & r j
= V T ψ (z j ), ∀ j
≥
N(cid:2)
i=1
eT
i hei
−
λ
2
N(cid:2)
i=1
hT
ei hei
+
M(cid:2)
j=1
rT
j hr j
−
λ
2
M(cid:2)
j=1
hT
r j hr j
−
η
2
Tr(V TW )
s.t. ei
= W T ϕ(xi), ∀i & r j
= V T ψ (z j ), ∀ j
N(cid:2)
=
ϕ(xi)TWhei
−
λ
2
N(cid:2)
i=1
hT
ei hei
+
M(cid:2)
j=1
ψ (z j )TVhr j
i=1
−
λ
2
M(cid:2)
j=1
hT
r j hr j
−
η
2
Tr(V TW ) (cid:2) J.
(3.29)
In this case, Rtrain
RKM
, W, hr j
points of J(hei
(cid:3)
N
i=1
=
ϕ(xi)TWhei
, V ) are given by
(cid:3)
+
M
j=1
ψ (z j )TVhr j . The stationary
⎧
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩
∂J
∂hei
∂J
∂W
∂J
∂hr j
∂J
∂V
= 0 ⇒ W T ϕ(xi) = λhei
, ∀i
= 0 ⇒ V = 1
η
(cid:2)
i
ϕ(xi)hT
ei
= 0 ⇒ V T ψ (z j ) = λhr j
, ∀ j
= 0 ⇒ W = 1
η
(cid:2)
j
ψ (z j )hT
r j
.
(3.30)
, hr j :
⎡
⎣
Elimination of W, V gives the solution in the conjugated dual features
hei
0
η [ϕ(xi)T ψ (z j )]
1
η [ψ (z j )T ϕ(xi)]
1
0
⎤
⎡
⎦
⎣
HT
e
T
H
r
⎤
⎡
⎦ =
⎣
⎤
⎦ (cid:16),
HT
e
T
H
r
(3.31)
with He = [he1
. . . , λs} with s ≤ N + M a specified number of nonzero eigenvalues.
. . . heN ] ∈ Rs×N, Hr = [hr1
. . . hrM ] ∈ Rs×M and (cid:16) = diag{λ
1
,
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
2142
J. Suykens
The primal and dual model representations are
(P)RKM : ˆe = W T ϕ(x)
ˆr = V T ψ (z)
M
(cid:10)
(cid:11)
(D)RKM : ˆe = 1
η
ˆr = 1
η
(cid:2)
j
(cid:2)
i
hr j
ψ (z j )T ϕ(x)
hei
ϕ(xi)T ψ (z),
(3.32)
which corresponds to matrix SVD in the case of linear compatible fea-
ture maps and if an additional compatibility condition holds (Suykens,
2016).
3.6 Kernel pmf. For the case of kernel probability mass function (kernel
pmf) estimation (Suykens, 2013), we start from the objective
J =
N(cid:2)
i=1
(pi
− ϕ(xi)T w)hi
−
N(cid:2)
i=1
+
pi
η
2
wT w
(3.33)
in the unknowns w ∈ Rn f , pi
∈ R. Suykens (2013) explained how
a similar formulation is related to the probability rule in quantum measure-
ment for a complex valued model.
∈ R, and hi
The stationary points are characterized by
⎧
⎪⎪⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎪⎪⎩
∂J
∂hi
∂J
∂w
∂J
∂ pi
= 0 ⇒ pi
= wT ϕ(xi), ∀i
(cid:2)
ϕ(xi)hi
= 0 ⇒ w = 1
η
i
= 0 ⇒ hi
= 1, ∀i.
(3.34)
≥ 0
The regularization constant η can be chosen to normalize
is achieved by the choice of an appropriate kernel function), which gives
then the kernel pmf obtained in Suykens (2013). This results in the repre-
sentations
= 1 (pi
i pi
(cid:3)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Deep Restricted Kernel Machines Using Conjugate Feature Duality
2143
(P)RKM : pi
= wT ϕ(xi)
M
(cid:10)
(cid:11)
(D)RKM : pi
= 1
η
(cid:2)
j
K(x j
, xi).
4 Deep Restricted Kernel Machines
(3.35)
In this section we couple different restricted kernel machines within a deep
architecture. Several coupling configurations are possible at this point. We
illustrate deep restricted kernel machines here for an architecture consisting
of three levels. We discuss two configurations:
1. Two kernel PCA levels followed by an LS-SVM regression level
2. LS-SVM regression level followed by two kernel PCA levels
In the first architecture, the first two levels extract features that are used
within the last level for classification or regression. Related types of archi-
tectures are stacked autoencoders (Bengio, 2009), where a pretraining phase
provides a good initialization for training the deep neural network in the
fine-tuning phase. The deep RKM will consider an objective function jointly
related to the kernel PCA feature extractions and the classification or regres-
sion. We explain how the insights of the RKM kernel PCA representations
can be employed for combined supervised training and feature selection.
A difference with other methods is also that conjugated features are used
within the layered architecture.
In the second architecture, one starts with regression and then lets two
kernel PCA levels further act on the residuals. In this case connections will
be shown with deep Boltzmann machines (Salakhutdinov, 2015; Salakhut-
dinov & Hinton, 2009) when considering the special case of linear feature
maps, though for the RKMs in a nonprobabilistic setting.
4.1 Two Kernel PCA Levels Followed by Regression Level. We focus
here on a deep RKM architecture consisting of three levels:
• Level 1 consists of kernel PCA with given input data xi and is char-
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
acterized by conjugated features h(1)
.
i
• Level 2 consists of kernel PCA by taking h(1)
as input and is charac-
terized by conjugated features h(2)
.
i
• Level 3 consists of LS-SVM regression on h(2)
i with output data yi and
is characterized by conjugated features h(3)
.
i
i
2144
J. Suykens
As predictive model is taken,
ˆe(1) = W T
1
1(x),
ϕ
ˆe(2) = W T
2
2((cid:16)−1
ϕ
1
ˆe(1)),
ˆy = W T
3
3((cid:16)−1
ϕ
2
ˆe(2)) + b,
(4.1)
∈ Rn f3
∈ Rn f2
; the level 2 part ϕ
2 : Rn f1 → Rn f2 , W2
×s(1)
3 : Rn f2 → Rn f3 , W3
evaluated at point x ∈ Rd. The level 1 part has feature map ϕ
1 : Rd → Rn f1 ,
×s(2)
∈ Rn f1
W1
; and the level
ˆe(1) and (cid:16)−1
3 part ϕ
ˆe(2) (with
2
ˆe(1) ∈ Rs(1) , ˆe(2) ∈ Rs(2)
) are taken as input for levels 2 and 3, respectively,
where (cid:16)
, (cid:16)
2 denote the diagonal matrices with the corresponding eigen-
1
values. The latter is inspired by the property that for the uncoupled kernel
/λ holds on the training data according to
PCA levels, the property hi
equation 3.26, which is then further extended to the out-of-sample case in
equation 4.1.
×p. Note that (cid:16)−1
= ei
1
The objective function in the primal is
Jdeep,P
= J1
+ J2
+ J3
with
J1
= − 1
2λ
1
J2
= − 1
2λ
2
N(cid:2)
i=1
N(cid:2)
i=1
T
e(1)
i
+
e(1)
i
T
e(2)
i
+
e(2)
i
η
1
2
η
2
2
Tr(W T
1 W1)
Tr(W T
2 W2)
J3
= 1
2λ
3
N(cid:2)
i=1
T
e(3)
i
+
e(3)
i
η
3
2
Tr(W T
3 W3)
(4.2)
(4.3)
ϕ
1(xi), ˆei
(2) = W T
2
(1) = W T
1
2((cid:16)−1
(2)) + b. How-
ϕ
with ˆei
1
ever, this objective function is not directly usable for minimization due to
T
the minus sign terms − 1
e(1)
. For direct
2λ
i
minimization of an objective in the primal, we will use the following stabi-
lized version,
and − 1
2λ
3((cid:16)−1
ϕ
i=1 e(2)
i=1 e(1)
= W T
3
(1)), ˆyi
e(2)
i
(cid:3)
(cid:3)
ˆei
ˆei
N
N
T
2
i
i
1
2
Jdeep,Pstab
= J1
+ J2
+ J3
+ 1
2
cstab(J2
1
+ J2
2 ),
(4.4)
with cstab a positive constant. The role of this stabilization term for the kernel
PCA levels is explained in the appendix. While in stacked autoencoders
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Deep Restricted Kernel Machines Using Conjugate Feature Duality
2145
Figure 3: Example of a deep restricted kernel machine consisting of three levels
with kernel PCA in levels 1 and 2 and LS-SVM regression in level 3.
one has an unsupervised pretraining and a supervised fine-tuning phase
(Bengio, 2009), here we train the whole network at once.
For a characterization of the deep RKM in terms of the conjugated
(see Figure 3), we will study the stationary points
, h(2)
i
, h(3)
i
features h(1)
of
i
Jdeep
= J1
+ J2
+ J
,
3
(4.5)
, W1
where the objective Jdeep(h(1)
i
sists of the sum of the objectives of levels 1,2,3 given by J1, J2, J
, b) for the deep RKM con-
, respectively.
, h(3)
i
, h(2)
i
, W2
, W3
3
This becomes
Jdeep
= −
N(cid:2)
i=1
ϕ
1(xi)TW1h(1)
i
+
λ
1
2
N(cid:2)
i=1
T
h(1)
i
h(1)
i
+
η
1
2
Tr(W T
1 W1)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
N(cid:2)
−
i=1
ϕ
2(h(1)
i
)TW2h(2)
i
+
λ
2
2
N(cid:2)
i=1
T
h(2)
i
h(2)
i
+
η
2
2
Tr(W T
2 W2)
+
N(cid:2)
i=1
(yT
i
− ϕ
3(h(2)
i
)TW3
− bT )h(3)
i
−
λ
3
2
N(cid:2)
i=1
T
h(3)
i
h(3)
i
+
η
3
2
Tr(W T
3 W3),
with the following inner pairings at the three levels:
Level 1 :
N(cid:2)
i=1
T
e(1)
i
h(1)
i
=
N(cid:2)
i=1
ϕ
1(xi)TW1h(1)
i
,
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
(4.6)
2146
Level 2 :
Level 3 :
N(cid:2)
i=1
N(cid:2)
i=1
T
e(2)
i
h(2)
i
=
T
e(3)
i
h(3)
i
=
N(cid:2)
i=1
N(cid:2)
i=1
ϕ
2(h(1)
i
)TW2h(2)
i
,
(yT
i
− ϕ
3(h(2)
i
)TW3
− bT )h(3)
i
.
J. Suykens
(4.7)
The stationary points of Jdeep(h(1)
i
, W1
, h(2)
i
, W2
, h(3)
i
, W3
, b) are given by
= 0 ⇒ W T
1
1(xi) = λ
ϕ
1h(1)
i
−
[ϕ
2(h(1)
i
)TW2h(2)
i
], ∀i,
⎧
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩
∂Jdeep
∂h(1)
i
∂Jdeep
∂W1
∂Jdeep
∂h(2)
i
∂Jdeep
∂W2
∂Jdeep
∂h(3)
i
∂Jdeep
∂W3
∂Jdeep
∂b
∂
∂h(1)
i
T ,
= 0 ⇒ W1
= 1
η
1
(cid:2)
i
ϕ
1(xi)h(1)
i
= 0 ⇒ W T
2
ϕ
2(h(1)
i
) = λ
2h(2)
i
−
= 0 ⇒ W2
= 1
η
2
(cid:2)
i
ϕ
2(h(1)
i
)h(2)
i
= 0 ⇒ yi
− W T
3
ϕ
3(h(2)
i
) − b = λ
3h(3)
i
, ∀i,
ϕ
3(h(2)
i
)h(3)
i
T ,
= 1
η
3
(cid:2)
i
h(3)
i
= 0.
= 0 ⇒ W3
= 0 ⇒
(cid:2)
i
∂
∂h(2)
i
T ,
[ϕ
3(h(2)
i
)TW3h(3)
i
], ∀i,
(4.8)
The primal and dual model representations for the deep RKM are then
M
(cid:10)
(cid:11)
ϕ
ˆe(1) = W T
1
(P)DeepRKM : ˆe(2) = W T
2
3((cid:16)−1
ϕ
1(x)
2((cid:16)−1
ϕ
ˆy = W T
3
2
ˆe(1))
1
ˆe(2)) + b
ˆe(1) = 1
η
1
(D)DeepRKM : ˆe(2) = 1
η
2
(cid:3)
ˆy = 1
η
3
, x)
(cid:3)
(cid:3)
j K1(x j
j K2(h(1)
j h(1)
j h(2)
j K3(h(2)
j
j h(3)
ˆe(1))
, (cid:16)−1
1
j
, (cid:16)−1
ˆe(2)) + b.
2
(4.9)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Deep Restricted Kernel Machines Using Conjugate Feature Duality
2147
By elimination of W1
equations in the conjugated features h(1)
, W3, one obtains the following set of nonlinear
i and b:
, h(2)
i
, h(3)
, W2
i
h(1)
j K1(x j
, xi) = λ
1h(1)
i
− 1
η
2
(cid:2)
j
j K2(h(1)
h(2)
j
, h(1)
i
) = λ
2h(2)
i
− 1
η
3
, h(1)
j )
∂K2(h(1)
i
∂h(1)
i
∂K3(h(2)
i
∂h(2)
i
j
(cid:2)
, h(2)
j )
T
h(2)
j
h(2)
i
, ∀i
j K3(h(2)
h(3)
j
, h(2)
i
) + b + λ
3h(3)
i
, ∀i
(cid:2)
j
(cid:2)
j
1
η
1
1
η
2
yi
⎧
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩
(cid:2)
j
= 0.
= 1
η
3
h(3)
i
(cid:2)
i
T
h(3)
j
h(3)
i
, ∀i,
(4.10)
Solving this set of nonlinear equations is computationally expensive. How-
, K3,lin
ever, for the case of taking linear kernels K2 and K3 (and Klin
denoting linear kernels) equation 4.10 simplifies to
, K2,lin
⎧
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩
Level 1 :
Level 2 :
Level 3 :
(cid:15)
(cid:15)
⎡
⎣
[K1(x j
, xi)] + 1
η
2
(cid:16)
[Klin(h(2)
j
, h(2)
i
)]
HT
1
= HT
1
(cid:16)
1
[K2,lin(h(1)
j
, h(1)
i
)] + 1
η
3
(cid:16)
[Klin(h(3)
j
, h(3)
i
)]
HT
2
= HT
2
(cid:16)
2
1
η
1
1
η
2
1
η
3
[K3,lin(h(2)
j
, h(2)
i
)] + λ
3IN 1N
T
N
1
0
⎤
⎡
⎦
⎣
⎤
⎡
⎦ =
⎣
⎤
⎦ .
YT
0
HT
3
bT
(4.11)
= [h(1)
1
…h(1)
N ], H2
Here we denote H1
N ]. One sees
that at levels 1 and 2, a data fusion is taking place between K1 and Klin and
, 1
between K2,lin and Klin, where 1
are specifying the relative weight
η
η
3
given to each of these kernels. In this way, one can choose for emphasizing
or deemphasizing the levels with respect to each other.
N ], H3
, 1
η
2
1
= [h(2)
1
…h(2)
= [h(3)
1
…h(3)
4.2 Regression Level Followed by Two Kernel PCA Levels. In this
case, we consider a deep RKM architecture with the following three
levels:
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
2148
J. Suykens
• Level 1 consists of LS-SVM regression with given input data xi and
output data yi and is characterized by conjugated features h(1)
.
i
• Level 2 consists of kernel PCA by taking h(1)
terized by conjugated features h(2)
.
i
• Level 3 consists of kernel PCA by taking h(2)
as input and is charac-
as input and is charac-
i
i
terized by conjugated features h(3)
i
.
We look then for the stationary points of
Jdeep
= J
1
+ J2
+ J3
(4.12)
, W1
where the objective Jdeep(h(1)
i
sists of the sum of the objectives of levels 1, 2, 3 given by J
tively. Deep RKM consists of coupling the RKMs.
, W3) for the deep RKM con-
, J2, J3, respec-
, b, h(2)
i
, h(3)
i
, W2
1
This becomes
Jdeep
=
N(cid:2)
i=1
(yT
i
− ϕ
1(xi)TW1
− bT )h(1)
i
−
λ
1
2
N(cid:2)
i=1
T
h(1)
i
+
h(1)
i
η
1
2
Tr(W T
1 W1)
N(cid:2)
−
i=1
N(cid:2)
−
i=1
ϕ
2(h(1)
i
)TW2h(2)
i
+
ϕ
3(h(2)
i
)TW3h(3)
i
+
λ
2
2
λ
3
2
N(cid:2)
i=1
N(cid:2)
i=1
T
h(2)
i
h(2)
i
+
T
h(3)
i
h(3)
i
+
η
2
2
η
3
2
Tr(W T
2 W2)
Tr(W T
3 W3),
(4.13)
×p, the level 2 part ϕ
3 : Rs(2) → Rn f3 , W3
1 : Rd → Rn f1 , W1
∈ Rn f1
, and the level 3 part ϕ
with ϕ
×s(2)
Rn f2
. Note that in
Jdeep, the sum of the three inner pairing terms is similar to the energy in
deep Boltzmann machines (Salakhutdinov, 2015; Salakhutdinov & Hinton,
2009) for the particular case of linear feature maps ϕ
3 and symmetric
1
interaction terms. For the special case of linear feature maps, one has
2 : Rp → Rn f2 , W2
∈ Rn f3
×s(3)
, ϕ
, ϕ
∈
2
Udeep
= −v T ˜W1h(1) − h(1)T
W2h(2) − h(2)T
W3h(3),
(4.14)
which takes the same form as equation 29 in Salakhutdinov (2015), with ˜W1
defined in the sense of equation 3.1 in this letter. The “U” in Udeep refers
to the fact that the deep RKM is unrestricted after coupling because of the
hidden-to-hidden connections between layers 1 and 2 and between layers
2 and 3, while the uncoupled RKMs are restricted.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Deep Restricted Kernel Machines Using Conjugate Feature Duality
2149
The stationary points of Jdeep(h(1)
i
⎧
, W1
, b, h(2)
i
= 0 ⇒ yi
− W T
1
1(xi) − b = λ
ϕ
1h(1)
i
+
ϕ
2(h(1)
i
)TW2h(2)
i
, h(3)
i
(cid:25)
, W2
∂
∂h(1)
i
, W3) are given by
(cid:26)
, ∀i
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩
∂Jdeep
∂h(1)
i
∂Jdeep
∂W1
∂Jdeep
∂b
∂Jdeep
∂h(2)
i
∂Jdeep
∂W2
∂Jdeep
∂h(3)
i
∂Jdeep
∂W3
T
ϕ
1(xi)h(1)
i
= 1
η
1
(cid:2)
i
h(1)
i
= 0
= 0 ⇒ W1
= 0 ⇒
(cid:2)
i
= 0 ⇒ W T
2
ϕ
2(h(1)
i
) = λ
2h(2)
i
−
= 0 ⇒ W2
= 1
η
2
(cid:2)
i
ϕ
2(h(1)
i
)h(2)
i
T
= 0 ⇒ W T
3
ϕ
3(h(2)
i
) = λ
3h(3)
i
, ∀i
= 0 ⇒ W3
= 1
η
3
(cid:2)
i
ϕ
3(h(2)
i
)h(3)
i
T .
(cid:25)
ϕ
3(h(2)
i
)TW3h(3)
i
(cid:26)
, ∀i
(4.15)
∂
∂h(2)
i
As predictive model for this deep RKM case, we have
M
(cid:10)
(cid:11)
(P)DeepRKM : ˆy = W T
1
1(x) + b
ϕ
(D)DeepRKM : ˆy = 1
η
1
(cid:2)
j
h(1)
j K1(x j
, x) + b.
(4.16)
, W2
, W3, one obtains the following set of nonlinear
By elimination of W1
equations in the conjugated features h(1)
⎧
i
, h(2)
i
, h(3)
i
, and b:
h(1)
j K1(x j
, xi) + b + λ
1h(1)
i
+ 1
η
2
(cid:2)
j
, h(1)
j )
∂K2(h(1)
i
∂h(1)
i
T
h(2)
j
h(2)
i
, ∀i
yi
(cid:2)
= 1
η
1
h(1)
i
(cid:2)
j
= 0
i
1
η
2
1
η
3
j K2(h(1)
h(2)
j
, h(1)
i
) = λ
2h(2)
i
− 1
η
3
(cid:2)
j
, h(2)
j )
∂K3(h(2)
i
∂h(2)
i
T
h(3)
j
h(3)
i
, ∀i
j K3(h(2)
h(3)
j
, h(2)
i
) = λ
3h(3)
i
, ∀i.
(cid:2)
j
(cid:2)
j
(4.17)
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
2150
J. Suykens
When taking linear kernels K2 and K3, the set of nonlinear equations sim-
plifies to
[K1(x j
, xi)] + 1
η
2
[Klin(h(2)
j
, h(2)
i
)] + λ
1IN 1N
⎧
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩
Level 1 :
Level 2 :
Level 3 :
⎡
⎣
=
(cid:15)
(cid:15)
1
η
1
(cid:9)
1
η
2
1
η
3
T
N
1
(cid:10)
YT
0
[K2,lin(h(1)
j
, h(1)
i
)] + 1
η
(cid:16)
3
(cid:9)
⎤
⎦
(cid:10)
HT
1
bT
0
(cid:16)
[Klin(h(3)
j
, h(3)
i
)]
HT
2
= HT
2
(cid:16)
2
[K3,lin(h(2)
j
, h(2)
i
)]
HT
3
= HT
3
(cid:16)
3
(4.18)
with a similar data fusion interpretation as explained in the previous sub-
section.
5 Algorithms for Deep RKM
The characterization of the stationary points for the objective functions in
the different deep RKM models typically leads to solving large sets of non-
linear equations in the unknown variables, especially for large given data
sets. Therefore, in this section, we outline a number of approaches and al-
gorithms for working with the kernel-based models (in either the primal or
the dual). We also outline algorithms for training deep feedforward neural
networks in a parametric way in the primal within the deep RKM setting.
The algorithms proposed in sections 5.2 and 5.3 are applicable also to large
data sets.
5.1 Levelwise Solving for Kernel-Based Models. For the case of linear
kernels in levels 2 and 3 in equation 4.11 and 4.18, we propose a heuristic
algorithm that consists of level-wise solving linear systems and eigenvalue
decompositions by alternating fixing different unknown variables.
…xN] ∈ Rd×N, Y = [y1
For equation 4.18, in order to solve level 1 as a linear system, one needs
…yN] ∈ Rp×N, but also
the input/output data X = [x1
the knowledge of h(2)
. Therefore, an initialization phase is required. One
can initialize h(2)
as zero or at random at level 1, obtain H1, and propagate
it to level 2. At level 2, after initializing H3, one finds H2, which is then
propagated to level 3, where one computes H3. After this forward phase,
one can go backward from level 3 to level 1 in a backward phase.
i
i
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Deep Restricted Kernel Machines Using Conjugate Feature Duality
2151
Schematically this gives the following heuristic algorithm:
Forward phase (level 1 → level 3)
, H3 initialization
H2
Level 1 : H1 := f1(X, Y, H2) (for equation 4.18) or
H1 := f1(X, H2) (for equation 4.11)
Level 2 : H2 := f2(H1
, H3)
Level 3 : H3 := f3(H2) (for equation 4.18) or
H3 := f1(Y, H2) (for equation 4.11)
Backward phase (level 3 → level 1)
Level 2 : H2 := f2(H1
, H3)
Level 1 : H1 := f1(X, Y, H2) (for equation 4.18) or
H1 := f1(X, H2) (for equation 4.11).
One can repeat the forward and backward phases a number of times,
without the initialization step. Alternatively, one could also apply an algo-
rithm with forward-only phases, which can then be applied a number of
times after each other.
5.2 Deep Reduced Set Kernel-Based Models with Estimation in Pri-
, ˜W3 are made to
mal. In the following approach, approximations ˜W1
W1
, W3:
, ˜W2
, W2
W1
= 1
η
1
˜W2
= 1
η
2
˜W3
= 1
η
3
N(cid:2)
i=1
M(cid:2)
j=1
M(cid:2)
j=1
ϕ
1(xi)h(1)
i
T (cid:12) ˜W1
= 1
η
1
M(cid:2)
j=1
ϕ
1( ˜x j )˜h(1)
j
T ,
ϕ
2(˜h(1)
j )˜h(2)
j
ϕ
2(˜h(2)
j )˜h(3)
j
T ,
T ,
(5.2)
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
}N
where a subset of the training data set { ˜x j
i=1 is considered with
M (cid:14) N. This approximation corresponds to a reduced-set technique in ker-
nel methods (Schölkopf et al., 1999). In order to have a good representation
of the data distribution, one can take a fixed-size algorithm with subset se-
lection according to quadratic Renyi entropy (Suykens et al., 2002), or a ran-
dom subset as a simpler scheme.
⊂ {xi
}M
j=1
(5.1)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
2152
J. Suykens
We proceed then with a primal estimation scheme by taking stabiliza-
tion terms for the kernel PCA levels. In the case of two kernel PCA levels
followed by LS-SVM regression, we minimize the following objective:
M(cid:2)
j=1
M(cid:2)
j=1
min
,˜h(3)
j
,b,(cid:16)
,(cid:16)
2
1
˜h(1)
j
,˜h(2)
j
Jdeep,Pstab
= − 1
2
e(1)
j
T (cid:16)−1
1 e(1)
j
+
+
η
1
2
η
2
2
Tr( ˜W T
1
˜W1) − 1
2
e(2)
j
T (cid:16)−1
2 e(2)
j
Tr( ˜W T
2
˜W2) + 1
2λ
3
M(cid:2)
j=1
T
e(3)
j
+
e(3)
j
η
3
2
Tr( ˜W T
3
˜W3)
+ 1
2
cstab
+ 1
2
cstab
⎛
⎝− 1
2
⎛
⎝− 1
2
M(cid:2)
j=1
M(cid:2)
j=1
e(1)
j
e(2)
j
The predictive model then becomes:
⎞
2
Tr( ˜W T
1
⎠
˜W1)
⎞
2
Tr( ˜W T
2
⎠
˜W2)
.
(5.3)
+
+
η
1
2
η
2
2
T (cid:16)−1
1 e(1)
j
T (cid:16)−1
2 e(2)
j
ˆe(1) = 1
η
1
ˆe(2) = 1
η
2
ˆy = 1
η
3
M(cid:2)
j=1
M(cid:2)
j=1
M(cid:2)
j=1
˜h(1)
j K1( ˜x j
, x),
j K2(˜h(1)
˜h(2)
j
, (cid:16)−1
1
ˆe(1)),
j K3(˜h(2)
˜h(3)
j
, (cid:16)−1
2
ˆe(2)) + b.
(5.4)
The number of unknowns in this case is M × (s(1) + s(2) + p) + s(1) + s(2) + 1.
Alternatively, instead of the regularization terms Tr( ˜W T
˜Wl ), one could also
l
take Tr( ˜H(l) ˜H(l)T ) where ˜H(l) = [˜h(l)
…˜h(l)
M ] for l = 1, 2, 3.
1
One can also maximize Tr((cid:16)
1) + Tr((cid:16)
1) +
Tr((cid:16)
2)) to the objective, equation 5.3, with c0 a positive constant. Note that
the components of ˜H(1), ˜H(2) in levels 1 and 2 do not possess an orthogonal-
ity property unless this is imposed as additional constraints to the objective
function.
2) by adding a term −c0(Tr((cid:16)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Deep Restricted Kernel Machines Using Conjugate Feature Duality
2153
5.3 Training Deep Feedforward Neural Networks within the Deep
RKM Framework. For training of deep feedforward neural networks
within this deep RKM setting, one minimizes Jdeep,Pstab in the unknown in-
terconnection matrices of the different levels. In case one takes one hidden
layer per level, the following objective is minimized
min
,β
1,2,3
,U1,2,3
W1,2,3
,b,(cid:16)
1
,(cid:16)
2
Jdeep,Pstab
= − 1
2
M(cid:2)
j=1
e(1)
j
T (cid:16)−1
1 e(1)
j
+
+
η
1
2
η
2
2
Tr(W T
1 W1) − 1
2
M(cid:2)
j=1
e(2)
j
T (cid:16)−1
2 e(2)
j
Tr(W T
2 W2) + 1
2λ
3
M(cid:2)
j=1
T
e(3)
j
+
e(3)
j
η
3
2
Tr(W T
3 W3)
+ 1
2
cstab
+ 1
2
cstab
⎛
⎝− 1
2
⎛
⎝− 1
2
M(cid:2)
j=1
M(cid:2)
j=1
e(1)
j
e(2)
j
T (cid:16)−1
1 e(1)
j
T (cid:16)−1
2 e(2)
j
for the model
ˆe(1) = W T
1
ˆe(2) = W T
2
ˆy = W T
3
σ (U1x + β
(cid:16)−1
σ (U2
1
(cid:16)−1
σ (U3
2
1),
ˆe(1) + β
ˆe(2) + β
2),
3) + b.
(cid:23)
W T
1 W1
Tr
(cid:23)
W T
2 W2
Tr
+
+
η
1
2
η
2
2
(cid:24)
⎞
2
⎠
(cid:24)
⎞
2
⎠
(5.5)
(5.6)
ˆe(2), which results
σ (U2
2),
Alternatively, one
(cid:16)−1
2
σ ((cid:16)−1
W T
2
1
of unknowns
s(2) + 1) + s(1) + s(2) + 1, where nh1,2,3 denote the number of hidden units.
can take additional nonlinearities on (cid:16)−1
ˆe(1),
1
ˆe(1) = W T
σ (U1x + β
ˆe(2) =
1
3) + b. The number
ˆe(2)) + β
× (p +
× (s(2) + s(1) + 1) + nh3
σ ((cid:16)−1
2
× (s(1) + d + 1) + nh2
in the model
ˆy = W T
3
ˆe(1)) + β
is nh1
σ (U3
1),
In order to further reduce the number of unknowns, and partially in-
spired by convolutional operations in convolutional neural networks (Le-
Cun, Bottou, Bengio, & Haffner, 1998), we also consider the case where U1
×n2 , the number of un-
and U2 are Toeplitz matrices. For a matrix U ∈ Rn1
knowns is reduced then from n1n2 to n1
+ n2
− 1.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
2154
J. Suykens
Table 1: Comparison of Test Error (%) of Models M
1 and M
2 on UCI Data Sets.
pid
bld
ion
adu
M
M
M
M
M
M
M
1,a
1,b
1,c
2,a
2,b
2,b,T
2,c
bestbmark
16.99 [17.46(0.65)]
19.53 [20.02(1.53)] 26.09 [30.96(3.34)]
18.75 [19.39(0.89)] 25.22 [31.48(4.11)]
17.08 [17.48(0.56)]
17.83 [21.21(4.78)]
21.88 [24.73(5.91)] 28.69 [32.39(3.48)]
21.09 [20.20(1.51)] 27.83 [28.86(2.83)]
15.07 [15.15(0.15)]
18.75 [20.33(2.75)] 28.69 [28.38(2.80)] 10.23 [6.92(3.69)] 14.91 [15.08 (0.15)]
15.71 [15.97(0.07)]
19.03 [19.16(1.10)] 26.08 [27.74(9.40)]
15.21 [15.19(0.08)]
24.61 [22.34(1.95)] 32.17 [27.61(3.69)]
14.4(0.3)
0 [0.68(1.60)]
0 [5.38(12.0)]
0 [8.21(6.07)]
1.71 [5.68(2.22)]
6.83 [6.50(8.31)]
3.42 [9.66(6.74)]
4.0(2.1)
22.7(2.2)
29.6(3.7)
Notes: Shown first is the test error corresponding to the selected model with minimal
validation error from the different random initializations. Between brackets, the mean and
standard deviation of the test errors related to all initializations are shown. The lowest test
error is in bold.
6 Numerical Examples
6.1 Two Kernel PCA Levels Followed by Regression Level: Examples.
We define the following models and methods for comparison:
• [M
1,a]: with additional
1]: Deep reduced set kernel-based models (with RBF kernel) with es-
timation in the primal according to equation 5.3 with the following
choices:
[M
term −c0(Tr((cid:16)
Tr( ˜H(l) ˜H(l)T ) (l = 1, 2, 3) regularization terms
[M
1,b]: without additional term −c0(Tr((cid:16)
[M
j=1 e(3)
1,c]: with objective function 1
2λ
j
that is, only the level 3 regression objective.
2]: Deep feedforward neural networks with estimation in the primal
2,b],
2,b,T ] Toeplitz matrices are
according to equation 5.5 with the same choices in [M
[M
taken for the U matrices in all levels, except for the last level.
2))
+ η
2 Tr(W T
1) + Tr((cid:16)
T
e(3)
j
1]. In the model [M
2,c] as above in [M
1) + Tr((cid:16)
2,a], [M
3 W3),
and
2))
(cid:3)
M
3
3
• [M
= 64, Ntest
We test and compare the proposed algorithms on a number of UCI data
sets: Pima indians diabetes (pid) (d = 8, p = 1, N = 400, Nval
=
256), Bupa liver disorder (bld) (d = 6, p = 1, N = 170, Nval
=
(d = 34, p = 1, N =
Johns Hopkins University ionosphere (ion)
115),
170, Nval
=
11000, Ntest
= 12222) data sets, where the number of inputs (d), outputs
(p), training (N), validation (Nval), and test data (Ntest) are indicated. These
numbers correspond to previous benchmarking studies in Van Gestel et al.
(2004). In Table 1, bestbmark indicates the best result obtained in the bench-
marking study of Van Gestel et al. (2004) from different classifiers, includ-
ing SVM and LS-SVM classifiers with linear, polynomial, and RBF kernel;
= 117), adult (adu) (d = 14, p = 1, N = 22000, Nval
= 112, Ntest
= 60, Ntest
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Deep Restricted Kernel Machines Using Conjugate Feature Duality
2155
linear and quadratic discriminant analysis; decision tree algorithm C4.5;
logistic regression; one-rule classifier; instance-based learners; and Naive
Bayes.
= 10 (M
= 3, 3, 3; λ
3
1,a,b);
−2;
2,b,T : s(1),(2),(3) = 4, 4, p;
= 10
3
c0
cstab
nh1,2,3
The tuning parameters, selected at the validation level, are
−2; cstab
1: s(1),(2),(3) = 2, 2, p; M = 20; λ
• pid: For M
1,a). For M
= 0.1 (M
2,a,b); c0
= 10
= 10
2: s(1),(2),(3) = 4, 2, p; nh1,2,3
= 0.1 (M
−2; cstab
= 1.
1: s(1),(2),(3) = 3, 2, p; M = 20; λ
1,a). For M
= 1000 (M
2,a,b); c0
= 10
= 3, 3, 5; λ
3
2: s(1),(2),(3) = 4, 2, p; nh1,2,3
= 0.1 (M
• bld: For M
= 0.1 (M
= 10 (M
= 3, 3, 3; λ
3
2,a). For M
= 10−3; cstab
2,a) For M
c0
cstab
nh1,2,3
3
• ion: For M
= 0.1 (M
= 1000.
−3; cstab
1: s(1),(2),(3) = 3, 2, p; M = 30; λ
1,a). For M
= 1000 (M
2,a,b); c0
= 10−3; cstab
= 3, 3, 3; λ
3
2: s(1),(2),(3) = 3, 3, p; nh1,2,3
= 0.1 (M
2,a) For M
= 1000.
= 10
3
c0
cstab
nh1,2,3
= 100 (M
= 3, 3, 5; λ
3
1,a,b);
= 10−3;
2,b,T : s(1),(2),(3) = 4, 2, p;
−3; cstab
= 100 (M
= 3, 3, 3; λ
3
1,a,b);
−3;
2,b,T : s(1),(2),(3) = 3, 3, p;
= 10
= 10
• adu: For M
−3; cstab
−4
= 10
= 10, 5, 3;
2,b,T : s(1),(2),(3) =
1: s(1),(2),(3) = 20, 10, p; M = 15; λ
= 0.1 (M
3
(M
1,a,b); c0
= 10
λ
3
5, 2, p; nh1,2,3
−7; cstab
1,a). For M
2,a,b); c0
= 10
= 0.1 (M
= 10, 5, 3; λ
3
2: s(1),(2),(3) = 5, 2, p; nh1,2,3
= 0.1 (M
2,a). For M
1,2,3
3
−7; cstab
= 103, η
1,2
1 and M
= 10, η
The other tuning parameters were selected as η
= 1, 1, 1.
= 1, 1, 1 for pid, bld,
= 10−3 for adu, unless specified differently above. In
ion and η
2 models, the ˜H(1),(2),(3) matrices and the interconnection ma-
the M
trices were initialized at random according to a normal distribution with
zero mean and standard deviation 0.1 (100, 20, 10, and 3 initializations for
pid, bld, ion, adu, respectively), the diagonal matrices (cid:16)
1,2 by the identity
matrix, and σ
1. For the training, a
quasi-Newton method was used with fminunc in Matlab.
= 1 for the RBF kernel models in M
1,2,3
1,2,3
The following general observations from the experiments are shown in
Table 1:
• Having the additional terms with kernel PCA objectives in levels 1
and 2, as opposed to the level 3 objective only, gives improved results
on all tried data sets.
• The best selected value for cstab varies among the data sets. In case
this value is large, the value of the objective function terms related to
the kernel PCA parts is close to zero.
• The use of Toeplitz matrices for the U matrices in the deep feedfor-
ward neural networks leads to competitive performance results and
greatly reduces the number of unknowns.
Figure 4 illustrates the evolution of the objective function (in logarithmic
scale) during training on the ion data set, for different values of cstab and in
comparison with a level 3 objective function only.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
2156
J. Suykens
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Figure 4: Illustration of the evolution of the objective function (logarithmic
scale) during training on the ion data set. Shown are training curves for the
model M
2,a for different choices of cstab (equal to 1, 10, 100 in blue, red, and ma-
genta, respectively) in comparison with M
2,c (level 3 objective only, in black),
for the same initialization.
6.2 Regression Level Followed by Two Kernel PCA Levels: Examples.
6.2.1 Regression Example on Synthetic Data Set. In this example, we com-
pare a basic LS-SVM regression with deep RKM consisting of three levels
, x j ) =
with LS-SVM + KPCA + KPCA, where a gaussian RBF kernel K(xi
exp(−(cid:6)xi
/σ 2) is used in the LS-SVM level and linear kernels in the
KPCA levels. Training, validation, and test data sets are generated from the
following true underlying function,
− x j
(cid:6)2
2
f (x) = sin(0.3x) + cos(0.5x) + sin(2x),
(6.1)
where zero mean gaussian noise with standard deviation 0.1, 0.5, 1, and
2 is added to the function values for the different data sets. In this ex-
ample, we have a single input and single output d = p = 1. Training data
(with noise) are generated in the interval [−10, 10] with steps 0.1, valida-
tion data (with noise) in [−9.77, 9.87] with steps 0.11, and test data (noise-
less) in [−9.99, 9.99] with steps 0.07. In the experiments, 100 realizations for
the noise are made, for which the mean and standard deviation of the re-
sults are shown in Table 2. The tuning parameters are selected based on the
Deep Restricted Kernel Machines Using Conjugate Feature Duality
2157
Table 2: Comparison between Basic LS-SVM Regression and Deep RKM on the
Synthetic Data Set, for Different Noise Levels.
Noise
Basic
Deep (1+1)
Deep (7+2)
0.1
0.5
1
2
−4
0.0019 ± 4.3 10
0.0403 ± 0.0098
0.1037 ± 0.0289
0.3368 ± 0.0992
−4
0.0018 ± 4.2 10
0.0374 ± 0.0403
0.0934 ± 0.0269
0.2902 ± 0.0875
−4
0.0019 ± 4.4 10
0.0397 ± 0.0089
0.0994 ± 0.0301
0.3080 ± 0.0954
1
2, η
1, η
3 (η
validation set, which are σ , γ for the RBF kernel in the basic LS-SVM model
and σ , λ
= 1 has been chosen) for the complete deep RKM. The
number of forward-backward passes in the deep RKM is chosen equal to
10. For deep RKM, we take the following two choices for the number of
components s(2), s(3) in the KPCA levels: 1 and 1, 7 and 2 for level 2 and
level 3, respectively. For deep RKM, the optimal values for 1
are 105,
η
which means that the level 2 and 3 kernel PCA levels receive higher weight
in the kernel fusion terms. As seen in Table 2, deep RKM improves over the
basic LS-SVM regression in this example. The optimal values for (σ, λ
1) are
(1, 0.001) for noise level 0.1 and (1, 0.01) for noise level 0.5, (1, 0.4) for noise
level 1, and (1, 1) for noise level 2.
, 1
η
3
2
6.2.2 Multiclass Example: USPS. In this example, the USPS handwritten
digits data set is taken from http://www.cs.nyu.edu/∼roweis/data.html.
It contains 8-bit grayscale images of digits 0 through 9 with 1100 examples
of each class. These data are used without additional scaling or prepro-
cessing. We compare a basic LS-SVM model (with primal representation
×p, b ∈ Rp with p = 10, that is, one output per
ˆy = W T ϕ(x) + b and W ∈ Rn f
class, and RBF kernel) with deep RKM consisting of LS-SVM + KPCA +
KPCA with RBF kernel in levels 1 and linear kernels in levels 2 and 3 (with
number of selected components s(2), s(3) in levels 2 and 3). In level 1 of deep
RKM, the same type of model is taken as in the basic LS-SVM model. In
this way, we intend to study the effect of the two additional KPCA layers.
The dimensionality of the input data is d = 256. Two training set sizes were
taken (N = 2000 and N = 4000 data points, that is, 200 and 400 examples per
class), 2000 data points (200 per class) for validation, and 5000 data (500 per
class) for testing. The tuning parameters are selected based on the valida-
tion set: σ , γ for the RBF kernel in the basic LS-SVM model and σ , λ
2, η
3
(η
= 1 has been chosen) for deep RKM. The number of forward-backward
passes in the deep RKM is chosen equal to 2. The results are shown for
the case of 2000 training data in Figure 5, showing the results on training,
validation, and test data with the predicted class labels and the predicted
output values for the different classes. For the case N = 2000, the se-
lected values were σ 2 = 45, s(2) = 10, s(3) = 1, λ
= 106. The
1
−6, 1
η
= 10
1, η
1
= 1
η
3
2
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
2158
J. Suykens
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Figure 5: Deep RKM on USPS handwritten digits data set. Left top: Training
data results (2000 data). Left Bottom: Validation data results (2000 data). Right
top: Test data results (5000 data). Right bottom: Output values for the 10 differ-
ent classes on the validation set.
misclassification error on the test data set is 3.18% for the deep RKM and
3.26% for the basic LS-SVM (with σ 2 = 45 and γ = 1/λ
1). For the case
−6,
N = 4000, the selected values were σ 2 = 45, s(2) = 1, s(3) = 1, λ
1
= 106. The misclassification error on the test data set is 2.12% for the
1
η
2
deep RKM and 2.14% for the basic LS-SVM (with σ 2 = 45 and γ = 1/λ
1).
This illustrates that for deep RKM, levels 2 and 3 are given high relative
importance through the selection of large 1
η
values.
= 1
η
3
= 10
, 1
η
2
3
6.2.3 Multiclass Example: MNIST. The data set, which is used without
additional scaling or preprocessing, is taken from http://www.cs.nyu.
edu/∼roweis/data.html. The dimensionality of the input data is d = 784
(images of size 28 × 28 for each of the 10 classes). In this case, we take
an ensemble approach where the training set (N = 50,000 with 10 classes)
has been partitioned into small nonoverlapping subsets of size 50 (5 data
points per class). The choice for this subset size resulted from taking the
last 10,000 points of this data set as validation data with the use of 40,000
data for training in that case. Other tuning parameters were selected in a
similar way. The 1000 resulting submodels have been linearly combined
Deep Restricted Kernel Machines Using Conjugate Feature Duality
2159
2
1
−6, η
= 10
= 10
= 1
η
3
= 1, 1
η
after applying the tanh function to their outputs. The linear combination is
determined by solving an overdetermined linear system with ridge regres-
sion, following a similar approach as discussed in section 6.4 of Suykens
et al. (2002). For the submodels, deep RKMs consisting of LSSVM + KPCA
+ KPCA with RBF kernel in levels 1 and linear kernels in levels 2 and
3, are taken. The selected tuning parameters are σ 2 = 49, s(2) = 1, s(3) = 1,
−6. The number of forward-backward passes
λ
1
in the deep RKM is chosen equal to 2. The training data set has been ex-
tended with another 50,000 training data consisting of the same data points
but corrupted with noise (random perturbations with zero mean and stan-
dard deviation 0.5, truncated to the range [0,1]), which is related to the
method with random perturbations in Kurakin, Goodfellow, and Bengio
(2016). The misclassification error on the test data set (10,000 data points) is
1.28%, which is comparable in performance to deep belief networks (1.2%)
and in between the reported test performances of deep Boltzmann machines
(0.95, 1.01%) and SVM with gaussian kernel (1.4%) (Salakhutdinov, 2015)
(see http://yann.lecun.com/exdb/mnist/ for an overview and compari-
son of performances obtained by different methods).
7 Conclusion
In this letter, a theory of deep restricted kernel machines has been proposed.
It is obtained by introducing a notion of conjugate feature duality where the
conjugate features correspond to hidden features. Existing kernel machines
such as least squares support vector machines for classification and regres-
sion, kernel PCA, matrix SVD, and Parzen-type models are considered as
building blocks within a deep RKM and are characterized through the con-
jugate feature duality. By means of the inner pairing, one achieves a link
with the energy expression of restricted Boltzmann machines, though with
continuous variables in a nonprobabilistic setting. It also provides an inter-
pretation of visible and hidden units. Therefore, this letter connects, on the
one hand, to deep learning methods and, on the other hand, to least squares
support vector machines and kernel methods. In this way, the insights and
foundations achieved in these different research areas could possibly mu-
tually reinforce each other in the future. Much future work is possible in
different directions, including efficient methods and implementations for
big data, the extension to other loss functions and regularization schemes,
treating multimodal data, different coupling schemes, and models for clus-
tering and semisupervised learning.
Appendix: Stabilization Term for Kernel PCA
We explain here the role of the stabilization term in kernel PCA as a mod-
ification to equation 2.15. In this case, the objective function in the primal
is
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
2160
J. Suykens
minw,ei
1
2
subject to ei
(cid:2)
γ
wT w −
e2
i
+ cstab
2
wT w −
1
2
γ
2
(cid:2)
2
e2
i
i
(A.1)
i
= wT ϕ(xi), i = 1, . . . , N.
(cid:31)
2
(cid:3)
Denoting J0
(cid:3)
i(ei
wT w − γ
2
− wT ϕ(xi)), from which it follows that
= 1
2
i e2
i
α
i
the Lagrangian is L = J0
+ cstab
2 J2
0
+
⎧
⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎩
∂L
∂w
∂L
∂ei
∂L
∂α
i
= 0 ⇒ (1 + cstabJ0)w =
(cid:3)
α
i
ϕ(xi)
i
= 0 ⇒ (1 + cstabJ0)γ ei
= α
i
= 0 ⇒ ei
= wT ϕ(xi).
Assuming that 1 + cstabJ0
(1/γ )α
i with K ji
for the original formulation (corresponding to cstab
=
(cid:16)= 0, elimination of w and ei yields
= ϕ(x j )T ϕ(xi), which is the solution that is also obtained
= 0).
jK ji
α
j
(cid:3)
Acknowledgments
The research leading to these results has received funding from the Eu-
ropean Research Council (FP7/2007-2013) / ERC AdG A-DATADRIVE-B
(290923) under the European Union’s Seventh Framework Programme.
This letter reflects only my views; the EU is not liable for any use that may
be made of the contained information; Research Council KUL: GOA/10/09
MaNet, CoE PFV/10/002 (OPTEC), BIL12/11T; PhD/Postdoc grants; Flem-
ish government: FWO: PhD/postdoc grants, projects: G0A4917N (deep
restricted kernel machines), G.0377.12 (Structured systems), G.088114N
(tensor-based data similarity); IWT: PhD/postdoc grants, projects: SBO
POM (100031); iMinds Medical Information Technologies SBO 2014; Bel-
gian Federal Science Policy Office: IUAP P7/19 (DYSCO, dynamical sys-
tems, control and optimization, 2012–2017).
References
Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltz-
mann machines. Cognitive Science, 9, 147–169.
Alzate, C., & Suykens, J. A. K. (2010). Multiway spectral clustering with out-of-
sample extensions through weighted kernel PCA. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 32(2), 335–347.
Bengio, Y. (2009). Learning deep architectures for AI. Boston: Now.
Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and
new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence,
35, 1798–1828.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Deep Restricted Kernel Machines Using Conjugate Feature Duality
2161
Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal
margin classifiers. In Proceedings of the COLT Fifth Annual Workshop on Computa-
tional Learning Theory (pp. 144–152). New York: ACM.
Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge
University Press.
Chen, L.-C., Schwing, A. G., Yuille, A. L., & Urtasun, R. (2015). Learning deep
structured models. In Proceedings of the 32nd International Conference on Machine
Learning.
Cho, Y., & Saul, L. K. (2009). Kernel methods for deep learning. In Y. Bengio, D.
Schuurmans, J. D. Lafferty, C. K. I. Williams, & A. Culotta (Eds.), Advances in
neural information processing systems, 22. Red Hook, NY: Curran.
Cortes, C., & Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 273–
297.
Damianou, A. C., & Lawrence, N. D. (2013). Deep gaussian processes. PMLR, 31,
207–215.
De Wilde, Ph. (1993). Class of Hamiltonian neural networks. Phys. Rev. E, 47, 1392–
1396.
Fischer, A., & Igel, C. (2014). Training restricted Boltzmann machines: An introduc-
tion. Pattern Recognition, 47, 25–39.
Goldstein, H., Poole, C., & Safko, J. (2002). Classical mechanics. Reading, MA:
Addison-Wesley.
Golub, G. H., & Van Loan, C. F. (1989). Matrix computations. Baltimore: Johns Hopkins
University Press.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge, MA: MIT
Press.
Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural compu-
tation. Reading, MA: Addison-Wesley.
Hinton, G. E. (2005). What kind of graphical model is the brain? In Proc. 19th In-
ternational Joint Conference on Artificial Intelligence (pp. 1765–1775). San Francisco:
Morgan Kaufmann.
Hinton, G. E., Osindero, S., & Teh, Y.-W. (2006). A fast learning algorithm for deep
belief nets. Neural Computation, 18, 1527–1554.
Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective
computational abilities. Proceedings of the National Academy of Sciences USA, 79,
2554–2558.
Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2015). Deep structured
output learning for unconstrained text recognition. In Proceedings of the Interna-
tional Conference on Learning Representations.
Kurakin, A., Goodfellow, I., & Bengio, S. (2016). Adversarial machine learning at scale.
arXiv:1611.01236
Larochelle, H., & Bengio, Y. (2008). Classification using discriminative restricted
Boltzmann machines. In Proceedings of the 25th International Conference on Machine
Learning. New York: ACM.
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436–444.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learn-
the IEEE, 86, 2278–
ing applied to document recognition. Proceedings of
2324.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
2162
J. Suykens
LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., & Huang, F.-J. (2006). A tutorial on
energy-based learning. In G. Bakir, T. Hofmann, B. Schölkopf, A. Smola, & B.
Taskar (Eds.), Predicting structured data. Cambridge, MA: MIT Press.
Lee, H., Grosse, R., Ranganath, R., & Ng, A. Y. (2009). Convolutional deep belief net-
works for scalable unsupervised learning of hierarchical representations. In Pro-
ceedings of the 26th Annual International Conference on Machine Learning (pp. 609–
616). New York: ACM.
Mairal, J., Koniusz, P., Harchaoui, Z., & Schmid, C. (2014). Convolutional kernel net-
works. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K. Q. Wein-
berger (Eds.), Advances in neural information processing systems (NIPS).
Mall, R., Langone, R., & Suykens, J. A. K. (2014). Multilevel hierarchical kernel spec-
tral clustering for real-life large scale complex networks. PLOS One, 9(6), e99966.
Petersen, K. B., & Pedersen, M. S. (2012). The matrix cookbook. Lyngby: Technical Uni-
versity of Denmark.
Poggio, T., & Girosi, F. (1990). Networks for approximation and learning. Proceedings
of the IEEE, 78(9), 1481–1497.
Rasmussen, C. E., & Williams, C. (2006). Gaussian processes for machine learning. Cam-
bridge, MA: MIT Press.
Rockafellar, R. T. (1987). Conjugate duality and optimization. Philadelphia: SIAM.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by
back-propagating errors. Nature, 323, 533–536.
Salakhutdinov, R. (2015). Learning deep generative models. Annu. Rev. Stat. Appl.,
2, 361–385.
Salakhutdinov, R., & Hinton, G. E. (2007). Using deep belief nets to learn covariance
kernels for gaussian processes. In J. Platt, D. Koller, Y. Singer, & S. T. Roweis (Eds.),
Advances in neural information processing systems, 20. Red Hook, NY: Curran.
Salakhutdinov, R., & Hinton, G. E. (2009). Deep Boltzmann machines. PMLR, 5, 448–
455.
Saunders, C., Gammerman, A., & Vovk, V. (1998). Ridge regression learning algo-
rithm in dual variables. In Proc. of the 15th Int. Conf. on Machine Learning (pp. 515–
521). San Francisco: Morgan Kaufmann.
Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Net-
works, 61, 85–117.
Schölkopf, B., Mika, S., Burges, C. C., Knirsch, P., Müller, K. R., Rätsch, G., & Smola,
A. J. (1999). Input space versus feature space in kernel-based methods. IEEE Trans-
actions on Neural Networks, 10(5), 1000–1017.
Schölkopf, B., & Smola, A. (2002). Learning with kernels. Cambridge, MA: MIT Press.
Schölkopf, B., Smola, A., & Müller, K.-R. (1998). Nonlinear component analysis as a
kernel eigenvalue problem. Neural Computation, 10, 1299–1319.
Schwing, A. G., & Urtasun, R. (2015). Fully connected deep structured networks.
arXiv:1503.02351
Smale, S., Rosasco, L., Bouvrie, J., Caponnetto, A., & Poggio, T. (2010). Mathematics
of the neural response. Foundations of Computational Mathematics, 10(1), 67–91.
Smolensky, P. (1986). Information processing in dynamical systems: Foundations of
harmony theory. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed
processing: Explorations in the microstructure of cognition, vol. 1: Foundations.
New York: McGraw-Hill.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Deep Restricted Kernel Machines Using Conjugate Feature Duality
2163
Srivastava, N., & Salakhutdinov, R. (2014). Multimodal learning with deep Boltz-
mann machines. Journal of Machine Learning Research, 15, 2949–2980.
Stewart, G. W. (1993). On the early history of the singular value decomposition.
SIAM Review, 35(4), 551–566.
Suykens, J. A. K. (2013). Generating quantum-measurement probabilities from an
optimality principle. Physical Review A, 87(5), 052134.
Suykens, J. A. K. (2016). SVD revisited: A new variational principle, compatible fea-
ture maps and nonlinear extensions. Applied and Computational Harmonic Analysis,
40(3), 600–609.
Suykens, J. A. K., Alzate, C., & Pelckmans, K. (2010). Primal and dual model repre-
sentations in kernel-based learning. Statistics Surveys, 4, 148–183.
Suykens, J. A. K., & Vandewalle, J. (1999a). Training multilayer perceptron classifiers
based on a modified support vector method. IEEE Transactions on Neural Networks,
10(4), 907–911.
Suykens, J. A. K., & Vandewalle, J. (1999b). Least squares support vector machine
classifiers. Neural Processing Letters, 9(3), 293–300.
Suykens, J. A. K., Vandewalle, J., & De Moor, B. (1995). Artificial neural networks for
modeling and control of non-linear systems. New York: Springer.
Suykens, J. A. K., Van Gestel, T., De Brabanter, J., De Moor, B., & Vandewalle, J. (2002).
Least squares support vector machines. Singapore: World Scientific.
Suykens, J. A. K., Van Gestel, T., Vandewalle, J., & De Moor, B. (2003). A support vec-
tor machine formulation to PCA analysis and its kernel version. IEEE Transactions
on Neural Networks, 14(2), 447–450.
Van Gestel, T., Suykens, J. A. K., Baesens, B., Viaene, S., Vanthienen, J., Dedene, G.,
De Moor, B., & Vandewalle, J. (2004). Benchmarking least squares support vector
machine classifiers. Machine Learning, 54(1), 5–32.
Vapnik, V. (1998). Statistical learning theory. New York: Wiley.
Wahba, G. (1990). Spline models for observational data. Philadelphia: SIAM.
Welling, M., Rosen-Zvi, M., & Hinton, G. E. (2004). Exponential family harmoniums
with an application to information retrieval. In L. K. Saul, Y. Weiss, & L. Bottou
(Eds.), Advances in neural information processing systems, 17. Cambridge, MA: MIT
Press.
Wiering, M. A., & Schomaker, L. R. B. (2014). Multi-layer support vector machines. In
J. A. K. Suykens, M. Signoretto, & A. Argyriou (Eds.), Regularization, optimization,
kernels, and support vector machines (pp. 457–476). Boca Raton, FL: Chapman &
Hall/CRC.
Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C.,
& Torr, P. H. S. (2015). Conditional random fields as recurrent neural networks.
In Proceedings of the International Conference on Computer Vision. Piscataway, NJ:
IEEE.
Received March 31, 2016; accepted March 15, 2017.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
2
9
8
2
1
2
3
1
9
8
1
6
5
8
n
e
c
o
_
a
_
0
0
9
8
4
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Download pdf