LETTER - Specialized Research AI at MIT

LETTER

Communicated by Nakul Verma

Safe Triplet Screening for Distance Metric Learning

Tomoki Yoshida
yoshida.t.mllab.nit@gmail.com
Nagoya Institute of Technology, Gokiso-cho, Showa-ku, Nagoya,
Aichi, 466-8555, Japan

Ichiro Takeuchi
takeuchi.ichiro@nitech.ac.jp
Nagoya Institute of Technology, Gokiso-cho, Showa-ku, Nagoya, Aichi, 466-8555,
Japan; National Institute for Materials Science, Sengen, Tsukuba, Ibaraki, 305-0047,
Japan; and RIKEN Center for Advanced Intelligence Project, Nihonbashi,
Chuo-ku, Tokyo, 103-0012, Japan

Masayuki Karasuyama
karasuyama@nitech.ac.jp
Nagoya Institute of Technology, Gokiso-cho, Showa-ku, Nagoya, Aichi, 466-8555,
Japan; National Institute for Materials Science, Sengen, Tsukuba, Ibaraki, 305-0047,
Japan; and Japan Science and Technology Agency, Honcho, Kawaguchi-shi,
Saitama, 332-0012, Japan

Distance metric learning has been widely used to obtain the optimal dis-
tance function based on the given training data. We focus on a triplet-
based loss function, which imposes a penalty such that a pair of instances
in the same class is closer than a pair in different classes. However, the
number of possible triplets can be quite large even for a small data set,
and this considerably increases the computational cost for metric opti-
mization. In this letter, we propose safe triplet screening that identifies
triplets that can be safely removed from the optimization problem with-
out losing the optimality. In comparison with existing safe screening
studies, triplet screening is particularly significant because of the huge
number of possible triplets and the semidefinite constraint in the opti-
mization problem. We demonstrate and verify the effectiveness of our
screening rules by using several benchmark data sets.

1 Introduction

Using an appropriate distance function is essential for various machine
learning tasks. For example, the performance of a k-nearest neighbor
(k-NN) classifier, one of the most standard classification methods, depends

Neural Computation 31, 2432–2491 (2019) © 2019 Massachusetts Institute of Technology
https://doi.org/10.1162/neco_a_01240

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Safe Triplet Screening for Distance Metric Learning

2433

crucially on the distance between different input instances. The simple Eu-
clidean distance is usually employed, but it is not necessarily optimal for
a given data set and task. Thus, the adaptive optimization of the distance
metric based on supervised information is expected to improve the perfor-
mance of machine learning methods including k-NN.

Distance metric learning (Weinberger & Saul, 2009; Schultz & Joachims,
2004; Davis, Kulis, Jain, Sra, & Dhillon, 2007; Kulis, 2013) is a widely ac-
cepted technique for acquiring the optimal metric from observed data. The
standard problem setting is to learn the following parameterized Maha-
lanobis distance,

(cid:2)

dM (xi

, x j ) :=

(xi

− x j )(cid:2)M(xi

− x j ),

where xi and x j are d-dimensional feature vectors and M ∈ Rd×d is a pos-
itive semidefinite matrix. This approach has been applied to tasks such as
classification (Weinberger & Saul, 2009), clustering (Xing, Jordan, Russell,
& Ng, 2003), and ranking (McFee & Lanckriet, 2010). These studies show
that the optimized distance metric improves the prediction performance of
each task. Metric optimization has also attracted wide interest, even from
researchers engaged in recent deep network studies (Schroff, Kalenichenko,
& Philbin, 2015; Hoffer & Ailon, 2015).

The seminal work of distance metric learning (Weinberger & Saul, 2009)
presents a triplet-based formulation. A triplet (i, j, l) is defined by the pair xi
and x j, which have the same label (same class), and xl, which has a different
label (different class). For a triplet (i, j, l), the desirable metric would satisfy
, xl ), meaning that the pair in the same class is closer than
dM (xi
the pair in different classes. For each of the triplets, Weinberger and Saul
(2009) define a loss function that penalizes violations of this constraint,

, x j ) < dM (xi (cid:3) (cid:2) d2 M (xi , xl ) − d2 M (xi (cid:4) , x j ) , for (i, j, l) ∈ T , where T is a set of triplets and (cid:2) : R → R is some loss function (e.g., the stan- dard hinge loss function). In addition to the triplet loss, other approaches, such as pairwise- and virtual point–based loss functions have been stud- ied. In the pairwise approach, the number of pairs can be much smaller than the triplets; Davis et al. (2007) used only 20c2 pairs, where c is the number of classes. The virtual point approach (Perrot & Habrard, 2015) converts the metric learning problem into a least squares problem, which minimizes a loss function for n virtual points. We particularly focus on , xl ) the triplet approach because the relative evaluation dM (xi would be more appropriate for many metric learning applications such as nearest-neighbor classification (Weinberger & Saul, 2009), and similarity search (Jain, Kulis, Dhillon, & Grauman, 2009), in which relative compari- son among objects plays an essential role. In fact, a recent comprehensive , x j ) < dM (xi l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 1 1 2 2 4 3 2 1 8 6 5 2 4 8 n e c o _ a _ 0 1 2 4 0 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 2434 T. Yoshida, I. Takeuchi, and M. Karasuyama survey (Li & Tian, 2018) showed that many current state-of-the-art meth- ods are based on triplet loss, which they referred to as relative loss. Note that although the quadruplets approach (Law, Thome, & Cord, 2013) can also incorporate higher-order relations, we mainly focus on the triplet ap- proach because it is much more popular in the community, although our framework can also accommodate the quadruplet case, as we explain in section 6.3. However, the set of triplets T is quite large even for a small data set. For example, in a two-class problem with 100 instances in each class, the number of possible triplets is 1,980,000. Because processing a huge num- ber of triplets is computationally prohibitive, a small subset is often used in practice (Weinberger & Saul, 2009; Shi, Bellet, & Sha, 2014; Capitaine, 2016). Typically, a subset of triplets is selected by using the neighbors of each training instance. For n training instances, Shi et al. (2014) selected only 30n triplets, and Weinberger and Saul (2009) selected at most O(kn2) triplets, where k is a prespecified constant. However, the effect on the final accuracy of these heuristic selections is difficult to know beforehand. Jain, Mason, and Nowak (2017) theoretically analyzed a probabilistic general- ization error bound for a random subsampling strategy of triplets. Their analysis revealed the sample complexity of metric learning, but the tight- ness of the bound is not clear and they did not demonstrate the practical use of determining the required number of triplets. For ordinal data em- bedding, Jamieson and Nowak (2011) showed a lower bound of required triplets (cid:3)(dn log n) to determine the embedding, but the tightness of this bound is also not known. Further, the applicability of the analysis to metric learning was not clarified. Our safe triplet screening enables the identification of triplets that can be safely removed from the optimization problem without losing the opti- mality of the resulting metric. This means that our approach can accelerate the optimization of time-consuming metric learning with the guarantee of optimality. Figure 1 shows a schematic illustration of safe triplet screening. Our approach is inspired by the safe feature screening of Lasso (Ghaoui, Viallon, & Rabbani, 2010), in which unnecessary features are identified by the following procedure: Step 1: Construct a bounded region in which the optimal dual solution is guaranteed to exist. Step 2: Given the bound created by step 1, remove features that cannot be selected by Lasso. This procedure is useful to mitigate the optimization difficulty of Lasso for high-dimensional problems; thus, many papers propose a variety of approaches to create bounded regions for obtaining a tighter bound that increases screening performance (Wang, Zhou, Wonka, & Ye, 2013; Liu, l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 1 1 2 2 4 3 2 1 8 6 5 2 4 8 n e c o _ a _ 0 1 2 4 0 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Safe Triplet Screening for Distance Metric Learning 2435 Figure 1: Metric learning with safe triplet screening. The naive optimization needs to minimize the sum of the loss function values for a huge number of triplets (i, j, l). Safe triplet screening identifies a subset of L(cid:4) (blue points in the illustration on the right) and R(cid:4) (green points in the illustration on the right), corresponding to the location of the loss function on which each triplet lies by using the optimal M(cid:4). This enables reducing the number of triplets to be reduced in the optimization problem. Zhao, Wang, & Ye, 2014; Fercoq, Gramfort, & Salmon, 2015; Xiang, Wang, & Ramadge, 2017). As another direction of research, the screening idea was applied to other learning methods, including support vector machine non- support vector screening (Ogawa, Suzuki, & Takeuchi, 2013), nuclear norm regularization subspace screening (Zhou & Zhao, 2015), and group Lasso group screening (Ndiaye, Fercoq, Gramfort, & Salmon, 2016). Based on the safe feature screening techniques, we build the procedure of our safe triplet screening as follows: Step 1: Construct a bounded region in which the optimal solution M(cid:4) is guaranteed to exist. Step 2: For each triplet (i, j, l) ∈ T , verify the possible loss function value under the condition created by step 1. We show that as a result of step 2, we can reduce the size of the met- ric learning optimization problem, by which the computational cost of the optimization can be drastically reduced. Although a variety of extensions of safe screening have been studied in the machine learning community (Lee & Xing, 2014; Wang, Wonka, & Ye, 2014; Zimmert, de Witt, Kerg, & Kloft, 2015; Zhang et al., 2016; Ogawa et al., 2013; Okumura, Suzuki, & Takeuchi, 2015; Shibagaki, Karasuyama, Hatano, & Takeuchi, 2016; Shiba- gaki, Suzuki, Karasuyama, & Takeuchi, 2015; Nakagawa, Suzumura, Kara- suyama, Tsuda, & Takeuchi, 2016; Takada, Hanada, Yamada, Sakuma, & Takeuchi, 2016; Hanada, Shibagaki, Sakuma, & Takeuchi, 2018), to the best l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 1 1 2 2 4 3 2 1 8 6 5 2 4 8 n e c o _ a _ 0 1 2 4 0 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 2436 T. Yoshida, I. Takeuchi, and M. Karasuyama of our knowledge, no studies have considered screening for metric learning. Compared with existing studies, our safe triplet screening is particularly significant due to the huge number of possible triplets and the semidefinite constraint. Our technical contributions are summarized as follows: • We derive six spherical regions in which the optimal M(cid:4) must lie and analyze their relationships. • We derive three types of screening rules, each of which employs a different approach to the semidefinite constraint. • We derive efficient rule evaluation for a special case when M is a di- agonal matrix. • We build an extension for the regularization path calculation. We further demonstrate the effectiveness of our approach based on several benchmark data sets with a huge number of triplets. This letter is organized as follows. In section 2, we define the optimiza- tion problem of large-margin metric learning. In section 3, we first derive six bounds containing optimal M(cid:4) for the subsequent screening procedure. Section 4 derives the rules and constructs our safe triplet screening. The computational cost for the rule evaluation is analyzed in section 5. Exten- sions are discussed in section 6, in which an algorithm specifically designed for the regularization path calculation, and a special case, in which M is a diagonal matrix, are considered. In section 7, we present the evaluation of our approach through numerical experiments. Section 8 concludes. 1.1 Notation. We denote by [n] the set {1, 2, . . . , n} for any integer n ∈ N. The inner product of the matrices is denoted by (cid:5)A, B(cid:6) := = tr(A(cid:2)B). The squared Frobenius norm is represented by (cid:7)A(cid:7)2 F := (cid:5)A, A(cid:6). The positive semidefinite matrix M ∈ Rd×d is denoted by M (cid:8) O or M ∈ Rd×d + . By using the eigenvalue decomposition of matrix M = V (cid:2)V (cid:2), matrices M+ and M− are defined as follows, i j Ai jBi j (cid:5) M = V ((cid:2)+ + (cid:2)−) (cid:9) (cid:6) (cid:7)(cid:8) (cid:2) V (cid:2) (cid:2) = V (cid:2)+V (cid:6) (cid:7)(cid:8) (cid:9) :=M+ (cid:2) , + V (cid:2)−V (cid:6) (cid:7)(cid:8) (cid:9) :=M− where (cid:2)+ and (cid:2)− are constructed only by the positive and neg- ative components of the diagonal matrix (cid:2). Note that (cid:5)M+, M−(cid:6) = tr(V (cid:2)+V (cid:2)V (cid:2)−V (cid:2)) = tr(V OV (cid:2)) = 0, and M+ is a projection of M onto the semidefinite cone—M+ = argminA(cid:8)O (cid:7)A − M(cid:7)2 F. 2 Preliminary , yi) | i ∈ [n]} be n pairs of a d-dimensional feature vector xi ∈ Rd and ∈ Y, where Y is a discrete label space. We consider learning the Let {(xi a label yi following Mahalanobis distance, l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 1 1 2 2 4 3 2 1 8 6 5 2 4 8 n e c o _ a _ 0 1 2 4 0 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Safe Triplet Screening for Distance Metric Learning (cid:2) dM (xi , x j ) := (xi − x j )(cid:2)M(xi − x j ), 2437 (2.1) + where M ∈ Rd×d is a positive semidefinite matrix that parameterizes dis- tance. As a general form of the metric learning problem, we consider a regularized triplet loss minimization (RTLM) problem. Our formula- tion is mainly based on a model originally proposed by Weinberger and Saul (2009), which is reduced to a convex optimization problem with the semidefinite constraint. For later analysis, we derive primal and dual for- mulations, and to discuss the optimality of the learned metric, we focus on the convex formulation of RTLM in this letter. 2.1 Triplet-Based Loss Function. We define a triplet of instances as (cid:10) T = (i, j, l) | (i, j) ∈ S, yi (cid:10) (cid:11) , l ∈ [n] , (cid:9)= yl (i, j) | yi (cid:11) . The set S contains index , i (cid:9)= j, (i, j) ∈ [n] × [n] where S = pairs from the same class, and T represents a triplet of indices consisting of (i, j) ∈ S, and l, which is in a class that differs from that of i and j. We refer to the following loss as the triplet loss: = y j (cid:3) (cid:2) d2 M (xi , xl ) − d2 M (xi (cid:4) , x j ) , for (i, j, l) ∈ T , where (cid:2) : R → R is some loss function. By substituting equation 2.1 into the triplet loss, this can be written as (cid:4) (cid:3) (cid:2) (cid:5)M, H i jl (cid:6) , where H i jl := (xi consider the hinge function, − xl )(xi − xl )(cid:2) − (xi − x j )(xi − x j )(cid:2). For the triplet loss, we (cid:2)(x) = max{0, 1 − x}, or the smoothed hinge function, (cid:2)(x) = ⎧ ⎪⎨ ⎪⎩ 0, 2γ (1 − x)2, 1 1 − x − γ , 2 x > 1,
1 − γ ≤ x ≤ 1,
x < 1 − γ , (2.2) (2.3) where γ > 0 is a parameter. Note that the smoothed hinge includes the
hinge function as a special case (γ → 0). The triplet loss imposes a penalty
if a pair (i, j) ∈ S is more distant than the threshold compared with a pair i
and l, which are in different classes. Both of the two loss functions contain
a region in which no penalty is imposed. We refer to this as the zero region.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

2438

T. Yoshida, I. Takeuchi, and M. Karasuyama

The two loss functions also contain a region in which the penalty increases
linearly, which we refer to as the linear region.

2.2 Primal and Dual Formulation of Triplet-Based Distance Metric
Learning. Using the standard squared regularization, we consider the fol-
lowing RTLM as a general form of metric learning:

Pλ(M) :=

min
M(cid:8)O

(cid:16)

(cid:3)

(cid:2)

i jl

(cid:5)M, H i jl

(cid:6)

(cid:4)

(cid:7)M(cid:7)2
F

(Primal)

(cid:5)

i jl denotes

(i, j,l)∈T , and λ > 0 is a regularization parameter. In
where
section 6.3, we discuss the relation of RTLM to existing metric learning
methods.

The dual problem is written as

max
0≤α≤1, (cid:4)(cid:8)O

Dλ(α, (cid:4)) := −

(cid:7)α(cid:7)2
2

+ α(cid:2)

1 −

(cid:7)Mλ(α, (cid:4))(cid:7)2
F

(Dual1)

where α ∈ R|T |, which contains α
variables, and

i jl for (i, j, l) ∈ T , and (cid:4) ∈ Rd×d are dual

Mλ(α, (cid:4)) := 1
λ

⎤

i jlH i jl

+ (cid:4)

⎦ .

⎡

(cid:16)

⎣

i jl

(2.4)

A derivation of this dual problem is presented in appendix A. Because
(cid:21)
(cid:21)
(cid:21)
(cid:21)2
the last term max(cid:4)(cid:8)O − 1
Mλ(α, (cid:4))
F is equivalent to the projection onto
2
a semidefinite cone (Boyd & Xiao, 2005; Malick, 2004), the above problem,
Dual1, can be simplified as

max
0≤α≤1

Dλ(α) := −

(cid:7)α(cid:7)2
2

+ α(cid:2)

1 −

(cid:21)
(cid:21)

(cid:21)
(cid:21)2
Mλ(α)
F

(Dual2)

where

Mλ(α) := 1
λ

⎡

(cid:16)

⎣

i jl

i jlH i jl

⎤

⎦

For the optimal M(cid:4), each of the triplets in T can be categorized into three
groups:

R(cid:4)

:= {(i, j, l) ∈ T | (cid:5)H i jl

, M

(cid:4)(cid:6) > 1},

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Safe Triplet Screening for Distance Metric Learning

C(cid:4)

L(cid:4)

:= {(i, j, l) ∈ T | 1 − γ ≤ (cid:5)H i jl
:= {(i, j, l) ∈ T | (cid:5)H i jl

, M
(cid:4)(cid:6) < 1 − γ }. , M (cid:4)(cid:6) ≤ 1}, 2439 (2.5) This indicates that the triplets in R(cid:4) and those in L(cid:4) are the zero region and linear region of the loss function, respectively. The well-known KKT conditions provide the following relation between the optimal dual variable and the derivative of the loss function (see appendix A for details): α(cid:4) i jl = −∇(cid:2)((cid:5)M (cid:4), H i jl (cid:6)). (2.6) In the case of hinge loss, the derivative is written as ∇(cid:2)((cid:5)M (cid:4), H i jl (cid:6)) = ⎧ ⎪⎨ ⎪⎩ 0, −c, −1, (cid:5)M(cid:4), H i jl (cid:5)M(cid:4), H i jl (cid:5)M(cid:4), H i jl (cid:6) > 1,
(cid:6) = 1,
(cid:6) < 1, where ∀c ∈ [0, 1]. In the case of smoothed hinge loss, the derivative is (cid:6) > 1,

(cid:5)M(cid:4), H i jl
1 − γ ≤ (cid:5)M(cid:4), H i jl
(cid:5)M(cid:4), H i jl

(cid:6) < 1 − γ . (cid:6) ≤ 1, ∇(cid:2)((cid:5)M (cid:4), H i jl (cid:6)) = ⎧ ⎪⎨ ⎪⎩ 0, − 1 −1, γ (1 − (cid:5)M(cid:4), H i jl (cid:6)), Both cases can be represented as ∇(cid:2)((cid:5)M (cid:4), H i jl (cid:6)) ⎧ ⎪⎨ ⎪⎩ = 0, ∈ −[0, 1], = −1, (i, j, l) ∈ R(cid:4), (i, j, l) ∈ C(cid:4), (i, j, l) ∈ L(cid:4). (2.7) From equations 2.7 and 2.6, we obtain the following rules for the optimal dual variable: (i, j, l) ∈ R(cid:4) ⇒ α(cid:4) i jl (i, j, l) ∈ C(cid:4) ⇒ α(cid:4) i jl (i, j, l) ∈ L(cid:4) ⇒ α(cid:4) i jl = 0, ∈ [0, 1], = 1. (2.8) The nonlinear semidefinite programming problem of RTLM can be solved by gradient methods including the primal-based (Weinberger & Saul, 2009) and dual-based approaches (Shen, Kim, Liu, Wang, & Van Den Hengel, 2014). However, the amount of computation may be prohibitive be- cause of the large number of triplets. The naive calculation of the objective l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 1 1 2 2 4 3 2 1 8 6 5 2 4 8 n e c o _ a _ 0 1 2 4 0 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 2440 T. Yoshida, I. Takeuchi, and M. Karasuyama function requires O(d2|T |) computations for both the primal and the dual cases. 2.3 Reduced-Size Optimization Problem. Assuming that we have a subset of triplets (i, j, l) ∈ L(cid:4) ∪ R(cid:4) before solving the optimization problem. Let ˆL ⊆ L(cid:4) and ˆR ⊆ R(cid:4) be the subsets of L(cid:4) and R(cid:4) we identify. Then, based on this prior knowledge, the optimization problem, Primal, can be trans- formed into the following reduced-size problem: ˜Pλ(M) = (cid:16) (i, j,l)∈ ˜T (cid:2)((cid:5)M, H i jl (cid:6)) + λ 2 (cid:7)M(cid:7)2 F + (cid:22) 1 − (cid:23) γ 2 | ˆL| − (cid:5)M, (cid:16) (i, j,l)∈ ˆL (cid:6), H i jl (2.9) where ˜T := T − ˆL − ˆR. This problem differs from the original, Primal, as follows • The loss term for ˆR is removed because it does not produce any penalty at the optimal solution. • The loss term for ˆL is fixed at the linear part of the loss function by which the sum over triplets can be calculated beforehand (the last two terms). The dual problem of this reduced-size problem can be written as γ (cid:7)α(cid:7)2 2 ˜Dλ(α) := − min 0≤α≤1 2 s.t. α ˆL = 1, α ˆR = 0. + α(cid:2) 1 − (cid:21) (cid:21) (cid:21) (cid:21)2 Mλ(α) F , λ 2 (2.10) which is the same optimization problem as Dual2 except that α ˆL and α ˆR are fixed. Because of this constraint, the number of free variables in this dual problem is | ˜T |. An important property of a reduced-size problem is that it retains the same optimal solution as the original problem: Lemma 1. The primal-dual problem pair, equations 2.9 and 2.10, and the original problem pair, Primal and Dual2, have the same optimal primal and dual solutions. The proof of this lemma is shown in appendix B, along with the derivation of the reduced-size dual, equation 2.10. Therefore, if a large number of ˆL and ˆR could be detected beforehand (i.e., | ˜T | (cid:16) |T |), the metric learning optimization would be accelerated dramatically. 3 Spherical Bound As we will see, our safe triplet screening is derived by using a spherical re- gion that contains the optimal M(cid:4). In this section, we show that six variants l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 1 1 2 2 4 3 2 1 8 6 5 2 4 8 n e c o _ a _ 0 1 2 4 0 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Safe Triplet Screening for Distance Metric Learning 2441 of the regions are created by three types of different approaches. Note that the proofs for all the theorems appear in the appendixes. 3.1 Gradient Bound. We first introduce a hypersphere, which we name gradient bound (GB), because the center and radius of the hypersphere are represented by the subgradient of the objective function: Theorem 1 (GB). Given any feasible solution M (cid:8) O, the optimal solution M(cid:4) for λ exists in the following hypersphere: (cid:21) (cid:21) M (cid:21) (cid:21)2 (cid:4) − QGB(M) F ≤ (cid:24) 1 2λ (cid:21) (cid:21) (cid:21) (cid:21)∇Pλ(M) F (cid:25) 2 , where QGB(M) := M − 1 2λ ∇Pλ(M). The proof is in appendix C. This theorem is an extension of the sphere for SVM (Shibagaki et al., 2015), which can be treated as a simple unconstrained problem. 3.2 Projected Gradient Bound. Even when we substitute the optimal M(cid:4) into the reference solution M, the radius of the GB is not guaranteed to be 0. By projecting the center of GB onto the feasible region (i.e., a semidef- inite cone), another GB-based hypersphere can be derived, which has a ra- dius converging to 0 at the optimal. We refer to this extension as projected gradient bound (PGB); a schematic illustration is shown as Figure 2a. In Figure 2a, the center of the GB QGB (the abbreviation of QGB(M)) is pro- jected onto the semidefinite cone, which becomes the center of PGB QGB + . The sphere of PGB can be written as Theorem 2 (PGB). Given any feasible solution M (cid:8) O, the optimal solution M(cid:4) for λ exists in the following hypersphere: (cid:21) (cid:21) (cid:21)M (cid:4) − (cid:27) (cid:26) QGB(M) + (cid:24) (cid:21) (cid:21) 2 (cid:21) F ≤ 1 2λ (cid:21) (cid:21) (cid:21)∇Pλ(M) (cid:21) F (cid:25) 2 (cid:21) (cid:21) (cid:21) − (cid:27) (cid:26) QGB(M) − (cid:21) (cid:21) 2 (cid:21) F . The proof is in appendix D. PGB contains the projections onto the positive and the negative semidefinite cone in the center and the radius, respectively. These projections require the eigenvalue decomposition of M − 1 ∇Pλ(M). 2λ This decomposition, however, only needs to be performed once to evaluate the screening rules of all the triplets. In the standard optimization proce- dures of RTLM, including Weinberger and Saul (2009), the eigenvalue de- composition of the d × d matrix is calculated in every iterative cycle, and thus, the computational complexity is not increased by PGB. The following theorem shows a superior convergence property of PGB compared to GB: l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 1 1 2 2 4 3 2 1 8 6 5 2 4 8 n e c o _ a _ 0 1 2 4 0 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 2442 T. Yoshida, I. Takeuchi, and M. Karasuyama l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 1 1 2 2 4 3 2 1 8 6 5 2 4 8 n e c o _ a _ 0 1 2 4 0 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Figure 2: Illustrations of spherical bounds. Theorem 3. There exists a subgradient ∇Pλ(M(cid:4)) such that the radius of PGB is 0. For the hinge loss, which is not differentiable at the kink, the optimal dual variables provide subgradients that set the radius equal to 0. This theorem is an immediate consequence of the proof in appendix I, which is the proof for the relation between PGB and the other bound derived in section 3.4. Safe Triplet Screening for Distance Metric Learning 2443 From Figure 2a, we see that the half space (cid:5)−QGB − = QGB − QGB + , can be used as a linear relaxation of the semidefinite constraint for the linear constraint rule in section 4.3. Interestingly, the GB with this lin- ear constraint is tighter than the PGB. This is proved in appendix D, which gives the proof of the PGB. − , X (cid:6) ≥ 0, where QGB 3.3 Duality Gap Bound. In this section, we describe the duality gap bound (DGB) in which the radius is represented by the duality gap: Theorem 4 (DGB). Let M be a feasible solution of the primal problem and α and (cid:4) be feasible solutions of the dual problem. Then the optimal solution of the primal problem M(cid:4) exists in the following hypersphere: (cid:21) (cid:21) M (cid:4) − M (cid:21) (cid:21)2 F ≤ 2(Pλ(M) − Dλ(α, (cid:4)))/λ. The proof is in appendix E. Because the radius is proportional to the square root of the duality gap, DGB obviously converges to 0 at the optimal solu- tion (see Figure 2b). The DGB, unlike the previous bounds, requires a dual feasible solution. This means that when a primal-based optimization algo- rithm is employed, we need to create a dual feasible solution from the pri- mal feasible solution. A simple way to create a dual feasible solution is to substitute the current M into M(cid:4) of equation 2.6. When a dual-based opti- mization algorithm is employed, a primal feasible solution can be created by equation 2.4. For the DGB, we can derive a tighter bound, the constrained duality gap bound (CDGB), with an additional constraint. However, except for a special case (dynamic screening with a dual solver), additional transformation of the reference solution is necessary, which can deteriorate the duality gap. See appendix F for further details. 3.4 Regularization Path Bound. In Wang et al. (2014), a hypersphere is proposed specifically for the regularization path, in which the optimization problem should be solved for a sequence of λs. Suppose that λ0 has already been optimized and it is necessary to optimize λ1. Then the same approach as Wang et al. (2014) is applicable to our RTLM, which derives a bound depending on the optimal solution for λ0 as a reference solution: Theorem 5 (RPB). Let M(cid:4) tion M(cid:4) 0 be the optimal solution for λ0. Then the optimal solu- 1 for λ1 exists in the following hypersphere: (cid:21) (cid:25) (cid:21) (cid:21) (cid:21)M (cid:21) (cid:21) (cid:21) (cid:21) M M (cid:21) (cid:21) − ≤ (cid:24) 2 (cid:21) (cid:21) F (cid:4) 0 (cid:4) 1 (cid:4) 0 λ0 − λ1 2λ1 λ0 + λ1 2λ1 F 2 . The proof is in appendix G. We refer to this bound as the regularization path bound (RPB). l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 1 1 2 2 4 3 2 1 8 6 5 2 4 8 n e c o _ a _ 0 1 2 4 0 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 2444 T. Yoshida, I. Takeuchi, and M. Karasuyama The RPB requires the theoretically optimal solution M(cid:4) 0, which is numer- ically impossible. Furthermore, because the reference solution is fixed on M(cid:4) 0, the RPB can be performed only once for a specific pair of λ0 and λ1 even if the optimal M(cid:4) 0 is available. The other bounds can be performed multiple times during the optimization by regarding the current approximate solu- tion as a reference solution. 3.5 Relaxed Regularization Path Bound. To use the RPB in practice, we modify this bound in such a way that the approximate solution can be used as a reference solution. Assume that M0 should satisfy (cid:7)M (cid:4) 0 − M0(cid:7)F ≤ (cid:8), where (cid:8) ≥ 0 is a constant. Given M0, which satisfies the above condition, we obtain the relaxed regularization path bound (RRPB): Theorem 6 (RRPB). Let M0 be an approximate solution for λ0, which satis- fies (cid:7)M(cid:4) 1 for λ1 exists in the following 0 hypersphere: (cid:21) (cid:21) (cid:21) (cid:21)M − M0(cid:7)F ≤ (cid:8). The optimal solution M(cid:4) (cid:7)M0 (cid:21) (cid:21) (cid:21) (cid:21) (cid:25) 2 M0 + − ≤ (cid:24) (cid:8) . 2 (cid:7) F |λ0 − λ1| + λ0 + λ1 2λ1 |λ0 − λ1| 2λ1 λ0 + λ1 2λ1 (cid:4) 1 F (3.1) The proof is in appendix H. The intuition behind the RRPB is shown in Figure 2d, in which the approximation error for the center of the RPB is depicted. In the theorem, the RRPB also considers the error in the radius, although it is not illustrated in the figure for simplicity. To the best of our knowledge, this approach has not been introduced in other existing screen- ing studies. For example, (cid:8) can be set from theorem 4 (DGB) as follows: (cid:2) (cid:8) = 2(Pλ 0 (M0) − Dλ 0 (α0, (cid:4)0))/λ0. (3.2) When the optimization for λ0 terminates, the solution M0 should be accu- rate in terms of some stopping criterion such as the duality gap. Then (cid:8) is expected to be quite small, and the RRPB can provide a tight bound for λ1, which is close to the ideal (but not computable) RPB. As a special case, by setting λ1 = λ0, the RRPB can be applied to perform the screening of λ1 us- ing any approximate solution having (cid:7)M(cid:4) − M(cid:7)F ≤ (cid:8), and then the RRPB 1 is equivalent to the DGB. 3.6 Analytical Relation between Bounds. The following theorem de- scribes the relation between PGB and RPB: l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 1 1 2 2 4 3 2 1 8 6 5 2 4 8 n e c o _ a _ 0 1 2 4 0 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Safe Triplet Screening for Distance Metric Learning 2445 Table 1: Comparison of Sphere Bounds. Radius Convergence Dynamic Screening Reference Solution Exact Optimality of Reference GB PGB DGB CDGB RPB RRPB (RPB + DGB) Can be > 0
= 0a
= 0
= 0
NA
= 0

Applicable
Applicable
Applicable
Applicable
Not applicable
Applicable

Not necessary
Primal
Not necessary
Primal
Primal/dual Not necessary
Primal/dual Not necessary
Primal
Primal/dual Not necessary

Necessary

Note: The radius convergence indicates a radius when the reference solution is the
optimal solution.
aFor the hinge loss (γ = 0) case, a subgradient is required to be selected appropriately
for achieving this convergence.

Theorem 7 (Relation between PGB and RPB). Suppose that the optimal solution
M(cid:4)
0 for λ0 is substituted into the reference solution M of PGB. Then there exists
a subgradient ∇Pλ
0) by which the PGB and RPB provide the same center and
radius for M(cid:4)
1.

1 (M(cid:4)

The proof is presented in appendix I. The following theorem describes the
relation between the DGB and RPB:

, α(cid:4)

0, and (cid:4)(cid:4)

Theorem 8 (Relation between DGB and RPB). Suppose that the optimal solutions
M(cid:4)
0 for λ0 are substituted into the reference solutions M, α, and (cid:4) of
0
the DGB. Then the radius of DGB and RPB for λ1 has a relation rDGB = 2 rRPB,
and the hypersphere of RPB is included in the hypersphere of DGB.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

The proof is in appendix J. Figure 2c illustrates the relation between the
DGB and RPB, which shows the theoretical advantage of the RPB for the
regularization path setting.

Using the analytical results obtained thus far, we summarize relative re-
lations between the bounds as follows. First, we consider the case in which
the reference solution is optimal for λ0 in the regularization path calculation.
We obviously see rGB ≥ rPGB from Figure 2a, and from theorems 7 and 8, we
see DGB ⊇ PGB = RPB = RRPB. When the reference solution is an approxi-
mate solution in the regularization path calculation, we see only rGB ≥ rPGB.
For dynamic screening in which the reference solution is always an approx-
imate solution, we see rGB ≥ rPGB, and we also see RRPB = DGB when (cid:8) is
determined by DGB as written in equation 3.2.

Other properties of the bounds are summarized in Table 1. Although
DGB and RRPB (RPB + DGB) have the same properties, our empirical eval-
uation in section 7.2 shows that RRPB often outperforms DGB in the reg-
ularization path calculation. (Note that although CDGB also has the same

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

2446

T. Yoshida, I. Takeuchi, and M. Karasuyama

properties as the above two methods, we omit it in the empirical evaluation
because of its practical limitation, as we see in section 3.3.)

4 Safe Rules for Triplets

Our safe triplet screening can reduce the number of triplets by identifying
a part of L(cid:4) and R(cid:4) before solving the optimization problem based on the
following procedure:

Step 1: Identify the spherical region in which the optimal solution M(cid:4)
lies, based on the current feasible solution we refer to as the reference
solution.

Step 2: For each triplet (i, j, l) ∈ T , verify the possibility of (i, j, l) ∈ L(cid:4) or

(i, j, l) ∈ R(cid:4) under the condition that M(cid:4) is in the region.

In section 3, we showed that there exist a variety of approaches to creating
the spherical region for step 1. In this section, we describe the procedure of
step 2 given the sphere region.

Letting B be a region that contains M(cid:4), the following screening rule can

be derived from equation 2.5:

(cid:5)X , H i jl

(cid:6) < 1 − γ ⇒ (i, j, l) ∈ L(cid:4), (cid:5)X, H i jl (cid:6) > 1 ⇒ (i, j, l) ∈ R(cid:4).

max
X∈B

min
X∈B

(R1)

(R2)

Based on these rules, ˆL ⊆ L(cid:4) and ˆR ⊆ R(cid:4) are constructed as

ˆL =

(cid:28)

(i, j, l)

ˆR =

(i, j, l)

(cid:29)
(cid:29)
(cid:29) max
X∈B

(cid:29)
(cid:29)
(cid:29) min
X∈B

(cid:5)X, H i jl

(cid:6) < 1 − γ (cid:30) (cid:5)X , H i jl (cid:6) > 1

(cid:30)

We present an efficient approach to evaluating these rules. Because equation
R1 can be evaluated in the same way as R2, we are concerned only with
equation R2 henceforth.

4.1 Spherical Rule. Suppose that the optimal M(cid:4) lies in a hypersphere
defined by a center Q ∈ Rd×d and a radius r ∈ R+. To evaluate the condition
of equation R2, we consider the following minimization problem, equation
P1:

(cid:5)X, H i jl

(cid:6) s.t. (cid:7)X − Q(cid:7)2
F

≤ r2.

min
X

(P1)

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Safe Triplet Screening for Distance Metric Learning

2447

Figure 3: Spherical rule defined by equation P1. The yellow sphere indicates the
region in which the optimal M(cid:4) must exist. The terms “max” and “min” indicate
the points at which the maximum and minimum values of the inner product
(cid:6) > 1 holds, condition R2 is guaranteed to be
(cid:5)X , H i jl
satisfied.

(cid:6) are attained. If (cid:5)X (cid:4), H i jl

Letting Y := X − Q, this problem is transformed into

(cid:5)Y , H i jl

(cid:6) + (cid:5)Q, H i jl

(cid:6) s.t. (cid:7)Y (cid:7)2
F

≤ r2.

min
Y

Because (cid:5)Q, H i jl
ing the inner product (cid:5)Y , H i jl
of this optimization problem is easily derived as

(cid:6) is a constant, this optimization problem entails minimiz-
(cid:6) under the norm constraint. The optimal Y (cid:4)

(cid:4) = −rH i jl

(cid:21)
(cid:21)
/

H i jl

(cid:21)
(cid:21)
F

and then the minimum value of P1 is (cid:5)H i jl
a schematic illustration. This derives the following spherical rule:

, Q(cid:6) − r

H i jl

(cid:21)
(cid:21)

(cid:21)
(cid:21)
F. Figure 3 shows

(cid:5)H i jl

, Q(cid:6) − r

(cid:21)
(cid:21)

H i jl

(cid:21)
(cid:21)

> 1 ⇒ (i, j, l) ∈ R(cid:4).

(4.1)

This condition can be easily evaluated for a given Q and r.

4.2 Spherical Rule with a Semidefinite Constraint. The spherical rule
does not utilize the positive semidefiniteness of M(cid:4); therefore, a stronger
rule can be constructed by incorporating a semidefinite constraint into
equation P1:

(cid:5)X , H i jl

(cid:6) s.t. (cid:7)X − Q(cid:7)2
F

≤ r2, X (cid:8) O.

min
X

(P2)

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

2448

T. Yoshida, I. Takeuchi, and M. Karasuyama

Although the analytical solution is not available, equation P2 can be solved
efficiently by transforming it into the semidefinite least squares (SDLS)
problem (Malick, 2004).

Let BPSD := {X | (cid:7)X − Q(cid:7)2
F

≤ r2, X (cid:8) O} be the feasible region of the op-
timization problem P2. To present the connection between SDLS and equa-
tion P2, we first assume that there exists a feasible solution X 0 for equation
P2 that satisfies (cid:5)X 0, H i jl

(cid:6) > 1:

∃X 0 such that (cid:5)X 0, H i jl

(cid:6) > 1 and X 0 ∈ BPSD.

(4.2)

Instead of equation P2, we consider the following SDLS problem:

min
X∈Rd×d

(cid:7)X − Q(cid:7)2

F s.t. (cid:5)X, H i jl

(cid:6) = 1, X (cid:8) O.

(SDLS)

If the optimal value of this problem is greater than r2 (i.e., (cid:7)X (cid:4) − Q(cid:7)2
>
F
r2), there is no intersection between BPSD and the subspace defined by
(cid:5)X , H i jl
(cid:10)

(cid:6) = 1:

(cid:11)

X | (cid:5)X, H i jl

(cid:6) = 1, X ∈ BPSD

= ∅.

From assumption 4.2, we have

(cid:10)

X | (cid:5)X, H i jl

(cid:6) > 1, X ∈ BPSD

(cid:11)

(cid:9)= ∅.

As BPSD is a convex set, based on the two conditions 4.3 and 4.4, we derive

(cid:10)

X | (cid:5)X, H i jl

(cid:6) ≤ 1, X ∈ BPSD

(cid:11)

= ∅,

which indicates

(cid:5)X, H i jl

(cid:6) > 1,

min
X∈B
PSD

Thus, the condition of equation R2 is satisfied.

Based on the connection shown above, the rule evaluation, equation R2,

with the semidefinite constraint is summarized as follows:

1. Select an arbitrary feasible solution X 0 ∈ BPSD. If (cid:5)X 0, H i jl

(cid:6) ≤ 1, we
immediately see that the condition of equation R2 is not satisfied for
the triplet (i, j, l). Otherwise, go to the next step. Note that in this
case, assumption 4.2 is confirmed because (cid:5)X 0, H i jl
2. Solve SDLS. If the optimal value satisfies (cid:7)X (cid:4) − Q(cid:7)2
F

> r2, the triplet

(cid:6) > 1).

(i, j, l) is guaranteed to be in R(cid:4).

For calculating the second step, we derive the following dual problem of

equation SDLS based on Malick (2004):

(4.3)

(4.4)

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Safe Triplet Screening for Distance Metric Learning

2449

max
y

DSDLS(y) := −

(cid:21)
(cid:21)

[Q + yH i jl]+

(cid:21)
(cid:21)2
F

+ 2Cy + (cid:7)Q(cid:7)2
F

where y ∈ R is a dual variable, and C = 1 for equation R2 and C = 1 − γ
for equation R1. Unlike the primal problem, the dual version is an un-
constrained problem that has only one variable, y, and thus, standard
gradient-based algorithms rapidly converge. We refer to the quasi-Newton
optimization for this problem as the SDLS dual ascent method. During dual
ascent, we can terminate the iteration before convergence if DSDLS(y) be-
comes larger than r2 because the value of the dual problem does not exceed
the value of the primal problem (weak duality).

Although the computation of [Q + yH i jl]+ requires an eigenvalue de-
composition, this computational requirement can be alleviated when the
center Q of the hypersphere is positive semidefinite. The definition deter-
mines that H i jl has at most one negative eigenvalue, and then Q + yH i jl
also has at most one negative eigenvalue. Let λmin be the negative (min-
imum) eigenvalue of Q + yH i jl, and qmin be the corresponding eigenvec-
tor. The projection [Q + yH i jl]+ can be expressed as [Q + yH i jl]+ = (Q +
yH i jl ) − λminqminq(cid:2)
min. Computation of the minimum eigenvalue and eigen-
vector is much easier than the full eigenvalue decomposition (Lehoucq &
Sorensen, 1996).

As a special case, when M is a diagonal matrix, the semidefinite con-
straint is reduced to the nonnegative constraint, and analytical calculation
of rule P2 is possible (see section 6.2).

4.3 Spherical Rule with Linear Constraint. Here, we reduce the com-
putational complexity by considering the relaxation of the semidefinite
constraint into a linear constraint. Suppose that a region defined by the lin-
ear inequality {X ∈ Rd×d | (cid:5)P, X (cid:6) ≥ 0} contains a semidefinite cone, Rd×d
+ ⊆
{X ∈ Rd×d | (cid:5)P, X (cid:6) ≥ 0}, for which we describe the determination of P ∈
Rd×d later. Using this relaxed constraint, condition R2 is

(cid:5)X , H i jl

(cid:6) s.t. (cid:7)X − Q(cid:7)2
F

≤ r2, (cid:5)P, X (cid:6) ≥ 0.

min
X

(P3)

This problem can be solved analytically by considering the KKT conditions
as follows (see appendix K).

Theorem 9 (Analytical Solution of Equation P3). The optimal solution of equa-
tion P3 is as follows:

(cid:5)H i jl

, X

(cid:4)(cid:6) =

⎧
0,
⎪⎪⎪⎨
(cid:5)H i jl
⎪⎪⎪⎩
(cid:31)

H i jl

, Q(cid:6) − r(cid:7)H i jl
, βP−H i jl
α

+ Q

(cid:7)F,

if H i jl
(cid:31)

= aP,

P, Q − r

H i jl
(cid:7)H i jl

(cid:7)F

≥ 0,

otherwise,

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

2450

T. Yoshida, I. Takeuchi, and M. Karasuyama

Figure 4: Linear relaxation of semidefinite constraint. From the projection of
A to A+, the supporting hyperplane (cid:5)−A−, X (cid:6) = 0 is constructed, and the half-
space {X | (cid:5)−A−, X(cid:6) ≥ 0} contains the semidefinite cone X (cid:8) O.

where a is a constant and

α =

(cid:7)P(cid:7)2
(cid:7)H i jl
F
r2(cid:7)P(cid:7)2
F

− (cid:5)P, H i jl

(cid:7)2
F
− (cid:5)P, Q(cid:6)2

(cid:6)2

, β =

(cid:5)P, H i jl

(cid:6) − α(cid:5)P, Q(cid:6)
(cid:7)P(cid:7)2
F

A simple way to obtain P is to utilize the projection onto the
semidefinite cone. Let A ∈ Rd×d be a matrix external to the semidefinite cone
as illustrated in Figure 4. In the figure, A+ is the projection of A onto the
semidefinite cone. For example, when the projected gradient for the primal
problem (Weinberger & Saul, 2009) is used as an optimizer, A can be an up-
date of the gradient descent A = M − η∇Pλ(M) with some step size η > 0.
Because M − η∇Pλ(M) is projected onto the semidefinite cone at every iter-
ative step of the optimization, no additional calculation is required to obtain
A and A+. Defining A− := A − A+, for any X (cid:8) O, we obtain

(cid:5)A+ − A, X − A+(cid:6) ≥ 0 ⇔ (cid:5)−A−, X(cid:6) ≥ 0.

The inequality on the left has its origins in the property of a supporting hy-
perplane (Boyd & Vandenberghe, 2004), and for the inequality on the right,
we use (cid:5)A+, A−(cid:6) = 0. By setting P = −A−, we obtain a linear approximation
of the semidefinite constraint, which is a superset of the original semidefi-
nite cone.

A necessary condition for performing our screening is that a loss function
needs to have at least one linear region or a zero region. For example, the
logistic loss cannot be used for screening because it has neither a linear nor
a zero region.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Safe Triplet Screening for Distance Metric Learning

2451

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

5 Computations

Algorithm 1 shows the detailed procedure of our safe screening with sim-
ple fixed step-size gradient descent. (Note that any other optimization

2452

T. Yoshida, I. Takeuchi, and M. Karasuyama

algorithm can be combined with our screening procedure.) In the algo-
rithm, for every freq iteration of the gradient descent, the screening rules
are evaluated by using the current solution M as the reference solution. As
the quality of the approximate solution M improves, the larger the number
of triplets that can be removed from T . Thus, the quality of the initial solu-
tion affects the efficiency. In the case of the regularization path calculation,
in which RTLM is solved for a sequence of λs, a reasonable initial solution is
the approximate solution to the previous λ. We discuss a further extension
specific to the regularization path calculation in section 6.1.

Considering the computational cost of the screening procedure of algo-
rithm 1, the rule evaluation (step 2) described in section 4 is often domi-
nant, because the rule needs to be evaluated for each one of the triplets. The
sphere, constructed in step 1, can be fixed during the screening procedure
as long as the reference solution is fixed.

− xl ) − (xi

− xl )(cid:2)Q(xi

− x j )(cid:2)Q(xi

To evaluate the spherical rule, equation 4.1, given the center Q and
the radius r, the inner product (cid:5)H i jl
, Q(cid:6) and the norm (cid:7)H i jl
(cid:7)F need to
be evaluated. The inner product (cid:5)H i jl
, Q(cid:6) can be calculated in O(d2) op-
erations because it is expanded as a sum of quadratic forms: (cid:5)H i jl
, Q(cid:6) =
− x j ). Further, we can reuse this term
(xi
from the objective function Pλ(M) calculation in the case of the DGB, RPB,
and RRPB. The norm (cid:7)H i jl
(cid:7)F can be calculated in O(d) operations, and this
is constant throughout the optimization process. Thus, for the DGB, RPB, or
RRPB, it is possible to reduce the additional computational cost of the spher-
ical rule for (i, j, l) to O(1) by calculating (cid:7)H i jl
(cid:7)F beforehand. The computa-
tional cost of the spherical rule with the semidefinite constraint (see section
4.2) is that of the SDLS algorithm. The SDLS algorithm needs O(d3) because
of the eigenvalue decomposition in every iterative cycle, which may con-
siderably increase the computational cost. The computational cost of the
spherical rule with the linear constraint (see section 4.3) is O(d2).

6 Extensions

6.1 Range-Based Extension of Triplet Screening. The screening rules
presented in section 4 relate to the problem of a fixed λ. In this section, we
regard a screening rule as a function of λ to derive a range of λs in which
the screening rule is guaranteed to be satisfied. This is particularly useful
for calculating the regularization path for which we need to optimize the
metric for a sequence of λs. If a screening rule is satisfied for a triplet (i, j, l)
in a range (λa, λ
b), we can fix the triplet (i, j, l) in ˆL or ˆR as long as λ is in
(λa, λ

b), without computing the screening rules.

6.1.1 Deriving the Range. Let

Q = A + B

1
λ

(6.1)

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Safe Triplet Screening for Distance Metric Learning

2453

be the general form of the center of a hypersphere for some constant matri-
ces A ∈ Rd×d and B ∈ Rd×d and

r2 = a + b

1
λ

+ c

1
λ2

(6.2)

be the general form of the radius for some constants a ∈ R, b ∈ R, and c ∈ R.
The GB, DGB, RPB, and RRPB can be in this form (details are provided in
appendix L, section L.1). Note that in the RRPB, equation 3.1, λ1 is regarded
as λ in the general form and λ0 is a constant. The condition of the spherical
rule (cid:5)H i jl

> 1 in equation 4.1 can be rewritten as

, Q(cid:6) − r

H i jl

(cid:21)
(cid:21)

(cid:3)

(cid:5)H i jl

, Q(cid:6) − 1

(cid:4)

2 > r2(cid:7)H i jl

(cid:7)2
F

with the assumption

(cid:5)H i jl

, Q(cid:6) − 1 > 0.

, A(cid:6) + (cid:5)H i jl

, Q(cid:6) = (cid:5)H i jl

Because (cid:5)H i jl
, these two inequalities can be
, B(cid:6) 1
λ
transformed into quadratic and linear functions of λ, respectively. The range
of λ that satisfies the two inequalities simultaneously represents the range
of λ in which a triplet (i, j, l) must be in R∗. The following theorem shows
the range for the case of RRPB given a reference solution M0, which is an
approximate solution for λ0:
Theorem 10 (Range-Based Extension of RRPB). Assuming (cid:5)H i jl
(cid:7)F(cid:7)M0(cid:7)F > 0 and (cid:7)M(cid:4)
(cid:7)H i jl
R(cid:4) for the following range of λ:

, M0(cid:6) − 2 +
− M0(cid:7)F ≤ (cid:8), a triplet (i, j, l) is guaranteed to be in

λ ∈ (λa, λ

b) ,

where

λa =

λ
b

(cid:3)

λ0

(cid:7)F − (cid:5)H i jl
, M0(cid:6) − 2 + (cid:7)H i jl

(cid:7)M0(cid:7)F(cid:7)H i jl
(cid:5)H i jl
(cid:3)
(cid:7)M0(cid:7)F(cid:7)H i jl
λ0
(cid:7)F(cid:7)M0(cid:7)F − (cid:5)H i jl

, M0(cid:6) + 2(cid:8)(cid:7)H i jl
(cid:7)F(cid:7)M0(cid:7)F
(cid:4)
, M0(cid:6)

(cid:7)F + (cid:5)H i jl
, M0(cid:6) + 2 + 2(cid:8)(cid:7)H i jl

(cid:7)H i jl

(cid:4)

(cid:7)F

Refer to section L.2 for the proof. The computational procedure for range-
based screening is shown in algorithm 2.

6.1.2 Consideration for Range Extension with Other Bounds. As shown in
equation 3.1, the RRPB is based on the optimality (cid:8) for the current λ0, and
does not depend on the optimality for λ1, which is regarded as λ in the gen-
eral form of equations 6.1 and 6.2. Because of this property, the RRPB is

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

2454

T. Yoshida, I. Takeuchi, and M. Karasuyama

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

0 (M0) − Dλ

particularly suitable to range-based screening among the spheres we de-
rived thus far. To calculate (cid:8), equation 3.2 for the RRPB, the duality
0 (M0) −
gap Pλ
0 (α0, (cid:4)0),
Dλ
= 0 for
for efficient computation, where ˜Dλ
i ∈ ˆR and α
= 1 for i ∈ ˆL are fixed. Because the reduced-size problem shares
i

0 (α0, (cid:4)0) is required. Instead of the original Pλ
0 (M0) − ˜Dλ
0 is the dual objective in which α

0 (α0, (cid:4)0), we can use problems with a reduced size, ˜Pλ

Safe Triplet Screening for Distance Metric Learning

2455

Figure 5: (a) Suppose that M(cid:4) is the optimal solution for λ
0, and some iter-
ative optimization algorithm obtains Mprev in the middle of the optimization
process. The circle around Mprev represents the DGB, which contains M(cid:4). Then
(cid:6) > 1 holds
the screening rule can eliminate the triplet (i, j, l) because (cid:5)M, H i jl
for any points in the circle. Now suppose that M0 is the approximate solution
we obtain after the optimization algorithm terminates with some small toler-
ance of the duality gap. The circle with the dashed line represents the region
in which the duality gap is less than the tolerance. Although M0 satisfies the
(cid:6) > 1 does not hold for M0. In this
terminate condition, the inequality (cid:5)M, H i jl
case, we cannot ignore this triplet (i, j, l) to evaluate the duality gap for different
λ (cid:9)= λ
0 because it causes a nonzero penalty. (b) An enlarged bound. Because of
”
the inequality of DGB (cid:7)M(cid:4) − M0
2(cid:8)/λ, this enlarged region contains any
approximate solutions with the duality gap ≤ (cid:8).

(cid:7)
F

≤

exactly the same optimal solution with the original problems, this gap also
provides a valid bound. As a result, we can avoid computing the sum over
all triplets in T (e.g., to calculate the loss term in the original primal) to
evaluate a bound.

In the other bounds, the loss term in the primal objective needs to be
carefully considered. Suppose that we have an approximate solution M0 for
λ0 as a reference solution. To regard a bound as a function of λ in the GB and
PGB, it is necessary to consider the gradient for λ (i.e., ∇Pλ(M0)), and the
DGB requires the objective value for λ (i.e., Pλ(M0)). These two terms may
not be correctly calculated if we replace them with the reduced-size primal
created for λ0. Figure 5a illustrates an example of this problem in the case
of DGB. To safely replace Pλ(M0) with the reduced-size primal ˜Pλ(M0) for
these cases, the following conditions need to hold:

(cid:5)M0, H i jl
(cid:5)M0, H i jl

(cid:6) < 1 − γ , for ∀(i, j, l) ∈ ˆL, (cid:6) > 1, for ∀(i, j, l) ∈ ˆR.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

2456

T. Yoshida, I. Takeuchi, and M. Karasuyama

If the reference solution M0 is exactly optimal for λ0, these conditions hold.
However, in practice, this cannot be true because of numerical errors, and
furthermore, the optimization algorithm is usually terminated with some
tolerance.

This problem can be avoided by enlarging the radius of spherical bounds
such that the bound contains the approximate solution. Assuming that M0
is an approximate solution with the duality gap (cid:8), then from the DGB, we
see that the distance between M0 and the optimal solution M(cid:4) satisfies

(cid:7)M

(cid:4) − M0(cid:7)F ≤

2(cid:8)
λ

This inequality indicates that by enlarging the radius of the hypersphere

(cid:2)

2(cid:8)
λ , we can guarantee that the bound includes any approximate solu-
by
tions (Figure 5b shows an illustration). Using the radius r introduced in
section 6.1, we obtain the enlarged radius R as follows:

R =

(cid:6)

a + b

1
λ
(cid:7)(cid:8)
r

+ c

1
λ2
(cid:9)

2(cid:8)
λ

(6.3)

The reduced-size problems created by this enlarged radius can be safely
used to evaluate the duality gap for any λ. However, this enlarged radius
no longer has the general form of the radius 6.2 we assumed to derive the
range. Although it is possible to derive a range even for the enlarged radius
R, the calculation becomes quite complicated, and thus we do not pursue
this direction in this study. (Appendix M shows the computational proce-
dure.) Further, an increase in the radius may decrease the screening rate.

6.2 Screening with Diagonal Constraint. When the matrix M is con-
strained to be a diagonal matrix, metric learning is reduced to feature
weighting in which the Mahalanobis distance, equation 2.1, simply adapts a
weight of each feature without combining different dimensions. Although
correlation in different dimensions is not considered, this simpler formula-
tion is useful to avoid a large computational cost for high-dimensional data
mainly because of the following two reasons:

• The number of variables in the optimization decreases from d2 to d.
• The semidefinite constraint for a diagonal matrix is reduced to the

nonnegative constraint of diagonal elements.

Both properties are also beneficial for efficient screening rule evaluation; in
particular, the second property makes the screening rule with the semidef-
inite constraint easier to evaluate.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Safe Triplet Screening for Distance Metric Learning

2457

The minimization problem of the spherical rule with the semi-definite

constraint, equation P2, is simplified as

(cid:2)
x

hi jl s.t.

min
x∈Rd

(cid:21)
(cid:21)
(cid:21)2
(cid:21)
x − q
2

≤ r2, x ≥ 0,

where hi jl := diag(H i jl ). Let

L(x, α, β) := x

(cid:2)

hi jl

− α(r2 −

(cid:21)
(cid:21)

x − q

(cid:21)
(cid:21)2
2) − β(cid:2)

(P4)

be the Lagrange function of equation P4, where α ≥ 0 and β ≥ 0 are dual
variables. The KKT conditions are written as

∂L/∂x = hi jl

+ 2α(x − q) − β = 0,

(cid:21)
(cid:21)

α(r2 −
x − q
(cid:21)
(cid:21)
(cid:21)2
(cid:21)
x − q
2

(cid:21)
(cid:21)2
2) = 0, β
≥ 0, β ≥ 0, x ≥ 0.

= 0,

kxk

α ≥ 0, r2 −

(6.4a)

(6.4b)

(6.4c)

We derive the analytical representation of the optimal solution for cases of
α > 0 and α = 0, respectively. For α > 0, the following theorem is obtained.
Theorem 11. If the optimal dual variable satisfies α > 0, the optimal x and β of
equation P4 can be written as

$ qk 0, = xk − hi jl,k /2α, if hi jl,k − 2αqk ≤ 0, otherwise, and β = hi jl + 2α(x − q). Then x also satisfies (cid:7)x − q(cid:7)2 2 = r2. (6.5) (6.6) (6.7) For α = 0, the following theorem is obtained. Theorem 12. If the optimal dual variable satisfies α = 0, the optimal x and β of equation P4 can be written as $

0,
max {qk

if hi jl,k

> 0

, 0},

otherwise

and

β = hi jl

(6.8)

(6.9)

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

2458

T. Yoshida, I. Takeuchi, and M. Karasuyama

The proofs for theorems 11 and 12 are in sections N.1 and N.2, respec-
tively.

Based on the theorems, the optimal solution of equation P4 can be calcu-
lated analytically. The detail of the procedure is shown in section N.3, which
requires O(d2) computations. Although this procedure obtains the solution
by using the fixed steps of analytical calculations, for larger values of d, iter-
ative optimization algorithms can be faster. For example, we can apply the
SDLS dual ascent to problem P4 in which each iterative step takes O(d).

6.3 Applicability to More General Formulation. Throughout the letter,
we analyze screening theorems based on the optimization problem defined
by Primal. RTML is the Frobenius norm-regularized triplet loss minimiza-
tion, which has been shown to be an effective formulation of metric learning
(Schultz & Joachims, 2004; Shen et al., 2014). Further, with slight modifica-
tions, our screening framework can accommodate a wider range of metric
learning methods. Here we redefine the optimization problem as follows:

(cid:16)

min
M(cid:8)O

(cid:2) ((cid:5)M, Ci

(cid:6)) +

(cid:7)M(cid:7)2
F

(6.10)

∈ Rd×d is a constant matrix. All our sphere bounds (GB, PGB, DGB,
where Ci
RPB, and RRPB) still hold for this general representation if the loss func-
tion (cid:2) : R → R is convex subdifferentiable. The rules (spherical rule, sphere
with semidefinite constraint, and sphere with linear constraint) can also be
constructed if the loss function has the form of the hinge type loss func-
tion, equations 2.2 and 2.3, by which standard hinge-, smoothed hinge–,
and squared hinge–loss functions are included.

We can incorporate an additional linear term into the objective function
6.10. Defining a pseudo-loss function ˜(cid:2)(x) = −x, we write the primal prob-
lem with a linear term as

Pλ(M) :=

min
M(cid:8)O

(cid:4)

(cid:5)M, H i jl

(cid:6)

(cid:16)

(cid:3)

(cid:2)

i jl

+ ˜(cid:2)((cid:5)M, C(cid:6)) +

(cid:7)M(cid:7)2
F

(6.11)

which can be seen as a special case of equation 6.10 because ˜(cid:2) is convex
subdifferentiable. Suppose that η
∈ {0, 1} indicates whether x j is a target
i j
neighbor of xi, which is a neighbor of xi having the same label. When we
(cid:5)
− x j )(cid:2) and employ the hinge loss, equa-
define C := −
tion 2.2, this formulation is the well-known LMNN (Weinberger & Saul,
2009) with the additional Frobenius norm regularization. Another interpre-
tation of this linear term is the trace norm regularization (Kulis, 2013). For
the pseudo-loss term ˜(cid:2), the derivative is ∇ ˜(cid:2)(x) = −1, and the conjugate is
˜(cid:2)∗(−a) = 0 if −a = −1; otherwise, ∞, where a is a dual variable. Then, by

− x j )(xi

i j(xi

i j

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Safe Triplet Screening for Distance Metric Learning

2459

using the derivation of the dual in appendix A, the dual problem is modi-
fied as

max
0≤α≤1,a=1,(cid:4)(cid:8)O

Dλ(α, (cid:4)) := −

(cid:7)α(cid:7)2
2

+ α(cid:2)

1 −

(cid:7)Mλ(α, a, (cid:4))(cid:7)2
F

where

Mλ(α, a, (cid:4)) := 1
λ

i jlH i jl

+ aC + (cid:4)

⎤

⎦ .

⎡

(cid:16)

⎣

i jl

Because equation 6.11 is a special case of equation 6.10, all spheres can be
derived, and we can construct the same screening rules for α
i jl for (i, j, l) ∈
T . The only difference is that the dual variable a is not associated with any
screening rule because it is fixed to 1 by the dual constraint.

About the loss term, pairwise- and quadruplet-loss functions can also be
incorporated into our framework. The pairwise approach considers a set
of pairs in the same class S and a set of pairs in the different classes D.
Davis et al. (2007) considered constraints with threshold parameters U and
, xl ) ≥ L for (i, l) ∈ D. Let (cid:2)t (x) =
L: d2
M (xi
[t − x]+ be a hinge loss function with threshold t. By using (cid:2)t, the above
two constraints can be relaxed to soft constraints that result in

, x j ) ≤ U for (i, j) ∈ S and d2

M (xi

min
M(cid:8)O

(cid:16)

(i, j)∈S

(cid:3)
(cid:5)M, −Ci j

(cid:6)

(cid:4)

(cid:2)−U

(cid:16)

(i,l)∈D

(cid:2)L ((cid:5)M, Cil

(cid:6)) +

(cid:7)M(cid:7)2
F

Because of the threshold parameters, the second term of the dual prob-
lem, Dual2, changes from α(cid:2)1 to α(cid:2)t, where t := [L, · · · , L, −U, · · · , −U](cid:2) ∈
R|D|+|S|. Our bounds still hold because (cid:2)t is convex subdifferentiable, and
screening rules are formulated as evaluating whether the inner product
(cid:5)M, Cil

(cid:6)) is larger or smaller than the threshold t.

(cid:6) (or (cid:5)M, −Ci j

Law et al. (2013) proposed a loss function based on a quadruplet of in-
, x j ) and
, xl ). For example, when (k, l) should be more dissimilar than (i, j),
, x j )). They define the following
M (xi

stances. The basic idea is to compare pairs of dissimilarity d2
d2
M (xk
the loss is defined as (cid:2)(d2
optimization problem,

, xl ) − d2

M (xk

M (xi

(cid:16)

min
M(cid:8)O

(i, j,k,l)∈Q

(cid:3)

(cid:2)

(cid:5)M, Ci jkl

(cid:6)

(cid:4)

(cid:7)M(cid:7)2
F

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

where Ci jkl
quadruplets. This is also a special case of equation 6.10.

= (xk

− xl )(xk

− xl )(cid:2) − (xi

− x j )(xi

− x j )(cid:2) and Q is a set of

2460

T. Yoshida, I. Takeuchi, and M. Karasuyama

Table 2: Summary of Data Sets.

#dimension

#sample

#classes

#triplet

Iris
Wine
Segment
Satimage
Phishing
SensIT Vehicle
a9a
Mnist
Cifar10
Rcv1.multiclass

4
13
19
36
68
100
16a
32a
200a
200b

150
178
2310
4435
11,055
78,823
32,561
60,000
50,000
15,564

3
3
7
6
2
3
2
10
10
53

∞ 546,668
∞ 910,224
832,000
20
898,200
15
487,550
7
638,469
3
732,625
5
1,350,025
5
180,004
2
126,018
3

λmax
1.3e+7
2.0e+7
2.5e+6
1.0e+7
5.0e+3
1.0e+4
1.2e+5
7.0e+3
2.0e+3
3.0e+2

min
2.3e+1
5.1e+1
4.2e+0
8.8e+0
2.0e−1
2.9e+0
3.1e+2
9.6e−1
3.3e+1
6.0e−4

Note: #triplet and λ
aThe dimension was reduced by AutoEncoder.
bThe dimension was reduced by PCA.

min are the average value for subsampled random trials.

Note that pairwise-, triplet- and quadruplet-loss functions can be used
simultaneously, and safe screening can be applied to remove any of those
loss terms.

7 Experiment

We evaluate the performance of safe triplet screening using the benchmark
data sets listed in Table 2, which are from LIBSVM (Chang & Lin, 2011) and
the Keras data set (Chollet et al., 2015). We create a set of triplets by follow-
ing the approach by Shen et al. (2014), in which k neighborhoods in the same
class x j and k neighborhoods in a different class xl are sampled for each xi.
We employed the regularization path setting in which RTLM is optimized
for a sequence of λ0, λ1, . . . , λT . To determine λ0 = λmax, from a sufficiently
large λ in which R is empty, we gradually reduced λ by multiplying it by
0.9 and started the regularization path calculation from λ in which R is not
empty. To generate the next value of λ, we used λt = 0.9λt−1, and the path
terminated when the following condition is satisfied:

loss(λt−1) − loss(λt )
loss(λt−1)

λt−1
λt−1 − λt

< 0.01, where loss(λt ) is the loss function value at λt. We randomly selected 90% of the instances of each data set five times, and the average is shown as the experimental result. As a base optimizer, we employed the projected gradient descent of the primal problem, and the iteration terminated when the duality gap became less than 10−6. For the loss function (cid:2), we used the smoothed hinge loss of γ = 0.05. (We also provide results for the hinge loss l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 1 1 2 2 4 3 2 1 8 6 5 2 4 8 n e c o _ a _ 0 1 2 4 0 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Safe Triplet Screening for Distance Metric Learning 2461 in section 7.4.1). We performed safe triplet screening after every 10 iterative cycles of the gradient descent. We refer to the first screening for a specific λt, in which the solution of the previous λt−1 is used as the reference solution for regularization path screening. The screening performed during the op- timization process (after regularization path screening) is termed dynamic screening. We performed both of these screening procedures for all exper- iments. As a baseline, we refer to the RTLM optimization without screen- ing as naive optimization. We initialized with M = O at λ0 because M = O is the optimal solution when λ → ∞. When the regularization coefficient changes, M starts from the previous solution ˆM (warm start). The step size of the gradient descent was determined by (cid:29) (cid:29) (cid:29) (cid:29) 1 2 (cid:5)(cid:12)M, (cid:12)G(cid:6) (cid:5)(cid:12)G, (cid:12)G(cid:6) + (cid:5)(cid:12)M, (cid:12)M(cid:6) (cid:5)(cid:12)M, (cid:12)G(cid:6) (cid:29) (cid:29) (cid:29) (cid:29) , where (cid:12)M = Mt − Mt−1, (cid:12)G = ∇Pλ(Mt ) − ∇Pλ(Mt−1) (Barzilai & Borwein, 1988). In SDLS dual ascent, we used the conjugate gradient method (Yang, 1993) to find the minimum eigenvalue. 7.1 Comparing Rules. We first validate the screening performance (screening rate and CPU time) of each screening rule introduced in section 4 by using algorithm 2 without the range-based screening process. Here, the screening rate is defined by #screened triplets/|{(i, j, l) ∈ T | (cid:5)H i jl , ˆM(cid:6) >
1 or (cid:5)H i jl
, ˆM(cid:6) < 1 − γ }| where ˆM is the solution after convergence. 7.1.1 GB-Based Rules. Here we use the GB and PGB as spheres, and we observe the effect of the semidefinite constraint in the rules. As a repre- sentative result, Figure 6a compares the performance of the rules by using segment data. First, except for the GB, the rules maintain a high screening rate for the entire regularization path, as shown in the top left plot. Note that this rate is only for regularization path screening, meaning that dynamic screening can further increase the screening rate during the optimization, as discussed in the section 7.1.2. The bottom left plot of the same figure shows that PGB and GB+Linear are the most efficient and achieved CPU times approximately 2 to 10 times faster than the naive optimization. The screening rate of the GB was severely reduced along the latter half of the regularization path. As illustrated in Figure 2a, the center of the GB can be external to the semidef- inite cone by which the sphere of GB contains a larger proportion of the re- gion violating the constraint M (cid:8) O, compared with the spheres with their center inside the semidefinite cone. This causes performance deterioration particularly for smaller values of λ, because the minimum of the loss term is usually outside the semidefinite cone. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 1 1 2 2 4 3 2 1 8 6 5 2 4 8 n e c o _ a _ 0 1 2 4 0 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 2462 T. Yoshida, I. Takeuchi, and M. Karasuyama l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 1 1 2 2 4 3 2 1 8 6 5 2 4 8 n e c o _ a _ 0 1 2 4 0 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Figure 6: Comparison of screening rules on the segment data set. For both pan- els a and b, the plots are aligned as follows. (Top left) Performance of regulariza- tion path screening. (Bottom left) Ratio of CPU time compared with the naive optimization for each λ. (Right) Enlargement of the upper left plot for the range − log10(λ) ∈ [−1, −0.6]. The screening rates of GB+Linear and GB+Semidefinite are slightly higher than that of the PGB (the plot on the right), which can be seen from their geometrical relation illustrated in Figure 2a. GB+Semidefinite achieved the highest screening rate, but eigenvalue decomposition is Safe Triplet Screening for Distance Metric Learning 2463 necessary to repeatedly perform the calculation in SDLS, which resulted in the CPU time increasing along the latter half of the path. Although PGB+Semidefinite is also tighter than PGB, the CPU time increased from approximately − log10(λ) ≈ −4 to −3. Because the center of PGB is positive semidefinite, only the minimum eigenvalue is required (see section 4.2), but it can still increase the CPU time. Among the screening methods compared here, our empirical analysis suggests that the use of the spherical rule with PGB, in which a semidef- inite constraint is implicitly incorporated in the projection process, is the most cost-effective. We did not observe that the other approach to consid- ering the semidefinite (or relaxed linear) constraint in the rule substantially outperforms PGB in terms of CPU time despite its high screening rate. We observed the same tendency for DGB. The screening rate did not change markedly even if the semidefinite constraint was explicitly considered. 7.1.2 DGB-Based Rules. Next, by using the DGB, we compared the per- formance of the three rules presented in section 4. Figure 6b shows the re- sults, which are similar to those obtained for the GB, shown in Figure 6a. The semidefinite and the linear constraint slightly improve the rate. How- ever, the large computational cost for screening with the semidefinite con- straint caused the overall CPU time to increase. Therefore, although the linear constraint is much easier to evaluate, the CPU time was almost the same as that required for the DGB because of the slight improvement in the screening rate. 7.2 Comparing Bounds. Here we compare the screening performance (screening rate and CPU time) of each bound introduced in section 3 by using algorithm 2 without the range-based screening process. We do not use RPB because it needs the strictly optimal previous solution. + c 1 Based on the results in the previous section, we employed the spher- ical rule. The result obtained for the phishing data set is shown in Fig- ure 7. The screening rate of the GB (top right) again decreased from the middle of the horizontal axis compared with the other spheres. The other spheres also have lower screening rates for small values of λs. As men- tioned in section 6.1, the radii of GB, DGB, RPB, and RRPB have the form r2 = a + b 1 λ2 , meaning that if λ → 0, then r → ∞. In the case of the PGB, λ although the dependency on λ cannot be written explicitly, the same ten- dency was observed. We see that the PGB and RRPB have similar results as suggested by theorem 7, and the screening rate of the DGB is lower than that of the RRPB, as suggested by theorem 8. A comparison of the PGB and RRPB indicated that the former achieved a higher screening rate, but the latter is more efficient with less CPU time, as shown in the plot at the bot- tom right, because the PGB requires a matrix inner product calculation for each triplet. Bounds other than the GB are more than twice as fast as the naive calculation for most values of λ. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 1 1 2 2 4 3 2 1 8 6 5 2 4 8 n e c o _ a _ 0 1 2 4 0 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 2464 T. Yoshida, I. Takeuchi, and M. Karasuyama Figure 7: Comparison of spherical bounds on the phishing data set. The heat maps on the left show the dynamic screening rate. The vertical axes of these heat maps represent the number of iterative steps required for the optimization divided by 10 to perform the screening. (Top right) Rate of regularization path screening. (Bottom right) Ratio of CPU time compared with naıve optimization. A comparison of the dynamic screening rate (the three plots on the left in Figure 7) of PGB and RRPB shows that the rate of PGB is higher. In terms of the regularization path screening (top right), RRPB and PGB have simi- lar screening rates, but PGB has a higher dynamic screening rate. Along the latter half of the regularization path, the number of gradient descent iter- ations increases; consequently, the dynamic screening significantly affects the CPU time, and the PGB becomes faster despite the additional computa- tion it requires to compute the inner product. We further evaluate the performance of the range-based extension de- scribed in section 6.1. Figure 8 shows the rate of the range-based screening for the segment data set. The figure shows that a wide range of λ can be screened, particularly for small values of λ; although the range is smaller for large values of λ, than for the small values, a high screening rate is ob- served when λ approaches λ 0. A significant advantage of this approach is that for those triplets screened by using the specified range, we no longer need to evaluate the screening rule as long as λ is within the range. The total CPU time for the regularization path is shown in Figure 9. In addition to GB, PGB, DGB, and RRPB, we further evaluate the performance when PGB and RRPB are used simultaneously. The use of two rules can improve the screening rate; however, additional computations are required to evaluate the rule. In the figure, for four out of six data sets, the PGB+RRPB combination requires the least CPU time. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 1 1 2 2 4 3 2 1 8 6 5 2 4 8 n e c o _ a _ 0 1 2 4 0 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Safe Triplet Screening for Distance Metric Learning 2465 Figure 8: Screening rate of range-based screening on the segment data set. The color indicates the screening rate for λ on the vertical axis based on the reference solution using λ 0 on the horizontal axis. The accuracy of the reference solution is 10−4 and 10−6 for the plots on the left and right, respectively. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 1 1 2 2 4 3 2 1 8 6 5 2 4 8 n e c o _ a _ 0 1 2 4 0 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Figure 9: Total CPU time of regularization path (seconds). The term RRPB+PGB indicates that the spherical rules are performed with these two spheres. “Base- line” indicates the computational time without screening. The bars show the total time and the proportion of the optimization algorithm and the screening calculation. 7.3 Evaluating the Practical Efficiency. We next considered a compu- tationally more expensive setting to evaluate the effectiveness of the safe screening approach in a practical situation. To investigate the regularization path more precisely, we set a finer grid of regularization parameters defined as λt = 0.99λt−1. We also incorporated the well-known active set heuristics to conduct our experiments on larger data sets. Note that because of the 2466 T. Yoshida, I. Takeuchi, and M. Karasuyama Table 3: Evaluation of the Total CPU Time (Seconds) with the Active Set Method. Method\Data Set phishing SensIT a9a mnist cifar10 rcv ActiveSet ActiveSet+RRPB ActiveSet+RRPB+PGB 7989.5 2126.2 2133.2 16,352.1 3555.6 3046.9 758.7 70.1 72.1 3788.1 871.1 897.9 11085.7 1431.3 1279.7 94996.3 43174.9 38231.1 Note: Results in bold indicate the fastest method. above differences, the computational time shown here cannot be directly compared with the results in sections 7.1 and 7.2. The active set method uses only a subset of triplets of which the loss is greater than 0 as the active set. The gradient is calculated by using only the active set, and the overall optimality is confirmed when the iteration converges. We employed the ac- tive set update strategy shown by Weinberger and Saul (2009), in which the active set is updated once every ten iterative cycles. Table 3 compares the CPU time for the entire regularization path. Based on the results in the previous section, we employed RRPB and RRPB+PGB (evaluating rules based on both spheres) for triplet screening. Further, the range-based screening described in section 6.1 is also performed using RRPB, for which we evaluate the range at the beginning of the optimization for each λ, as shown in algorithm 2. Our safe triplet screening accelerates the optimization process by up to 10 times compared to the simple active set method. The results for higher-dimensional data sets with a diagonal M are presented in section 7.4.1. 7.4 Empirical Evaluation of Three Special Cases. Here we evaluate three special cases of our formulation: nonsmoothed hinge loss, the Maha- lanobis distance with a diagonal matrix, and dynamic screening for a certain value of λ. 7.4.1 Nonsmoothed Hinge Loss. In previous experiments, we used the smoothed hinge loss function γ = 0.05. However, the hinge loss function γ = 0 is also widely used. Figure 10 shows the screening result of the PGB spherical rule for the segment data. Here, the loss function of RTLM is the hinge loss function, and the other settings are the same as those of the exper- iments in the main text. The results show that PGB achieved a high screen- ing rate and that the CPU time substantially improved. 7.4.2 Learning with Higher-Dimensional Data Using Diagonal Matrix. Here we evaluate the screening performance when the matrix M is confined to being a diagonal matrix. Based on the same setting as section 7.3, com- parison with the ActiveSet method is shown in Table 4. We used RRPB and RRPB+PGB, both of which largely reduced the CPU time. Attempts to l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 1 1 2 2 4 3 2 1 8 6 5 2 4 8 n e c o _ a _ 0 1 2 4 0 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Safe Triplet Screening for Distance Metric Learning 2467 Figure 10: Performance evaluation of PGB for the hinge loss setting. The heat map on the left shows the dynamic screening rate with the vertical axis showing the number of iterative cycles for optimization divided by 10 at which screening is performed. (Top right) Rate of regularization path screening. (Bottom right) Ratio of CPU time compared with the naive optimization. Table 4: Total Time (Seconds) of the Regularization Path for Diagonal M. Method\Data Set USPS Madelon Colon-Cancer Gisette ActiveSet ActiveSet+RRPB ActiveSet+RRPB+PGB #dimension #samples #triplet k λmax λ min 2485.5 326.7 336.6 256 7291 656,200 10 1.0e+7 1.9e+3 7005.8 593.4 562.4 500 2000 720,400 20 2.0e+14 4.7e+11 3149.8 632.2 628.2 2000 62 38,696 ∞ 5.0e+7 7.0e+3 – 133,870.0 127,123.8 5000 6000 1,215,225 15 4.5e+8 2.1e+3 Notes: The results in bold indicate the fastest method. The Gisette data set did not produce results by ActiveSet because of the time limitation. process the Gisette data set, which has the largest dimension, 5,000, with the active set method were unsuccessful and the method did not terminate even after 250,000 s. 7.4.3 Dynamic Screening for Fixed λ. Here, we evaluate the performance of dynamic screening for a fixed λ. For λ, we used λmin in Table 2 for which the screening rate was relatively low in our results thus far (e.g., see l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 1 1 2 2 4 3 2 1 8 6 5 2 4 8 n e c o _ a _ 0 1 2 4 0 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 2468 T. Yoshida, I. Takeuchi, and M. Karasuyama l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l Figure 11: Evaluation of the computational time for dynamic screening. The computational time required for (a) dynamic screening (a) without the active set and (b) with the active set. “Baseline” indicates the results obtained for the naive method without screening and the active set strategy. The bars show the total time and the proportion of the optimization algorithm and the screening calculation. Figure 6a). Figure 11 compares the computational time of the naive ap- proach without screening and with the dynamic screening shown in algo- rithm 1. The plots in Figure 11a show that dynamic screening accelerates the learning process. The plots in Figure 11b show the performance of the ac- tive set strategy, indicating that the combination of dynamic screening and the active set strategy is effective for further acceleration. 7.5 Effect of Number of Triplets on Prediction Accuracy. Finally, we examine the relation between the number of triplets contained in T and the prediction accuracy of the classification. We employed the nearest-neighbor (NN) classifier to measure the prediction performance of the learned met- ric. The data set was randomly divided into training data (60%), validation data (20%), and test data (20%). The regularization parameter λ changed from 105 to 0.1 and was chosen by minimizing the validation error. The ex- periment was performed 10 times by randomly partitioning the data set in different ways. The results are shown in Figure 12, which summarizes the CPU time and test error rate for different settings of the number of triplets. The horizontal f / / / / 3 1 1 2 2 4 3 2 1 8 6 5 2 4 8 n e c o _ a _ 0 1 2 4 0 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Safe Triplet Screening for Distance Metric Learning 2469 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / Figure 12: CPU time (seconds) and test error rate on the phishing data set. axes in all four plots, a to d, represent the number of neighbors k used to define the original triplet set T as described at the beginning of section 7. Figure 12a shows the CPU time to calculate the entire regularization path with and without screening. Here “Without Screening” indicates the Ac- tiveSet approach, and “With Screening” indicates the ActiveSet+RRPB ap- proach. These results show that the learning time increases as k increases, and safe triplet screening shows larger decreases in the CPU time for larger values of k. Figures 12b to 12d show the test error rates, each calculated by 10 NN, 20 NN, and 30 NN classifiers, respectively. In Figure 12b, the 10 NN test error is minimized at k = 6, with screening requiring less than approxi- mately 2,000 seconds, whereas the naive approach (Without Screening) can calculate only approximately k = 4 in the same computational time. In Fig- ure 12c, the 20 NN test error is minimized at k = 12, with screening requir- ing approximately 4000 seconds, whereas the naıve approach can calculate only approximately k = 8. In Figure 12d, the 30 NN test error is minimized at k = 15, with screening requiring approximately 5000 seconds, whereas the naïve approach can calculate only approximately k = 9. These results indicate that the number of neighbors, k, significantly affects the prediction accuracy, and sufficiently large k is often necessary to achieve the best pre- diction performance. / / / 3 1 1 2 2 4 3 2 1 8 6 5 2 4 8 n e c o _ a _ 0 1 2 4 0 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 2470 T. Yoshida, I. Takeuchi, and M. Karasuyama 8 Conclusion We introduced safe triplet screening for large-margin metric learning. Three screening rules and six spherical bounds were derived, and the relations among them were analyzed. We further proposed a range-based extension for the regularization path calculation. Our screening technique for metric learning is particularly significant compared with other screening studies because of the large number of triplets and the semidefinite constraint. Our numerical experiments verified the effectiveness of safe triplet screening using several benchmark data sets. Appendix A: Dual Formulation To derive the dual problem, we first rewrite the primal problem as l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . (cid:16) (cid:4) (cid:3) ti jl (cid:2) min M, t i jl s.t. M (cid:8) O, ti jl + λR(M) = (cid:5)M, H i jl (cid:6), / e d u n e c o a r t i c e - p d / l f / where t is a |T |-dimensional vector that contains all ti jl for (i, j, l) ∈ T and R(M) = 1 2 (cid:7)M(cid:7)2 F . The Lagrange function is (A.1) L(M, t, α, (cid:4)) := (cid:16) (cid:4) (cid:3) ti jl (cid:2) i jl + λR(M) + (cid:16) i jl α i jl (ti jl − (cid:5)M, H i jl (cid:6)) − (cid:5)M, (cid:4)(cid:6), where α ∈ R|T | and (cid:4) ∈ Rd×d + are Lagrange multipliers. Let (cid:2)∗ (−α i jl ) := sup ti jl (Mλ(α, (cid:4))) := sup M ∗ R {(−α i jl )ti jl − (cid:2) (cid:4) (cid:3) ti jl }, {(cid:5)Mλ(α, (cid:4)), M(cid:6) − R(M)}, (A.2) (A.3) be convex conjugate functions (Boyd & Vandenberghe, 2004) of (cid:2) and R, where Mλ(α, (cid:4)) := 1 λ ⎤ α i jlH i jl + (cid:4) ⎦ . ⎡ (cid:16) ⎣ i jl (A.4) / / / 3 1 1 2 2 4 3 2 1 8 6 5 2 4 8 n e c o _ a _ 0 1 2 4 0 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Safe Triplet Screening for Distance Metric Learning 2471 Then the dual function is written as Dλ(α, (cid:4)) := inf M,t L(M, t, α, (cid:4)) (cid:16) {(−α i jl )ti jl sup ti jl − (cid:2) (cid:4) (cid:3) ti jl } − λ sup M {(cid:5)Mλ(α, (cid:4)), M(cid:6) − R(M)} (cid:2)∗ (−α ∗ i jl ) − λR (Mλ(α, (cid:4))). = − = − i jl (cid:16) i jl From the Karush-Kuhn-Tucker (KKT) condition, we obtain ∇ML = λ∇R(M) − λMλ(α, (cid:4)) = O, = 0, ∇ti jl L = ∇(cid:2)(ti jl ) + α (cid:4) (cid:8) O, M (cid:8) O, (cid:5)M, H i jl (cid:6) = ti jl i jl , (cid:5)M, (cid:4)(cid:6) = 0, (A.5a) (A.5b) (A.5c) where, in the case of hinge loss, ∇(cid:2)(x) = ⎧ ⎪⎨ ⎪⎩ 0, −c, −1, x > 1,
x = 1,
x < 1, where ∀c ∈ [0, 1], and in the case of smoothed hinge loss, ∇(cid:2)(x) = ⎧ ⎪⎨ ⎪⎩ 0, − 1 −1, γ (1 − x), x > 1,
1 − γ ≤ x ≤ 1,
x < 1 − γ . From these two equations and equation A.5b, we see that 0 ≤ α ≤ 1. Sub- stituting equation A.5b into equation A.2 and considering the above con- straint, the conjugate of the loss function (cid:2) can be transformed into (cid:2)∗ (−α i jl ) = γ 2 α2 i jl − α i jl . Note that this equation holds for the cases of both hinge loss (by setting γ = 0) and smoothed hinge loss (γ > 0). Substituting equation A.5a into
A.3, the conjugate of the regularization term R is written as

∗
R

(Mλ(α, (cid:4))) = R(Mλ(α, (cid:4))) = 1
2

(cid:7)Mλ(α, (cid:4))(cid:7)2
F

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

2472

T. Yoshida, I. Takeuchi, and M. Karasuyama

Therefore, the dual problem is

max
0≤α≤1, (cid:4)(cid:8)O

Dλ(α, (cid:4)) = −

(cid:16)

i jl

(cid:2)∗

(−α

i jl ) −

(cid:7)Mλ(α, (cid:4))(cid:7)2
F

(Dual1)

(cid:21)
(cid:21)
(cid:21)2
(cid:21)
Because the second term, max(cid:4)(cid:8)O − 1
Mλ(α, (cid:4))
F, is equivalent to the pro-
2
jection onto a semidefinite cone (Boyd & Xiao, 2005; Malick, 2004), the above
problem (Dual1) can be simplified as

max
0≤α≤1

Dλ(α) := −

(cid:7)α(cid:7)2
2

+ α(cid:2)

1 −

(cid:21)
(cid:21)

(cid:21)
(cid:21)2
Mλ(α)
F

(Dual2)

where

Mλ(α) := 1
λ

⎡

(cid:16)

⎣

i jl

i jlH i jl

⎤

⎦

For the optimal M(cid:4), each triplet in T can be categorized into the following
three groups:

L(cid:4)

C(cid:4)

R(cid:4)

, M

:= {(i, j, l) ∈ T | (cid:5)H i jl
:= {(i, j, l) ∈ T | 1 − γ ≤ (cid:5)H i jl
:= {(i, j, l) ∈ T | (cid:5)H i jl

(cid:4)(cid:6) < 1 − γ }, , M (cid:4)(cid:6) > 1}.

, M

(cid:4)(cid:6) ≤ 1},

(A.6)

Based on equations A.5b and A.5c in becomes clear that α(cid:4)
i jl
H i jl

(cid:6)), by which the following rules are obtained:

= −∇(cid:2)((cid:5)M(cid:4),

(i, j, l) ∈ L(cid:4) ⇒ α(cid:4)
i jl
(i, j, l) ∈ C(cid:4) ⇒ α(cid:4)
i jl
(i, j, l) ∈ R(cid:4) ⇒ α(cid:4)
i jl

= 1,
∈ [0, 1],
= 0.

Appendix B: Proof of Lemma 1

The reduced-size problem can be represented by

(A.7)

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

min
M,t

s.t.

(cid:16)

(cid:2)(ti jl ) +

= (cid:5)M, H i jl

(i, j,l)∈ ˜T
ti jl
M (cid:8) O.

(cid:6)

(cid:16)

(i, j,l)∈ ˆL

(cid:22)
1 −

(cid:23)

− ti jl

(cid:7)M(cid:7)2
F

(i, j, l) ∈ T ,

Safe Triplet Screening for Distance Metric Learning

2473

Then the Lagrangian is

˜L(M, t, α, (cid:4)) =

(cid:16)

(cid:2)(ti jl ) +

(cid:16)

(i, j,l)∈ ˆL

(cid:22)
1 −

(cid:23)

− ti jl

(cid:7)M(cid:7)2
F

(i, j,l)∈ ˜T
(cid:16)

(i, j,l)∈T

i jl (ti jl

− (cid:5)M, H i jl

(cid:6)) − (cid:5)M, (cid:4)(cid:6).

(B.1)

The dual function is written as

˜Dλ(α, (cid:4)) := inf
M,t

= −

˜L(M, t, α, (cid:4))
(cid:16)

{(−α

i jl )ti jl

sup
ti jl

{(1 − α

i jl )ti jl

sup
ti jl

− (cid:2)

(cid:4)

(cid:3)
ti jl

} −

(cid:16)

} + (1 −

(i, j,l)∈ ˆR

)| ˆL|

{(cid:5)Mλ(α, (cid:4)), M(cid:6) − R(M)},

(i, j,l)∈ ˜T
(cid:16)

−

(i, j,l)∈ ˆL
− λ sup
M

{(−α

i jl )ti jl

sup
ti jl

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

}

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

where R(M) and and Mλ(α, (cid:4)) are defined by equations A.1 and A.4, re-
spectively. Based on the second and third terms of the previous equation,
we see

i jl

= 0,

i jl

= 1,

∀(i, j, l) ∈ ˆR,

∀(i, j, l) ∈ ˆL,

(B.2)

(B.3)

which prevent ˜Dλ from approaching ∞. Then constraints B.2 and B.3 enable
us to further transform the dual objective into

˜Dλ(α, (cid:4)) = −

(cid:16)

(i, j,l)∈ ˜T

(cid:16)

(cid:2)∗

(−α

i jl ) +

(cid:22)
1 −

(cid:23)

−

(cid:7)Mλ(α, (cid:4))(cid:7)2
F

(cid:23)

| ˆL| −

(cid:7)Mλ(α, (cid:4))(cid:7)2
F

(i, j,l)∈ ˆL
(cid:22)
1 −

= −

2
γ

(cid:7)α ˜T (cid:7)2
2

+ α(cid:2)

˜T 1 +
λ

1 −

(cid:7)α(cid:7)2
2

+ α(cid:2)

(cid:7)Mλ(α, (cid:4))(cid:7)2
F

Thus, the dual problem is written as

max
0≤α≤1, (cid:4)(cid:8)O

Dλ(α, (cid:4)) = −

2
s.t. α ˆL = 1, α ˆR = 0.

(cid:7)α(cid:7)2
2

+ α(cid:2)

1 −

(cid:7)Mλ(α, (cid:4))(cid:7)2
F

(B.4)

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

2474

T. Yoshida, I. Takeuchi, and M. Karasuyama

This is the same optimization problem as Dual1 except that α ˆL and α ˆR are
fixed as the optimal value in Dual1. This obviously indicates that problems
B.4 and Dual1 have the same optimal solution. Given the optimal dual vari-
ables α(cid:4) and (cid:4)(cid:4), the optimal primal M(cid:4) can be derived by

M = 1
λ

⎡

⎣

(cid:16)

(i, j,l)∈T

⎤

i jlH i jl

+ (cid:4)

⎦ ,

(B.5)

which is from ∇M ˜L = 0. Because equation B.5 is exactly the same transfor-
mation as equation 2.4, the same optimal primal M(cid:4) must be obtained. (cid:2)

Appendix C: Proof of Theorem 1 (GB)

The following theorem is a well-known optimality condition for the general
convex optimization problem:

Theorem 12 (Optimality Condition of Convex Optimization, Bertsekas, 1999). In
the minimization problem minx∈F f (x) where the feasible region F and the function
f (x) are convex, and the necessary and sufficient condition that x(cid:4) is the optimal
solution is

∃∇ f (x

(cid:4)

) ∈ ∂ f (x

(cid:4)

)

(cid:26)
∇ f (x

(cid:4)

(cid:2)
)

(cid:4) − x) ≤ 0, ∀x ∈ F

(cid:27)

where ∂ f (x(cid:4)) represents the set of subgradients in x(cid:4).

From theorem 12, the following holds for the optimal solution M(cid:4):

(cid:5)∇Pλ(M

(cid:4)

), M

(cid:4) − M(cid:6) ≤ 0, ∀M (cid:8) O.

i jl (M) be the subgradient of the loss function (cid:2)((cid:5)M, H i jl

Let (cid:6)
∇Pλ(M) is written as
(cid:16)

∇Pλ(M) =

(cid:6)

i jl (M) + λM.

(C.1)

(cid:6)) at M. Then

(C.2)

i jl

From the convexity of the (smoothed) hinge loss function (cid:2)((cid:5)M, H i jl
obtain

(cid:6)), we

(cid:4), H i jl
(cid:2)((cid:5)M
(cid:2)((cid:5)M, H i jl

(cid:6)) ≥ (cid:2)((cid:5)M, H i jl
(cid:4), H i jl
(cid:6)) ≥ (cid:2)((cid:5)M

(cid:6)) + (cid:5)(cid:6)
(cid:6)) + (cid:5)(cid:6)

i jl (M), M
(cid:4)

(cid:4) − M(cid:6),
(cid:4)(cid:6),

), M − M

i jl (M

for any subgradient. The addition of these two equations shows that

(cid:5)(cid:6)

i jl (M

(cid:4)

), M

(cid:4) − M(cid:6) ≥ (cid:5)(cid:6)

i jl (M), M

(cid:4) − M(cid:6).

(C.3)

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Safe Triplet Screening for Distance Metric Learning

2475

Combining equations C.1, to C.3 results in

(cid:6)

i jl (M) + λM

(cid:4), M

(cid:4) − M

≤ 0

(cid:16)

i jl

⇔ (cid:5)∇Pλ(M) − λM + λM

(cid:4), M

(cid:4) − M(cid:6) ≤ 0.

By transforming this inequality based on completing the square, we obtain
(cid:2)
GB.

Appendix D: Proof of Theorem 2 (PGB)

Let QGB be the center of the GB hypersphere and rGB be the radius. The
optimal solution exists in the following set:

{X | (cid:7)X − QGB(cid:7)2
F

≤ r2
GB

, X (cid:8) O}.

(D.1)

By transforming the sphere of GB, we obtain

(cid:7)X − QGB(cid:7)2
F

= (cid:7)X − (QGB

+ + QGB

− )(cid:7)2
F
+ 2(cid:5)X, −QGB

= (cid:7)X − QGB

+ (cid:7)2
F

− (cid:6) + 2(cid:5)QGB

+ , QGB

− (cid:6) + (cid:7)QGB

− (cid:7)2
F

Because X (cid:8) O and −QGB
(cid:5)QGB

− (cid:8) O, we see (cid:5)X , −QGB
− (cid:6) = 0, we obtain the following sphere:

+ , QGB

− (cid:6) ≥ 0. Furthermore, using

r2
GB

≥ (cid:7)X − QGB(cid:7)2
F
∴ (cid:7)X − QGB
+ (cid:7)2
F

≥ (cid:7)X − QGB
≤ r2
GB

+ (cid:7)QGB
+ (cid:7)2
F
.
− (cid:7)2
− (cid:7)QGB
F

− (cid:7)2
F

+ and r2

Letting QPGB := QGB
by considering (cid:5)X , −QGB
immediately see that GB with the linear constraint (cid:5)X, −QGB
than PGB.

F, PGB is obtained. Note that
− (cid:6) ≥ 0 instead of X (cid:8) O in equation D.1, we can
− (cid:6) ≥ 0 is tighter
(cid:2)

PGB := r2

− (cid:7)QGB

− (cid:7)2

Appendix E: Proof of Theorem 4 (DGB)

(cid:7)x(cid:7)2
In general, a function f (x) is an m-strongly convex function if f (x) − m
2
2
is convex. Because the objective function Pλ(M) is a λ-strongly convex func-
tion, we obtain

Pλ(M) ≥ Pλ(M

(cid:4)

) + (cid:5)∇Pλ(M

(cid:4)

), M − M

(cid:4)(cid:6) +

(cid:21)
(cid:21)

M − M

(cid:4)

(cid:21)
(cid:21)2
F

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

2476

T. Yoshida, I. Takeuchi, and M. Karasuyama

From the optimal condition, equation C.1, the second term on the right-
hand side is greater than or equal to 0, and from the weak duality, Pλ(M(cid:4)) ≥
(cid:2)
Dλ(α, (cid:4)). Therefore, we obtain theorem 4.

Appendix F: Constrained Duality Gap Bound (CDGB)

√

For the DGB, we show that if the primal and dual reference solutions satisfy
2 times smaller. We extend the dual-based
equation 2.4, the radius can be
screening of SVM (Zimmert et al., 2015) for RTLM.
Theorem 14 (CDGB). Let α and (cid:4) be the feasible solutions of the dual problem.
Then the optimal solution of the primal problem M(cid:4) exists in the following hyper-
sphere:

(cid:21)
(cid:21)

(cid:21)
(cid:21)2
(cid:4) − Mλ(α, (cid:4))
F

≤ GDλ (α, (cid:4))/λ.

Proof. Let GDλ (α, (cid:4)) := Pλ(Mλ(α, (cid:4))) − Dλ(α, (cid:4)) be the duality gap as a
function of the dual feasible solutions α and (cid:4). The following equation is
the duality gap as a function of the primal feasible solution M in which the
dual solutions are optimized:

GPλ (M) :=

min
0 ≤ α ≤ 1,
(cid:4) (cid:8) O,
Mλ(α, (cid:4)) = M

GDλ (α, (cid:4)) = Pλ(M) − max
0 ≤ α ≤ 1,
(cid:4) (cid:8) O,
Mλ(α, (cid:4)) = M

Dλ(α, (cid:4)).

From the definition, we obtain

GDλ (α, (cid:4)) ≥ GPλ (Mλ(α, (cid:4))).

(F.1)

From the strong convexity of GPλ shown in section F.1, the following holds
for any Mλ(α, (cid:4)) and M(cid:4) (cid:8) O:

GPλ (Mλ(α, (cid:4))) ≥ GPλ (M
(cid:21)
(cid:21)

+ λ

(cid:4)

) + (cid:5)∇GPλ (M
(cid:4)

Mλ(α, (cid:4)) − M

(cid:4)(cid:6)

(cid:4)

), Mλ(α, (cid:4)) − M
(cid:21)
(cid:21)2
F

We assume that M(cid:4) is the optimal solution of the primal problem.
Then, because M(cid:4) is also a solution to the convex optimization problem
minM(cid:8)O GPλ (M), it becomes clear that (cid:5)∇GPλ (M(cid:4)), Mλ(α, (cid:4)) − M(cid:4)(cid:6) ≥ 0 from
theorem 12. Considering GPλ (M(cid:4)) = 0 and GDλ (α, (cid:4)) ≥ GPλ (Mλ(α, (cid:4))), both
of which are from the definition, we obtain

GDλ (α, (cid:4)) ≥ GPλ (Mλ(α, (cid:4))) ≥ λ

(cid:21)
(cid:21)

Mλ(α, (cid:4)) − M

(cid:4)

(cid:21)
(cid:21)2
F

Dividing by λ, CDGB is derived.

(cid:2)

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Safe Triplet Screening for Distance Metric Learning

2477

We name this bound the constrained duality gap bound (CDGB), of which
the radius converges to 0 at the optimal solution, because the CDGB also has
a radius proportional to the square root of the duality gap. For primal-based
optimizers, additional calculation is necessary for Pλ(Mλ(α, (cid:4))), whereas
dual-based optimizers calculate this term in the optimization process.

F.1 Proof of Strong Convexity of GPλ . We first define an m-strongly con-

vex function as follows:
Definition 1 (m-strongly Convex Function). When f (x) − m
2
function, f (x) is an m-strongly convex function.

(cid:7)x(cid:7)2

2 is a convex

According to definition 1, to show that GPλ is strongly convex, we need

to show that the term other than λ (cid:7)M(cid:7)2

F is convex:

(cid:16)

GPλ (M) =

i jl
(cid:6)

(cid:2)((cid:5)M, H i jl
(cid:7)(cid:8)
convex

(cid:6))

+λ (cid:7)M(cid:7)2
F

(cid:9)

min
0 ≤ α ≤ 1, (cid:4) (cid:8) O, Mλ(α, (cid:4)) = M

(cid:6)

(cid:7)(cid:8)
:= f (M)

(cid:16)

i jl
(cid:6)

(cid:2)∗

(−α

i jl )

(cid:7)(cid:8)
:=g(α)

(cid:9)

Because the loss (cid:2) is convex, we need to show that f (M) is convex. This can
be shown as

f (M) =

1
λ

’(cid:5)

min
0 ≤ α ≤ 1, (cid:4) (cid:8) O,
i jl H i jl

+ (cid:4)

(

i jl

= M

g(α) =

1
λ

(cid:5)

min
0 ≤ α ≤ 1,
i jl H i jl

i jl

(cid:26) M

g(α).

Consider a point M2 = t M0 + (1 − t) M1 (t ∈ [0, 1]), which internally di-
vides two points M0 and M1. Let

α(cid:4)
i :=

1
λ

argmin
0 ≤ α ≤ 1,
i jl H i jl

i jl

(cid:5)

(cid:26) Mi

g(α),

which means that α(cid:4)
{0, 1, 2}), and from the definition, we see f (Mi) = g(α(cid:4)
t α(cid:4)
α
1. Then, 0 ≤ α2 ≤ 1 and 1
2,i jlH i jl
i jl
0
convex because of the convexity of (cid:2)∗, we have

i is the minimizer of this problem for a given Mi(i ∈
i ). Further, let α2 =
(cid:26) M2. Because g is

+ (1 − t) α(cid:4)

(cid:5)

t f (M0) + (1 − t) f (M1) = t g(α(cid:4)

0) + (1 − t) g(α(cid:4)
1)

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

2478

T. Yoshida, I. Takeuchi, and M. Karasuyama

≥ g(tα(cid:4)
(cid:6)
0

+ (1 − t)α(cid:4)
(cid:9)
1

(cid:7)(cid:8)
α
2

)

≥ g(α(cid:4)

2) = f (t M0 + (1 − t) M1
(cid:9)

(cid:6)

(cid:7)(cid:8)
M2

Hence, f (M) is convex and GPλ is a strongly convex function.

Appendix G: Proof of Theorem 5 (RPB)

The optimality condition, theorem 12, in the dual problem, Dual1, for λ0, λ1
determines that

∇αDλ
∇αDλ

0 (α(cid:4)
1 (α(cid:4)

(cid:2)
, (cid:4)(cid:4)
0)
(cid:2)
, (cid:4)(cid:4)
1)

(α(cid:4)
1
(α(cid:4)
0

− α(cid:4)

0) + (cid:5)∇(cid:4)Dλ
1) + (cid:5)∇(cid:4)Dλ

0 (α(cid:4)
1 (α(cid:4)

, (cid:4)(cid:4)

0), (cid:4)(cid:4)
1), (cid:4)(cid:4)

− (cid:4)(cid:4)
0
− (cid:4)(cid:4)
1

(cid:6) ≤ 0,
(cid:6) ≤ 0.

By adding these two equations, we obtain

[∇αDλ

0 (α(cid:4)
0
− ∇(cid:4)Dλ

, (cid:4)(cid:4)
1 (α(cid:4)

0) − ∇αDλ
1), (cid:4)(cid:4)
, (cid:4)(cid:4)

1 (α(cid:4)
1
− (cid:4)(cid:4)
0

, (cid:4)(cid:4)

(cid:2)
1)]
(cid:6) ≤ 0.

(α(cid:4)
1

− α(cid:4)

0) + (cid:5)∇(cid:4)Dλ

0 (α(cid:4)

, (cid:4)(cid:4)
0)

Next, we consider the following difference of gradient:

∇α

, (cid:4)(cid:4)

i jl Dλ
∇(cid:4)Dλ

0 (α(cid:4)
0 (α(cid:4)

0) − ∇α
i jl Dλ
, (cid:4)(cid:4)
0) − ∇(cid:4)Dλ

1 (α(cid:4)
1 (α(cid:4)

, (cid:4)(cid:4)

− α(cid:4)

1) = −γ (α(cid:4)
1) = −(M

0 i jl
(cid:4)
− M
0

1 i jl ) − (cid:5)H i jl
1).

(cid:4)

, M

(cid:4)
0

− M

(cid:6),

(cid:4)
1

Defining H (cid:4)

t :=

(cid:5)

α(cid:4)
t i jlH i jl

i jl

,M(cid:4)

t is rewritten as M(cid:4)

= 1
λt

[H (cid:4)
t

+ (cid:4)(cid:4)

t ]. Then

γ (cid:7)α(cid:4)
1

− α(cid:4)
0

(cid:7)2
2

− (cid:5)H

(cid:4)
− H
1
⇔ γ (cid:7)α(cid:4)
1

(cid:4)
, M
0
− α(cid:4)
0

(cid:4)
(cid:4)
− M
1
0
− (cid:5)λ1M
(cid:7)2
2
⇒ −(cid:5)λ1M

(cid:6) − (cid:5)M
(cid:4)
1
(cid:4)
1

(cid:4)
− M
0
(cid:4)
− λ0M
0
(cid:4)
− λ0M
0

(cid:4)
1
, M
, M

, (cid:4)(cid:4)
1
− M
− M

− (cid:4)(cid:4)
0
(cid:4)
1
(cid:4)
1

(cid:4)
0
(cid:4)
0

(cid:6) ≤ 0
(cid:6) ≤ 0
(cid:6) ≤ 0.

Transformation of this inequality based on completing the square allows
(cid:2)
the RPB to be obtained.

Appendix H: Proof of Theorem 6 (RRPB)

Considering a hypersphere that expands the RPB radius by
+λ
places the RPB center with
0
2λ

M0, we obtain

+λ
0
2λ

(cid:8) and re-

(cid:21)
(cid:21)
(cid:21)
(cid:21)M

(cid:4)
1

−

λ0 + λ1
2λ1

(cid:21)
(cid:21)
(cid:21)
(cid:21)

≤

|λ0 − λ1|
2λ1

(cid:21)
(cid:21)

(cid:21)
(cid:21)
F

(cid:4)
0

λ0 + λ1
2λ1

(cid:8).

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Safe Triplet Screening for Distance Metric Learning

2479

Because (cid:8) is defined by (cid:7)M(cid:4)
0
0, which satisfies (cid:7)M(cid:4)
by M(cid:4)
lustration). Using the reverse triangle inequality,

− M0(cid:7)F ≤ (cid:8), this sphere covers any RPB created
− M0(cid:7)F ≤ (cid:8) (see Figure 2d for a geometrical il-

(cid:7)M

(cid:4)
0

(cid:7)F − (cid:7)M0(cid:7)F ≤ (cid:7)M

(cid:4)
0

− M0(cid:7)F ≤ (cid:8),

we obtain

(cid:21)
(cid:21)
(cid:21)
(cid:21)M

(cid:4)
1

−

λ0 + λ1
2λ1

(cid:21)
(cid:21)
(cid:21)
(cid:21)

≤

|λ0 − λ1|
2λ1

((cid:7)M0

(cid:7)

+ (cid:8)) +

λ0 + λ1
2λ1

(cid:8).

By rearranging this, RRPB is obtained.

(cid:2)

Appendix I: Proof of Theorem 7 (Relationship between PGB

and RPB)

When the dual variable is used as the subgradient of the (smoothed) hinge
loss at the optimal solution M(cid:4)
0 of λ0 (from equation A.7, the optimal dual
variable provides a valid subgradient), the gradient of the objective function
in the case of λ1 is written as

∇Pλ

1 (M

(cid:4)

0) = −H

(cid:4)
0

+ λ1M

(cid:4)
0

where

(cid:4)

0 := −

(cid:16)

i jl

∇(cid:2)((cid:5)M

(cid:4)
0

, H i jl

(cid:6))H i jl

(cid:16)

i jl

α(cid:4)
0 i jlH i jl

Because λ0M(cid:4)

= H (cid:4)

0+,

∇Pλ

1 (M

(cid:4)

0) = −(H

0+ + H
= (λ1 − λ0)M

(cid:4)

0−) + λ1M
0−.
− H

(cid:4)

(cid:4)
0

Then the center and radius of GB are

(cid:4)
0

∇Pλ

− 1
2λ1

QGB = M
(cid:21)
(cid:21)
(cid:21)(λ1 − λ0)M(cid:4)
0
4λ2
1

r2
GB

1 (M

(cid:4)

0) =
(cid:21)
(cid:21)
(cid:21)

(λ0 + λ1)M(cid:4)
2λ1

+ H (cid:4)
0−

− H (cid:4)
0−

(cid:21)
(cid:21)

(λ1 − λ0)M(cid:4)

(cid:21)
(cid:21)2
F

− 2(λ1 − λ0)(cid:5)M(cid:4)
4λ2
1

, H (cid:4)
0−

(cid:6) +

(cid:21)
(cid:21)
(cid:21)H (cid:4)
0−

(cid:21)
(cid:21)
2
(cid:21)
F

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

2480

T. Yoshida, I. Takeuchi, and M. Karasuyama

(cid:21)
(cid:21)

(cid:21)
(cid:21)2
(λ0 − λ1)M(cid:4)
0
F
4λ2
1

(cid:21)
(cid:21)
(cid:21)H (cid:4)
0−

(cid:21)
(cid:21)
(cid:21)

GB uses the fact that M∗

0 and H ∗

0− are orthogonal.

Here, the last equation of r2
Using QGB and r2

GB, the center and radius of PGB are found to be
+ = (λ0 + λ1)M(cid:4)

, QGB

− =

QPGB = QGB

r2
PGB

= r2
GB

−

(cid:21)
(cid:21)

QGB
−

H (cid:4)
0−
2λ1
(cid:21)
(cid:21)2
F

2λ1
(cid:21)
(cid:21)2
F

(cid:21)
(cid:21)

(λ0 − λ1)M(cid:4)
4λ2
1

Therefore, PGB coincides with RPB.

(cid:2)

Appendix J: Proof of Theorem 8 (Relationship between DGB

and RPB)

At the optimal solution M(cid:4)
0
0 (α(cid:4)
0 (M(cid:4)
tion from Pλ
0
(cid:16)

0) = Dλ

(cid:16)

, α(cid:4)
, (cid:4)(cid:4)

0 and (cid:4)(cid:4)
0) and Mλ

0 (α(cid:4)

, (cid:4)(cid:4)

0 of λ0, we obtain the following equa-

(cid:2)((cid:5)M

(cid:4)
0

, H i jl

(cid:6)) +

(cid:2)∗

(−α(cid:4)

0 i jl ) = −λ0

0) = M(cid:4)
0:
(cid:21)
(cid:21)
(cid:21)2
(cid:21)
F

(cid:4)
0

i jl

We also see Mλ
Mλ
value of the duality gap for λ1 is

, (cid:4)(cid:4)

0) = λ

1 (α(cid:4)

0 (α(cid:4)

, (cid:4)(cid:4)

0) = λ

M(cid:4)

0. Using these results, the

Pλ

1 (M

(cid:4)

0) − Dλ

1 (α(cid:4)

, (cid:4)(cid:4)

0) = (λ0 − λ1)2

2λ1

(cid:21)
(cid:21)

(cid:21)
(cid:21)2
F

(cid:4)
0

Therefore, the radius of DGB rDGB and the radius of RPB rRPB satisfy the
following relationship:

= 2(Pλ

r2
DGB

1 (M(cid:4)

1 (α(cid:4)

, (cid:4)(cid:4)

0))

0) − Dλ
λ1
(cid:21)
(cid:21)

(cid:21)
(cid:21)2
F

(cid:4)
0

= (λ0 − λ1)2
λ2
1

= 4 r2

RPB

Furthermore, the centers of these hyperspheres are

QDGB = M

(cid:4)
0

, QRPB =

λ0 + λ1
2λ1

(cid:4)
0

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Safe Triplet Screening for Distance Metric Learning

2481

and the distance between the centers is

(cid:7)QDGB − QRPB(cid:7)F =

|λ0 − λ1|
2λ1

(cid:7)M

(cid:4)
0

(cid:7)F = rRPB.

Thus, the DGB includes the RPB as illustrated in Figure 2c.

(cid:2)

Appendix K: Proof of Theorem 9

The Lagrange function is defined as

L(X, α, β ) := (cid:5)X , H i jl

(cid:6) − α 1
2

(r2 − (cid:7)X − Q(cid:7)2

F ) − β(cid:5)P, X (cid:6).

From the KKT condition, we obtain

+ α(X − Q) − βP = O.

∂L/∂X = H i jl
α ≥ 0, β ≥ 0, (cid:7)X − Q(cid:7)2
F
α(r2 − (cid:7)X − Q(cid:7)2

≤ r2, (cid:5)P, X (cid:6) ≥ 0.

F ) = 0, β(cid:5)P, X(cid:6) = 0.

(K.1a)

(K.1b)

(K.1c)

If α = 0, then H i jl
= βP from equation K.1a, and the value of the ob-
jective function becomes (cid:5)X , H i jl
(cid:6) = β(cid:5)X, P(cid:6) = 0 from equation K.1c. Let
us consider the case of α (cid:9)= 0. From equation K.1c, it becomes clear that
(cid:7)X − Q(cid:7)2
= r2. If β = 0, the linear constraint is not an active constraint (i.e.,
F
(cid:5)P, X (cid:6) > 0 at the optimal); hence, it is the same as problem P1, which can be
analytically solved. If this solution satisfies the linear constraint (cid:5)P, X(cid:6) ≥ 0,
it becomes the optimal solution. Next, we consider the case of β (cid:9)= 0. From
equations K.1a and K.1c, α and β are obtained as

α = ±

(cid:7)H i jl
(cid:7)P(cid:7)2
F
r2(cid:7)P(cid:7)2
F

− (cid:5)P, H i jl

(cid:7)2
F
− (cid:5)P, Q(cid:6)2

(cid:6)2

, β =

(cid:5)P, H i jl

(cid:6) − α(cid:5)P, Q(cid:6)
(cid:7)P(cid:7)2
F

Of the solutions of the two values of α, α > 0 gives the minimum value from
(cid:2)
equation K.1b.

Appendix L: Range-Based Extension

L.1 Generalized Form of GB, DGB, RPB, and RRPB.

L.1.1 GB. The gradient is written as

∇Pλ(M) = (cid:6) + λM.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

2482

T. Yoshida, I. Takeuchi, and M. Karasuyama

Then, the squared norm of this gradient is

(cid:7)∇Pλ(M)(cid:7)2
F

= (cid:7)(cid:6)(cid:7)2
F

+ 2λ(cid:5)(cid:6), M(cid:6) + λ2(cid:7)M(cid:7)2
F

By substituting this into the center and the radius of GB, we obtain

r2
GB

= 1
4λ2
= 1
4λ2

(cid:7)∇Pλ(M)(cid:7)2
F

(cid:3)
(cid:7)(cid:6)(cid:7)2
F

+ 2λ(cid:5)(cid:6), M(cid:6) + λ2(cid:7)M(cid:7)2
F

(cid:4)

(cid:5)(cid:6), M(cid:6) 1
4λ2

(cid:7)(cid:6)(cid:7)2
F

(cid:7)M(cid:7)2
F

= 1
4
QGB = M − 1

+ 1
2λ
2λ ((cid:6) + λM)
(cid:6).

= 1
2

M − 1
2λ

L.1.2 DGB. The duality gap is written as

gap =

(cid:16)

i jl

((cid:2)((cid:5)M, H i jl

(cid:6)) + (cid:2)∗

(−α

i jl )) +

(cid:7)M(cid:7)2
F

+ 1
2λ

(cid:7)

(cid:16)

i jl

i jlH i jl

+ (cid:4)(cid:7)2
F

Then the center and radius of DGB are

(cid:7)M(cid:7)2
F

⎞

r2
DGB

= 2gap
λ
)

(cid:16)

= 2
λ

((cid:2)((cid:5)M, H i jl

(cid:6)) + (cid:2)∗

(−α

i jl )) +

(cid:21)
(cid:21)
(cid:21)
(cid:21)
(cid:21)
(cid:21)

+ (cid:4)

i jl
(cid:21)
(cid:21)
(cid:21)
(cid:21)
(cid:21)
(cid:21)

+ 1
2λ

(cid:16)

i jl

= (cid:7)M(cid:7)2
F

+ 2
λ

i jlH i jl

⎛

(cid:16)

⎝

i jl

(cid:21)
(cid:21)
(cid:21)
(cid:21)
(cid:21)
(cid:21)

(cid:16)

i jl

i jlH i jl

+ (cid:4)

(cid:21)
(cid:21)
2
(cid:21)
(cid:21)
(cid:21)
(cid:21)
F

+ 1
λ2

QDGB = M.

((cid:2)((cid:5)M, H i jl

(cid:6)) + (cid:2)∗

(−α

i jl ))

⎠

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Safe Triplet Screening for Distance Metric Learning

2483

L.1.3 RPB. With respect to RPB, we regard λ1 as the target λ for which

we consider the range. From the definition, we see

QRPB =

rRPB =

(cid:4)
0

λ0 + λ
2λ M
λ0
+
2λ M
(cid:21)
(cid:21)
(cid:21)
(cid:21)
F

(cid:4)
0

= 1
(cid:4)
0
2
λ0 − λ
2λ
(cid:21)
= − 1
(cid:21)
2

(cid:21)
(cid:21)

(cid:4)
0

λ0
2λ

(cid:21)
(cid:21)

(cid:21)
(cid:21)
F

(cid:4)
0

L.1.4 RRPB. Here again, we regard λ1 as the target λ for which we con-

sider the range. First, we assume λ ≤ λ0. Then we have

QRRPB =

rRRPB =

M0 +

λ0 + λ
2λ M0
λ0
2λ M0,
(cid:7)M0(cid:7)F +
(cid:24)

= 1
2
λ0 − λ
2λ
= − 1
(cid:7)M0(cid:7)F + 1
λ
2

(cid:8)

λ0
λ
λ0
2

(cid:25)

(cid:7)M0(cid:7)F + λ0(cid:8)

In the case of λ ≥ λ0, we have

λ0
2λ M0,
(cid:7)M0(cid:7)F + (cid:8)

M0 +

rRRPB =

QRRPB = 1
2
λ − λ0
2λ
(cid:8) + 1
2

(cid:24)

(cid:25)

(cid:7)M0(cid:7)F

−

λ0
2λ

(cid:7)M0(cid:7)F.

L.2 Proof of Theorem 10 (Range-Based Extension of RRPB). In RRPB,

we replace λ1 with λ and assume λ ≤ λ0. Then,

QRRPB =

λ0 + λ
2λ M0, rRRPB =

λ0 − λ
2λ

(cid:7)M0(cid:7)F +

λ0
λ

(cid:8).

From the spherical rule, equation 4.1, we obtain

λ + λ0
2λ

(cid:5)H i jl

, M0(cid:6) −

(cid:24)

λ0 − λ
2λ

(cid:7)M0(cid:7)F +

(cid:25)

λ0
λ

(cid:8)

(cid:7)H i jl

(cid:7)F > 1

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

2484

T. Yoshida, I. Takeuchi, and M. Karasuyama

⇔ ((cid:5)H i jl
(cid:6)

, M0(cid:6) − 2 + (cid:7)H i jl

(cid:7)(cid:8)
>0 is required.

(cid:7)F(cid:7)M0(cid:7)F )
(cid:9)

> λ0((cid:7)M0(cid:7)F(cid:7)H i jl

(cid:6)

(cid:7)F − (cid:5)H i jl
(cid:7)(cid:8)
≥0

, M0(cid:6)
(cid:9)

+2(cid:8)(cid:7)H i jl

(cid:7)F ).

The Cauchy-Schwarz inequality determines that the right-hand side is
equal to or greater than 0; therefore, the left-hand side must be greater
than 0:

∴ λ0 ≥ λ >

λ0((cid:7)M0(cid:7)F(cid:7)H i jl
(cid:5)H i jl

(cid:7)F − (cid:5)H i jl
, M0(cid:6) − 2 + (cid:7)H i jl

, M0(cid:6) + 2(cid:8)(cid:7)H i jl
(cid:7)F(cid:7)M0(cid:7)F

(cid:7)F )

In the case of λ ≥ λ0,

QRRPB =

λ0 + λ
2λ M0, rRRPB =

λ − λ0
2λ

(cid:7)M0(cid:7)F + (cid:8).

From spherical rule 4.1,

(cid:24)

λ + λ0
2λ
⇔ ((cid:7)H i jl
(cid:6)

(cid:5)H i jl

, M0(cid:6) −

λ − λ0
2λ
(cid:7)F(cid:7)M0(cid:7)F − (cid:5)H i jl
(cid:7)(cid:8)
≥0

(cid:25)

(cid:7)M0(cid:7)F + (cid:8)

(cid:7)H i jl

(cid:7)F > 1

, M0(cid:6)
(cid:9)

+2 + 2(cid:8)(cid:7)H i jl

(cid:7)F )λ

< λ0((cid:7)M0(cid:7)F(cid:7)H i jl (cid:7)F + (cid:5)H i jl , M0(cid:6)). Similarly, the Cauchy-Schwarz inequality determines that the left-hand side is greater than 0: ∴ λ0 ≤ λ < (cid:7)H i jl λ0((cid:7)M0(cid:7)F(cid:7)H i jl (cid:7)F(cid:7)M0(cid:7)F − (cid:5)H i jl , M0(cid:6)) (cid:7)F + (cid:5)H i jl , M0(cid:6) + 2 + 2(cid:8)(cid:7)H i jl . (cid:7)F (cid:2) Appendix M: Calculation for Range-Based Extension Other Than RPB and RRPB The spherical rule is written as (cid:5)H i jl , Q(cid:6) − R(cid:7)H i jl (cid:7)F > 1 ⇒ (i, j, l) ∈ R(cid:4).

This inequality is equivalent to the following two inequalities:

((cid:5)H i jl

, Q(cid:6) − 1)2 > R2(cid:7)H i jl
(cid:5)H i jl
, Q(cid:6) > 1.

(cid:7)2
F

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Safe Triplet Screening for Distance Metric Learning

2485

By using equations 6.1 and 6.3, these inequalities can be transformed into

(cid:24)/

H i jl

, A + B

1
λ

(cid:25)

− 1

a + b

1
λ

+ c

1
λ2

2(cid:8)
λ

(cid:7)H i jl

(cid:7)2
F

(M.1)

H i jl

, A + B

1
λ

> 1.

(M.2)

Note that the definitions of a, b, c, A, and B for each bound are shown in
section L.1. Because inequality M.2 can be written as a linear inequality of
λ, we can easily obtain the range of λ that satisfies the inequality. On the
other hand, inequality M.1 is equivalent to

(cid:24)/

, A + B

H i jl
)

1
λ

(cid:25)

− 1

a + b

1
λ

+ c

1
λ2

+ 2(cid:8)
λ

(cid:24)/

⇔

, A + B

H i jl
)
#

1
λ

(cid:25)

− 1
#

a + b

1
λ

+ c

1
λ2

+ 2

a + b

1
λ

+ c

1
λ2

2(cid:8)
λ

(cid:24)

−

a + b

1
λ

+ c

1
λ2

+ 2(cid:8)
λ

(cid:25)

(cid:7)H i jl

(cid:7)2
F

(cid:7)H i jl

(cid:7)2
F

2(cid:8)
λ

(cid:7)H i jl

(cid:7)2
F

The last inequality can be transformed into the following two inequalities:

)(cid:24)/

H i jl

, A + B

1
λ

(cid:25)

− 1

(cid:24)

−

a + b

1
λ

+ c

1
λ2

+ 2(cid:8)
λ

(cid:25)

(cid:7)H i jl

(cid:7)2
F

*
2

(cid:24)

> 4

a + b

1
λ

(cid:24)/

H i jl

, A + B

1
λ

+ c

1
λ2
0

(cid:25) (cid:24)

2(cid:8)
λ

(cid:25)
2

(cid:25)

(cid:7)H i jl
(cid:24)

− 1

a + b

(cid:7)4
F

(M.3)

(cid:25)

1
λ

+ c

1
λ2

+ 2(cid:8)
λ

(cid:7)H i jl

(cid:7)2
F

(M.4)

Inequality M.4 is also a quadratic inequality for which we can obtain the
range of λ that satisfies the inequality. Although inequality M.3 is a fourth-
order inequality, the range of λ can be calculated by using a fourth-order
equation solver. Then we obtain the range of λ as the intersection of the
ranges derived from equations M.2 to M.4.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

2486

T. Yoshida, I. Takeuchi, and M. Karasuyama

Appendix N: Spherical Rule with Semidefinite Constraint for Diagonal

Case

N.1 Proof of Theorem 11. Rearranging equation 6.4a, we obtain

β
k

= 2αxk

+ (hi jl,k

− 2αqk).

When we assume hi jl,k

− 2αqk

> 0, we see

hi jl,k
⇒ β
k

− 2αqk
= 2αxk
(cid:6)(cid:7)(cid:8)(cid:9)
≥0

> 0
+ (hi jl,k
(cid:6)

> 0

− 2αqk)
(cid:7)(cid:8)
(cid:9)
>0

⇒ xk

= 0.

The previous equation is derived from the complementary condition β
0. When we assume hi jl,k

≤ 0, we have

− 2αqk

kxk

− 2αqk
− β
k

≤ 0
= −(hi jl,k

− 2αqk) ≥ 0

hi jl,k
⇒ 2αxk
⇒ β
k
⇒ xk

= 0
= qk

− hi jl,k

/2α.

= 0, is derived from xk
≥ 0 and the comple-
≥ 0, β
The third equation, β
k
k
mentary condition β
= 0, and in the previous equation, the assumption
kxk
α > 0 is used. Using the above two derivations, we obtain equation 6.5. Fur-
ther, from the complementary condition α(r2 − (cid:7)x − q(cid:7)2
2) = 0, it is clear that
(cid:7)x − q(cid:7)2
(cid:2)
= r2 because of the assumption α > 0.
2

N.2 Proof of Theorem 12. Because we assume that α = 0, we obtain

β = hi jl

by using the KKT condition, equation 6.4a. Note that this implicitly indi-
≥ 0 should be satisfied because of the nonnegativity of β. The
cates that hi jl
complementary condition xk

= 0 determines that

β
k

= 0 if hi jl,k

> 0.

(N.1)

To satisfy all the KKT conditions, equation 6.4, we need to set the other xk
≤ r2 and x ≥ 0 are satisfied. Note that the other
in such a way that (cid:7)x − q(cid:7)2
2
conditions in equation 6.4 are satisfied for any x because of the assumption
α = 0. By setting

= max{qk

, 0} for hi jl,k

= 0,

(N.2)

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Safe Triplet Screening for Distance Metric Learning

2487

(6.4)

(cid:7)x − q(cid:7)2
thus the condition (cid:7)x − q(cid:7)2
2

2 is minimized under the conditions x ≥ 0 and equation N.1, and
≤ r2 should be satisfied when the optimal α is 0.

N.3 Analytical Procedure of Rule Evaluation for the Diagonal Case.
We first verify the case α = 0. If the solution equations 6.8 and 6.9 satisfy
all the KKT conditions, equation 6.4, then the solution is optimal. Other-
wise, we consider the case α > 0. Let S := {k | xk
> 0} be the support set of
x, where xk is defined by equation 6.5. When S is regarded as a function of
α, an element of S can change at which α satisfies

hi jl,k

− 2αqk

= 0

for some k ∈ [d]. Let α1 ≤ α2 ≤ · · · ≤ α
change points that can be found by sorting

d(cid:27) for d(cid:27) ≤ d be a sequence of those

(cid:28)

hi jl,k
2qk

(cid:29)
(cid:29)
(cid:29)

hi jl,k
2qk

(cid:30)
(cid:9)= 0, k ∈ [d]

> 0, qk

(N.3)

For notational convenience, we define α0 := 0. Based on the definition, S is
fixed for any α in an interval (α
k. This means
k

k+1), to which we refer as S

, α

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

2488

T. Yoshida, I. Takeuchi, and M. Karasuyama

k for k = 1, . . . , d(cid:27).
that the support set of the optimal x should be one of S
Algorithm 3 shows an analytical procedure for calculating the optimal x,
which verifies the optimality of each one of S
k after considering the case of
α = 0. For each iterative cycle in algorithm 3, O(d) computation is required,
and thus the solution can be found by O(d2).

Acknowledgments

This work was financially supported by grants from the Japanese Min-
istry of Education, Culture, Sports, Science and Technology awarded to I.T.
(16H06538, 17H00758) and M.K. (16H06538, 17H04694); from the Japan Sci-
ence and Technology Agency (JST) CREST awarded to I.T. (JPMJCR1302,
JPMJCR1502) and PRESTO awarded to M.K. (JPMJPR15N2); from the Ma-
terials Research by Information Integration Initiative (MI2I) project of the
Support Program for Starting Up Innovation Hub from the JST awarded to
I.T. and M.K.; and from the RIKEN Center for Advanced Intelligence Project
awarded to I.T.

References

Barzilai, J., & Borwein, J. M. (1988). Two-point step size gradient methods. IMA Jour-

nal of Numerical Analysis, 8(1), 141–148.

Bertsekas, D. P. (1999). Nonlinear programming. Belmont: Athena Scientific.
Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge

University Press.

Boyd, S., & Xiao, L. (2005). Least-squares covariance matrix adjustment. SIAM Jour-

nal on Matrix Analysis and Applications, 27(2), 532–546.

Capitaine, H. L. (2016). Constraint selection in metric learning. arXiv:1612.04853.
Chang, C.-C., & Lin, C.-J. (2011). Libsvm: A library for support vector machines. ACM

Transactions on Intelligent Systems and Technology, 2(3), 27.

Chollet, F., et al. (2015). Keras. https://github.com/keras-team/keras.
Davis, J. V., Kulis, B., Jain, P., Sra, S., & Dhillon, I. S. (2007). Information-theoretic met-
ric learning. In Proceedings of the 24th International Conference on Machine Learning
(pp. 209–216). New York: ACM.

Fercoq, O., Gramfort, A., & Salmon, J. (2015). Mind the duality gap: Safer rules for
the lasso. In Proceedings of the 32nd International Conference on Machine Learning,
(pp. 333–342).

Ghaoui, L. E., Viallon, V., & Rabbani, T. (2010). Safe feature elimination for the lasso and

sparse supervised learning problems. arXiv:1009.4219.

Hanada, H., Shibagaki, A., Sakuma, J., & Takeuchi, I. (2018). Efficiently monitoring
small data modification effect for large-scale learning in changing environment.
In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (pp. 1314–1321).
Palo Alto, CA: AAAI Press.

Hoffer, E., & Ailon, N. (2015). Deep metric learning using triplet network. In Pro-
ceedings of the International Workshop on Similarity-Based Pattern Recognition (pp.
84–92). Berlin: Springer.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Safe Triplet Screening for Distance Metric Learning

2489

Jain, L., Mason, B., & Nowak, R. (2017). Learning low-dimensional metrics. In I.
Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, &
R. Garnett (Eds.), Advances in neural information processing systems, 30 (pp. 4139–
4147). Red Hook, NY: Curran.

Jain, P., Kulis, B., Dhillon, I. S., & Grauman, K. (2009). Online metric learning and fast
similarity search. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), Ad-
vances in neural information processing systems, 21 (pp. 761–768). Cambridge, MA:
MIT Press.

Jamieson, K. G., & Nowak, R. D. (2011). Low-dimensional embedding using adap-
tively selected ordinal data. In Proceedings of the 2011 49th Annual Allerton Confer-
ence on Communication, Control, and Computing (pp. 1077–1084). Piscataway, NJ:
IEEE.

Kulis, B. (2013). Metric learning: A survey. Boston: Now Publishers.
Law, M. T., Thome, N., & Cord, M. (2013). Quadruplet-wise image similarity learn-
ing. In Proceedings of the IEEE International Conference on Computer Vision (pp. 249–
256). Piscataway, NJ: IEEE.

Lee, S., & Xing, E. P. (2014). Screening rules for overlapping group lasso. arXiv:1410.6880.
Lehoucq, R. B., & Sorensen, D. C. (1996). Deflation techniques for an implicitly
restarted Arnoldi iteration. SIAM Journal on Matrix Analysis and Applications,
17(4), 789–821.

Li, D., & Tian, Y. (2018). Survey and experimental study on metric learning methods.

Neural Networks, 105, 447–462.

Liu, J., Zhao, Z., Wang, J., & Ye, J. (2014). Safe screening with variational inequalities
and its application to lasso. In Proceedings of the International Conference on Machine
Learning (pp. 289–297).

Malick, J. (2004). A dual approach to semidefinite least-squares problems. SIAM Jour-

nal on Matrix Analysis and Applications, 26(1), 272–284.

McFee, B., & Lanckriet, G. R. (2010). Metric learning to rank. In Proceedings of the 27th
International Conference on Machine Learning (pp. 775–782). Madison, WI: Omni-
press.

Nakagawa, K., Suzumura, S., Karasuyama, M., Tsuda, K., & Takeuchi, I. (2016). Safe
pattern pruning: An efficient approach for predictive pattern mining. In Proceed-
ings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining (pp. 1785–1794). New York: ACM.

Ndiaye, E., Fercoq, O., Gramfort, A., & Salmon, J. (2016). Gap safe screening rules
for sparse-group lasso. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R.
Garnett (Eds.), Advances in neural information processing systems, 29 (pp. 388–396).
Red Hook, NY: Curran.

Ogawa, K., Suzuki, Y., & Takeuchi, I. (2013). Safe screening of non-support vectors
in pathwise SVM computation. In Proceedings of the 30th International Conference
on Machine Learning (pp. 1382–1390).

Okumura, S., Suzuki, Y., & Takeuchi, I. (2015). Quick sensitivity analysis for in-
cremental data modification and its application to leave-one-out CV in linear
classification problems. In Proceedings of the 21th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (pp. 885–894). New York:
ACM.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

2490

T. Yoshida, I. Takeuchi, and M. Karasuyama

Perrot, M., & Habrard, A. (2015). Regressive virtual metric learning. In C. Cortes,
N. D. Lawrence, D. D. Lee, M. Sugiyama, & R. Garnett (Eds.), Advances in neural
information processing systems, 28 (pp. 1810–1818). Red Hook, NY: Curran.

Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for
face recognition and clustering. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (pp. 815–823). Piscataway, NJ: IEEE.

Schultz, M., & Joachims, T. (2004). Learning a distance metric from relative compar-
isons. In S. Thrun, L. K. Saul, & B. Schölkopf (Eds.), Advances in neural information
processing systems, 16 (pp. 41–48). Cambridge, MA: MIT Press.

Shen, C., Kim, J., Liu, F., Wang, L., & Van Den Hengel, A. (2014). Efficient dual
approach to distance metric learning. IEEE Transactions on Neural Networks and
Learning Systems, 25(2), 394–406.

Shi, Y., Bellet, A., & Sha, F. (2014). Sparse compositional metric learning. In Proceed-
ings of the Twenty-Eighth AAAI Conference on Artificial Intelligence. Palo Alto, CA:
AAAI Press.

Shibagaki, A., Karasuyama, M., Hatano, K., & Takeuchi, I. (2016). Simultaneous safe
screening of features and samples in doubly sparse modeling. In Proceedings of
the 33rd International Conference on Machine Learning (pp. 1577–1586).

Shibagaki, A., Suzuki, Y., Karasuyama, M., & Takeuchi, I. (2015). Regularization path
of cross-validation error lower bounds. In C. Cortes, N. D. Lawrence, D. D. Lee,
M. Sugiyama, & R. Garnett (Eds.), Advances in neural information processing sys-
tems, 28 (pp. 1675–1683). Red Hook, NY: Curran.

Takada, T., Hanada, H., Yamada, Y., Sakuma, J., & Takeuchi, I. (2016). Secure approx-
imation guarantee for cryptographically private empirical risk minimization. In
Proceedings of the 8th Asian Conference on Machine Learning (pp. 126–141).

Wang, J., Wonka, P., & Ye, J. (2014). Scaling SVM and least absolute deviations via ex-
act data reduction. In Proceedings of the International Conference on Machine Learning
(pp. 523–531).

Wang, J., Zhou, J., Wonka, P., & Ye, J. (2013). Lasso screening rules via dual polytope
projection. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Q. Wein-
berger (Eds.), Advances in neural information processing systems, 26 (pp. 1070–1078).
Red Hook, NY: Curran.

Weinberger, K. Q., & Saul, L. K. (2009). Distance metric learning for large mar-
gin nearest neighbor classification. Journal of Machine Learning Research, 10, 207–
244.

Xiang, Z. J., Wang, Y., & Ramadge, P. J. (2017). Screening tests for lasso problems.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(5), 1008–1027.
Xing, E. P., Jordan, M. I., Russell, S. J., & Ng, A. Y. (2003). Distance metric learning
with application to clustering with side-information. In S. Becker, S. Thrun, & K.
Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 521–
528). Cambridge, MA: MIT Press.

Yang, H. (1993). Conjugate gradient methods for the Rayleigh quotient minimization

of generalized eigenvalue problems. Computing, 51(1), 79–94.

Zhang, W., Hong, B., Liu, W., Ye, J., Cai, D., He, X., & Wang, J. (2016). Scaling
up sparse support vector machines by simultaneous feature and sample reduction.
arXiv:1607.06996.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

3
1
1
2
2
4
3
2
1
8
6
5
2
4
8
n
e
c
o
_
a
_
0
1
2
4
0
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Safe Triplet Screening for Distance Metric Learning

2491

Zhou, Q., & Zhao, Q. (2015). Safe subspace screening for nuclear norm regularized
least squares problems. In Proceedings of the International Conference on Machine
Learning (pp. 1103–1112).

Zimmert, J., de Witt, C. S., Kerg, G., & Kloft, M. (2015). Safe screening for support
vector machines. In NIPS 2015 workshop on optimization in machine learning. Red
Hook, NY: Curran.

Received November 14, 2018; accepted July 29, 2019.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d