Global Convergence of the (1 + 1) Evolución

Global Convergence of the (1 + 1) Evolución
Strategy to a Critical Point

Tobias Glasmachers
Institute for Neural Computation, Ruhr-University, Bochum, Alemania

tobias.glasmachers@ini.rub.de

https://doi.org/10.1162/evco_a_00248

Abstracto

We establish global convergence of the (1 + 1) evolution strategy, eso es, convergence
to a critical point independent of the initial state. Más precisamente, we show the existence
of a critical limit point, using a suitable extension of the notion of a critical point to mea-
surable functions. At its core, the analysis is based on a novel progress guarantee for
elitist, rank-based evolutionary algorithms. By applying it to the (1 + 1) evolution strat-
egy we are able to provide an accurate characterization of whether global convergence
is guaranteed with full probability, or whether premature convergence is possible. Nosotros
illustrate our results on a number of example applications ranging from smooth (non-
convex) cases over different types of saddle points and ridge functions to discontinuous
and extremely rugged problems.

Palabras clave

Evolution strategies, sufficient decrease, global convergence.

1

Introducción

Global convergence of an optimization algorithm refers to convergence of the iterates to
a critical point independent of the initial state—in contrast to local convergence, cual
guarantees this property only for initial iterates in the vicinity of a critical point.1 For ex-
amplio, many first order methods enjoy this property (Gilbert and Nocedal, 1992), mientras
Newton’s method does not. In the realm of direct search algorithms, mesh adaptive
search algorithms are known to be globally convergent (Torczon, 1997).

Evolution strategies (ES) are a class of randomized search heuristics for direct
search in Rd . El (1 + 1)-ES is the maybe simplest such method, originally developed
by Rechenberg (1973). A particularly simple variant thereof, which was first defined by
Kern et al. (2004), is given in Algorithm 1. Its state consists of a single parent individual
m ∈ Rd and a step size σ > 0. It samples a single offspring x ∈ Rd per generation from
the isotropic multivariate normal distribution N (metro, σ 2I ) and applies (1 + 1)-selección;
eso es, it keeps the better of the two points. Aquí, I ∈ Rd×d denotes the identity ma-
trix. The standard deviation σ > 0 of the sampling distribution, also called global step
tamaño, is adapted online. The mechanism maintains a fixed success rate usually chosen as
1/5, in accordance with Rechenberg’s original approach. It is discussed in more detail
en la sección 3. In effect, step size control enables linear convergence on convex quadratic
funciones (Jägersküpper, 2006a), and therefore locally linear convergence on twice dif-
ferentiable functions. A diferencia de, algorithms without step size adaptation can converge

1Some authors refer to global convergence as convergence to a global optimum. We do not use the

term in this sense.
Manuscrito recibido: 30 Abril 2018; revisado: 28 Septiembre 2018 y 24 Enero 2019; aceptado: 25 Enero
2019.
© 2019 Instituto de Tecnología de Massachusetts

Computación evolutiva 28(1): 27–53

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

/

mi
d
tu
mi
v
C
oh
a
r
t
i
C
mi

pag
d

yo

F
/

/

/

/

/

2
8
1
2
7
2
0
2
0
3
7
3
mi
v
C
oh
_
a
_
0
0
2
4
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

t. Glasmachers

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

/

mi
d
tu
mi
v
C
oh
a
r
t
i
C
mi

pag
d

yo

F
/

/

/

/

/

2
8
1
2
7
2
0
2
0
3
7
3
mi
v
C
oh
_
a
_
0
0
2
4
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

as slowly as pure random search (Hansen et al., 2015). Además, being rank-based
methods, ESs are invariant to strictly monotonic transformations of objective values. ESs
tend to be robust and suitable for solving difficult problems (rugged and multimodal
fitness landscapes), a capacity that is often attributed to invariance properties.

Aunque el (1 + 1)-ES is the oldest evolution strategy in existence, we do not yet
fully understand how generally it is applicable. In this article, we cast this open prob-
lem into the question on which functions the algorithm will succeed to locate a local
optimum, and on which functions it may converge prematurely, and hence fail. We aim
at an as complete as possible characterization of these different cases.

By modern standards, el (1 + 1)-ES cannot be considered a competitive optimiza-
tion method. The covariance matrix adaptation evolution strategy (CMA-ES) by Hansen
and Ostermeier (2001) and its many variants mark the state of the art. The algorithm
goes beyond the simple (1 + 1)-ES in many ways: it uses nonelitist selection with a pop-
ulación, it adapts the full covariance matrix of its sampling distribution (effectively re-
sembling second order methods), and it performs temporal integration of direction in-
formation in the form of evolution paths for step size and covariance matrix adaptation.
Still, its convergence order on many relevant functions is linear, and that is thanks to
the same mechanism as in the (1 + 1)-ES, namely step size adaptation.

Hasta la fecha, convergence guarantees for ESs are scarce. Some results exist for convex
quadratic problems, which essentially implies local convergence on twice continuously

28

Volumen de cálculo evolutivo 28, Número 1

Global Convergence of the (1 + 1) Evolution Strategy to a Critical Point

differentiable functions. In this situation it is natural to start with the simplest ES,
which is arguably the (1 + 1)-ES. The variant defined by Kern et al. (2004) is given in
Algoritmo 1; it is discussed in detail in Section 3.

Jägersküpper (2003, 2005, 2006a,b) analyzed the (1 + 1)-ES2 on the sphere function
as well as on general convex quadratic functions. His analysis ensures linear conver-
gence with overwhelming probability, eso es, with a probability of 1 − exp
para
some ε > 0, where d is the problem dimension. En otras palabras, the analysis is asymp-
totic in the sense d → ∞, and for fixed (finite) dimension d ∈ N, no concrete value or
bound is attributed to this probability. A dimension-dependent convergence rate of
(cid:3)(1/d ) is obtained.

(cid:2)
(cid:2)(d ε )

(cid:3)

A related and more modern approach relying explicitly on drift analysis was pre-
sented by Akimoto et al. (2018), showing linear convergence of the algorithm on the
sphere function, and providing an explicit, non-asymptotic runtime bound for the first
hitting time of a level set.

The analysis by Auger (2005) is based on the stability of the Markov chain defined
by the normalized state m/σ , for a (1, λ)-ES on the sphere function. Since the chain is
shown to converge to a stationary distribution and the problem is scale-invariant, lin-
ear convergence or divergence is obtained, with full probability. There exists sufficient
empirical evidence for convergence; sin embargo, this is not covered by the result.

A different approach to proving global convergence is to modify the algorithm un-
der consideration in a way that allows for an analysis with well established techniques.
This route was explored by Diouane et al. (2015), where step size adaptation is subject
to a forcing function in order to guarantee a sufficient decrease condition, akin to, para
ejemplo, the Wolfe conditions for inexact line search (lobo, 1969). This is a powerful
approach since the resulting analysis is general in terms of the algorithms (lo mismo
step size forcing mechanism can be added to virtually all ES) and the objective func-
ciones (the function must be bounded from below and Lipschitz near the limit point)
at the same time. The price is that the analysis does not apply to algorithms regularly
applied within the EC community, and that we do not obtain new insights about the
mechanisms of these algorithms. Además, the forcing function decays slowly, forc-
ing a linearly convergent algorithm into sublinear convergence (but still much faster
than random search). From a more technical point of view the Lipschitz condition is
unfortunate since it is not preserved under monotonic transformations of fitness val-
ues. We improve on this approach by providing sufficient decrease of a transformed
objective function, which holds for all randomized elitist, rank-based algorithms, y
hence does not require a forcing function or any other algorithmic changes.

The global convergence guarantee by Akimoto et al. (2010) is closest to the present
artículo. También, that analysis is extremely general in the sense that it covers a broad range
of problems and algorithms. The objective function is assumed to be continuously dif-
ferentiable, and the only requirement for the algorithm is that it successfully diverges on
a linear function. This includes all state-of-the-art evolution strategies and many more
algoritmos. Since continuously differentiable functions are locally arbitrarily well ap-
proximated by linear functions (first order Taylor polynomial), it is concluded that any
limit point must be stationary, since there the linear term vanishes and higher order
terms take over. This is an elegant and powerful result. Its main restriction is that it ap-
plies only to continuously differentiable functions. This is a huge class, but it can still be

2Jägersküpper analyzed a different step size adaptation rule. Sin embargo, it exhibits essentially the

same dynamics as Algorithm 1.

Volumen de cálculo evolutivo 28, Número 1

29

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

/

mi
d
tu
mi
v
C
oh
a
r
t
i
C
mi

pag
d

yo

F
/

/

/

/

/

2
8
1
2
7
2
0
2
0
3
7
3
mi
v
C
oh
_
a
_
0
0
2
4
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

t. Glasmachers

considered a relevant limitation because on continuously differentiable problems ESs
are in direct competition with gradient-based methods, which are usually more efficient
if gradients are available.

Por esta razón, solving smooth and otherwise easy problems cannot be the focus of
evolution strategies. Por lo tanto, in this article we seek to explore the most general class of
problems that can be solved with an evolution strategy. En otras palabras, we aim to push
the limits beyond the well-understood cases, towards really difficult ones. Our goal is to
establish the largest possible class of problems that can be solved reliably by an ES, y
we also want to understand its limitations, es decir., which problems cannot be solved, y
why. For this purpose, we focus on the simplest such algorithm, namely the (1 + 1)-ES
defined in Algorithm 1. It turns out that the limitations of the algorithm are closely tied
to its success-based step size adaptation mechanism. To capture this effect we introduce
a novel regularity condition ensuring proper function of success-based step-size control.
The new condition is arguably much weaker than continuous differentiability, in a sense
that will become clear as we discuss examples and counter-examples.
From a bird’s eye’s perspective, our contributions are as follows:

1. we provide a general progress or decrease guarantee for rank-based elitist algo-

rithms,

2. we show how general the (1 + 1)-ES is applicable, eso es, on which problems it

will find a local optimum.

The article and the proofs are organized as follows. In the next section we establish
a progress guarantee for rank-based elitist algorithms. This result is extremely general,
and it is in no way tied to continuous search spaces and the (1 + 1)-ES. Por lo tanto, es
stated in general terms, in the expectation that it will prove useful for the analysis of al-
gorithms other than the (1 + 1)-ES. Its role in the global convergence proof is to ensure
a sufficient rate of optimization progress as long as the step size is well adapted and
the progress rate is bounded away from zero. En la sección 3, we discuss properties of the
(1 + 1)-ES and introduce the regularity condition. Based on this condition we show that
the step size returns infinitely often to a range where non-trivial progress can be con-
cluded from the decrease theorem. Based on these achievements we establish a global
convergence theorem in Section 4, essentially stating that there exists a subsequence of
iterates converging to a critical point, the exact notion of which is defined in Section 3.
We also establish a negative result, showing that a nonoptimal critical point results in
premature convergence with positive probability, which excludes global convergence.
En la sección 5, we apply the analysis to a variety of settings and demonstrate their impli-
cations. We close with conclusions and open questions.

2 Optimization Progress of Rank-Based Elitist Algorithms

En esta sección, we establish a general theorem ensuring a certain rate of optimization
progress for randomized rank-based elitist algorithms. We consider a general search
space X. This space is equipped with a σ -algebra and a reference measure denoted (cid:4).
The usual choice of the reference measure is the counting measure for discrete spaces
and the Lebesgue measure for continuous spaces. The objective function f : X → R, a
be minimized, is assumed to be measurable. The parent selection and variation oper-
ations of the search algorithm are also assumed to be measurable; indeed we assume
that these operators give rise to a distribution from which the offspring is sampled, y
this distribution has a density with respect to (cid:4).

30

Volumen de cálculo evolutivo 28, Número 1

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

/

mi
d
tu
mi
v
C
oh
a
r
t
i
C
mi

pag
d

yo

F
/

/

/

/

/

2
8
1
2
7
2
0
2
0
3
7
3
mi
v
C
oh
_
a
_
0
0
2
4
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Global Convergence of the (1 + 1) Evolution Strategy to a Critical Point

A rank-based optimization algorithm ignores the numerical fitness scores (F –
valores), and instead relies solely on pairwise comparisons, resulting in exactly one of the
), or f (X) > f (X(cid:5)
relations f (X) < f (x(cid:5) ). This property renders it invariant to strictly monotonically increasing (rank preserving) transformations of the objective values. Therefore it “perceives” the objective function only in terms of its level sets, not in terms of the actual function values. For f : X → R let ), f (x) = f (x(cid:5) (cid:4) Lf (y) := x ∈ X f (y) := S< S≤ f (y) := (cid:4) (cid:4) x ∈ X x ∈ X (cid:5) (cid:5) (cid:5) f (x) = y (cid:5) (cid:5) (cid:5) f (x) < y (cid:5) (cid:5) (cid:5) f (x) ≤ y (cid:6) (cid:6) (cid:6) f (cid:3) . (cid:2) f (m) (cid:2) f (m) denote the level set of f , and the sub-level sets strictly below and including level y ∈ R. For m ∈ X we define the short notations Lf (m) := Lf and f (m) := S≤ S≤ Due to the assumption that the offspring generation distribution is (cid:4)-measurable, with full probability, the algorithm is invariant to the values of the objective function restricted to zero sets (sets Z of measure zero, fulfilling (cid:4)(Z) = 0). The following defi- nition captures these properties. It encodes the “essential” level set structure of an ob- jective function. Definition 1: We call two measurable functions f, g : X → R equivalent and write f (m) := S< (cid:2) f (m) (cid:3) , S< (cid:3) f f (cid:7)∼g if there exists a zero set Z ⊂ X and a strictly monotonically increasing function φ : f (X) → for all x ∈ X \ Z. Here f (X) and g(X) denote the images of g(X) such that g(x) = φ f and g, respectively. We denote the corresponding equivalence class in the set of measurable (cid:5) (cid:5) g(cid:7)∼f functions by [f ] := g : X → R (cid:2) f (x) (cid:8) (cid:9) (cid:3) . It follows immediately from the definition that the sublevel sets of equivalent objective functions f (cid:7)∼g coincide outside a zero set. In the next step we construct a canonical representative for each equivalence class, which we can think of as a normal form of an objective function. Definition 2: For f : X → R we define the spatial suboptimality functions (cid:7)f < (cid:4) : X → R ∪ {∞}, (cid:7)f ≤ (cid:4) : X → R ∪ {∞}, (cid:2) x (cid:10)→ (cid:4) S< f (x) (cid:3) (cid:2) S≤ x (cid:10)→ (cid:4) f (x) (cid:3) , (cid:7)f < computing the volume of the success domain, that is, the set of improving points. If (cid:4) and (cid:7)f ≤ (cid:4) coincide then we drop the upper index and simply denote the spatial suboptimality function (cid:7)f(cid:4). by The definition is illustrated with two examples in Figures 1 and 2. In the follow- ing, m ∈ X will denote the elite (or parent) point, and m(t ) is the elite point in iteration t ∈ N of an iterative algorithm, that is, an evolutionary algorithm with elitist selection. For two very different reasons, namely 1) to avoid divergence of the algorithm in the case of unbounded search spaces, and 2) for simplicity of the technical arguments in the proofs, we restrict ourselves to the case that the sublevel set S≤ of the initial iterate m(0) is bounded and has finite spatial suboptimality. For most reasonable refer- ence measures, boundedness implies finite spatial suboptimality. For X = Rd equipped (cid:2) m(0) (cid:3) f Evolutionary Computation Volume 28, Number 1 31 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / / e d u e v c o a r t i c e - p d l f / / / / / 2 8 1 2 7 2 0 2 0 3 7 3 e v c o _ a _ 0 0 2 4 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 T. Glasmachers Figure 1: Objective function f : R → R with plateau and jump (left). Corresponding spatial suboptimality (cid:7)f ≤ (cid:4) (solid) (right). (cid:7)f < (cid:4) (dotted) and l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / / e d u e v c o a r t i c e - p d l f / / / / / 2 8 1 2 7 2 0 2 0 3 7 3 e v c o _ a _ 0 0 2 4 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 2: All relevant properties of the sphere function f : R2 → R for rank-based opti- mization are specified by its circular level sets, illustrated in blue on the domain (ground plane). The spatial suboptimality of the point x is the Lebesgue measure of the gray area, (cid:7)f(cid:4)(x) indicated by the bold red vertical arrow. which coincides with the function value (cid:7)f(cid:4)(x) = π · (cid:11)x(cid:11)2, irrespective of the rank-preserving (and hence In this example it holds level-set preserving) transformation applied to f . with the Lebesgue measure this is equivalent to the topological closure S≤ compact. The assumptions immediately imply that S< (cid:3) , and that restricted to S≤ all y ≤ f m(0) the bounded range S≤ f (m(0)), we will from here on ignore the issue of infinite being f (y) are bounded for (cid:7)f ≤ (cid:4) take values in . Since an elitist algorithm never accepts points outside f (m(0)) the functions f (y) and S≤ (cid:7)f(cid:4)-values.3 (cid:7)f < (cid:4) and (cid:2) m(0) 0, (cid:7)f(cid:4) (cid:3)(cid:11) (cid:2) (cid:10) f (cid:3) (cid:2) m(0) In the continuous case, a plateau is a level set of positive Lebesgue measure. When defining a local optimum as the best point within an open neighborhood, then an 3An alternative approach to avoiding infinite values is to apply a bounded reference measure with full support, for example, a Gaussian on Rd . In the absence of a uniform distribution on X, the price to pay for a bounded and everywhere positive reference measure is a nonuniform measure, which does not allow for a uniform, positive lower bound. The resulting technical complications seem to outweigh the slightly increased generality of the results. 32 Evolutionary Computation Volume 28, Number 1 Global Convergence of the (1 + 1) Evolution Strategy to a Critical Point interior point of a plateau is a local optimum, which may not always be intended. Any- way, when analyzing the (1 + 1)-ES we will not handle plateaus and instead assume (cid:7)f < that level sets of f are zero sets. This also implies that (cid:4) agree. For now the only slightly weaker statement of the following lemma is sufficient, which does allow for plateaus. Lemma 1: Let f : X → R be measurable. If (cid:4) (cid:7)∼f (cid:7)∼(cid:7)f < (cid:7)f ≤ (cid:4) . (cid:7)f ≤ (cid:4) (x) is finite for all x ∈ X, then it holds (cid:7)f ≤ (cid:4) and (cid:7)f(cid:4) if possible) as a The proof is found in the appendix. We use canonical representative of its equivalence class (if the function values are finite, but see the discussion above). These functions have the property (cid:2) (cid:2) (cid:7)f < (cid:7)f ≤ (cid:4) (x) = (cid:4) (cid:4) (x) = (cid:4) S(cid:7)f < S(cid:7)f (cid:7)f < (cid:4) (or simply (cid:7)f ≤ (cid:4) and (cid:3) (cid:3) ≤ (cid:4) (x) (cid:4) (x) (cid:7)f(cid:4) encodes the Lebesgue measure of its own sublevel sets. We will measure (cid:7)f(cid:4) that is, optimization progress in terms of by δ > 0 amounts to reducing the volume of better points by δ.

(cid:7)F(cid:4)-valores. Decreasing the spatial suboptimality

Due to the rank-based nature of the algorithms under study we cannot expect to
fulfill a sufficient decrease condition based on f -values. This is because a functional gain
(cid:5) := f (X) − f (X(cid:5)) > 0 achieved by moving from x to x(cid:5) can be reduced to an arbitrarily
small or large gain φ(F (X)) − φ(F (X(cid:5)
)), where ϕ is strictly monotonically increasing, y
the class of transformations does not allow to bound the difference uniformly, neither
additively nor multiplicatively. En cambio, the following theorem establishes a progress or
(cid:7)F(cid:4). It gets
decrease guarantee measured in terms of the spatial suboptimality function
around the problem of inconclusive values in objective space (cual, in case of single-
objective optimization, is just the real line) by considering a quantity in search space,
namely the reference measure of the sublevel set.

The algorithm is randomized; hence the decrease follows a distribution. The fol-

lowing definition captures properties of this distribution.

Definición 3: Let P denote a probability distribution on X with a bounded density with respect
a (cid:4) and let f : X → R be a measurable objective function. The quantity
(cid:5)
(cid:13)
(cid:5)
(cid:5)
(cid:5) A ⊂ X measurable with (cid:4)(A) > 0

tu := sup

(cid:12)

PAG (A)
(cid:4)(A)

is an upper bound on the density. Consider a sample x ∼ P . Define the functions

r < : R → [0, 1], r ≤ : R → [0, 1], z (cid:10)→ Pr z (cid:10)→ Pr (cid:3) (cid:2) f (x) < z (cid:3) (cid:2) f (x) ≤ z = P = P (cid:3) (cid:15) (cid:2) S< f (z) (cid:14) S≤ f (z) of probabilities of strict and weak improvements. Furthermore, we define s : [0, 1] → R as a (cid:3) (cid:2) ≤ q ≤ r ≤ for all q ∈ [0, 1]. We collect measurable inverse function fulfilling r < s(q ) (cid:8) the discontinuities of r < and r ≤ in the set Z := z ∈ R (cid:2) s(q ) (cid:5) (cid:5) r <(z) < r ≤ and define the sum (z) (cid:9) (cid:3) ζ := (cid:14) r ≤ (cid:16) z∈Z (cid:15) 2 (z) − r <(z) of squared improvement jumps. Note that u, r <, r ≤ , s, Z, and ζ implicitly depend on (cid:4), P , and f . This is not indicated explicitly in order to avoid excessive clutter in the notation. Evolutionary Computation Volume 28, Number 1 33 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / / e d u e v c o a r t i c e - p d l f / / / / / 2 8 1 2 7 2 0 2 0 3 7 3 e v c o _ a _ 0 0 2 4 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 T. Glasmachers If the function f is continuous with continuous domain X and without plateaus, then r < and r ≤ coincide, we have ζ = 0, and s maps each probability q ∈ [0, 1] to the corresponding unique quantile of the distribution of f (x) under P . However, if there exists a plateau within the support of P (a level set of positive P -measure, that is, if X is discrete), then ζ is positive and on Z the function s takes values anywhere between the lower quantile P (f (x) < z) and the upper quantile P (f (x) ≤ z). The exact value does not matter, since the only use of s-values is as arguments to one of the r-functions. Indeed, r <(s(q )) and r ≤ (s(q )) “round” the probability q down or up, respectively, to the closest value that is attainable as the probability of sampling a sublevel set. The freedom in the choice of s can also be understood in the context of Figure 1: if the point z in the definitions of r < and r ≤ is located on the plateau, then s(q ) can be the anywhere between the probability mass of the sub-level set excluding and including the plateau. With these definitions in place, the following theorem controls the expected value as well as the quantiles of the decrease distribution. Theorem 1: Let P denote a probability distribution on X with a bounded density with respect to (cid:4) and let f : X → R be a measurable objective function. We use the notation of the above definition. Fix a reference point m ∈ X and let p := r < denote the probability of strict (cid:7)f < improvement of a sample x ∼ P over m. Then for each q ∈ [0, p], the q-quantile of the (cid:4) - (cid:3) (cid:2) s(q ) (cid:2) f (m) (cid:3) and the q-quantile of the (cid:7)f ≤ (cid:4) -decrease is bounded decrease is bounded from below by p−r < u (cid:3) (cid:2) s(q ) p−r ≤ u by (cid:3) , i.e., (cid:2) + (cid:4) Lf (m) (cid:17) (cid:7)f < (cid:4) (m) − (cid:7)f < Pr (cid:4) (x) ≥ (cid:17) (cid:4) (m) − (cid:7)f ≤ (cid:7)f ≤ (cid:4) (x) ≥ Pr (cid:18) (cid:3) ≥ q, (cid:2) s(q ) p − r < u (cid:2) s(q ) p − r ≤ u (cid:3) (cid:2) + (cid:4) (cid:18) (cid:3) ≥ q. Lf (m) The expected (cid:7)f < (cid:4) -decrease is bounded from below by (cid:19) E (cid:8) max 0, (cid:7)f < (cid:4) (m) − (cid:7)f < (cid:4) (x) (cid:9)(cid:20) ≥ p2 + ζ 2u , and the expected (cid:7)f ≤ (cid:4) -decrease is bounded from below by (cid:19) E (cid:8) max 0, (cid:7)f ≤ (cid:4) (m) − (cid:7)f ≤ (cid:4) (x) (cid:9)(cid:20) ≥ p2 + ζ 2u (cid:2) + (cid:4) Lf (m) (cid:3) . Proof: We start with the first two claims, which provide lower bounds on the q- quantiles of probabilities of improvement by some margin δ ≥ 0. The argument here (cid:7)f(cid:4)-improvement of δ from m to x means that the (cid:7)f(cid:4)-sublevel set of is elementary: an x is smaller than that of m by (cid:4)-mass δ (due to the offspring x improving upon its (cid:7)f(cid:4)-sublevel sets of parent m). This corresponds to a difference in P -mass of the same at most u · δ, which will correspond to q in the following. Note that the probabilities (Pr(. . .)-notation) correspond to the same distribution P from which x is sampled, and (cid:7)f(cid:4)-values and s-values directly correspond to (cid:4)-mass. The situation is illustrated that in Figure 3. To make the above argument precise we fix q and define the f -level (cid:17) (cid:4) y ∈ R (cid:5) (cid:5) (cid:5) P (S≤ f (y)) ≥ q (cid:6) (cid:18) . yq := inf 34 Evolutionary Computation Volume 28, Number 1 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / / e d u e v c o a r t i c e - p d l f / / / / / 2 8 1 2 7 2 0 2 0 3 7 3 e v c o _ a _ 0 0 2 4 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Global Convergence of the (1 + 1) Evolution Strategy to a Critical Point Figure 3: Illustration of the quantile decrease, here in the continuous case. The optimum is marked with a flag. In this example, the level lines of the objective function f are star-shaped. The circle with the dashed shading on the right indicates the sampling distribution, which has ball-shaped support in this case. The probability of the area A ∪ B is the value p = P (A ∪ B ), and q = P (A) is the probability of the event of interest, corresponding to a significant improvement. The area (cid:4)(B ) is a lower bound on the (cid:7)f(cid:4). It is lower bounded by P (B ) u . The (bold) level line improvement in terms of separating A and B belongs to A, and not to B. Therefore, if this set has positive measure, then we can only guarantee q ≤ P (A) (in contrast to equality), and the lower bound becomes p−r <(s(q )) ≤ (cid:4)(B ). = p−q u u l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / / e d u e v c o a r t i c e - p d l f / / / / / 2 8 1 2 7 2 0 2 0 3 7 3 e v c o _ a _ 0 0 2 4 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 (cid:2) (cid:3) S≤ f (yq ) f (m) \ S≤ f (yq ). The nested sublevel sets S< For q = 0 the first two statements are trivial. For q > 0 the infimum is attained and it
F (yq ), B := Lf (yq ), y
≥ q. We define three disjoint sets: A := S< thus holds P C := S< f (m) = A ∪ B ∪ C are unions of these sets. By the definitions of p and q the probability of the f (yq )) ≤ p − q, and the probability set C is upper bounded by P (C) = P (S< of A ∪ B is lower bounded by P (A ∪ B ) ≥ q. (cid:4) (m) − (cid:7)f < (cid:7)f < q := (cid:7)f < (cid:7)f < (cid:4) -level z< We will show that the event of interest for the first claim, namely f (yq ) = A ∪ B, and S< f (m)) − P (S≤ f (yq ) = A, S≤ (cid:4) (x) ≥ (cid:4) (m) − p−r <(s(q )) u p−r <(s(q )) u , implies x ∈ S≤ and the set (cid:5)< f (yq ) = A ∪ B. To this end we define the q := S< (z< (cid:7)f < (cid:14) q ). We have (cid:15) (cid:14) (m) \ S≤ (cid:7)f < (cid:15) (cid:4) (cid:4) (cid:4)((cid:5)< q ) = (cid:4) (cid:21) (m) (cid:24) S< (cid:7)f < (cid:22)(cid:23) (cid:4) =(cid:7)f < (cid:4) (m) − (cid:4) (cid:21) ≥(cid:7)f < (z< q ) S≤ (cid:7)f < (cid:24) (cid:22)(cid:23) (cid:4) (cid:4) (m)− p−r< (s(q )) u ≤ p − r <(s(q )) u , (cid:4) (z< ⊂ B ∪ C, and hence A ⊂ S≤ (cid:7)f < q ). However, due to the definition of S≤ q ) ≤ p − r <(s(q )) by the definition of u. Together with Lemma 1, this (in and hence P ((cid:5)< implies (cid:5)< q contrast to S<), the sublevel set A being a subset of S≤ (cid:7)f < set B is contained in S≤ (z< q ). This shows the first claim. (cid:7)f < (cid:7)f ≤ (cid:4) -level z≤ For the second claim we define the (zq ), and we note that it holds (cid:2) (cid:4) (m) − p−r ≤ − (cid:4) Lf (m) (cid:3) (cid:2) (cid:7)f ≤ and the set (cid:5)≤ = (cid:4) (m) − (cid:4) Lf (m) (cid:7)f < q ) ≤ p − r ≤ (cid:4) (m). Then, with an analogous argument as above we obtain P ((cid:5)< ⊂ C and hence at A ∪ B ⊂ S≤ this case we immediately arrive at (cid:5)≤ (cid:7)f q the second claim. (s(q )). In q ), which shows q ) implies that also the level (m) \ S≤ (cid:7)f q := (cid:7)f ≤ q := S< (cid:7)f (s(q )) u (z≤ (z< ≤ (cid:4) ≤ (cid:4) ≤ (cid:4) (cid:3) (cid:4) (cid:4) Evolutionary Computation Volume 28, Number 1 35 T. Glasmachers Let Q denote the quantile function (the generalized inverse of the cdf) of the . Then the expectation is lower bounded by (cid:7)f < (cid:4) -improvement max (cid:8) 0, (cid:7)f < (cid:9) (cid:4) (m) − (cid:7)f < (cid:9)(cid:20) (cid:4) (x) (cid:25) 1 (cid:19) E max (cid:8) 0, (cid:7)f < (cid:4) (m) − (cid:7)f < (cid:4) (x) = Q(q ) dq (cid:2) s(q ) p − r < u p − q u p − q u dq + dq + (cid:3) dq (cid:25) p 0 (cid:16) z∈Z 0 (cid:25) p 0 (cid:25) p 0 (cid:25) p 0 ≥ = = = p2 2u + (cid:2) (cid:16) z∈Z r ≤(z) − r <(z) 2u (cid:2) (cid:3) dq s(q ) q − r < u q − r <(z) u r ≤ (z) (cid:25) r <(z) (cid:3) 2 = p2 + ζ 2u . dq l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / / e d u e v c o a r t i c e - p d l f / / / / / 2 8 1 2 7 2 0 2 0 3 7 3 e v c o _ a _ 0 0 2 4 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 The proof of the expected again comes from (cid:7)f ≤ (cid:4) (m) = (cid:7)f < (cid:2) (cid:7)f ≤ (cid:4) improvement is analogous. The additional term (cid:4) Lf (m) (cid:2) (cid:2) (cid:4) (m) + (cid:4) Lf (m) (cid:3) . (cid:3) In our application of the above theorem to the (1 + 1)-ES x corresponds to the offspring point sampled from a Gaussian centered on m. (cid:2) Due to the term (cid:4) in the decrease of (cid:7)f ≤ (cid:4) , the theorem covers the fitness-level method (Droste et al., 2002; Wegener, 2003). However, in particular for search distribu- tions spreading their probability mass over many level sets, the theorem is considerably stronger. Lf (m) (cid:3) In the continuous case, in the absence of plateaus, the statement can be simplified considerably: Corollary 1: Under the assumptions and with the notation of Definition 3 and Theorem 1 we assume in addition that all level sets of f have measure zero. Then for each q ∈ [0, p], the q-quantile of the (cid:7)f(cid:4)-decrease is bounded from below by and the expected (cid:27) Pr (cid:26) (cid:7)f(cid:4)(m) − (cid:7)f(cid:4)(x) ≥ p − q u (cid:7)f(cid:4)-decrease is bounded from below by (cid:8) 0, (cid:7)f(cid:4)(m) − (cid:7)f(cid:4)(x) max E (cid:19) (cid:9)(cid:20) ≥ q, ≥ p2 2u . The following corollary is a broken down version for Gaussian search distributions N (m, C) with mean m and covariance matrix C, which has the density (cid:27) ϕ(x) = 1 (cid:28) (2π )d/2 det(C) exp (x − m)T C−1(x − m) . (cid:26) − 1 2 Corollary 2: Consider the search space Rd and the Lebesgue measure (cid:4). Let f : Rd → R denote a measurable objective function with level sets of measure zero. Consider a normally 36 Evolutionary Computation Volume 28, Number 1 Global Convergence of the (1 + 1) Evolution Strategy to a Critical Point distributed sample x ∼ N (m, C). Under the assumptions and with the notation of Definition 3 (cid:7)f(cid:4)-decrease is bounded from below by and Theorem 1, for each q ∈ [0, p], the q-quantile of the (cid:14) (cid:7)f(cid:4)(m) − (cid:7)f(cid:4)(x) ≥ (2π )d/2 · (cid:28) Pr (cid:15) det(C) · (p − q ) ≥ q, and the expected (cid:7)f(cid:4)-decrease is bounded from below by (cid:19) E (cid:8) 0, (cid:7)f(cid:4)(m) − (cid:7)f(cid:4)(x) max (cid:9)(cid:20) ≥ (2π )d/2 · (cid:28) det(C) · p2 2 . An isotropic distribution with component-wise standard deviation (step size) σ > 0
has covariance matrix C = σ 2I , where I ∈ Rd×d is the identity matrix; hence we have
(cid:28)
det(C) = σ d . In the context of continuous search spaces, Jägersküpper (2003) refers
(cid:7)F(cid:4)-progress as “spatial gain.” He analyzes in detail the gain distribution of an
a
isotropic search distribution on the sphere model. This result is much less general than
the previous corollary, since we can deal with arbitrary objective functions, cuales son
characterized (en la zona) only by a single number, the success probability. For the special
case of a Gaussian mutation and the sphere function, Jägersküpper’s computation of the
spatial gain is more exact, since it is tightly tailored to the geometry of the case, en con-
trast to being based on a general bound. We lose only a multiplicative factor of the gain,
which does not impact our analysis significantly. Sin embargo, it should be noted that in
the problem analyzed by Jägersküpper, the factor grows with the problem dimension d.
The spatial gain is closely connected to the notion of a progress rate (Rechenberg, 1973),
in particular if the gain is lower bounded by a fixed fraction of the suboptimality. Para
fixed objective function like the sphere model f (X) = (cid:11)X(cid:11)2 it is easy to relate functional
suboptimality f (X) − f ∗

to spatial suboptimality

(cid:7)F(cid:4)(X).

Success-Based Step Size Control in the (1 + 1)-ES

3
En esta sección, we discuss properties of the (1 + 1)-ES algorithm and provide an anal-
ysis of its success-based step size adaptation rule that will allow us to derive global
convergence theorems. To this end we introduce a nonstandard regularity property.

From here on, we consider the search space Rd , equipped with the standard Borel
σ -algebra, y (cid:4) denotes the Lebesgue measure. Por supuesto, all results from the previous
section apply, with X = Rd .

In each iteration t ∈ N, the state of the (1 + 1)-ES is given by (metro(t ), pag (t )) ∈ Rd ×
. It samples one candidate offspring from the isotropic normal distribution x (t )
(cid:3)
. The parent is replaced by successful offspring, meaning that the off-

R+
norte (metro(t ),
spring must perform at least as well as the parent.

pag (t ))2I

(cid:2)

The goal of success-based step size adaptation is to maintain a stable distribution
of the success rate, Por ejemplo, concentrated around 1/5. This can be achieved with
a number of different mechanisms. Here we consider the maybe simplest such mech-
anism, namely immediate adaptation based on “success” or “failure” of each sample.
Pseudocode for the full algorithm is provided in Algorithm 1.

Constants c− < 0 and c+ > 0 in Algorithm 1 control the change of log(pag ) in case of
failure and success, respectivamente. They are parameters of the method. For c+ + 4 · c− = 0
we obtain an implementation of Rechenberg’s classic 1/5-rule (Rechenberg, 1973). Nosotros
call τ = c−
c−−c+ the target success probability of the algorithm, which is always assumed
to be strictly less than 1/2. This is equivalent to c+ > −c−. A reasonable parameter set-
ting is c−, c+ ∈ (cid:2)

(cid:3)
.

(cid:2)

1
d

Volumen de cálculo evolutivo 28, Número 1

37

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

/

mi
d
tu
mi
v
C
oh
a
r
t
i
C
mi

pag
d

yo

F
/

/

/

/

/

2
8
1
2
7
2
0
2
0
3
7
3
mi
v
C
oh
_
a
_
0
0
2
4
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

t. Glasmachers

Two properties of the algorithm are central for our analysis: it is rank-based and
it performs elitist selection, ensuring that the best-so-far solution is never lost and the
sequence f (metro(t )) is monotonically decreasing.

Since step-size control depends crucially on the concept of a fixed rate of successful
offspring, we define the success probability of the algorithm, which is the probability
of a sampled point outperforming the parent in the search distribution center.
Definición 4: For a measurable function f : Rd → R, we define the success probability func-
ciones

F : Rd × R+ → [0, 1],
pag< (m, σ ) (cid:10)→ Pr (cid:25) = p≤ f : Rd × R+ → [0, 1], (m, σ ) (cid:10)→ Pr (cid:25) (cid:14) f (x) < f (m) 1 (2π )d/2σ d S< f (m) (cid:14) f (x) ≤ f (m) (cid:5) (cid:5) (cid:5) x ∈ N (m, σ 2I ) (cid:15) (cid:26) − (cid:11)x − m(cid:11)2 2σ 2 · exp (cid:5) (cid:5) (cid:5) x ∈ N (m, σ 2I ) (cid:15) 1 (2π )d/2σ d (cid:26) · exp − (cid:11)x − m(cid:11)2 2σ 2 = ≤ f (m) S (cid:27) (cid:27) dx, dx. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . f computes the probability of sampling a strictly better point. If p< The function p≤ f computes the probability of sampling a point at least as good as m, while p< f co- incide (i.e., if there are no plateaus), then we write pf . A nice property of the success probability is that it does not drop too quickly when increasing the step size: Lemma 2: For all m ∈ Rd , σ > 0 and a ≥ 1 it holds

f and p≤

F (metro, a · σ ) ≥ 1
pag< ad f (m, a · σ ) ≥ 1 p≤ ad · p< f (m, σ ), · p≤ f (m, σ ). The proof is found in the appendix; this is the case for a number of technical lemmas in this section. The next step is to define a plausible range for the step size. Definition 5: For p ∈ [0, 1] and a measurable function f : Rd → R, we define upper and lower bounds (cid:4) p (m) := inf ξ f p (m) := sup ηf σ ∈ R+ (cid:4) σ ∈ R+ (cid:5) (cid:5) (cid:5) p< (cid:5) (cid:5) (cid:5) p≤ f (m, σ ) ≤ p f (m, σ ) ≥ p (cid:6) (cid:6) on the step size guaranteeing lower and upper bounds on the probability of improvement. p (m) with p > τ as a “too small” step size at m. Similarmente, for p < τ , ηf We think of ξ f p (m) is a “too large” step size at m. Assume that the two values of p are chosen so that a sufficiently wide range of “well-adapted” step sizes exists in between the “too small” and “too large” ones. We aim to establish that if the step size is outside this range, then step size adaptation will push it back into the range. The main complication is that the range for σ depends on the point m. The following lemma establishes a gap between lower and upper step size bound, that is, a lower bound on the size of the step size range. √ √ pT (x) ≤ d Lemma 3: For 0 ≤ pH ≤ pT ≤ 1 it holds d pH · ξ f pT · ηf pH (x) for all x ∈ Rd . 38 Evolutionary Computation Volume 28, Number 1 / / e d u e v c o a r t i c e - p d l f / / / / / 2 8 1 2 7 2 0 2 0 3 7 3 e v c o _ a _ 0 0 2 4 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Global Convergence of the (1 + 1) Evolution Strategy to a Critical Point Figure 4: Illustration of a contour line with a kink opening up in an angle indicated by the dashed lines. The circles are iso-density lines on the isotropic Gaussian search distribution centered on the kink. The following definition is central. It captures the ability of the (1 + 1)-ES to recover from a state with a far too small step size. This property is needed to avoid premature convergence. Definition 6: For p > 0, a function f : Rd → R is called p-improvable in x ∈ Rd if ξ f
is positive. The function is called p-improvable on Y ⊂ Rd if ξ f
pag
to Y ) is lower bounded by a positive, lower semi-continuous function ˜ξ f
x ∈ Rd is called p-critical if it is not p-improvable for any p > 0.

pag (X)
p restricted
pag : Y → (0, 1]. A point

Y (the function ξ f

(cid:5)
(cid:5)

The property of p-improvability is a nonstandard regularity condition. The concept
applies to measurable functions; hence we do not need to restrict ourselves to smooth
or continuous objectives. On the one hand side, the property excludes many measur-
able and even some smooth functions. Por otro lado, it is far less restrictive than
continuity and smoothness, in the sense that it allows the objective function to jump
and the level sets to have kinks. Intuitivamente, in the two-dimensional case illustrated in
Cifra 4, if for each point the sublevel set opens up in an angle of more than 2πp, entonces
the function is p-improvable. This is the case for many discontinuous functions, cómo-
alguna vez, not for all smooth ones. The degree three polynomial f (x1, x2) = x3
2 can serve
1
as a counter example, since every point of the form (x1, 0) is p-critical. All of its con-
tour lines form cuspidal cubics; ver figura 6 en la sección 5.3. Local optima are always
p-critical, but many critical points of smooth functions are not (see below). The above
example demonstrates that some saddle points share this property; sin embargo, if x is p-
critical but not locally optimal, then p< f (x, σ ) > 0 for all σ > 0. This means that such a
point can be improved with positive probability for each choice of the step size, pero en
the limit σ → 0 the probability of improvement tends to zero.

+ x2

We should stress the difference between point-wise p-improvability, which simply
demands that ξ f
p is positive, and set-wise p-improvability, which in addition demands
that ξ f
p is lower bounded by a lower semicontinuous positive function. The latter prop-
erty ensures the existence of a positive lower bound for ξ f
p on a compact set. En esto
sense, set-wise p-improvability is uniform on compact sets. En secciones 5.5 y 5.6, nosotros
will see examples where this makes a decisive difference.

Intuitivamente, the value of p of a p-improvable function is critical: if it is below τ , entonces
the algorithm may be endangered to systematically decrease its step size while it should
better do the contrary.

Volumen de cálculo evolutivo 28, Número 1

39

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

/

mi
d
tu
mi
v
C
oh
a
r
t
i
C
mi

pag
d

yo

F
/

/

/

/

/

2
8
1
2
7
2
0
2
0
3
7
3
mi
v
C
oh
_
a
_
0
0
2
4
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

t. Glasmachers

The next lemma establishes that smooth functions are p-improvable in all regular

puntos, and also in most saddle points.
Lema 4: sea ​​f : Rd → R be continuously differentiable.

1. For a regular point x ∈ Rd , f is p-improvable in x for all p < 1 2 . 2. Let Y denote the set of all regular points of f , then f is p-improvable on Y , for all p < 1 2 . 3. Let x ∈ Rd denote a critical point of f , let f be twice continuously differentiable in a neighborhood of x, and let H = ∇2f (x) denote the Hessian matrix. If H has at least one negative eigen value, then x is not p-critical. Similarly, we need to ensure that the step size does not diverge to ∞. This is easy, since the spatial suboptimality is finite: Lemma 5: Consider the state (m(t ), σ (t )) of the (1 + 1)-ES. For each p ∈ (0, 1), if (cid:29) σ (t ) ≥ d (cid:7)f(cid:4)(m(t )) p · (2π )d/2 then p< f (m(t ), σ (t )) ≤ p. In other words, a too large step size is very likely to produce unsuccessful offspring. The probability of success decays quickly with growing step size, since the step size bound grows slowly in the form (cid:3)(p−1/d ) as the success probability p decays to zero. Applying the above inequality to p < τ implies that for large enough step size σ (t ), the expected change E[log(σ (t+1)) − log(σ (t ))] in the (1 + 1)-ES (Algorithm 1) is negative. The following lemma is elementary. It is used multiple times in proofs, with the interpretation of the event “1” meaning that a statement holds true. It has a similar role as drift theorems in an analysis of the expected or high-probability behavior (Lehre and Witt, 2013; Lengler and Steger, 2016; Akimoto et al., 2018); however, here we aim for almost sure results. Lemma 6: Let X(t ) ∈ {0, 1} denote a sequence of independent binary random variables. If there exists a uniform lower bound Pr(X(t ) = 1) ≥ p > 0, then almost surely there exists an infinite
subsequence (tk )k∈N so that X(tk ) = 1 for all k ∈ N.

In applications of the lemma, the events of interest are not necessarily independent;
sin embargo, they can be “made independent” by considering a sequence of independent
events that imply the events of interest. In our applications, this is the case if the events
of actual interest hold with probability of at least p; then an i.i.d. sequence of Bernoulli
events implying corresponding sub-events with probability of exactly p does the job.
En otras palabras, we will have a sequence ˜X(t ) of independent events, where ˜X(t ) = 1
implies X(t ) = 1. The above lemma is then applied to ˜X(t ), which trivially yields the
same statement for X(t ). We imply this construction in all applications of the lemma.

The following lemma establishes, under a number of technical conditions, that the
step size control rule succeeds in keeping the step size stable. If the prerequisites are
fulfilled, then the result yields an impossible fact, namely that the overall reduction
of the spatial suboptimality is unbounded. So the lemma is designed with proofs by
contradiction in mind.
(cid:2)
metro(t ), pag (t )

denote the sequence of states of the (1 + 1)-ES on a measurable ob-
Lema 7: Dejar
jective function f : Rd → R. Let pT , pH ∈ (0, 1) denote probabilities fulfilling pH < τ < pT (cid:3) 40 Evolutionary Computation Volume 28, Number 1 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / / e d u e v c o a r t i c e - p d l f / / / / / 2 8 1 2 7 2 0 2 0 3 7 3 e v c o _ a _ 0 0 2 4 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Global Convergence of the (1 + 1) Evolution Strategy to a Critical Point and pH pT ≤ ed·c− , and assume the existence of constants 0 < bT < bH such that bT ≤ ξ f pT (m(t )) and ec+ · ηf pH (m(t )) ≤ bH for all t ∈ N. Then, with full probability, there exists an infinite subsequence (tk )k∈N of iterations fulfilling (cid:19) ξ f pT (cid:2) m(tk ) (cid:3) , ηf pH (cid:2) m(tk ) (cid:3)(cid:20) (1) for all k ∈ N. σ (tk ) ∈ Equation (1) is a rather weak condition demanding that step-size adaptation works as desired. However, the requirement of a uniform lower bound bT on the step size (cid:7)f(cid:4)-progress together with Theorem 1 implies that the (1 + 1)-ES would make infinite (cid:7)f(cid:4) is by definition in expectation. This is of course impossible if non-negative. Therefore the lemma does not describe a typical situation observed when running the (1 + 1)-ES, but quite in contrast, an impossible situation that needs to be excluded in the proof of the main result in the next section. (cid:7)f(cid:4)(m(0)) is finite, since 4 Global Convergence l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / / e d u e v c o a r t i c e - p d l f / / / / / 2 8 1 2 7 2 0 2 0 3 7 3 e v c o _ a _ 0 0 2 4 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 In this section, we establish our main result. The theorem ensures the existence of a limit point of the sequence m(t ) in a subset of desirable locations. In many cases this amounts to convergence of the algorithm to a (local) optimum. Theorem 2: Consider a measurable objective function f : Rd → R with level sets of measure zero. Assume that K0 := S≤ ⊂ K0 denote a closed subset. If f is is compact, and let K1 (cid:3) p-improvable on K0 t∈N has a limit point in K1. Proof: Lemma 5 ensures the existence of 0 < pH < e−d·c− · τ and \ K1 for some p > t , then the sequence

(cid:2)
metro(0)

(cid:2)
metro(t )

(cid:3)

F

(cid:29)
bH := d

(cid:2)

(cid:3)

(cid:7)F(cid:4)
metro(0)
pH · (2Pi )d/2

such that it holds ηf
pH (X) ≤ bH uniformly for all x ∈ K0. En particular, bH is a uniform
upper bound on ηf
pH .

Let B(X, r ) denote the open ball of radius r > 0 around x ∈ Rd and define the com-
(cid:30) (cid:31)

pact set

k (r ) := K0

B(X, r ).

\ K1 and

r>0 K (r ) = K0

x∈K1
\ K1; hence K (r ) is a compact exhaustion

It holds K (r ) ⊂ K0
\ K1.
of K0

Fix r > 0, and assume for the sake of contradiction that all points m(t ), t > t0, son
pT denote the positive lower semicontinuous
pT , which is guaranteed to exist due to the p-improvability of f . Nosotros

contained in K (r ). We set pT := p. Let ˜ξ f
lower bound on ξ f
define

(cid:6)

bT := min

(cid:4)
˜ξ f
pT (metro)

(cid:5)
(cid:5)
(cid:5) m ∈ K (r )

> 0

and apply Lemma 7 to obtain an infinite subsequence of states with step size lower
bounded by σ (t ) ≥ bT > 0. According to Lemma 2, the success probability is lower
≥ pI := (bT /bH )d · pT > 0 for all m ∈ K (r ) and σ ∈ [bT , bH ].
bounded by pf
(cid:7)F(cid:4)-valor
· pI /2 is lower bounded by pI /2 > 0. We apply Lemma 6 con

Corolario 2 ensures that in each such state the probability to decrease the

by at least (2Pi )d/2 · bd
t

(cid:2)
metro(t ), pag (t )

(cid:3)

Volumen de cálculo evolutivo 28, Número 1

41

t. Glasmachers

t

the following construction. For each state (metro, pag ) we pick a set E(metro, pag ) ⊂ Rd of proba-
· pI /2. Then we model the
bility mass pI /2 improving on
sampling procedure of the (1 + 1)-ES in iteration t as a two-stage process: first we draw
a binary variable ˜X(t ) ∈ {0, 1} with Pr( ˜X(t ) = 1) = pI /2, and then we draw x (t ) from a
Gaussian restricted to E(metro(t−1), pag (t−1)) if ˜X(t ) = 1, and restricted to the complement oth-
erwise. The variables ˜X(t ) are independent, by construction.

(cid:7)F(cid:4)(metro) by at least (2Pi )d/2 · bd

(cid:7)F(cid:4)(metro(0)) is finite and

Then Lemma 6 implies that the overall

(cid:7)F(cid:4)-decrease is almost surely infinite, cual
(cid:7)F(cid:4) is lower bounded by zero. Por eso, el
contradicts the fact that
sequence m(t ) leaves K (r ) after finitely many steps, almost surely. For r = 1/n, let tn
(cid:3)
denote an iteration fulfilling m(tn ) (cid:16)∈ K (r ). The sequence
n∈N does not have a limit
\ K1 (since that point would be contained in K (r ) for some r > 0), sin embargo,
point in K0
due to the Bolzano-Weierstraß theorem it has at least one limit point in K0, which must
(cid:2)
therefore be located in K1.

metro(tn )

(cid:2)

The above theorem is of primary interest if K1 is the set of (local) minima of f , o
at least the set of critical or p-critical points. Due to the prerequisites of the theorem we
always have

(cid:4)

x ∈ K0

(cid:5)
(cid:5)
(cid:5) x is p-critical

(cid:6)

⊂ K1,

eso es, p-critical points are candidate limit points.

In accordance with Akimoto et al. (2010), the following corollary establishes con-

vergence to a critical point for continuously differentiable functions.

(cid:3)

(cid:2)

= S≤
F

Corolario 3: sea ​​f : Rd → R be a continuously differentiable function with level sets of mea-
(cid:3)
sure zero. Assume that K0
t∈N has a critical
limit point.
Prueba: Define K1 := {x ∈ K0
pact. Lema 4 ensures that f is p-improvable on K0
follows immediately from Theorem 2.

| ∇f (X) = 0} as the set of critical points. This set is com-
\ K1 for all p < 1/2. Then the claim (cid:2) is compact. Then the sequence m(t ) m(0) (cid:2) l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / / e d u e v c o a r t i c e - p d l f / / / / / 2 8 1 2 7 2 0 2 0 3 7 3 e v c o _ a _ 0 0 2 4 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Technically the above statements do not apply to problems with unbounded sub- level sets. However, due to the fast decay of the tails of Gaussian search distributions we can often approximate these problems by changing the function “very far away” from the initial search distribution, in order to make the sublevel sets bounded. We may then even apply the theorem with empty K1, which implies that after a while the approxi- mation becomes insufficient since the algorithm diverges. In this sense we can conclude divergence, for example, on a linear function. We will use this argument several times in the next section, mainly to avoid unnecessary technical complications when defining saddle points and ridge functions. We may ask whether p-improvability for p > τ is not only a sufficient but also a
necessary condition for global convergence. This turns out to be wrong. The quadratic
saddle point case discussed in Section 5.2 is a counter example, where the algorithm di-
verges reliably even if the success probability is far smaller than τ . A diferencia de, the ridge
of p-critical saddle points analyzed in Section 5.3 results in premature convergence, de-
spite the fact that the critical points form a zero set, and this can even happen for a
ridge of p-improvable points with p < τ ; see Section 5.4. Drift analysis is a promising tool for handling all of these cases. Here we provide a rather simple result, which still suffices for many interesting cases. A related analysis for a nonelitist ES was carried out by Beyer and Meyer-Nieberg (2006). 42 Evolutionary Computation Volume 28, Number 1 Global Convergence of the (1 + 1) Evolution Strategy to a Critical Point Theorem 3: Consider a measurable objective function f : Rd → R with level sets of measure zero. Let m ∈ Rd be a p-critical point. If the success probability decays sufficiently quickly, that is, if ∞(cid:16) p≤ f (m, ek·c− ) < ∞ k=0 then for each given p < 1 there exists an initial condition such that the (1 + 1)-ES converges to m with probability of at least p. !∞ f (m, ek·c− ). For given p < 1, there exists Proof: Define the zero sequence SK := a K0 such that SK0 < 1 − p. By definition, the probability of never sampling a successful ·c− is given offspring when starting the algorithm in the initial state m(0) = m, σ (0) = eK0 (cid:2) by SK0 ; in this case we have m(t ) = m for all t ∈ N. The above theorem precludes global convergence to a (local) optimum with full proba- bility in the presence of a suitable nonoptimal p-critical point. k=K p≤ 5 Case Studies In this section, we analyze various example problems with very different character- istics, by applying the above convergence analysis. We characterize the optimization behavior of the (1 + 1)-ES, giving either positive or negative results in terms of global convergence. We start with smooth functions and then turn to less regular cases of non- smooth and discontinuous functions. On the one hand side, we show that the theorem is applicable to interesting and nontrivial cases; on the other hand we explore its limits. 5.1 The 2-D Rosenbrock Function The two-dimensional Rosenbrock function is given by − x2)2 + (x1 f (x1, x2) := 100(x2 1 − 1)2. This is a degree four polynomial. The function is unimodal (has a single local minimum), but not convex. Moreover, it does not have critical points other than the global optimum x∗ = (1, 1). The function is illustrated in Figure 5. The Rosenbrock function is a popular test problem because it requires a diverse set of optimization behaviors: the algorithm must descend into a parabolic valley, follow the valley while adapting to its curved shape, and finally converge into the global opti- mum, which is a smooth optimum with nontrivial (but still moderate) conditioning. Corollary 3 immediately implies convergence of the (1 + 1)-ES into the global opti- mum. It does not say anything about the speed of convergence; however, Jägersküpper (2006a) established linear convergence in the last phase with overwhelming probability (however, using a different step size adaptation rule). Taken together, these results give a rather complete picture of the optimization pro- cess: irrespective of the initial state we know that the algorithm manages to locate the global optimum without getting stuck on the way. Once the objective function starts to look quadratic in good enough approximation, Jägersküpper’s result indicates that linear convergence can be expected. The same analysis applies to all twice continuously differentiable unimodal functions without critical points other than the optimum. 5.2 Saddle Points—The p-Improvable Case We consider the quadratic objective function f (x1, x2) := a · x2 1 − x2 2 Evolutionary Computation Volume 28, Number 1 43 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / / e d u e v c o a r t i c e - p d l f / / / / / 2 8 1 2 7 2 0 2 0 3 7 3 e v c o _ a _ 0 0 2 4 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 T. Glasmachers Figure 5: The 2-D Rosenbrock function in the range [−2, 2] × [−1, 3]. √ with parameter a > 0. The origin is a saddle point. It is p-improvable for all p < 2 cot−1( a)/π (see the appendix for details). For small enough a, the success proba- bility is larger than τ and Corollary 3 applies, while for large values of a the success probability decays to zero and we lose all guarantees. Simulations show that the ES overcomes the zero level set containing the saddle point without a problem, also for large values of a. It seems that p-improvable sad- dle points do not result in premature convergence of the algorithm, irrespective of the value of p > 0. Sin embargo, this statement is based on an empirical observation, not on a
rigorous proof.

5.3

Saddle Points—The p-Critical Case

The cubic polynomial

F (x1, x2) := x3
1

+ x2
2

has p-critical saddle points on the line R × {0} ⊂ R2 forming a ridge; ver figura 6. Con-
out loss of generality we consider m = 0 ∈ R2 in the following. A successful offspring
≤ 0. For small enough σ and hence for small enough (cid:11)X(cid:11) (cid:17) 1,
x ∈ R2 fulfills x3
1
| ∈ o(pag ). Plugging this
(cid:11)X(cid:11) ∈ (cid:3)(pag ), this implies −x1
into the above inequality we obtain |x2
pag ). Por lo tanto, for small σ we have

p≤
F (0, pag ) ∈ O(

| and hence −x1
·
| ∈ O(−x1
pag ). This implies that the cumulative success probability
∞(cid:16)

∈ (cid:3)(pag ) y |x2

(cid:18) |x2

+ x2
2

∞(cid:16)

(cid:18)

(cid:17)

(cid:26)

(cid:27)

p≤
F (0, et·c− ) =O

et·c−/2

=O

t=0

t=0

1
1 − ec−/2

=O(1)

is finite, and Theorem 3 yields (premature) convergence with arbitrarily high
probabilidad.

5.4

Linear Ridge

Consider the linear ridge objective

F (x1, x2) := x1

+ a · |x2

|

44

Volumen de cálculo evolutivo 28, Número 1

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

/

mi
d
tu
mi
v
C
oh
a
r
t
i
C
mi

pag
d

yo

F
/

/

/

/

/

2
8
1
2
7
2
0
2
0
3
7
3
mi
v
C
oh
_
a
_
0
0
2
4
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Global Convergence of the (1 + 1) Evolution Strategy to a Critical Point

Cifra 6: Level lines of the function f (x1, x2) = x3
1
shows a zoom of factor 10.

+ x2

2 in the range [−1, 1]2. The inset

with parameter a > 0. The function is continuous, and its level sets contain a kink.
De nuevo, the line R × {0} is critical; this is where the function is nondifferentiable. El
−1(a)/Pi < 1/2 (see the appendix). For a → ∞ the function is p-improvable for p < cot success probability decays to zero. As long as cot −1(a)/π > τ we can conclude divergence of the algorithm (el en-
tended behavior) from Theorem 2. Otherwise we lose this property, and it is well known
and easy to check with simulations that for large enough a the algorithm indeed con-
verges prematurely.

5.5

Sphere with Jump

Our next example is an “essentially discontinuous” problem in the sense that in general
no function in the equivalence class [F ] is continuous. We consider objective functions
of the form

F (X) := (cid:11)X(cid:11)2 + 11S (X),
where 11S denotes the indicator function of a measurable set S ⊂ Rd . If S has a suffi-
ciently simple shape then this problem is similar to a constrained problem where S is
the infeasible region (Arnold and Brauer, 2008), at least for small enough σ . As long as
metro(t ) ∈ S the (1 + 1)-ES essentially optimizes the sphere function, and as soon as m(t ) (cid:16)∈ S
el (soft) constraint comes into play.

If S is the complement of a star-shaped open neighborhood of the origin then it is
easy to see that the function is unimodal and p-improvable for all p < 1/2. Theorem 2 applied with K1 := {0} yields the existence of a subsequence converging to the origin, (cid:3) which implies convergence of the whole sequence due to monotonicity of f . The results of Jägersküpper (2005) and Akimoto et al. (2018) imply linear convergence. (cid:2) m(t ) Other shapes of S give different results. For example, for d ≥ 2, if S is a ball not containing the origin then the function is still unimodal. For example, define S as the = (1, 0, . . . , 0) ∈ Rd . Then at m := open ball of radius 1/2 around the first unit vector e1 3/2 · e1 we have ξ f p (m) = 0 for all p > 0, and according to Theorem 3 the algorithm can
converge prematurely if the step size is small. Alternativamente, if S is the closed ball, entonces

Volumen de cálculo evolutivo 28, Número 1

45

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

/

mi
d
tu
mi
v
C
oh
a
r
t
i
C
mi

pag
d

yo

F
/

/

/

/

/

2
8
1
2
7
2
0
2
0
3
7
3
mi
v
C
oh
_
a
_
0
0
2
4
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

t. Glasmachers

all points except the origin are p-improvable for all p < 1/2; however, there does not exist a positive lower semicontinuous lower bound on ξ f p in any neighborhood of m = 3/2 · e1, and again the algorithm can converge to this point, irrespective of the target success probability τ . Now consider the strip S := (a, ∞) × (0, 1) ⊂ R2 with parameter a > 0. An elemen-
tary calculation of the success rate at m := (a + ε, 1) for σ → 0 shows that the (1 + 1)-
ES is guaranteed to converge to the optimum irrespective of the initial conditions if
−1(a)/(2Pi ) < τ (details are found in the appendix), that is, if a is large enough; oth- tan erwise the algorithm can converge prematurely to a point on the edge (a, ∞) × {1} of S. 5.6 Extremely Rugged Barrier Let us drive the above discontinuous problem to the extreme. Consider the one- dimensional problem f (x) := x + 11S (x), where S ⊂ [−1, 0] is a Smith-Volterra-Cantor set, also known as a fat Cantor set. S is closed, has positive measure (usually chosen as (cid:4)(S) = 1/2), but is nowhere dense. Counterintuitively, the function is unimodal in the sense that no point is optimal re- stricted to an open neighborhood (which is what commonly defines a local optimum). Still, intuitively, S should act as a barrier blocking optimization progress with high prob- ability. The function is point-wise p-improvable everywhere. However, similar to the closed ball case in the previous section, there is no positive, lower semicontinuous lower bound on ξ f p . Therefore Theorem 2 does not apply. Indeed, unsurprisingly, sim- ulations4 show that the algorithm gets stuck with positive probability when initialized with 0 < x (0) (cid:17) 1 and σ (cid:17) 1. When removing 0 from S, then analogous to Section 5.3 we obtain p≤ σ ) for m = 0 and small σ , and hence Theorem 3 applies. f (m, σ ) ∈ O( √ In contrast, if S is a Cantor set of measure zero then the algorithm diverges success- fully, since it ignores zero sets with full probability. 6 Conclusions and Future Work We have established global convergence of the (1 + 1)-ES for an extremely wide range of problems. Importantly, with the exception of a few proof details, the analysis captures the actual dynamics of the algorithm and hence consolidates our understanding of its working principles. Our analysis rests on two pillars. The first one is a progress guarantee for rank- based evolutionary algorithms with elitist selection. In its simplest form, it bounds the progress on problems without plateaus from below. It seems to be quite generally appli- cable, for example, to runtime analysis and hence to the analysis of convergence speed. The second ingredient is an analysis of success-based step size control. The cur- rent method barely suffices to show global convergence. It is not suitable for deducing stronger statements such as linear convergence on scale invariant problems. Control of the step size on general problems therefore needs further work. Many natural questions remain open, the most significant are listed in the follow- ing. These open points are left for future work. 4Special care must be taken when simulating this problem with floating point arithmetic. Our simula- tion is necessarily inexact; however, not beyond the usual limitations of floating point numbers. It does reflect the actual dynamics well. The fitness function is designed such that the most critical point for the simulation is zero, which is where standard IEEE floating point numbers have maximal precision. 46 Evolutionary Computation Volume 28, Number 1 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / / e d u e v c o a r t i c e - p d l f / / / / / 2 8 1 2 7 2 0 2 0 3 7 3 e v c o _ a _ 0 0 2 4 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Global Convergence of the (1 + 1) Evolution Strategy to a Critical Point • The approach does not directly yield results on the speed of convergence. How- ever, the progress guarantee of Theorem 1 is a powerful tool for such an anal- ysis. It can provide us with drift conditions and hence yield bounds on the expected runtime and on the tails of the runtime distribution. But for that to be effective we need better tools for bounding the tails of the step size distribution. Here, again, drift is a promising tool. • The current results are limited to step-size adaptive algorithms and do not in- clude covariance matrix adaptation. One could hope to extend the proceeding to the (1 + 1)-CMA-ES algorithm (Igel et al., 2007), or to (1 + 1)-xNES (Glas- machers et al., 2010). Controlling the stability of the covariance matrix is ex- pected to be challenging. It is not clear whether additional assumptions will be required. As an added benefit, it may be possible to relax the condition p > τ for p-improvability, by requiring it only after successful adaptation of
the covariance matrix.

• Plateaus are currently not handled. Teorema 1 shows how they distort the dis-
tribution of the decrease. Worse, they affect step size adaptation, and they make
it virtually impossible to obtain a lower bound on the one-step probability of a
strict improvement. Por lo tanto, proper handling of plateaus requires additional
argumentos.

• In the interest of generality, our convergence theorem only guarantees the exis-
tence of a limit point, not convergence of the sequence as a whole. Creemos
that convergence actually holds in most cases of interest (at least as long as
there are no plateaus; see above). This is nearly trivial if the limit point is an
isolated local optimum; sin embargo, it is unclear for a spatially extended opti-
mum, Por ejemplo, a low-dimensional variety or a Cantor set.

• Our current result requires a saddle point to be p-improvable for some p > τ ,
otherwise the theorem does not exclude convergence of the ES to the saddle
punto. We know from simulations that the (1 + 1)-ES overcomes p-improvable
saddle points reliably, also for p (cid:17) t . A proper analysis guaranteeing this be-
havior would allow the establishment of statements analogous to work on
gradient-based algorithms that overcome saddle points quickly and reliably;
see for example, Dauphin et al. (2014). Sin embargo, this is clearly beyond the
scope of the present article.

• We provide only a minimal negative result stating that the algorithm may in-
deed converge prematurely with positive probability if there exists a p-critical
point for which the cumulative success probability does not sum to infinity. En
Sección 5.5, it becomes apparent that this notion is rather weak, since the state-
ment is not formally applicable to the case of a closed ball, which however
differs from the open ball scenario only on a zero set. This makes clear that
there is still a gap between positive results (global convergence) and negative
resultados (premature convergence). Teorema 3 can certainly be strengthened, pero
the exact conditions remain to be explored. A single p-improvable point with
pag < τ is apparently insufficient. A p-critical point may be sufficient, but it is not necessary. Evolutionary Computation Volume 28, Number 1 47 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / / e d u e v c o a r t i c e - p d l f / / / / / 2 8 1 2 7 2 0 2 0 3 7 3 e v c o _ a _ 0 0 2 4 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 T. Glasmachers Acknowledgments I would like to thank Anne Auger for helpful discussions, and I gratefully acknowledge support by Dagstuhl seminar 17191 “Theory of Randomized Search Heuristics.” References Akimoto, Y., Auger, A., and Glasmachers, T. (2018). Drift theory in continuous search spaces: Expected hitting time of the (1 + 1)-es with 1/5 success rule. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), pp. 1922–1925. Akimoto, Y., Nagata, Y., Ono, I., and Kobayashi, S. (2010). Theoretical analysis of evolutionary computation on continuously differentiable functions. In Genetic and Evolutionary Computa- tion Conference, pp. 1401–1408. Arnold, D., and Brauer, D. (2008). On the behaviour of the (1 + 1)-es for a simple constrained problem. In Parallel Problem Solving from Nature, pp. 1–10. Auger, A. (2005). Convergence results for the (1, λ)-SA-ES using the theory of ϕ-irreducible Markov chains. Theoretical Computer Science, 334(1–3): 35–69. Beyer, H.-G., and Meyer-Nieberg, S. (2006). Self-adaptation on the ridge function class: First re- sults for the sharp ridge. In Parallel Problem Solving from Nature, pp. 72–81. Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), Advances in neural information processing systems 27, pp. 2933–2941. Red Hook, NY: Curran Associates, Inc. Diouane, Y., Gratton, S., and Vicente, L. (2015). Globally convergent evolution strategies. Mathe- matical Programming, 152(1–2): 467–490. Droste, S., Jansen, T., and Wegener, I. (2002). On the analysis of the (1+ 1) evolutionary algorithm. Theoretical Computer Science, 276(1–2): 51–81. Gilbert, J., and Nocedal, J. (1992). Global convergence properties of conjugate gradient methods for optimization. SIAM Journal on Optimization, 2(1): 21–42. Glasmachers, T., Schaul, T., and Schmidhuber, J. (2010). A natural evolution strategy for multi- objective optimization. In Parallel Problem Solving from Nature, pp. 627–636. Hansen, N., Arnold, D. V., and Auger, A. (2015). Evolution strategies. In J. Kacprzyk and W. Pedrycz (Eds.), Handbook of computational intelligence, pp. 871–898. Berlin: Springer. Hansen, N., and Ostermeier, A. (2001). Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9(2): 159–195. Igel, C., Hansen, N., and Roth, S. (2007). Covariance matrix adaptation for multi-objective opti- mization. Evolutionary Computation, 15(1): 1–28. Jägersküpper, J. (2003). Analysis of a simple evolutionary algorithm for minimization in Eu- clidean spaces. Automata, Languages and Programming, p. 188. Jägersküpper, J. (2005). Rigorous runtime analysis of the (1 + 1) ES: 1/5-rule and ellipsoidal fitness landscapes. In International Workshop on Foundations of Genetic Algorithms, pp. 260– 281. Jägersküpper, J. (2006a). How the (1 + 1) ES using isotropic mutations minimizes positive definite quadratic forms. Theoretical Computer Science, 361(1): 38–56. 48 Evolutionary Computation Volume 28, Number 1 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / / e d u e v c o a r t i c e - p d l f / / / / / 2 8 1 2 7 2 0 2 0 3 7 3 e v c o _ a _ 0 0 2 4 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Global Convergence of the (1 + 1) Evolution Strategy to a Critical Point Jägersküpper, J. (2006b). Probabilistic runtime analysis of (1+, λ), ES using isotropic mutations. In Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation (GECCO), pp. 461–468. Kern, S., Müller, S. D., Hansen, N., Büche, D., Ocenasek, J., and Koumoutsakos, P. (2004). Learn- ing probability distributions in continuous evolutionary algorithms—A comparative review. Natural Computing, 3(1): 77–112. Lehre, P. K., and Witt, C. (2013). General drift analysis with tail bounds. Technical Report. Retrieved from arXiv:1307.2559. Lengler, J., and Steger, A. (2016). Drift analysis and evolutionary algorithms revisited. Technical Re- port. Retrieved from arXiv:1608.03226. Rechenberg, I. (1973). Evolutionsstrategie: Optimierung technisher systeme nach prinzipien der biolo- gischen evolution. Stuttgart: Frommann-Holzboog. Torczon, V. (1997). On the convergence of pattern search algorithms. SIAM Journal on Optimization, 7(1): 1–25. Wegener, I. (2003). Methods for the analysis of evolutionary algorithms on pseudo-Boolean func- tions. In Evolutionary optimization, pp. 349–369. International Series in Operations Research and Management, Vol. 48. Boston: Springer. Wolfe, P. (1969). Convergence conditions for ascent methods. SIAM Review, 11(2): 226–235. Appendix Here we provide the proofs of technical lemmas that were omitted from the main text in the interest of readability. (cid:4) (x(cid:5) Proof of Lemma 1: We have to show that the level sets of all three functions agree out- side a set of measure zero. It is immediately clear from definition 1 that the level sets of (cid:7)f ≤ (cid:7)f ≤ (cid:4) (x) = f are a refinement of the level sets of (cid:4) and (cid:4) (x) < (cid:7)f ≤ (cid:7)f ≤ (cid:4) (x) = (cid:7)f < (cid:7)f < (cid:7)f ≤ (cid:4) (x(cid:5) ) both imply ) and f (x) < f (x(cid:5) ). (cid:7)f < (cid:4) , i.e., f (x) = f (x(cid:5) (cid:4) (x) < (cid:7)f < (cid:7)f < (cid:4) (x(cid:5) ) and (cid:7)f < (cid:7)f ≤ (cid:4) do not join f -level sets of positive measure. (cid:4) and It remains to be shown that (cid:3)−1 (cid:2) (cid:7)f < Let y ∈ R denote a level so that Y = (y) has positive measure (cid:4)(Y ) > 0. Tenemos
to show that this measure (not necessarily the whole set, only up to a zero set) is covered
by a single f -level set. Assume the contrary, for the sake of contradiction. Then we find
ourselves in one of the following situations:

) implies
(cid:4) (X(cid:5)

), y

(cid:4)

)

(cid:3)

1. There exist x, X(cid:5) ∈ Y fulfilling a := f (X) < f (x(cid:5) > 0
(cid:2)
f −1(a(cid:5)
y (cid:4)
> 0. So the mass of Y is split into at least two chunks of positive
(cid:2)
(cid:7)F < f −1(a) (cid:4) (x(cid:5) (cid:4) (x) ≥ (cid:4) > 0, which contradicts the
measure. This implies
(cid:7)F < assumption that x and x(cid:5) (cid:4) -level. belong to the same (cid:2) f −1(I ) 2. There exist x, x(cid:5) ∈ Y fulfilling a = f (x) < f (x(cid:5)) = a(cid:5) and it holds (cid:4) ) =: a(cid:5) (cid:3) ) − (cid:7)f < (cid:3) (cid:2) f −1(a) and it holds (cid:4) (cid:3) > 0
). So Y consists of a continuum of level sets of
(cid:2)
f −1(I )
(cid:4) (X) ≥ (cid:4)
> 0, conduciendo a

) − (cid:7)F < (cid:3) for the open interval I = (a, a(cid:5) (cid:7)f < (cid:4) (x(cid:5) measure zero. Again, this implies the same contradiction as in the first case. (cid:7)f ≤ (cid:4) is exactly analogous. The argument for Proof of Lemma 2: It holds (cid:25) f (m, a · σ ) = p< S< f (m) 1 (2π )d/2ad σ d Evolutionary Computation Volume 28, Number 1 (cid:26) · exp − (cid:27) dx (cid:11)x − m(cid:11)2 2a2σ 2 (cid:2) 49 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / / e d u e v c o a r t i c e - p d l f / / / / / 2 8 1 2 7 2 0 2 0 3 7 3 e v c o _ a _ 0 0 2 4 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 T. Glasmachers (cid:25) ≥ 1 ad · 1 (2π )d/2σ d S< f (m) (cid:26) · exp − (cid:27) dx (cid:11)x − m(cid:11)2 2σ 2 = 1 ad · p< f (m, σ ). The computation for p≤ Proof of Lemma 3: Fix x and define ξ := ξ f so in the following we treat the case that both are positive. For a ≥ 1 it holds (cid:2) pT (x). The cases pH = 0 and ξ = 0 are trivial, f is analogous. (cid:25) pT = = ad · ≤ ad · S< f (x) (cid:25) S< f (x) (cid:25) S< f (x) (cid:26) 1 (2π )d/2ξ d exp − (cid:11)x(cid:5) − x(cid:11)2 2ξ 2 (cid:27) dx(cid:5) 1 (2π )d/2ad ξ d exp 1 (2π )d/2ad ξ d exp (cid:27) (cid:27) (cid:26) − (cid:26) − (cid:11)x(cid:5) − x(cid:11)2 2ξ 2 (cid:11)x(cid:5) − x(cid:11)2 2a2ξ 2 dx(cid:5) dx(cid:5). √ pT /ad (x) from below. Applying the above argument with a = d In other words, the success probability for step size a · ξ is at least pT /ad . Hence, in order to push the success probability below pT /ad , the step size must be at least ξ · a, which therefore bounds ηf pT /pH (cid:2) completes the proof. In a small enough neighborhood of a regular point x the function f Proof of Lemma 4: can be approximated arbitrarily well by a linear function (its first order Taylor polyno- mial). In particular, the level set of f is arbitrarily well approximated by a hyperplane, for which the probability of strict improvement is exactly 1/2. Hence we have which immediately implies the first statement. lim σ →0 f (x, σ ) = 1 p< 2 , We have already seen that the second statement holds point-wise. It remains to be shown that ξ f p |Y is lower bounded by a positive, lower semicontinuous function. To this end we show that ξ f p |Y takes positive values. Consider a convergent sequence (at )t∈N → x ∈ Rd and define ξa := lim inft→∞ ξ f p (at ) and ξx := ξ f p (x). We have to show that it holds ξx ≤ ξa for all choices of x and (at )t∈N. We define p itself is lower-semicontinuous, and we note that ξ f (cid:4) Sx := Sa := (cid:4) and σ ∈ R+ σ ∈ R+ (cid:6) f (x, σ ) ≤ p (cid:5) (cid:5) (cid:5) p< (cid:5) (cid:5) (cid:5) ∃(tk )k∈N : p< (cid:6) f (atk , σ ) ≤ p ∀k ∈ N , which allows us to write ξa = inf(Sa ) and ξx = inf(Sx ). Fix σ ∈ Sa and a correspond- f (atk , σ ) ≤ p ∀k ∈ N. From the continuity of f ing subsequence (tk )k∈N so that it holds p< it follows that the success probability function p< f is lower semicontinuous (and even = x and lower semi- continuous in its second argument, the step size). From limk→∞ atk continuity of p< f it follows σ ∈ Sx. We conclude Sa ⊂ Sx and therefore ξx ≤ ξa. To show the last statement we construct a cone of improving steps centered at x. This cone makes up a fixed fraction of each ball centered on x, which shows that x is p-improvable, where p is any number smaller than the volume of the intersection of 50 Evolutionary Computation Volume 28, Number 1 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / / e d u e v c o a r t i c e - p d l f / / / / / 2 8 1 2 7 2 0 2 0 3 7 3 e v c o _ a _ 0 0 2 4 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Global Convergence of the (1 + 1) Evolution Strategy to a Critical Point ball and cone divided by the volume of the ball, which is well-defined and positive in the limit when the radius tends to zero. Let v denote an eigen vector of H fulfill- ing vT H v < 0. For σ → 0, the objective function is well approximated by the quadratic Taylor expansion f (x(cid:5) ) ≈ g(x(cid:5) ) = f (x) + (x − x(cid:5) )T H (x − x(cid:5) ). f (x) is locally well approximated by S< The sublevel set S< g (x), which is a cone centered on x. Whether a ray x + R · z belongs to S< g (x) or not depends on whether zT H z < 0 or not. Now, the eigen vector v has this property, and due to continuity of g, the same holds for an open neighborhood N of v. The cone x + R · N is contained in S< g (x) and has the g (x, σ ) = p > 0 under N (X, σ 2I ) for all σ > 0. We conclude
same positive probability s< f (x, σ ) ≥ p > 0,
pag< lim σ →0 which completes the proof. Proof of Lemma 5: We use the short notation m := m(t ) and σ = σ (t ). Let S = S< denote the region of improvement, with Lebesgue measure sampling from this region is bounded by f (m) (cid:7)f(cid:4)(m). The probability of (cid:2) f (m) = p< = < = (cid:25) 1 S (2π )d/2σ d exp (cid:25) 1 (2π )d/2σ d exp S (cid:25) (cid:26) (cid:26) − − (cid:11)x − m(cid:11)2 2σ 2 (cid:11)x − m(cid:11)2 2σ 2 (cid:27) (cid:27) dx dx dx S 1 (2π )d/2σ d (cid:7)f(cid:4)(m) (2π )d/2σ d ≤ p, where the last inequality is equivalent to the assumption. Proof of Lemma 6: Assume the contrary, for the sake of contradiction. Then ∞(cid:16) X(t ) < ∞. (cid:2) t=1 Fix N ∈ N. Hoeffding’s inequality applied with ε = p/2 and n ≥ 2N p yields (cid:17) n(cid:16) Pr t=1 X(t ) ≤ N ≤ exp n → ∞−→0. (cid:18) (cid:27) (cid:26) −n · p2 2 Hence, for n → ∞, with full probability the infinite sum exceeds N . Since N was arbi- (cid:2) trary, we arrive at a contradiction. Proof of Lemma 7: In each iteration, the step size σ is multiplied by either ec− or ec+ . According to Lemma 3, the condition pH pT (cid:2) m(tk ) ≤ ed·c− yields (cid:3) (cid:3) ≥ e−c− . ηf pH ξ f pT (cid:2) m(tk ) An unsuccessful step of the (1 + 1)-ES in iteration t results in a reduction of the step size by the factor σ (t+1) = ec− < 1 and leaves m(t+1) = m(t ) unchanged. We conclude that no σ (t ) Evolutionary Computation Volume 28, Number 1 51 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / / e d u e v c o a r t i c e - p d l f / / / / / 2 8 1 2 7 2 0 2 0 3 7 3 e v c o _ a _ 0 0 2 4 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 T. Glasmachers such step can overjump the interval (cid:2) and σ (t+1) ≤ ξ f m(t ) pT (cid:10) ξ f (cid:3) pT . The above property also implies bH bT (cid:3) , ηf pH (cid:2) m(t ) (cid:2) m(t ) (cid:3)(cid:11) ≥ e−c− . , in the sense of σ (t ) ≥ ηf pH (cid:3) (cid:2) m(t ) The central proof argument works as follows. First, we exclude that the step size remains outside [bT , bH ] for too long. The same argument does not work for the target interval defined in Equation (1) because of its time dependency—we could overjump the moving target. Instead we show that the only way for the step size to avoid the target interval for an infinite time is to overjump, that is, to find itself above and below the interval infinitely often. Finally, an argument exploiting the properties of unsuccessful steps allows us to consider a static target, which cannot be overjumped by the property already shown above. First, we show that there exists an infinite subsequence of iterations t fulfilling σ (t ) ∈ [bT , bH ]. This statement is strictly weaker than the assertion to be shown. It is still helpful in the following because then we know that the step sizes return to a fixed, t-independent interval for an infinite number of times. Assume for the sake of contradic- tion that there exists t0 such that σ (t ) ≤ bT for all t ≥ t0. The logarithmic step size change δ(t ) := log(σ (t+1)) − log(σ (t )) takes the values c+ > 0 with probability at least pT > τ and
c− < 0 with probability at most 1 − pT < 1 − τ , hence (cid:11) (cid:10) δ(t ) E ≥ (cid:5) := pT · c+ + (1 − pT ) · c− > 0.

For t1 > t0 we consider the random variable log(pag (t1 )) = log(pag (t0 )) +
δ(t ). The vari-
ables δ(t ) are not independent. We create independent variables as follows. For each
candidate state (metro, pag ) fulfilling σ < bT we fix a set I (m, σ ) ⊂ S< f (m) of improving steps with probability mass exactly pT under the distribution N (m, σ 2I ). Let ˜δ(t ) denote the step size change corresponding to δ(t ) for which the step size is increased only if the iterate m(t+1) is contained in I (m, σ ). Note that these hypothetical step size changes do not influence the actual sequence of algorithm states. Therefore, the sequence is i.i.d., and it holds ˜δ(t ) ≤ δ(t ). From Hoeffding’s inequality applied with ε = (cid:5)/2 to ˜δ(t ) ≤ ! −1 t1 t=t0 ! ! −1 t1 t=t0 −1 t1 t=t0 δ(t ) we obtain (cid:12) Pr log(σ (t1 )) ≤ log(σ (t0 )) + (t1 (cid:26) ≤ exp −(t1 − t0) · (cid:5)2 2(c+ − c−)2 , (cid:13) (cid:5) 2 − t0) · (cid:27) that is, the probability that the log step size grows by less than (cid:5)/2 per iteration on − t0. For t1 average is exponentially small in t1 the → ∞ it vanishes completely. Hence, with probability becomes minuscule, and for t1 full probability, we arrive at a contradiction. The same logic contradicts the assumption that σ (t ) ≥ bH for all t ≥ t0. Hence, with full probability, subepisodes of very small and very large step size are of finite length, and according to Lemma 6 the sequence of step sizes returns infinitely often to the interval [bT , bH ]. log(bT ) − log(σ (t0 )) + 2/(cid:5) · (cid:18) t0 (cid:3) (cid:2) Next we show that there exists an infinite subsequence of iterations fulfilling Equa- tion (1). Again, assume the contrary. We know already that σ (t ) does not stay below bT or above bH for an infinite time. Hence, there must exist an infinite subsequence fulfilling either (cid:19) (cid:3)(cid:20) (cid:2) m(t ) bT , ξ f pT σ (t ) ∈ (cid:19) ηf pH σ (t ) ∈ (cid:2) m(t ) (cid:3) , bH (cid:20) . (2) (3) Evolutionary Computation Volume 28, Number 1 or 52 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / / e d u e v c o a r t i c e - p d l f / / / / / 2 8 1 2 7 2 0 2 0 3 7 3 e v c o _ a _ 0 0 2 4 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Global Convergence of the (1 + 1) Evolution Strategy to a Critical Point Assume an infinite subsequence fulfilling Equation (2). For each of these iterations, the success probability is lower bounded by pT . Consider the case of consecutive successes. Until the event σ (t ) ≥ ξ f (4) pT the probability of success remains lower bounded by pT > 0. The condition is fulfilled
after at most n+
c+ successes in a row, hence the probability of
such an episode occurring is lower bounded by pn+
T > 0. Lema 6 ensures the existence
of an infinite subsequence of iterations with this property. Each such episode contains a
point fulfilling either Equation (1) or Equation (4). By assumption, the former happens
only finitely often, which implies that the latter happens infinitely often.

registro(bH ) − log(bT )

(cid:2)
metro(t )

:=

(cid:3)»

(cid:3)

(cid:2)

Por eso, this case as well as the alternative assumption of an infinite sequence fulfill-
ing Equation (3), handled with an analogous argument, result in an infinite subsequence
with the property

pag (t ) ∈

(cid:19)
ηf
pH

(cid:2)
metro(t )

(cid:3)
, ec+ · bH

(cid:20)
.

Following the same line of arguments as above, as long as σ (t ) ≥ ηf
pH
ability of an unsuccessful step is lower bounded by 1 − pH > 0. After at most n−
(cid:2)
registro(bT ) − log(bH ) + c+
mugiendo, the step size must have dropped below bT ≤ ηf
pH
such an episode occurring is lower bounded by (1 − pH )norte-
an infinite number of such episodes occurs.

(cid:3)
, the prob-
:=
c− unsuccessful steps in a row, called an episode in the fol-
(cid:2)
metro(t )
, hence the probability of
> 0. According to Lemma 6,

(cid:2)
metro(t )

(cid:3)»

(cid:3)

Por construcción, these episodes consist entirely of unsuccessful steps, and therefore
metro(t ) remains unchanged for the duration of an episode. This comes in handy, since this
metro(t )
means that also the target interval
remains fixed, and this again
means that at least one iteration of the episode falls into this interval. We have thus con-
structed an infinite subsequence of iteration within the above interval, in contradiction
(cid:2)
to the assumption.

(cid:3)
, ηf
pH

(cid:19)
ξ f
pT

metro(t )

(cid:3)(cid:20)

(cid:2)

(cid:2)

Finalmente, we provide details on the computations of success rates in the examples.
En la sección 5.2, the set where the function f (x1, x2) := a · x2
− x2
2 takes the value zero

1
a) y (−1,
a). The cone
consists of two lines through the origin in directions (1,
is bounded by these lines in the success domain. The angle between their directions
a)
divided by π corresponds to the success rate. It is two times the angle between (1,
y (1, 0), and hence 2 cot
The threshold p < cot −1( −1(a)/π in Section 5.4 follows the exact same logic, with the difference that the square root vanishes in the direction vectors, and we lose a factor of two, since the success domain is only one half of the cone. a). Dividing by π yields the result. √ √ √ In Section 5.5, the circular level line in the corner point (a, 1) is tangent to the vector −1(a) between (−1, a) and (−1, 0), divided by 2π , is a lower bound (−1, a). The angle tan on the success rate at m = (a + ε, 1) with σ (cid:17) ε. The bound is precise for ε → 0. Evolutionary Computation Volume 28, Number 1 53 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / / e d u e v c o a r t i c e - p d l f / / / / / 2 8 1 2 7 2 0 2 0 3 7 3 e v c o _ a _ 0 0 2 4 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3Global Convergence of the (1 + 1) Evolution image

Descargar PDF