Large Linguistic Corpus Reduction with

Large Linguistic Corpus Reduction with
SCP Algorithms

Nelly Barbot∗
IRISA, University of Rennes 1

Olivier Bo¨effard∗
IRISA, University of Rennes 1

Jonathan Chevelu∗
IRISA, University of Rennes 1

Arnaud Delhay∗
IRISA, University of Rennes 1

Linguistic corpus design is a critical concern for building rich annotated corpora useful in
different domains of applications. For example, speech technologies such as ASR (Automatic
Speech Recognition) or TTS (Text-to-Speech) need a huge amount of speech data to train data-
driven models or to produce synthetic speech. Collecting data is always related to costs (recording
speech, verifying annotations, etc.), and as a rule of thumb, the more data you gather, the
more costly your application will be. Within this context, we present in this article solutions
to reduce the amount of linguistic text content while maintaining a sufficient level of linguistic
richness required by a model or an application. This problem can be formalized as a Set Covering
Problem (SCP) and we evaluate two algorithmic heuristics applied to design large text corpora
in English and French for covering phonological information or POS labels. The first considered
algorithm is a standard greedy solution with an agglomerative/spitting strategy and we propose
a second algorithm based on Lagrangian relaxation. The latter approach provides a lower bound
to the cost of each covering solution. This lower bound can be used as a metric to evaluate the
quality of a reduced corpus whatever the algorithm applied. Experiments show that a suboptimal
algorithm like a greedy algorithm achieves good results; the cost of its solutions is not so far
from the lower bound (about 4.35% for 3-phoneme coverings). Usually, constraints in SCP are
binary; we proposed here a generalization where the constraints on each covering feature can be
multi-valued.

∗ IRISA, University of Rennes 1, 6 rue de Kerampont, Lannion 22300, France.

E-mail: {nelly.barbot, olivier.bo(cid:127)effard, jonathan.chevelu, arnaud.delhay}@irisa.fr.
This work is partially supported by the French National Research Agency (ANR) in the framework of the
project Phorevox.

Submission received: 7 May 2013; revised submission received: 20 November 2014; accepted for publication:
18 March 2015.

doi:10.1162/COLI a 00225

© 2015 Association for Computational Linguistics

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
1
3
3
5
5
1
8
0
6
7
0
9
/
c
o

l
i

_
a
_
0
0
2
2
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 41, Number 3

1. Introduction

In automatic speech and language processing, many technologies make extensive use
of written or read text sets. These linguistic corpora are a necessity to train models or to
extract rules, and the quality of the results strongly depends on a corpus’ content. Often,
the reference corpus should provide a maximum diversity of content. For example, in
Tian, Nurminen, and Kiss (2005), and Tian and Nurminen (2009), it turns out that max-
imizing the text coverage of the learning corpus improves an automatic syllabification
based on a neural network. Similarly, a high quality speech synthesis system based on
the selection of speech units requires a rich corpus in terms of diphones, diphones
in context, triphones, and prosodic markers. In particular, Bunnell (2010) shows the
importance of a good coverage of diphones and triphones for the intelligibility of a
voice produced by a unit selection speech synthesis system.

To cover the best attributes needed for a task, several strategies are then possible. A
first method—very simple—is to collect text randomly, but it soon becomes expensive
because of the natural distribution of linguistic events following the Zipf’s law. Very
few events are extremely frequent and many events are very rare. This problem is
often made difficult by the fact that many technologies require several variants of the
same event (as in a Text-to-Speech [TTS] system using several acoustical versions of
the same phonological unit). Usually, a large volume of data needs to be collected.
However, and depending on the applications, building such corpora is often achieved
under a constraint of parsimony. As an example, for a TTS system, a high-quality
synthetic voice generally needs a huge number of speech recordings. But minimizing
the duration of a recording is also a critical point to ensure uniform quality of the
voice, to reduce the drudgery of the recording, to reduce the financial cost, or to
follow a technical constraint on the amount of collected data for embedded systems.
Moreover, a reduced set tends to limit the need of human implication for checking
the data (transcription and annotation). Similarly, in the natural language processing
field (NLP), the adaptation of a generic model to a specific domain often requires new
annotated data that illustrate its specificities (as in Candito, Anguiano, and Seddah
2011). However, the creation cost of such data highly depends on the kind of labels
used to adapt the model. In particular, the annotation in syntax trees is really more
expensive than in Part-of-Speech (POS) tags. Then, it could be more efficient to an-
notate a compact corpus that reflects the phenomena variability than a corpus with
a natural distribution of events, which implies many redundancies (see Neubig and
Mori 2010).

In a machine learning framework, the active learning strategy can be used as
an alternative that reduces the manual data annotation effort to design the training
corpus without diminishing the quality of the model to train (see Settles 2010 or Schein,
Sandler, and Ungar 2004). It consists of building the corpus iteratively by choosing
an item according to an external source of information (a user or an experimental
measure). This approach has been applied in NLP, speech recognition, and spoken
language understanding (see for instance Tomanek and Olsson 2009 and Gotab, B´echet,
and Damnati 2009).

A second alternative, when no direct quality measure is available, consists of
covering a large set of attributes that may impact the final quality (after annotation
or recording). This kind of approach might also be preferred when the final corpus
is built in one batch (for instance, because of out-sourcing or annotator/performer
consistency constraints). A method could be an automatic extraction from a huge
text corpus of a minimal sized subset that covers the identified attributes. This

356

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
1
3
3
5
5
1
8
0
6
7
0
9
/
c
o

l
i

_
a
_
0
0
2
2
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Barbot et al.

Large Linguistic Corpus Reduction

problem is a generalization of the Set-Covering Problem (SCP), which is an NP-
hard problem, as shown in Karp (1972). It is then necessary to use heuristics or
sub-optimal algorithms for a reasonable computation time. Moreover, Raz and Safra
(1997) and Alon, Moshkovitz, and Safra (2006) have shown that the SCP cannot
be polynomially approximated with ratio c × ln(n) unless P = NP, when c is a con-
stant, and n refers to the size of the universe to cover. That means that one can-
not be certain to obtain a result under this ratio with any polynomial algorithm.
However, the latter complexity results are given for any kind of distribution in the
mono-representation case. One can ask if good multi-represented coverages can be
achieved efficiently on data following Zipf’s law, which is usual in the domain of
NLP.

Within the field of speech processing, the most frequently used strategy is a greedy
method based on an agglomeration policy. This iterative algorithm selects the sen-
tence with the highest score at each iteration. The score reflects the contribution of
the sentence to the covering under construction. In Gauvain, Lamel, and Esk´enazi
(1990), this methodology has been applied to build a database of read speech from
a text corpus for the evaluation of speech recognition systems using hierarchically
organized covering attributes. Van Santen and Buchsbaum (1997) have tested different
variants of greedy selection of texts by varying the units to cover (diphones, dura-
tion, etc.) and the “scores” for a sentence depending on the considered applications.
In Tian, Nurminen, and Kiss (2005), the learning corpus for an automatic system of
syllabification is designed using a greedy approach with the Levenshtein distance
as a score function in order to maximize its text diversity. In Franc¸ois and Bo¨effard
(2001), the methodology gives a priority to the rarest categories of allophones. The
latter methodology has been implemented for the definition of the multi-speaker corpus
Neologos in Krstulovi´c et al. (2006). In the article of Krul et al. (2006), the authors
constructed a corpus where the distribution of diphonemes/triphonemes matches a
uniform distribution. A greedy algorithm is led by a score function based on the
Kullback-Liebler divergence. A similar method is used in Krul et al. (2007) to design a
reduced database in accordance with a specific domain distribution. Kawai et al. (2000)
propose a pair exchange mechanism that Rojc and Kaˇciˇc (2000) apply after a first reverse
greedy algorithm—also called spitting greedy—deleting the useless sentences. In Cadic,
Boidin, and d’Alessandro (2010), the covering of “sandwich” units (defined to be more
adapted to corpus-based speech synthesis) is carried out by generating new sentences
in a semi-automatic way. Candidates are generated using finite state transducers. The
sentences are ordered according to a greedy criterion (their sandwiches richness) and
presented to a human evaluator. This collection of artificial and rich sentences en-
ables an effective reduction of the size of the covering but requires expensive human
intervention to obtain semantically correct sentences that will be therefore easier to
record.

The results of these previously cited studies are difficult to compare because of
the different initial corpora and covering constraints (partial or full covering) and
evaluation criteria (the number of gathered sentences, the Kullback divergence, etc.).
In Zhang and Nakamura (2008), a priority policy for the rare units is added into an
agglomerative greedy algorithm in order to get a covering of triphoneme classes from a
large text corpus in Chinese language. The results show that this priority policy driven
by the score function and the phonetic content of the sentences reduces the covering
size compared with a standard agglomerative greedy algorithm.

Similarly, in Franc¸ois and Bo¨effard (2002), several combinations of greedy algo-
rithms (agglomeration, spitting, pair exchange, or priority to rare units) were applied

357

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
1
3
3
5
5
1
8
0
6
7
0
9
/
c
o

l
i

_
a
_
0
0
2
2
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 41, Number 3

to the construction of a corpus for speech synthesis in French containing at least three
representatives of the most frequent diphones. Based on this work, the best strategy
would be the application of an agglomerative greedy followed by a spitting greedy
algorithm. During the agglomeration phase, the score of a sentence corresponds to the
number of its unit instances that remain to be covered normalized by its length. During
the spitting phase, at each iteration, the longest redundant sentence is removed from the
covering. This algorithm is called the Agglomeration and Spitting Algorithm (ASA).

As an alternative to a greedy algorithm, which is sub-optimal, solving the SCP using
Lagrangian relaxation principles can provide an exact solution for problems of reason-
able size. However, for speech processing, the SCP has several millions of sentences with
tens of thousands of covering features. Considering these practical constraints, Chevelu
et al. (2007) adapted a Lagrangian relaxation based algorithm proposed by Caprara,
Fischetti, and Toth (1999). In the context of Italian railways, Caprara, Fischetti, and Toth
proposed heuristics to solve scheduling problems and won a competition, called
Faster, organized by the Italian Operational Research Society in 1994, ahead of other
Lagrangian relaxation heuristics–based algorithms, like Ceria, Nobili, and Sassano
(1998).

In Chevelu et al. (2007, 2008), the algorithm takes into account the constraints of
multi-representation. A minimal number of representatives for the same unit may be
required. The proposed algorithm, called LamSCP — Lagrangian-based Algorithm
for Multi-represented SCP — is applied to extract coverings of diphonemes with a
mono- or a 5-representation and coverings of triphonemes with mono-representation
constraints. These results are compared with the greedy strategy ASA and are about 5%
to 10% better. Besides, the LamSCP provides a lower bound for the cost of the optimal
covering and allows for evaluating the quality of the results.

In Barbot, Bo¨effard, and Delhay (2012), phonological content of diphoneme cover-
ings is studied regarding many parameters. These coverings are obtained by different
algorithms (LamSCP, ASA, greedy based on the Kullback divergence) and some of
the coverings are randomly completed to reach a given size (from 20,000 to 30,000
phones). It turns out that the coverings obtained using LamSCP and ASA provide a
good representation of short units and the representation of long units mainly depends
on the length of the corpus.

In this article, we present in more detail the LamSCP algorithm and its score func-
tions and heuristics that take into account multi-representation constraints. We deepen
the study about the performance of LamSCP for the construction of a phonologically
rich corpus according to the size of the search space. We evaluate LamSCP and ASA
algorithms on a corpus of sentences in English for a covering of multi-represented
diphones, where the minimal number of required unit representatives varies from one
to five. We also compare them in the case of very constrained triphoneme coverings in
English and French, which represent about 12 times more units to cover. Additionally,
both algorithms are tested to provide multi-represented coverings of POS tags in order
to assess their ability to deal with different kinds of linguistic data. A particular effort
has been made on methodology to obtain comparable measures, to study the stability
of both algorithms, and to establish confidence intervals for each solution.

This article is organized as follows. In Section 2, the SCP framework and the
associated notations are introduced. The ASA algorithm is described in Section 3 and
the LamSCP is detailed in Section 4. The experimental methodology is presented in
Section 5 and results are discussed in Section 6. Before concluding in Section 8, we
present experiments in the context of TTS where we evaluate on that task the benefits
of a reduction in section 7.

358

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
1
3
3
5
5
1
8
0
6
7
0
9
/
c
o

l
i

_
a
_
0
0
2
2
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Barbot et al.

Large Linguistic Corpus Reduction

2. The Set-Covering Problem

Before describing the SCP-solving algorithms proposed in this article, we introduce in
this section some notations and the Lagrangian properties used by LamSCP.

Let us consider a corpus A composed of n sentences s1, : : : , sn. According to the
target applications, these sentences are annotated with respect to phonological, acoustic,
prosodic attributes, and so forth. Each sentence is then associated with a family of units
} and A can
of different types. The set of units present in A is denoted U = {u1, : : : , um
be represented by a matrix A = (aij), where aij is the number of instances of unit ui in the
sentence sj. Therefore, the j th column of A corresponds to sentence sj in A. To simplify
the writing, we define the sets M = {1, : : : , m} and N = {1, : : : , n}.

For a given vector of integers B = (b1, : : : , bm)T, a reduction X of A, also called
covering of U, is defined as a subset of A which contains, for every i ∈ M, at least bi
instances of ui. It can be described by a vector X = (x1, : : : , xn)T where xj = 1 if sj belongs
to X and xj = 0 otherwise. In other words, a covering is a solution X ∈ {0, 1}n of the
following system

∀i ∈ M,

j2N

aijxj

≥ bi

(1)

that is, AX ≥ B where B is called the constraint vector.

Our aim is to optimize a covering according to a cost function minimization crite-
rion. The covering cost is given by summing the costs of the sentences that compose the
covering. The optimization problem can be formulated as the following SCP:

X

(cid:3) = arg min
X∈{0,1}n
AX≥B

CX

(2)

where C = (c1, : : : , cn) is the cost vector and cj the cost of the sentence sj. Because of the
objective to minimize the total length of the covering, we have chosen to define the cost
of a sentence as one of its length features. According to the considered application, the
sentence cost can be defined as its number of phones (one of our objectives is to design
a phonetically rich script with a minimal speech recording duration), or its number of
words, part-of-speech tags, breath groups, and so on.

In Caprara, Fischetti, and Toth (1999), Caprara, Toth, and Fischetti (2000), and Ceria,
Nobili, and Sassano (1998), the studied crew scheduling problem is a particular case of
Equation (2) where A is a binary matrix and B = 1Rm (i.e., with mono-representation
constraints). In order to ensure that Equation (1) admits a solution, we assume that, for
each i ∈ M, the minimal number bi of ui instances required in the covering is not greater
than the number (A1Rn )i of ui instances in A, that is A1Rn ≥ B. Under this assumption, A
is the maximal size solution of Equation (1), represented by X = 1Rn. In the case where
bi is greater than the number of ui instances in A, bi is set to (A1Rn )i.

To drive the SCP algorithms during the sentence selection phase, the covering
capacity (cid:22)j of sentence sj is defined as the number of its unit instances required in the
covering in view of the constraint vector:

(cid:22)j =

i2M

{

}

min

aij, bi

(3)

359

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
1
3
3
5
5
1
8
0
6
7
0
9
/
c
o

l
i

_
a
_
0
0
2
2
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 41, Number 3

Let us notice that (cid:22)j does not consider the excess unit instances: For example, if sj
contains aij = 10 instances of ui and at least bi = 3 instances of ui are required, the
contribution of ui to (cid:22)j derivation only takes into account three instances of ui.

3. Greedy Algorithm ASA

In this section, the two main steps that compose the algorithm ASA are briefly de-
scribed. First, an agglomerative greedy procedure is applied to A so as to derive a
covering. Next, a spitting greedy procedure reduces this covering in order to approach
the optimal solution of Equation (2).

3.1 Agglomerative Greedy Strategy

The greedy strategy builds a sub-optimal solution to the SCP Equation (2) in an iterative
way. At each iteration, the lowest cost sentence is chosen from A. If several sentences
correspond to the lowest cost, the one coming first (i.e., the one with the lowest index)
is chosen. Initially, the set of selected sentences X is empty, the matrix ˜A associated with
the candidate sentences is assigned to A, the current covering capacity of sj is given by
˜(cid:22)j = (cid:22)j, and the current constraint vector ˜B = B. The cost of sentence sj is defined by

{

(cid:27)j =

cj= ˜(cid:22)j if ˜(cid:22)j
∞ otherwise

̸= 0

(4)

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
1
3
3
5
5
1
8
0
6
7
0
9
/
c
o

l
i

_
a
_
0
0
2
2
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Indeed, if ˜(cid:22)j = 0, it turns out that sj does not cover any unit missing in the solution X
under construction and its infinite cost (cid:27)j avoids its selection.

At each iteration, the selected sentence s is added to X . Taking into account the
content of s, ˜B is updated to max
where the jth entry of ∆ equals 1 if
sj = s and 0 otherwise. Next, the associated column of s in ˜A is set to 0Rm . For each
sentence sj with a non-zero ˜(cid:22)j feature, ˜(cid:22)j is then updated using ˜A and ˜B in Equation (3).
At last, the agglomerative greedy algorithm is stopped as soon as all the constraints

˜B − ˜A∆, 0Rm

{

}

are satisfied, that is, ˜B = 0Rm .

3.2 Spitting Greedy Strategy

The spitting greedy strategy also consists in building iteratively a sub-optimal solution
Y to Equation (2) by reducing the size of a covering. The initial covering Y is set to the
solution X derived by the agglomerative phase described earlier. At each iteration, the
set of the redundant sentences of Y is calculated and the costliest one (according to
the cost function C) is removed from Y. An element s of Y is said to be redundant if for
∈ U, its number of instances into Y, denoted mi(Y ), and into Y \ {s}, denoted
each ui
}. In other words, s is a redun-
} = min {mi(Y \ {s}), bi
mi(Y \ {s}), check min {mi(Y ), bi
dant element of the covering Y if Y \ {s} is also a covering solution of Equation (1).
The spitting greedy algorithm stops when the redundant sentence set is empty.

4. Lagrangian Relaxation Based–Algorithm

This section describes the main phases of the algorithm called LamSCP. This algorithm
takes advantage of the Lagrangian relaxation properties reviewed herein in order to
approach the optimal solution of Equation (2) as close as possible. Strongly inspired by

360

Barbot et al.

Large Linguistic Corpus Reduction

Caprara, Fischetti, and Toth (1999), but generalized to the multi-representation prob-
lem, this algorithm provides a lower bound of the optimal solution cost. Having such
information is very useful for assessing the achievements of the SCP algorithms.

4.1 Lagrangian Relaxation Principles

Let us briefly recall the main principles of Lagrangian relaxation on which LamSCP is
based to solve Equation (2) (see Fisher [1981] for more details on Lagrangian relaxation).

First, the Lagrangian function associated with Equation (2) is defined by

L(X, Λ) = CX + ΛT(B − AX)

= ΛTB + C(Λ)X

(5)

)
m

(
R+

, X ∈ {0, 1}n, and C(Λ) = C − ΛTA. The coordinates of Λ =
where Λ ∈
((cid:21)1, : : : , (cid:21)m)T are called Lagrangian multipliers and can be interpreted as a weighting
of constraints (1). The jth entry of C(Λ), called Lagrangian cost cj(Λ) of sentence sj, takes
into account its cost cj and the adequacy of its composition to address Equation (2). For
, the Lagrangian function satisfies L(X, Λ) ≤ CX.
every covering X and every Λ ∈
Thus, the dual Lagrangian function defined by

(
R+

)
m

L(Λ) = min

X2f0,1gn

L(X, Λ)

(6)

presents the following fundamental property: For every Λ ∈ Rm
+ and every covering
(cid:3)
X, we have L(Λ) ≤ CX. Hence, L(Λ) is a lower bound of the minimal covering cost, CX
,
but does not necessarily correspond to the cost of a covering. In order to compute L(Λ),
an acceptable solution for the vector X minimizing L(X, Λ) is X(Λ) = (x1(Λ), : : : , xn(Λ))T
where

{

xj(Λ) =

1 if cj(Λ) < 0 0 if cj(Λ) > 0
∈ {0, 1} otherwise

(7)

Additionally, the dual Lagrangian function and the Lagrangian costs inform about
the potential usefulness of sentences in the optimal covering. More precisely, for a given
Λ and a known upper bound UB of minimal covering cost, the gap g(Λ) = UB − L(Λ)
measures the relaxation quality. If cj(Λ) is strictly greater than g(Λ), we can check that
any covering containing sj has a cost value strictly greater than UB. Hence, sentence
sj is not selected and xj can be fixed at zero. Similarly, if cj(Λ) < −g(Λ), any covering with a cost lower than UB contains sj and one can fix xj to 1. Therefore, an optimal covering is made up of sentences with a low Lagrangian cost, as done in Caprara, Toth, and Fischetti (2000) and Ceria, Nobili, and Sassano (1998), and the higher the relaxation quality (i.e., the lower g(Λ)) is, the cheaper the covering will be. The resolution of the dual problem of Equation (2) consists in finding Λ(cid:3) ∈ Rm + that maximizes the lower bound L(Λ), that is Λ(cid:3) = arg max (cid:3)2Rm + L(Λ) (8) 361 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 3 3 5 5 1 8 0 6 7 0 9 / c o l i _ a _ 0 0 2 2 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 41, Number 3 Because this real variable function L is concave and piecewise affine, a well-known approach for finding a near-optimal multiplier vector is the subgradient algorithm. 4.2 The Three Phases The LamSCP is an iterative algorithm, composed of several procedures that aim to either improve the current best solution or reduce the combinatorial issue related to the considered problem. In order to derive a good solution, the algorithm calls on a great number of greedy procedures to solve sub-problems with the help of the Lagrangian costs. As for the combinatorial reduction, the most frequently used heuristic consists of downsizing the problem by mainly considering the sentences with low Lagrangian costs. The algorithm is organized around a main procedure called 3-phases. This pro- cedure can single-handedly solve a multi-represented SCP. As its name suggests, the 3-phases functioning consists in iterating a sequence of the three following sub- procedures as shown in Figure 1: r r r The subgradient phase calculates an estimation ˜Λ of Λ(cid:3) that maximizes the dual Lagrangian function. This procedure requires an upper bound UB of the optimal covering cost. UB is initialized by a greedy algorithm (rather than the cost of the whole corpus A). This phase is detailed in Section 4.2.1. The heuristic phase explores the neighborhood of ˜Λ by generating a great number of Lagrangian vectors ˜Λp. A greedy-like procedure is associated with each ˜Λp so as to compute a covering using the Lagrangian cost vector C( ˜Λp). If, during this exploration, a less costly covering than the best known one (corresponding to the cost UB) is found, the upper bound UB is then updated to the cost of this less costly solution. Similarly, if a better estimation of Λ(cid:3) ˜Λ is updated. This phase is described in Section 4.2.2. The column fixing phase selects a set F of sentences that are most likely to belong to the optimal covering. This phase is detailed in Section 4.2.3. is obtained, Following the column fixing phase, the constraint vector is updated and the un- selected sentences define a set-covering sub-problem, called a residual problem. This sub-problem is processed similarly, via an additional iteration of the three phases. This iterative process is stopped when the residual problem is empty or when the associated dual Lagrangian function indicates a cost is too high. Indeed, because this function indicates a minimal cost for covering the sub-problem, its addition to the cost CF of the sentences already retained in F gives a lower bound of the total cost of the solution under construction, which should not rise beyond the cost UB of the best known solution so as to be potentially more advantageous. 4.2.1 Subgradient Phase. In order to reach the quality goal, the subgradient phase pro- vides a near-optimal solution ˜Λ of the dual Lagrangian problem (8) using a subgradient 362 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 3 3 5 5 1 8 0 6 7 0 9 / c o l i _ a _ 0 0 2 2 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Barbot et al. Large Linguistic Corpus Reduction Figure 1 LamSCP algorithm. The rectangle blocks are the steps that aim at improving solution quality and the ellipses are those that try to reduce the size of the problem. type algorithm. This iterative approach generates a sequence (Λp) using the following updating formula (see Caprara, Fischetti, and Toth 1999) { Λp+1 = max Λp + (cid:22) } g(Λp) ||S(Λp)||2 S(Λp), 0 (9) where S(Λp) = B − AX(Λp) so as to take into account the multi-representation con- straints. Parameter (cid:22) is adjusted to fit the convergence fastness according to the method proposed by Caprara, Fischetti, and Toth (1999). At the first call of 3-phases, Λ0 is arbitrarily defined as follows: for each i ∈ M, (cid:21)0 i = min j∈N ̸=0 aij cj (cid:22)j (10) As for UB, its initial value is set to the cost of a covering previously calculated. In order to evaluate how much the covering cost derived by ASA can be improved, we have chosen to initialize UB by this value. At the following iterations of 3-phases, Λ0 is given by a random perturbation (less than 10%) of the best known vector ˜Λ (of which the entries of the sentences fixed in the last column fixing phase are removed) and UB corresponds to the cost of the best covering found (after subtraction of the cost of the sentences fixed in the last column fixing phase). In another approach proposed by Ceria, Nobili, and Sassano (1998), UB corresponds to the upper bound of a dual Lagrangian problem, and the subgradient procedure simultaneously estimates the upper bound and the lower bound, generating two sequences of multipliers. The subgradient phase also calls two procedures: pricing and spitting. Procedure pricing aims to reduce the size of the search space. For each unit ui, the pricing selects the 5bi smallest Lagrangian cost sentences covering ui. If this selection contains less than 5m sentences, where m is the maximal entry of B, it is completed by low Lagrangian cost sentences (less than 0.1) to the limit of 5m sentences. The set of the chosen sentences is denoted P and its design guarantees a sufficient number of instances for each unit to cover and a small variety in its composition. Actually, the subgradient method is applied on P instead of A, and P is updated every 10 subgradient iterations. 363 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 3 3 5 5 1 8 0 6 7 0 9 / c o l i _ a _ 0 0 2 2 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 41, Number 3 Finally, at each iteration, definition (7) of X(Λp) and the large number of Lagrangian costs close to zero allow a considerable number of vectors S(Λp). In order to get around the computational difficulty of finding the steepest accent direction, Caprara, Fischetti, and Toth (1999) propose a heuristic that, according to the experimental results, accel- erates the convergence of the subgradient phase. This heuristic is implemented in the spitting procedure. Called at each iteration, this procedure extracts from P the subset S of sentences with a Lagrangian cost lower than 0.001. S is then reduced using a spitting greedy algorithm to remove its redundant elements in decreasing Lagrangian cost order. At last, for every j ∈ M, xj(Λp) = 1 if sj ∈ S and xj(Λp) = 0 otherwise. Thus, S(Λp) does not necessarily correspond to a subgradient vector. 4.2.2 Heuristic Phase. The heuristic phase calculates a large number of coverings before keeping the best one. To that end, a sequence of 150 multiplier vectors is generated by perturbing ˜Λ using the formula ˜Λp+1 = max where ˜Λ0 = ˜Λ and (cid:22) is provided by the subgradient phase, so as to allow for a change in a large number of ˜Λp. With each ˜Λp, an agglomerative greedy algorithm followed by a spitting greedy one are associated in order to calculate a covering. ˜Λp + (cid:22)g( ˜Λp)S( ˜Λp), 0 { } The agglomerative greedy chooses at each iteration the sentence sj with the lowest cost (cid:27)j( ˜Λp) where    (cid:27)j( ˜Λp) = cj( ˜Λp) ∗ ˜(cid:22)j if cj( ˜Λp) < 0 and ˜(cid:22)j > 0
if cj( ˜Λp) ≥ 0 and ˜(cid:22)j > 0
cj( ˜Λp)= ˜(cid:22)j

if ˜(cid:22)j = 0

(11)

This cost function provides an advantage to low Lagrangian cost sentences sj containing
˜(cid:22)j unit instances that could be helpful to the covering under construction. The agglom-
erative step uses several heuristics so as to reduce the search space. It is run within a
l of P, composed of the sentences sj with the lowest costs (cid:27)j( ˜Λk). At
limited subset P
l is selected and the cost of the sentences of P are updated.
each iteration, a sentence of P
l becomes greater than the minimal cost in P \ P
If the maximum sentence cost in P
l,
the working subset P
l is also updated. The definitions of P and P
l guarantee that the
agglomeration step provides a nonpartial solution of the considered SCP. This solution
is then reduced during the spitting step by iteratively removing its redundant sentences
sj with the highest costs cj.

At the end of the heuristic phase, the best found covering X (cid:3)

and its cost CX
(stored in UB) are kept as well as the highest value of L( ˜Λp) (found during the sub-
gradient or heuristic phases).

(cid:3)

4.2.3 Column Fixing Phase. The column fixing phase aims to reduce the problem size
by choosing “promising” sentences among the ones with very low Lagrangian cost or
containing rare unit instances. The unselected sentences are less interesting for resolving
the SCP and the residual problem associated with these sentences should be the subject
of another call of the 3-phases.

More precisely, the column fixing phase calculates the subset Q composed of sen-
tences sj with a negative Lagrangian cost cj( ˜Λ). Q is represented by the binary vector
∈ U, the number of its instances
∈ Q. For each ui
Q = (q1, : : : , qn)T where qj = 1 if sj
≤ bi, then ui is considered as a rare unit
covered by Q is given by (AQ)i. If (AQ)i
and all the elements of Q containing some instances of ui are fixed in a set F. The
covering constraints that are not satisfied by F constitute a residual SCP. In order to

364

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
1
3
3
5
5
1
8
0
6
7
0
9
/
c
o

l
i

_
a
_
0
0
2
2
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Barbot et al.

Large Linguistic Corpus Reduction

)

(
BT1Rm

complete F with a few sentences of the best known covering, a greedy-type algorithm
is run on X (cid:3) \ F to derive a solution of this residual SCP. From the obtained solution,
the max{
=20, 1} lowest Lagrangian cost sentences are also added in F. The
sentences that are “fixed” in F during the column fixing phase stay in F up to the end
of the 3-phases. After the column fixing phase, the residual sub-problem is processed
by iterating the three phases and the next fixed sentences are added to F.

4.3 Refining Procedure

The 3-phases procedure is encapsulated in an outer loop that permits the partial re-
consideration of the solution X (cid:3)
provided by this procedure. To that end, the refining
procedure, proposed in Caprara, Fischetti, and Toth (1999), is used in order to select
elements of X (cid:3)
that contribute at least to the gap g( ˜Λ). In the case of the SCP with multi-
representation constraints, the definition of the contribution of sj
can be adapted
as follows:

∈ X (cid:3)

(cid:14)j = max{cj( ˜Λ), 0} +

i∈M
aij>0

˜(cid:21)i(AX

(cid:3) − B)i

aij
(AX(cid:3))i

(12)

(cid:3) − B)i of the
The second term of Equation (12) consists of sharing the contribution ˜(cid:21)i(AX
excess instances of ui in X (cid:3)
according to the distribution of ui instances in X (cid:3)
. Therefore,
the refining procedure ranks in an increasing order the elements sj of X (cid:3)
according to
their (cid:14)j value, and fixes the first elements in a set G until its given covering rate (cid:28)G
reaches (cid:25). (cid:28)G represents the rate of covering constraints satisfied by G and is defined by

(cid:28)G = 1 −

i2M max{bi

− (AG)i, 0}

BT1Rm

(13)

where G denotes the binary vector corresponding to G.

4.4 The Overall LamSCP

The LamSCP is made up of the main procedures introduced in the previous sections
interlinked by the following steps. First, because of the adaptation of the algorithm to
the SCP with multi-representation constraints, the entries of matrix A are clipped to
the constraint vector in order to simplify the calculations such as the ones of (cid:22)1, : : : , (cid:22)n.
This threshold application implies that the excess instances of each unit ui are taken into
account in each sentence sj beyond the number bi, in the derivation of (cid:14)j.

After the initialization of Λ0 using Equation (10) and the upper bound UB, pro-
cedure 3-phases is called and provides a solution X to the complete SCP. The refining
function fixes a sentence subset G such that G covers a given rate (cid:25) of the covering
constraints. Parameter (cid:25) starts at a minimum value (cid:25)min = 0:3. The residual SCP is then
processed by 3-phases. The (cid:25) value grows at a rate of 20 percent whenever 3-phases
does not improve the solution to the complete SCP. If (cid:25) is greater or equal to 1, the
refining procedure fixes the whole best solution and the residual problem is then empty.
On the other hand, (cid:25) is set to (cid:25)min if a better solution is found in order to challenge
G and improve this solution. This iterative sequence composed of the 3-phases and
refining procedures is carried out until the residual problem is not empty, the gap g( ˜Λ)
is positive, and the number of iterations has not reached 20.

365

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
1
3
3
5
5
1
8
0
6
7
0
9
/
c
o

l
i

_
a
_
0
0
2
2
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 41, Number 3

5. Experiments

We propose a twofold comparison of the ASA and LamSCP algorithms: One part is fo-
cused on the covering cost for a large SCP, and the other on the stability of the solutions.
Moreover, in order to assess the ability and the behavior of both algorithms to process
different linguistic data, a first set of experiments deals with phonological attributes
(mainly covering co-occurrences of phonemes) and a second set with grammatical
labels (mainly covering co-occurrences of POS labels).

Both attribute types, often involved in automatic linguistic processing, were chosen
because their distribution consists of few highly frequent events and numerous rare
events. On the one hand, for TTS tasks, the phonological type covering is a useful
preliminary step of the text corpus design before the recording step. In order to produce
the signal corresponding to a requested sentence, the unit selection engine requires at
least one instance of each phone (or 2-phone, depending on the concatenation process).
Because the recording and the post-recording annotation process are expensive tasks,
the recording length of such a corpus has to be as short as possible. On the other hand,
in order to train a domain-specific dependency parser, the covering of POS sequences
may be useful for increasing the diversity of syntax patterns. Because the dependency
annotation is a highly expensive task, the adaptation corpus to annotate needs to be
as small as possible, containing characteristic examples of the specific lexical variation
rather than following the natural distribution. One can expect that increasing its diver-
sity of POS sequences may lead to more diversity in the syntax trees.

Experiments on covering co-occurrences of phonemes are carried out on two large
phonologically annotated text corpora, and consist of covering at least k instances of
each phoneme, diphoneme until n-phoneme (i.e., triphoneme if n = 3, diphoneme if
n = 2). The cost cj of the sentence sj is given by its number of phones. From this point,
this kind of SCP is called a “k-covering of n-phonemes.”

A first corpus, Gutenberg, is composed of texts in English, mainly extracted from
novels and short stories. This corpus is the production of the Gutenberg Project, pre-
sented by Hart (2003), and has been used by Kominek and Black (2003) to design the
speech corpus Arctic. A second corpus, in French, named Le-Monde, is extracted from
articles published in the newspaper Le Monde in 1997. Table 1 summarizes the main
features of both corpora.

The phonological annotation of the Gutenberg corpus comes from the Arctic/
Festvox database (see Kominek and Black 2003), and the annotation of the Le-Monde
corpus is a by-product of the Neologos project, detailed by Krstulovi´c et al. (2006).

Table 1
Statistics of the studied corpora.

Le-Monde

Gutenberg

Number of sentences
Number of phonemes
Corpus size (number of phones)
Number of diphonemes
Number of diphones
Number of triphonemes
Number of triphones
Sentence length mean (phones) & Std. Dev.

172,168
35
16,668,609
1,172
16,496,441
26,443
16,324,273
96.81 (60.46)

53,996
57
1,539,735
1,955
1,485,739
27,477
1,431,743
28.51 (10.52)

366

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
1
3
3
5
5
1
8
0
6
7
0
9
/
c
o

l
i

_
a
_
0
0
2
2
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Barbot et al.

Large Linguistic Corpus Reduction

Table 2
Statistics of 1-POS and 2-POS occurrences, corpus Le-Monde. On average, a sentence contains
27:64 POS with a standard deviation of 16:45.

Number of distinct units Number of occurrences

1-POS
2-POS

141
6,716

4,587,320
4,421,371

For each corpus, we have collected every phoneme, diphoneme, triphoneme, and
their occurrences in each sentence so as to define the set U of units to cover and the
matrix A. A is built by collecting one sentence after the other following the ordering
inside the corpus, and one unit after the other inside the sentences. After this matrix
translation, we obtain two description files and two index files. The first file describes
the matrix A and the second one the cost vector C. Because of the low matrix density, we
have chosen a sparse representation to save space and computation time: For instance,
the 2-phoneme Gutenberg matrix is about 2.2% dense. We only store the cells of A that
have a non-zero value so as to get a sparse matrix. The index files are made for the
correspondence between the general covering problem and the application domain.
The implementation is made in C. In terms of software engineering, our algorithms
are working on an SCP that does not depend on the application data. For example,
there is no information on what types of units are to be covered. The algorithms only
have the matrix of occurrences A, the cost vector C, and the constraint vector B. A set
of translation files (from application data to SCP and from SCP to application data) is
built before each computation. As a consequence, there is no difficulty in addressing a
different set of features to cover on the same or on a different corpus.

To study the achievements of ASA and LamSCP on different types of data, we
have also chosen to address the “k-covering of n-POS” on the corpus Le-Monde. The
grammatical and syntactical analyses are processed by the Synapse development
analyzer presented in Synapse (2011). In order to consider a SCP with a substantial
number of required units, a very detailed level of POS tagging has been selected,
providing 141 distinct tags after analyzing Le-Monde. For example, this level provides
tags like “Determiner male singular Article” or “Noun female singular,” whereas the
simplest level gives “Determiner” or “Noun.” This latter level of description would
have given only nine different POS tags after analyzing Le-Monde. The main associated
statistics are given in Table 2. For these experiments of POS covering, the cost of a
sentence is defined as its number of POS occurrences.

We used a PC with 2 CPUs (E5320/1.86GHz/4 cores/64bits) and 32 GB RAM for the
phonological coverings and the POS coverings were computed using a PC with 8 CPUs
(Intel Xeon X7550/2.00Ghz/8 cores/64bits) and 128 GB RAM. Our implementations do
not take advantage of any parallelism.

The following sections detail more precisely the different experiments conducted

on French or English.

5.1 Performance of the Algorithms for 1-Covering of 2-Phonemes in French

The aim of Experiment 1 is the assessment of the achievements of both algorithms, ASA
and LamSCP, and the robustness of the results when the sentence ordering is modified

367

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
1
3
3
5
5
1
8
0
6
7
0
9
/
c
o

l
i

_
a
_
0
0
2
2
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 41, Number 3

in the corpus to reduce. Indeed, one of the difficulties of the greedy methodology is that
the score function has discrete values and several sentences can yield the same score. In
our implementation, among the sentences showing the best current score, the first one
encountered is chosen. We would like to measure the influence of this random choice on
the stability of the results. LamSCP uses greedy strategies based on Lagrangian costs.
Because the Lagrangian costs cj(Λ) take the SCP in its entirety into account and are
continuous real-value functions of Λ, they would be more selective than the sentence
costs used by ASA. A simple solution for evaluating the stability consists of proceeding
with an important amount of experiments on the same SCP by randomly modifying
the sentence ranking in A. Experiment 1 measures the impact of these permutations
on the solutions computed by both algorithms. The considered SCP is the 1-covering
of 2-phonemes on the corpus Le-Monde. Considering the computation time (more than
5 hours for LamSCP), only 47 instances of the SCP are considered, each instance corre-
sponding to a random sentence ordering in Le-Monde. The 95% confidence intervals are
derived using the bootstrap method, concerning the covering cost, the number and the
length of sentences in coverings, the computation time, and the “distance” between the
covering costs and the associated lower bound L( ˜Λ).

5.2 Stability of the Algorithms for the k-Covering of 2-Phonemes in English

One of the goals of Experiment 2 is to compare the achievements of both algorithms on
corpus Gutenberg, which has different features from the ones of Le-Monde. The sentences
in Gutenberg are shorter on average and the associated variation of sentence length is
lower. Furthermore, Gutenberg is 10 times smaller than Le-Monde.

In order to compare with the results of the previous experiment done on Le-Monde,
we first observed a 1-covering of 2-phonemes on the Gutenberg corpus. The search
space seems smaller than in Experiment 1. According to Table 1, Le-Monde is com-
posed of more sentences than Gutenberg. Le-Monde contains 33,165,050 occurrences of
2-phonemes and Gutenberg only 3,025,474. Moreover, the number of attributes to cover
is lower: 1,207 2-phonemes in Le-Monde and 2,012 in Gutenberg. We can also notice that
five 2-phonemes have only one occurrence in Le-Monde and that the total cost of the
five sentences covering these rare units is 751 phones, whereas this is the case for 109
2-phonemes in Gutenberg and the total cost of the 104 concerned sentences is 3,606
phones. Finally, if we consider the density of matrix A, 8.4% of the cells are non-empty
for Experiment 1 and 2.2% for Experiment 2. As with Experiment 1, the sentence order-
ing in Gutenberg has been randomly modified to produce 60 instances of the SCP and
similar solution statistics have been computed for both algorithms.

The second objective is to test and compare the ability of two algorithms to deal
with the constraints of multi-representation. For this, we apply the same methodology
to the k-covering of 2-phonemes in Gutenberg, for k from 2 to 5. We note that for the same
original corpus to reduce, the size of the search space decreases when k increases. These
different SCPs enable us to compare the performance of two algorithms depending on
the size of the search space.

5.3 1-Covering of 3-Phonemes in English

In Experiment 3, the aim is to observe the behavior of both algorithms on very con-
strained problems. For this, we study their ability to treat a covering of 3-phonemes. We
try to assess the impact on the solution features and on the stability of such an increase
in the number of attributes to cover with many rare events. So as to compute statistics

368

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
1
3
3
5
5
1
8
0
6
7
0
9
/
c
o

l
i

_
a
_
0
0
2
2
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Barbot et al.

Large Linguistic Corpus Reduction

on the 1-covering of 3-phonemes, an instance of Gutenberg has been proposed to ASA
and to LamSCP. This instance counts 29,489 units to cover and the density of matrix
A is 0.24%. The computation time is nearly 5 days for LamSCP and we then chose
to carry out 35 instances of the 1-covering of 3-phonemes on Gutenberg. Additionally,
it is interesting to compare these results with those of Experiment 2 concerning the
1-covering of 2-phonemes on the same corpus, which corresponds to a larger search
space.

5.4 1-Covering of 3-Phonemes in French

In order to pursue the objective set out in the description of Experiment 3, that is, the
ability of algorithms to treat with numerous constraints and a heavy-tail distribution
of units, Experiment 4 consisted of testing both algorithms on the 1-covering of 3-
phonemes on Le-Monde. The search space seems larger than in the previous experiment.
Let us recall that Le-Monde contains 3.18 times more sentences than Gutenberg and it
counts 27,650 units to cover. Furthermore, Gutenberg contains 5,000 3-phonemes with
only one occurrence, which requires the selection of 4,180 sentences with a total length
equal to 137,714 phones, whereas Le-Monde contains 2,274 rare 3-phonemes scattered
in 2,107 sentences measuring a total of 283,208 phones. The associated matrix density
is 0.69%. Because the computation of a first instance takes more than 8 days, we have
limited the number of instances to 30 for this SCP.

5.5 k-Covering of 1-POS and 2-POS in French

The main goal of Experiment 5 is to study the behavior of both algorithms, ASA and
LamSCP, dealing with another kind of linguistic attribute, and to compare this with
the previous experiments. To achieve this goal, we consider POS attributes and the
associated SCP: 1- and 5-coverings of POS, 1- and 5-coverings of 2-POS defined on
Le-Monde. Indeed, we can observe in Table 2 that the global statistics of POS tags
in Le-Monde are quite different from their phonological counterparts summarized in
Table 1. In particular, the density of matrix A is 11.03% for a 1-POS covering and 0.57%
for a 2-POS covering. Also, the search space size seems to decrease when considering
successively the mono- and the multi-coverings of 1-POS, and the mono- and the multi-
coverings of 2-POS, permitting us to compare them with the results coming from the
experiments on phonological coverings. We evaluate the stability by computing 50
randomly mixed versions of the corpus Le-Monde.

6. Results and Discussion

In this section, the results of the experiments described in Section 5 are provided and
discussed. As a consequence, the organization of this section and Section 5 are similar.

6.1 Performance of Algorithms for 1-Covering of 2-Phonemes in French

Table 3 shows the main results of Experiment 1, concerning the 1-covering of 2-
phonemes from the corpus Le-Monde. Symbol ± indicates that the mentioned value
corresponds to a 95% confidence interval, calculated using the bootstrap method from

369

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
1
3
3
5
5
1
8
0
6
7
0
9
/
c
o

l
i

_
a
_
0
0
2
2
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 41, Number 3

Table 3
Statistics of the solutions of 1-covering of 2-phonemes computed by ASA and LamSCP from
Le-Monde. The last column represents the best lower bounds found by LamSCP.

Experiment 1: 1-covering of 2-phonemes, corpus Le-Monde

ASA

LamSCP

L( ˜Λ)

Covering size (phones)
Reduction rate relative to ASA (%)
Covering size [min; max]
Covering size Std. Dev.
Sentence number
Sentence length
Time CPU (seconds)

8,555 ± 22

[8,447; 8,669]
57.40
335.73 ± 1.69
25.48 ± 0.10
51 ± 0

7,786 ± 4
–9.00 ± 0.20
[7,767; 7,829]
13.01
268.76 ± 1.08
28.97 ± 0.12
20,478 ± 508

7,689 ± 5
–10.13 ± 0.19
[7,649; 7,715]
13.83


the 47 instances of the SCP. In order to cover each of the 1,207 2-phonemes of Le-Monde,
ASA drastically reduces the size of the initial corpus by 99.94% (±0.00). However, on
average, LamSCP calculates a 9.00% shorter covering. The lower bound L( ˜Λ) for the
optimal covering cost is 7,689 ± 5 phones. L( ˜Λ) is not a minimum value and may
not correspond to the cost of a real covering. Because this lower bound is updated all
along the execution of LamSCP, we do not mention a specific calculation time for this
result.

For one instance of the SCP, let CX

(cid:3)
ASA be the size of the solution given by ASA. The
(cid:3)
quantity (cid:28)ASA = 1 − L( ˜Λ)=CX
ASA indicates that the optimum solution to SCP is at most
(cid:28)ASA times shorter than the covering calculated by ASA. It can be observed that the
optimal solution is at most 10.13% (±0.19) shorter than the one yielded by ASA and at
most 1.24% (±0.08) shorter than the solution yielded by LamSCP. The solutions obtained
by LamSCP and the optimal solution to the SCP are therefore very close. Considered
among the 47 instances of the SCP the best solutions yielded by ASA (8,447 phones)
and LamSCP (7,767 phones), LamSCP is 8.75% better than ASA in terms of covering
costs, while the best lower bound for the SCP is 7,715 phones, only 0.67% (respectively,
8.66%) shorter than the best covering by LamSCP (respectively, ASA).

The average length of the sentences selected by both algorithms is far below the
average length of the sentences in the corpus (96.81 phones). LamSCP tends to choose
sentences that are slightly longer than ASA, with an average 28.97 (±0.12) phones
compared with 25.48 (±0.10) phones. Moreover, ASA selects on average 335.73 (±1.69)
sentences per solution, about 24.91% more than LamSCP, which selects 268.76 (±1.08)
sentences on average. This seems to indicate that LamSCP makes fewer local choices
than ASA. This hypothesis can also be validated through the analysis of the variability
of the results. The relative variation of the covering costs calculated by LamSCP is
13.01/7,786 = 0.16%, and 57.40/8,555 = 0.67% by ASA; that is to say a stability of the
costs 4 times greater for the solutions yielded by LamSCP than for ASA. Moreover, the
solutions are composed of a very stable number of sentences: The associated relative
standard deviation is 5.31/335.73 = 1.58% for the 47 instances solved by ASA, and
3.85/268.76 = 1.43% for the instances solved by LamSCP. It turns out that the results
of both algorithms are very stable when the order of the sentences is modified in the
original corpus.

Finally, concerning computation time, the resolution of an instance of the SCP
lasts on average 5 hr 41 min 18 sec (±8 min 28 sec) for LamSCP versus 51 sec

370

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
1
3
3
5
5
1
8
0
6
7
0
9
/
c
o

l
i

_
a
_
0
0
2
2
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Barbot et al.

Large Linguistic Corpus Reduction

(±0 sec) for ASA. On average over the 47 instances, LamSCP takes 390 (±9) times as
long as ASA.

6.2 Stability of the Algorithms for k-Covering of 2-Phonemes in English

The considered SCP consists of covering at least k times each of the 2,012 2-phonemes of
the Gutenberg corpus, with k varying from 1 to 5. The results are summarized in Table 4.
For all instances of these SCPs, it has been observed that LamSCP computes shorter
coverings than ASA. However, that advantage diminishes as k grows: The cost advan-
tage offered by LamSCP compared with ASA decreases from 9.73% (±0.13) for k = 1
to 4.50% (±0.04) for k = 5. Also, the solutions obtained from ASA and LamSCP seem
to get closer to the optimal solution as k rises. The corresponding figures are presented
in Table 5: For instance, the optimal solution is at most 0.75% (±0.02) shorter than that
obtained by LamSCP for k = 1, and 0.27% (±0.00) for k = 5.

Table 4
Experiment 2: Statistics based on 60 instances of a k-covering of 2-phonemes from Gutenberg.

ASA

LamSCP

L( ˜Λ)

k = 1

k = 2

k = 3

k = 4

k = 5

Covering size (phones)
Reduction rate relative to ASA (%)
Covering size [min; max]
Sentence number
Sentence length
Time CPU (seconds)

Covering size (phones)
Reduction rate relative to ASA (%)
Covering size [min; max]
Sentence number
Sentence length
Time CPU (seconds)

Covering size (phones)
Reduction rate relative to ASA (%)
Covering size [min; max]
Sentence number
Sentence length
Time CPU (seconds)

Covering size (phones)
Reduction rate relative to ASA (%)
Covering size [min; max]
Sentence number
Sentence length
Time CPU (seconds)

Covering size (phones)
Reduction rate relative to ASA (%)
Covering size [min; max]
Sentence number
Sentence length
Time CPU (seconds)

14,909 ± 23

[14,700; 15,107]
623.75 ± 1.61
23.90 ± 0.04
8 ± 0

27,518 ± 25

[27,278; 27,727]
1080.19 ± 1.60
25.48 ± 0.02
14 ± 0

39,319 ± 34

[39,091; 39,580]
1,507.75 ± 1.75
26.07 ± 0.01
20 ± 0

50,491 ± 28

[50,300; 50,753]
1,908.10 ± 1.73
26.46 ± 0.01
28 ± 0

61,375 ± 30

[61,137; 61,683]
2,288.91 ± 1.88
26.81 ± 0.01
39 ± 0

13,458 ± 2
–9.73 ± 0.13
[13,441; 13,485]
520.85 ± 0.57
25.83 ± 0.02
2,711 ± 68

25,604 ± 4
–6.94 ± 0.09
[25,576; 25,642]
948.30 ± 0.68
26.99 ± 0.01
3,931 ± 68

36,985 ± 4
–5.92 ± 0.07
[36,946; 37,028]
1,359.28 ± 1.13
27.20 ± 0.02
5,974 ± 114

48,023 ± 4
–4.89 ± 0.06
[47,987; 48,075]
1,744.40 ± 1.35
27.53 ± 0.01
6,589 ± 239

58,610 ± 4
–4.50 ± 0.04
[58,582; 58,645]
2,101.62 ± 0.84
27.88 ± 0.01
7,599 ± 344

13,357 ± 2
–10.41 ± 0.12
[13,341; 13,378]


25,413 ± 1
–7.65 ± 0.08
[25,400; 25,430]


36,746 ± 2
–6.53 ± 0.07
[36,724; 36,769]


47,820 ± 4
–5.29 ± 0.05
[47,780; 47,842]


58,447 ± 2
–4.76 ± 0.04
[58,420; 58,465]


371

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
1
3
3
5
5
1
8
0
6
7
0
9
/
c
o

l
i

_
a
_
0
0
2
2
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 41, Number 3

Table 5
Ratios (cid:28)LamSCP and (cid:28)ASA for the k-covering of 2-phonemes from Gutenberg.

k

(cid:28)LamSCP

(cid:28)ASA

1

2

3

4

5

0.75%
(± 0.02)
10.41%
(± 0.12)

0.74%
(± 0.01)
7.65%
(± 0.08)

0.64%
(± 0.01)
6.53%
(± 0.07)

0.42%
(± 0.01)
5.29%
(± 0.05)

0.27%
(± 0.00)
4.76%
(± 0.04)

Because the search area diminishes as k increases, it may be observed that the
algorithms tend to be more stable. This is true both for the size of the solutions, as well as
for the number of sentences that define them. Table 6 represents the variation of the size
of the solutions as a function of k. This variation is calculated as follows: For a given
k number and a given algorithm, the standard deviation of the size of the k-covering
computed by that algorithm is divided by the average size of these coverings. Thus, it
can be noted that LamSCP offers a stability 4 to 8 times superior to ASA concerning the
size of the coverings. As for the number of sentences, the relative standard deviation
similarly decreases from 0.97% to 0.28% when k increases from 1 to 5 for ASA solutions,
and from 0.42% to 0.15% for LamSCP ones.

One can note that the increase of the minimal number k of instances of each unit
to cover leads to a selection, by LamSCP and ASA, of longer sentences on average. The
average length of the sentences picked for a 1-covering was quite low. As the constraints
increase along with k, it only seems natural that the algorithms tend to select longer
sentences, as shorter sentences no longer contain enough occurrences of 2-phonemes.
Moreover, as described in Section 2, when the minimal number bi of a unit ui demanded
in the covering exceeds the number of instances of that unit in the initial corpus, all
sentences containing instances of ui in the initial corpus are selected, and bi is set to
(A1Rn )i. Thus, as k increases, the algorithm tends to select more and more sentences,
and their length tends towards the average value over the whole corpus, which is
28.51 phones for Gutenberg.

As for computation time, although it increases as k grows, because of the increasing
number of constraints to update, the ratio between the computation time of LamSCP
and ASA tends to diminish, as shown in Table 7. This tendency may find an explana-
tion in the fact that the search space diminishes as k increases, which causes a lesser
number of selected sentences to be questioned during the 3-phase iteration of LamSCP.
Also, we notice that the average computation time of the two algorithms is greater
in Experiment 1, owing to a greater number of sentences in corpus Le-Monde and a
higher density of matrix A. Moreover, the ratio between the computation times of

Table 6
Relative standard deviation of solution cost for both algorithms of a k-covering of 2-phonemes
from Gutenberg.

k

1

2

3

4

5

LamSCP
ASA

0.07% 0.05% 0.05% 0.04% 0.02%
0.57% 0.37% 0.29% 0.18% 0.17%

372

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
1
3
3
5
5
1
8
0
6
7
0
9
/
c
o

l
i

_
a
_
0
0
2
2
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Barbot et al.

Large Linguistic Corpus Reduction

Table 7
Computation time ratio between LamSCP and ASA for a k-covering of 2-phonemes from
Gutenberg.

k

1

2

3

4

5

Time LamSCP / Time ASA 333 (± 7)

280 (± 5)

292 (± 6)

218 (± 9)

194 (± 8)

LamSCP and ASA decreases between Experiments 1 and 2, going from 390 (±9) to
333 (±7). Again, this can be explained by the diminishing of the search space.

For k = 1, the advantage offered by LamSCP on the covering costs compared with
ASA is slightly higher than that observed in Experiment 1: 9.73% (±0.13) in this case,
versus 9.00% (±0.20) in the previous experiment. This seems to contradict the idea that
the performance of LamSCP improves as the search area becomes wider. However,
the distributions of the units to cover in Gutenberg and Le-Monde are different, and the
variation on the length of the sentences in Le-Monde is very high, which may account
for this slight difference in terms of gain. Note that the size of the calculated coverings
and the lower value L( ˜Λ) are closer in the experiment carried out on Gutenberg. It is
difficult, however, to perform further comparisons with Experiment 1 regarding the
“distance“ between the costs of the solutions computed by these algorithms, and the
optimal covering cost, given that the quality of the lower bound cannot be evaluated.
The gain in stability offered by LamSCP, both for the costs of the solutions or the
number of sentences, is more important than that noticed during the previous exper-
iment. We think that the increase is due to a more restricted search space, and less
variability of the length of the sentences in corpus Gutenberg, which may be observed in
Table 1.

6.3 1-Covering of 3-Phonemes in English

Table 8 sums up the main results of Experiment 3, where 35 instances of 1-covering
of 3-phonemes from Gutenberg were processed. According to the L( ˜Λ) values, covering
all 3-phonemes requires a solution size greater than or equal to 226,635 phones. On
average, the solution measures 227,360 ± 12 phones using LamSCP, and 236,828 ± 94

Table 8
Experiment 3: Statistics based on 35 experiments of a 1-covering of 3-phonemes from Gutenberg.

ASA

LamSCP

L( ˜Λ)

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
1
3
3
5
5
1
8
0
6
7
0
9
/
c
o

l
i

_
a
_
0
0
2
2
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Covering size (phones)
Reduction rate relative to ASA (%)
Covering size [min; max]
Covering size Std. Dev.
Sentence number
Sentence length
Time CPU (seconds)

236,828 ± 94

[236,075; 237,615]

301.75
8,005.20 ± 5.02
29.88 ± 0.00
2,834 ± 20

227,360 ± 12

–3.99 ± 0.04
[227,317; 227,425]
33.12
7,606.77 ± 2.14
29.58 ± 0.01

371,501 ± 10

226,559 ± 8

–4.33 ± 0.04
[226,518; 226,635]

23.67



373

Computational Linguistics

Volume 41, Number 3

phones using ASA. The optimal covering is at most 0.35% (±0.00) shorter than solutions
derived by LamSCP and 4.33% (±0.04) shorter than the ones derived by ASA. We can
then observe that both algorithms manage to compute solutions with close sizes when
scaling up the required attribute set. The solutions are very stable, even more than
in Experiment 2: The relative variation of their size is 0.12% for ASA and 0.01% for
LamSCP; the relative variation of their sentence number is 0.17% for ASA and 0.07%
for LamSCP. This increase of stability is due to a smaller search space and the increase
of the number of rare units required, which also compels the algorithms to select a
higher number of inevitable sentences for all the instances of the SCP. Furthermore,
the decrease of the ratio between the computation time of LamSCP and ASA from 332
for the 1-covering of 2-phonemes to 130 for the one of 3-phonemes on Gutenberg may
confirm this idea, which has also been put forward in Experiment 2.

Concerning the length of the selected sentences by both algorithms, it is greater
than the one for the 5-covering of 2-phonemes, observed in Experiment 2, and slightly
deviates from the average sentence length for the whole corpus. Consequently, it turns
out that covering longer and generally rarer units involves a selection of longer sen-
tences. This is confirmed by the fact that the sentences of Gutenberg covering units with
a single occurrence in Gutenberg represent more than half the size of the solutions and
are composed of 33 phones on average.

6.4 1-Covering of 3-Phonemes in French

In this section, we analyze the results of Experiment 4, the 30 instances of the 1-
covering of 3-phonemes from Le-Monde carried out by ASA and LamSCP. The results
are given in Table 9. First, although the main features of Le-Monde and Gutenberg are
different, notice that the closeness between the size of coverings calculated by both
algorithms and the lower bound L( ˜Λ) is comparable to the one observed in Experiment
3. Indeed, the optimal covering size is at most 0.48% (±0:03) and 4.35% (±0:05) shorter
than the solution size derived by LamSCP and ASA, respectively. Similarly, the size
of solutions and the number of selected sentences are as stable as those observed in
the previous experiment: The solution length varies from 0.01% for LamSCP to 0.10%
for ASA and the number of sentences fluctuates about 0.16% for ASA and 0.10% for
LamSCP.

As for the comparison with the results of Experiment 1 (1-covering of 2-phonemes
from Le-Monde), the main trends are similar to the ones observed for the transition
from the 1-covering of 2-phonemes to the 1-covering of 3-phonemes from Gutenberg.

Table 9
Experiment 4: Statistics based on 30 experiments of a 1-covering of 3-phonemes from Le-Monde.

ASA

LamSCP

L( ˜Λ)

Covering size (phones)
Reduction rate relative to ASA (%)
Covering size [min; max]
Covering size Std. Dev.
Sentence number
Sentence length
Time CPU (seconds)

620,434 ± 222

[619,319; 622,201]

633.03
6,969.10 ± 4.63
89.02 ± 0.03
7,845 ± 78

596,323 ± 27

–3.89 ± 0.03
[596,190; 596,486]

88.59
6,436.76 ± 2.56
92.64 ± 0.03
660,928 ± 24,355

593,417 ± 127
–4.35 ± 0.04
[592,640; 594,250]
391.82



374

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
1
3
3
5
5
1
8
0
6
7
0
9
/
c
o

l
i

_
a
_
0
0
2
2
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Barbot et al.

Large Linguistic Corpus Reduction

in Experiment 4,

However,
the average selected sentence length has markedly
increased, approaching the mean value on the whole corpus: 89.02 (±0:03) for ASA
and 92.64 (±0:03) for LamSCP, whereas in Experiment 1 these values are, respectively,
25.48 (±0:10) and 28.97 (±0:12). We have already observed in Experiment 3 that
covering longer units increases the length of selected sentences but this high amplitude
seems to be inherent to the design of corpus Le-Monde. Furthermore, notice that the
1-coverings of 2-phonemes from Le-Monde are almost half as small as the ones from
Gutenberg, whereas the 1-coverings of 3-phonemes from Le-Monde are between twice
and three times longer than the ones from Gutenberg. This is due to the fact that the
3-phonemes with a single instance in Le-Monde are very scattered in long sentences
(their length mean is about 134 phones), and these indispensable sentences represent
nearly half the size of the solutions. The other sentences of the solutions are around
70 phones long. Lastly, the ratio between the computation time of both algorithms is
about 84, which is smaller than the ratios previously observed, but this SCP is the most
time consuming: 2 hr 10 min for ASA and more than 7 days for LamSCP.

6.5 k-Covering of 1-POS and 2-POS in French

Table 10 sums up the main results of Experiment 5, dealing with the 1- and 5-coverings
of 1-POS and 2-POS. For all these SCP, LamSCP produces smaller coverings, com-
posed of longer sentences, than the coverings obtained with ASA. When the search
space diminishes, the relative “distance” between the size of solutions provided by
both algorithms decreases, as well as between the lower bound L( ˜Λ) and the size of
solutions obtained by ASA. These trends were also observed in the earlier experiments.
In particular, as for the 1-covering of 1-POS, not only does LamSCP provide 10.06%
(± 0.00) shorter solutions than ASA, but its solutions are optimal for all 50 instances
of this SCP. Indeed, the lower bound value varies from 482.51 to 482.87 occurrences of
1-POS while all the solutions given by LamSCP are made of exactly 483 occurrences
of POS. For the other k-coverings of n-POS, the optimal solution is at worst 0.39%
(±0.02) shorter than the covering given by LamSCP for (k, n) = (5, 1), 0.11% (±0.00) for
(k, n) = (1, 2), and 0.22% (±0.00) for (k, n) = (5, 2). The solutions obtained by ASA or by
LamSCP are very stable. For example, the relative standard deviation of number of POS
in a covering solution varies from 0.00% to 1.23% for ASA, and from 0.00% to 0.04%
for LamSCP.

As previously observed for both algorithms, their computation times grow when
the number of required covering features increases. However, the ratio between the
computation time of LamSCP and ASA does not behave as in Experiment 2 (see Table 7):
For the k-covering of 1-POS, this ratio increases from 290 (±11) to 657 (± 43) when k goes
from 1 to 5, and for the k-covering of 2-POS, it increases from 75 (±5) to 108 (±6).

7. Evaluation on a Text-to-Speech Synthesis System

In the previous sections, different algorithms dealing with corpus reduction were in-
troduced and studied. The proposed experiments mainly evaluate the effects of these
algorithms in terms of corpus reduction but not according to a practical task. This
section proposes an experiment to assess the impact of the corpus reduction on a unit
selection speech synthesis system.

As explained in Section 1, a corpus reduction for a TTS system is a trade-off between
minimizing the recording and post-processing time to build the speech corpus and

375

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
1
3
3
5
5
1
8
0
6
7
0
9
/
c
o

l
i

_
a
_
0
0
2
2
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 41, Number 3

Table 10
Statistics of 50 instances of a k-covering of n-POS from Le-Monde.

Experiment 5: k-covering of n-POS, corpus Le-Monde

n = 1 and k = 1

ASA

LamSCP

Covering size (POS)
Reduction rate relative to ASA (%)
Covering size [min; max]
Covering size Std. Dev.
Sentence number
Sentence length
Time CPU (seconds)

Covering size (POS)
Reduction rate relative to ASA (%)
Covering size [min; max]
Covering size Std. Dev.
Sentence number
Sentence length
Time CPU (seconds)

Covering size (POS)
Reduction rate relative to ASA (%)
Covering size [min; max]
Covering size Std. Dev.
Sentence number
Sentence length
Time CPU (seconds)

Covering size (POS)
Reduction rate relative to ASA (%)
Covering size [min; max]
Covering size Std. Dev.
Sentence number
Sentence length
Time CPU (seconds)

537.70 ± 1.78

[524; 552]
6.64
72.07 ± 0.51
7.46 ± 0.04
2 ± 0

n = 1 and k = 5

2,659 ± 4

[2,628; 2,682]
12.42
285.00 ± 0.82
9.33 ± 0.02
5 ± 0

n = 2 and k = 1

77,022 ± 18

[76,913; 77,187]
63.33
3,288.22 ± 1.48
23.42 ± 0.00
175 ± 0

n = 2 and k = 5

281,532 ± 17

[281,362; 281,652]
72.09
10,379.67 ± 1.67
27.12 ± 0.00
2,863 ± 113

483.00 ± 0.00
–10.06 ± 0.00
[483; 483]
0

60.65 ± 0.31
7.97 ± 0.03
735 ± 25

2,502 ± 0
–5.90 ± 0.11
[2,501; 2,505]
1.11
238.91 ± 0.65
10.47 ± 0.02
2,712 ± 162

L( ˜Λ)

482.68 ± 0.02
–10.17 ± 0.00
[482.51; 482.87]
0.08



2,492 ± 0
–6.27 ± 0.09
[2,482; 2,493]

1.85



l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

75,281 ± 1
–2.25 ± 0.02
[75,273; 75,297]

3.44
3,120.89 ± 0.91
24.12 ± 0.00
13,095 ± 881

75,192 ± 2
–2.37 ± 0.02
[75,176; 75,220]

8.26



278,114 ± 5

–1.21 ± 0.00
[278,084; 278,152]

15.27
10,034.96 ± 2.66
27.71 ± 0.00
309,095 ± 901

277,497 ± 4

–1.43 ± 0.00
[277,459; 277,524]

14.33



/

/

/

4
1
3
3
5
5
1
8
0
6
7
0
9
/
c
o

l
i

_
a
_
0
0
2
2
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

keeping the highest phonological richness of the corpus to ensure the quality of the
synthetic speech. The goal of this experiment is to measure this trade-off by evalu-
ating the quality of the same TTS system fed with different speech corpora uttered
by the same speaker. Note that the intrinsic quality of this system is not the purpose
here.

Firstly, a brief presentation of a state-of-the-art unit selection–based TTS system
is proposed in Section 7.1. The linguistic parameters used by the TTS system are de-
tailed because they are linked to the required features in the reduction stage. In Sec-
tion 7.2, corpora used in the experiment are introduced. The attributes to cover and the

376

Barbot et al.

Large Linguistic Corpus Reduction

methodology of evaluation are described in Section 7.3; the results are given and dis-
cussed in Section 7.4.

7.1 Text-to-Speech System

For this experiment, a state-of-the-art unit selection–based TTS system is used to pro-
duce an acoustic signal from an input text. A linguistic front end processes the text to
extract features taken into account by the algorithm that selects segments in a speech
corpus (see Bo¨effard and d’Alessandro 2012).

The input text is converted into a sequence of phonemes using a French phonetizer
proposed by B´echet (2001). Non-speech sound labels can be added to this sequence
(silences, breaths, para-verbal events, etc.). A vector of features is defined as follows:

1.

2.

3.

4.

5.

6.

7.

The phone or non-speech sound label

Is the described segment a non-speech sound?

Is the phone in the onset of the syllable?

Is the phone in the coda of the syllable?

Is the phone in the last syllable of its syntagm?

Is the current syllable at the end of a word?

Is the current syllable at the beginning of a word?

Extraction of features is done using the ROOTS toolkit described in Bo¨effard et al.
(2012). The unit selection process aims to associate a signal segment from the speech
corpus to each vector of features computed from the input text. This is performed
in two steps. In the first step, for each unit, a set of candidates that match the same
features are extracted from the speech corpus. In the second step, given all candidates,
the best path is searched using an optimization algorithm so as to produce the sequence
of speech units. The algorithm tries to minimize three sub-costs, commonly used in unit
selection based TTS systems, which are spectral discrepancies based on MFCC distance,
amplitude, and f0 distances.

7.2 Corpora

Two corpora are used in this experiment. The first one, Learning corpus, is an annotated
acoustic corpus used to provide speech data for the TTS engine. It is an expressive
corpus in French, spoken by a male speaker reading Albertine disparue, an excerpt from `A
la recherche du temps perdu by Marcel Proust. The corpus is composed of 3,138 sentences
automatically annotated using a process described in Bo¨effard et al. (2012). The overall
length of the speech corpus is 9 hr 57 min. When creating a voice for a unit selection–
based TTS system, long sentences are generally removed or split into syntagm groups
in order to help the speaker.

A second corpus, named Test corpus, is a text corpus that is synthesized and used
in the listening experiment. It is composed of 30 short sentences randomly extracted
from a phonetically balanced corpus in French, proposed by Combescure (1981). The
use of a corpus with a different linguistic style minimizes the bias introduced by the
learning corpus.

Statistics are given in Table 11.

377

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
1
3
3
5
5
1
8
0
6
7
0
9
/
c
o

l
i

_
a
_
0
0
2
2
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 41, Number 3

Table 11
Characteristics of the two corpora.

Learning corpus

Test corpus

Number of syntagms
Number of labels
Corpus size (labels)
Number of 2-phoneme
Number of diphones
Syntagm length mean (phones) & Std. Dev.

19,587
36
392,865
1,033
373,278
20.1 (11.9)

30
36
813
317
787
27.1 (5.6)

7.3 Methodology

For this experiment, two reduced corpora are evaluated. They are built by reducing
the full learning corpus using the two different algorithms presented in the previous
sections: ASA and LamSCP.

As described in Section 7.1, the unit selection process of the speech synthesis system
is based on a set of phonological attributes. It seems natural to try to cover features that
reflect the variability of these attributes. For this experiment, algorithms must cover all
the units at least once, where a unit is described by the following:

r
r

r
r

Its label, that is, one of the 35 phonemes or a non-speech sound label

The structure of the syllable that contains the phoneme, if it is a vowel

The position of the associated syllable in the word (start, middle, or end)

A Boolean indicated if the associated syllable is at the end of a syntagm

The feature extraction is performed by the same set of tools used by the speech synthesis
engine. Given this set of features, the learning corpus contains 1,497 classes of units. The
cost function to minimize by the reduction algorithms is the total length of the set of the
selected syntagms, in phones.

Two speech synthesis systems are defined, extracting the speech units to concate-
nate from the coverings provided by LamSCP and ASA. Two other systems are added as
baselines. First, a system named Full, built with the whole learning corpus, is used as an
upper bound. Second, a system named Random uses a random reduction of the Learning
corpus as a pool corpus of speech units. This reduction is done by randomly selecting
sentences from the whole learning corpus until the size of the covering obtained by
LamSCP is reached. Random is used as a lower bound.

Whereas the optimization efficiency is measured by statistics on the reduced cor-
pora, the quality of the synthesized speech signals is evaluated by a listening test.
The protocol is based on a MUSHRA test, presented in ITU-R (2003), where for every
sentence of Test corpus, the signals synthesized by the four systems are presented to
each tester in a random order. If a system is not able to produce a signal for a re-
quested sentence (because of a missing 2-phone in the pool corpus), an empty signal
is presented. Ten native French testers (four naives and six experts) are asked to eval-
uate the overall quality of the stimuli and to give a mark from 0 to 100 (by steps of
5 points).

378

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
1
3
3
5
5
1
8
0
6
7
0
9
/
c
o

l
i

_
a
_
0
0
2
2
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Barbot et al.

Large Linguistic Corpus Reduction

Table 12
Statistics about the reduced corpora computed by ASA and LamSCP from Learning corpus.
The last column concerns the best lower bound found by LamSCP.

Speech synthesis experiment: corpus reduction

ASA

LamSCP

Covering size (phones)
Reduction rate relative to the full corpus (%)
Reduction rate relative to ASA (%)
Number of syntagms

36,538
–90.70

2,191

35,681
–90.92
–2.35

2,119

L( ˜Λ)

35,655
–90.92
–2.42

7.4 Results and Discussion

The Learning corpus composed of 19,587 syntagms is first reduced using ASA and
LamSCP. Statistics of the resulting solutions are summarized in Table 12. The covering
of the 1,497 constraints divides the input corpus size by almost 10 and reduces the
10 hours of speech to around 1 hr 20 min. As for the previous experiments, even though
different kinds of features are mixed (phonemes, syllable structures, position in a word
or a syntagm) the ASA algorithm produces a solution close to the optimal one. However,
LamSCP is again slightly better in terms of covering size.

For the measure of the acoustic impact of corpus reduction, the listening test results
are presented in Table 13 for the average marks and in Table 14 for the average ranks.
Note that without a natural speech reference during the test, the marks should not be
seen as an absolute score. Even if the LamSCP corpus is slightly smaller than the one
from ASA, the acoustic quality of both systems is comparable according to the testers
(with a slight advantage for the LamSCP corpus). In comparison with the baseline,
which uses the whole learning corpus, the acoustic degradation is significant. This
illustrates well the trade-off between corpus size and speech quality: For a 90% corpus
size reduction, the acoustic quality drops by 10 points. Further research should be
focused on the set of attributes to cover and their number of occurrences in order to
improve this compromise. As expected, the baseline built from random sentences is
preferred significantly less than the other systems because of the lack of relatively rare
acoustic units.

Table 13
MUSHRA average marks of 30 sentences from the Test corpus with 10 listeners per sentence
(the higher the better).

379

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
1
3
3
5
5
1
8
0
6
7
0
9
/
c
o

l
i

_
a
_
0
0
2
2
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 41, Number 3

Table 14
MUSHRA average ranks for 30 sentences from the Test corpus with 10 listeners per sentence
(the lower the better).

8. Conclusion

This article discussed the building of linguistic-rich corpora under an objective of
parsimony. This task, a generalization of SCP, turns out to be an NP-hard problem that
cannot be polynomially approximated. We studied the behavior of several algorithms
in the particular domain of NLP, where the considered events follow a heavy-tailed type
distribution.

The proposed algorithms have been compared through three kinds of experiments:
The first one is the coverings of 2- and 3-phonemes from two text corpora, one in French,
the other in English; the second one consists of the coverings of part-of-speech labels
from a corpus in French; the third one evaluates the impact of both algorithms on the
acoustic quality of a corpus-based TTS system. The first algorithm, ASA, is composed of
an agglomerative greedy strategy followed by a spitting greedy stage. The second one,
LamSCP, is based on Lagrangian relaxation principles combined with greedy strate-
gies. LamSCP is our adaptation of an algorithm proposed in Caprara, Fischetti, and
Toth (1999) to the multi-representation constraints. The comparison of SCP solutions
is mainly about their size, their maximal distance with the optimal covering, and their
robustness in case of perturbation of the initial corpus ordering.

Although ASA is much faster than LamSCP, it does not permit us to single-
handedly assess the quality of its solution in terms of size. The main assets of LamSCP
are the calculation of a lower bound to the optimal covering size and shorter solutions
than the ones obtained by ASA. Indeed, in our experiments of phonological coverings,
the optimal solution is at most 1.24% (10.13%, respectively) smaller than the solutions
derived by LamSCP (ASA, respectively). As for the coverings of 1-POS, LamSCP pro-
vides the optimal solution in a case of a mono-representation constraint, whereas the
ASA solution is 10.17% greater than the optimal one. These relative gaps between
the lower bounds and solution sizes of both algorithms generally decrease when the
size of the search space decreases. Thanks to the lower bound derived by LamSCP,
we empirically show that it is possible to get almost optimal solutions in a linguistic
framework following Zipf’s law distribution, despite the theoretic complexity of the
multi-represented SCP. Concerning the last experiment in the TTS framework, even if
LamSCP provides a smaller corpus, the subjective test shows no significant difference
between the TTS systems based on LamSCP and ASA corpora. Therefore, we think that
ASA remains the most adequate strategy, in terms of performance, ease of development,
and computation time to solve SCP in the NLP field. However, it would be interesting

380

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
1
3
3
5
5
1
8
0
6
7
0
9
/
c
o

l
i

_
a
_
0
0
2
2
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Barbot et al.

Large Linguistic Corpus Reduction

to test a parallelized version of the heuristic phase that calls an important number of
greedy sub-procedures.

Our future prospects for this work are in automatic language processing and speech
synthesis. First, in the framework of the Phorevox project supported by the French
National Research Agency, we are considering the automatic design of exercise contents
for language learning by the selection of texts covering some phonological or linguistic
difficulties. Secondly, this work is a preliminary step to building a phonetically rich
script before its recording in order to produce a high quality speech synthesis. The
covering choices, such as the attributes to cover, the number of required occurrences,
or the “sentence” length (utterances, syntagms, etc.) need to be validated. Moreover,
in this article, we have observed the great impact of the distribution of rare units in
the corpus to reduce, and we believe it will be interesting to adapt the “sentence”
granularity according to this distribution.

References
Alon, Noga, Dana Moshkovitz, and Shmuel
Safra. 2006. Algorithmic construction of
sets for k-restrictions. ACM Transactions on
Algorithms (TALG), 2(2):153–177.

Barbot, Nelly, Olivier Bo¨effard, and Arnaud
Delhay. 2012. Comparing performance of
different set-covering strategies for
linguistic content optimization in speech
corpora. In Proceedings of the International
Conference on Language Resources and
Evaluation (LREC), pages 969–974, Istanbul.

B´echet, Fr´ed´eric. 2001. Liaphon: un systeme

complet de phon´etisation de textes.
Traitement automatique des langues,
42(1):47–67.

Bo¨effard, Olivier, Laure Charonnat, S´ebastien
Le Maguer, Damien Lolive, and Ga¨elle
Vidal. 2012. Towards fully automatic
annotation of audiobooks for TTS. In
Proceedings of the International Conference on
Language Resources and Evaluation (LREC),
pages 975–980, Istanbul.

Bo¨effard, Olivier and Christophe

d’Alessandro, 2012. Speech Synthesis. Wiley.

Bunnell, H. Timothy. 2010. Crafting small

databases for unit selection TTS: Effects on
intelligibility. In Proceedings of the ISCA
Tutorial and Research Workshop on Speech
Synthesis (SSW7), pages 40–44, Kyoto.

Cadic, Didier, C´edric Boidin, and Christophe
d’Alessandro. 2010. Towards optimal TTS
corpora. In Proceedings of the International
Conference on Language Resources and
Evaluation (LREC), pages 99–104, Malta.

Candito, Marie, Enrique Henestroza

Anguiano, and Djam´e Seddah. 2011. A
word clustering approach to domain
adaptation: Effective parsing of biomedical
texts. In Proceedings of the 12th International
Conference on Parsing Technologies,
pages 37–42, Dublin.

Caprara, Alberto, Matteo Fischetti, and Paolo
Toth. 1999. A heuristic method for the set
covering problem. Operations Research,
47(5):730–743.

Caprara, Alberto, Paolo Toth, and Matteo
Fischetti. 2000. Algorithms for the set
covering problem. Annals of Operations
Research, 98(1-4):353–371.

Ceria, Sebasti´an, Paolo Nobili, and Antonio

Sassano. 1998. A Lagrangian-based
heuristic for large-scale set covering
problems. Mathematical Programming,
81(2):215–228.

Chevelu, Jonathan, Nelly Barbot, Olivier
Bo¨effard, and Arnaud Delhay. 2007.
Lagrangian relaxation for optimal corpus
design. In Proceedings of the ISCA Tutorial
and Research Workshop on Speech Synthesis
(SSW6), pages 211–216, Bonn.

Chevelu, Jonathan, Nelly Barbot, Olivier
Bo¨effard, and Arnaud Delhay. 2008.
Comparing set-covering strategies for
optimal corpus design. In Proceedings of the
International Conference on Language
Resources and Evaluation (LREC),
pages 2951–2956, Marrakech.

Combescure, Pierre. 1981. 20 listes de 10
phrases phon´etiquement ´equilibr´ees.
Revue d’Acoustique, 56:34–38.

Fisher, Marshall L. 1981. The Lagrangian
relaxation method for solving integer
programming problems. Management
Science, 27(1):1–18.

Franc¸ois, H´el`ene and Olivier Bo¨effard. 2001.
Design of an optimal continuous speech
database for text-to-speech synthesis
considered as a set covering problem. In
Proceedings of the European Conference on
Speech Communication and Technology
(Eurospeech), pages 829–832, Aalborg.

Franc¸ois, H´el`ene and Olivier Bo¨effard. 2002.
The greedy algorithm and its application

381

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
1
3
3
5
5
1
8
0
6
7
0
9
/
c
o

l
i

_
a
_
0
0
2
2
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 41, Number 3

to the construction of a continuous
speech database. In Proceedings of the
International Conference on Language
Resources and Evaluation (LREC),
pages 1420–1426, Las Palmas,
Canary Islands.

Gauvain, Jean-Luc, Lori Lamel, and Maxine
Esk´enazi. 1990. Design considerations and
text selection for Bref, a large French
readspeech corpus. In Proceedings of the
International Conference of Spoken Language
Processing (ICSLP), pages 1097–1100,
Kobe.

synthesis. In Proceedings of the ISCA
Tutorial and Research Workshop on Speech
Synthesis (SSW6), pages 217–222,
Bonn.

Krul, Aleksandra, G´eraldine Damnati,

Franc¸ois Yvon, and Thierry Moudenc.
2006. Corpus design based on the
Kullback-Leibler divergence for
Text-To-Speech synthesis application.
In Proceedings of the International
Conference on Spoken Language Processing
(ICSLP), pages 2030–2033,
Pittsburgh, PA.

Gotab, Pierre, Fr´ed´eric B´echet, and G´eraldine

Neubig, Graham and Shinsuke Mori. 2010.

Damnati. 2009. Active learning for
rule-based and corpus-based spoken
language understanding models. In
Proceedings of the IEEE workshop on
Automatic Speech Recognition and
Understanding (ASRU), pages 444–449,
Merano.

Hart, Michael. 2003. Project gutenberg.
http://www.gutenberg.org/ (Last
consulted April 2015).

ITU-R. 2003. ITU-R recommendation bs.1534:
Method for the subjective assessment of
intermediate quality levels of coding
systems. Radio communication Bureau,
Geneva.

Karp, Richard M. 1972. Reducibility
among combinatorial problems. In
Complexity of Computer Computations,
the IBM Research Symposia Series.
Springer, pages 85–103.

Kawai, Hisashi, Seiichi Yamamoto, Norio
Higuchi, and Tohru Shimizu. 2000. A
design method of speech corpus for
text-to-speech synthesis taking
account of prosody. In Proceedings of the
International Conference on Spoken Language
Processing (ICSLP), pages 420–425,
Beijing.

Kominek, John and Alan W. Black. 2003. The
CMU Arctic speech databases for speech
synthesis research. Technical Report
CMU-LTI-03-177, Carnegie Mellon
University Language Technologies
Institute. Pittsburg, PA.

Krstulovi´c, Sacha, Fr´ed´eric Bimbot, Olivier
Bo¨effard, Delphine Charlet, Dominique
Fohr, and Odile Mella. 2006. Optimizing
the coverage of a speech database through
a selection of representative speaker
recordings. Speech Communication,
48(10):1319–1348.

Krul, Aleksandra, G´eraldine Damnati,

Franc¸ois Yvon, C´edric Boidin, and Thierry
Moudenc. 2007. Adaptive database
reduction for domain specific speech

382

Word-based partial annotation for efficient
corpus construction. In Proceedings of the
International Conference on Language
Resources and Evaluation (LREC),
pages 2723–2727, Malta.

Raz, Ran and Shmuel Safra. 1997. A

sub-constant error-probability low-degree
test, and a sub-constant error-probability
PCP characterization of NP. In Proceedings
of the twenty-ninth annual ACM symposium
on Theory of computing, STOC ’97,
pages 475–484, El Paso, TX.

Rojc, Matej and Zdravko Kaˇciˇc. 2000. Design
of optimal Slovenian speech corpus for use
in the concatenative speech synthesis
system. In Proceedings of the International
Conference on Language Resources and
Evaluation (LREC), pages 321–326,
Athens.

Schein, Andrew I., Ted S. Sandler, and

Lyle H. Ungar. 2004. Bayesian example
selection using BaBiES. Technical Report
MS-CIS-04-08, Department of Computer
and Information Science, University of
Pennsylvania.

Settles, Burr. 2010. Active learning literature
survey. Technical Report 1648, Department
of Computer Sciences, University of
Wisconsin, Madison.

Synapse. 2011. Documentation technique:

Composant d’´etiquetage et lemmatisation.
http://www.synapse-fr.com/.
Tian, Jilei and Jani Nurminen. 2009.

Optimization of text database using
hierachical clustering. In Proceedings of the
IEEE International Conference on Acoustics,
Speech, and Signal Processing (ICASSP),
pages 4269–4272, Taipei.

Tian, Jilei, Jani Nurminen, and Imre

Kiss. 2005. Optimal subset selection
from text databases. In Proceedings of
the IEEE International Conference on
Acoustics, Speech, and Signal
Processing (ICASSP), pages 305–308,
Philadelphia, PA.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
1
3
3
5
5
1
8
0
6
7
0
9
/
c
o

l
i

_
a
_
0
0
2
2
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Barbot et al.

Large Linguistic Corpus Reduction

Tomanek, Katrin and Fredrik Olsson. 2009. A
Web survey on the use of active learning to
support annotation of text data. In
Proceedings of the NAACL HLT 2009
Workshop on Active Learning for Natural
Language Processing, pages 45–48,
Boulder, CO.

Van Santen, Jan P. H. and Adam L.

Buchsbaum. 1997. Methods for optimal
text selection. In Proceedings of the European

Conference on Speech Communication and
Technology (Eurospeech), pages 553–556,
Rhodes.

Zhang, Jin-Song and Satoshi Nakamura.
2008. An improved greedy search
algorithm for the development
of a phonetically rich speech
corpus. IEICE Transactions
on Information and Systems,
E91-D(3):615–630.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
1
3
3
5
5
1
8
0
6
7
0
9
/
c
o

l
i

_
a
_
0
0
2
2
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

383Large Linguistic Corpus Reduction with image
Large Linguistic Corpus Reduction with image
Large Linguistic Corpus Reduction with image
Large Linguistic Corpus Reduction with image

Download pdf