The Complexity of Ranking Hypotheses in

The Complexity of Ranking Hypotheses in
Optimality Theory

Jason Riggle∗
University of Chicago

Given a constraint set with k constraints in the framework of Optimality Theory (OT), what is
its capacity as a classification scheme for linguistic data? One useful measure of this capacity
is the size of the largest data set of which each subset is consistent with a different grammar
hypothesis. This measure is known as the Vapnik-Chervonenkis dimension (VCD) y es un
standard complexity measure for concept classes in computational learnability theory. En esto
trabajar, I use the three-valued logic of Elementary Ranking Conditions to show that the VCD of
Optimality Theory with k constraints is k−1. Analysis of OT in terms of the VCD establishes
that the complexity of OT is a well-behaved function of k and that the ‘hardness’ of learning in
OT is linear in k for a variety of frameworks that employ probabilistic definitions of learnability.

1. Introducción

Given a set CON of k constraints in the framework of Optimality Theory (OT; Príncipe
and Smolensky 1993), what is the capacity of CON as a classification scheme for samples
of language data? In OT, constraints are functions that map candidates to natural num-
beres, where each candidate is a member of the (possibly infinite) set of possible deriva-
tions of an input form i supplied by the candidate generating function GEN(i). El
number that a constraint Ci assigns to a candidate indicates how many times that
candidate violates Ci. A grammar is a ranking of the constraints that imposes a total
ordering on CON, R
CON (or simply R when CON is clear from the context), y el
language that is generated by grammar R is the set of candidates that are optimal
according to R as in Definition 1.

Definición 1

a. Candidate a is more harmonic than candidate b according to R, written a (cid:4) b,
if they share the same input and a is assigned fewer violations by the highest-
ranked constraint that assigns different numbers of violations to a and b.
b. Candidate a is optimal according to ranking R iff no other candidate generated

by GEN is more harmonic than a.

Because each of the k! rankings of CON is a different grammar that generates a
potentially unique language, one natural measure of the classificatory capacity of CON

∗ Department of Linguistics, University of Chicago, 1010 mi. 59th St., chicago, IL 60637.

jriggle@uchicago.edu. Many thanks to Alan Prince, Jeff Heinz, Greg Kobele, Colin Wilson, and two
anonymous Computational Linguistics reviewers for helpful comments and suggestions. Any errors are, de
curso, my own.

Envío recibido: 22 December 2006; revised submission received: 9 Octubre 2007; aceptado para
publicación: 17 December 2007.

© 2009 Asociación de Lingüística Computacional

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

3
5
1
4
7
1
7
9
8
5
6
2
/
C
oh

yo
i
.

0
7

0
3
1

r
2

0
6

9
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 35, Número 1

is the upper bound of k! languages in what Prince and Smolensky (1993, página 27) dub
the factorial typology of the constraint set. Another complexity metric that is useful in
analyses of learnability (especially for non-finite concept classes) is the cardinality of
the largest data set of which each subset corresponds to a different ranking hypothesis.
The idea of measuring the complexity of a concept class (in the case at hand, a set of
grammars) in this way comes from the work of Vapnik and Chervonenkis (1971) and is
known as the Vapnik-Chervonenkis dimension (VCD). In OT, the VCD of a constraint
set CON (es decir., the concept class consisting of languages generated by rankings of CON)
is the size of the largest sample (set of candidates) that is shatterable as in Definition 2.

Definición 2
A sample S is shatterable by a constraint set CON iff, for every partitioning of S into two
disjoint sets T and F (including the null/whole partition), there is at least one ranking
R

CON that makes every s ∈ T optimal but no s ∈ F optimal.

Vapnik and Chervonenkis’s definition of shatterability has interesting implications
for samples consisting of OT candidates. Por ejemplo, each candidate in a shatterable
sample S must be an input → output mapping for a unique input form because two
candidates a and b with the same input would either tie with identical sets of violations
or show harmonic inequality. In the case of a tie, no ranking could realize a partitioning
that separates a and b and, in the case of harmonic inequality, no ranking could realize
a partitioning in which a and b are simultaneously optimal. More generally, the VCD
places an upper bound on the number of distinct grammar hypotheses that can be real-
ized over any sample of linguistic data consisting of OT candidates, and thus provides
a ready measure of the complexity of the hypothesis space in Optimality Theory.

The VCD of a concept class is obviously not independent of its size. As Blumer
|C|
et al. (1989) point out, for any finite concept class C, the VCD is bounded at log2
because it takes at least 2d hypotheses to associate a unique hypothesis with every sub-
set of a sample of size d. De este modo, because the number of grammars (hypotheses) encima
k constraints is finite—one grammar for each of the k! rankings—the VCD of OT is
bounded at log2 k!. O, put more simply, because log2 x! ≤ x log2 x, this establishes
k log2 k as an upper bound on the VCD of OT. In this article, I will show how the
structure of the hypothesis space in Optimality Theory provides a tighter bound on
the VCD of OT than the bound established by the finitude of the hypothesis space. I
will improve upon the inherent bound of k log2 k by showing that the VCD of OT with
k constraints is actually bounded at k − 1 and thus grows linearly with the size of |CON|.
The complexity measured by the VC dimension has a number of ramifications for
learning in Optimality Theory. Por ejemplo, the VCD of a concept class places an abso-
lute lower bound on the number of mistakes that any error-driven learning algorithm
can be guaranteed of achieving (Littlestone 1988). This fact tells us that it may yet
be possible to improve upon the quadratic mistake bound of (k2 − k)/2 for Recursive
Constraint Demotion (Tesar and Smolensky 1993, 2000; Tesar 1995, 1997, 1998), el
reigning mistake bound for any OT learning algorithm. The VCD of a concept class
also provides a very general bound on the number of data samples that are required for
learning in probabilistic models of learning that will be discussed in Section 5.

2. Elementary Ranking Conditions

The main result for the VC dimension of OT will be given in Section 4. Primero, alguno
supporting results will be established showing that there is an upper bound of k − 1 en

48

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

3
5
1
4
7
1
7
9
8
5
6
2
/
C
oh

yo
i
.

0
7

0
3
1

r
2

0
6

9
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Riggle

Ranking Hypotheses in OT

shatterable sets of statements about constraint rankings that are expressed with Prince’s
(2002) Elementary Ranking Conditions.

If our sample space X consists of candidates, then any x ∈ X can be described
in terms of the set of constraint rankings under which x is optimal. Príncipe (2002)
provides a scheme for encoding this kind of ranking information called an Elementary
Ranking Condition (ERC). En esta sección, I will review some formal properties of ERCs
that are relevant for establishing the VC dimension of OT. Prince demonstrates many
formal properties of ERCs beyond those covered here and shows that ERCS are equiv-
alent to the implication-negation fragment of the three-valued relevance logic RM3 (cf.
Anderson and Belnap 1975). This section will review properties of ERCs that are most
relevant for the results at hand. For formal proofs and a complete exposition of the logic
of ERCs, see Prince (2002).

For a constraint set CON containing k constraints, ERCs are k-length vectors that
use the symbols L, mi, and W to encode logical statements about rankings. Each con-
straint is assigned an arbitrary numeric index, and in each ERC α, the ith coordinate αi
refers to the constraint with ith index Ci. The meaning of an ERC is that at least one
constraint whose corresponding coordinate contains a W outranks all of the constraints
whose coordinates contain L’s. De este modo, (cid:9)W., mi, l, l(cid:10) means that C1 outranks both C3 and
C4, mientras (cid:9)l, l, W., W.(cid:10) means that either C3 or C4 outranks both C1 and C2. ERCs can
be constructed by comparing candidates as in Definition 3. Note that Ci(a) denotes the
number of times candidate a violates the constraint with index i.

Definición 3
Given a constraint set CON with k constraints indexed {1 … k} and two candidates
that share the same input, the function ercCON(a, b) returns an ERC α = (cid:9)α1, …, αk(cid:10) eso
describes the rankings under which a (cid:4) b.1

erc(a, b) = (cid:9)α1, …, αk(cid:10) dónde






αi = W if Ci(a) < Ci(b) αi = L if Ci(a) > Ci(b)
αi = e if Ci(a) = Ci(b)

The symbol W in αi of erc(a, b) = α is a mnemonic for the fact that Ci favors a
(the winner), whereas an L in coordinate i is a mnemonic for the fact that Ci favors b
(the loser). An e in αi indicates that the candidates are equivalent according to Ci.

Ejemplo 1

aporte

cand. a
cand. b

C1
*
**

C2
**
*

C3
*

erc(b, a) = (cid:9)l, W., W. (cid:10) = b (cid:4) a if C2 or C3 outranks C1
erc(a, b) = (cid:9)W., l, l (cid:10) = a (cid:4) b if C1 outranks C2 and C3

(*’s indicate number of violations)
Note the symmetry between erc(a, b) = (cid:9)W., l, l (cid:10), which says that candidate a is more
harmonic than b under any ranking where C1 outranks both C2 and C3, and erc(b, a),

1 The function ercCON (a, b), or simply erc(a, b) when CON is clear from context, is undefined for candidate

input → output mappings with different inputs because they cannot be meaningfully compared.

49

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

3
5
1
4
7
1
7
9
8
5
6
2
/
C
oh

yo
i
.

0
7

0
3
1

r
2

0
6

9
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 35, Número 1

which says that b is more harmonic than a under any ranking where either C2 or C3
outranks C1. This symmetry reflects the fact that erc(a, b) and erc(b, a) encode antithetical
ranking conditions. The opposition between these ERCs follows straightforwardly from
the fact that only one of the two candidates can be optimal under any given ranking.

The illustrative tableaux presented with OT analyses can be turned into sets of
ERCs by making pairwise comparisons between the violations for one designated (o
observado) winner and the violations for each other candidate.

Ejemplo 2

aporte

cand. a
cand. b
cand. C
cand. d
cand. mi
cand. F

C1
*
**
*
*
**

C2
**
*

*
**
***

C3

**
*

*

winner
erc(a, b) = (cid:9)W., l, mi (cid:10) = a (cid:4) b if C1 outranks C2
erc(a, C) = (cid:9)mi, l, W.(cid:10) = a (cid:4) c if C3 outranks C2
erc(a, d) = (cid:9)mi, l, W.(cid:10) = a (cid:4) d if C3 outranks C2
erc(a, mi) = (cid:9)W., mi, mi (cid:10) = a (cid:4) e under every ranking
erc(a, F ) = (cid:9)l, W., W.(cid:10) = a (cid:4) f if C2 or C3 outranks C1

The comparison of candidate a with candidate e in Example 2, erc(a, mi) = (cid:9)W., mi, mi(cid:10),
yields an odd ranking condition that does not actually express a particular ranking (No
constraint has an L), but instead indicates that C1 favors candidate a and no constraint
favors candidate e. En este caso, candidate e is said to be harmonically bounded by
candidate a because there can be no ranking under which e is more harmonic than
a. En cambio, if candidate e were designated the winner, then erc(mi, a) = (cid:9)l, mi, mi(cid:10). Este
ERC also does not encode a specific ranking, but rather indicates that the mere existence
of candidate a as an alternative means that no ranking can make candidate e optimal.

Like most OT tableaux, Ejemplo 2 is an illustration of how a handful of candidates
fare with respect to one another according to a particular set of constraints. To know
which rankings (if any) make candidate a globally optimal, it would be necessary to
define the candidate generating function GEN in order to obtain a representation of the
entire set of ERCs {erc(a, X) | x ∈ GEN(aporte)} = ERCS(a). This is not as daunting as it
might appear because, even though |GEN(aporte)| may be infinite, the fact that the num-
ber of k-length ERCs is finite guarantees that each of the candidates in GEN(aporte) will
map to one of a finite number of ERC sets. Además, as Riggle (2004) demonstrates,
the standard OT assumption of the universality of faithfulness constraints that pe-
nalize changes to the input guarantees that all but finitely many of the members of
GEN(aporte) will be harmonically bounded. Riggle also presents an algorithm for com-
puting this finite set of contenders (es decir., candidates that are not harmonically bounded)
that can be used in cases where GEN is restricted so that it is a rational function.2
Regardless of how optimization is computed, what is relevant for the assessment of
the VCD of OT is the definition of optimality. Following Definition 1, a ranking R
CON can
be seen as a function from candidates to True (if they are optimal) or False (if they are

2 GEN is rational if it is representable as a finite state transducer. Riggle’s (2004) CONTENDERS algorithm is
an extension of Ellison’s (1994) application of Dijkstra’s (1959) “shortest paths” algorithm to optimization
in OT that operates over finite-state representations of GEN and EVAL. Ellison showed that if harmony is
used as the “distance” to be optimized, then optimal outputs can be efficiently found. The CONTENDERS
algorithm follows a similar strategy but, instead of finding the shortest (es decir., most harmonic) path for one
ranking, the algorithm finds all non-harmonically-bounded paths and thereby optimizes for all rankings.

50

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

3
5
1
4
7
1
7
9
8
5
6
2
/
C
oh

yo
i
.

0
7

0
3
1

r
2

0
6

9
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Riggle

Ranking Hypotheses in OT

no). The entire ERC set for a candidate ERCS(a) describes exactly the rankings under
which candidate a is a globally optimal candidate.

The reduction of candidates to ERC sets makes it possible to use the logic of ERCs
to reason about candidates. Most of the time, the ERCs of interest are those that contain
at least one L and one W—what Prince calls nontrivial ERCs. ERCs that contain W’s
but no L’s are generated when a candidate is compared with another candidate that it
harmonically bounds, such as erc(a, mi) = (cid:9)W., mi, mi (cid:10) in Example 2. This ERC reveals that
candidate e cannot be optimal but yields no information about what rankings make
candidate a optimal. Similarmente, no ranking information can be gleaned from the all-e
ERC that results from comparing “tied” candidates that have the same violations.
Finalmente, ERCs like erc(mi, a) = (cid:9)l, mi, mi (cid:10), with L’s but no W’s reveal nothing other than the
fact that candidate e cannot be optimal under any ranking.

The most relevant logical relation for ERCs is that of entailment. The entailment
relation among nontrivial ERCs is given in Definition 4 (Príncipe 2002, página 6, Proposi-
ción 1.1).

Definición 4
For nontrivial ERCs α and β, α → β iff each αi ∈ α entails βi ∈ β where L → e → W.

Because nontrivial ERCs encode disjunctions of conjunctions (es decir., [C1 or … cn] outranks
[C1(cid:1) y … cn(cid:1) ]), entailments of the form α → β line up with the logical operations of
disjunction introduction (whenever β has W where α has an L or an e) and conjunction
elimination (whenever β has an e where α has an L).

Ejemplo 3

(cid:9)W., l, l, mi(cid:10) → (cid:9)W., mi, l, mi(cid:10)
(cid:9)W., mi, l, mi(cid:10) → (cid:9)W., mi, l, W.(cid:10)
(cid:9)W., l, l, mi(cid:10) → (cid:9)W., mi, l, W.(cid:10)

es decir., If C1 outranks C2 and C3 then C1 outranks C3.
es decir., If C1 outranks C3 then C1 or C4 outranks C3.
es decir., If C1 outranks C2 and C3 then C1 or C4 outranks C3.

In addition to revealing entailments among individual ranking conditions, la lógica
of ERCs makes it possible to derive new ranking conditions that are entailed by the
combination of other ERCs. Príncipe (2002, página 8) provides a logical operation called
fusion that derives entailments from sets of ERCs.

Definición 5
The fusion of ERC set Φ is a single ERC φ that is entailed by Φ where:
φi = L if any ERC in Φ has an L in its ith coordinate,
φi = e if every ERC in Φ has an e in its ith coordinate,
φi = W otherwise.

Every ERC entailed by Φ is entailed by the fusion of a subset of Φ (Príncipe 2002,
página 14). De este modo, the operation of fusion can reveal nonobvious entailments among ERCs.
Consider Φ = {(cid:9)W., W., mi, l(cid:10), (cid:9)l, W., W., mi(cid:10), (cid:9)W., mi, l, W.(cid:10)}. The ERCs in Φ denote, respectivamente,
“C1 or C2 outranks C4,” “C2 or C3 outranks C1,” and “C1 or C4 outranks C3.” The fusion
of Φ is (cid:9)l, W., l, l(cid:10), which encodes the inference from Φ that C2 outranks C1, C3, and C4.
The operation of fusion can also reveal inconsistencies in ERC sets. Consider the
set Ψ = {(cid:9)W., l, W.(cid:10), (cid:9)l, W., W.(cid:10), (cid:9)W., W., l(cid:10)}. Fusing Ψ yields (cid:9)l, l, l(cid:10). As with harmonically
bounded candidates, this ERC shows that no constraint ranking is consistent with
the statements in Ψ (En realidad, they are circular). Prince refers to the class of ERCs with

51

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

3
5
1
4
7
1
7
9
8
5
6
2
/
C
oh

yo
i
.

0
7

0
3
1

r
2

0
6

9
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 35, Número 1

L’s but no W’s as L+. He shows that these ERCs arise from fusion if and only if the
fused set contains incompatible ranking conditions.

Definición 6
An ERC set is consistent iff it has no subset that fuses to an ERC in L+ (Príncipe 2002,
página 11).

For any consistent ERC set there is a constraint ranking (often several) of which
all of its ERCs are true statements (Príncipe 2002, página 21). The ERCs in an inconsistent
colocar, por otro lado, can never all be true of a single ranking. Inconsistency can arise
from a single pair of candidates (p.ej., ERCS(mi) in Example 2 contains erc(mi, a) = (cid:9)l, mi, mi(cid:10)).
Inconsistency can also arise across multiple candidate comparisons (p.ej., ERCS(d) en
Ejemplo 2 contains erc(d, a) = (cid:9)mi, W., l(cid:10) and erc(d, C) = (cid:9)mi, l, W.(cid:10)). This latter type of in-
consistencia, where several of the ERCs associated with a candidate fuse to L+, arises
from what Samek-Lodovici and Prince (1999) call collective harmonic bounding.
Finalmente, it is possible for inconsistencies to arise when ERCs for several candidates
with distinct inputs are combined. Por ejemplo, if ERCS(X) = {(cid:9)W., l, W.(cid:10)}, ERCS(y) =
{(cid:9)l, W., W.(cid:10)}, and ERCS(z) = {(cid:9)W., W., l(cid:10)} entonces, even though x, y, and z may be candidates
for distinct inputs (es decir., come from different tableaux), the union of their ERCs fuses
a (cid:9)l, l, l(cid:10) ∈ L+ and thereby reveals that there is no ranking under which all three
candidates are simultaneously optimal.

Inverting the W’s and L’s of an ERC produces its antithetical counterpart that is true
whenever the original ERC is false and vice versa. This opposition can be exploited in
describing the range of consistent ERC sets.

Definición 7
The negation of α is α where: αi = W if αi = L, αi = L if αi = W, and αi = e if αi = e.

Provided that α is not all e’s, every ranking is described by either α or α but not both
(Príncipe 2002, página 42). In this way, ERC negation is just the standard notion of negation
in three-valued logics. The opposition between α and α makes a binary partition on
the space of rankings. This is intuitively obvious for simple statements like (cid:9)W., l, mi(cid:10) y
(cid:9)l, W., mi(cid:10). The opposition is a bit less intuitive for more complex conditions like (cid:9)W., l, l(cid:10)
y (cid:9)l, W., W.(cid:10), but the fact that erc(a, b) is the antithesis of erc(b, a) makes it abundantly
clear (es decir., if a and b are not tied, then every ranking must prefer one or the other). El
antithetical relationship between an ERC and its negation is reflected in the operation
of fusion by the fact that fusing antithetical ERCs will always yield an ERC in L+.

3. The VCD of Elementary Ranking Conditions

Before turning to the question of the VC dimension of the sample space in OT, it will be
helpful to define shatterability purely in terms of ERCs and thereby to establish a bound
on the VCD of sets of ERCs. We will say that an ERC α is true of a given ranking R if
the condition imposed by α is consistent with the linear ordering of the constraints
defined by R.

Definición 8
An ERC set Ω over constraints CON is shatterable iff for every subset ∆ ⊆ Ω, there is a
ranking R

CON of which all ERCs in Ω–∆ are true while all the ERCs in ∆ are false.

52

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

3
5
1
4
7
1
7
9
8
5
6
2
/
C
oh

yo
i
.

0
7

0
3
1

r
2

0
6

9
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Riggle

Ranking Hypotheses in OT

From this definition of shatterability for sets of ERCs, it is immediately clear that only
nontrivial ERCs can occur in shatterable sets.

Lema 1
Every ERC in a shatterable set must contain at least one L and one W.
Prueba : The ERCs of L+ cannot occur in a shatterable set because there is no ranking
of which they are true. En cambio, ERCs with no L’s cannot occur in shatterable sets
(cid:1)
because there is no ranking of which they are false.
With Definition 8 in hand, and having excluded the trivial ERCs from the picture,
it will be possible to reduce shatterability for ERC sets to consistency under negation.
Primero, a definition of negation for sets of ERCs.

Definición 9
A partial negation of ERC set Ω is obtained by negating every ERC in a subset ∆ ⊆ Ω.
Por ejemplo: Ω = {(cid:9)W., l, l(cid:10), (cid:9)mi, W., l(cid:10)} has four partial negations: one per subset.
(cid:5)

(cid:6) (cid:5)

(cid:6) (cid:5)

(cid:6) (cid:5)

(cid:6)

un = (cid:9)W., l, l(cid:10)
γ = (cid:9)mi, W., l(cid:10)

un = (cid:9)l, W., W.(cid:10)
γ = (cid:9)mi, W., l(cid:10)

un = (cid:9)W., l, l(cid:10)
γ = (cid:9)mi, l, W.(cid:10)

un = (cid:9)l, W., W.(cid:10)
γ = (cid:9)mi, l, W.(cid:10)

Teorema 1
An ERC set Ω is shatterable iff every partial negation of Ω is consistent.
Prueba : Suppose every partial negation of Ω is consistent. De este modo, for any partial negation
in which ∆ is the negated subset of Ω and Γ is the rest of Ω, it is the case that there
is a ranking R of which all the ERCs of ∆ + Γ are true. Because a nontrivial ERC and
its negation are never both true of the same ranking and trivial ERCs cannot occur in
shatterable sets, the ERCs in Ω − Γ are false of ranking R while the ERCs of Γ are true.
Because ∆ was arbitrary, it is the case that for every subset of Ω, there is a ranking of
which the ERCs in that subset are false while the rest are true, and thus consistency
under partial negation is sufficient for shatterability. Si, por otro lado, there is a
partial negation that is not consistent, then there is a subset of Ω such that if the ERCs
in that subset are negated, the resulting ∆ + Γ is not consistent. Sin embargo, because there
is no ranking of which the members of an inconsistent ERC set are all true, Ω is not
shatterable because there is no ranking of which the ERCs in Γ are true while the ERCs in
Ω − Γ are false. De este modo, consistency under partial negation is both necessary and sufficient
(cid:1)
for shatterability.

Corolario 1
Every subset of a shatterable ERC set is itself shatterable.
Prueba : Because each partial negation of a shatterable ERC set must, by definition, ser
consistent and because every subset of a consistent set must also be consistent, it is the
case that every subset of a shatterable set is consistent under every partial negation and
(cid:1)
is thus shatterable.

Defining shatterability in terms of partial negation lines up with the commonsense
observation that no set containing α and γ where α → γ is shatterable because there can
be no ranking of which the former is true while the latter is false. This is neatly captured
by the fact that if α → γ, no superset of {a, γ} can be shattered because fusing {a, γ}
is guaranteed to yield an ERC in L+. The requirement of consistency under partial
negation also shows why relatively weak conditions like (cid:9)W., W., l(cid:10) y (cid:9)W., l, W.(cid:10) cannot
co-occur in shatterable sets even though neither entails the other. En este caso, fusing the

53

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

3
5
1
4
7
1
7
9
8
5
6
2
/
C
oh

yo
i
.

0
7

0
3
1

r
2

0
6

9
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 35, Número 1

negation of both ERCs yields (cid:9)l, l, l(cid:10) ∈ L+. This follows transparently from the fact that
either the statement ”C1 or C2 outranks C3” or the statement ”C1 or C3 outranks C2” is
true of any ranking of three constraints.

The definition of shatterability for ERC sets in terms of consistency under partial
negation makes it easy to demonstrate that for |CON| =k, there are shatterable ERC
sets of size k − 1. Diagonal ERC sets provide a particularly simple example of a class of
shatterable ERC sets of this size.

Definición 10
ERC set Ω is diagonal if its members can be given
as a list LΩ in which each nth ERC in the list has a
W in its nth coordinate, an L in its n + 1th coordinate,
and e’s everywhere else.

E.g., Ω =






(cid:9) W., l, mi, mi, mi (cid:10)
(cid:9) mi, W., l, mi, mi (cid:10)
(cid:9) mi,
mi, W., l, mi (cid:10)
(cid:9) mi,
mi, mi, W., l (cid:10)






Lema 2
Diagonal ERC sets are shatterable.
Prueba : Assume that Ω is a diagonal ERC set and Ω(cid:1) is an arbitrary subset of an arbitrary
partial negation of Ω. If n is the number of ERCs in Ω(cid:1) entonces, by the definition of diagonal
ERC sets, there must be at least n + 1 coordinates (columnas) in Ω(cid:1) that are filled with
L or W for some ERC in Ω(cid:1) (es decir., are not all-e columns). Because each of the n ERCs has
only one L, at most n columns contain L’s, thus the fusion of Ω(cid:1) contains at least one W.
Because Ω(cid:1) was an arbitrary subset, no subset fuses to L+. Because the partial negation
(cid:1)
was arbitrary, every partial negation is consistent and thus Ω is shatterable.

From the shatterability of diagonal ERC sets (with k − 1 members if |CON| =k), nosotros
obtain a lower bound of k − 1 on the VCD of ERC sets. Having established that there
are shatterable sets of k-length ERCs with k − 1 miembros, what remains to be shown is
that no set larger than k − 1 is shatterable.

Definición 11
Coordinate Ci is W-unique in ERC set Φ if Φ has a partial negation Φ(cid:1) such that in the
fusion of Φ(cid:1), φ = (cid:9)φ1, …, φk(cid:10), the only coordinate that contains a W is φi.

Definición 12
The minor Ωα,j of an ERC set Ω is a new set Ω(cid:1) in which ERC α has been removed and
the jth coordinate has been removed from the remaining ERCs.








a : (cid:9) l, l, W., mi, W. (cid:10)
b : (cid:9) mi, mi,
mi, l, W. (cid:10)
γ : (cid:9) mi, mi, W., l, mi (cid:10)
δ : (cid:9) l, W., mi, mi, mi (cid:10)

then Ωγ, 3 =



a : (cid:9) l, l, mi, W. (cid:10)
b : (cid:9) mi, mi, l, W. (cid:10)
δ : (cid:9) l, W., mi, mi (cid:10)

Por ejemplo, if Ω =



As illustrated in Definition 12, the term “minor” used here is analogous to the
standard notion of the minor of a matrix. It is straightforward to show that every
shatterable ERC set contains shatterable minors that can be obtained by removing one
constraint’s coordinate (columna) and one ERC (row).

Lema 3
Reduction Lemma. – If Ω is a shatterable ERC set, then it has a shatterable minor Ωα,j.

54

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

3
5
1
4
7
1
7
9
8
5
6
2
/
C
oh

yo
i
.

0
7

0
3
1

r
2

0
6

9
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Riggle

Ranking Hypotheses in OT

Prueba : By Corollary 1, for any α ∈ Ω, Ω − {a} is shatterable. In Ω − {a} there must
be at least one coordinate Cj that is not W-unique. If this were not the case and every
coordinate in Ω − {a} was W-unique, then one of the L’s in α would occlude the only
W in a partial negation of Ω, making it inconsistent contra the assumption that Ω is
shatterable (α must have at least one L by Lemma 1). Because Cj is not W-unique, para
every partial negation of every subset of Ω − {a}, there is a coordinate other than Cj
that fuses to W. This being the case, shatterability is preserved if Cj is eliminated. De este modo,
(cid:1)
the minor Ωα,j is shatterable as required.

Teorema 2
For k > 1, the largest shatterable set of k-length ERCs has k − 1 miembros.
Prueba : If x is the size of the largest shatterable set of k-length ERCs and y is the size of
the largest shatterable set of (k + 1)-length ERCs, then y is not greater than x + 1. Este
must be so because if y ≥ x + 2 then x could not be the size of the largest shatterable
set of k-length ERCs because a set of (k + 1)-length ERCs would have a shatterable
minor larger than x. Porque (cid:9)W., l(cid:10) y (cid:9)l, W.(cid:10) are the only nontrivial ERCs for k = 2
and because they are antithetical and thus cannot co-occur in a shatterable set, el
largest shatterable ERC set at k = 2 consists of a single ERC. This base case establishes
an upper bound of k − 1 on the size of shatterable ERC sets and the diagonal ERC sets
provide a lower bound of k − 1. Together these bounds place the cardinality of the
largest shatterable set at exactly k − 1.
(cid:1)

Along with the diagonal ERC sets, there are many shatterable ERC sets with k − 1
miembros, but no shatterable sets with more than k − 1 miembros. What remains now is
to connect this result for ERC sets back to the realm of candidates.

4. The VCD of Optimality Theory

CON

The question posed at the outset of this article was: for a constraint set CON with k
constraints that map candidates to natural numbers, what is the cardinality of the largest
set of candidates S such that, for each subset T ⊆ S, there is at least one ranking R
under which every t in T is optimal, but no s in S − T is optimal? Claramente, the answer
to this question depends greatly on details of the constraints in CON. Sin embargo, if we
reduce candidates to the ERC sets associated with them, it is possible to place an upper
bound on the size of S without knowing anything about CON other than its size k.

Recall that a candidate c is mapped to True by ranking R

CON just in case every ERC
in ERCS(C) is consistent with R. En cambio, c is mapped to False by R if any of the ERCs
in ERCS(C) is not consistent with R. This notion can be extended to sets of candidates
como sigue. If S is a set of candidates, then ERCS(S) is the union of ERCS(s) for all s ∈ S.
A sample S is accepted by ranking R just in case every α in ERCS(S) is consistent with R.
En cambio, S is rejected by R if any α in ERCS(S) is not consistent with R. Además,
if ERCS(S) is consistent, then there must be at least one ranking that accepts S. En esto
caso, we will refer to S as a consistent sample. The concepts of partial negation and
W-uniqueness also have analogs for candidate sets.

Definición 13
For a consistent sample S, a partial exclusion is a partial negation of ERCS(S) eso
rejects some F ⊆ S by rendering ercs( F ) inconsistent for each f ∈ F while preserving the
consistency of ercs(s) for every s ∈ (S − F).

55

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

3
5
1
4
7
1
7
9
8
5
6
2
/
C
oh

yo
i
.

0
7

0
3
1

r
2

0
6

9
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 35, Número 1

Definición 14
Ci is w-unique in S if there is a partitioning of S into T and F under which Ci is the only
coordinate that fuses to W in ERCS(t) for every partial exclusion that rejects F.

The property of W-uniqueness in samples crucially contrasts with what one might call
being semi-unique—the case where for at least one, but not all, of the partial exclusions
that reject F, Ci is the only column that fuses to W in ERCS(t).

Definición 15
Given a constraint set CON and a sample S, the minor Sx,j is obtained by removing
candidate x from S and removing constraint Cj from CON.

By extending partial negation, W-uniqueness, and the concept of minors to the realm of
muestras, it is straightforward to show that shatterable samples have shatterable minors.

Lema 4
Shatterable samples have shatterable minors.

Prueba : Assume that S is a shatterable sample. Because removing candidate x from S has
no effect on whether the remainder of S can be shattered, S − {X} is also shatterable.
Sample S − {X} must have at least one coordinate that is not W-unique. If this were
not the case, entonces, because ERCS(X) must contain at least one coordinate with an L
(else there would be no way to reject {X}), the presence of x in S would place an L in a
W-unique coordinate in S − {X}. Sin embargo, this would make it impossible to associate a
ranking with at least one partitioning of S into accepted and rejected subsets contra the
assumption that S is shatterable. If Cj is a coordinate in S − {X} that is not W-unique,
entonces, for every partial exclusion that rejects F, under each partitioning of S − {X} en
T and F, there is at least one other coordinate Ci that fuses to W. De este modo, Cj could be
removed from CON while preserving the shatterability of S − {X}. Por lo tanto, Sx,j is a
(cid:1)
shatterable minor as required.

The crucial piece of the proof in Section 3 that the VCD of ERC sets is k − 1 era
the illustration of a one-to-one relationship between k and the bound on shatterable sets
by showing that removing an ERC from a shatterable set makes it possible to remove
a coordinate from the remaining ERCs while preserving shatterability. If shatterable
ERC sets could be larger than k − 1, then it would have been necessary to remove sev-
eral ERCs before it was possible to safely remove a coordinate from the remaining
ERCs. Because ERC sets and candidate samples both have shatterable minors, a similar
strategy will show that shatterable samples must also grow at a one-to-one rate with k.

Teorema 3
Si |CON| =k, then the size of shatterable sample sets is bounded at k − 1.
Prueba : If k = 2, a sample consisting of a single candidate can be shattered if ERCS(s) es
{(cid:9)W., l(cid:10)} o {(cid:9)l, W.(cid:10)}, but no larger sample can be shattered. If there were such a sample,
it would contain at least two candidates a and b and there would be a ranking under
which both candidates were optimal, a ranking under which neither candidate was
óptimo, a ranking that made a but not b optimal, and another ranking that made b
but not a optimal. This state of affairs requires at least four distinct rankings, cual es
impossible with only two constraints. De este modo, it is established that, at k = 2, the largest
shatterable sample set has at most one candidate.

56

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

3
5
1
4
7
1
7
9
8
5
6
2
/
C
oh

yo
i
.

0
7

0
3
1

r
2

0
6

9
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Riggle

Ranking Hypotheses in OT

If S is the largest shatterable sample for k constraints, entonces, at k + 1, the size of
the largest shatterable sample is |S| + 1. If this were not the case, there would be a
shatterable sample X such that |X| ≥ |S| + 2 for k + 1 constraints. Sin embargo, porque
shatterable samples have shatterable minors (Lema 4), this would mean that there
was a shatterable sample of size |S| + 1 for k constraints, contrary to the assumption
that S was largest. Given the base case that |S| = 1 when k = 2, the cardinality of
shatterable samples is thus bounded at k − 1 as required.
(cid:1)

The bound of k − 1 defines the limiting case that is obtained when there can be can-
didates in the sample space for any ERC set. In actual practice, the specific details of
the constraints in CON and the range of ways that they interact will determine which
elements of the powerset of the set of k-length ERCs are associated with candidates in
the sample space. This means that the VC dimension of a specific constraint set CON
can be much lower than |CON| − 1. Sin embargo, the result that the VCD of OT can be
a lo sumo |CON| − 1 is propitious for the learnability of Optimality Theoretic grammars.

5. Conclusions

Bounding the VC dimension of OT according to the number of constraints in CON es-
tablishes a general property of the sets of ranking hypotheses that can be associated
with sets of candidates. This bound is independent of any assumptions about how the
ERC sets for candidates are computed, independent of any assumptions about how
optimizations are computed, and independent of any assumptions about the formal
properties of constraints other than that they map candidates to N.

The linear growth of the VCD with |CON| = k provides a very general positive
learnability result for OT. Blumer et al. (1989), building on the learning model of Valiant
(1984), define a concept class C as uniformly learnable if there is a learning algorithm
A such that, for any error threshold (cid:10) and confidence level δ, if A is given m training
samples randomly drawn according to a probability distribution π over the sample
espacio, then A has at least probability δ of generating a hypothesis whose likelihood
of misclassifying any point in the sample space drawn randomly according to π is less
than (cid:10). Blumer et al. link the VC dimension to learnability by showing that concept
classes are uniformly learnable if and only if they have a finite VCD. Además, ellos
show that upper bounds on m can be established for learning that depend only on the
VC dimension of the concept class to be learned. The bound on m according to d =VCD
from Blumer et al. is given in Equation (1).

m ≤

(cid:10)

4
(cid:10)

(cid:11)
d ln 12

(cid:10) + ln 2
δ

(cid:12)(cid:13)

(1)

This is a worst-case bound that holds for the most adversarial probability distributions
over the sample space and the worst consistent learning algorithms (es decir., algoritmos
that are consistent in that they correctly classify all data in the training set, but worst-
case in that they err maximally on all unobserved data). Specific OT learning algorithms
that have tighter bounds and non-worst-case probability distributions over samples will
certainly present a different picture.

For a concrete example of OT learning, consider a version of Prince and Smolensky’s
(1993) basic CV syllable theory in which candidates are mappings from {C, V}* a {C,
V, .}*, and for each input i ∈ {C, V}*, the candidate set produced by GEN(i) represents

57

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

3
5
1
4
7
1
7
9
8
5
6
2
/
C
oh

yo
i
.

0
7

0
3
1

r
2

0
6

9
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 35, Número 1

all ways of modifying i through deletion and insertion of ., C, and V.3 If CON contains (i)
a constraint against deletion, (ii) a constraint against V insertion, (iii) a constraint against
C insertion, (iv) a constraint against syllables with codas, y (v) a constraint against
syllables without onsets, then the range of possible rankings of these five constraints
allows for 120 different grammars which in turn define twelve different languages
(es decir., twelve subsets of the sample space X = {C, V}* × {C, V, .}*).

If learners are trained with positive evidence in the form of optimal input → output
mappings, then the probability distribution over the sample space can be characterized
in terms of the probability distribution over the input strings in {C, V}*. Each optimal
candidate a = (i → o) provides information about the teacher’s ranking in the form of
ERCS(a) = {erc(a, b)|b = (i → x) ∈ GEN(i)}. Riggle (2004) shows that, because the func-
tions in this system are all rational (es decir., finite state representable), the set ERCS(a) poder
be derived via an algorithm called CONTENDERS. In this system, ERCS(a) can contain
from zero to twelve ERCs. The zero-ERC cases arise for input strings that share the
same optimal output under all rankings (es decir., /CV/→[.CV.]). The sets top out at twelve
because there are never more than twelve contenders (es decir., non-harmonically-bounded
candidates) for any given input string. The twelve ERC bound is a consequence of the
fact the 120 rankings only realize twelve distinct languages.4

As noted in Section 1, candidates with the same input cannot co-occur in shatterable
conjuntos. Because of this, the bound on shatterable samples established in Section 4 carries
transparently over to the more general case where learners are trained with optimal
(i, oh) mappings and then tested with novel inputs. Because the set of contenders is
determined solely by GEN and CON (which the learner is presumed to have access to)
if the learner can compute CONTENDERS(i), then testing on novel inputs reduces to
having the learner select one optimal candidate from the set of contenders, cual
in turn reduces to binary questions of harmonic inequality between pairs a and b in
CONTENDERS(i), which in turn reduces to the question of which of erc(a, b) or erc(b, a) es
consistent with the ERCs gleaned from previous observations.

This is merely one sketch of how the learning problem in OT can be formulated
so that the VC dimension can predict its success. There are undoubtedly other possible
formulations. Además, as noted, real-world cases will often contain details that are
more relevant than the VC dimension in predicting learnability. Por ejemplo, in syl-
lable structure grammar just described, there are inputs for which the CONTENDERS
algorithm generates one candidate per language in the factorial typology. In such a
caso, the ERC set for a single optimal candidate can serve as a “global trigger” that is
sufficient to uniquely identify the teacher’s language. Further analysis with specific
constraints and OT learning algorithms like Recursive Constraint Demotion (Tesar 1995,
1997, 1998; Tesar and Smolensky 1993, 2000), the Gradual Learning Algorithm (campesinos
1997, 1998; Boersma and Hayes 2001), and the ERC-Union learner (Riggle 2004) will
surely yield further insights and a less abstract picture of learning in Optimality Theory.
The VCD is an extremely robust metric that characterizes hardness in many learning
frameworks (Haussler, Kearns, and Schapire 1992) and is applicable without any as-
sumptions other than that the learner is consistent. Any learner that bases its hypotheses
on the union of the ERCs associated with the data on which it is trained is guaranteed to
be consistent, and thus an extremely simple ERC-union learner can learn OT grammars

3 C and V represent consonants and vowels respectively and “.” represents a syllable boundary marker.
4 Riggle (2004) extends Prince and Smolensky’s nine-way factorial typology to twelve with a slightly looser
GEN. En este caso, porque (cid:18)log2 12(cid:19) = 4, the ERC-based VCD bound is the same as that obtained by the
finitude of the typology. Usually, sin embargo, we do not have the luxury of knowing the size of C.

58

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

3
5
1
4
7
1
7
9
8
5
6
2
/
C
oh

yo
i
.

0
7

0
3
1

r
2

0
6

9
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Riggle

Ranking Hypotheses in OT

from random training texts whose size m is linear in k. This linear bound on the re-
lationship between k and sample complexity is a nice tightening of the k log2 k bound
that follows from the finitude of k! and contrasts starkly with pessimistic assessments
of learnability suggested by the factorial relationship between k and the number of
possible grammars.

Referencias
anderson, Alan R. and Nuel D. Belnap, Jr.

1975. EntailmentThe Logic of Relevance and
Necessity. Prensa de la Universidad de Princeton.
Blumer, Anselm, Andrzej Ehrenfeucht,
David Haussler, and Manfred K.
Warmuth. 1989. Learnability and the
Vapnik-Chervonenkis dimension. Diario
of the ACM, 36(4):929–965.

campesinos, Pablo. 1997. How we learn variation,
optionality, and probability. Actas de
the Institute of Phonetic Sciences, 21:43–58.
campesinos, Pablo. 1998. Functional Phonology:

Formalizing the Interactions between
Articulatory and Perceptual Drives. Doctor.
tesis, The Hague.

campesinos, Paul and Bruce Hayes. 2001.

Empirical tests of the gradual learning
algoritmo. Linguistic Inquiry, 32:45–86.

Dijkstra, Edsger. W.. 1959. A note on

two problems in connexion with graphs.
Numerische Mathematik, 1:269–271.
Ellison, t. Marca. 1994. Phonological
derivation in optimality theory. En
Proceedings of the Fifteenth Conference on
Ligüística computacional, pages 1007–1013,
Kioto, Japón. doi:dx.doi.org/10.3115/
991250.991312.

Haussler, David, Michael Kearns, and Robert

Schapire. 1992. Bounds on the sample
complexity of Bayesian learning using
information theory and the VC dimension.
Technical Report UCSC-CRL-91-44.
Littlestone, Nick. 1988. Learning quickly
when irrelevant attributes abound:
A new linear-threshold algorithm. Machine
Aprendiendo, 2(4):285–318.

Príncipe, alan. 2002. Entailed ranking arguments.
ROA 500. Available at http://roa.rutgers.edu.

Príncipe, Alan and Paul Smolensky. 1993.

Optimality Theory: Constraint Interaction in

Generative Grammar. Blackwell, Malden,
MAMÁ.

Riggle, Jason. 2004. Generación, Reconocimiento,
and Learning in Finite State Optimality
Teoría. Doctor. tesis, Universidad de
California, Los Angeles.

Samek-Lodovici, Vieri and Alan Prince. 1999.

Optima. ROA 785. Disponible en
http://roa.rutgers.edu.

Tesar, bruce. 1995. Computational Optimality

Teoría. Doctor. tesis, Universidad de
Colorado.

Tesar, bruce. 1997. Multi-recursive constraint

demotion. ROA 197. Disponible en
http://roa.rutgers.edu.

Tesar, bruce. 1998. Error-driven learning in

Optimality Theory via the efficient
computation of optimal forms. In Is the
Best Good Enough? Optimality and
Competition in Syntax, ed. Pilar Barbosa,
Danny Fox, Paul Hagstran, Martha J.
McGinnis, and David Pesetsky. CON prensa,
Cambridge, MAMÁ.

Tesar, Bruce and Paul Smolensky. 1993.
The learnability of optimality theory:
An algorithm and some basic complexity
resultados. Unpublished manuscript.
Departamento de Ciencias de la Computación
& Institute of Cognitive Science,
University of Colorado at Boulder.
Tesar, Bruce and Paul Smolensky. 2000.

Learnability in Optimality Theory. CON prensa,
Cambridge, MAMÁ.

Valiente, leslie g. 1984. A theory of the

learnable. Communications of the ACM,
27(11):1134–1142.

Vapnik, V. norte. y un. Chervonenkis. 1971.
On the uniform convergence of relative
frequencies of events to their probabilities.
Theory of Probability and its Applicaions,
16:264–280.

59

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

3
5
1
4
7
1
7
9
8
5
6
2
/
C
oh

yo
i
.

0
7

0
3
1

r
2

0
6

9
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

yo

a
r
t
i
C
mi

pag
d

F
/

/

/

/

3
5
1
4
7
1
7
9
8
5
6
2
/
C
oh

yo
i
.

0
7

0
3
1

r
2

0
6

9
8
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Descargar PDF