The Complexity of Ranking Hypotheses in
Optimality Theory
Jason Riggle∗
Universität von Chicago
Given a constraint set with k constraints in the framework of Optimality Theory (OT), what is
its capacity as a classification scheme for linguistic data? One useful measure of this capacity
is the size of the largest data set of which each subset is consistent with a different grammar
Hypothese. This measure is known as the Vapnik-Chervonenkis dimension (VCD) and is a
standard complexity measure for concept classes in computational learnability theory. In diesem
arbeiten, I use the three-valued logic of Elementary Ranking Conditions to show that the VCD of
Optimality Theory with k constraints is k−1. Analysis of OT in terms of the VCD establishes
that the complexity of OT is a well-behaved function of k and that the ‘hardness’ of learning in
OT is linear in k for a variety of frameworks that employ probabilistic definitions of learnability.
1. Einführung
Given a set CON of k constraints in the framework of Optimality Theory (OT; Prince
and Smolensky 1993), what is the capacity of CON as a classification scheme for samples
of language data? In OT, constraints are functions that map candidates to natural num-
bers, where each candidate is a member of the (possibly infinite) set of possible deriva-
tions of an input form i supplied by the candidate generating function GEN(ich). Der
number that a constraint Ci assigns to a candidate indicates how many times that
candidate violates Ci. A grammar is a ranking of the constraints that imposes a total
ordering on CON, R
CON (or simply R when CON is clear from the context), und das
language that is generated by grammar R is the set of candidates that are optimal
according to R as in Definition 1.
Definition 1
A. Candidate a is more harmonic than candidate b according to R, written a (cid:4) B,
if they share the same input and a is assigned fewer violations by the highest-
ranked constraint that assigns different numbers of violations to a and b.
B. Candidate a is optimal according to ranking R iff no other candidate generated
by GEN is more harmonic than a.
Because each of the k! rankings of CON is a different grammar that generates a
potentially unique language, one natural measure of the classificatory capacity of CON
∗ Department of Linguistics, Universität von Chicago, 1010 E. 59th St., Chicago, IL 60637.
jriggle@uchicago.edu. Many thanks to Alan Prince, Jeff Heinz, Greg Kobele, Colin Wilson, and two
anonymous Computational Linguistics reviewers for helpful comments and suggestions. Any errors are, von
course, my own.
Einreichung erhalten: 22 Dezember 2006; revised submission received: 9 Oktober 2007; accepted for
Veröffentlichung: 17 Dezember 2007.
© 2009 Verein für Computerlinguistik
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
3
5
1
4
7
1
7
9
8
5
6
2
/
C
Ö
l
ich
.
0
7
–
0
3
1
–
R
2
–
0
6
–
9
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Computerlinguistik
Volumen 35, Nummer 1
is the upper bound of k! languages in what Prince and Smolensky (1993, page 27) dub
the factorial typology of the constraint set. Another complexity metric that is useful in
analyses of learnability (especially for non-finite concept classes) is the cardinality of
the largest data set of which each subset corresponds to a different ranking hypothesis.
The idea of measuring the complexity of a concept class (in the case at hand, a set of
grammars) in this way comes from the work of Vapnik and Chervonenkis (1971) und ist
known as the Vapnik-Chervonenkis dimension (VCD). In OT, the VCD of a constraint
set CON (d.h., the concept class consisting of languages generated by rankings of CON)
is the size of the largest sample (set of candidates) that is shatterable as in Definition 2.
Definition 2
A sample S is shatterable by a constraint set CON iff, for every partitioning of S into two
disjoint sets T and F (including the null/whole partition), there is at least one ranking
R
CON that makes every s ∈ T optimal but no s ∈ F optimal.
Vapnik and Chervonenkis’s definition of shatterability has interesting implications
for samples consisting of OT candidates. Zum Beispiel, each candidate in a shatterable
sample S must be an input → output mapping for a unique input form because two
candidates a and b with the same input would either tie with identical sets of violations
or show harmonic inequality. In the case of a tie, no ranking could realize a partitioning
that separates a and b and, in the case of harmonic inequality, no ranking could realize
a partitioning in which a and b are simultaneously optimal. Allgemeiner, the VCD
places an upper bound on the number of distinct grammar hypotheses that can be real-
ized over any sample of linguistic data consisting of OT candidates, and thus provides
a ready measure of the complexity of the hypothesis space in Optimality Theory.
The VCD of a concept class is obviously not independent of its size. As Blumer
|C|
et al. (1989) point out, for any finite concept class C, the VCD is bounded at log2
because it takes at least 2d hypotheses to associate a unique hypothesis with every sub-
set of a sample of size d. Daher, because the number of grammars (hypotheses) über
k constraints is finite—one grammar for each of the k! rankings—the VCD of OT is
bounded at log2 k!. Or, put more simply, because log2 x! ≤ x log2 x, this establishes
k log2 k as an upper bound on the VCD of OT. In diesem Artikel, I will show how the
structure of the hypothesis space in Optimality Theory provides a tighter bound on
the VCD of OT than the bound established by the finitude of the hypothesis space. ICH
will improve upon the inherent bound of k log2 k by showing that the VCD of OT with
k constraints is actually bounded at k − 1 and thus grows linearly with the size of |CON|.
The complexity measured by the VC dimension has a number of ramifications for
learning in Optimality Theory. Zum Beispiel, the VCD of a concept class places an abso-
lute lower bound on the number of mistakes that any error-driven learning algorithm
can be guaranteed of achieving (Littlestone 1988). This fact tells us that it may yet
be possible to improve upon the quadratic mistake bound of (k2 − k)/2 for Recursive
Constraint Demotion (Tesar and Smolensky 1993, 2000; Tesar 1995, 1997, 1998), Die
reigning mistake bound for any OT learning algorithm. The VCD of a concept class
also provides a very general bound on the number of data samples that are required for
learning in probabilistic models of learning that will be discussed in Section 5.
2. Elementary Ranking Conditions
The main result for the VC dimension of OT will be given in Section 4. Erste, manche
supporting results will be established showing that there is an upper bound of k − 1 An
48
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
3
5
1
4
7
1
7
9
8
5
6
2
/
C
Ö
l
ich
.
0
7
–
0
3
1
–
R
2
–
0
6
–
9
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Riggle
Ranking Hypotheses in OT
shatterable sets of statements about constraint rankings that are expressed with Prince’s
(2002) Elementary Ranking Conditions.
If our sample space X consists of candidates, then any x ∈ X can be described
in terms of the set of constraint rankings under which x is optimal. Prince (2002)
provides a scheme for encoding this kind of ranking information called an Elementary
Ranking Condition (ERC). In diesem Abschnitt, I will review some formal properties of ERCs
that are relevant for establishing the VC dimension of OT. Prince demonstrates many
formal properties of ERCs beyond those covered here and shows that ERCS are equiv-
alent to the implication-negation fragment of the three-valued relevance logic RM3 (vgl.
Anderson and Belnap 1975). This section will review properties of ERCs that are most
relevant for the results at hand. For formal proofs and a complete exposition of the logic
of ERCs, see Prince (2002).
For a constraint set CON containing k constraints, ERCs are k-length vectors that
use the symbols L, e, and W to encode logical statements about rankings. Each con-
straint is assigned an arbitrary numeric index, and in each ERC α, the ith coordinate αi
refers to the constraint with ith index Ci. The meaning of an ERC is that at least one
constraint whose corresponding coordinate contains a W outranks all of the constraints
whose coordinates contain L’s. Daher, (cid:9)W, e, L, L(cid:10) means that C1 outranks both C3 and
C4, while (cid:9)L, L, W, W(cid:10) means that either C3 or C4 outranks both C1 and C2. ERCs can
be constructed by comparing candidates as in Definition 3. Note that Ci(A) denotes the
number of times candidate a violates the constraint with index i.
Definition 3
Given a constraint set CON with k constraints indexed {1 … k} and two candidates
that share the same input, the function ercCON(A, B) returns an ERC α = (cid:9)α1, …, αk(cid:10) Das
describes the rankings under which a (cid:4) b.1
erc(A, B) = (cid:9)α1, …, αk(cid:10) Wo
αi = W if Ci(A) < Ci(b)
αi = L if Ci(a) > Ci(B)
αi = e if Ci(A) = Ci(B)
The symbol W in αi of erc(A, B) = α is a mnemonic for the fact that Ci favors a
(the winner), whereas an L in coordinate i is a mnemonic for the fact that Ci favors b
(the loser). An e in αi indicates that the candidates are equivalent according to Ci.
Example 1
Eingang
cand. A
cand. B
C1
*
**
C2
**
*
C3
*
erc(B, A) = (cid:9)L, W, W (cid:10) = b (cid:4) a if C2 or C3 outranks C1
erc(A, B) = (cid:9)W, L, L (cid:10) = a (cid:4) b if C1 outranks C2 and C3
(*’s indicate number of violations)
Note the symmetry between erc(A, B) = (cid:9)W, L, L (cid:10), which says that candidate a is more
harmonic than b under any ranking where C1 outranks both C2 and C3, and erc(B, A),
1 The function ercCON (A, B), or simply erc(A, B) when CON is clear from context, is undefined for candidate
input → output mappings with different inputs because they cannot be meaningfully compared.
49
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
3
5
1
4
7
1
7
9
8
5
6
2
/
C
Ö
l
ich
.
0
7
–
0
3
1
–
R
2
–
0
6
–
9
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Computerlinguistik
Volumen 35, Nummer 1
which says that b is more harmonic than a under any ranking where either C2 or C3
outranks C1. This symmetry reflects the fact that erc(A, B) and erc(B, A) encode antithetical
ranking conditions. The opposition between these ERCs follows straightforwardly from
the fact that only one of the two candidates can be optimal under any given ranking.
The illustrative tableaux presented with OT analyses can be turned into sets of
ERCs by making pairwise comparisons between the violations for one designated (oder
observed) winner and the violations for each other candidate.
Example 2
Eingang
cand. A
cand. B
cand. C
cand. D
cand. e
cand. F
C1
*
**
*
*
**
C2
**
*
*
**
***
C3
**
*
*
winner
erc(A, B) = (cid:9)W, L, e (cid:10) = a (cid:4) b if C1 outranks C2
erc(A, C) = (cid:9)e, L, W(cid:10) = a (cid:4) c if C3 outranks C2
erc(A, D) = (cid:9)e, L, W(cid:10) = a (cid:4) d if C3 outranks C2
erc(A, e) = (cid:9)W, e, e (cid:10) = a (cid:4) e under every ranking
erc(A, F ) = (cid:9)L, W, W(cid:10) = a (cid:4) f if C2 or C3 outranks C1
The comparison of candidate a with candidate e in Example 2, erc(A, e) = (cid:9)W, e, e(cid:10),
yields an odd ranking condition that does not actually express a particular ranking (NEIN
constraint has an L), but instead indicates that C1 favors candidate a and no constraint
favors candidate e. In this case, candidate e is said to be harmonically bounded by
candidate a because there can be no ranking under which e is more harmonic than
A. Umgekehrt, if candidate e were designated the winner, then erc(e, A) = (cid:9)L, e, e(cid:10). Das
ERC also does not encode a specific ranking, but rather indicates that the mere existence
of candidate a as an alternative means that no ranking can make candidate e optimal.
Like most OT tableaux, Example 2 is an illustration of how a handful of candidates
fare with respect to one another according to a particular set of constraints. To know
which rankings (wenn überhaupt) make candidate a globally optimal, it would be necessary to
define the candidate generating function GEN in order to obtain a representation of the
entire set of ERCs {erc(A, X) | x ∈ GEN(Eingang)} = ERCS(A). This is not as daunting as it
might appear because, wenngleich |GEN(Eingang)| may be infinite, the fact that the num-
ber of k-length ERCs is finite guarantees that each of the candidates in GEN(Eingang) Wille
map to one of a finite number of ERC sets. Außerdem, as Riggle (2004) demonstriert,
the standard OT assumption of the universality of faithfulness constraints that pe-
nalize changes to the input guarantees that all but finitely many of the members of
GEN(Eingang) will be harmonically bounded. Riggle also presents an algorithm for com-
puting this finite set of contenders (d.h., candidates that are not harmonically bounded)
that can be used in cases where GEN is restricted so that it is a rational function.2
Regardless of how optimization is computed, what is relevant for the assessment of
the VCD of OT is the definition of optimality. Following Definition 1, a ranking R
CON can
be seen as a function from candidates to True (if they are optimal) or False (if they are
2 GEN is rational if it is representable as a finite state transducer. Riggle’s (2004) CONTENDERS algorithm is
an extension of Ellison’s (1994) application of Dijkstra’s (1959) “shortest paths” algorithm to optimization
in OT that operates over finite-state representations of GEN and EVAL. Ellison showed that if harmony is
used as the “distance” to be optimized, then optimal outputs can be efficiently found. The CONTENDERS
algorithm follows a similar strategy but, instead of finding the shortest (d.h., most harmonic) path for one
ranking, the algorithm finds all non-harmonically-bounded paths and thereby optimizes for all rankings.
50
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
3
5
1
4
7
1
7
9
8
5
6
2
/
C
Ö
l
ich
.
0
7
–
0
3
1
–
R
2
–
0
6
–
9
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Riggle
Ranking Hypotheses in OT
nicht). The entire ERC set for a candidate ERCS(A) describes exactly the rankings under
which candidate a is a globally optimal candidate.
The reduction of candidates to ERC sets makes it possible to use the logic of ERCs
to reason about candidates. Meistens, the ERCs of interest are those that contain
at least one L and one W—what Prince calls nontrivial ERCs. ERCs that contain W’s
but no L’s are generated when a candidate is compared with another candidate that it
harmonically bounds, such as erc(A, e) = (cid:9)W, e, e (cid:10) in Example 2. This ERC reveals that
candidate e cannot be optimal but yields no information about what rankings make
candidate a optimal. Ähnlich, no ranking information can be gleaned from the all-e
ERC that results from comparing “tied” candidates that have the same violations.
Endlich, ERCs like erc(e, A) = (cid:9)L, e, e (cid:10), with L’s but no W’s reveal nothing other than the
fact that candidate e cannot be optimal under any ranking.
The most relevant logical relation for ERCs is that of entailment. The entailment
relation among nontrivial ERCs is given in Definition 4 (Prince 2002, page 6, Proposi-
tion 1.1).
Definition 4
For nontrivial ERCs α and β, α → β iff each αi ∈ α entails βi ∈ β where L → e → W.
Because nontrivial ERCs encode disjunctions of conjunctions (d.h., [C1 or … Cn] outranks
[C1(cid:1) Und … Cn(cid:1) ]), entailments of the form α → β line up with the logical operations of
disjunction introduction (whenever β has W where α has an L or an e) and conjunction
elimination (whenever β has an e where α has an L).
Example 3
(cid:9)W, L, L, e(cid:10) → (cid:9)W, e, L, e(cid:10)
(cid:9)W, e, L, e(cid:10) → (cid:9)W, e, L, W(cid:10)
(cid:9)W, L, L, e(cid:10) → (cid:9)W, e, L, W(cid:10)
d.h., If C1 outranks C2 and C3 then C1 outranks C3.
d.h., If C1 outranks C3 then C1 or C4 outranks C3.
d.h., If C1 outranks C2 and C3 then C1 or C4 outranks C3.
In addition to revealing entailments among individual ranking conditions, the logic
of ERCs makes it possible to derive new ranking conditions that are entailed by the
combination of other ERCs. Prince (2002, page 8) provides a logical operation called
fusion that derives entailments from sets of ERCs.
Definition 5
The fusion of ERC set Φ is a single ERC φ that is entailed by Φ where:
φi = L if any ERC in Φ has an L in its ith coordinate,
φi = e if every ERC in Φ has an e in its ith coordinate,
φi = W otherwise.
Every ERC entailed by Φ is entailed by the fusion of a subset of Φ (Prince 2002,
page 14). Daher, the operation of fusion can reveal nonobvious entailments among ERCs.
Consider Φ = {(cid:9)W, W, e, L(cid:10), (cid:9)L, W, W, e(cid:10), (cid:9)W, e, L, W(cid:10)}. The ERCs in Φ denote, jeweils,
“C1 or C2 outranks C4,” “C2 or C3 outranks C1,” and “C1 or C4 outranks C3.” The fusion
of Φ is (cid:9)L, W, L, L(cid:10), which encodes the inference from Φ that C2 outranks C1, C3, and C4.
The operation of fusion can also reveal inconsistencies in ERC sets. Consider the
set Ψ = {(cid:9)W, L, W(cid:10), (cid:9)L, W, W(cid:10), (cid:9)W, W, L(cid:10)}. Fusing Ψ yields (cid:9)L, L, L(cid:10). As with harmonically
bounded candidates, this ERC shows that no constraint ranking is consistent with
the statements in Ψ (in fact, they are circular). Prince refers to the class of ERCs with
51
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
3
5
1
4
7
1
7
9
8
5
6
2
/
C
Ö
l
ich
.
0
7
–
0
3
1
–
R
2
–
0
6
–
9
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Computerlinguistik
Volumen 35, Nummer 1
L’s but no W’s as L+. He shows that these ERCs arise from fusion if and only if the
fused set contains incompatible ranking conditions.
Definition 6
An ERC set is consistent iff it has no subset that fuses to an ERC in L+ (Prince 2002,
page 11).
For any consistent ERC set there is a constraint ranking (often several) of which
all of its ERCs are true statements (Prince 2002, page 21). The ERCs in an inconsistent
set, andererseits, can never all be true of a single ranking. Inconsistency can arise
from a single pair of candidates (z.B., ERCS(e) in Example 2 contains erc(e, A) = (cid:9)L, e, e(cid:10)).
Inconsistency can also arise across multiple candidate comparisons (z.B., ERCS(D) In
Example 2 contains erc(D, A) = (cid:9)e, W, L(cid:10) and erc(D, C) = (cid:9)e, L, W(cid:10)). This latter type of in-
consistency, where several of the ERCs associated with a candidate fuse to L+, arises
from what Samek-Lodovici and Prince (1999) call collective harmonic bounding.
Endlich, it is possible for inconsistencies to arise when ERCs for several candidates
with distinct inputs are combined. Zum Beispiel, if ERCS(X) = {(cid:9)W, L, W(cid:10)}, ERCS(j) =
{(cid:9)L, W, W(cid:10)}, and ERCS(z) = {(cid:9)W, W, L(cid:10)} Dann, even though x, j, and z may be candidates
for distinct inputs (d.h., come from different tableaux), the union of their ERCs fuses
Zu (cid:9)L, L, L(cid:10) ∈ L+ and thereby reveals that there is no ranking under which all three
candidates are simultaneously optimal.
Inverting the W’s and L’s of an ERC produces its antithetical counterpart that is true
whenever the original ERC is false and vice versa. This opposition can be exploited in
describing the range of consistent ERC sets.
Definition 7
The negation of α is α where: αi = W if αi = L, αi = L if αi = W, and αi = e if αi = e.
Provided that α is not all e’s, every ranking is described by either α or α but not both
(Prince 2002, page 42). In this way, ERC negation is just the standard notion of negation
in three-valued logics. The opposition between α and α makes a binary partition on
the space of rankings. This is intuitively obvious for simple statements like (cid:9)W, L, e(cid:10) Und
(cid:9)L, W, e(cid:10). The opposition is a bit less intuitive for more complex conditions like (cid:9)W, L, L(cid:10)
Und (cid:9)L, W, W(cid:10), but the fact that erc(A, B) is the antithesis of erc(B, A) makes it abundantly
clear (d.h., if a and b are not tied, then every ranking must prefer one or the other). Der
antithetical relationship between an ERC and its negation is reflected in the operation
of fusion by the fact that fusing antithetical ERCs will always yield an ERC in L+.
3. The VCD of Elementary Ranking Conditions
Before turning to the question of the VC dimension of the sample space in OT, it will be
helpful to define shatterability purely in terms of ERCs and thereby to establish a bound
on the VCD of sets of ERCs. We will say that an ERC α is true of a given ranking R if
the condition imposed by α is consistent with the linear ordering of the constraints
defined by R.
Definition 8
An ERC set Ω over constraints CON is shatterable iff for every subset ∆ ⊆ Ω, there is a
ranking R
CON of which all ERCs in Ω–∆ are true while all the ERCs in ∆ are false.
52
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
3
5
1
4
7
1
7
9
8
5
6
2
/
C
Ö
l
ich
.
0
7
–
0
3
1
–
R
2
–
0
6
–
9
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Riggle
Ranking Hypotheses in OT
From this definition of shatterability for sets of ERCs, it is immediately clear that only
nontrivial ERCs can occur in shatterable sets.
Lemma 1
Every ERC in a shatterable set must contain at least one L and one W.
Proof : The ERCs of L+ cannot occur in a shatterable set because there is no ranking
of which they are true. Umgekehrt, ERCs with no L’s cannot occur in shatterable sets
(cid:1)
because there is no ranking of which they are false.
With Definition 8 in hand, and having excluded the trivial ERCs from the picture,
it will be possible to reduce shatterability for ERC sets to consistency under negation.
Erste, a definition of negation for sets of ERCs.
Definition 9
A partial negation of ERC set Ω is obtained by negating every ERC in a subset ∆ ⊆ Ω.
Zum Beispiel: Ω = {(cid:9)W, L, L(cid:10), (cid:9)e, W, L(cid:10)} has four partial negations: one per subset.
(cid:5)
(cid:6) (cid:5)
(cid:6) (cid:5)
(cid:6) (cid:5)
(cid:6)
α = (cid:9)W, L, L(cid:10)
γ = (cid:9)e, W, L(cid:10)
α = (cid:9)L, W, W(cid:10)
γ = (cid:9)e, W, L(cid:10)
α = (cid:9)W, L, L(cid:10)
γ = (cid:9)e, L, W(cid:10)
α = (cid:9)L, W, W(cid:10)
γ = (cid:9)e, L, W(cid:10)
Theorem 1
An ERC set Ω is shatterable iff every partial negation of Ω is consistent.
Proof : Suppose every partial negation of Ω is consistent. Daher, for any partial negation
in which ∆ is the negated subset of Ω and Γ is the rest of Ω, it is the case that there
is a ranking R of which all the ERCs of ∆ + Γ are true. Because a nontrivial ERC and
its negation are never both true of the same ranking and trivial ERCs cannot occur in
shatterable sets, the ERCs in Ω − Γ are false of ranking R while the ERCs of Γ are true.
Because ∆ was arbitrary, it is the case that for every subset of Ω, there is a ranking of
which the ERCs in that subset are false while the rest are true, and thus consistency
under partial negation is sufficient for shatterability. Wenn, andererseits, there is a
partial negation that is not consistent, then there is a subset of Ω such that if the ERCs
in that subset are negated, the resulting ∆ + Γ is not consistent. Jedoch, because there
is no ranking of which the members of an inconsistent ERC set are all true, Ω is not
shatterable because there is no ranking of which the ERCs in Γ are true while the ERCs in
Ω − Γ are false. Daher, consistency under partial negation is both necessary and sufficient
(cid:1)
for shatterability.
Corollary 1
Every subset of a shatterable ERC set is itself shatterable.
Proof : Because each partial negation of a shatterable ERC set must, by definition, Sei
consistent and because every subset of a consistent set must also be consistent, it is the
case that every subset of a shatterable set is consistent under every partial negation and
(cid:1)
is thus shatterable.
Defining shatterability in terms of partial negation lines up with the commonsense
observation that no set containing α and γ where α → γ is shatterable because there can
be no ranking of which the former is true while the latter is false. This is neatly captured
by the fact that if α → γ, no superset of {α, γ} can be shattered because fusing {α, γ}
is guaranteed to yield an ERC in L+. The requirement of consistency under partial
negation also shows why relatively weak conditions like (cid:9)W, W, L(cid:10) Und (cid:9)W, L, W(cid:10) kann nicht
co-occur in shatterable sets even though neither entails the other. In this case, fusing the
53
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
3
5
1
4
7
1
7
9
8
5
6
2
/
C
Ö
l
ich
.
0
7
–
0
3
1
–
R
2
–
0
6
–
9
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Computerlinguistik
Volumen 35, Nummer 1
negation of both ERCs yields (cid:9)L, L, L(cid:10) ∈ L+. This follows transparently from the fact that
either the statement ”C1 or C2 outranks C3” or the statement ”C1 or C3 outranks C2” is
true of any ranking of three constraints.
The definition of shatterability for ERC sets in terms of consistency under partial
negation makes it easy to demonstrate that for |CON| = k, there are shatterable ERC
sets of size k − 1. Diagonal ERC sets provide a particularly simple example of a class of
shatterable ERC sets of this size.
Definition 10
ERC set Ω is diagonal if its members can be given
as a list LΩ in which each nth ERC in the list has a
W in its nth coordinate, an L in its n + 1th coordinate,
and e’s everywhere else.
E.g., Ω =
(cid:9) W, L, e, e, e (cid:10)
(cid:9) e, W, L, e, e (cid:10)
(cid:9) e,
e, W, L, e (cid:10)
(cid:9) e,
e, e, W, L (cid:10)
Lemma 2
Diagonal ERC sets are shatterable.
Proof : Assume that Ω is a diagonal ERC set and Ω(cid:1) is an arbitrary subset of an arbitrary
partial negation of Ω. If n is the number of ERCs in Ω(cid:1) Dann, by the definition of diagonal
ERC sets, there must be at least n + 1 coordinates (columns) in Ω(cid:1) that are filled with
L or W for some ERC in Ω(cid:1) (d.h., are not all-e columns). Because each of the n ERCs has
only one L, at most n columns contain L’s, thus the fusion of Ω(cid:1) contains at least one W.
Because Ω(cid:1) was an arbitrary subset, no subset fuses to L+. Because the partial negation
(cid:1)
was arbitrary, every partial negation is consistent and thus Ω is shatterable.
From the shatterability of diagonal ERC sets (with k − 1 members if |CON| = k), Wir
obtain a lower bound of k − 1 on the VCD of ERC sets. Having established that there
are shatterable sets of k-length ERCs with k − 1 members, what remains to be shown is
that no set larger than k − 1 is shatterable.
Definition 11
Coordinate Ci is W-unique in ERC set Φ if Φ has a partial negation Φ(cid:1) such that in the
fusion of Φ(cid:1), φ = (cid:9)φ1, …, φk(cid:10), the only coordinate that contains a W is φi.
Definition 12
The minor Ωα,j of an ERC set Ω is a new set Ω(cid:1) in which ERC α has been removed and
the jth coordinate has been removed from the remaining ERCs.
α : (cid:9) L, L, W, e, W (cid:10)
β : (cid:9) e, e,
e, L, W (cid:10)
γ : (cid:9) e, e, W, L, e (cid:10)
δ : (cid:9) L, W, e, e, e (cid:10)
then Ωγ, 3 =
α : (cid:9) L, L, e, W (cid:10)
β : (cid:9) e, e, L, W (cid:10)
δ : (cid:9) L, W, e, e (cid:10)
Zum Beispiel, if Ω =
As illustrated in Definition 12, the term “minor” used here is analogous to the
standard notion of the minor of a matrix. It is straightforward to show that every
shatterable ERC set contains shatterable minors that can be obtained by removing one
constraint’s coordinate (column) and one ERC (Reihe).
Lemma 3
Reduction Lemma. – If Ω is a shatterable ERC set, then it has a shatterable minor Ωα,J.
54
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
3
5
1
4
7
1
7
9
8
5
6
2
/
C
Ö
l
ich
.
0
7
–
0
3
1
–
R
2
–
0
6
–
9
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Riggle
Ranking Hypotheses in OT
Proof : By Corollary 1, for any α ∈ Ω, Ω − {α} is shatterable. In Ω − {α} there must
be at least one coordinate Cj that is not W-unique. If this were not the case and every
coordinate in Ω − {α} was W-unique, then one of the L’s in α would occlude the only
W in a partial negation of Ω, making it inconsistent contra the assumption that Ω is
shatterable (α must have at least one L by Lemma 1). Because Cj is not W-unique, für
every partial negation of every subset of Ω − {α}, there is a coordinate other than Cj
that fuses to W. This being the case, shatterability is preserved if Cj is eliminated. Daher,
(cid:1)
the minor Ωα,j is shatterable as required.
Theorem 2
For k > 1, the largest shatterable set of k-length ERCs has k − 1 members.
Proof : If x is the size of the largest shatterable set of k-length ERCs and y is the size of
the largest shatterable set of (k + 1)-length ERCs, then y is not greater than x + 1. Das
must be so because if y ≥ x + 2 then x could not be the size of the largest shatterable
set of k-length ERCs because a set of (k + 1)-length ERCs would have a shatterable
minor larger than x. Weil (cid:9)W, L(cid:10) Und (cid:9)L, W(cid:10) are the only nontrivial ERCs for k = 2
and because they are antithetical and thus cannot co-occur in a shatterable set, Die
largest shatterable ERC set at k = 2 consists of a single ERC. This base case establishes
an upper bound of k − 1 on the size of shatterable ERC sets and the diagonal ERC sets
provide a lower bound of k − 1. Together these bounds place the cardinality of the
largest shatterable set at exactly k − 1.
(cid:1)
Along with the diagonal ERC sets, there are many shatterable ERC sets with k − 1
members, but no shatterable sets with more than k − 1 members. What remains now is
to connect this result for ERC sets back to the realm of candidates.
4. The VCD of Optimality Theory
CON
The question posed at the outset of this article was: for a constraint set CON with k
constraints that map candidates to natural numbers, what is the cardinality of the largest
set of candidates S such that, for each subset T ⊆ S, there is at least one ranking R
under which every t in T is optimal, but no s in S − T is optimal? Clearly, the answer
to this question depends greatly on details of the constraints in CON. Jedoch, if we
reduce candidates to the ERC sets associated with them, it is possible to place an upper
bound on the size of S without knowing anything about CON other than its size k.
Recall that a candidate c is mapped to True by ranking R
CON just in case every ERC
in ERCS(C) is consistent with R. Umgekehrt, c is mapped to False by R if any of the ERCs
in ERCS(C) is not consistent with R. This notion can be extended to sets of candidates
as follows. If S is a set of candidates, then ERCS(S) is the union of ERCS(S) for all s ∈ S.
A sample S is accepted by ranking R just in case every α in ERCS(S) is consistent with R.
Umgekehrt, S is rejected by R if any α in ERCS(S) is not consistent with R. Außerdem,
if ERCS(S) is consistent, then there must be at least one ranking that accepts S. In diesem
Fall, we will refer to S as a consistent sample. The concepts of partial negation and
W-uniqueness also have analogs for candidate sets.
Definition 13
For a consistent sample S, a partial exclusion is a partial negation of ERCS(S) Das
rejects some F ⊆ S by rendering ercs( F ) inconsistent for each f ∈ F while preserving the
consistency of ercs(S) for every s ∈ (S − F).
55
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
3
5
1
4
7
1
7
9
8
5
6
2
/
C
Ö
l
ich
.
0
7
–
0
3
1
–
R
2
–
0
6
–
9
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Computerlinguistik
Volumen 35, Nummer 1
Definition 14
Ci is w-unique in S if there is a partitioning of S into T and F under which Ci is the only
coordinate that fuses to W in ERCS(T) for every partial exclusion that rejects F.
The property of W-uniqueness in samples crucially contrasts with what one might call
being semi-unique—the case where for at least one, aber nicht alles, of the partial exclusions
that reject F, Ci is the only column that fuses to W in ERCS(T).
Definition 15
Given a constraint set CON and a sample S, the minor Sx,j is obtained by removing
candidate x from S and removing constraint Cj from CON.
By extending partial negation, W-uniqueness, and the concept of minors to the realm of
Proben, it is straightforward to show that shatterable samples have shatterable minors.
Lemma 4
Shatterable samples have shatterable minors.
Proof : Assume that S is a shatterable sample. Because removing candidate x from S has
no effect on whether the remainder of S can be shattered, S − {X} is also shatterable.
Sample S − {X} must have at least one coordinate that is not W-unique. If this were
not the case, Dann, because ERCS(X) must contain at least one coordinate with an L
(else there would be no way to reject {X}), the presence of x in S would place an L in a
W-unique coordinate in S − {X}. Jedoch, this would make it impossible to associate a
ranking with at least one partitioning of S into accepted and rejected subsets contra the
assumption that S is shatterable. If Cj is a coordinate in S − {X} that is not W-unique,
Dann, for every partial exclusion that rejects F, under each partitioning of S − {X} into
T and F, there is at least one other coordinate Ci that fuses to W. Daher, Cj could be
removed from CON while preserving the shatterability of S − {X}. daher, Sx,j is a
(cid:1)
shatterable minor as required.
The crucial piece of the proof in Section 3 that the VCD of ERC sets is k − 1 War
the illustration of a one-to-one relationship between k and the bound on shatterable sets
by showing that removing an ERC from a shatterable set makes it possible to remove
a coordinate from the remaining ERCs while preserving shatterability. If shatterable
ERC sets could be larger than k − 1, then it would have been necessary to remove sev-
eral ERCs before it was possible to safely remove a coordinate from the remaining
ERCs. Because ERC sets and candidate samples both have shatterable minors, a similar
strategy will show that shatterable samples must also grow at a one-to-one rate with k.
Theorem 3
Wenn |CON| = k, then the size of shatterable sample sets is bounded at k − 1.
Proof : If k = 2, a sample consisting of a single candidate can be shattered if ERCS(S) Ist
{(cid:9)W, L(cid:10)} oder {(cid:9)L, W(cid:10)}, but no larger sample can be shattered. If there were such a sample,
it would contain at least two candidates a and b and there would be a ranking under
which both candidates were optimal, a ranking under which neither candidate was
optimal, a ranking that made a but not b optimal, and another ranking that made b
but not a optimal. This state of affairs requires at least four distinct rankings, welches ist
impossible with only two constraints. Daher, it is established that, at k = 2, the largest
shatterable sample set has at most one candidate.
56
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
3
5
1
4
7
1
7
9
8
5
6
2
/
C
Ö
l
ich
.
0
7
–
0
3
1
–
R
2
–
0
6
–
9
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Riggle
Ranking Hypotheses in OT
If S is the largest shatterable sample for k constraints, Dann, at k + 1, the size of
the largest shatterable sample is |S| + 1. If this were not the case, there would be a
shatterable sample X such that |X| ≥ |S| + 2 for k + 1 constraints. Jedoch, Weil
shatterable samples have shatterable minors (Lemma 4), this would mean that there
was a shatterable sample of size |S| + 1 for k constraints, contrary to the assumption
that S was largest. Given the base case that |S| = 1 when k = 2, the cardinality of
shatterable samples is thus bounded at k − 1 as required.
(cid:1)
The bound of k − 1 defines the limiting case that is obtained when there can be can-
didates in the sample space for any ERC set. In actual practice, the specific details of
the constraints in CON and the range of ways that they interact will determine which
elements of the powerset of the set of k-length ERCs are associated with candidates in
the sample space. This means that the VC dimension of a specific constraint set CON
can be much lower than |CON| − 1. dennoch, the result that the VCD of OT can be
at most |CON| − 1 is propitious for the learnability of Optimality Theoretic grammars.
5. Conclusions
Bounding the VC dimension of OT according to the number of constraints in CON es-
tablishes a general property of the sets of ranking hypotheses that can be associated
with sets of candidates. This bound is independent of any assumptions about how the
ERC sets for candidates are computed, independent of any assumptions about how
optimizations are computed, and independent of any assumptions about the formal
properties of constraints other than that they map candidates to N.
The linear growth of the VCD with |CON| = k provides a very general positive
learnability result for OT. Blumer et al. (1989), building on the learning model of Valiant
(1984), define a concept class C as uniformly learnable if there is a learning algorithm
A such that, for any error threshold (cid:10) and confidence level δ, if A is given m training
samples randomly drawn according to a probability distribution π over the sample
Raum, then A has at least probability δ of generating a hypothesis whose likelihood
of misclassifying any point in the sample space drawn randomly according to π is less
als (cid:10). Blumer et al. link the VC dimension to learnability by showing that concept
classes are uniformly learnable if and only if they have a finite VCD. Darüber hinaus, Sie
show that upper bounds on m can be established for learning that depend only on the
VC dimension of the concept class to be learned. The bound on m according to d =VCD
from Blumer et al. is given in Equation (1).
m ≤
(cid:10)
4
(cid:10)
(cid:11)
d ln 12
(cid:10) + ln 2
δ
(cid:12)(cid:13)
(1)
This is a worst-case bound that holds for the most adversarial probability distributions
over the sample space and the worst consistent learning algorithms (d.h., Algorithmen
that are consistent in that they correctly classify all data in the training set, but worst-
case in that they err maximally on all unobserved data). Specific OT learning algorithms
that have tighter bounds and non-worst-case probability distributions over samples will
certainly present a different picture.
For a concrete example of OT learning, consider a version of Prince and Smolensky’s
(1993) basic CV syllable theory in which candidates are mappings from {C, V}* Zu {C,
V, .}*, and for each input i ∈ {C, V}*, the candidate set produced by GEN(ich) represents
57
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
3
5
1
4
7
1
7
9
8
5
6
2
/
C
Ö
l
ich
.
0
7
–
0
3
1
–
R
2
–
0
6
–
9
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Computerlinguistik
Volumen 35, Nummer 1
all ways of modifying i through deletion and insertion of ., C, and V.3 If CON contains (ich)
a constraint against deletion, (ii) a constraint against V insertion, (iii) a constraint against
C insertion, (iv) a constraint against syllables with codas, Und (v) a constraint against
syllables without onsets, then the range of possible rankings of these five constraints
allows for 120 different grammars which in turn define twelve different languages
(d.h., twelve subsets of the sample space X = {C, V}* × {C, V, .}*).
If learners are trained with positive evidence in the form of optimal input → output
mappings, then the probability distribution over the sample space can be characterized
in terms of the probability distribution over the input strings in {C, V}*. Each optimal
candidate a = (i → o) provides information about the teacher’s ranking in the form of
ERCS(A) = {erc(A, B)|b = (i → x) ∈ GEN(ich)}. Riggle (2004) zeigt, dass, because the func-
tions in this system are all rational (d.h., finite state representable), the set ERCS(A) can
be derived via an algorithm called CONTENDERS. In this system, ERCS(A) can contain
from zero to twelve ERCs. The zero-ERC cases arise for input strings that share the
same optimal output under all rankings (d.h., /CV/→[.CV.]). The sets top out at twelve
because there are never more than twelve contenders (d.h., non-harmonically-bounded
candidates) for any given input string. The twelve ERC bound is a consequence of the
fact the 120 rankings only realize twelve distinct languages.4
As noted in Section 1, candidates with the same input cannot co-occur in shatterable
sets. Because of this, the bound on shatterable samples established in Section 4 carries
transparently over to the more general case where learners are trained with optimal
(ich, Ö) mappings and then tested with novel inputs. Because the set of contenders is
determined solely by GEN and CON (which the learner is presumed to have access to)
if the learner can compute CONTENDERS(ich), then testing on novel inputs reduces to
having the learner select one optimal candidate from the set of contenders, welche
in turn reduces to binary questions of harmonic inequality between pairs a and b in
CONTENDERS(ich), which in turn reduces to the question of which of erc(A, B) or erc(B, A) Ist
consistent with the ERCs gleaned from previous observations.
This is merely one sketch of how the learning problem in OT can be formulated
so that the VC dimension can predict its success. There are undoubtedly other possible
formulations. Außerdem, as noted, real-world cases will often contain details that are
more relevant than the VC dimension in predicting learnability. Zum Beispiel, in syl-
lable structure grammar just described, there are inputs for which the CONTENDERS
algorithm generates one candidate per language in the factorial typology. In such a
Fall, the ERC set for a single optimal candidate can serve as a “global trigger” that is
sufficient to uniquely identify the teacher’s language. Further analysis with specific
constraints and OT learning algorithms like Recursive Constraint Demotion (Tesar 1995,
1997, 1998; Tesar and Smolensky 1993, 2000), the Gradual Learning Algorithm (Boersma
1997, 1998; Boersma and Hayes 2001), and the ERC-Union learner (Riggle 2004) Wille
surely yield further insights and a less abstract picture of learning in Optimality Theory.
The VCD is an extremely robust metric that characterizes hardness in many learning
frameworks (Haussler, Kearns, and Schapire 1992) and is applicable without any as-
sumptions other than that the learner is consistent. Any learner that bases its hypotheses
on the union of the ERCs associated with the data on which it is trained is guaranteed to
be consistent, and thus an extremely simple ERC-union learner can learn OT grammars
3 C and V represent consonants and vowels respectively and “.” represents a syllable boundary marker.
4 Riggle (2004) extends Prince and Smolensky’s nine-way factorial typology to twelve with a slightly looser
GEN. In this case, Weil (cid:18)log2 12(cid:19) = 4, the ERC-based VCD bound is the same as that obtained by the
finitude of the typology. Usually, Jedoch, we do not have the luxury of knowing the size of C.
58
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
3
5
1
4
7
1
7
9
8
5
6
2
/
C
Ö
l
ich
.
0
7
–
0
3
1
–
R
2
–
0
6
–
9
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Riggle
Ranking Hypotheses in OT
from random training texts whose size m is linear in k. This linear bound on the re-
lationship between k and sample complexity is a nice tightening of the k log2 k bound
that follows from the finitude of k! and contrasts starkly with pessimistic assessments
of learnability suggested by the factorial relationship between k and the number of
possible grammars.
Verweise
Anderson, Alan R. and Nuel D. Belnap, Jr.
1975. Entailment – The Logic of Relevance and
Necessity. Princeton University Press.
Blumer, Anselm, Andrzej Ehrenfeucht,
David Haussler, and Manfred K.
Warmuth. 1989. Learnability and the
Vapnik-Chervonenkis dimension. Zeitschrift
of the ACM, 36(4):929–965.
Boersma, Paul. 1997. How we learn variation,
optionality, and probability. Proceedings of
the Institute of Phonetic Sciences, 21:43–58.
Boersma, Paul. 1998. Functional Phonology:
Formalizing the Interactions between
Articulatory and Perceptual Drives. Ph.D.
These, Den Haag.
Boersma, Paul and Bruce Hayes. 2001.
Empirical tests of the gradual learning
Algorithmus. Sprachliche Untersuchung, 32:45–86.
Dijkstra, Edsger. W. 1959. A note on
two problems in connexion with graphs.
Numerische Mathematik, 1:269–271.
Ellison, T. Markieren. 1994. Phonological
derivation in optimality theory. In
Proceedings of the Fifteenth Conference on
Computerlinguistik, pages 1007–1013,
Kyoto, Japan. doi:dx.doi.org/10.3115/
991250.991312.
Haussler, David, Michael Kearns, and Robert
Schapire. 1992. Bounds on the sample
complexity of Bayesian learning using
information theory and the VC dimension.
Technical Report UCSC-CRL-91-44.
Littlestone, Nick. 1988. Learning quickly
when irrelevant attributes abound:
A new linear-threshold algorithm. Machine
Learning, 2(4):285–318.
Prince, Alan. 2002. Entailed ranking arguments.
ROA 500. Available at http://roa.rutgers.edu.
Prince, Alan and Paul Smolensky. 1993.
Optimality Theory: Constraint Interaction in
Generative Grammar. Blackwell, Malden,
MA.
Riggle, Jason. 2004. Generation, Recognition,
and Learning in Finite State Optimality
Theory. Ph.D. These, Universität
Kalifornien, Los Angeles.
Samek-Lodovici, Vieri and Alan Prince. 1999.
Optima. ROA 785. Verfügbar um
http://roa.rutgers.edu.
Tesar, Bruce. 1995. Computational Optimality
Theory. Ph.D. These, Universität
Colorado.
Tesar, Bruce. 1997. Multi-recursive constraint
demotion. ROA 197. Verfügbar um
http://roa.rutgers.edu.
Tesar, Bruce. 1998. Error-driven learning in
Optimality Theory via the efficient
computation of optimal forms. In Is the
Best Good Enough? Optimality and
Competition in Syntax, Hrsg. Pilar Barbosa,
Danny Fox, Paul Hagstran, Martha J.
McGinnis, and David Pesetsky. MIT Press,
Cambridge, MA.
Tesar, Bruce and Paul Smolensky. 1993.
The learnability of optimality theory:
An algorithm and some basic complexity
results. Unpublished manuscript.
Department of Computer Science
& Institute of Cognitive Science,
University of Colorado at Boulder.
Tesar, Bruce and Paul Smolensky. 2000.
Learnability in Optimality Theory. MIT Press,
Cambridge, MA.
Valiant, Leslie G. 1984. A theory of the
learnable. Communications of the ACM,
27(11):1134–1142.
Vapnik, V. N. and A. Chervonenkis. 1971.
On the uniform convergence of relative
frequencies of events to their probabilities.
Theory of Probability and its Applicaions,
16:264–280.
59
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
3
5
1
4
7
1
7
9
8
5
6
2
/
C
Ö
l
ich
.
0
7
–
0
3
1
–
R
2
–
0
6
–
9
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
3
5
1
4
7
1
7
9
8
5
6
2
/
C
Ö
l
ich
.
0
7
–
0
3
1
–
R
2
–
0
6
–
9
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
PDF Herunterladen