Consistent Unsupervised Estimators for Anchored PCFGs

Alexander Clark
Department of Philosophy
King’s College London
alexsclark@gmail.com

Nathana¨el Fijalkow
CNRS, LaBRI, Bordeaux, and The
Alan Turing Institute of Data
Science, London

nathanael.fijalkow@labri.fr

Abstract

Learning probabilistic context-free grammars
(PCFGs) from strings is a classic problem in
computational linguistics since Horning (1969).
Here we present an algorithm based on distri-
butional learning that is a consistent estimator
for a large class of PCFGs that satisfy certain
natural conditions including being anchored
(Stratos et al., 2016). We proceed via a
reparameterization of (top–down) PCFGs that
we call a bottom–up weighted context-free
grammar. We show that if the grammar is
anchored and satisfies additional restrictions
on its ambiguity, then the parameters can be
directly related to distributional properties of
the anchoring strings; we show the asymptotic
correctness of a naive estimator and present
some simulations using synthetic data that
show that algorithms based on this approach
have good finite sample behavior.

1 Introduction

This paper presents an approach for strongly learn-
ing a linguistically interesting subclass of prob-
abilistic context-free grammars (PCFGs) from
strings in the realizable case. Unpacking this, we
assume that we have some PCFG that we are
interested in learning and that we have access
only to a sample of strings generated by the
PCFG (i.e., sampled from the distribution defined
by the context-free grammar). Crucially, we do
not observe the derivation trees—the hierarchical
latent structure. Strong learning means that we
want the learned grammar to define the same
distribution over labeled trees as the original
grammar and not just the same distribution over
strings.

Clearly, there can be many structurally different
PCFGs that define the same distribution over
strings. Consider for example the distribution that
generates a single string of length 3 with prob-

409

ability one and the various PCFGs that give rise to
that same distribution; for these obvious reasons,
that we discuss in more detail later, we cannot
have an algorithm that does this for all PCFGs.
Accordingly, we define some sufficient conditions
on PCFGs for this algorithm to perform correctly.
More precisely, we define some simple structural
conditions on the underlying CFGs (in Section 3),
and we will show that the resulting class of PCFGs
is identifiable from strings, in the sense that any
two PCFGs that define the same distribution over
strings will be isomorphic.

We then provide a computationally trivial learn-
together with a
ing algorithm in Section 4,
proof that it will strongly learn every grammar
in this class. The algorithm is not
intended
to be a realistic algorithm, but merely to
this
illustrate the fundamental correctness of
general approach. We then show that general
PCFGs in Chomsky normal form (CNF) that
approximate the observable properties of natural
language syntax are efficiently learnable using
some simulations with synthetic data in Section 5.
Our primary scientific motivation is to under-
stand the process of first-language acquisition, in
particular the early phases of the acquisition of
syntactic structure. Importantly, the grammar is
not just a decision procedure that classifies strings
as being grammatical or ungrammatical, but addi-
tionally assigns a tree structure to the grammatical
sentences, a structure the primary role of which is
to support semantic interpretation. The standard
view is that children learn the syntactic structure
of their languages not by purely syntactic means,
but rather by using information about the range
of available interpretations, derived from the
situational context of the sentences they hear and
inferences about the intentions and goals of the
speaker (e.g., Abend et al., 2017). Indeed there
is ample direct evidence from the developmental
psycholinguistics literature that this does in fact
happen at certain stages of language acquisition:

Transactions of the Association for Computational Linguistics, vol. 8, pp. 409–422, 2020. https://doi.org/10.1162/tacl a 00323
Action Editor: Mark-Jan Nederhof. Submission batch: 11/2019; Revision batch: 4/2020; Published 7/2020.

2020 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

c
(cid:13)

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
2
3
1
9
2
3
1
2
7

/
t

a
c
_
a
_
0
0
3
2
3
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

For example, Gropen et al. (1991) showed that
the acquisition of argument structure of verbs
exploits semantic information about the verb and
the arguments. However, the children in these
experiments—the youngest cohort being nearly 4
years old—have already acquired a great deal of
knowledge about English syntax.

Here, we are exploring an alternative or perhaps
complementary hypothesis: namely, that the ac-
quisition of the syntactic categories and rules of the
language can to a certain extent be learned using
only information derived from the surface strings
without any appeal to external information about
the hierarchical structure of the language that is
being learned. In other words, the initial phases of
language acquisition are based on purely syntactic
information rather than the semantic bootstrapping
discussed above.

The contributions of this paper are as follows.
First, we provide a reparameterization of PCFGs
within the space of weighted context-free gram-
mars (WCFGs) that we call Bottom–up WCFGs.
Next, we define three structural conditions on
CFGs and show that they imply the identifiability
of the class of all PCFGs based on those grammars.
We then present a naive computationally trivial
estimator and prove its asymptotic consistency
for that class of PCFGs. We present some exper-
iments on synthetic grammars that show that a
variant of this algorithm has good finite sample
behavior. Finally, we examine the extent to which
these conditions are plausible, using a corpus of
child-directed speech.

2 Definitions

We assume we have a finite set of atomic symbols
Σ. The set of finite strings over this set is written
Σ∗, nonempty finite strings are denoted by Σ+,
and the empty string is λ. We will typically write
a, b, c, . . . for elements of Σ and u, v, w, . . . for
elements of Σ∗. A (formal) language L is a subset
of Σ∗. A context is an ordered pair of strings, that
Σ∗ that we write as l, r.
is, an element of Σ∗ ×
If U, V are languages, then their concatenation is
U V defined in the normal way, and we will also
write uV where u is a string instead of
V and
so on. Given a fixed language L, we define for a
set of strings U a set of contexts U ⊲ as

u
}

{

U ⊲ =

l, r

{

lU r

}

⊆

410

If U =

u
}
{

we will write u⊲ for the distribution

of u—the set of contexts in which it can occur.

→

[0, 1], such that Pw

A stochastic language is a function P from
P(w) = 1. Note
Σ∗
that the support of this distribution is a formal
language as defined above. We assume for the
rest of the paper that the expected length of strings
drawn from this distribution is finite.

Σ∗

∈

We can define for some u

Σ+, the expected
number of times that u will occur as a substring in
a string distributed according to P.1

∈

l,r

Σ∗

P(lur)

E(u) = X
Σ∗×
We can also define, for a string u, its context
distribution, which is a probability distribution
over its contexts written
(u), whose support will
be u⊲, given for l, r

Σ∗ by

D
Σ∗ ×

∈

(u)[l, r] =

P(lur)
E(u)

Context-Free Grammars

∈

→

Σ or A

Σ, V, S, P

∈
S
\ {

a where A

V and B, C

We consider context-free grammars (CFGs) in
Chomsky normal form
where Σ is a
h
nonempty finite set of terminal symbols; V is a
nonempty finite set, disjoint from Σ of nontermi-
nal symbols, S is a distinguished element of V ,
the start symbol and P is a finite nonempty set
of productions each of which is either of the form
V and a
BC
A
→
where A

∈
We write A, B, C, . . . for elements of V and
Σ. A derivation tree τ is
α for strings over V
a singly rooted ordered tree where every node is
labeled with an element of V
Σ and each local
tree is in P . The yield of a derivation is the string
of symbols of leaves of the tree taken left to right;
we write this as y(τ ). The set of all derivations
licensed by G and rooted by a nonterminal A, and
with a yield in a set Γ is written as Ω(G, A, Γ);
here we follow the notation of Smith and Johnson
(2007) among others. We will omit G when it is
clear.

∪

}

1This is the expectation because if u occurs n times in
a string w, there will be n distinct contexts l, r such that
lur = w.

2We follow the classical definition of Chomsky normal
form in not allowing S to occur on the right-hand side of
any rules. This simplifies various parts of the analysis, and
makes the learning problem slightly harder, but it is not hard
to remove this restriction if it is desired. Note that we do not
allow an empty right-hand side of a production.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
2
3
1
9
2
3
1
2
7

/
t

a
c
_
a
_
0
0
3
2
3
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

We want to be able to combine trees using tree
substitution; thus, if we have a tree τ1 whose yield
is lBr, where l and r are strings over Σ, and a tree
τ2 whose root is B and whose yield is α, we can
τ2 whose yield is
combine them to get a tree τ1 ⊗
lαr.

We define the string language defined by a

nonterminal A to be

(G, A) = (cid:8)y(τ ) : τ

∈

Ω(G, A, Σ+)(cid:9) .

The string language defined by a CFG G is

(G, S).

(G) =
For a tree τ and a production A

α we
α; τ ) for the number of times the
for the number
τ
|
for the

write f (A
production occurs in t. We write
of nonterminal symbols in a tree, and
length of a string.

w
|

→

2.1 WCFGs

We will now consider the probabilistic case where
we have a (discrete) probability distribution over
trees, that is, over Ω(G, S, Σ+), which will then
define a stochastic language, whose support will
be a context-free language. We will only consider
those distributions which satisfy some simple
conditional independence assumptions and can
be represented by weighted CFGs.

A weighted CFG (WCFG) is a CFG together
R that maps
with a parameter function θ : P
productions to nonnegative real values; we will
write this as G; θ. The weight or score of a tree τ
is the product of the weights of each production.
Formally s : Ω(G)

R is defined as

→

→
s(τ ; θ) = Y
α

θ(A

→

α)f (A

α;τ )

→

∈

→
τ2) = s(τ1)s(τ2). In general we
Note that s(τ1 ⊗
will define the score of a set of trees Ω to be the
sum of the scores of the trees in that set: s(Ω)
Ω s(τ ). The weight of a string w is the sum
= Pτ
of the weights of each derivation tree which yields
w; s(w) = s(Ω(G, S, w)).

∈

Definition 2.1. The inside value of a nonterminal
A, written I(A) is

I(A) = s(Ω(G, A, Σ+))

Note that this quantity is sometimes called the par-
tition function, written Z(A). The outside value,
O(A), is defined likewise as

O(A) = s(Ω(G, S, Σ∗AΣ∗))

411

Note that O(S) = 1 by definition, since
Ω(G, S, Σ∗SΣ∗) is a single element set consisting
of the trivial tree with one node S, which has
score 1.

A WCFG is globally normalized if I(S) = 1. In
this case it defines a probability distribution over
trees, we can identify the probability of a tree with
its score: P(τ ) = s(τ ), and via that a stochastic
language.

2.2 Expectations

We define expectations of nonterminals, termi-
nals, and productions, with respect to the distribu-
tion over trees defined by a globally normalized
WCFGs.

Given a globally normalized WCFG, the quan-
α) is the expected number of times
α occurs in a tree generated

tity E(A
the production A
by the distribution induced by the grammar:

→

E(A

→

α) = X
τ

Ω(G,S,Σ+)

∈

s(τ )f (A

α; τ )

→

Using this we define the expectation of a

nonterminal:

E(A) = X

E(A

α)

→

α:A

∈
Note that E(S) = 1 (because it can only occur at
the root of every tree).

→

For nonterminals A, B, C and terminals a, the
following identities relate the expectations and the
inside and outside values, which can be established
using the methods of, for example, Chi (1999).

E(A) = I(A)O(A)
a) = O(A)θ(A

→
BC) = O(A)θ(A

→

E(A

→

(1)

BC)I(B)I(C)

1 (or β−

Note that for any nonterminal A that is not S,
and any β > 0, we can scale all parameters for
productions with A on the left-hand side by β, and
every production with A on the right-hand side by
2 if A occurs twice on the right-hand
β−
side), and the score of every tree will remain the
same. There are two natural ways of resolving this
arbitrariness: one is to stipulate that for all non-
terminals I(A) = 1, which gives us the familiar
PCFG. The parameters of a tight PCFG satisfy

θ(A

→

α) =

E(A

→
E(A)

α)

(2)

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
2
3
1
9
2
3
1
2
7

/
t

a
c
_
a
_
0
0
3
2
3
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

The learning approach we take here is based on
modeling the context distribution, and it is there-
fore more mathematically convenient to use the
second normalization method where we stipulate
that O(A) = 1 for all nonterminals. We now
define this alternative parameterization, which we
call a bottom–up WCFG, in contrast to the top–
down generative process associated with a PCFG.

Definition 2.2 (bottom–up WCFG). We say that a
WCFG is in bottom–up form if I(S) = 1, and for
all nonterminals A, O(A) = 1.

If a WCFG is in bottom–up form then the

parameters satisfy:

θ(A

BC) =

E(A
BC)
→
E(B)E(C)
a).

a) = E(A

→

→
θ(A

→

(3)

Note that in this form, we condition the parame-
ters on the right-hand side of the production not
on the left-hand side as is done with a PCFG.

There is a unique bijection between the class
of tight PCFGs and bottom–up WCFGs; we can
easily convert from one form to the other. We can
efficiently compute the inside and outside values
of a convergent WCFG using standard techniques
(Hutchins, 1972; Nederhof and Satta, 2008;
Etessami et al., 2012); these involve solving a
system of quadratic equations (since the grammar
is in Chomsky normal form) in the case of the
inside values, which can be done using the Newton
method or a fixed point iteration, and a linear
system in the case of the outside values. The expec-
tations of each production can then be computed
using Equation 1 and then converted into a PCFG
or bottom up WCFG as desired using Equations 2
and 3, respectively.

3 Identifiability

We assume that we have a sequence of strings
generated independently and identically distrib-
uted (i.i.d.) from some distribution generated by
an unknown PCFG or WCFG, which we call the
target grammar.

We are interested in the problem of producing a
PCFG from this input data that is close to the target
PCFG; namely, the underlying CFG is isomorphic
to the underlying CFG of the target grammar and
additionally the parameters are within ǫ of the cor-
responding parameters of the target grammar: we
call this being ǫ-close. Two CFGs are isomorphic

412

if they are identical apart from the labels of the
nonterminals; the isomorphism is just a bijection
between the nonterminals and productions in the
natural way.

Definition 3.1. Two WCFGs, G; θ and G′; θ′, are
ǫ-close if there is an CFG-isomorphism φ from G
to G′ such that for all A

α in the grammars,

→

θ(A
|

→

α)

−

θ′(φ(A

α))

→

< ǫ G , if for every WCFG, G ∗ More precisely, we say that a learning algorithm A is a consistent estimator for a class of globally normalized WCFGs, , θ ∗ in the class, for every ǫ, δ > 0, there is an N
such that if the algorithm receives a sample of
m strings, sampled i.i.d. where m
N then it
outputs a WCFG ˆG, ˆθ such that with probability at
.
least 1

δ we have that ˆG, ˆθ is ǫ-close to G
∗
3.1 Structural Conditions on Grammars

, θ

−

≥

∗

We now define three structural conditions on
PCFGs that will be sufficient
to guarantee
identifiability of the class from strings.
Condition 3.1. A grammar G is anchored if for
every nonterminal A, there exists a terminal a
such that A
P then
B = A. In other words a occurs on the right-hand
side of exactly one production.

P and, if B

→

∈

We will call such a terminal a characterizing
terminal of A, and if a characterizes A we will
sometimes write [[a]] for A.

This condition is very close to a number of
conditions that have been proposed in the literature
both for topic modeling and for grammatical
inference: We use here the terminology of Stratos
et al. (2016), but similar ideas occur in, for
example, Adriaans’s (1999) approach to learning
CFGs and Denis et al.’s (2004) approach to
learning regular languages. This is also very
closely related to what is called the 1-Finite Kernel
Property in distributional learning of CFGs (Clark
and Yoshinaka, 2016).

The key idea behind the learning algorithm is
this: If every nonterminal has a characterizing
terminal then we can infer the probabilities of the
productions of the grammar from distributional
properties of the strings of corresponding termi-
nals. Thus if A, B, and C are nonterminals chara-
cterized by a, b, and c, respectively, then we can
infer something about the parameter of the produc-
tion A
BC by looking at the distributional
properties of a and bc. And if A is a nonterminal

→

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
2
3
1
9
2
3
1
2
7

/
t

a
c
_
a
_
0
0
3
2
3
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

characterized by a and b is any terminal, then we
can infer something about the parameter of the
production A
b by looking at the distributional
properties of a and b.

→

3.2 Divergences

We start by defining some quantities that depend
only on a distribution over strings. Recall that
the R´enyi α-divergence (R´enyi, 1961) between
two discrete distributions P and Q is defined for
α =

∞

Q) = log sup

P (x)
Q(x)

(4)

Given two strings u, v we will be concerned

with ρ(u

v),

→
ρ(u

v) = R

(u)

(

∞

(v))

→

(5)

This is an asymmetric nonnegative measure of
‘‘distance’’ between the context distributions of u
and v, which takes the value 0 only when they are
identical. Note that, because u⊲ is the support of

(u),

e−

ρ(u

v) =

→

E(u)
E(v)

inf
l,r
∈

u⊲

P(lvr)
P(lur)

We can now state a foundational result, which
relates the parameters of a production to these
divergences. We will start by proving an inequal-
ity, that we will later strengthen to an equality
under additional conditions.

Theorem 3.1. Suppose G; θ is a bottom–up
WCFG, and G is anchored. Let D be the distribu-
tion it defines, and P the set of productions.
Suppose that a, b, c are characterizing terminals
for nonterminals A, B, C respectively. Then for
any terminal d if A
d

→
d)

≤

∈
E(d)e−

ρ(a

→

θ(A

→

and if A

∈

→

θ(A

→

BC)

≤

E(bc)
E(b)E(c)

e−

ρ(a

bc)

→

Proof. Suppose A is a nonterminal in G that is
characterized by a. Then, for every context l, r,
since the only way that we can derive an a is via
A, P(lar) = s(Ω(S, lAr))θ(A
a). Summing
both sides with respect to l, r we obtain

→

E(a) = O(A)θ(A

→

413

Since O(A) = 1 in a bottom-up WCFG we

have that

and therefore

θ(A

→

a) = E(a)

s(Ω(S, lAr)) =

P(lar)
E(a)

(6)

(7)

Now consider lexical rules. Consider some
production A
d in the grammar, where a
→
a⊲. Since a
characterizes A. Consider some l, r
is an anchor of A, we know that s(Ω(S, lAr)) > 0,
and therefore P(ldr) > 0. Clearly

∈

P(ldr)

≥

s(Ω(S, lAr))θ(A

→

(8)

since the probability on the left-hand side is a sum
over the scores of many possible derivations, and
the right-hand side is a sum over a subset of those
derivations.

Therefore:

θ(A

≤

→

P(ldr)
s(Ω(S, lAr))

Now using Equation 7, we obtain

θ(A

≤

→

E(a)

P(ldr)
P(lar)

Because this is true for all l, r

a⊲ we have

∈
P(ldr)
P(lar)

E(a)
E(d)

ρ(a

θ(A

d)
E(d) ≤

→

= e−

inf
a⊲
l,r
∈
The same argument goes through for the binary
rules. Suppose we have A, B, C nonterminals
respectively, and a
characterized by a, b, c,
production A
BC). Let
P(lar) > 0 and P(lbcr) > 0. Clearly

BC with parameter θ(A

l, r be some context

in a⊲,

→
then

→

P(lbcr)
≥
s(Ω(S, lAr))θ(A

Therefore θ(A

→

b)θ(C

BC)θ(B

c)
(9)
BC) is smaller than or equal to

→

s(Ω(S, lAr))θ(B

b)θ(C

P(lbcr)

→

Using Equation 6 twice, and Equation 7 we get

θ(A

→

BC)

≤

E(bc)
E(b)E(c)

E(a)
E(bc)

P(lbcr)
P(lar)

Again, because this is true for all l, r

have

a⊲ we

∈

θ(A

→

BC)

≤

E(bc)
E(b)E(c)

e−

ρ(a

bc)

→

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
2
3
1
9
2
3
1
2
7

/
t

a
c
_
a
_
0
0
3
2
3
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

This shows us that we have an upper bound on
the parameters from a distributional property. But
looking at Equations 8 and 9, we can consider the
circumstances under which this inequality will be
tight, in which case we can recover the parameters
directly.

In particular, if the grammar is unambiguous
(i.e., if every string has at most one derivation
tree) then if the left-hand side of the inequality is
nonzero we can immediately see that the inequality
will become an equality. As it happens, there
will also be equality under some much weaker
conditions that we now define.

3.3 Ambiguity

We now define two closely related conditions that
are both related to the degree of ambiguity of the
grammar.
Condition 3.2. Suppose a CFG G contains a
α. We say that G has an
production A
unambiguous context for that production if there
is a string w and strings l, u, r such that w = lur,
Ω(G, S, w) is nonempty and

→

and if [[a]]

θ([[a]]

→

[[b]][[c]]

∈

[[b]][[c]]) =

E(bc)
E(b)E(c)

e−

ρ(a

b,c)

→

⊗

→

Proof. If we have a production [[a]]
[[b]][[c]] in
the grammar, we know there is a context such that
Ω([[a]], w) where all
Ω(S, lwr) = Ω(S, l[[a]]r)
the elements of Ω(A, w) have an occurrence of
[[b]][[c]] at the root. Because we know that
[[a]]
Ω([[a]], bc) consists of a single tree using [[a]]
[[b]][[c]]; and Ω(S, l[[b]][[c]]r) = Ω(S, l[[a]]r)
Ω([[a]], [[b]][[c]]),
Ω(S, l[[a]]r)
manipulations to get that for this l, r

→
⊗
Ω(S, lbcr) =
Ω(A, bc). Now we apply the same

therefore

→

⊗

θ([[a]]

→

[[b]][[c]]) =

E(a)
E(b)E(c)

P(lbcr)
P(lar)

and therefore

θ([[a]]

→

[[b]][[c]]) =

E(bc)
E(b)E(c)

e−

ρ(a

bc).

→

The argument for lexical rules is analogous.

Ω(G, S, w) = Ω(G, S, lAr)

Ω(G, A, u)

⊗

We can understand this better by taking the log.

and all elements of Ω(G, A, u) have an occurrence
the root. A CFG is locally
of A
unambiguous if it has an unambiguous context
for every production in its set of productions.

α at

→

Informally this condition says that for every
production there is some string which, although it
can be ambiguous, always uses that production at
the same point. Note that if G is locally unam-
biguous and is anchored, then for every binary
production, [[a]]
[[b]][[c]] there will be a context
l, r such that lbcr satisfies the condition; and for
b there will be a context
every production [[a]]
l, r such that lbr satisfies the condition.

→

If a grammar is unambiguous, then every con-
text is an unambiguous context for every deriva-
tion that uses it, but this condition is much weaker
than that; indeed, we don’t need there to be any
unambiguous strings, since Ω(G, S, lAr) can have
more than one element.

Lemma 3.1. If G; θ is a bottom–up WCFG and
G is anchored and is locally unambiguous, then if
[[a]]

→

∈

θ([[a]]

→

b) = E(b)e−

ρ(a

→

414

log θ([[a]]
→
E(bc)
E(b)E(c) −

= log

[[b]][[c]])
ρ(a

bc)

→

(10)

is just

The natural parameter is then the sum of two
the pointwise mutual
terms: The first
information (Church and Hanks, 1990) between b
and c.3 The second term penalizes cases where the
right-hand side is distributionally dissimilar from
the left-hand side. For the lexical productions,
similarly we have two terms:

log θ([[a]]

b) = log E(b)

→
3.4 Upward Monotonicity

ρ(a

−

→

(11)

We need one more condition, however. There
may be many different grammars that define the
same distribution over strings that satisfy these
two conditions because we may have multiple
nonterminals that could be merged together.
Condition 3.3. A grammar G =
strictly upward monotonic if for all Q

is
P ,
(G). (Where Q is restricted
V 2).)

(
L
to CNF productions of V

Σ, V, S, Q

Σ, V, S, P

⊃ L

(Σ

⊃

)

3With an adjustment of log E(
|

expectations and not probabilities.

∪
w

) because they are

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
2
3
1
9
2
3
1
2
7

/
t

a
c
_
a
_
0
0
3
2
3
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Informally, if we add a new production to the
grammar, then the language defined increases.
Note that of course all grammars have the property
that if Q
(G). Here
(
i ⊇ L
L
we require this monotonicity to be strict.

Σ, V, S, Q

P , then

⊇

We define the set of derivation contexts of a

nonterminal A to be

(G, A) =

{

l, r : Ω(G, S, lAr)

∅}

Lemma 3.2. Suppose G is anchored and
upward monotonic: If A, B are nonterminals and

(G, A) =

(G, B) then A = B.

C
Proof. Let a be an anchor for A; we can clearly
a without increasing the
add the production B
language generated. Therefore, B
a is in the
grammar, and so A = B as a is an anchor.

→

Lemma 3.3. Suppose G is anchored and upward
monotonic: Then

[[a]]

∈

→

P iff a⊲

b⊲

⊆

and

[[a]]

→

[[b]][[c]]

∈

P iff a⊲

(bc)⊲

⊆

Using the same condition we can show
that productions not in the grammar will have
parameters zero, because of an infinite divergence
term.
Lemma 3.4. Suppose G is anchored, and upward
monotonic, then

b is not in the grammar, then

→
b) =

∞

[[b]][[c]] is not in the grammar, then

•

If [[a]]
ρ(a

→
If [[a]]
ρ(a

→

→
bc) =

∞

→

Proof. If A
b is not in the grammar, then by
Lemma 3.3, there is some l, r such that lar is in
the language but lbr is not in the language and so
ρ(a

. Similarly for binary rules.

b) =

→

∞
3.5 Selecting Nonterminals

The preceding discussion shows that if we have
a set of terminals that are anchors for the true
nonterminals in the original grammar, then the
productions and the (bottom–up) parameters of
the associated productions will be fixed correctly,
but it says nothing about parameters that might
be associated to productions that use other
nonterminals. However, it is easy to show that
under these assumptions there can be no other
nonterminals.

Lemma 3.5. Suppose G1 and G2 are anchored
and strictly monotonic, and are weakly equivalent.
Then they are isomorphic, and there is a unique
isomorphism between them.

⊇

→

Proof. Let A be a nonterminal in G1, and let a
be an anchor for A. Suppose B
a be some
production in G2. Let b be an anchor for B.
Therefore a⊲
b⊲. By a similar argument there
⊇
must be a nonterminal C in G1 and a terminal
c⊲. But because
c that anchors C such that b⊲
a⊲
c⊲, we must have a production C
a in
G1. Since a is an anchor C = A, and therefore
a⊲ = b⊲ = c⊲. Therefore

(G2, B).
C
Let φ then be the CFG-morphism from G1 →
(G1, A) =
(G2, A′). This is well defined by Lemma 3.2,
C
and is clearly a bijection. Given this bijection,
by Lemma 3.3, they will have the same set of
productions, and thus be isomorphic.

G2, defined by φ(A) = A′

(G1, A) =

→

iff

⊇

3.6 Identifiability

We can now define the classes of grammars
that we are interested in. Let GA be the set
of all trim CFGs that are in Chomsky normal
form, anchored (Condition 3.1), are locally
unambiguous (Condition 3.2), and are strictly
upward monotonic (Condition 3.3).

∈

Let PA be the set of all tight PCFGs with finite
expectations, with CFGs in GA, and let WA be the
set of all WCFGs in bottom–up form with CFGs
in GA.
Theorem 3.2. Suppose G1; θ1 and G2; θ2 are in
WA and are stochastically equivalent: In other
Σ+, P(w; G1) = P(w; G2),
words, for all w
is isomorphic to G2, and if φ is
then G1
the unique such morphism,
α,
α)).
α) = θ2(φ(A
θ1(A

→
Proof. Because they are stochastically equivalent,
the support of their distributions is equal, and
thus G1 and G2 are weakly equivalent. Therefore
by Lemma 3.5 there is a unique isomorphism
between them, φ. By Lemma 3.1 the parameters
of corresponding productions must also be
equal.

for all A

→

Because there is a bijection between WA and

PA, PA is also identifiable from strings.

4 Naive Estimators

We now analyze the properties of a particular
estimator that we call the naive plugin estimator,

415

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
2
3
1
9
2
3
1
2
7

/
t

a
c
_
a
_
0
0
3
2
3
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

which we will show can learn all grammars in WA
and PA. This approach uses a trivial manner of
estimating the ρ values, and from this we derive
a consistent estimator for the class. This approach
has poor sample complexity but is algorithmically
trivial.

We will need to estimate the ρ divergences
from a sample of strings drawn i.i.d. from the
distribution defined by the grammar. Given a
sample of strings, the most naive approach is
to estimate P(w) and E(a) by the empirical
distribution, to estimate the ratio as the ratio of
these estimates, and to take the supremum over the
frequent contexts of a rather than over the infinite
set a⊲.

−

∞

b) =

δ, ˆρN (a

least 1
is a c if ρ(c
→
with probability at least 1

→
a) =

. Similarly, if there
, there is an N such that
.

δ, ˆρN (c

a) =

∞
−
Lemma 4.2. For some G; θ
WA suppose that
a is an anchor for a nonterminal A, b for B, and
c for C. If ρ(a
, then for every δ > 0,
∞
there is an N such that with probability at least
.
1

bc) =

δ ˆρN (a

→

∈

−

→

∞

→

BC were in P then ρ(a

Proof. If A
bc)
BC is not in P . By
would be finite. So A
→
Condition 3.3, there must be some context l
, r
∗
∗
in a⊲ but not in (bc)⊲, and so for sufficiently large
will occur more than √N times.
N , l
∗

ar
∗

→

ˆXN
|

−

We are interested in convergence in probability,
which we will write as ˆXN
X; in other
words, for any ǫ, δ > 0, there is an n such that for
all N > n, with probability greater than 1
δ we
have

N
→∞
−−−→

< ǫ. X − Let w1, . . . , wN be the sample of N strings drawn i.i.d from a target PCFG, and let n(w) be the number of times that w occurs in the sample (as a whole string), and let m(w) be the number of times substring occurs as a substring; clearly, Pl,r n(lwr) = m(w). Define ˆP(w) = n(w)/N to be the empirical probability of w and ˆE(u) = m(w)/N to be the empirical expectation of u. Clearly, for any string w we have ˆP(w) P(w) and ˆE(w) E(w). The naive plugin estimator is given by: N →∞ −−−→ N →∞ −−−→ Definition 4.1. For a, b, c Σ we define ˆρN (a → bc) = log max l,r:n(lar)>√N

∈
ˆE(bc)
ˆE(a)

n(lar)
n(lbcr)

(12)

And for a, b

∈

Σ we define

ˆρN (a

→

b) = log

ˆE(b)
ˆE(a)

max
l,r:n(lar)>√N

n(lar)
n(lbr)

(13)

Note that ˆρN (a

there is
some context l, r such that n(lar) > √N , and
n(lbcr) = 0.

bc) =

→

∞

We can show the convergence of the estimators
when one side is anchored, starting with the case
when the divergence is infinite.

Lemma 4.1. For some G; θ
WA suppose that a
is an anchor for a nonterminal A and suppose that
for some b
. Then for every
δ > 0, there is an N such that with probability at

Σ, ρ(a

b) =

∞

→

∈

416

Lemma 4.3. For some G; θ
is an anchor for a nonterminal A. Suppose ρ(a
b).
b) is finite; then ˆρN (a

WA suppose that a

ρ(a

→

∈

N
→∞
−−−→

→

Lemma 4.4. For some G; θ
WA suppose
that a is an anchor for a nonterminal A, b for
B, and c for C; if ρ(a
bc) is finite, then
ˆρN (a

→
bc).

ρ(a

bc)

∈

→
When

N
→∞
−−−→
is

→

the

finite

ρ
straightforward since
}| ≤
√N and so we can use Chernoff bounds in a
standard way.

convergence
l, r : n(lar) > √N

4.1 Definition of the Algorithm

⊂

→

∞

→

d) < a) = w1, . . . , wN We can now define the algorithm, taking as input a sequence of strings and using the trivial plugin estimators ˆρN . The pseudocode is presented in Algorithm A. The algorithm starts by identifying the set of terminals that are anchors, which is illustrated in Figure 1. If a terminal d is not an anchor then there will be some terminal a which is an anchor such that ρ(a ∞ ; in other words, such that and ρ(d a⊲ d⊲. If the ˆρN estimates are infinite iff ρ is infinite, then we can see that Γ will be the set of possible anchors; that is, those terminals that occur on the right-hand side of exactly one production. Clearly, if a and b are anchors for the same a) = 0, and nonterminal then ρ(a if they are anchors for different nonterminals then , so we can just group ρ(a them into equivalence classes and pick the most frequent one from each class as the anchor. The start symbol will be anchored by the symbol that occurs most frequently as a whole sentence. b) = ρ(b b) = ρ(b a) = → ∞ → → → l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 2 3 1 9 2 3 1 2 7 / / t l a c _ a _ 0 0 3 2 3 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Input: A sequence of strings D = w1, w2, . . . , wN Output: A WCFG G; θ 1 Compute ˆρN (a Σ 2 Γ a ∈ ˆρN (b ← { ∞ ∨ | ∀ → b) for all a, b Σ, ˆρN (a ; → b ∈ a) = → Σ; ∈ b) < ∞} 3 Define the equivalence relation on Γ given → ∞ ∞ b) < ∼ → b iff ˆρN (a a) < and by a . Let ∆ be the set formed by ˆρN (b picking the terminal a with maximal m(a) from each equivalence class in Γ/ ∆ ∈ } n(a) | b ← { ← ; ∆ } ∆, b Σ, ˆρN (a a a ∼ [[a]] a | arg max { [[a]] → ; ← { ∈ ∈ b) < → ∈ ; ; | 4 V 5 s 6 PL ∆ ; ∈ ∞} 7 Compute ˆρN (a 8 PB [[a]] ← { ∆, ˆρN (a → bc) < Σ, V, [[s]], PL → → [[b]][[c]] bc) for all a, b, c ∈ a, b, c | ; ; PB i b) ˆE(b) ; ∞} ∪ ˆρN (a → b) e− ← [[b]][[c]]) ← bc) ˆE(bc)/ˆE(b)ˆE(c) ; 9 G ← h 10 θ([[a]] 11 θ([[a]] e− ˆρN (a → → → 12 return G; θ Algorithm A: WCFG learner. uses the inside outside (IO) algorithm (Eisner, 2016) to normalize the WCFG produced by Algorithm A; we take the output WCFG and run one iteration of the IO algorithm on the same data to estimate the expectations of all the rules that are then normalized to produce a PCFG. Proving the convergence of this estimator requires a little bit of care. Chi (1999) shows that the result of this procedure will always be a tight PCFG; the finite expectation of allows us to apply a variant of the dominated convergence theorem combined with the law of large numbers to show that this is a consistent estimator for the class of grammars PA. τ | | 5 Experiments The contributions of this paper are primarily theoretical but the reader may have legitimate concerns about the practicality of this approach given the naive estimator, the assumptions that are required, and the asymptotic nature of the correctness result. Here we present some compu- tational simulations that address these issues, using synthetic PCFGs that mimic to a certain l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 2 3 1 9 2 3 1 2 7 / / t l a c _ a _ 0 0 3 2 3 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Figure 1: Diagram showing the terminal selection algorithm for a grammar with three nonterminals with anchors a, b, c. This diagram represents the space of context distributions: All terminals have a context distibution in the convex hull of the anchors. d Γ because ρ(a , and it is a) = therefore in the interior of the convex hull. but ρ(d d) < → → ∞ ∞ 6∈ We can now prove that this algorithm is a consistent estimator for the class of WCFGs that we consider, WA. Theorem 4.1. For every grammar G WA, ∗ for every ǫ, δ > 0, there is an n such that when
Algorithm A is run on a sample of N strings,
N > n, generated i.i.d. from G
it produces
∗
a WCFG G; θ such that with probability at least
1

∗ ∈

; θ

, θ

∗

−

•

is CFG-isomorphic to G, and if φ is an
to G

G
∗
isomorphism from G
∗

θ
• |

∗

→

α)

−

θ(φ(A

α))

→

< ǫ ∼ → → b) is close to ρ(a Proof. (Sketch) Assume first that N is sufficiently b) for all large that ˆρN (a a, b such that either a or b is an anchor; we can then show that Γ in Line 2 is just the set of possible b will be true iff a, b are anchors anchors; and a for the same nonterminal. We define a bijection between the nonterminals of the hypothesis and the target. Line 5 picks the start symbol to be the unique anchor that can occur in a length 1 string. The grammar will have the right productions via Lemma 3.3, and the parameters will converge via Lemmas 4.3 and 4.4. The output of this is a WCFG that may be divergent: We therefore define Algorithm B that 417 extent the observable properties of child-directed speech (Pearl and Sprouse, 2012). We generate CFGs that have 10 nonterminals, 1,000 terminal symbols, and all possible rules in CNF; none of these grammars are in GA. To obtain a PCFG, we sample the parameters for the binary productions and an extra parameter for the lexical rules from a symmetric Dirichlet distribution with parameter α, which we vary to control the degree of ambiguity of the grammar. We then train these parameters using the IO algorithm to get a distribution of lengths close to a zero-truncated Poisson with parameter 5. We then sample the conditional lexical parameters from a multivariate log normal distribution with σ = 5.4 To obtain a practical algorithm we follow Stratos et al. (2016). We consider only the local context—the immediate preceding and following word including a distinguished sentence boundary marker—and use Ney-Essen clustering (Ney et al., 1994) with 20 clusters to get a low-dimensional feature space. We give the learning algorithm the true number of nonterminals as a hyperparameter (in contrast to Algorithm A, which learns the number of nonterminals) and run the NMF algorithm of Stratos et al. (2016) to find the anchors, considering only those that occur at least 1,000 times. We set the lexical parameters using the Frank-Wolfe algorithm, and the binary parameters using the Renyi divergence with α = 5. To alleviate data sparsity with estimating the distribution of the anchor bigrams when computing the binary rule parameters, we use all bigrams consisting of words that have probability at least 0.9 of being derived from the respective nonterminal. This produces a WCFG (A) which may be divergent. We then run one iteration of the IO algorithm5 to obtain a PCFG (B), and then a further 10 iterations to get another PCFG (C); this is guaranteed to increase the likelihood of the model; if the PCFG B is sufficiently close to the target then this will converge towards the global optimum, the ML estimate; if not it will only converge to a local optimum. For efficiency reasons we only run the IO algorithm on sentences of length at most 10; and we evaluate on lengths up to 20. The performance continues to improve with further iterations. 4This gives a Zipfian long-tailed distribution. We experimented also with a truncation of a Pitman Yor process with similar results. 5We are grateful to Mark Johnson for his efficient C implementation. 418 Figure 2: Box and whisker plot showing labeled exact match for 100 grammars sampled with α = 0.01. We compare algorithms A, B, and C against gold (the target PCFG) and ML (the maximum likelihood PCFG learned by supervised learning from the training data). Figure 3: Box and whisker plot showing unlabeled accuracy. We add trivial baselines of left and right branching and random trees. 100 grammars sampled with α = 0.01. 5.1 Results After fixing the hyperparameters, we generate 100 different PCFGs for each condition, and sample 106 sentences from each. We evaluate the results according to how well they recover the true tree structures. We sample 1,000 trees from the target PCFG and evaluate the Viterbi parse of the yield of the tree using labeled exact match in Figure 2 and micro-averaged unlabeled precision/recall in Figure 3.6 In all cases we exclude all forced choices so it is possible to score zero. The performance of the original grammar is a measure of the ambiguity of the grammar. To see the effect of varying the degree of ambiguity, Figure 4 plots unlabeled exact match the supervised baseline for values of against . For α = 1 both are close to α 0.01, 0.1, 1.0 ∈ { 6Because both trees are binary, precision is equal to recall. } l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 2 3 1 9 2 3 1 2 7 / / t l a c _ a _ 0 0 3 2 3 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 text; a number of the assumptions we make are clearly false. First, even looking at English, we can see that the anchoring assumption is too strong. For example, the expletive pronouns in English, there and it, are both ambiguous, since there is also an adverb and it is also a personal pronoun, and so if there is a nonterminal representing such pronouns, then it will not be anchored. When we consider phrasal categories, the ques- tion of whether such nonterminals are anchored requires asking two questions: first, whether such nonterminals generate single words at all, and secondly whether among those words we can find anchors. The existence of pro-forms, such as pronouns in the case of noun phrases, guarantees this for at least some categories. Clearly, this is genre-dependent, because it is sensitive to sen- tence length. Here we look at the Adam corpus of child-directed speech in English as syntactically annotated in the Penn treebank style by Pearl and Sprouse (2012). Table 1 shows the results. We can see that nonclausal categories are mostly anchored at this crude level of analysis, but that clausal categories are not. This implies that simple sentences without embedded clauses can be learned using this approach, but that learning complex clausal structures will require this approach to be extended at least to anchors of length more than one. Most fundamentally, simple PCFGs of the type that we consider here are very poor models of natural language syntax. In order to obtain reasonable results, such grammars need to be lexicalized because otherwise the independence assumptions of the PCFG are violated because of semantic relations, for example, between a verb and its subject. Thus the realizability assumption the approach relies on is dramatically false. 7 Discussion and Conclusion There are two ways of thinking about PCFGs: one is as a nontrivial CFG with parameters attached, where the support of the distribution is the language generated by the CFG, and the other is where the CFG is trivial, containing all possible productions, and where the support is the set of all strings; we can call these sparse and dense PCFGs, respectively. Hsu et al. (2013) show that in the dense case the class of PCFGs is not identifiable without additional constraints, even when one can exclude a set of grammars of measure Figure 4: Scatter plot showing unlabeled exact match with the x-axis showing the ML model and the y-axis showing the algorithm C for three different values of the Dirichlet hyperparameter for the binary rules, α = 0.01, 0.1, and 1.0. The diagonal line is the theoretical upper bound. the random baseline; apart from that extreme case we find the performance degrading smoothly as predicted by theory. The labeled exact match (not shown here) in contrast shows a more pronounced decrease. These grammars are about an order of magni- tude smaller than plausible natural language gram- mars for child-directed speech as derived from the treebank in Pearl and Sprouse (2012), but this is largely for resource limitations because whereas Algorithm A is very fast, the IO algorithm is computationally expensive, and running these experiments on hundreds of synthetic gram- mars/languages at a time would be prohibitively expensive. It is certainly computationally feasible to run these experiments on single grammars with up to 100 nonterminals and 20,000 terminals. In small-scale experiments the results appear compa- rable with those we report here. The major failure mode is when there are nonterminals A where a) is very small. In those cases, Pa though the grammar may be technically anchored, the anchors will be below the frequency threshold being considered.7 E(A → 6 Applicability to Natural Language Corpora An important question is whether this approach is directly applicable to natural language corpora either of transcribed child-directed speech or of 7Full code for reproducing these experiments is available at https://github.com/alexc17/locallearner. 419 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 2 3 1 9 2 3 1 2 7 / / t l a c _ a _ 0 0 3 2 3 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 t P (l = 1) wmax ADJP 0.67 ADVP 0.84 FRAG 0.3 INTJ 0.87 NP 0.7 PP 0.078 PRT 0.99 S 0.017 SBAR 0.0046 SBARQ 0.0 SQ 0.021 0.11 VP WHADVP 0.98 WHNP 0.8 careful already seal hmm he for off - if - - crying when who P (t wmax) | 0.85 1.0 0.2 1.0 1.0 0.13 0.72 - 0.0024 - - 0.82 1.0 0.95 Table 1: Phrasal categories from the corpus of child-directed speech in Pearl and Sprouse (2012) showing that the proportion of length 1 yields the best anchor with frequency at least 10 and the proportion of tokens of that word that occurs as a yield of that tag. zero.8 The class of sparse PCFGs we consider, PA, has measure zero in their framework, and thus there is no incompatibility between their result and Theorem 3.2. However, there is some incompatibility between the empirical results in Section 5 and Hsu et al. (2013)’s result. With the protocol used in Section 5 we are indeed trying to learn a nonidentifiable class because the PCFGs are dense. However, the grammars are approximately anchored in the sense that for each nonterminal A there is a terminal a such that E(A a) is very close to E(a). In these cases, even though there are different parameter settings that give rise to the same distribution over strings, they will all be quite close to each other. → There have been many different attempts to solve this problem over the decades since the learning problem was initially introduced by Horning (1969); a useful survey of older work on learning CFGs is contained in Lee (1996). One strand of research looks at using the IO algorithm to train some heuristically initialized grammar (Baker, 1979; Lari and Young, 1990; Pereira and Schabes, 1992; de Marcken, 1999). However, this 8For technical reasons they consider only grammars where all probability mass is evenly distributed over all possible binary trees of a given length, and which are as a result highly ambiguous. approach is only guaranteed to converge to a local maximum of the likelihood, and does not work well in practice. A related problem that we do not discuss in this paper is learning when the labeled tree structures are observed—essentially that of estimating a PCFG from a treebank, a problem which is algorithmically trivial and statistically well behaved, as Cohen and Smith (2012) show. The approach we take is most closely related to the work by Stratos et al. (2016) and work on weakly learning CFGs from samples generated by PCFGs developed by Shibata and Yoshinaka (2016). However, there are very few approaches to learning PCFGs with any nontrivial theoretical guarantees. The approach here is essentially an exemplar- based model: The syntactic categories are based length 1. This can be on single strings of naturally extended, mutatis mutandis, to sets of exemplars, and to exemplars with length greater than 1. The extension beyond CFGs to mildly context sensitive grammars such as MCFGs (Seki et al., 1991) seems to present some problems that do not occur in the nonprobabilistic case (Clark and Yoshinaka, 2016); although the same bounds on the bottom up parameters can be derived, identifying the set of anchors seems to be challenging. The variant of Algorithm A discussed in Section 5 is also interesting because it only uses local information in the initial phase: Indeed, it only uses the bigram and trigram counts, and it is only in the use of the IO algorithm that a pass through the data using the full sentence is used; this is compatible with psycholinguistic evidence about to track transitional probabilities (e.g., work following Saffran et al., 1996). Of course the original version in Section 4 uses complete sentences and not just the low-order counts. infants’ abilities Note that Equation 10 provides some theoret- ical justification for the long literature (Harris, 1955; McCauley and Christiansen, 2019) on using mutual information as a heuristic for unsupervised chunking. Although it is intuitively reasonable that chunks should correspond to subsequences that have high pointwise mutual information, it is gratifying to finally have some mathematical basis for these intuitions. Acknowledgments This work was partially carried out while the first author was a visiting researcher at The Alan Turing 420 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 2 3 1 9 2 3 1 2 7 / / t l a c _ a _ 0 0 3 2 3 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Institute. The second author was supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1 and the DeLTA project (ANR-16- CE40-0007). We would like to thank the reviewers for helpful comments that have improved the paper. References Omri Abend, Tom Kwiatkowski, Nathaniel J. Smith, Sharon Goldwater, and Mark Steedman. 2017. Bootstrapping language acquisition. Cog- nition, 164:116–143. Pieter Adriaans. 1999. Learning shallow context- free languages under simple distributions. Technical Report ILLC Report PP-1999-13, Institute for Logic, Language and Computation, Amsterdam. James K. Baker. 1979. Trainable grammars for speech recognition. In Speech Communication Papers for the 97th Meeting of the Acoustic Society of America, pages 547–550. Zhiyi Chi. 1999. Statistical properties of proba- bilistic context-free grammars. Computational Linguistics, 25(1):131–160. Kenneth Ward Church and Patrick Hanks. 1990. Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1):22–29. Alexander Clark and Ryo Yoshinaka. 2016. Distri- butional learning of context-free and multiple context-free grammars. In Jeffrey Heinz and M. Jos´e Sempere, editors, Topics in Grammat- ical Inference, pages 143–172, Springer Berlin Heidelberg, Berlin, Heidelberg. Shay B. Cohen and Noah A. Smith. 2012. Empirical risk minimization for probabilistic grammars: Sample complexity and hardness of learning. Computational Linguistics, 38(3): 479–526. Franc¸ois Denis, Aur´elien Lemay, and Alain Terlutte. 2004. Learning regular languages using RFSAs. Theoretical Computer Science, 313(2):267–294. Jason Eisner. 2016. Inside-outside and forward- backward algorithms are just backprop (tutorial paper). In Proceedings of the Workshop on Structured Prediction for NLP, pages 1–17. Kousha Etessami, Alistair Stewart, and Mihalis Yannakakis. 2012. Polynomial time algorithms for multi-type branching processes and stochas- tic context-free grammars. In Proceedings of the Forty-Fourth Annual ACM Symposium on Theory of Computing, pages 579–588. ACM. Jess Gropen, Steven Pinker, Michelle Hollander, and Richard Goldberg. 1991. Affectedness and direct objects: The role of lexical semantics in the acquisition of verb argument structure. Cognition, 41(1):153–195. Zellig Harris. 1955. From phonemes to mor- phemes. Language, 31:190–222. James Jay Horning. 1969. A Study of Grammatical thesis, Computer Science Inference. Ph.D. Department, Stanford University. Daniel Hsu, Sham M. Kakade, and Percy Liang. 2013. Identifiability and unmixing of latent parse trees. In Advances in Neural Information Processing Systems (NIPS), pages 1520–1528. Sandra E. Hutchins. 1972. Moments of string and derivation lengths of stochastic context-free grammars. Information Sciences, 4(2):179–191. Karim Lari and Stephen J. Young. 1990. The estimation of stochastic context-free grammars using the inside-outside algorithm. Computer Speech and Language, 4:35–56. Lillian Lee. 1996. Learning of context-free lang- uages: A survey of the literature. Technical Report TR-12-96, Center for Research in Computing Technology, Harvard University. Carl G. de Marcken. 1999. On the unsupervised induction of phrase-structure grammars. In Nat- ural Language Processing Using Very Large Corpora, pages 191–208. Kluwer. Stewart M. McCauley and Morten H. Christiansen. 2019. Language learning as lan- guage use: A cross-linguistic model of child language development. Psychological Review, 126(1):1. Mark-Jan Nederhof and Giorgio Satta. 2008. Com- puting partition functions of PCFGs. Research on Language and Computation, 6(2):139–162. 421 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 2 3 1 9 2 3 1 2 7 / / t l a c _ a _ 0 0 3 2 3 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Hermann Ney, Ute Essen, and Reinhard Kneser. 1994. On structuring probabilistic dependencies in stochastic language modelling. Computer Speech and Language, 8:1–38. Lisa Pearl and Jon Sprouse. 2012. Computational models of acquisition for islands. In J. Sprouse and N. Hornstein, editors, Experimental Syntax and Island Effects. Cambridge University Press, Cambridge, UK. Fernando Pereira and Yves Schabes. 1992. Inside- outside reestimation from partially bracketed corpora. In Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics, pages 128–135. Alfr´ed R´enyi. 1961. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. The Regents of the University of California. Jenny R. Saffran, Richard N. Aslin, and Elissa L. Newport. 1996. Statistical learning by eight month old infants. Science, 274:1926–1928. Hiroyuki Seki, Takashi Matsumura, Mamoru Fujii, and Tadao Kasami. 1991. On multiple context-free grammars. Theoretical Computer Science, 88(2):229. Chihiro Shibata and Ryo Yoshinaka. 2016. Probabilistic learnability of context-free gram- mars with basic distributional properties from positive examples. Theoretical Computer Sci- ence, 620:46–72. Noah A. Smith and Mark Johnson. 2007. Weighted and probabilistic context-free gram- mars are equally expressive. Computational Linguistics, 33(4):477–491. Karl Stratos, Michael Collins, and Daniel Hsu. 2016. Unsupervised part-of-speech tagging with anchor hidden Markov models. Trans- actions of the Association for Computational Linguistics, 4:245–257. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 2 3 1 9 2 3 1 2 7 / / t l a c _ a _ 0 0 3 2 3 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 422
Download pdf