Time-and-Space-Efficient Weighted Deduction

Time-and-Space-Efficient Weighted Deduction

Jason Eisner
Department of Computer Science
Johns Hopkins University
jason@cs.jhu.edu

Abstract

Many NLP algorithms have been described
in terms of deduction systems. Unweighted
deduction allows a generic forward-chaining
execution strategy. For weighted deduction,
however, efficient execution should propa-
gate the weight of each item only after it has
converged. This means visiting the items
in topologically sorted order (as in dynamic
programming). Toposorting is fast on a ma-
terialized graph; unfortunately, materializing
the graph would take extra space. Is there a
generic weighted deduction strategy which,
for every acyclic deduction system and every
input, uses only a constant factor more time
and space than generic unweighted deduc-
tion? After reviewing past strategies, we an-
swer this question in the affirmative by com-
bining ideas of Goodman (1999) and Kahn
(1962). We also give an extension to cyclic
deduction systems, based on Tarjan (1972).

1 Introduction

Many NLP algorithms have been described in terms
of deduction systems, starting with the seminal pa-
per “Parsing as Deduction” (Pereira and Warren,
1983). In general, deduction systems are an abstrac-
tion of dynamic computation graphs that can be
used to describe the structure of many algorithms,
including theorem provers, dynamic programming
algorithms, and structured neural networks (Sikkel,
1993; Eisner and Filardo, 2011).

Unweighted deduction admits a generic execu-
tion strategy known as “forward chaining.” Fur-
thermore, the behavior of this strategy is well-
understood. Its runtime and space can be easily
bounded based on a simple inspection of the deduc-
tion system (McAllester, 2002). Static analysis can
sometimes find tighter bounds (Vieira et al., 2022).
For weighted deduction, however, execution
should wait to propagate a derived item’s weight
until that weight has converged. Ideally it would

960

visit the items in topologically sorted order (i.e.,
dynamic programming), which is not required for
the unweighted case. Fortunately, toposorting is
fast. Thus, is there a generic weighted deduction
strategy which, for every acyclic deduction system
and every input, uses only a constant factor more
time and space than generic unweighted deduction?
In this paper, we provide this missing master
strategy, which is surprisingly simple. It establishes
a “don’t worry, be happy” meta-theorem (McFerrin,
1988): Asymptotic analyses of unweighted acyclic
deduction systems do transfer to the weighted case.
Past methods do not achieve this guarantee. For
some deduction systems, static analysis may be
able to identify a topological ordering. But even
then, one is left with increased runtime, either from
visiting all possible items (which fails to exploit the
sparsity of the set of items actually derived from a
particular input) or visiting only the items that have
been derived (which exploits sparsity, but requires
a priority queue that incurs logarithmic overhead).
The alternative is dynamic analysis: given the
input, first identify all the items that can be derived,
using unweighted deduction, and then toposort
them in order to compute the weights. Goodman
(1999) suggested a version of this approach that ma-
terializes the graph of dependencies among items,
but acknowledged that this generally increases the
asymptotic space requirements. In this paper, we
show how to avoid the space increase. We give a
practical two-pass weighted deduction algorithm,
inspired by Kahn (1962), that uses parent counting
to efficiently enumerate the derived items in topo-
logically sorted order. It stores the counts of edges
but no longer stores the edges themselves.

Also, for the case where the graph may have cy-
cles, we give a practical three-pass algorithm. The
first two passes efficiently enumerate the strongly
connected components in topologically sorted or-
der (Tarjan, 1972), with the same time and space
guarantees as above. The third pass solves for the
weights in each cyclic component, which may in-

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
5
8
8
2
1
5
4
4
5
9

/

/
t

l

a
c
_
a
_
0
0
5
8
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Transactions

of
Action

the
Editor:
c(cid:3)

Computational

Association
G
Carlos
ó
Association

for
guez.
mez-Rodr
í
Computational

2023

for

Submission

Linguistics,

vol.

11,
batch: 2/2023;

pp.

Linguistics.

Distributed

Revision
under

batch: 5/2023;
4.0
CC-BY
a

Published
license.

960–973,

2023.

https://doi.org/10.1162/tacl

00588

a
8/2023.

crease the asymptotic runtime if cycles exist.

As an application of our two- and three-pass
algorithms, consider extending Earley parsing (Ear-
ley, 1970; Graham et al., 1980) to probabilistic
(Stolcke, 1995) or semiring-weighted (Goodman,
1999) context-free grammars. Opedal et al. (2023)
present Earley-style deduction systems whose un-
weighted versions are easily seen to be executable
in O(n3|G|) time and O(n2|G|) space on a sen-
tence of length n and a grammar of size |G|. These
bounds are tight when every substring of the sen-
tence can be analyzed, in its left context, as every
sort of constituent or partial constituent. In practi-
cal settings, however, the parse chart is often much
sparser than this worst case and execution is cor-
respondingly faster. For instance, Earley (1970)
describes “bounded-state grammars” where pars-
ing is guaranteed to require only O(n|G|) time and
O(n|G|) space. Opedal et al. (2023, Appendix H.1)
simply invoked the present paper (“don’t worry”) to
offer all the same guarantees for weighted parsing.
In contrast, the various past methods would have
been slower by a log factor, or consumed Θ(n3|G|)
space for some grammars, or consumed Θ(n3|G|)
runtime even for bounded-state grammars.

While the solution seems obvious in retrospect
(at least once the problem is framed), it has some-
how gone unnoticed in the literature. It has also
evaded several hundred graduate and undergrad-
uate students in the author’s NLP class over two
decades. Students each year are asked to design and
implement a Viterbi Earley parser. As extra credit,
they are asked if they can achieve O(n3) runtime
while maintaining correctness. None of them have
ever spotted the Kahn-based solution, though some
have found the other methods mentioned above.

2 Formal Framework

2.1 Deduction Systems
A weighted deduction system P = (V, W, E, ⊕)
consists of

• a possibly infinite set of items, V, which may
be regarded as representing propositions about
the input to the system

• a set of weights, W, where the weight associ-
ated with an item (if any) might be intended to
characterize its truth value, probability, prove-
nance, description, or other properties

v

of which is written in the form
f↢ u1, . . . , uk
where k ≥ 1 is the in-degree of the hy-
peredge,1 v, u1, . . . , uk ∈ V are items, and
f : Wk → W is a combination function

• a function ⊕ that maps each item v ∈ V to an
associative and commutative binary operator
⊕v : W × W → W, called the aggregator
for item v

In short, P is an ordered B-hypergraph (V, E)
where each hyperedge is labeled with a combina-
tion function and each vertex is labeled with an
aggregator.2 We say that P is an acyclic system if
this hypergraph is acyclic. Note that our formalism
is not limited to semiring-weighted systems.

2.2 Evaluation

We now explain how to regard P as an arithmetic
circuit—a type of program that can be applied to
inputs. An input to P is a pair (V, ω) where

• V ⊆ V is a set of axioms
• ω : V → W assigns a weight to each axiom

Evaluating P on this input returns an output pair
( ¯V, ¯ω) where

• the derived items ¯V ⊆ V are those that are
reachable from the axioms V in the hyper-
graph (V, E); equivalently, ¯V is the smallest
set such that
– V ⊆ ¯V
– if (v

f↢ u1, . . . , uk) ∈ E and

u1, . . . , uk ∈ ¯V , then v ∈ ¯V

• for each derived item v ∈ ¯V , let ¯Ev ⊆ E be
the set of all hyperedges used to derive it:

¯Ev = {(v

f↢ u1, . . . , uk) ∈ E :
u1, . . . , uk ∈ ¯V }

1The artificial requirement that k ≥ 1 serves to simplify
f↢
our pseudocode without loss of generality. A hyperedge v
with k = 0 would state that v is an axiom of the deduction
system, with weight f (). It can be replaced by a k = 1 hyper-
f ′
↢ start, where f ′ is the constant function returning
edge v
f (), and start is a special item that is always included (with
arbitrary weight) in the set of input axioms in §2.2 below.

2In a B-hypergraph (Gallo et al., 1993), each hyperedge
pairs a set of input vertices with a single output vertex. In an
ordered B-hypergraph, each hyperedge pairs a sequence of
f↢ u1, u2 is not

input vertices with a single output vertex: v

• a possibly infinite set of hyperedges, E, each

the same as v

f↢ u2, u1.

961

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
5
8
8
2
1
5
4
4
5
9

/

/
t

l

a
c
_
a
_
0
0
5
8
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

• ¯ω : ¯V → W satisfies the following con-

straints: for each v ∈ ¯V \ V ,

word(x,i,k)
phrase(a,i,k)

rewrite(a,x)
rewrite(a,b,c)

¯ω(v) =

(cid:77)

v
f
↢u1,…,uk)∈ ¯Ev

(v

f (¯ω(u1), . . . , ¯ω(uk))

and for each v ∈ V ,

(1a)

and the hyperedges E are all objects of the forms
×↢ rewrite(a,x), word(x,i,k)
×↢ rewrite(a,b,c),

phrase(a,i,k)
phrase(a,i,k)

phrase(b,i,j), phrase(c,j,k)

¯ω(v) = ω(v) ⊕v

(cid:77)

v
f
↢u1,…,uk)∈ ¯Ev

f (¯ω(u1), . . . , ¯ω(uk))
(1b)

(v

Note that ¯ω(v) always has at least one sum-
mand, since ¯Ev = ∅ is not possible in (1a).3

v∈ ¯V

¯V is always uniquely determined, as is ¯E def=
(cid:83)
¯Ev. How about ¯ω? We say that v ∈ ¯V de-
pends on u if ¯Ev contains a hyperedge of the form
f↢ . . . , u, . . .. If ( ¯V, ¯E) is an acyclic hypergraph,
v
meaning that no v ∈ ¯V depends transitively on
itself, then evaluation has a unique solution ( ¯V, ¯ω).
This is guaranteed for acyclic P. In general, how-
ever, there could be multiple functions ¯ω or no func-
tions ¯ω that satisfy the system of constraints (1).

2.3 Special Cases

An unweighted deduction system is very similar
to the above, but where f , ⊕v, and ω are omitted.
Thus, evaluation returns only ¯V and not ¯ω. Equiv-
alently, an unweighted deduction system can be
regarded as the special case where there is only a
single weight: W = {⊤}. Then f, ⊕v, ω, and ¯ω
are trivial constant functions that only return ⊤.

A semiring-weighted deduction system is the
special case where ⊕v and f are fixed throughout
P to be two specific operations ⊕ and ⊗, respec-
tively, such that (W, ⊕, ⊗) forms a semiring. This
case has proved very useful for NLP algorithms
(Goodman, 1999; Eisner and Blatz, 2007; Vieira
et al., 2021).

2.4 Example: Weighted CKY Parsing

We give an acyclic weighted deduction system
P = (V, W, E, ⊕) that corresponds to the inside
algorithm for probabilistic context-free grammars
in Chomsky normal form (Baker, 1979). We take
the weights to be probabilities: W = [0, 1] ⊆ R.
The item set V consists of all objects of the forms

3When v ∈ V , typically ω(v) is the only summand. How-
ever, we have no reason to require that: for generality, our for-
malism and algorithms do allow an axiom to be derived again
from other items, which then contribute to its weight (1b).

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
5
8
8
2
1
5
4
4
5
9

/

/
t

l

a
c
_
a
_
0
0
5
8
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

where x ranges over terminal symbols, a, b, c range
over nonterminal symbols, i, j, k range over the
natural numbers N, and × is the ordinary multipli-
cation function on probabilities. To complete the
specification of P, define ⊕v = + for all v ∈ V.

To run the inside algorithm on a given sentence
under a given grammar, one must provide the input
(V, ω). Encode the sentence x1x2 · · · xn as the n
axioms {word(xi, i − 1, i): 1 ≤ i ≤ n}, each
having weight 1. Encode the grammar with one
rewrite axiom per production, whose weight is
the probability of that production: for example, the
production S → NP VP corresponds to the axiom
rewrite(S,NP,VP).

After evaluation, phrase(a,i,k) ∈ ¯V iff a ⇒∗
in which case ¯ω(phrase(a,i,k))
xi+1 . . . xk,
gives the total probability of all derivations of that
form—that is, the inside probability of nontermi-
nal a over the input span [i, k]. We remark that
furthermore, ¯E is the traditional packed parse for-
est, with the hyperedges in ¯Ev being the traditional
backpointers from item v ∈ ¯V .

A variant of this system obtains the same re-
sults with greater hardware parallelism by tak-
ing the elements of W to be tensors. For each
(i, k) pair, all items phrase(a,i,k) are collapsed
into a single item phrase(i,k) whose weight is
a 1-dimensional tensor (vector) indexed by non-
terminals a. Similarly, all items rewrite(a,b,c)
are collapsed into a single item rewrite_binary
whose weight is a 3-dimensional tensor, and all
items rewrite(a,x) into rewrite_unary whose
weight is a 2-dimensional tensor. The combination
functions and aggregators are modified to operate
over tensors. It is also possible to make them non-
linear to get a neural parser (Drozdov et al., 2019).
Another variant lets the grammar contain unary
nonterminal productions, represented by additional
axioms of the form rewrite(a,x), by adding
×↢
all hyperedges of the form phrase(a,i,k)
rewrite(a,b), phrase(b,i,k). The resulting P
is cyclic iff the grammar has unary production cy-
cles. Even then, for a probabilistic grammar, it
turns out that evaluation always gives a unique ¯ω.

962

2.5 Computational Framework

The definitions above were mathematical. To im-
plement evaluation on a computer, we assume that
the set of input axioms V is finite.4 We also hope
that for a given input, the set of derived items ¯V
will also be finite (even if V is infinite), since oth-
erwise the algorithms in this paper will not termi-
nate.5 Since P itself is typically infinite in order to
handle arbitrary inputs, we assume that its compo-
nents have been specified using code. In particular,
V and W are datatypes, ⊕ is a function, and E is
accessed through iterators described below.

In practice, a deduction system P is usually spec-
ified using finitely many patterns that contain vari-
ables, as was done in §2.4. The patterns that specify
V are called types, and the patterns that specify E
are called deductive rules or sequents. Once these
patterns are written down, the necessary code can
be generated automatically. The most popular for-
mal notation for the unweighted case is the deduc-
tive rules of sequent calculus (Sikkel, 1993; Shieber
et al., 1995). That formalism was generalized by
Goodman (1999) to the semiring-weighted case.
Sato (1995) and Eisner et al. (2005) developed no-
tations based on logic programming languages for
those special cases, and Eisner and Filardo (2011)
then generalized those notations.

Each algorithm in this paper takes (P, V, ω) as
its input. Each algorithm makes use of a chart C, a
data structure that maps keys in V (namely, derived
items) to values in W. If the algorithm terminates,
C then holds the result of evaluating P on (V, ω):
¯V is finite and is given by the keys of C, while
¯ω : ¯V → W is given by the mapping v (cid:55)→ C[v].

Each algorithm also uses an agenda A, which
is a set of items.6 Each derived item is added to A
(“pushed”) when it is added as a key to C for the
first time (or in the case of Algorithm 5, for the last

4To keep V finite, arithmetic facts such as plus(4,1,5)
are generally not included in V (or V). Deductive rules (§2.5)
may consult these built-in facts, but they are not treated as
items in our framework, only as conditions on hyperedges.

5The infinite ¯V case does arise when the deduction system
P is designed to derive provable mathematical theorems, valid
real-world inferences, reachable configurations of a Turing
machine, etc. Computational techniques in this case include
timeout, pruning, query-driven evaluation (§8), and lifted in-
ference. Timeout and pruning involve small modifications
to the algorithms in this paper. The other strategies can be
implemented by automatically transforming P to derive fewer
items (e.g. Beeri and Ramakrishnan, 1991) and leaving our
algorithms unchanged.

6The agenda A and chart C respectively correspond to the

open set and closed set of search algorithms like A*.

time). Later it is removed again (“popped”) and
new items are derived from it. The chart is com-
plete once the agenda is empty. Algorithm 2 allows
a popped item to be pushed again later because its
weight has changed; our other algorithms do not.
In our pseudocode, C and A are global but all
other variables are local (important for recursion).
The chart class provides iterators specific to P:

• C.in(v) yields all hyperedges in E of the form
f↢ u1, . . . , uk such that u1, . . . , uk have

v
already popped from A.

• C.out(u) yields all hyperedges v

f↢
u1, . . . , uk in E such that u1, . . . , uk have al-
ready popped from A and u ∈ {u1, . . . , uk}.
Note that if u appears more than once among
u1, . . . , uk, then the iterator should still only
yield the hyperedge once. This is necessary
f↢ u, u,
to ensure that with the hyperedge v
Algorithm 2 below will increment ¯ω(v) by
f (¯ω(u), ¯ω(u)) only once and not twice.

In practice, these iterators are made efficient via in-
dex data structures, as discussed below. Notice that
A.pop(u) must notify C to add u to these indexes.
The code that is automatically generated for P
implements the item datatype, the chart class in-
cluding iterators, and the agenda class.

Some of the algorithms in this paper require only
out() and not in(). We prefer to avoid in() for
reasons to be discussed in §5.3; see also §8.

In our efficiency analyses, we will sometimes
make the following standard assumptions (but
they are not needed for the meta-theorem of §1):

1. There exists a constant K such that all hyper-

edges in E have in-degree k ≤ K.

2. f and ⊕v can be computed in O(1) time.
3. Storing an item as a chart key, testing whether
it is one, or looking up or modifying its weight
in the chart, can be achieved in O(1) time.

4. The C.out(v) method and (when used) the
C.in(u) method create iterators in O(1) time.
Furthermore, calling such an iterator to get the
next hyperedge (or learn that there is no next
hyperedge) takes O(1) time. Hence iterating
over an item’s m ≥ 0 in- or out-edges can be
done in time proportional to m + 1.

5. A chart mapping N items to their weights
can be stored in O(N ) space, including any
private indexes needed to support the iterators.

963

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
5
8
8
2
1
5
4
4
5
9

/

/
t

l

a
c
_
a
_
0
0
5
8
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

In the author’s experience, these assumptions
can be made to hold for most deduction systems
used in NLP, at least given the Uniform Hashing As-
sumption7 and using the Word RAM model of com-
putation. However, a few tricks may be needed:

• To ensure that out does not double-count hy-
peredges, it may be convenient to transform
the deduction system: for example, replacing

f↢ u, u with v

f↢ u, u′ and u′

id↢ u.

v

• To ensure that each axiom can be stored in
O(1) space, one may have to transform the
deduction system. For instance, rather than
specifying each context-free grammar rule—
of any length—by a single axiom, a long rule
can be binarized so that it is specified by multi-
ple axioms (thus increasing N ), each of which
requires only O(1) space to store.

• Even when individual axioms and their
weights are bounded in size, the derived items
or their weights may be unbounded in size.
However, often the space and time require-
ments above can still be achieved through
structure sharing and hash consing.8

• To ensure the given iterator performance, it
may be necessary to transform the deduction
system, again increasing N .
In particular,
McAllester (2002) applies a transformation to
unweighted deduction systems that are writ-
ten in his notation; Eisner et al. (2005) apply
a very similar approach to weighted systems.
In the transformed systems,9 hyperedges have
in-degree ≤ 2. Hence C.out(u) must find
all previously popped items u′ such that E
contains hyperedges with inputs u, u′ or u′, u.
Moreover, those items can be found with a
constant number of hash lookups, by having
C maintain a constant number of hash tables
(indexes), each of which maps from a possible
property of u′ to a list of all previously popped
items u′ with that property. For instance, for
the weighted CKY system in §2.4, one such
hash table would map each integer j to all
items of the form phrase(b,i,j), if any.

Further implementation details are given in the
papers that were cited throughout the present sec-
tion. We do not dwell on them as they are orthogo-
nal to the concerns of the present paper.

3 Unweighted Forward Chaining

A basic strategy for unweighted deduction is for-
ward chaining (Algorithm 1), well-known in the
Datalog community (Ceri et al., 1990). This is
simply a reachability algorithm: it finds all of the
items that can be reached from the axioms V in the
B-hypergraph (V, E). If the standard assumptions
of §2.5 hold, it runs in time O(| ¯V |+| ¯E|) and space
O(| ¯V |).

The agenda A is a set that supports O(1)-time
push() and pop() methods. push() adds an ele-
ment to the set (our code is careful to never push
duplicates). pop() removes and returns some arbi-
trary element of the set. If pop() is implemented
to use a LIFO or FIFO policy, then the reachability
algorithm will be depth-first or breadth-first, re-
spectively, but any policy (“queue discipline”) will
do provided that ¯V is finite.

Algorithm 1 Unweighted forward chaining

▷ here C ⊆ V is just a set
▷ axioms

1: C ← ∅; A ← ∅
2: for v ∈ V :
3: C.add(v); A.push(v)
4: while A ̸= ∅ :
5:

u ← A.pop()

▷ remove some element

for (v

f↢ u1, . . . , uk) ∈ C.out(u) :

if v /∈ C :

▷ also implies v /∈ A

C.add(v); A.push(v)

6:

7:

8:

7Alternatively, we can use universal hashing. Then the
runtime bounds are bounds on the expected runtime, where
the expectation is over the choice of hash function.

8E.g., for weighted deduction systems written in Dyna
(Eisner et al., 2005), items may be unbounded lists or other
deeply nested Prolog-style terms. Yet structure sharing allows
out() or in() to build a complex item in O(1) time, and hash
consing finds a pointer in O(1) time to the copy of that item (if
any) that is stored as a chart key alongside the item’s weight.
9The transformation introduces intermediate items for par-
tially matched rules (“prefix firings”). These new items are
analogous to “saved states” in the RETE forward-chaining
algorithm for OPS-5 (Forgy, 1979; Perlin, 1990).

Since this is the unweighted case, the chart C
does not need to store weights. It is simply the
set of items derived so far, approaching the desired
¯V .10
In later algorithms, however, C will be a
map from those items to the weights that they have
accumulated so far, approaching the desired ¯ω.

10To guarantee that C actually converges to a possibly infi-
nite ¯V —in the sense that every item in ¯V is eventually added
to C—one must use an agenda policy that avoids starvation.
For example, if the policy periodically selects the least recently
pushed item, then every pushed item is eventually popped.

964

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
5
8
8
2
1
5
4
4
5
9

/

/
t

l

a
c
_
a
_
0
0
5
8
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Notice that since items are seen by the iterators
f↢
only after they have popped, the hyperedge v
u1, . . . , uk is considered only once—when the last
of its input items (not necessarily uk) pops. This
“considered only once” property makes Algorithm 1
more efficient, but is not required for correctness.
For the weighted algorithm below, however, it is
required to prevent double-counting of hyperedges.

4 Weighted Forward Chaining

Algorithm 2 is a first attempt to generalize to the
weighted case, where C becomes a map.

The pseudocode uses the following conventions:
If v is not a key of C, then C[v] returns a special
value ⊥ /∈ W. We define ⊥ ⊕v ω = ω for all ω.
Changing C[v] from ⊥ to a value in W adds v as a
new key to the chart C.

Algorithm 2 Weighted forward chaining

▷ map with keys in V, values in W

Then C[v] can only decrease at line 9, and propa-
gating its new (smaller) value will in effect replace
the old value.11 Separately, there exists a slightly
different family of algorithms which, each time u
pops from the agenda, propagates the difference
between C[u] and the old value of C[u] from the
last time it popped. Such “difference propagation
algorithms” exist for semiring-weighted deduction
systems (Eisner et al., 2005, Fig. 3) and also exist
for weighted deduction systems in which each op-
eration ⊕v has a corresponding operation ⊖v that
can be used to compute differences of values.

The approaches in the previous paragraph are
correct in the special cases where they apply. They
even work for cyclic deduction systems (provided
that they converge). However, they are an ineffi-
cient choice for the acyclic case—the same derived
item in ¯V could pop more than once (in fact far
more), which we refer to as repropagation,12 but
in the toposorted solution, it will pop exactly once.
We therefore focus on toposorting approaches.

▷ axioms

5 Previous Toposorted Strategies

1: C ← ∅
2: A ← ∅
3: for v ∈ V :
4: C[v] ⊕v= ω(v)
5: A.push(v)
6: while A ̸= ∅ :
7:

u ← A.pop()

▷ remove some element

8:

9:

10:

11:

12:

for (v

f↢ u1, . . . , uk) ∈ C.out(u) :
C[v] ⊕v= f (C[u1], . . . , C[uk])
if C[v] changed :

if v has already popped from A : error
if v /∈ A : A.push(v)

If C[v] has changed at line 10, we must have
it on the agenda to ensure that its new value is
propagated forward through the hypergraph. The
error at line 11 occurs when we would propagate
both old and new values, which in general leads
to incorrect results. The major challenge of this
paper is to efficiently avoid this error condition.

In an acyclic system, the “obvious” way to avoid
the error condition is to pop items from the agenda
in topologically sorted (toposorted) order—do not
pop v until every item that v depends on has already
popped. If that can be arranged, then v will only
pop after its value has converged; C[v] will never
change again, so v will never be pushed again.

We remark that toposorting is not the only op-
tion. Line 11 can simply be omitted in the special
case where W = R, ⊕v = min for all v ∈ V,
and all functions f are monotonic on all arguments.

Assume that P is acyclic and that unweighted for-
ward chaining terminates. We discuss previously
proposed strategies that visit the finitely many de-
rived items ¯V in a toposorted order, and explain
why these strategies do not satisfy our goals.

In addition to methods based on Algorithms 1
and 2, we consider algorithms that relax the items
in ¯V in a topologically sorted order. C.relax(v)
recomputes C[v] based on the current state of the
chart C as well as the input (V, ω):

1: C[v] ← if v ∈ V then ω(v) else⊥
f↢ u1, . . . , uk) ∈ C.in(v) :
2: for (v
3: C[v] ⊕v= f (C[u1], . . . , C[uk])

11This special case also applies when W is a lattice, ⊕v is
the meet operation, and f preserves the lattice’s partial order.
12Although for the special case with ⊕v = min that was
just mentioned at the start of the previous paragraph, there is
a well-known way to avoid repropagation even in the cyclic
case, inspired by Dijkstra’s (1959) shortest-path algorithm:
prioritize the agenda so that v with minimum C[v] pops first
(Knuth, 1977; Nederhof, 2003). This suffices provided that
each f is superior, meaning that f is not only monotonic (i.e.,
monotonically non-decreasing in all arguments) but also its
output is ≥ all of its inputs. Under the standard assumptions
of §2.5, a runtime of O(| ¯E| + | ¯V | log | ¯V |) can be achieved
by implementing the agenda as a Fibonacci heap (Fredman
and Tarjan, 1987). Note that the prioritization methods in §5.1
below are quite different: they prioritize based only on v, not
C[v], so they do not depend on any properties of W, ⊕v, or f ,
and do not require the decrease_key method.

965

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
5
8
8
2
1
5
4
4
5
9

/

/
t

l

a
c
_
a
_
0
0
5
8
8
p
d

.

▷ axiom?

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

If we relax every v ∈ ¯V once, after relaxing all
of the items that it depends on, then C will hold
the correct result of evaluation from (1). Note that
it is harmless to relax v /∈ ¯V : this will simply leave
C[v] = ⊥, so it will not be added to the chart.

5.1 Prioritization

One approach to evaluation is to run Algorithm 2
where the agenda A is a priority queue that pops the
derived items in a known toposorted order. This
requires specifying a priority function π such that
if v depends on u, then u pops first because π(u) ≺
π(v), where ≺ is a total ordering of the priorities
returned by π. π could depend on the input V .
It must be manually designed for a given P (or
perhaps in some cases, it could be automatically
constructed via some sort of static analysis of P).
We will assume in our discussion that π and ≺ are
constant-time functions.

Now the agenda push and pop methods are sim-
ply the insert and delete_min operations of a
priority queue. Each takes O(log | ¯V |) time when
the priority queue is implemented as a binary heap
with at most | ¯V | entries.13 Unfortunately this ap-
proach adds O(| ¯V | log | ¯V |) overall to our runtime
bound, which can make it asymptotically slower
than unweighted forward chaining (§3).

If the priorities are integers in the range [0, N )
for some N > 0, then this extra runtime can be
reduced to O(N ) using a bucket priority queue
(Dial, 1969). A bucket priority queue maintains
an array that maps each p ∈ [0, N ) to a bucket
(set) that contains all agenda items with priority
p. The bucket queue also maintains plast, the prior-
ity of the most recently popped item (initially 0).
A.push(v) simply adds v to bucket π(v). A.pop()
increments plast until it reaches the first non-empty
bucket, then removes any item from that bucket.
This works because our usage of the priority queue
is monotone (Thorup, 2000)—that is, the mini-
mum priority never decreases, so plast never needs
to decrease. Overall, Algorithm 2 will scan forward
once through all the empty and non-empty buckets,
taking O(N ) time. When there is a risk of large N ,
we observe that an alternative is to store only the
non-empty buckets on the integer priority queue of
Thorup (2000), so the next non-empty bucket can

13In contrast to shortest-path problems (footnote 12), there
is no asymptotic advantage here to using a more complicated
Fibonacci heap, since the decrease_key operation is not
needed: the priority (i.e., key) of an item u is fixed in ad-
vance at π(u) and hence never changes after it is pushed.

be found in O(log log B) time where B is the cur-
rent number of non-empty buckets (assuming the
Word RAM model of computation and assuming
that priorities fit in a single word). The mono-
tone property ensures that each bucket is added
and emptied only once, so the extra runtime is
now O(N ′ log log N ′) where N ′ bounds the num-
ber of distinct priority levels (buckets) ever used,
e.g., N ′ = min(| ¯V |, N ). This is an asymptotic
improvement if N ′ log log N ′ = o(N ).

An example permitting small N is the deduction
system for CKY parsing (§2.4). Take N = n,
where n is the length of the input sentence, and
define π(phrase(a,i,k)) = k − i − 1 ∈ [0, N ).
Many items will have the same integer priority.

Consider also the extension that allows acyclic
unary productions (§2.4). Here we can identify
each nonterminal a with an integer in [0, m) such
that b < a if rewrite(a,b) ∈ V . Take N = nm, π(phrase(a,i,k)) = (k − i − 1) · m + a ∈ [0, N ). While prioritization is often useful both theoret- ically and practically, it does not solve our main problem. First, it assumes we know a priority func- tion for P. Second, the runtime is not guaranteed to be proportional to that of Algorithm 1, even with integer priorities. In the last example above, N may not be O(| ¯E| + | ¯V |): consider a pathological class of inputs where the grammar has very long unary production chains (so m is large) that are not actu- ally reached from the input sentences. The alterna- tive N ′ log log N ′ may also fail to be O(| ¯E|+| ¯V |), e.g., on a class of inputs where | ¯E| = O(| ¯V |). 5.2 Dynamic Programming Tabulation Again by analyzing P manually (or potentially au- tomatically), it may be possible to devise a method which, given the axioms V , iterates in topologically sorted order over some finite “worst-case” set X ⊆ V that is guaranteed to be a superset of ¯V . By relax- ing all v ∈ X in this order, we get the correct result. This is called tabulation—systematically filling in a table (namely, the chart) that records ¯ω(v) ∈ W ∪ {⊥} for all v ∈ X. An example is CKY parsing (Kasami, 1965; Baker, 1979). Define V as in §2.4. Given V that specifies a sentence of length n and a context-free grammar in Chomsky Normal Form, let X def= V ∪ {phrase(x,i,k): 0 ≤ i < k ≤ n}. The CKY algorithm relaxes all items in X in increasing order of their widths k − i, where rewrite items are said to have width 0. Opedal et al. (2023, Appendix H.2) give the much less 966 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 8 8 2 1 5 4 4 5 9 / / t l a c _ a _ 0 0 5 8 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 obvious order for weighted Earley parsing. This strategy does away with the out() iterator. The trouble is that it calls C.in(v) on all items v ∈ X—and X may be much larger than ¯V . That ruins our goal of making weighted evaluation on every input as fast as unweighted evaluation (up to a constant factor). For example, in CKY pars- ing, |X| is Θ(n2), yet parsing a particular sentence with a particular grammar may lead to a very sparse ¯V . Algorithm 1 will be fast on this input whereas CKY remains slow (Ω(n2)). For a more precisely analyzed example, consider the deduction system corresponding to Earley’s algorithm, as already mentioned in §1. A bounded-state grammar is guar- anteed to give a sparse ¯V on every input sentence. Goodman (1999, §5) points out that on such a gram- mar, Algorithm 1 runs in O(n) time, but again, tabulation takes Ω(n2) time since |X| = Θ(n2).14 5.3 The Two-Pass Trick To avoid this problem, Goodman (1999, §5) pro- poses to (1) compute ¯V using unweighted forward chaining (Algorithm 1), and then (2) relax just the items in ¯V , in effect choosing X = ¯V . The ques- tion now—in our acyclic case—is how pass (2) can visit the items in ¯V in a topologically sorted order. Using a priority queue to do so would have the same overhead as using a priority queue with the simpler one-pass method (§5.1). Instead, we can use backward chaining during the second pass (Algorithm 3). This is a simple re- cursive algorithm that essentially interleaves relax- ation with depth-first search. It indeed relaxes the vertices in a toposorted order. The call stack acts like a LIFO agenda (made explicit in Algorithm 7). Note that C.in(v) is only called on v ∈ ¯V , and it only iterates over the hyperedges in ¯Ev, which are supported by items in ¯V . This is why a first pass to discover ¯V is helpful. By contrast, a pure one-pass backward chaining method would have to identify a finite X and iterate over all of E’s hyperedges to items v ∈ X, which would visit many items and hyperedges that are not in ¯V and ¯E. At first glance, Algorithm 3 appears to be ex- actly what we need: under our standard assump- tions (§2.5), it runs in time O(| ¯V | + | ¯E|) and space O(| ¯V |), just like Algorithm 1. The question is whether the standard assumptions still hold. Al- 14In principle, this might be fixed by making X a tighter upper bound on ¯V . Given V , evaluation would have to analyze the grammar and construct a special smaller X for the sentence with |X| = O(n), then use that X. However, §5.3 is easier. Algorithm 3 Weight computation by backward chaining 1: ▷ Algorithm 1 has already been run to compute ¯V ▷ now C is a map with keys in ¯V 2: C ← ∅ 3: for v ∈ ¯V : COMPUTE(v) ▷ ¯V is the old set C 4: procedure COMPUTE(v) 5: if C[v] = ⊥ : ▷ first visit 6: 7: 8: 9: 10: ▷ this iterator requires u1, . . . , uk to have been popped from Algorithm 1’s agenda f↢ u1, . . . , uk) ∈ C.in(v) : for i ← 1 to k : COMPUTE(ui) for (v C.relax(v) assert C[v] ̸= ⊥ ▷ because v ∈ ¯V gorithm 3 requires not only C.out(u) on the for- ward pass (like Algorithm 1) but also C.in(v) on the backward pass. Supporting the second type of iterator will generally require maintaining ad- ditional indexes in the chart C. For some deduc- tion systems, that is only a mild annoyance and a constant-factor overhead. Unfortunately, there do exist pathological cases where the second iterator is asymptotically more expensive. This implies that Algorithm 3 cannot serve as a generic master algorithm to establish our meta-theorem (§1). For such a case, consider a deduction sys- f↢ tem with hyperedges of the form t(p · q) r(p), s(q) where p and q are large primes and p · q is their product (mod m).15 Suppose there are n r items and n s items. On the forward pass, this requires all n2 combinations. On the backward pass, since factoring p · q is computationally hard, the out() iterator cannot easily work backwards from a given t item—the best strategy is probably to iterate over all (r,s) pairs that have popped from the agenda to find out which ones yield the given t item. Since this would be done for each t item, the overall runtime of the backward pass could be as bad as Θ(mn2), even though there are only n2 edges and the forward pass runs in time O(n2). What Goodman (1999, §5) actually proposed was to materialize the hypergraph of derived items during forward chaining, allowing the use of clas- sical toposorting algorithms, which run on materi- alized graphs. In particular, if each derived item v stores its incoming hyperedges ¯Ev, then it is easy to make C.in(v) satisfy the standard assumptions— it simply iterates over the stored list. Then Al- 15Or more dramatically, p and q are strings and p · q denotes a cryptographic hash of their concatenation into range [0, m). 967 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 8 8 2 1 5 4 4 5 9 / / t l a c _ a _ 0 0 5 8 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Algorithm 4 Unweighted forward chaining with parent counting (compare Algorithm 1) Algorithm 5 Weighted forward chaining with par- ent counting (compare Algorithm 2) ▷ here C ⊆ V is just a set ▷ axioms 1: C ← ∅; A ← ∅ 2: for v ∈ V : 3: C.add(v); A.push(v) 4: waiting_edges[v] += 1 ▷ # summands needed 5: while A ̸= ∅ : 6: 7: waiting_items += 1 ▷ remove some element ▷ # items popped u ← A.pop() 8: 9: 10: 11: for (v f↢ u1, . . . , uk) ∈ C.out(u) : if waiting_edges[v] = 0 : C.add(v); A.push(v) waiting_edges[v] += 1 ▷ v /∈ C 8: gorithm 3 achieves the desired runtime. Unfortu- nately, as Goodman acknowledged, materialization drives the space requirement up from Θ(| ¯V |) to Θ(| ¯V | + | ¯E|). In the example of the previous para- graph, the space goes from Θ(m+n) to Θ(m+n2). 5.4 Discussion In summary, none of the previous strategies for acyclic weighted deduction are guaranteed to achieve the same time and space performance as unweighted deduction (Algorithm 1). In addition, some of them require constructing priority func- tions or efficient in() iterators. While they are quite practical for some systems, it has not previ- ously been possible for a paper that presents an un- weighted deduction system to simply assert that its asymptotic analysis—for example, via McAllester (2002)—will carry through to weighted deduction. 6 Two-Pass Weighted Deduction via Parent Counting We now present our simple solution to the problem, in the acyclic case. As already mentioned, it allows acyclic weighted deduction systems such as the inside algorithm (§2.4) and weighted Earley’s algo- rithm (§1) to benefit from sparsity in the chart just as much as their unweighted versions do, without any asymptotic time or space overhead. The idea is to use a version of topological sort- ing (Kahn, 1962) that only requires following out- hyperedges, as in forward chaining. We first mod- ify our first pass (Algorithm 4). An item v ∈ C does not store its incoming hyperedges ¯Ev, but just stores the count of those incoming hyperedges in waiting_edges[v] (initially 0). Now, the second pass (Algorithm 5) computes 968 1: ▷ Algorithm 4 has already computed waiting_edges[v] and waiting_items. CONTRIBUTE(v, ω(v)) 2: C ← ∅; A ← ∅ ▷ now C is a map with keys in ¯V 3: for v ∈ V : ▷ axioms 4: 5: while A ̸= ∅ : 6: 7: waiting_items −= 1 ▷ remove some element ▷ # items unpopped u ← A.pop() for (v f↢ u1, . . . , uk) ∈ C.out(u) : CONTRIBUTE(v, f (C[u1], . . . , C[uk])) 9: 10: if waiting_items ̸= 0 : error ▷ cycle detected 11: procedure CONTRIBUTE(v,w) 12: C[v] ⊕v= w 13: waiting_edges[v] −= 1 ▷ # summands needed 14: if waiting_edges[v] = 0 : 15: A.push(v) ▷ delayed push: item is ready the weights, waiting to push an item v onto the agenda until all contributions to its total weight have been aggregated into C[v]. waiting_edges[v] holds the count of contributions that have not yet arrived; when this is reduced to 0, the item can finally be pushed.16 The agenda in the second pass plays the role of the “ready list” of Kahn (1962). All of the same work of deriving items will be done again on the second pass, with no memory of the first pass other than the counts. Both passes use forward chaining and derive the same items—but the second pass may derive them in a different or- der, as it deliberately imposes ordering constraints. Our pseudocode also maintains an extra counter, waiting_items, used only to check the requirement that the input graph must be acyclic. For any input, both passes consume only as much runtime and space as Algorithm 1, up to a constant factor.17 This constant-factor result does not re- quire all of the standard assumptions of §2.5. It needs only the assumptions that f and ⊕v can be computed in O(1) time, and that an item’s weight can be stored in space proportional to the space re- quired by the item itself. If even these assumptions fail, we can still say that the additional runtime of the weighted algorithm is only the total cost of 16Just as a garbage collector can destroy an object when its reference count is reduced to 0. 17More precisely, only as much as the maximum runtime and space of Algorithm 1 on that input, where the max is over possible pop orders. The pop order is left unspecified in Algorithm 1 but might affect the actual runtime of the iterators. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 8 8 2 1 5 4 4 5 9 / / t l a c _ a _ 0 0 5 8 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 f↢ · · · in running f and ⊕v for every hyperedge v | ¯E|, and the additional space is only what is needed to store the weights of the | ¯V | items. When all the standard assumptions apply, the method achieves O(| ¯V | + | ¯E|) time and O(| ¯V |) space, just like §3. Algorithm 6 Solving an SCC using only out() 1: procedure SOLVESCC(S) 2: ▷ This procedure assumes that C[v] already ag- gregates the non-cyclic contributions to ¯ω(v), from axioms and earlier SCCs. 7 Weighted Deduction With Cycles Goodman (1999) also considers the case where the derived hypergraph ( ¯V, ¯E) may be cyclic (while remaining finite). There, we should partition it into strongly connected components (SCCs), which we should visit and solve in toposorted order. 7.1 Solving the SCCs Let S ⊆ ¯V be an SCC. We assume a proce- dure SOLVESCC(S) that sets C[v] = ¯ω(v) for all v ∈ S, where the ¯ω(v) values jointly satisfy the |S| equations (1). The procedure will be called only af- ter previous SCCs have been solved. Thus, C[u] = ¯ω(u) already whenever v ∈ S depends on u /∈ S. Solution is commonly attempted by relaxing the items v ∈ S repeatedly until there is no further change.18 Since C.relax(v) (from §5) requires the in() iterator, we can optionally reorganize the computation to use only out(): see Algorithm 6.19 Repeated relaxation is not guaranteed to termi- nate, nor even converge in the limit to a solution ¯ω, even when one exists. However, under some deduc- tion systems, S may have special structure that en- ables a finite-time solution algorithm. For example, footnote 12 noted that Knuth (1977) solves certain weighted deduction systems in which ⊕v = min. As another example, the equations (1) may become linear once the weights ¯ω(u) from previous SCCs have been replaced with constants. If the ⊕ and ⊗ operations of this linear system of equations form a semiring that also has a closure operator for summing a geometric series in finite time, then the Kleene-Floyd-Warshall algorithm (Lehmann, 1977) solves the SCC S in O(|S|3) operations. 18Much like the Bellman-Ford shortest-paths algorithm (Bellman, 1958), although that algorithm also includes a test that halts with an error if convergence is impossible. 19Algorithm 6 relaxes all v ∈ S in parallel, which un- fortunately can cause oscillation in some cases where serial relaxation would converge. A serial variant, which requires more space, would temporarily materialize the out-edges from all u ∈ S: the out-edges to v ∈ S are stored at v to support serial relaxation within S, while the out-edges to v /∈ S may be saved in a separate list and used at the end to propagate to later SCCs. Alternatively, when a difference propagation algorithm exists (§4), one could run it on S to convergence. 3: 4: 5: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: ▷ Cprev, Cnew are local maps with keys in S. for v ∈ S : if v ∈ V : C[v] ⊕v= ω(v) Cprev[v] ← C[v] ▷ axiom? ▷ all acylic contributions ▷ update until convergence ▷ deep copy ▷ see footnote 23 6: 7: while true : 8: Cnew ← Cprev for u ∈ S : for (v f↢ u1, . . . , uk) ∈ C.out(u) : if v ∈ S : ▷ cyclic hyperedge back to S Cnew[v] ⊕v= f (C[u1], . . . , C[uk]) ▷ deep equality test if Cnew = C : break C ← Cnew ▷ Finally, propagate the solution to later SCCs, in case they too are solved with Algorithm 6. for u ∈ S : ▷ see footnote 23 for (v f↢ u1, . . . , uk) ∈ C.out(u) : if v /∈ S : ▷ hyperedge to later SCC C[v] ⊕v= f (C[u1], . . . , C[uk]) This may be improved to sub-cubic (e.g., Strassen, 1969) if an ⊖ operator is also available. However the SCCs are solved, the point is that solving n SCCs separately (if n > 1) is generally
faster than applying the same algorithm to solve
the entire graph (Purdom, 1970). For example, Tar-
jan (1981) noted that since Kleene-Floyd-Warshall
runs in cubic time on the number of items, the run-
times of these two strategies are proportional to
| ¯E| + (cid:80)
i |Si|3 < | ¯E| + | ¯V |3, respectively, when the SCCs have sizes |S1| + · · · + |Sn| = | ¯V |. Sim- ilarly, Nederhof (2003) suggested running Knuth’s algorithm on each SCC separately, giving runtimes proportional to (cid:80) i | ¯Ei| log(1+|Si|) < | ¯E| log(1+ | ¯V |). Applying the relaxation method to the SCCs of an acyclic hypergraph recovers the fast Algo- rithm 3, since each SCC Si has |Si| = 1 and can be solved to convergence with a single relaxation. 7.2 Finding the SCCs To solve the SCCs one at a time, we need to enumer- ate them in toposorted order. Accomplishing this in O(| ¯V | + | ¯E|) time is a job for Tarjan’s algorithm (Tarjan, 1972). We give two solutions.20 20Goodman (1999) offers three solutions, but §5.4 rejected them as too expensive even in the acyclic case. In particular, Goodman’s solution based on SCC decomposition could 969 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 8 8 2 1 5 4 4 5 9 / / t l a c _ a _ 0 0 5 8 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 If efficient in() iterators for P are available, then we can use Algorithm 7, a variant of Algorithm 3 that has been upgraded to use Tarjan’s algorithm rather than ordinary depth-first search. Again, this is a two-pass algorithm: (1) compute ¯V using un- weighted forward chaining (Algorithm 1), and then (2) run a Tarjan-enhanced backward pass to find (and solve) the SCCs in toposorted order.21 The SCCs that are closest to the axioms are found first, so each SCC can be solved as soon as it is found. But what if we do not have in() iterators that are as efficient as the out() iterators (see §5.3)? Fortunately, it is also possible to use pure forward chaining, since the forward graph has the same set of SCCs as the backward graph. This is (1) run a three-pass algorithm (Algorithm 8): Algorithm 1 to find ¯V , (2) run Tarjan’s algorithm on the forward graph to enumerate the SCCs in reverse order, pushing them onto a stack, (3) pop the SCCs off the stack to visit them in the desired order, solving each one when it is popped. For any K, this method consumes only as much time and space as Algorithm 117—excluding the extra time it uses for the SOLVESCC calls, which is unavoidable—up to a constant factor, over all inputs and all deduction systems whose hyperedges have in-degree ≤ K. The constant factor depends only on K (see footnote 22 for why it does). One might hope to combine passes (1) and (2). That would indeed work if we were dealing with an ordinary graph. However, hypergraphs are trickier. It is crucial for Tarjan’s algorithm that when it visits a node u, it immediately explores all paths from u, so as to discover any cycle containing u while u is still on the stack. But C.out(u) will not enumerate f↢ u, u′ if u′ is not yet known to be a hyperedge v in ¯V , so the necessary exploration from u to v will be blocked. Pass (1) solves this by precomputing the vertices ¯V . Now calling C.out(u) during pass (2) will not be blocked but will be able to find the employ Tarjan’s algorithm, but it materializes the hyperedges first. We use iterators instead to save space. 21Algorithm 7 uses a somewhat nonstandard presentation of Tarjan’s that exposes a helpful contract at line 6. Note that A no longer serves as an agenda of items to explore in future, but as a stack of items that are currently being explored. An item waits on the stack because we are backward chaining and its in- puts are not yet all known. It always depends transitively on all items above it on the stack. This is because each item v on the stack depends on the item immediately above v either directly, if COMPUTE(v) is still in progress (hence that item is some ui), or else transitively, via some item below v (then called A[low ]) that kept v from popping when COMPUTE(v) returned. Algorithm 7 Weighted cyclic backward chaining (compare Algorithm 3) 1: ▷ Algorithm 1 has already been run to compute ¯V 2: ▷ Here A is a stack of distinct items. A.index[v] records the current stack position of v (with the bottom and top elements at 0 and |A| − 1 respec- tively), or takes a special value if v /∈ A. 3: C ← ∅; A ← ∅ ▷ now C is a map with keys in ¯V 4: for v ∈ ¯V : COMPUTE(v) ▷ ¯V is the old set C 5: function COMPUTE(v) 6: ▷ Set C[v] = ¯ω(v), unless any of v’s own SCC is on the stack A. In that case, add v and its remaining SCC ancestors to A, setting their C values as required by Algorithm 6 line 2. Returns the # of items at the bottom of the stack that were not detected to be in v’s SCC. 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: low ← |A| if C[v] = ⊥ : A.push(v) if v ∈ V : C[v] ← ω(v) ▷ will become return value ▷ first visit ▷ we may undo this at line 22 ▷ axiom? for (v f↢ u1, . . . , uk) ∈ C.in(v) : ▷ acyclic hyperedge? r ← true for i ← 1 to k : if ui ∈ A : ▷ cycle detected r ← false; low min= A.index[ui] else low min= COMPUTE(ui) if r : C[v] ⊕v= f (C[u1], . . . , C[uk]) ▷ because v ∈ ¯V assert C[v] ̸= ⊥ if low = A.index[v] : ▷ low is unchanged ▷ pop v’s entire SCC into this set ▷ nothing beneath v is in v’s SCC S ← ∅ while |A| > low : S.add(A.pop())
SOLVESCC(S)

return low

hyperedge to v as desired, perhaps looping back
through u (or u′). In effect, pass (2) runs Tarjan’s
algorithm on the implicit dependency graph that
has an edge u → v whenever v was derived by
combining u with other items during pass (1). Pass
(2) can enumerate the dependency edges from u at
any time. To enable this, at Algorithm 8 line 13,
the C.out iterator must retain state from pass (1)
and only require u1, . . . , uk to have been popped
previously—not necessarily during pass (2): that
is, u1, . . . , uk ∈ ¯V .22 The backward methods also
used this more relaxed test (Algorithm 3 line 7 and
Algorithm 7 line 11), but Algorithm 5 line 8 did not.

22As a result, this hyperedge will be visited k times, but
only the first time will have an effect, due to the test at line 11.
23Each time we arrive at line 9 or 16, the loop must enumer-

970

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
5
8
8
2
1
5
4
4
5
9

/

/
t

l

a
c
_
a
_
0
0
5
8
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Algorithm 8 Weighted cyclic forward chaining
(compare Algorithm 7)

1: ▷ Algorithm 1 has already been run to compute ¯V
2: ▷ Again A is a stack of distinct items.
▷ here C ⊆ ¯V is a set
3: C ← ∅; A ← ∅
4: T ← ∅
▷ toposorted stack of SCCs
5: for v ∈ V : FINDNEWSCCS(v)
▷ pass (2)
▷ now C is a map with keys in ¯V
6: C ← ω
7: while T ̸= ∅ : SOLVESCC(T .pop()) ▷ pass (3)
8: function FINDNEWSCCS(u)
9:

▷ Push onto T all SCCs that are reachable from
u and not yet in T , unless any of u’s own SCC
is on the stack A. In that case, just add u and
its remaining SCC descendants to A. Returns
the # of items at the bottom of the stack that
were not detected to be in u’s SCC.

10:

11:

12:

13:

14:

15:

16:

17:

18:

19:

20:

21:

low ← |A|
if u /∈ C :

▷ will become return value
▷ first visit
C.add(u); A.push(u) ▷ we may undo push
f↢ u1, . . . , uk) ∈ C.out(u) :
if v ∈ A : low min= A.index[v]
else low min= FINDNEWSCCS(v)
if low = A.index[u] : ▷ low is unchanged

for (v

▷ pop u’s entire SCC into this set

▷ nothing beneath u is in u’s SCC
S ← ∅
while |A| > low : S.add(A.pop())
T .push(S)

return low

8 Limitations

We have not discussed parallel execution beyond
tensorization of a deduction system (§2.4). But
in forward chaining, one could pop multiple items
from the agenda and then process them simultane-
ously in multiple threads, taking care to avoid write
conflicts.24 Indeed, semi-naive bottom-up evalua-
tion, a classical strategy for evaluating unweighted
deduction systems (Ceri et al., 1990), is equivalent
to processing all items on the agenda in parallel at
each step. For backward chaining, Matei (2016)
reviews concurrent versions of Tarjan’s algorithm.

ate each hyperedge from S exactly once. Under our iterator
design (§2.5), this can be arranged if we iterate through u ∈ S
by pushing S’s items onto an agenda and then popping them
off, provided that we first roll back the state of the C.out
iterators as if S’s items had never previously popped.

24Caveat: The out iterator must avoid duplicates when
finding the hyperedges out of the just-popped set. For example,
f↢ u, u′, u, u′′ should be found only once if
the hyperedge v
u′ and u′′ are jointly popped after u, just as it should be found
only once if u is popped alone after u′ and u′′ (see §2.5).

We have not discussed dynamic algorithms,
where the axioms or their weights can be updated
after or during evaluation, and one wishes to up-
date the derived items and their weights without
recomputing them from scratch (Eisner et al., 2005;
Filardo and Eisner, 2012).

Finally, §2.2 defined evaluation to return all
of ¯V . We did not discuss query-driven evalua-
tion, which returns only ¯V ∩ Q (and corresponding
weights) for a given set of query items Q ⊆ V.
This can allow termination even in some cases with
infinite ¯V . Intuitively, we hope to backward-chain
from Q while also forward-chaining from V . This
can be achieved by a technique such as magic sets
(Beeri and Ramakrishnan, 1991), which transforms
the deduction system P to (1) accept additional
input axioms that specify the queries Q, (2) derive
additional items corresponding to recursive queries,
(3) augment the original hyperedges so that they
will derive an original item v only when it answers
a query. Evaluation of the new system P ′ requires
only forward chaining via out() (e.g., our Algo-
rithms 4–6 and 8)—but forward chaining for (2)
simulates backward chaining under P. As a result,
the out() iterators of P ′ must do additional work
corresponding to what the in() iterators of P did.

9 Conclusions

Often an NLP algorithm can be expressed as a
weighted deduction system (§2) that defines how
various quantities should be computed from one
another. Such a system is usually presented as a
small collection of rules that combine to generate a
large or even infinite computation graph (really a
hypergraph). The rules specify concisely what is
to be computed, leaving the details of scheduling
and data structures to generic strategies that are
well-understood, reliable, and easy to analyze.

We adopted a general formalism for weighted de-
duction systems (going beyond the usual semiring-
weighted case) and surveyed a range of generic
strategies that can be interpreted or compiled for
a given system when seeking a practical CPU im-
plementation. All of the strategies were presented
uniformly in terms of a chart C and agenda A.

Our main novel contribution was to exhibit a
generic strategy for acyclic weighted deduction that
is guaranteed to be as time- and space-efficient on
every system and input as the corresponding strat-
egy for unweighted deduction, up to a constant fac-
tor that is independent of the system and input. We

971

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
5
8
8
2
1
5
4
4
5
9

/

/
t

l

a
c
_
a
_
0
0
5
8
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

also exhibited a strategy with related guarantees for
the more general case of cyclic weighted deduction.

Acknowledgements

Tim Vieira, Andreas Opedal, and Matthew Francis-
Landau provided useful discussion. In particular,
Tim suggested discussing bucket queues in §5.1.
The anonymous reviewers contributed several use-
ful suggestions and questions about the presenta-
tion, which I have tried to address. Any errors are
my own.

References

J. K. Baker. 1979. Trainable grammars for speech
recognition. In Speech Communication Papers
Presented at the 97th Meeting of the Acoustical
Society of America.

Catriel Beeri and Raghu Ramakrishnan. 1991. On
the power of magic. Journal of Logic Program-
ming, 10(3):255–299.

Richard Bellman. 1958. On a routing problem.
Quarterly of Applied Mathematics, 16:87–90.

Stefano Ceri, Georg Gottlob, and Letizia Tanca.
Logic Programming and Databases.

1990.
Springer.

Robert B. Dial. 1969. Algorithm 360: Shortest-
path forest with topological ordering. Communi-
cations of the ACM, 12(11):632–633.

Edsger W. Dijkstra. 1959. A note on two problems
in connexion with graphs. Numerische Mathe-
matik, 1:269–271.

Andrew Drozdov, Patrick Verga, Mohit Yadav, Mo-
hit Iyyer, and Andrew McCallum. 2019. Unsu-
pervised latent tree induction with deep inside-
outside recursive auto-encoders. In Proceedings
of NAACL: HLT, pages 1129–1141.

Jay Earley. 1970. An efficient context-free pars-
ing algorithm. Communications of the ACM,
13(2):94–102.

Jason Eisner and John Blatz. 2007. Program
transformations for optimization of parsing al-
gorithms and other weighted logic programs. In
Proceedings of FG 2006: The 11th Conference
on Formal Grammar, pages 45–85. CSLI Publi-
cations.

Jason Eisner and Nathaniel W. Filardo. 2011.
In
Dyna: Extending Datalog for modern AI.
Oege de Moor, Georg Gottlob, Tim Furche, and
Andrew Sellers, editors, Datalog Reloaded, vol-
ume 6702 of Lecture Notes in Computer Science,
pages 181–220. Springer.

Jason Eisner, Eric Goldlust, and Noah A. Smith.
2005. Compiling comp ling: Weighted dynamic
programming and the Dyna language. In Pro-
ceedings of Human Language Technology Con-
ference and Conference on Empirical Methods in
Natural Language Processing, pages 281–290.

Nathaniel Wesley Filardo and Jason Eisner. 2012.
A flexible solver for finite arithmetic circuits.
In Technical Communications of the 28th In-
ternational Conference on Logic Programming
(ICLP), volume 17 of Leibniz International Pro-
ceedings in Informatics (LIPIcs), pages 425–
438.

C. L. Forgy. 1979. On the Efficient Implementation
of Production Systems. Ph.D. thesis, Carnegie-
Mellon University Department of Computer Sci-
ence.

Michael L. Fredman and Robert E. Tarjan. 1987.
Fibonacci heaps and their uses in improved net-
work optimization algorithms. Journal of the
ACM, 34(3):596–615. Announced at FOCS’84.

Giorgio Gallo, Giustino Longo, Stefano Pallottino,
and Sang Nguyen. 1993. Directed hypergraphs
and applications. Discrete Applied Mathematics,
42:177–201.

Joshua Goodman. 1999. Semiring parsing. Com-

putational Linguistics, 25(4):573–605.

Susan L. Graham, Michael A. Harrison, and Wal-
ter L. Ruzzo. 1980. An improved context-free
recognizer. ACM Transactions on Programming
Languages and Systems, 2(3):415–462.

Arthur B. Kahn. 1962. Topological sorting of
large networks. Commmunications of the ACM,
5(11):558–562.

Tadao Kasami. 1965. An efficient recognition and
syntax-analysis algorithm for context-free lan-
guages. Technical Report AFCRL-65-758, Air
Force Cambridge Research Laboratory, Bedford,
MA.

972

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
5
8
8
2
1
5
4
4
5
9

/

/
t

l

a
c
_
a
_
0
0
5
8
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

prefix probabilities. Computational Linguistics,
21(2):165–201.

Volker Strassen. 1969. Gaussian elimination is not
optimal. Numerische Mathematik, 13(4):354–
356.

Robert E. Tarjan. 1972. Depth-first search and
linear graph algorithms. SIAM Journal on Com-
puting, 1(2):146–160.

Robert Endre Tarjan. 1981. Fast algorithms for
solving path problems. Journal of the ACM,
28(3):594–614.

Mikkel Thorup. 2000. On RAM priority queues.
SIAM Journal on Computing, 30(1):86–109.

Tim Vieira, Ryan Cotterell, and Jason Eisner. 2021.
Searching for more efficient dynamic programs.
In Findings of EMNLP’21, pages 3812–3830.

Tim Vieira, Ryan Cotterell, and Jason Eisner. 2022.
Automating the analysis of parsing algorithms
(and other dynamic programs). Transactions of
the Association for Computational Linguistics.
Accepted for publication.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
5
8
8
2
1
5
4
4
5
9

/

/
t

l

a
c
_
a
_
0
0
5
8
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Donald E. Knuth. 1977. A generalization of Dijk-
stra’s algorithm. Information Processing Letters,
6(1):1–5.

Daniel J. Lehmann. 1977. Algebraic structures
for transitive closure. Theoretical Computer Sci-
ence, 4(1):59–76.

Vera Matei. 2016. Parallel algorithms for detecting
strongly connected components. Master’s thesis,
Vrije Universiteit Amsterdam.

David McAllester. 2002. On the complexity anal-
ysis of static analyses. Journal of the ACM,
49(4):512–537.

Bobby McFerrin. 1988. Don’t worry, be happy. In

Simple Pleasures. Manhattan Records.

Mark-Jan Nederhof. 2003. Weighted deductive
parsing and Knuth’s algorithm. Computational
Linguistics, 29(1):135–143.

Andreas Opedal, Ran Zmigrod, Tim Vieira, Ryan
Cotterell, and Jason Eisner. 2023. Efficient
semiring-weighted Earley parsing. In Proceed-
ings of ACL.

Fernando C. N. Pereira and David H. D. Warren.
1983. Parsing as deduction. In Proceedings of
ACL, pages 137–144.

Mark Perlin. 1990. Topologically traversing the
RETE network. Applied Artificial Intelligence,
4(3):155–177.

Paul Purdom, Jr. 1970. A transitive closure algo-
rithm. BIT Numerical Mathematics, 10(1):76–
94.

Taisuke Sato. 1995. A statistical learning method
for logic programs with distribution semantics.
In Proceedings of ICLP, pages 715–729.

Stuart M. Shieber, Yves Schabes, and Fernando
Pereira. 1995. Principles and implementation of
deductive parsing. Journal of Logic Program-
ming, 24(1–2):3–36.

Klaas Sikkel. 1993. Parsing Schemata. Ph.D. the-
sis, University of Twente, Enschede, The Nether-
lands. Published as a book by Springer-Verlag
in 1997.

Andreas Stolcke. 1995. An efficient probabilis-
tic context-free parsing algorithm that computes

973
Download pdf