Efficient Computation of Expectations under Spanning Tree Distributions
Ran Zmigrod ,
University of Cambridge, United Kingdom
Tim Vieira ,
Ryan Cotterell
,
Johns Hopkins University, stati Uniti
ETH Z¨urich, United Kingdom
rz279@cam.ac.uk tim.f.vieira@gmail.com
ryan.cotterell@inf.ethz.ch
Astratto
We give a general framework for inference in
spanning tree models. We propose unified al-
gorithms for the important cases of first-order
expectations and second-order expectations in
edge-factored, non-projective spanning-tree
models. Our algorithms exploit a fundamental
connection between gradients and expecta-
zioni, which allows us to derive efficient
algorithms. These algorithms are easy to im-
plement with or without automatic differentia-
tion software. We motivate the development of
our framework with several cautionary tales
of previous research, which has developed
numerous inefficient algorithms for comput-
ing expectations and their gradients. We de-
monstrate how our
framework efficiently
computes several quantities with known algo-
rithms,
including the expected attachment
score, entropy, and generalized expectation
criteria. As a bonus, we give algorithms for
quantities that are missing in the literature,
including the KL divergence. In all cases, our
approach matches the efficiency of existing
algorithms and, in several cases, reduces the
runtime complexity by a factor of the sen-
tence length. We validate the implementation
of our framework through runtime experi-
menti. We find our algorithms are up to 15
E 9 times faster than previous algorithms for
computing the Shannon entropy and the gra-
dient of the generalized expectation objective,
rispettivamente.
1
introduzione
Dependency trees are a fundamental combinato-
rial structure in natural language processing. It
follows that probability models over dependency
trees are an important object of study. In terms
Equal contribution.
675
of graph theory, one can view a (non-projective)
dependency tree as an arborescence (commonly
known as a spanning tree) of a graph. To build a
dependency parser, we define a graph where the
nodes are the tokens of the sentence, and the edges
are possible dependency relations between the
gettoni. The edge weights are defined by a model,
which is learned from data. in questo documento, we focus
on edge-factored models where the probability of
a dependency tree is proportional to the product
the weights of its edges. As there are exponen-
tially many trees in the length of the sentence, we
require clever algorithms for finding the normal-
ization constant. Fortunately, the normalization
constant for edge-factored models is efficient to
compute via to the celebrated matrix–tree theorem.
The matrix–tree theorem (Kirchhoff, 1847)—
its counterpart for directed
more specifically,
graphs (Tutte, 1984)—appeared before the NLP
community in an onslaught of contemporaneous
papers (Koo et al., 2007; McDonald and Satta,
2007; Smith and Smith, 2007) that leverage the
classic result to efficiently compute the normal-
ization constant of a distribution over trees. IL
result is still used in more recent work (Ma and
Blu, 2017; Liu and Lapata, 2018). We build upon
this tradition through a framework for computing
expectations of a rich family of functions under
a distribution over trees. Expectations appear in
all aspects of the probabilistic modeling process:
training, model validation, and prediction. There-
fore, developing such a framework is key to accel-
erating progress in probabilistic modeling of trees.
Our framework is motivated by the lack of
a unified approach for computing expectations
over spanning trees in the literature. We believe
this gap has resulted in the publication of
numerous inefficient algorithms. We motivate the
importance of developing such a framework by
highlighting the following cautionary tales.
• McDonald and Satta (2007) proposed an
algorithm for computing
inefficient O
N 5
(cid:3)
(cid:2)
Operazioni dell'Associazione per la Linguistica Computazionale, vol. 9, pag. 675–690, 2021. https://doi.org/10.1162/tacl a 00391
Redattore di azioni: Dan Gildea. Lotto di invio: 5/2020; Lotto di revisione: 3/2021; Pubblicato 6/2021.
C(cid:2) 2021 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
9
1
1
9
7
6
8
7
0
/
/
T
l
UN
C
_
UN
_
0
0
3
9
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
(cid:2)
(cid:3)
N 3
feature expectations, which was much slower
than the O
algorithm obtained by
Koo et al. (2007) and Smith and Smith
(2007). The authors subsequently revised
their paper.
(cid:2)
(cid:3)
• Smith and Eisner (2007) proposed an O
N 4
algorithm for computing entropy. Later,
(cid:3)
N 3
Martins et al.
its gradient.
method for entropy, but not
Our framework recovers Martins et al.’s
(2010) algorithm, and additionally provides
the gradient of entropy in O
(2010) gave an O
N 3
(cid:3)
(cid:2)
(cid:2)
.
(cid:2)
(cid:3)
• Druck et al. (2009) proposed an O
N 5
algorithm for evaluating the gradient of
the generalized expectation (GE) criterion
(McCallum et al., 2007). The runtime
bottleneck of their approach is the evaluation
of a covariance matrix, which Druck and
Smith (2009) later improved to O
. Noi
show that the gradient of the GE criterion can
(cid:3)
(cid:2)
be evaluated in O
N 4
N 3
(cid:2)
(cid:3)
.
We summarize our main results below:
• Unified Framework: We develop an
algorithmic framework for calculating expec-
tations over spanning arborescences. Noi
give precise mathematical assumptions on
the types of functions that are supported. Noi
provide efficient algorithms that piggyback
on automatic differentiation techniques, COME
our framework is rooted in a deep con-
nection between expectations and gradients
(Darwiche, 2003; Li and Eisner, 2009).
• Improvements to existing approaches: Noi
give asymptotically faster algorithms where
several prior algorithms were known.
• Efficient algorithms for new quantities:
We demonstrate how our framework cal-
culates several new quantities, such as the
Kullback–Leibler divergence, Quale (to our
knowledge) had no prior algorithm in the
literature.
• Practicality: We present practical speed-
ups in the calculation of entropy compared
to Smith and Eisner (2007). We observe
speed-ups in the range of 4.1 E 15.1 In
five languages depending on the typical sen-
tence length. We also demonstrate a 9 times
speed-up for evaluating the gradient of the
GE objective compared to Druck and Smith
(2009).
• Simplicity: Our algorithms are simple to
implement—requiring only a few lines of
PyTorch code (Paszke et al., 2019). We have
released a reference implementation at
the following URL: https://github
.com/rycolab/tree expectations.
2 Distributions over Trees
We consider the distribution over trees in weighted
directed graphs with a designated root node. UN
(rooted, weighted, and directed) graph is given
by G = (N , E, ρ). N = {1, . . . , N } ∪ {ρ} is a set
of N +1 nodes where ρ is a designated root node.
E is a set of weighted edges where each edge
wij−−→ j) ∈ E is a pair of distinct nodes such that
(io
the source node i ∈ N points to a destination
node j ∈ N with an edge weight wij ∈ R. Noi
assume—without loss of generality—that the root
node ρ has no incoming edges. Inoltre, we
assume only one edge can exist between two
nodes. We consider the multi-graph case in §2.2.
language processing applications,
these weights are typically parametric functions,
such as log-linear models (McDonald et al.,
2005B) or neural networks (Dozat and Manning,
2017; Ma and Hovy, 2017), which are learned
from data.
In natural
A tree1 d of a graph G is a set of N edges
such that all non-root nodes j have exactly one
incoming edge and the root node ρ has at least
one outgoing edge. Inoltre, a tree does not
contain any cycles. We denote the set of all trees
in a graph by D and assume that |D| > 0 (questo è
not necessarily true for all graphs).
The weight of a tree d ∈ D is defined as:
(cid:4)
w(D) def=
wij
(i → j) ∈ d
(1)
Normalizing the weight of each tree yields a
probability distribution:
P(D) def=
w(D)
Z
(2)
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
9
1
1
9
7
6
8
7
0
/
/
T
l
UN
C
_
UN
_
0
0
3
9
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
where the normalization constant is defined as
(cid:5)
(cid:5)
(cid:4)
Z def=
w(D) =
wij
(3)
d∈ D
d∈ D
(i → j) ∈ d
1The more precise graph-theoretic term is arborescence.
676
Ovviamente, for (2) to be a proper distribution, we
require wij ≥ 0 for all (i → j) ∈ E, and Z > 0.
2.1 The Matrix–Tree Theorem
The normalization constant Z involves a sum
over D, which can grow exponentially large with
N . Fortunately, there is sufficient structure in the
computation of Z that it can be evaluated in O(N 3)
time. The Matrix–Tree Theorem (MTT) (Tutte,
1984; Kirchhoff, 1847) establishes a connection
between Z and the determinant of the Laplacian
matrix, L ∈ RN ×N . For all i, j ∈ N \{ρ},
(cid:9)
wi(cid:7)j
if i = j
Lij
def=
⎩
io(cid:7)∈ N \ {j}
−wij
(4)
otherwise
⎧
⎨
Theorem 1 (Matrix–Tree Theorem; Tutte (1984,
P. 140)). For any graph,
Z = |l|
(5)
Inoltre, the normalization constant can be
computed in O
time.2
N 3
(cid:2)
(cid:3)
2.2
Dependency Parsing and the
Laplacian Zoo
Graph-based dependency parsing can be encoded
come segue. For each sentence of length N , we
create a graph G = (N , E, ρ) where each non-root
node represents a token of the sentence, and ρ
represents a special root symbol of the sentence.
Each edge (i → j) in the graph represents a
possible dependency relation between head word
i and modifier word j. Fig. 1 gives an example
dependency tree. In the remainder of this section,
we give several variations on the Laplacian matrix
that encode different sets of valid trees.3
In many cases of dependency parsing, we want
ρ to have exactly one outgoing edge. This is
motivated by linguistic theory, where the root of a
sentence should be a token in the sentence rather
than a special root symbol (Tesni`ere, 1959). There
are exceptions to this, such as parsing Twitter
(Kong et al., 2014) and parsing specific lan-
guages (per esempio., The Prague Treebank [Bejˇcek et al.,
(cid:3)
2For simplicity, we assume that the runtime of matrix
determinants is O
. Tuttavia, we would be remiss if
we did not mention that algorithms exist to compute the
determinant more efficiently (Dumas and Pan, 2016).
(cid:2)
N 3
3The reader may want to skip this section on their first
reading.
677
Figura 1: Example of a dependency tree.
2013]). We call these multi-root trees4 and these
are represented by the set D, as described ear-
lier. Therefore, the normalization constant over
all multi-root trees can be computed by a direct
application of Theorem 1.
Tuttavia, in most dependency parsing corpora,
only one edge may emanate from the root (Nivre
et al., 2018; Zmigrod et al., 2020). Così, we con-
sider the set of single-rooted trees, denoted D(1).
Koo et al. (2007) adapt Theorem 1 to efficiently
compute Z for the set D(1) with the root-weighted
Laplacian,5 (cid:10)L ∈ RN ×N
(cid:10)Lij =
⎧
⎪⎪⎨
⎪⎪⎩
wρ j
(cid:9)
io(cid:7)∈ N \ {ρ , j}
−wij
if i = 1
if i = j
wi(cid:7)j
(6)
otherwise
Proposition 1. For any graph, the normalization
constant over all single-rooted trees is given by
the determinant of the root-weighted Laplacian
(Koo et al., 2007, Prop. 1)
Z =| (cid:10)l|
(7)
the normalization constant
Inoltre,
(cid:2)
single-rooted trees can be computed in O
time.
for
(cid:3)
N 3
Labeled Trees. To encode labeled dependency
relations in our set of trees, we simply augment
edges with labels—resulting in a multi-graph in
which multiple edges may exist between pairs
y/ wijy−−−−→ j)
of nodes. Now, edges take the form (io
where i and j are the source and destination nodes
as before, y ∈ Y is the label, and wijy is their
weight.
4We follow the conventions of Koo et al. (2007) and say
‘‘single-root’’ and ‘‘multi-root’’ when we technically mean
the number of outgoing edges from the root ρ, and not the
number of root nodes in a tree, which is always one.
5The choice to replace row 1 by the root edges is done by
convention, we can replace any row in the construction of ˆL.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
9
1
1
9
7
6
8
7
0
/
/
T
l
UN
C
_
UN
_
0
0
3
9
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Proposition 2. For any multi-graph, the normal-
ization constant for multi-root or single-rooted
trees can be calculated using Theorem 1 O
Proposition 1 (rispettivamente) with the edge weights,
(cid:5)
wij =
wijy
y∈ Y
(8)
Inoltre, the normalization constant can be
computed in O(N 3 + | Y|N 2) time.6
Summary. We give common settings in which
the MTT can be adapted to efficiently compute Z
for different sets of trees. The choice is dependent
upon the task of interest, and one must be careful
to choose the correct Laplacian configuration. IL
results we present in this paper are modular in the
specific choice of Laplacian. For the remainder of
this paper, we assume the unlabeled tree setting
and will refer to the set of trees as simply D and
our choice of Laplacian as L.
3 Expectations
In this section, we characterize the family of
expectations that our framework supports. Nostro
framework is an extension of Li and Eisner (2009)
to distributions over spanning trees. In contrasto,
their framework considers expectations over dis-
tributions that can be factored as B-hypergraphs
(Gallo et al., 1993). Our distributions over trees
cannot be cast as polynomial-size B-hypergraphs.
Another important distinction between our frame-
work and that of Li and Eisner (2009) is that we
do not use the semiring abstraction as it is alge-
braically too weak to compute the determinant
efficiently.7
(1982) proved that
6The algorithms given in later sections will not provide
full details for the labeled case due to space constraints, but we
assure the reader that our algorithms can be straightforwardly
generalized to the labeled setting.
7Infatti, Jerrum and Snir
IL
partition function for spanning trees requires an exponential
number of additions and multiplications in the semiring
model of computation (cioè., assuming that subtraction is
not allowed). È interessante notare, division is not required, Ma
algorithms for division-free determinant computation run in
O
(Kaltofen, 1992). An excellent overview of the power
of subtraction in the context of dynamic programming is given
in Mikl´os (2019, Ch. 3). It would appear as if commutative
rings would make a good level of abstraction as they admit
efficient determinant computation. È interessante notare, this means
that we cannot use the MTT in the max-product semiring
A (efficiently) find the maximum weight tree because max
(cid:3)
does not have an inverse. Fortunately, there exist O
N 2
algorithms to find the maximum weight tree for both the
N 4
(cid:3)
(cid:2)
(cid:2)
678
The expected value of a function f : D (cid:8)→ RF
is defined as follows
Ed[F (D)] def=
(cid:5)
d∈ D
P(D)F (D)
(9)
Without any assumptions on f , computing (9) È
intractable.8 In the remainder of this section, we
will characterize a class of functions f whose
expectations can be efficiently computed.
The first type of functions we consider are
that are additively decomposable
functions
along the edges of the tree. Formalmente, a function
R : D (cid:8)→ RR is additively decomposable if it can
be written as
(cid:5)
R(D) =
rij
(10)
(i → j) ∈ d
where we abuse notation slightly by for any func-
tion r : D (cid:8)→ RR, we consider
the edge
function rij as a vector of edge values. Un ex-
ample of an additively decomposable function is
R(D) = − log p(D) whose expectation gives the
Shannon entropy.9 Other first-order expectations
include the expected attachment score and the
Kullback–Leibler divergence. We demonstrate
how to compute these in our framework in and
§6.1 and §6.3, rispettivamente.
The second type of functions we consider
are functions that are second-order additively
decomposable along the edges of
the tree.
Formalmente, a function r: D (cid:8)→ RR is second-order
additively decomposable if it can be written as
the outer product of two additively decomposable
functions, R : D (cid:8)→ RR and s : D (cid:8)→ RS
T(D) = r(D)S(D)(cid:9)
(11)
Così, T(D) ∈ RR×S is generally a matrix.
An example of such a function is the gradi-
ent of entropy (see §6.2) or the GE objective
(McCallum et al., 2007) (see §6.4 with respect to
the edge weights. Another example of a second-
order additively decomposable function is the
single-root and multi-root settings (Zmigrod et al., 2020;
Gabow and Tarjan, 1984).
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
9
1
1
9
7
6
8
7
0
/
/
T
l
UN
C
_
UN
_
0
0
3
9
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
8Ovviamente, one could use sampling methods, ad esempio
Monte Carlo, to approximate (9). Sampling methods may be
efficient if the variance of f under p is not too large.
(cid:4)
9Proof: − log p(D) = − log( 1
Z
(cid:5)
(i → j) ∈ d wij)
= log Z −
⇒ rij = 1
(i → j) ∈ d log wij.
N log Z − log wij.
(cid:12)
covariance matrix. Given two feature functions
R: D (cid:8)→ RR and s: D (cid:8)→ RS, their covariance
(cid:13)
(cid:9)
matrix is Ed
.
Così, it is second-order additively decomposable
function as long as r(D) and s(D) are additively
decomposable.
− Ed[R(D)]Ed[S(D)]
R(D)S(D)(cid:9)
One family of functions which can be computed
efficiently but we will not explore here are those
who are multiplicatively decomposable over the
edges. A function q : D (cid:8)→ RQ is multiplicatively
decomposable if it can be written as
(cid:4)
q(D) =
qij
(12)
(i → j) ∈ d
interest in some applications (Vieira and Eisner,
2017, Sezione 5.3).
Specifically, the partial
is useful for determining the total
The First-Order Case.
derivative ∂Z
∂wij
weight of trees which include the edge (i → j),
(cid:5)
(cid:14)wij
def=
w(D)
d∈ Dij
(13)
def= {d ∈ D | (i → j ∈ d)}.
where Dij
Inoltre, P((i → j) ∈ d) = (cid:14)wij/Z =
wij
Z
Proposition 3. For any edge i → j,
∂Z
∂wij
.12
where the product of qij is an element-wise vector
Prodotto. These functions form a family that we will
call zeroth-order expectations and can be computed
with a constant number of calls to MTT (usually
two or three). Examples of these include the R´enyi
entropy and (cid:3)p-norms.10
Proof.
4 Connecting Gradients and
Expectations
A
efficiente
In this section, we build upon a fundamental
connection between gradients and expectations
(Darwiche, 2003; Li and Eisner, 2009). Questo
connection allows us to build on work in automatic
differentiation
gradient
obtain
algorithms. While the propositions in this section
are inspired from past work, we believe that
the presentation and proofs of these propositions
have previously not been clearly presented.11 We
find it convenient to work with unnormalized
expectations, or totals (for short). We denote the
total of a function f as f def=
d∈ D w(D)F (D).
We recover the expectation with Ep[F ] = ¯f /Z.
We note that totals (on their own) may be of
(cid:9)
(cid:3)
(cid:5)
def=
(cid:2)(cid:5)
d∈ D p(D)k
10IL (cid:3)k norm of the distribution p often denoted as
1/k for k ≥ 0. It is computable from a
(cid:11)P(cid:11)k
zeroth-order expectation because it can be written as ( Z(k)
Zk )1/k
where Z(k)=
ij, which is clearly
a zeroth-order expectation. Allo stesso modo, the R´enyi entropy of
(cid:2)(cid:5)
order α ≥ 0 with α (cid:12)= 1 is Hα(P) def=
=
1
1−α log
.
d∈ D w(D)k =
(i → j) ∈ d wk
d∈ D p(D)α
1
1−α log
Z(α)
Zα
(cid:5)
(cid:7)
(cid:6)
(cid:3)
11Li and Eisner (2009, Sezione 5.1) provide a similar
derivation to Proposition 3 and Proposition 4 for hypergraphs.
679
(cid:14)wij =
∂Z
∂wij
wij
(14)
(cid:5)
w(D)
d∈ Dij
(cid:5)
(cid:4)
(cid:14)wij =
=
wi(cid:7)j(cid:7)
d∈ Dij
(io(cid:7) → j(cid:7)) ∈ d
(cid:4)
(cid:5)
= wij
wi(cid:7)j(cid:7)
(io(cid:7) → j(cid:7)) ∈
D \ {i → j}
(cid:5)
(cid:4)
wi(cid:7)j(cid:7)
wi(cid:7)j(cid:7)
d∈ D
(cid:5)
(io(cid:7) → j(cid:7)) ∈ d
(cid:4)
d∈ D
(io(cid:7) → j(cid:7)) ∈ d
d∈ Dij
∂
∂wij
∂
∂wij
= wij
= wij
=
∂Z
∂wij
wij
Proposition‘4 will establish a connection
between the unnormalized expectation r and ∇Z.
Proposition 4. For any additively decomposable
function r: D (cid:8)→ RR, the total r can be computed
using a gradient–vector product
(cid:5)
r =
(cid:14)wijrij
(i → j) ∈ E
(15)
12Some authors (per esempio., Wainwright and Jordan, 2008) prefer
to work with an exponentiated representation wij = exp(θij)
so that ∇θij log Z = p((i → j) ∈ d). This avoids an explicit
division by Z, and multiplication by wij as these operations
happens by virtue of the chain rule.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
9
1
1
9
7
6
8
7
0
/
/
T
l
UN
C
_
UN
_
0
0
3
9
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Proof.
(cid:5)
r =
w(D)R(D)
Proposition 6. For any additively decomposable
function r : D (cid:8)→ RR that does not depend on
w,14 and edge i → j ∈ E,
d∈ D
(cid:5)
d∈ D
(cid:5)
(cid:5)
w(D)
rij
(i → j) ∈ d
(cid:5)
w(D)rij
wij
∂r
∂wij
= (cid:14)wijrij +
(cid:5)
(k → l) ∈ E
(cid:2)wij, klrkl
(18)
=
=
=
=
d∈ D
(i → j) ∈ d
(cid:5)
(cid:5)
w(D)rij
(i → j) ∈ E
(cid:5)
d∈ Dij
(cid:14)wijrij
(i → j) ∈ E
The Second-Order Case. We can similarly use
∂2Z
to determine the total weight of trees
∂wij ∂wkl
which include both (i → j) E (k → l) con
(i → j) (cid:12)= (k → l)13
(cid:2)wij, kl
def=
(cid:5)
d∈ Dij, kl
w(D)
(16)
(cid:2)wij, kl
where Dij
Inoltre,
def= {d ∈ D | (i → j) ∈ d, (k → l) ∈ d}.
Z = p(i → j ∈ d, (k → l) ∈ d).
Proposition 5. For any pair of edges i → j and
(k → l) such that i → j (cid:12)= (k → l),
Proof.
wij
∂r
∂wij
= wij
= wij
∂
∂wij
∂Z
∂wij
= (cid:14)wijrij +
⎞
⎠
wklrkl
⎛
⎝
(cid:5)
(k → l) ∈ E
rij + wij
∂Z
∂wkl
(cid:5)
(cid:5)
(k → l) ∈ E
(cid:2)wij,klrkl
(k → l) ∈ E
∂2Z
∂wij∂wkl
wklrkl
(cid:2)wij,kl =
∂2Z
∂wij ∂wkl
wijwkl
(17)
Proof.
(cid:2)wij, kl =
(cid:5)
w(D)
d∈ Dij,kl
(cid:5)
(cid:4)
=
d∈ Dij,kl
= wijwkl
(k(cid:7) → l(cid:7)) ∈ d
∂2
∂wij∂wkl
wk(cid:7)l(cid:7)
(cid:5)
(cid:4)
wi(cid:7)j(cid:7)
d∈ D
(io(cid:7) → j(cid:7)) ∈ d
=
∂2Z
∂wij ∂wkl
wijwkl
Proposition 6 will relate ∇2Z to ∇r. This will
be used in Proposition 7 to establish a connection
between the total t and ∇2Z, and additionally
establishes a connection between t and ∇r.
Proposition 7. For any second-order additively
decomposable function t: D (cid:8)→ RR×S, che è
expressed as the outer product of additively
decomposable functions, R : D (cid:8)→ RR and
S: D (cid:8)→ RS, T(D) = r(D)S(D)(cid:9), where r does
not depend on w, the total t can be computed
using a Jacobian–matrix product
(cid:5)
t =
(i → j) ∈ E
∂r
∂wij
(cid:9)
wijsij
(19)
or a Hessian–matrix product
(cid:5)
t =
(cid:14)wijrijsij
(cid:9) +
(cid:5)
(cid:2)wij, klrijskl
(cid:9)
(i → j) ∈ E
(k → l) ∈ E
(20)
13As each edge can only appear once in a tree, (cid:3)wij, ij = 0.
14More precisely, ∂r(D)
∂wij
= 0 for all d ∈ D and i → j ∈ E.
680
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
9
1
1
9
7
6
8
7
0
/
/
T
l
UN
C
_
UN
_
0
0
3
9
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Proof. We first prove (19)
(cid:5)
d ∈ D
(cid:5)
w(D)R(D)S(D)(cid:9)
(cid:5)
w(D)R(D)
(cid:9)
sij
d ∈ D
(cid:5)
(cid:5)
(i → j) ∈ d
w(D)R(D)sij(cid:9)
d ∈ D
i → j ∈ d
(cid:5)
(cid:5)
w(D)R(D)sij
(cid:9)
T
=
=
=
=
=
(i → j) ∈ E
(cid:5)
(i → j) ∈ E
(cid:5)
d ∈ Dij
wij
∂
∂wij
(cid:19)
(cid:5)
d ∈ D
(cid:20)
w(D)R(D)
(cid:9)
sij
wij
(i → j) ∈ E
∂r
∂wij
(cid:9)
sij
Then (20) immediately follows by substituting
(18) into (19) and expanding the summation.
Remark. There is a simple recipe to compute
∇rn for each n = 1, . . . , R. Primo, some notation;
−→
1ij be a vector over E with a 1 in dimension
let
(i → j), and zeros elsewhere. By plugging [rij]N
and sij = 1
into (19), we can compute
wij
tn = ∇rn.15 However, if r depends on w, we
must add the following first-order term, che è
due to the product rule
−→
1ij
Figura 2: Algorithm for first-order totals.
for evaluating Z, such as the procedure we des-
cribed in §2.1. While we provide derivatives §5.1
in our algorithms, these can also be evaluated
using any AD library, such as JAX (Bradbury
et al., 2018), PyTorch (our choice) (Paszke et al.,
2019), or TensorFlow (Abadi et al., 2015).
Proposition 4 is realized as T1 in Fig. 2 E
(19) E (20) are realized as Tv
2 in Fig. 3,
rispettivamente. We provide the runtime complexity
of each step in the algorithms. These will be
discussed in more detail in §5.2.
2 and Th
5.1 Derivatives of Z
All three algorithms rely on first- or second-order
derivatives of Z. Since Z = |l|, we can express its
gradient via Jacobi’s formula and an application
of the chain rule16
∂Z
∂wij
= Z
(cid:5)
(io(cid:7),j(cid:7))∈ Lij
Bi(cid:7)j(cid:7)l(cid:7)
io(cid:7)j(cid:7),ij
(22)
B = L−(cid:9)
(23)
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
9
1
1
9
7
6
8
7
0
/
/
T
l
UN
C
_
UN
_
0
0
3
9
1
P
D
.
∇rn = tn +
(cid:5)
(i → j) ∈ E
(cid:21)
(cid:14)wij∇[rij]N
(cid:24)
(cid:22)(cid:23)
first-order term
(21)
Dove
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
We provide the details for computing the gradients
of two first-order quantities, Shannon entropy and
the KL divergence, using this recipe in §6.2 and
§6.3, rispettivamente.
5 Algorithms
Having reduced the computation of r and t to
finding derivatives of Z in §4, we now describe
efficient algorithms that exploit
this connec-
zione. The main algorithmic ideas used in this
section are based on automatic differentiation
(AD) techniques (Griewank and Walther, 2008).
These are general-purpose techniques for effi-
ciently evaluating gradients given algorithms
that evaluate the functions. In our setting, IL
algorithm in question is an efficient procedure
15Note that when wij = 0, we can set sij = 0.
∂Li(cid:7)j(cid:7)
∂wij
io(cid:7)j(cid:7),ij =
is the transpose of L−1, l(cid:7)
, and Lij
is the set of pairs where (io(cid:7), j(cid:7)) ∈ Lij means that
l(cid:7)
(cid:12)= 0. We define Bρ j(cid:7) def= 0 for any j(cid:7) ∈ N .
Koo et al. (2007) show that for any i and j,
| Lij| ≤ 2 in the unlabeled case, Infatti, l(cid:7)
io(cid:7)j(cid:7),ij is
given by
io(cid:7)j(cid:7),ij
l(cid:7)
io(cid:7)j(cid:7),ij =
⎧
⎪⎨
⎪⎩
if i(cid:7) ∈ {1, j}, j(cid:7) = j
1
−1 if i(cid:7) = i, j(cid:7) = j
otherwise
0
(24)
16The derivative of |l| can also be given using the matrix
adjugate, ∇Z = adj(l)(cid:9). There are benefits to using the
adjugate as it is more numerically stable and equally efficient
(Stewart, 1998). Infatti, any algorithm that computes the
determinant can be algorithmically differentiated to obtain an
algorithm for the adjugate.
681
Their result holds for any Laplacian encoding
we gave in §2.2.17
The second derivative of Z can be evaluated as
follows18
∂2Z
∂wij∂wkl
=
(cid:5)
(io(cid:7),j(cid:7))∈ Lij
(k(cid:7),l(cid:7))∈ Lkl
l
(cid:7)
io(cid:7)j(cid:7),ij
∂2Z
∂Li(cid:7)j(cid:7)∂Lk(cid:7)l(cid:7)
l(cid:7)
k(cid:7)l(cid:7),kl
(25)
Dove
∂2Z
∂Li(cid:7)j(cid:7)∂Lk(cid:7)l(cid:7)
= Z (Bi(cid:7)j(cid:7)Bk(cid:7)l(cid:7) − Bi(cid:7)l(cid:7)Bk(cid:7)j(cid:7))
(26)
Note that (25) also contains a term with ∇2L as
it is derived from the product rule. Because L is
a linear construction, its second derivative is zero
and so we can drop this term.
5.2 Complexity Analysis
The efficiency of our approach is rooted in the
following result from automatic differentiation,
which relates the cost of gradient evaluation to the
cost of function evaluation. Given a function f ,
we denote the number of differentiable elementary
operations (per esempio., +, *, /, -, cos, pow) of f by
Cost{F }.
Theorem 2 (Cheap Jacobian–vector Products).
For any function f : RK (cid:8)→ RM and any vector
v ∈ RM , we can evaluate (∇f (X))(cid:9)v ∈ RK
with cost satisfying the following bound via
reverse-mode AD (Griewank and Walther, 2008,
page 44),
(cid:25)
(∇f (X))(cid:9)v
(cid:26)
≤ 4·Cost{F }
(27)
Così, O
(cid:3)
Cost{(∇f (X))(cid:9)v}
= O(Cost{F }).
Cost
(cid:2)
As a special (and common) case, Theorem 2
implies a cheap gradient principle: The cost of
evaluating the gradient of a function of one output
(M = 1) is as fast as evaluating the function itself.
Algorithm T1. The cheap gradient principle
tells us that ∇Z can be evaluated as quickly as
Z itself, and that numerically accurate procedures
for Z give rise to similarly accurate procedures
for ∇Z. Additionally, many widely used software
libraries can do this work for us, such as JAX,
17We have that | Lij| ≤ 2| Y| in the labeled case.
18We provide a derivation in Appendix A. Druck and
Smith (2009) give a similar derivation for the Hessian, Quale
we have generalized to any second-order quantity.
682
(cid:2)
(cid:2)
(cid:3)
N 3
N 2R
PyTorch, and TensorFlow. The runtime of evalu-
ating Z is dominated by evaluating the determinant
of the Laplacian matrix. Therefore, we can find
(cid:3)
both Z and ∇Z in the same complexity: O
.
Line 4 of Fig. 2 is a sum over N 2 scalar–vector
multiplications of size R, this suggests a runtime
of O
. Tuttavia, in many applications, R
is a sparse function. Therefore, we find it useful
to consider the complexities of our algorithms in
terms of the size R, and the maximum density
R(cid:7) of each rij. We can then evaluate Line 4 In
O
, leading to an overall runtime for T1
(cid:3)
of O
N 2
space to store the Laplacian matrix. Computing
the gradient of Z similarly takes O
to store.
Since storing r takes O(R) spazio, T1 has a space
(cid:2)
complexity of O
. The call to Z uses O
N 3 + N 2R(cid:7)
N 2 + R
N 2R(cid:7)
(cid:2)
N 2
(cid:2)
(cid:2)
(cid:3)
(cid:2)
(cid:3)
(cid:3)
(cid:3)
.
Algorithm Tv
2 . Second-order quantities (T),
appear to require ∇2Z and so do not directly
fit the conditions of the cheap gradient principle:
the Hessian (∇2Z) is the Jacobian of the gradient.
The approach of Tv
2 to work around this is to make
several calls to Theorem 2 for each element of
R. In questo caso, the function in question is (11),
which has output dimensionality R. Computing
∇r can thus be evaluated with R calls to reverse-
(cid:2)
mode AD, requiring O
time.
We can somewhat support fast accumulation
of S(cid:7)-sparse S in the summation of Tv
2 (Line
∂r
will generally be dense,
6). Unfortunately,
∂wij
so the cost of the outer product on Line 6 È
O(RS(cid:7)). Così, Tv
2 has an overall runtime of
R(N 3 + N 2R(cid:7)) + N 2R S(cid:7)
O
.19 Additionally,
(cid:3)
(cid:2)
2 requires O
Tv
N 2R + R S
of space because
(cid:3)
(cid:2)
O
N 2R
is needed to compute and store the
Jacobian of r and t has size O(R S).
R(N 3 + N 2R(cid:7))
(cid:3)
(cid:3)
(cid:2)
2 . The downside of Tv
Algorithm Th
2 is that
no work is shared between the R evaluations of
the loop on Line 3. For our computation of Z,
it turns out that substantial work can be shared
among evaluations. Specifically, ∇2Z only relies
on the inverse of the Laplacian matrix, as seen
In (26), leading to an alternative algorithm for
second-order quantities, Th
2 . This is essentially
the same observation made in Druck and Smith
(2009). Exploiting this allows us to compute ∇2Z
in O
time. Note that this runtime is only
achievable due to the sparsity of ∇L. The accu-
mulation component (Line 12) of Th
2 can be done
N 4
(cid:3)
(cid:2)
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
9
1
1
9
7
6
8
7
0
/
/
T
l
UN
C
_
UN
_
0
0
3
9
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
19If S