Efﬁcient Global Learning of - IA de Investigación especializada en el MIT

Aprendizaje global eficiente de
Entailment Graphs

Jonathan Berant∗
Universidad Stanford

Noga Alon∗∗
Tel Aviv University

Ido Dagan†
Bar-Ilan University

Jacob Goldberger‡
Bar-Ilan University

Entailment rules between predicates are fundamental to many semantic-inference applications.
Como consecuencia, learning such rules has been an active ﬁeld of research in recent years. Métodos
for learning entailment rules between predicates that take into account dependencies between
different rules (p.ej., entailment is a transitive relation) have been shown to improve rule quality,
but suffer from scalability issues, eso es, the number of predicates handled is often quite small.
In this article, we present methods for learning transitive graphs that contain tens of thousands
de nodos, where nodes represent predicates and edges correspond to entailment rules (termed
entailment graphs). Our methods are able to scale to a large number of predicates by exploiting
structural properties of entailment graphs such as the fact that they exhibit a “tree-like” property.
We apply our methods on two data sets and demonstrate that our methods ﬁnd high-quality
solutions faster than methods proposed in the past, and moreover our methods for the ﬁrst time
scale to large graphs containing 20,000 nodes and more than 100,000 bordes.

1. Introducción

Performing textual inference is at the heart of many semantic inference applications,
such as Question Answering (control de calidad) and Information Extraction (IE). A prominent generic

∗ Computer Science Department, Universidad Stanford, stanford, California 94305.

Web: http://www.stanford.edu/∼joberant.

∗∗ Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, 6997801, Israel.

Web: http://www.tau.ac.il/∼nogaa.

† Computer Science Department, Bar-Ilan University, Ramat-Gan, 52900, Israel.

Web: http://www.cs.biu.ac.il/∼dagan.

‡ Engineering Department, Bar-Ilan University, Ramat-Gan, 52900, Israel.

Web: http://www.eng.biu.ac.il/goldbej.

Envío recibido: 23 Noviembre 2013; versión revisada recibida: 26 Septiembre 2014; aceptado para
publicación: 25 Noviembre 2014.

doi:10.1162/COLI a 00220

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 2

paradigm for textual inference is Textual Entailment (Dagan et al. 2013). In textual en-
tailment, the goal is to recognize, given two text fragments termed text and hypothesis,
whether the hypothesis can be inferred from the text. Por ejemplo, the text Cyprus
was invaded by the Ottoman Empire in 1571 implies the hypothesis The Ottomans attacked
Cyprus.

Semantic inference applications such as QA and IE crucially rely on entailment
normas (Ravichandran and Hovy 2002; Shinyama and Sekine 2006; Berant and Liang 2014)
or equivalently, inference rules—that is, rules that describe a directional inference rela-
tion between two fragments of text. An important type of entailment rule speciﬁes the
entailment relation between natural language predicates, Por ejemplo, the entailment
rule X invade Y⇒X attack Y can be helpful in inferring the aforementioned hypothesis.
Como consecuencia, substantial effort has been made to learn such rules (Lin and Pantel 2001;
Sekine 2005; Szpektor and Dagan 2008; Schoenmackers et al. 2010; Melamud et al. 2013).
Textual entailment is inherently a transitive relation, eso es, the rules x ⇒ y and y ⇒
z imply the rule x ⇒ z. Por ejemplo, from the rules X reduce nausea ⇒ X help with nausea
and X help with nausea ⇒ X associated with nausea we can infer the rule X reduce nausea ⇒
X associated with nausea (Cifra 1). Respectivamente, tarde, Dagan, and Goldberger (2012)
proposed taking advantage of transitivity to improve learning of entailment rules. Ellos
modeled learning entailment rules as a graph optimization problem, where nodes are
predicates and edges represent entailment rules that respect transitivity. To solve this
optimization problem, they formulated it as an Integer Linear Program (ILP) and used
an off-the-shelf ILP solver to ﬁnd an exact solution. En efecto, they showed that applying
global transitivity constraints results in more accurate graphs (known as entailment
graphs) than methods that ignore the property of transitivity.

Although using an off-the-shelf ILP solver is straightforward, ﬁnding the optimal
set of edges respecting transitivity is NP-hard, and practically, transitivity constraints
impose substantial restrictions on the scalability of the methods. De hecho, en algunos casos
ﬁnding the optimal set of edges for entailment graphs with even ∼50 nodes was quite
slow.

In this article, we develop algorithms for learning entailment graphs that take
advantage of structural properties such as transitivity, but substantially reduce the com-
putational cost of inference, thus allowing us to solve the aforementioned optimization
problem on very large graphs.

Cifra 1
A fragment of an entailment graph about the concept nausea from the data set used by Berant,
Dagan, and Goldberger. Edges that can be inferred by transitivity are omitted for clarity.

250

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

X-related-to-nauseaX-associated-with-nauseaX-prevent-nauseaX-help-with-nauseaX-reduce-nauseaX-treat-nausea

Berant et al.

Efﬁcient Global Learning of Entailment Graphs

Our method contains two main steps. The ﬁrst step is based on a sparsity assump-
tion that even if an entailment graph contains many predicates, most of them do not
entail one another, and thus we can decompose the graph into smaller components that
can be solved more efﬁciently. Por ejemplo, the predicates X parent of Y, X child of Y
and X relative of Y are independent from the predicates X works with Y, X boss of Y, y
X manages Y, and thus we can consider each one of these two sets separately. We prove
that ﬁnding the optimal solution for each of the smaller components results in a global
optimal solution for our optimization problem.

The second step proposes a polynomial heuristic approximation algorithm for ﬁnd-
ing a transitive set of edges in each one of the smaller components. It is based on a novel
modeling assumption that entailment graphs exhibit a “tree-like” property, which we
term forest reducible. Por ejemplo, the graph in Figure 1 is not a tree, because the
predicates X related to nausea and X associated with nausea form a cycle. Sin embargo, estos
two predicates are synonymous, and if they were merged into a single node, then the
graph would become a tree. We propose a simple iterative approximation algorithm,
where in each iteration a single node is deleted from the graph and then inserted back
in a way that improves the objective function value. We prove that if we impose a
constraint that entailment graphs must be forest reducible, then each iteration can be
performed in linear time. This results in an algorithm that can scale to entailment graphs
containing tens of thousands of nodes.

We apply our algorithm on two data sets. The ﬁrst data set includes medium-sized
entailment graphs where predicates are typed, eso es, the arguments are restricted to
belong to a particular semantic type (por ejemplo, Xperson parent of Yperson). Nosotros mostramos que
using our algorithm, we can substantially improve runtime, suffering from only a slight
reduction in the quality of learned entailment graphs. The second data set includes a
much larger graph containing 20,000 untyped nodes, where applying state-of-the-art
methods that use an ILP solver is completely impossible. We run our algorithm on this
data set and demonstrate that we can learn knowledge bases with more than 100,000
entailment rules at a higher precision compared with local learning methods.

The article is organized as follows. En la sección 2 we survey prior work on the
learning of entailment rules. We ﬁrst focus on local methods (Sección 2.1), eso es,
methods that handle each pair of predicates independently of other predicates, y
then describe global methods (Sección 2.2), eso es, methods that take into account
multiple predicates simultaneously. Sección 3 is the core of the article and describes our
main algorithmic contributions. After formalizing entailment rule learning as a graph
optimization problem, we present the two steps of our algorithm. Sección 3.1 describe
the ﬁrst step, in which a large entailment graph is decomposed into smaller compo-
nents. Sección 3.2 describes the second step, in which we develop an efﬁcient heuristic
approximation based on the assumption that entailment graphs are forest reducible.
Sección 4 describes experiments on the ﬁrst data set containing medium-sized entail-
ment graphs with typed predicates. Sección 5 presents an empirical evaluation on a large
graph containing 20,000 untyped predicates. We also perform a qualitative analysis in
Sección 5.4 to further elucidate the behavior of our algorithm. Sección 6 concludes the
artículo.

This article is based on previous work (tarde, Dagan, and Goldberger 2011; tarde
et al. 2012), but expands over it in multiple directions. Empirically, we present results
on a novel data set (Sección 5) that is by orders of magnitude larger than in the past.
Algorithmically, we present the Tree-Node-And-Component-Fix algorithm, which is an
extension of the Tree-Node-Fix algorithm presented in Berant et al. (2012) and achieves
best results in our experimental evaluation. Teóricamente, we provide an NP-hardness

251

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 2

proof for the Max-Trans-Forest optimization problem presented in Berant et al. (2012)
and an ILP formulation for it. Last, we perform in Section 5.4 an extensive qualitative
analysis of the graphs learned by our local and global algorithms from which we draw
conclusions for future research directions.

2. Fondo

In this section we describe prior work relevant for entailment rule learning. Primero, nosotros
describe local methods that estimate entailment for a pair of predicates, focusing on
methods that we employ in this article. Entonces, we describe global methods that perform
inference over a larger set of predicates. Específicamente, we provide details on the method
and optimization problem developed by Berant, Dagan, and Goldberger (2012) para
which we propose scalable inference algorithms in this article. Last, we survey some
other related work in NLP that uses global inference.

2.1 Local Learning

In local learning, given a pair of predicates (i, j) we would like to determine whether
i ⇒ j. The main sources of information utilized in the past for local learning were
(1) lexicographic resources, (2) co-occurrence, y (3) distributional similarity. Nosotros
brieﬂy describe the ﬁrst two and then expand more on distributional similarity, cual
is the most commonly used source of information.

Lexicographic resources are manually built knowledge bases from which semantic
information may be extracted. Por ejemplo, the hyponymy, toponymy, and synonymy
relations in WordNet (Fellbaum 1998) can be used to detect entailment between nouns
and verbs. Although WordNet is the most popular lexicographic resource, other re-
sources such as CatVar, Nomlex, and FrameNet have also been utilized to extract
inference rules (Meyers et al. 2004; Budanitsky and Hirst 2006; Szpektor and Dagan
2009; Coyne and Rambow 2009; Ben Aharon, Szpektor, and Dagan 2010).

Pattern-based methods attempt to identify the semantic relation between a pair of
predicates by examining their co-occurrence in a large corpus. Por ejemplo, la frase
people snore while they sleep provides evidence that snore ⇒ sleep. While most pattern-
based methods focused on identifying semantic relations between nouns (Por ejemplo,
Hearst patterns [Hearst 1992]), several works (Pekar 2008; Chambers and Jurafsky
2011; Weisman et al. 2012) attempted to extract relations between predicates as well.
Chklovsky and Pantel (2004) used pattern-based methods to generate the commonly
used VerbOcean resource.

Both lexicographic as well as pattern-based methods suffer from limited cover-
edad. Distributional similarity is therefore used to automatically construct broad-scale
resources. Distributional similarity methods are based on the “distributional hypoth-
esis” (harris 1954) that semantically similar predicates occur with similar arguments.
Quite a few methods have been suggested (Lin and Pantel 2001; Szpektor et al. 2004;
Bhagat, Pantel, and Hovy 2007; Szpektor and Dagan 2008; Yates and Etzioni 2009;
Schoenmackers et al. 2010), which differ in terms of the speciﬁcs of the ways in which
predicates are represented, the features that are extracted, and the function used to
compute feature vector similarity. Próximo, we elaborate on the methods that we use in this
artículo.

Lin and Pantel (2001) proposed an algorithm for learning paraphrase relations
between binary predicates, eso es, predicates with two variables such as X treat Y.
For each binary predicate, Lin and Pantel compute two sets of features Fx and Fy,

252

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Berant et al.

Efﬁcient Global Learning of Entailment Graphs

which are the words that instantiate the arguments X and Y, respectivamente, in a large
cuerpo. Given a predicate u and its feature set for the X variable Fx, every feature fx ∈ Fx
is weighted by pointwise mutual information between the predicate and the feature:
w( fx) = log Pr( fx|tu)
Pr( fx ) , where the probabilities are computed using maximum likelihood
over the corpus. Given two predicates u and v, the Lin measure (lin 1998) is computed
for the variable X in the following manner:

Linx(tu, v) =

(cid:80)

f ∈Fu

f ∈Fu
X

X ( F ) + wv

[wu
x ∩Fv
X
X ( F ) + (cid:80)
wu

X( F )]
wv
X( F )

f ∈Fv
X

(1)

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

The measure is computed analogously for the variable Y and the ﬁnal distribu-
tional similarity score, termed DIRT, is the geometric average of the scores for the
two variables: If DIRT(tu, v) es alto, this means that the templates u and v share many
“informative” arguments and so it is possible that u ⇒ v.

Szpektor and Dagan (2008) suggested two modiﬁcations to DIRT. Primero, they looked
at unary predicates, eso es, predicates with a single variable such as X treat. En segundo lugar,
they computed a directional score that is more suited for capturing entailment relations
compared to the symmetric Lin score. They proposed that if for two unary predicates
u ⇒ v, then relatively many of the features of u should be covered by the features of v.
This is captured by the asymmetric Cover measure (Weeds and Weir 2003):

Cover(tu, v) =

(cid:80)

f ∈Fu∩Fv wu( F )
(cid:80)
f ∈Fu wu( F )

(2)

The ﬁnal directional score, termed BInc (Balanced Inclusion), is the geometric average
of the Lin measure and the Cover measure.

Both Lin and Pantel as well as Szpektor and Dagan compute a similarity score using
a single argument. Sin embargo, it is clear that although this alleviates sparsity problems,
considering pairs of arguments jointly provides more information. Yates and Etzioni
(2009), Schoenmackers et al. (2010), and even earlier, Szpektor et al. (2004), presentado
methods that compute semantic similarity based on pairs of arguments.

A problem common to all local methods presented above is predicate ambiguity—
predicates may have different meanings and different entailment relations in different
contextos. Some works resolved the problem of ambiguity by representing predicates
with argument variables that are typed (Pantel et al. 2007; Schoenmackers et al. 2010).
Por ejemplo, argument variables in the work of Schoenmackers et al. were restricted
to belong to one of 156 types, such as country or profession. A different solution that
has attracted substantial attention recently is to represent the various contexts in which
a predicate can appear in a low-dimensional latent space (Por ejemplo, using Latent
Dirichlet Allocation [Blei, Ng, and Jordan 2003]) and infer entailment relations between
predicates based on the contexts in which they appear (Ritter, Mausam, and Etzioni
2010; ´OS´eaghdha 2010; Dinu and Lapata 2010; Melamud et al. 2013). In the experiments
presented in this article we will use the representation of Schoenmackers et al. in one
experimento, and ignore the problem of predicate ambiguity in the other.

253

Ligüística computacional

Volumen 41, Número 2

2.2 Global Learning

The idea of global learning is that, by jointly learning semantic relations between a
large number of natural language phrases, one can use the dependencies between the
relations to improve accuracy. A natural way to model that is with a graph where nodes
are phrases and edges represent semantic similarity. Snow, Jurafsky, and Ng (2006)
presented one of the early examples of global learning in the context of learning noun
taxonomies. In their work, they enforced a transitivity constraint over the taxonomy
using a greedy inference procedure and demonstrated that this improves the quality of
the taxonomy. Transitivity constraints were also enforced by Yates and Etzioni (2009),
who proposed a clustering algorithm for learning undirected synonymy relations.
Nakashole, Weikum, and Suchanek (2012) learned a taxonomy of binary predicates and
also enforced transitivity with a greedy algorithm.

tarde, Dagan, and Goldberger (2012) developed a global method for learning en-
tailment relations between predicates. In this article we present scalable approximation
algorithms for the optimization problem they propose, and so we provide further detail
on their method. The input to their algorithm is a large corpus and a lexicographic
resource, such as WordNet, and the output is a set of entailment rules that respect the
constraint of transitivity. The algorithm is composed of two main steps. en el primero
step, a set of predicates is extracted from the corpus and a local entailment classiﬁer
is trained based on examples automatically generated from the lexicographic resource.
En este punto, we can derive for each pair of predicates (i, j) a score wij ∈ R that estimates
whether i ⇒ j.

In the second step of their algorithm, they construct a graph, where predicates are
nodes and edges represent entailment rules. Using the local scores they look for the set
of edges E that maximizes the objective function (cid:80)
(i,j)∈E wij under the constraint that
edges respect transitivity. They show that this optimization problem is NP-hard and
ﬁnd an exact solution using an ILP solver over small graphs. They also avoid problems
of predicate ambiguity by partially contextualizing the predicates. In this article, nosotros
present efﬁcient and scalable heuristic approximation algorithms for the optimization
problem they propose.

En años recientes, there has been substantial work on approximation algorithms for
global inference problems. Do and Roth (2010) suggested a method for the related task
of learning taxonomic relations between terms. Given a pair of terms, they construct a
small graph that contains the two terms and a few other related terms, and then impose
constraints on the graph structure. They construct these small graphs because their
work is geared towards scenarios where relations are determined on-the-ﬂy for a given
pair of terms and no global knowledge base is ever explicitly constructed. Because they
independently construct a graph for each pair of terms, their method easily produces
solutions where global constraints, such as transitivity, are violated.

Another approximation method that violates transitivity constraints is LP relaxation
(Martins, Herrero, y xing 2009). In LP relaxation, binary variables are replaced by
continuous variables, transforming the problem from an ILP to a Linear Program (LP),
which is polynomial. An LP solver is then applied, and variables that are assigned a
fractional value are rounded to their nearest integer, so many violations of transitivity
may occur. The solution when applying LP relaxation is not a transitive graph; we will
show in Section 4 that our approximation method is substantially faster.

Global inference has gained popularity in recent years in NLP, and a common ap-
proximation method that has been extensively utilized is dual decomposition (Sontag,
Globerson, and Jaakkola 2011). Dual decomposition has been successfully applied in

254

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Berant et al.

Efﬁcient Global Learning of Entailment Graphs

tasks such as Information Extraction (Reichart and Barzilay 2012), Máquina traductora
(Chang and Collins 2011), Parsing (Rush et al. 2012), and Named Entity Recogni-
ción (Wang, Che, and Manning 2013). A lo mejor de nuestro conocimiento, it has not yet
been applied for the task of learning entailment relations. The graph decomposition
method we present in Section 3.1 can be viewed as an ideal case of dual decom-
posición, where we can decompose the problem into disjoint components in a way
that we do not need to ensure consistency of the results obtained on each component
separately.

3. Efﬁcient Inference

Our goal is to learn a large knowledge base of entailment rules between natural
language predicates. Following Berant, Dagan, and Goldberger (2012), we formu-
late this task as a graph-learning problem, dónde, given the nodes of an entailment
graph, we would like to ﬁnd the best set of edges that respect a global transitivity
constraint.

Our main modeling assumption is that textual entailment is a transitive relation.
tarde, Dagan, and Goldberger (2012) have demonstrated that this modeling assump-
tion holds for focused entailment graphs, eso es, graphs in which predicates are disam-
biguated by one of their arguments. Por ejemplo, a graph that focuses on the concept
nausea might contain a en entailment rule between predicates such as X prevents nausea
⇒ X affects nausea, where the predicate X prevent Y is disambiguated by the instantiation
of the argument nausea. En este trabajo, we will examine this modeling assumption for
more general graphs. En la sección 4 we show that transitivity holds in typed entailment
graphs, eso es, graphs where each predicate speciﬁes the semantic type of its arguments
(Por ejemplo, Xdrug prevent Ysymptom). En la sección 5.4, we examine the assumption of
transitivity when predicates do not carry any typing information (Por ejemplo, X prevent
Y); and we observe that in this setting the assumption of transitivity is often violated
because of the predicate ambiguity. Sin embargo, we show that even in this set-up we
can empirically take advantage of the transitivity assumption.

Let V be a set of predicates, which are the nodes of the entailment graph, and let
w : V × V → R be an entailment weighting function. Given a pair of predicates (i, j), a
positive wij indicates local tendency to decide that i entails j, whereas a negative wij
indicates local tendency to decide that i does not entail j. We want to ﬁnd a global
entailment transitive graph that is most consistent with the local cues. Formalmente, nuestro
goal is to ﬁnd the directed graph G = (V, mi) that maximizes the sum of edge weights
(cid:80)
(i,j)∈E wij, under the constraint that the graph is transitive—that is, for every triple of

nodos (i, j, k), si (i, j) ∈ E and ( j, k) ∈ E, entonces (i, k) ∈ E.

tarde, Dagan, and Goldberger (2012) proved that this optimization problem
(which we term Max-Trans-Graph) is NP-hard, and provided an ILP formulation for
él. Let xij be an indicator for whether i ⇒ j, then x = {xij : i (cid:54)= j} are the variables of the
following ILP:

máximo
X

(cid:88)

i(cid:54)=j

wijxij

s.t.

∀i, j, k ∈ V

∀i, j ∈ V

xij + xjk − xik ≤ 1
xij ∈ {0, 1}

(3)

255

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 2

The objective function is the sum of weights over the edges of G, and the constraint
xij + xjk − xik ≤ 1 on the binary variables enforces transitivity (es decir., xij = xjk = 1 implies
that xik = 1). The weighting function w is trained separately using supervised learning
methods and we describe the details of training for each one of our experiments in
Secciones 4 y 5.

There is a simple probabilistic modeling that motivates the score (3) that we
optimize. Assume that for each pair of nodes (i, j), we are given a probability pij(1) =
pag(xij = 1) that i ⇒ j (the probability that i (cid:59) j is denoted by pij(0) = 1 − pij(1)). Assum-
ing a uniform probability over graphs, the posterior probability (which can also be
viewed as the likelihood) of a graph, represented by an edge-set x, es:

pag(X) ∝

(cid:89)

i(cid:54)=j

pij(xij)

It can be easily veriﬁed that:1

Por eso,

iniciar sesión p(xij) = log

pij(1)
pij(0)

xij + log pij(0)

iniciar sesión p(X) =

(cid:88)

i(cid:54)=j

iniciar sesión p(xij) =

(cid:88)

i(cid:54)=j

wijxij + const

such that ‘const’ is a scalar that does not depend on x and

wij = log

pij(1)
pij(0)

(4)

(5)

(6)

(7)

The optimal entailment graph, por lo tanto, is arg maxx p(X) = arg maxx
i(cid:54)=j wijxij, dónde
the maximization is over all the transitive graphs. Hence the most likely transitive
entailment graph is obtained as the solution of the ILP maximization problem (3).

(cid:80)

tarde, Dagan, and Goldberger solved this optimization problem with an ILP
solver, but because ILP is NP-hard, this does not scale well; the number of variables
is O(|V|2) and the number of constraints is O(|V|3). De este modo, even a graph with 80 nodos
(predicates) has more than half a million constraints. En esta sección, we describe a
heuristic algorithm that empirically provides high-quality solutions for this optimiza-
tion problem in graphs with tens of thousands of nodes.

Our algorithm contains two main steps. The ﬁrst step (Sección 3.1) is based on
a structural assumption that entailment graphs are relatively sparse—that is, mayoría
predicates do not entail one another. This allows us to decompose the graph into smaller
components in a way that guarantees that an exact solution for each one of the compo-
nents results in an exact solution for Max-Trans-Graph. Sin embargo, often even after de-
composition, components are too large and ﬁnding an exact solution is still intractable.
The second step (Sección 3.2) proposes a heuristic algorithm for ﬁnding a good
solution for each one of the components. This step is based on an observation that

1 We assume that pij(1), pij(0) ∈ (0, 1) and thus log p(xij ) is well deﬁned.

256

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Berant et al.

Efﬁcient Global Learning of Entailment Graphs

entailment graphs exhibit a “tree-like” property and are very similar to a novel type
of graph, which we term forest-reducible graph. We utilize this property to develop
an iterative efﬁcient approximation algorithm for learning the graph edges, donde cada
iteration takes linear time. We also prove that ﬁnding the optimal forest-reducible graph
is NP-hard.

En secciones 4 y 5 we apply our algorithm on two different data sets and show
that using transitivity substantially improves performance and that our methods dra-
matically improve scalability, allowing us to increase the scope of global learning of
entailment graphs.

3.1 Entailment Graph Decomposition

The ﬁrst step of our algorithm takes advantage of graph sparsity: Most predicates in
language do not entail one another. De este modo, it might be possible to decompose entailment
graphs into small components and solve each component separately.

Let V1, V2 be a partitioning of the nodes V into two disjoint non-empty subsets. Nosotros
term any directed edge, whose two nodes belong to different subsets, a crossing edge.

Proposition 1
If we can partition a set of nodes V into disjoint sets V1, V2 such that no crossing edge
has a positive weight, then the optimal set of edges that is the solution of the ILP
problema (3) does not contain any crossing edge.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Proof Assume by contradiction that the edge-set of the optimal solution Eopt con-
tains a non-empty set of crossing edges Ecross. We can construct Enew = Eopt \ Ecross.
Claramente (cid:80)

wij, as wij ≤ 0 for any crossing edge.

wij ≥ (cid:80)

(i,j)∈Enew

(i,j)∈Eopt

Próximo, we show that Enew does not violate transitivity constraints. Assuming it does,
then the violation is caused by omitting the edges in Ecross. De este modo, there must be, sin
pérdida de generalidad, nodes i ∈ V1 and k ∈ V2 such that for some node j, (i, j) y ( j, k) are in
Enew, pero (i, k) is not. Sin embargo, this means either (i, j) o ( j, k) is a crossing edge, cual
is impossible because we omitted all crossing edges. De este modo, Enew is a better solution than
Eopt, contradiction. Por eso, Ecross is empty. (cid:4)

This proposition suggests a simple algorithm (ver algoritmo 1): Construct an
undirected graph with the node set V and with an edge connecting i and j if either
wij > 0 or wji > 0, then ﬁnd its connected components, and ﬁnally solve each component
separately. If an ILP solver is used to ﬁnd the edges of each component separately,

257

Ligüística computacional

Volumen 41, Número 2

then we obtain an optimal (not approximate) solution to the optimization problem (3)
for the whole graph. Finding the undirected edges (Line 1) and computing connected
componentes (Line 2) can be performed in O(V2). De este modo, in this case the efﬁciency of the
algorithm is dominated by the application of an ILP solver (Line 4).

If the entailment graph decomposes into small components, one could obtain an
exact solution with an ILP solver, applied on each component separately, sin
resorting to any approximation. To further extend scalability in this setting we use a
cutting-plane method (kelly 1960). Cutting-plane methods have been used in the past,
Por ejemplo, in dependency parsing (Riedel and Clarke 2006). The idea is that even if
we omit all transitivity constraints, we still expect most transitivity constraints to be
satisﬁed, given a good weighting function w. De este modo, it makes sense to avoid specifying
the constraints ahead of time, but rather add them when they are violated. Esto es
formalized in Algorithm 2.

Line 1 initializes an active set of constraints (ACT). Line 3 applies the ILP solver with
the active constraints. Lines 4 y 5 ﬁnd the violated constraints and add them to the
active constraints. The algorithm halts when no constraints are violated. The solution
is clearly optimal because we obtain a maximal solution for a less-constrained problem
that does not violate any transitivity constraint.

A pre-condition for using cutting-plane methods is that computing the violated con-
tensiones (Line 4) is efﬁcient, as it occurs in every iteration. We do that in a straightforward
manner: For all edges (i, j) y ( j, k) that are in the current solution E∗, si (i, k) /∈ E∗
we add xij + xjk − xik ≤ 1 to the violated constraints. This is cubic in worst-case and
performs very fast in practice.

Para concluir, an exact algorithm for Max-Trans-Graph decomposes the graph into
components and uses Algorithm 2 to ﬁnd an exact solution for each component. Cómo-
alguna vez, because the problem is NP-hard, this will still fail once components become large.
Próximo, we describe the second step of our method, which replaces Algorithm 2 con un
efﬁcient approximation that can scale to much larger graphs. We will show in Section 4
that the solutions obtained by the approximation algorithm are almost as good as the
exact solution on a data set where ﬁnding an exact solution is feasible.

3.2 Efﬁcient Tree-Based Approximation

The approximation we present in this section is based on a conjecture that entailment
graphs exhibit a “tree-like” property, eso es, they can be reduced into a structure similar
to a directed forest, which we term forest-reducible graph (FRG). Although FRGs are
a more constrained class of directed graphs, we prove that restricting our optimization

258

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Berant et al.

Efﬁcient Global Learning of Entailment Graphs

problem to FRGs does not make the problem fundamentally easier, eso es, the problem
remains NP-hard (see Appendix). Entonces, we present in Section 3.2.2 our iterative approx-
imation algorithm, where in each iteration a node is removed and re-attached back to
the graph in a locally optimal way. Combining this scheme with our conjecture about
the graph structure yields a linear algorithm for node re-attachment.

De este modo, the contribution of this section is twofold: Primero, we deﬁne a novel mod-
eling assumption about the tree-like structure of entailment graphs and empirically
demonstrate its validity. Segundo, we exploit this assumption to develop a polynomial
approximation algorithm for learning entailment graphs that can scale to much larger
graphs than in the past.

3.2.1 Forest-Reducible Graph. The predicate entailment relation, described by our entail-
ment graphs, is typically from a “semantically speciﬁc” predicate to a more “general”
uno. De este modo, intuitively, the topology of an entailment graph is expected to be “tree-like.”
In this section we ﬁrst formalize this intuition and then empirically analyze its validity.
This property of entailment graphs is an interesting topological observation on its own,
but also enables the efﬁcient approximation algorithm of Section 3.2.2.

For a directed edge i ⇒ j in a directed acyclic graph (DAG), we term the node i
a child of node j, and j a parent of i.2 A directed forest is a DAG where all nodes
have no more than one parent. A strongly connected component in a directed graph
is a set of nodes such that there is a path from every node in the set to any other
nodo. In entailment graphs, strongly connected components correspond to semantically
equivalent predicates.

The entailment graph in Figure 2a (subgraph from the data set described in
Sección 4) is clearly not a directed forest—it contains a cycle of size two comprising
the nodes X common in Y and X frequent in Y, and in addition the node X be epidemic
in Y has three parents. Sin embargo, we can convert it to a directed forest by applying
the following operations. Any directed graph G can be converted into a Strongly-
Connected-Component (SCC) graph in the following way: Every strongly connected
component is contracted into a single node, and an edge is added from SCC S1 to SCC
S2 if there is an edge in G from some node in S1 to some node in S2. The SCC graph is
always a DAG (Cormen et al. 2002), and if G is transitive, then the SCC graph is also
transitive. The graph in Figure 2b is the SCC graph of the one in Figure 2a, but is still
not a directed forest because the node X be epidemic in Y has two parents.

The transitive closure of a directed graph G is obtained by adding an edge from node i
to node j if there is a path in G from i to j. The transitive reduction of G is obtained by
removing all edges whose absence does not affect its transitive closure. In DAGs, el
result of transitive reduction is unique (Aho, Garey, and Ullman 1972). We thus deﬁne
the reduced graph Gred = (Vred, Ered) of a directed graph G as the transitive reduction of
its SCC graph. The graph in Figure 2c is the reduced graph of the one in Figure 2a and
is a directed forest. We say a graph is a forest-reducible graph if its reduced graph is a
directed forest.

We now hypothesize that entailment graphs are approximately FRGs. la intuición
is that the predicate on the left-hand-side of an entailment rule has a more speciﬁc
meaning than the one on the right-hand-side. Por ejemplo, in Figure 2a X be epidemic in Y
(where X is a type of disease and Y is a country) is more speciﬁc than X common in Y and

2 In standard graph terminology an edge is from a parent to a child. We choose the opposite deﬁnition to

conﬂate edge direction with the direction of the entailment operator ‘⇒’.

259

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 2

Cifra 2
A fragment of an entailment graph (a), its SCC graph (b), and its reduced graph (C). Nodes are
predicates with typed variables, which are omitted in (b) y (C) for compactness.

X frequent in Y, which are equivalent, while X occur in Y is even more general. Accord-
ingly, the reduced graph in Figure 2c is an FRG. We note that this is not always the case:
Por ejemplo, the entailment graph in Figure 3 is not an FRG, because X annex Y entails
both Y be part of X and X invade Y, while the latter two do not entail one another. Cómo-
alguna vez, we hypothesize that this scenario is rather uncommon. Como consecuencia, a natural
variant of the Max-Trans-Graph problem is to restrict the required output graph of the
optimization problem (3) to an FRG. We term this problem Max-Trans-Forest.

To test whether our hypothesis holds empirically, we performed the following anal-
ysis. We sampled seven gold standard typed entailment graphs (eso es, graphs where

Cifra 3
A fragment of an entailment graph that is not an FRG.

260

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Xdisease be epidemic in Ycountry Xdisease common in Ycountry Xdisease occur in Ycountry Xdisease frequent in Ycountry Xdisease begin in Ycountry be epidemic in common in frequent in occur in begin in be epidemic in common in frequent in occur in begin in (a) (b) (C) Xcountry annex Yplace Xcountry invade Yplace Yplace be part of Xcountry

Berant et al.

Efﬁcient Global Learning of Entailment Graphs

predicates specify the semantic type of their arguments, such as Xdrug prevents Ysymptom)
from the data set described in Section 4, manually transformed them into FRGs by
deleting a minimal number of edges, and measured recall over the set of edges in each
graph (precision is naturally 1.0, as we only delete gold-standard edges). The lowest
recall value obtained was 0.95, illustrating that deleting a very small proportion of
edges converts a typed entailment graph into an FRG. Further support for the practical
validity of this hypothesis is obtained from our experiments on typed entailment graphs
en la sección 4. In these experiments we show that exactly solving Max-Trans-Graph
and Max-Trans-Forest (with an ILP solver) results in nearly identical performance. En
Sección 5.4 we qualitatively analyze the validity of the FRG assumption in untyped
entailment graphs (where predicates are of the form X prevents Y) and ﬁnd that this
assumption does not always hold, because of predicate ambiguity.

An ILP formulation for Max-Trans-Forest is simple—a transitive graph is an FRG
if all nodes in its reduced graph have no more than one parent. It can be veriﬁed that
this is equivalent to the following statement: For every triplet of nodes i, j, k, if i ⇒ j and
i ⇒ k, then either j ⇒ k or k ⇒ j (o ambos). This constraint can be stated in a linear form:
xij + xik − xjk − xkj ≤ 1. Por lo tanto, adding this new type of constraint to the ILP given in
Ecuación (3) results in a formulation for Max-Trans-Forest:

máximo
X

(cid:88)

i(cid:54)=j

wijxij

such that

∀i, j, k ∈ V

∀i, j ∈ V

xij + xjk − xik ≤ 1
xij + xik − xjk − xkj ≤ 1
xij ∈ {0, 1}

(8)

A natural question is whether there is a simpler (polinomio) solution for Max-Trans-
Forest that avoids the need for an ILP solver. In the Appendix we prove that Max-Trans-
Forest is also an NP-hard problem by a polynomial reduction from the X3C problem
(Garey and Johnson 1979). Readers who are not interested in the proof can safely skip
the Appendix.

3.2.2 Approximation Algorithm. In this section we present Tree-Node-Fix and Tree-Node-
And-Component-Fix, which are efﬁcient approximation algorithms for Max-Trans-
Forest, as well as Graph-Node-Fix, an approximation for Max-Trans-Graph.

Tree-Node-Fix. The scheme of Tree-Node-Fix (TNF) is the following. Primero, an initial FRG
is constructed, using some initialization procedure. Entonces, at each iteration a single node
v is re-attached (see below) to the FRG in a way that improves the objective function.
This is repeated until the value of the objective function can no longer be improved by
re-attaching a node.

Re-attaching a node v is performed by removing v from the graph (deleting v and all
its adjacent edges) and connecting it back with a better set of edges, while maintaining
the constraint that it is an FRG. This is done by considering all possible edges from/to
the other graph nodes and choosing the optimal subset, while the rest of the graph edges
remain ﬁxed. Por ejemplo, En figura 2, one way of re-attaching the node X common in
Y is to add it as a direct child of X occur in Y that is not a synonym of X frequent in Y.
This will result in deletion of the edges X common in Y ⇒ X frequent in Y, X frequent in
Y ⇒ X common in Y, and also X be epidemic in Y ⇒ X common in Y, because otherwise the

261

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 2

resulting graph will not be an FRG. We will show that re-attachment can be efﬁciently
performed in linear time using dynamic programming.

Formalmente, let Sv−in = (cid:80)

and Sv−out = (cid:80)
amounts to optimizing a linear objective:

i(cid:54)=v wivxiv be the sum of scores over v’s incoming edges
k(cid:54)=v wvkxvk be the sum of scores over v’s outgoing edges. Re-attachment

argmax
X(v)

(Sv-in + Sv-out)

(9)

while maintaining the FRG constraint, where the variables x(v) ⊆ x are indicators for all
pairs of nodes involving v. We approximate a solution for Equation (3) by iteratively
optimizing the simpler objective (9). Claramente, at each re-attachment the value of the
objective function cannot decrease, because the optimization algorithm considers the
previous graph as one of its candidate solutions.

We now show that re-attaching a node v is linear. To analyze v’s re-attachment, nosotros
consider the structure of the directed forest Gred just before v is re-inserted, and examine
the possibilities for v’s insertion relative to that structure. We start by deﬁning some
helpful notations. Every node c ∈ Vred is a strongly connected component in G. Let vc ∈ c
be an arbitrary representative node in c (we can choose any node vc ∈ c, because the
graph G is transitive, and c is a strongly connected component). We denote by Sv-in(C)
the sum of weights from all nodes in c and their descendants to v, and by Sv-out(C) el
sum of weights from v to all nodes in c and their ancestors:

Sv-in(C) =

Sv-out(C) =

(cid:88)

i∈c
(cid:88)

i∈c

wiv +

wvi +

(cid:88)

k /∈c
(cid:88)

k /∈c

wkvxkvc

wvkxvck

(10)

(11)

Tenga en cuenta que {xvck, xkvc} are edge indicators in G and not Gred.

There are several cases for re-attaching v that we need to consider:

(a)

(b)

Inserting v into an existing component c ∈ Vred (caso 1, see Figure 4a).

Inserting v as a new component, which breaks into two subcases:

Forming a new component that contains only v, where v is a child
of a component c (case 2a, see Figure 4b).

Forming a new component that contains only v, where v is not a
child of any component c, and so is a new root in Gred (case 2b, ver
Figura 4c).

Cifra 4
(a) Inserting v into a component c ∈ Vred. (b) Inserting v as a child of c and a parent of a subset of
c’s children in Gred. (b’) A node d that is a descendant but not a child of c cannot choose v as a
parent, as v becomes its second parent. (C) Inserting v as a new root.

262

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

(a) d c v … c v c d1 … d2 v … … … r1 r2 v (b) (b’) (C) r3 …

Berant et al.

Efﬁcient Global Learning of Entailment Graphs

Note that v cannot form a new component that contains other nodes as well, porque
the rest of the graph is ﬁxed.

To ﬁnd the optimal way of re-attaching v, we need to compute the score Equa-
ción (9), for each of the three cases, and choose the best one. We now describe how
this score is computed in each of the three cases.

Case 1: Inserting v into a component c ∈ Vred. In this case we add in G edges from
all nodes in c and their descendants to v and from v to all nodes in c and their ancestors.
The score (9) in this case is

s1(C) (cid:44) Sv-in(C) + Sv-out(C)

(12)

Case 2a: Inserting v as a child of some c ∈ Vred. Once c is chosen as the parent of v,
choosing v’s children in Gred is substantially constrained. A node that is not a descendant
of c cannot become a child of v, because this would create a new path from that node
to c and would require by transitivity to add a corresponding directed edge to c (but all
graph edges not connecting v are ﬁxed). Además, only a direct child of c can choose v as
a parent instead of c (Figura 4b), because for any other descendant of c, v would become
a second parent, and Gred will no longer be a directed forest (Figure 4b’). De este modo, este caso
requires adding in G edges from v to all nodes in c and their ancestors; además,
for each new child of v, denoted by d ∈ Vred, we add edges from all nodes in d and
their descendants to v. Fundamentalmente, although the number of possible subsets of c’s children
in Gred is exponential, the fact that they are independent trees in Gred allows us to go
over them one by one, and decide for each one whether it will be a child of v or not,
depending on whether Sv-in(d) is positive. Por lo tanto, the score (9) in this case is:

s2(C) (cid:44) Sv-out(C)+

(cid:88)

máximo(0, Sv-in(d))

d∈child(C)

(13)

where child(C) are the children of c.

Case 2b: Inserting v as a new root in Gred. Similar to case 2a, only roots of Gred can
become children of v. In this case for each chosen root r we add in G edges from the
nodes in r and their descendants to v. De nuevo, each root can be examined independently.
Por lo tanto, the score (9) of re-attaching v is:

(cid:44) (cid:88)
r

máximo(0, Sv-in(r))

(14)

where the summation is over the roots of Gred.

It can be easily veriﬁed that Sv-in(C) and Sv-out(C) satisfy the recursive deﬁnitions:

Sv-in(C) =

Sv-out(C) =

(cid:88)

i∈c
(cid:88)

i∈c

(cid:88)

wiv +

Sv-in(d), c ∈ Vred

d∈child(C)

wvi + Sv-out(pag), c ∈ Vred

(15)

(16)

263

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 2

where p is the parent of c in Gred. These recursive deﬁnitions allow computing in linear
time Sv-in(C) and Sv-out(C) for all c (given Gred) using dynamic programming, before going
over the cases for re-attaching v. Sv-in(C) is computed going over Vred leaves-to-root
(post-order), and Sv-out(C) is computed going over Vred root-to-leaves (pre-order).

Re-attachment is summarized in Algorithm 3. Computing an SCC graph is linear
(Cormen et al. 2002), and it is easy to verify that transitive reduction in FRGs is also
linear (Line 1). Computing Sv-in(C) and Sv-out(C) (Lines 2–3) is also linear, as explained.
Cases 1 and 2b are trivially linear and in case 2a we go over the children of all nodes in
Vred. As the reduced graph is a forest, this simply means going over all nodes of Vred,
and so the entire algorithm is linear.

Because re-attachment is linear, re-attaching all nodes is quadratic. Thus if we
bound the number of iterations over all nodes, the overall complexity is quadratic. Este
is dramatically more efﬁcient and scalable than applying an ILP solver. Empirically, nosotros
found that TNF converges after 5–10 iterations.

Tree-Node-and-Component-Fix. Assuming that entailment graphs are FRGs allows us to
use the node re-attachment operation in linear time. Sin embargo, this assumption also en-
ables performing other graph operations efﬁciently. We now suggest a natural extension
to the TNF algorithm that better explores the space of FRGs.

A disadvantage of TNF is that it re-attaches a single node in every iteration. Estafa-
sider, por ejemplo, the reduced graph in Figure 5. In this graph, nodes l, metro, and n are
direct children of node k, but suppose that in the optimal solution they are all children
of node j. Reaching the optimal solution would require three independent re-attachment
operaciones, and it is not clear that each of the three alone would improve the objective
function value. Sin embargo, if we allow re-attachment operations over components in
the SCC graph, then we would be able to re-attach the strong connectivity component
containing the nodes l, metro, and n in a single operation. De este modo, the idea of our extended
TNF algorithm is to allow re-attachment of both nodes and components. We term this
algorithm Tree-Node-and-Component-Fix (TNCF).

There are many ways in which this intuition can be implemented and our TNCF
algorithm uses one possible variant. In TNCF we ﬁrst perform node re-attachment
until convergence as in TNF (after initialization), but then compute the SCC graph
and perform component re-attachment until convergence. Component re-attachment is
identical to node re-attachment, except that we are guaranteed that the reduced graph
is a forest. After performing component re-attachment until convergence, we again

264

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Berant et al.

Efﬁcient Global Learning of Entailment Graphs

Cifra 5
An example graph where single node re-attachment is a disadvantage.

perform node re-attachments and then component re-attachments and so on, until the
entire process converges.

Graph-Node-Fix. The re-attachment strategy described above can be applied without
assuming that the graph is an FRG. Próximo, we show Graph-Node-Fix (GNF), a similar
algorithm that uses the re-attachment strategy to obtain an approximate solution for
the ILP problem Max-Trans-Graph (3). In this more general case, ﬁnding the optimal re-
attachment of a node v is done with an ILP solver. Sin embargo, this ILP is simpler than
Ecuación (3), because we consider only candidate edges involving v. Cifra 6 ilustra
the three types of possible transitivity constraint violations when re-attaching v. El
left side depicts a violation when (i, k) /∈ E, expressed by the constraint in Equation (18),
and the middle and right depict two violations when the edge (i, k) ∈ E, expressed by
the constraints in Equation (19). De este modo, the ILP is formulated by adding the following
constraints to the objective function (9):

máximo
X(v)

s.t.

Sv-in + Sv-out

∀i,k∈V\{v}
∀i,k∈V\{v}
∀i,k∈V\{v}

si (i, k) /∈ E, xiv + xvk ≤ 1
si (i, k) ∈ E, xvi ≤ xvk, xkv ≤ xiv
xiv, xvk ∈ {0, 1}

Cifra 6
Three types of transitivity constraint violations, when re-attaching the node v to a graph
containing the nodes i and k.

(17)

(18)

(19)

265

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

i k j n l m v i k v i k v i k v i k

Ligüística computacional

Volumen 41, Número 2

Complexity is exponential because of the ILP solver; sin embargo, the ILP size is
reduced by an order of magnitude to O(|V|) variables and O(|V|2) constraints. A
summarize, the complexity of GNF is higher than the complexity of TNF, but it does
not require the assumption that entailment graphs are FRGs.

4. Experimental Evaluation: Typed Entailment Graphs

In this and the next section we empirically evaluate the algorithms presented in Sec-
ción 3 on two different data sets. Our ﬁrst data set, presented in this section, com-
prises medium-sized graphs for which obtaining an exact solution is possible, and we
demonstrate that our approximation methods substantially improve runtime while
causing only a small degradation in performance compared with the optimal solution.
The graphs are also particularly suited for global optimization because graph predicates
are typed, which substantially reduces their ambiguity. The second data set (Sección 5)
contains a graph with tens of thousands of untyped nodes, where exact inference
is completely impossible. We show that our methods scale to this graph and that
transitivity improves performance even when predicates are untyped. La resultante
graph contains more than 100,000 entailment rules that can be utilized in downstream
semantic applications.

The input to the algorithms presented in Section 3 is a set of nodes V and a weight-
ing function w : V × V → R. We describe how those are constructed before presenting
an experimental evaluation.

4.1 Typed Entailment Graphs

As mentioned in Section 2, one of the challenges in global optimization is that tran-
sitivity does not always hold when predicates are ambiguous. Schoenmackers et al.
(2010) proposed an algorithm for learning inference rules between typed predicates, a
representation that substantially reduces ambiguity. We adopt their representation and
learn typed entailment graphs. A typed entailment graph (ver figura 7) is a directed
graph where the nodes are typed predicates. A typed predicate is a triple (t1, pag, t2), o
simply p(t1, t2), representing a predicate in natural language. p is the lexical realization
of the predicate and the typed variables t1, t2 indicate that the arguments of the predicate
belong to the semantic types t1, t2. Semantic types are taken from a set of types T, dónde
each type t ∈ T is a bag of natural language words or phrases. Examples for typed
predicates are: conquer(COUNTRY,CITY) and contain(PRODUCT,MATERIAL). An instance
of a typed predicate is a triple (a1, pag, a2), or simply p(a1, a2), where a1 ∈ t1 and a2 ∈ t2

Cifra 7
A fragment of a typed entailment graph.

266

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

province of(lugar,country)be part of(lugar,country)annex(country,lugar)invade(country,lugar)be relate to(drug,drug)be derive from(drug,drug)be process from(drug,drug)be convert into(drug,drug)

Berant et al.

Efﬁcient Global Learning of Entailment Graphs

are termed arguments. Por ejemplo, be common in(asthma,australia) is an instance of be
common in(DISEASE,PLACE).

Given a set of typed predicates, entailment rules can only exist between predicates
that share the same (unordered) pair of types (such as PLACE and COUNTRY), as other-
wise, the rule would contain unbound variables. Por eso, given a set of typed predicates
we can immediately decompose them into disjoint subsets—all typed predicates shar-
ing the same pair of types deﬁne a separate graph that describes the entailment relations
between those predicates (ver figura 7).

Edges in typed entailment graphs represent entailment rules in the usual way. Si
the type t1 is different from the type t2, mapping of arguments is straightforward, como
in the rule be ﬁnd in(MATERIAL,PRODUCT) ⇒ contain(PRODUCT,MATERIAL). If t1 and t2
are equal we need to specify how arguments are mapped. This is done by splitting each
node into two: Por ejemplo, the node beat(TEAM,TEAM) is split into two typed predicates
beat(Xteam,Yteam) and beat(Yteam,Xteam). This allows us to specify a rule where argument
order is reversed such as beat(Xteam,Yteam) ⇒ lose to(Yteam,Xteam). This also accommodates
rules such as play(Xteam,Yteam) ⇒ play(Yteam,Xteam): If team A plays team B, then team B
plays team A.

To create typed entailment graphs, we used a data set that was generously provided
by Schoenmackers et al. (2010). Schoenmackers et al. produced a mapping of 1.1 millón
arguments into 156 types (Examples for (argumento, TYPE) pairs are (exodus, BOOK), (china,
COUNTRY), y (asthma, DISEASE)), and then utilized the types, the mapped arguments,
y 1 million TextRunner tuples (Banko et al. 2007) to generate a set of 10,672 typed
predicates.3 Because entailment can only occur between predicates that share the same
types, we decomposed the 10,672 typed predicates into 2,303 typed entailment graphs.
The largest graph contains 188 nodes and the total number of potential rules is 263,756.4
The advantage of typing predicates is that it substantially reduces ambiguity, while still
maintaining rules of wide applicability.

4.2 Training a Local Entailment Classiﬁer

The weighting function w is derived from an entailment score provided by a local
clasificador. Given a local classiﬁer that provides an entailment score sij for a pair of pred-
icates (i, j), we deﬁne wij = sij − λ, where λ is a prior that controls graph sparseness: Como
λ increases, wij decreases and becomes negative for more pairs of predicates, representación
the graph more sparse. The probabilistic interpretation of λ is as follows. For each two
nodes let q be a prior probability that an edge exists. For large values of q the graph
tends to be dense, y viceversa. Deﬁning λ as log 1−q
q , the modiﬁed weight function is:

wij = log

pag(xij = 1)
pag(xij = 0)

− log

1 − q

q = sij − λ

(20)

Given the weight function w, the task is to ﬁnd the maximum a posteriori global graph
that satisﬁes predeﬁned constraints such as transitivity.

3 Readers are referred to their paper for details on mapping of tuples to typed predicates. The mapping

of arguments into types can be downloaded from http://www.cs.washington.edu/research/
sherlock-hornclauses/.

4 In more detail, 1,714 graphs have less than 5 nodos, 326 graphs have 5–10 nodes, 145 graphs have 10–20
nodos, 92 graphs have 20–50 nodes, 18 graphs have 50–100 nodes, y 7 graphs have at least 100 nodos.

267

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 2

Training is similar to the method proposed by Berant, Dagan, and Goldberger
(2012), and we brieﬂy describe it here. The input for training is a lexicographic resource,
for which we use WordNet, and a set of tuples, for which we use the 1 million typed
TextRunner tuples provided by Schoenmackers et al. We perform the following steps:

Training set generation Positive examples are generated using WordNet
synonyms and hypernyms. Negative pairs are generated using WordNet
direct co-hyponyms (sister terms), but we also utilize Word hyponyms
at distance 2. Además, we generate negative examples by randomly
sampling pairs of typed predicates that share the same types. Mesa 1
provides an example for each type of automatically generated training
ejemplo. It has been noted in the past that the WordNet verb hierarchy
contains a certain amount of noise (Richens 2008; Roth and Frank
2012). Sin embargo, we use WordNet only to generate examples for training
a statistical classiﬁer, and thus we can tolerate some noise in the generated
examples. De hecho, we have noticed that simple variants in training set
generation do not result in substantial differences in classiﬁer performance.

Feature representation Each example pair of typed predicates (p1, p2)
is represented by a feature vector, where each feature is a distributional
similarity score estimating whether p1 entails p2.

We compute 11 distributional similarity scores for each pair of typed
predicates, based on their arguments in the input set of tuples. The ﬁrst six
scores are computed by trying all combinations of two similarity functions
Lin and BInc with three types of feature representations (mira la sección 2):

(a)

(b)

A feature is a pair of arguments. Por ejemplo, a feature for the
typed predicate invade(COUNTRY,CITY) might be (germany,leningrad)
o (england,paris).

Predicate representation is binary, and each typed predicate has two
feature vectors, one for the X slot and one for the Y slot. Similarities
are computed for each vector separately and are then combined
by a geometric average (as in DIRT [Lin and Pantel 2001]). Para
ejemplo, the predicate invade(COUNTRY,CITY) will have a feature
vector for its X slot with features such as germany and england, y un
feature vector for its Y slot with features such as leningrad and paris.

(C)

Binary typed predicates are decomposed into two unary typed
predicates, and similarity is computed separately for each unary

Mesa 1
Automatically generated training set examples.

Tipo

Ejemplo

direct hypernym
direct synonym
direct cohyponym
hyponym (distance=2)
aleatorio

beat(TEAM,TEAM) ⇒ play(TEAM,TEAM)
reach(TEAM,GAME) ⇒ arrive at(TEAM,GAME)
invade(COUNTRY,CITY) (cid:59) bomba(COUNTRY,CITY)
defeat(CITY,CITY) (cid:59) eliminate(CITY,CITY)
hold(PLACE,EVENT) (cid:59) win(PLACE,EVENT)

268

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Berant et al.

Efﬁcient Global Learning of Entailment Graphs

predicate. Entonces, similarity scores are combined by a geometric
promedio. Por ejemplo, the binary typed predicate invade(COUNTRY,
CITY) will be decomposed into two unary predicates, one where
the ﬁrst argument of invade has the type COUNTRY (and the type of
the second argument is unspeciﬁed), and another where the second
argument of invade has the type CITY (and the ﬁrst argument in
unspeciﬁed).

The other ﬁve scores were provided by Schoenmackers et al.
(Schoenmackers et al. 2010) and include SR (Schoenmackers et al. 2010),
LIME (McCreath and Sharma 1997), M-estimate (Dzeroski and Brakto 1992),
the standard G-test, and a simple implementation of Cover (Weeds and Weir
2003). En general, the rationale behind this representation is that combining
various scores will yield a better classiﬁer than each single measure.

Training We sub-sample negative examples and train over an equal
number of positive and negative examples. We used SVMperf
(Joachims 2005) to train a Gaussian kernel classiﬁer that provides
an output score, sij. We tuned the two SVM parameters using 5-fold
cross validation on a development set of two typed entailment graphs.

Prior knowledge For a small number of pairs of predicates, we might have
prior knowledge of whether one entails the other. tarde, Dagan, y
Goldberger (2012) integrated prior knowledge by adding hard constraints
to the ILP. Because not all of our algorithms use an ILP solver, we integrate
prior knowledge by modifying the local classiﬁer score. For pairs of
predicates i, j for which we have prior knowledge that i entails j (termed
positive local constraints), we set sij = ∞. For pairs of predicates i, j for
which we have prior knowledge that i does not entail j (termed negative
local constraints), we set sij = −∞.

To generate prior knowledge constraints we normalized each predicate by
omitting the ﬁrst word if it is a modal and turned passives into actives. Si
two normalized predicates are equal, they are a positive local constraint.
Negative local constraints were constructed from three sources
(1) Predicates differing by a single pair of words that are WordNet
antonyms, (2) Predicates differing by a single word of negation, y (3)
Predicates p(t1, t2) y P(t2, t1) where p is a transitive verb (Por ejemplo,
beat) in VerbNet (Kipper, Dang, and Palmer 2000).

As reported by Berant, Dagan, and Goldberger (2012), we ﬁnd that
injecting prior knowledge into the graph improves performance, as it
provides a good starting point for the inference procedure.

4.3 Experimental Evaluation

To evaluate performance, we manually annotated all edges in 10 typed entailment
graphs containing 14, 14, 22, 30, 53, 62, 76, 86, 118, y 118 nodos. This annotation
yielded 3,427 edges and 35,585 non-edges, resulting in an empirical edge density of
9%. The data set is publicly available and can be downloaded from http://www-nlp.
stanford.edu/joberant/homepage_files/resources/Acl2011Exp.rar.

269

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 2

We implemented the following algorithms for learning graph edges, where in all of

them the graph is ﬁrst decomposed into components as described in Section 3.1.

No-trans Use local scores without transitivity constraints—an edge (i, j) is inserted

iff wij > 0, or in other words iff sij > λ.

Exact-graph Find the optimal solution for Max-Trans-Graph in each component by

applying an ILP solver in a cutting-plane method, as described in Section 3.1.

Exact-forest Find the exact solution for Max-Trans-Forest (see Equation 8) in each

component by applying an ILP solver in a cutting-plane method.

LP-relax Solve Max-Trans-Graph approximately by applying an LP-relaxation (ver
Sección 2) on each graph component. We apply the LP solver within the same cutting-
plane method to allow for a direct comparison. As mentioned, our goal is to present a
method for learning transitive graphs, whereas LP-relax produces solutions that violate
transitivity. Sin embargo, we run it on our data set to obtain empirical results, y para
compare runtimes against TNF.

Graph-Node-Fix (GNF) Initialize each component in the following way: If the
graph is very sparse, eso es, λ ≥ C for some constant C (set to 1 in our experiments),
then solving the graph exactly is not an issue and we use Exact-graph. De lo contrario, nosotros
initialize by applying Exact-graph in a sparse conﬁguration, eso es, λ = C.

Tree-Node-Fix (TNF) Initialize as in GNF, except that if it generates a graph that is
not an FRG, it is corrected by a simple heuristic: For every node in the reduced graph
Gred that has more than one parent, we choose from its current parents the single one
whose SCC is composed of the largest number of nodes in G.

We do not present the results of TNCF in this experiment, because for medium-sized

graphs it provides results that are almost identical to TNF.

The No-trans baseline is a state-of-the-art local learning algorithm. It uses state-of-
the-art local scores such as DIRT (Lin and Pantel 2001), BInc (Szpektor and Dagan 2008),
Cover (Weeds and Weir 2003), and SR (Schoenmackers et al. 2010) and trains a classiﬁer
using these scores. tarde, Dagan, and Goldberger (2010) have demonstrated empiri-
cially that training a classiﬁer over multiple similarity scores improves performance
compared with using just a single similarity score.

The Exact-graph algorithm is the state-of-the-art global method presented by
tarde, Dagan, and Goldberger (2012). It ﬁnds the optimal solution for Max-Trans-
Graph, and our goal is to show that we can obtain good performance more efﬁciently
using our approximation methods. Last, LP-relax is a standard approximation method
for solving ILPs (Martins, Herrero, y xing 2009).

We note that a trivial baseline would be to initialize edges based on whether sij > λ
and then compute the transitive closure in each component. We empirically found that
this adds edges too aggressively and results in very low precision.

We use the Gurobi optimization package5 as our ILP solver in all experiments. El

experiments were run on a multi-core 2.5GHz server with 32GB of RAM.

We evaluate algorithms by comparing the set of gold-standard edges with the set
of edges learned by each algorithm. We measure recall, precisión, and F1 for various
values of the sparseness parameter λ, and compute the area under the precision-recall
curve (AUC) generated. Efﬁciency is evaluated by comparing runtimes.

We ﬁrst focus on runtimes and show that TNF is substantially more efﬁcient than
other baselines that use transitivity. Cifra 8 compares runtimes of Exact-graph, GNF,
TNF, and LP-relax as −λ increases and the graph becomes denser. Note that the y-axis

5 www.gurobi.com

270

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Berant et al.

Efﬁcient Global Learning of Entailment Graphs

Cifra 8
Runtime in seconds of the global algorithms for various −λ values.

is in logarithmic scale. Claramente, Exact-graph is extremely slow and runtime increases
quickly. For λ = 0.3, runtime was already 12 hours and we were unable to obtain re-
sults for λ < 0.3, whereas in TNF we easily got a solution for any λ. When λ = 0.6, where both Exact-graph and TNF achieve best F1, TNF is 10 times faster than Exact- graph. When λ = 0.5, TNF is 50 times faster than Exact-graph, and so on. Most impor- tantly, runtime for GNF and TNF increases much more slowly than for Exact-graph. Comparing runtimes for TNF and GNF, we see that the gap between the algorithms decreases as −λ increases. However, for reasonable values of λ, TNF is about four to seven times faster than GNF, and we were unable to run GNF on large graphs, as we report in Section 5. Runtime of LP-relax is also bad compared with TNF and GNF. Runtime increases more slowly than Exact-graph, but still very fast compared with TNF. When λ = 0.6, LP-relax is almost 10 times slower than TNF, and when λ = −0.1, LP-relax is 200 times slower than TNF. This points to the difﬁculty of scaling LP-relax to large graphs. Last, Exact-forest is the slowest algorithm and because it is an approximation of Exact-graph we omit if from the ﬁgure for clarity. We now examine the quality of the learned graphs and validity of our modeling assumptions. Figure 9 (left) shows a pair-wise comparison of Exact-graph and Exact- forest. As is evident, the two curves are very similar, and the maximal F1 on the curve and AUC are almost identical. This provides further support for our modeling assumption that entailment graphs are roughly forest-reducible. Figure 9 (right) shows a similar comparison for TNF and GNF. We observe again that the curves are similar and performance is almost identical (maximal F1: 0.41, AUC: 0.31), illustrating that using the FRG assumption when using our approximation algorithm is empirically effective. In Figure 10, we compare the performance of Exact-graph, which exactly solves Max-Trans-Graph, to our most efﬁcient approximation algorithm, TNF, and to No-trans, a baseline that does not use transitivity at all (GNF and LP-relax are omitted from the 271 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 2 2 4 9 1 8 0 4 8 7 2 / c o l i _ a _ 0 0 2 2 0 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 lllllll−0.8−0.6−0.4−0.20.01050100500500050000−lambdaseclExact−graphLP−relaxGNFTNF Computational Linguistics Volume 41, Number 2 Figure 9 Pair-wise comparison of the precision-recall curves of Exact-graph vs. Exact-forest, and GNF vs. TNF. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 2 2 4 9 1 8 0 4 8 7 2 / c o l i _ a _ 0 0 2 2 0 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Figure 10 Precision (y-axis) vs. recall (x-axis) curve. Maximal F1 on the curve is 0.43 for Exact-graph, 0.41 for TNF, and 0.34 for No-trans. AUC in the recall range 0–0.5 is 0.32 for Exact-graph, 0.31 for TNF, and 0.26 for No-trans. ﬁgure to improve readability). We observe that both Exact-graph and TNF substan- tially outperform No-trans and that TNF’s graph quality is only slightly lower than Exact-graph (which is extremely slow). We report in the caption the maximal F1 on the curve and AUC in the recall range 0–0.5 (the widest range for which we have results for all algorithms). Note that compared with Exact-graph, TNF reduces AUC by merely a point and the maximal F1 score by two points only. As for LP-relax, results are just slightly lower than Exact-graph (maximal F1: 0.43, AUC: 0.32), but its output is not a transitive graph, and as shown above runtime is quite slow. 272 lllllllllllllllllllll0.00.10.20.30.40.50.60.70.00.20.40.60.81.0recallprecisionlExact−graphExact−forest0.00.10.20.30.40.50.60.70.00.20.40.60.81.0recallprecisionTNFGNF0.00.10.20.30.40.50.60.70.00.20.40.60.81.0recallprecisionllllllllllllllllllllllllExact−graphTNFNo−trans Berant et al. Efﬁcient Global Learning of Entailment Graphs To summarize, in this section we have empirically evaluated the algorithms pre- sented in Section 3 on medium-sized graphs for which we can ﬁnd an optimal solution. Our main ﬁndings are as follows: 1. 2. 3. 4. 5. Finding the optimal solution for Exact-graph and Exact-forest results in very similar graphs. This supports our assumption that entailment graphs are approximately forest-reducible. Running GNF and TNF also yields very similar graphs, illustrating that using the FRG assumption when using our re-attachment approximation scheme is effective. Our efﬁcient approximation algorithm, TNF, is substantially faster compared with running an ILP solver that exactly solves Max-Trans-Graph. TNF results in only a slight decrease in quality of learned graphs compared with Exact-graph. TNF, which respects the transitivity constraint, learns graphs of higher quality compared with the local baseline, No-trans. These ﬁndings lead us to believe that the algorithms presented in Section 3 can scale to large entailment graphs. In the next section, we empirically evaluate on a much larger data set for which using ILP solvers is impractical. 5. Experimental Evaluation: Untyped Entailment Graphs In this section we evaluate our methods on a graph with 20,000 nodes. Again, we describe how the nodes V and the weighting function w are constructed. 5.1 Untyped Entailment Graphs The nodes of the entailment graph we learn in this section are not typed. Although ambiguity is a problem in this setting, we will show that nevertheless transitivity con- straints can improve results compared with a state-of-the-art local entailment classiﬁer. The nodes of the graph were generated from a set of two billion tuples of the form (arg1,predicate,arg2) that were extracted by the Reverb open information extraction system (Fader, Soderland, and Etzioni 2011) over the Clueweb096 data set and were generously provided by the authors of Reverb. We performed some simple preprocessing over the extracted tuples. Each tuple is accompanied by a conﬁdence value and we discarded tuples with conﬁdence smaller than 0.5. Next, we normalized predicates using a procedure that omits unnecessary content such as modals and adverbs. The entire normalization procedure is publicly available as part of the Reverb project.7 Last, we normalized arguments by replacing pronouns and determiners by the tokens PRONOUN and DET. We also ran the BIU 6 http://lemurproject.org/clueweb09.php/. 7 https://github.com/knowitall/reverb-core/blob/master/core/src/main/java/edu/washington/ cs/knowitall/normalization/RelationString.java. 273 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 2 2 4 9 1 8 0 4 8 7 2 / c o l i _ a _ 0 0 2 2 0 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 41, Number 2 number normalizer8 and replaced numbers larger than one by the token NUM. After these three steps, we are left with a total of 960 million tuples of which 291 million are distinct. The number of distinct predicates in the set of extracted tuples is 103,315. Because this is still a large number, we restrict the predicate set to the 10,000 predicates that appear with the highest number of pairs of arguments. As previously explained, each predicate is split into two (for example, defeat is split into X defeat Y and Y defeat X), and so the ﬁnal number of entailment graph nodes is 20,000. 5.2 Training a Local Entailment Classiﬁer As with typed entailment graphs, the weighting function w is obtained by training a classiﬁer that provides a score sij for all pairs of predicates9 and deﬁning wij = sij − λ. This setting, computing the scores sij, involves the following steps: 1. 2. 3. Training set generation. We use the data set released by Zeichner, Berant, and Dagan (2012), which contains 6,567 entailment rule applications annotated for their validity by crowdsourcing. For example, the data set marks that The exercises alleviate pain ⇒ The exercises help ease pain is a valid rule application, whereas Obama want to boost the defense budget ⇒ Obama increase the defense budget is an invalid rule application. We extract a single rule from each rule application, for example, from the rule application The exercises alleviate pain ⇒ The exercises help ease pain we extract the rule X alleviate Y ⇒ X help ease Y. We use half of the data set for training, resulting in 1,224 positive examples and 2,060 negative examples. Another two training examples are X unable to pay Y ⇒ X owe Y and X own Y (cid:59) Y be sold to X. Feature representation. Each pair of predicates (p1, p2) is represented by a feature vector where the ﬁrst six are distributional similarity features identical to the ﬁrst six features described in Section 4.2. In addition, for pairs of predicates for which at least one distributional similarity feature is non-zero, we add lexicographic features computed from WordNet (Fellbaum 1998), VerbOcean (Chklovski and Pantel 2004), and CatVar (Habash and Dorr 2003), as well as string-similarity features. Table 2 provides the exact details of these features. A feature is computed for a pair of predicates (p1, p2) where in this context a predicate is a pair (pred,rev): pred is the lexical realization of the predicate, and rev is a Boolean indicating whether arg1 is X and arg2 is Y or vice versa. Overall, each pair of predicates is represented by 27 features. Training. After obtaining a feature representation for every pair of predicates, we train a Gaussian kernel SVM classiﬁer that optimizes F1 (SVMperf implementation [Joachims 2005]), and tune the parameters C and γ by a grid search combined with 5-fold cross validation. We use the trained local classiﬁer to compute a score sij for all pairs of predicates. 8 http://u.cs.biu.ac.il/~nlp/downloads/normalizer.html. 9 We do not need to run the classiﬁer on 10, 0002 pairs of predicates, because for the majority of predicate pairs the features are all zeros, and for this set of pairs we can run the classiﬁer only once. 274 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 2 2 4 9 1 8 0 4 8 7 2 / c o l i _ a _ 0 0 2 2 0 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Berant et al. Efﬁcient Global Learning of Entailment Graphs Table 2 Deﬁnition of lexicographic and string-similarity features. We denote by the string pred the lexical realization of the predicate, and by the Boolean indicator rev whether arg1 is X and arg2 is Y, or vice versa. Normalized edit-distance is edit-distance divide by the sum of lengths of pred1 and pred2. Name Type Source Description synonym loose synonym binary WordNet binary WordNet hypernym loose hypernym binary WordNet binary WordNet hyponym loose hyponym binary WordNet binary WordNet co-hyponym binary WordNet loose co-hyponym binary WordNet entailment loose entailment binary WordNet binary WordNet . stronger loose stronger binary binary VerbOcean VerbOcean rev-stronger loose rev-stronger binary binary VerbOcean VerbOcean CatVar binary CatVar remove word binary String add word binary String remove adj. binary String binary String real String binary String rev1 = rev2 ∧ pred1 is a synonym of pred2. rev1 = rev2 ∧ pred1 and pred2 are identical except for a pair of words (w1, w2 ) where w1 is a synonym of w2. rev1 = rev2 ∧ pred1 is a hypernym of pred2 at distance ≤ 2. rev1 = rev2 ∧ pred1 and pred2 are identical except for a pair of words (w1, w2 ) where w1 is a hypernym of w2 at distance ≤ 2. rev1 = rev2 ∧ pred1 is a hyponym of pred2 at distance ≤ 2. rev1 = rev2 ∧ pred1 and pred2 are identical except for a pair of words (w1, w2 ) where w1 is a hyponym of w2 at distance ≤ 2. rev1 = rev2 ∧ pred1 is a co-hyponym of pred2 at distance ≤ 2. rev1 = rev2 ∧ pred1 and pred2 are identical except for a pair of words (w1, w2 ) where w1 is a co-hyponym of w2 at distance ≤ 2. rev1 = rev2 ∧ pred1 verb-entails pred2 (distance ≤ 1). rev1 = rev2 ∧ pred1 and pred2 are identical except for a pair of words (w1, w2 ) where w1 verb-entails w2 (distance ≤ 1). rev1 = rev2 ∧ pred1 is stronger-than pred2. rev1 = rev2 ∧ pred1 and pred2 are identical except for a pair of words (w1, w2 ) where w1 is stronger-than w2. rev1 = rev2 ∧ pred2 is stronger-than pred1. rev1 = rev2 ∧ pred1 and pred2 are identical except for a pair of words (w1, w2 ) where w2 is stronger-than w1. pred1 and pred2 contain a pair of content words (w1, w2 ) that are either identical or derivationally-related in CatVar. rev1 = rev2 ∧ removing a single word from pred1 will result in pred2. rev1 = rev2 ∧ adding a single word to pred1 will result in pred2. rev1 = rev2 ∧ removing a single adjective from pred1 will result in pred2. rev1 = rev2 ∧ adding a single adjective to pred1 will result in pred2. if rev1 = rev2, the normalized edit-distance between pred1 and pred2, otherwise 1. rev1 = rev2. add adj. Edit Reverse 4. Prior knowledge. We automatically generate local constraints for pairs of predicates for which we know with high certainty whether the ﬁrst entails the second or not. We deﬁne and compute constraints over pairs of predicates (p1, p2), where again a predicate p is a pair (pred, rev). We start with negative local constraints (Examples in Table 3): (a) (b) cousin: (p1, p2) are cousins if rev1 = rev2 and pred1 = pred2, except for a single pair of words w1 and w2, which are cousins in WordNet—that is, they have a common hypernym at distance 2. indirect hypernym: (p1, p2) are indirect hypernyms if rev1 = rev2 and pred1 = pred2, except for a single pair of words w1 and w2, and w1 is a hypernym at distance 2 of w2 in WordNet. 275 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 2 2 4 9 1 8 0 4 8 7 2 / c o l i _ a _ 0 0 2 2 0 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 41, Number 2 Table 3 Examples for negative local constraints. The column ⇒/⇔ indicates whether the constraint is directional or symmetric. Name ⇒/⇔ Example cousin indirect hypernym antonym uncertainty implica- tion negation transitive opposite ⇔ X beat Y (cid:59) X ﬁx Y ⇒ X give all kind of (cid:59) X sell all kind of Y ⇔ X announce death of (cid:59) X announce birth of Y ⇒ X want to analyze Y (cid:59) X analyze Y ⇔ X do no harm to Y (cid:59) X do harm to Y ⇔ X abandon Y (cid:59) Y abandon X (c) (d) (e) (f) (a) (b) antonym: (p1, p2) are antonyms if rev1 = rev2 and pred1 = pred2, except for a single pair of words w1 and w2, and w1 is an antonym of w2 in WordNet. uncertainty implication: holds for (p1, p2) if rev1 = rev2 and concatenating the words want to to pred2 results in pred1. negation: Negation holds for (p1, p2) and (p2, p1), if rev1 = rev2 and in addition removing a single negation word (not, no, never, or nt) from pred1 results in pred2. Negation also holds for (p1, p2) and (p2, p1), if rev1 = rev2 and in addition replacing in pred1 the word no for the word a results in pred2. transitive opposites: (p1, p2) are transitive opposites if pred1 = pred2 and rev1 (cid:54)= rev2 and pred1 is a transitive verb in VerbNet. Next, we deﬁne the positive local constraints (Examples in Table 4): determiner: The determiner constraint holds for (p1, p2) and (p2, p1) if rev1 = rev2 and omitting a determiner (a or the) from pred1 results in pred2. positive implication: Positive implication holds for (p1, p2) if rev1 = rev2 and concatenating the words manage to or start to or start or decide to or begin to to pred2 results in pred1. Table 4 Examples for positive local constraints. The column ⇒/⇔ indicates whether the constraint is directional or symmetric. Name ⇒/⇔ Example determiner positive implication passive-active ⇔ X be the answer to Y ⇒ X be answer to Y ⇒ X start to doubt Y ⇒ X doubt Y ⇔ X command Y ⇒ Y be commanded by X 276 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 2 2 4 9 1 8 0 4 8 7 2 / c o l i _ a _ 0 0 2 2 0 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Berant et al. Efﬁcient Global Learning of Entailment Graphs (c) passive-active: The passive-active constraint holds for (p1, p2) and (p2, p1) if p2 is the passive form of p1. Again, local constraints are integrated into our model by changing the score sij = ∞ for positive local constraints and sij = −∞ for negative local constraints. 5.3 Experimental Evaluation To evaluate performance we use the second half of the data set released by Zeichner, Berant, and Dagan (2012) as a test set. This test set contains 3,283 annotated examples, where 1,734 are covered by the 10,000 nodes of our graph (649 positive examples and 1,085 negative examples). As usual, for each algorithm we compute recall and precision with respect to the gold standard at various points by varying the sparseness parameter λ. Most methods used over typed graphs in Section 4 (Exact-graph, Exact-forest, LP- relax, and GNF) are intractable because they require an ILP solver and our graph does not decompose into components that are small enough. Therefore, we compare the No- trans local baseline, where transitivity is not used, with the efﬁcient global methods TNF and TNCF. Recall that in Section 4 we initialized TNF by applying an ILP solver in a sparse conﬁguration on each graph component. In this experiment the graph is too large to use an ILP solver and so we initialize TNF and TNCF with a simple heuristic we describe next. We begin with an empty graph and ﬁrst sort all pairs of predicates (i, j) for which wij > 0, according to their weight w. Entonces, we go over predicate pairs one-by-one and
perform two operations. Primero, we verify that inserting (i, j) into the graph does not
violate the FRG assumption. This is done by going over all edges (i, k) ∈ E and checking
that for every k either ( j, k) ∈ E or (k, j) ∈ E. If this is the case, then the edge (i, j) is a valid
candidate edge as the resulting reduced graph will remain a directed forest, de lo contrario
adding it will result in i having two parents in the reduced graph, and we do not insert
this edge. Segundo, we compute the transitive closure Tij, which contains all node pairs
that must be edges in the graph in case we insert (i, j), due to transitivity. We compute
the change in the value of the objective function if we were to insert Tij into the graph.
If this change improves the objective function value, we insert Tij into the graph (y
so the value of the objective function increases monotonically). We keep going over
the candidate edges from highest score to lowest until the objective function can no
longer be improved by inserting a candidate edge. We term this initialization HTL-FRG,
because we scan edges from high-score to low-score and maintain an FRG. We add this
initialization procedure as another baseline.

Cifra 11 presents the precision-recall curve of all global algorithms compared with
the local classiﬁer for 0.05 ≤ λ ≤ 2.5. One evident property is that No-trans reaches
higher recall values than the global algorithms, which means that adding a global
transitivity constraint prevents adding correct edges that have positive weight, desde
this would cause the addition of many other edges that have negative weight. A possible
reason for that is predicate ambiguity, where positive weight edges connect connectivity
components through an ambiguous predicate. This suggests that a natural direction for
future research is to develop algorithms that do not impose transitivity constraints for
rules i ⇒ j and j ⇒ k, if each rule refers to a different meaning of j. One possible direction
is to learn latent “sense” or “topic” variables for each rule (as suggested by Melamud
et al. Melamud et al. [2013]). Entonces, we can use the methods presented in this article

277

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 2

Cifra 11
Precision-recall curve comparing global algorithms to the local algorithm.

to impose transitivity constraints for each sense separately, and easily allow violations
of transitivity across different senses. We provide concrete examples for such cases of
ambiguity in our qualitative analysis in Section 5.4.

Although global algorithms are limited in their recall, for recall values of 0.1–
0.2 they substantially improve precision over No-trans. To better observe this result,
Mesa 5 presents the results for the recall range 0.1–0.25. Comparing TNCF and No-
trans shows that the TNCF algorithm improves over No-trans by 5–10 precision points:

Mesa 5
Recordar, precisión, F1, and the number of learned rules for No-trans and several global algorithms
for parameters of λ for which recall is in the range 0.1–0.25.

método

recordar

precisión

# of rules

0.1
0.14
0.17
0.19
0.24
0.09
0.1
0.11
0.12
0.11
0.11
0.14
0.15
0.12
0.13
0.16
0.18

1
No-trans
0.8
No-trans
0.6
No-trans
0.4
No-trans
No-trans
0.2
HTL-FRG 0.2
HTL-FRG 0.15
HTL-FRG 0.1
HTL-FRG 0.05
0.2
TNF
0.15
TNF
0.1
TNF
0.05
TNF
0.2
TNCF
0.15
TNCF
0.1
TNCF
0.05
TNCF

278

0.75
0.7
0.68
0.7
0.65
0.78
0.78
0.77
0.76
0.77
0.72
0.72
0.72
0.79
0.75
0.8
0.74

0.18
0.24
0.27
0.3
0.35
0.17
0.18
0.19
0.21
0.19
0.19
0.23
0.25
0.21
0.23
0.27
0.29

39,872
57,276
65,232
77,936
116,832
48,986
55,756
65,802
84,380
61,186
71,674
88,304
127,826
66,240
80,450
102,565
156,208

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

0.00.10.20.30.40.60.70.80.91.0recallprecisionlllllllllllllllNo−transHTL−FRGTNFTNCF

Berant et al.

Efﬁcient Global Learning of Entailment Graphs

In the recall range of 0.1–.2, the precision of No-trans is 0.68–0.75, whereas the precision
of TNCF is 0.74–0.8. The number of rules learned by TNCF in this recall range is
acerca de 100,000.

We use HTL-FRG as an initialization method for TNCF, which on its own is not a
very good approximation algorithm. Comparing HTL-FRG with No-trans, we observe
that it is unable to reach high recall values and it only marginally improves precision
compared with No-trans. Applying TNF over HTL-FRG also provides disappointing
results—it increases recall compared with HTL-FRG, but precision is not better than
No-trans. En efecto, it seems that although performing node re-attachments monotoni-
cally improves the objective function value, it is not powerful enough to learn a graph
that is better with respect to our gold standard. Por otro lado, adding component
re-attachments (TNCF) allows the algorithm to better explore the space of graphs and
learn a graph with higher recall and precision than HTL-FRG.

Cifra 12 presents the runtime for HTL-FRG, TNF, and TNCF (y-axis in loga-
rithmic scale), all of which yield an FRG and can be run on a large untyped graph.
De nuevo, we were unable to obtain results with Exact-graph, Exact-forest, LP-relax, y
GNF, because using an ILP solver on a graph with 20,000 nodes was impossible. Como
esperado, HTL-FRG is much faster than both TNF and TNCF, and TNCF is somewhat
slower than TNF. Sin embargo, as mentioned earlier, TNCF is able to improve both pre-
cision and recall compared with TNF and HTL-FRG. Note that when λ = 1.2, runtime
increases suddenly for both TNF and TNCF. This is because at this point a large con-
nected component is formed, as we explain next.

Cifra 13 presents the results of graph decomposition as described in Section 3.1.
Evidently, cuando λ = 0 the largest component constitutes almost all of the graph, cual
contiene 20,000 nodos. Note that when λ = 1.2, the size of the largest component in-
creases suddenly to more than half of the graph nodes. Además, TNCF obtained
recall of 0.1–0.2 when λ ≤ 0.2, and in this case the number of nodes in the largest

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

279

Cifra 12
Runtime in seconds for various −λ values for global algorithms.

−2.5−2.0−1.5−1.0−0.50.01e+025e+025e+035e+045e+05−lambdasec.lllllllllllllllHTL−FRGTNFTNCF

Ligüística computacional

Volumen 41, Número 2

Cifra 13
The number of nodes in the largest component as a function of the sparseness prior −λ.

component is 90% of the total number of graph nodes. De este modo, contrary to the typed
graphs, untyped graphs that contain ambiguous predicates do not decompose well into
small components.

To summarize, we demonstrated that our efﬁcient global methods scale to a graph
con 20,000 nodes and observed that even when predicates are ambiguous we are able
to improve precision for moderate values of recall. Sin embargo, there are indications
that our modeling assumptions are violated when applied over a large graph with
ambiguous predicates. Transitivity and FRG constraints limit the recall we are able to
reach, and the graph does not decompose very well into components. De este modo, in future
work we would like to apply our algorithms over graphs where predicate ambiguity is
modeled and so transitivity constraints can be properly applied.

5.4 Qualitative Analysis

In the previous section we quantitatively evaluated global methods for learning entail-
ment graphs over a large set of untyped predicates. Our results revealed interesting
issues that we would like to further investigate:

1. Why do methods that assume entailment graphs are transitive and

forest-reducible have limited recall?

How much does node re-attachment affect graph structure?

En esta sección, we try to answer these questions by performing a qualitative analysis

that allows us to gain better insight on the behavior of our algorithms.

Question 1: Transitivity and FRG assumptions. We would like to understand why using
global algorithms such as TNCF results in limited recall. Our hypothesis is that be-
cause predicates in the graph are untyped and potentially ambiguous, violations of the

280

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

lllllllllllllll−2.5−2.0−1.5−1.0−0.50.005000100001500020000−lambdaNum. de nodos

Berant et al.

Efﬁcient Global Learning of Entailment Graphs

assumptions are possible, which results in recall reduction. To examine this hypothesis,
we perform two qualitative analyses:

1. We analyze graphs containing only gold-standard edges, and manually

identify cases where the modeling assumptions are violated.

2. We compare the local algorithm, No-trans, to the global algorithm, TNCF.
If our hypothesis is correct, then we expect TNCF to remove correct edges
that are inserted by No-trans due to ambiguous predicates.

We start by constructing a graph with all edges from our gold-standard data set
and search for transitivity and FRG violations caused by ambiguous predicates. Nota
that the manually annotated gold-standard edges are only a subset of the full set of
correct edges, which is actually much larger. De este modo, we cannot automatically identify
violations, because the graph is missing many true edges. En cambio, we manually go
over candidate violations and check if indeed this case is a result of a violation of our
modeling assumptions, and if so then our global algorithm cannot predict the correct
set of edges.

One example is the pair of rules seek ⇒ apply for and apply for ⇒ ﬁle for. el primero
rule was annotated in the rule application Others seek medical care ⇒ Others apply for
medical care (Recall that crowdsourcing was used to annotate rule applications, and rules
were extracted from these applications.) The second rule was extracted from the rule
application Students apply for more than one award ⇒ Student ﬁle for more than one award.
This causes a transitivity violation because Students seek more than one award (cid:59) Alumno
ﬁle for more than one award. Evidently, the meaning of the predicate apply for is context-
dependent. Another example is the pair of rules contain ⇒ supply and supply ⇒ serve
annotated in the applications The page contains links ⇒ The page supplies links and Users
supply information or material ⇒ Users serve information or material. Claramente, contener (cid:59) serve
and the transitivity violation is caused by the fact that the predicate supply is context-
dependent—in the ﬁrst context the subject is inanimate, whereas in the second context
the subject is human.

A good example for an FRG violation is the predicate come from. The following
three entailment rules are annotated by the turkers: come from ⇒ be raised in, come
from ⇒ be derived from, and come from ⇒ come out of. These correspond to the following
rule applications: The Messiah comes from Judah ⇒ The Messiah was raised in Judah, El
colors come from the sun light ⇒ The colors are derived from the sun light, and The truth
comes from the book ⇒ The truth comes out of the book. Claramente, be raised in (cid:59) be derived
from and be derived from (cid:59) be raised in, so this is an FRG violation. En efecto, the three
applications correspond to different meanings of the predicate come from, which depend
on context. A second example is the pair of rules be dedicated to ⇒ be committed to and be
dedicated to ⇒ contain information on. De nuevo, this is an FRG violation because be committed
a (cid:59) contain information on and contain information on (cid:59) be committed to. This violation
occurs because of the ambiguity of the predicate be dedicated to, which can be resolved
by knowing whether the subject is human or is an object that carries information (para
ejemplo, The web site is dedicated to the life of lizards).

This ﬁrst analysis revealed particular cases where the transitivity and FRG as-
sumptions do not hold. In the next analysis, we would like to directly compare cases
where global and local methods disagree with one another. We hypothesize that because
predicates are not disambiguated, then global algorithms will delete correct edges due

281

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 2

to an ambiguous predicate. We set the sparseness parameter λ = 0.2 and compare the
set of edges learned by No-trans to the set of edges learned by TNCF.

Examining the disagreements, we observe that, as expected, en 90% of the cases
TNCF deletes an edge that was inserted by No-trans. We divide these deleted edges
into two categories in the following manner: We compute the weakly connected com-
ponents of the graph learned by TNCF,10 and then for each edge inserted by No-trans
and deleted by TNCF we check whether it connects two different weakly connected
components or not. There are two reasons for focusing on this property of deleted edges.
Primero, we hypothesize that most deleted edges will connect two different connected com-
ponents, because such edges result in many violations of transitivity. Segundo, weakly
connected components usually correspond to separate semantic meanings, y por lo tanto
a correct edge that connects two weakly connected components probably involves an
ambiguous predicate.

En efecto, out of all edges where No-trans inserts an edge and TNCF does not, 76.4%
connect weakly connected components in the global graph. This proves that deleting
such edges is the main cause for disagreement between TNCF and No-trans. Ahora, nosotros
can take a closer look at these edges, and check whether indeed when a correct edge is
deleted, it is because of an ambiguous predicate.

We randomly sample ﬁve cases where TNCF erroneously deleted an edge connect-
ing weakly connected components—these are cases where ambiguity is likely to occur.
For comparison, we also sample ﬁve cases where TNCF was right and No-trans erred.
Mesa 6 shows the samples where No-trans is correct. The ﬁrst column describes the rule
and the second column speciﬁes the score sij provided by the local classiﬁer No-trans.
The last two columns detail two examples for predicates that are in the same weakly
connected component with the rule LHS and RHS in the graph learned by TNCF. Estos
two columns provide a ﬂavor for the overall meaning of predicates that belong to this
component. Mesa 7 is equivalent and shows the samples where TNCF is correct.

Ejemplo 1 en mesa 6 already demonstrates the problem of ambiguity. The LHS for
this rule application in the crowdsourcing experiment was your parents turn off comments
and indeed in this case the RHS your parents cut off comments is inferred. Sin embargo,
we can see that the LHS component revolves around the turning off or shutting off of
appliances, Por ejemplo, whereas the RHS component deals with a more explicit and
physical meaning of cutting. This is because the predicate X cut off Y is ambiguous
and consequently TNCF omits this correct edge. Ejemplo 5 also demonstrates well the
problem of ambiguity—the LHS of this rule application is This site was put by fans, cual
in this context entails the RHS This site was run by fans. Sin embargo, the more common
sense of be put by does not entail the predicate be run by; this sense is captured by the
predicates in the LHS component Y help put X and X be placed by Y. Ejemplo 4 is another
good example where the ambiguous predicate is X be open to Y—the RHS component is
concerned with the physical state of being opened, whereas the LHS component has a
more abstract sense of availability. The LHS of the rule application in this case was The
services are available to museums. Examples 2 y 3 are cases where the problem is not in
predicate ambiguity but in the connected component itself. De este modo, out of ﬁve randomly
sampled examples where TNCF deleted an edge that connects two weakly connected
componentes, three can be attributed to predicate ambiguity.

10 A weekly connected component in a directed graph is a set of nodes S such that for every pair of nodes

tu, v ∈ S there is an undirected path from u to v and from v to u, eso es, there is a path when edge directions
are ignored.

282

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Berant et al.

Efﬁcient Global Learning of Entailment Graphs

Mesa 6
Correct edges inserted by No-trans and omitted by TNCF that connect weakly connected
components in TNCF. An explanation for each column is provided in the body of the section.

Ejemplo

Regla

sij

LHS component

RHS component

X turn off Y ⇒ X cut off Y

1.32 X shut off Y, X cut

X chop Y, X slice Y

on Y

X be arrested in Y ⇒ X be captured

0.27 X be killed on Y, X

in Y

be arrested on Y

X die of Y ⇒ X be diagnosed with

0.88 X gain on Y, X run

X be available to Y ⇒ X be open to

X be put by Y ⇒ X be run by Y

on Y

0.27 X be offered to Y, X
be provided to Y
0.41 Y help put X, X be

X be delivered in Y, X
be provided in Y
X be treated for Y, X
have symptom of Y
X open for Y, X open

up for Y

Y operate X, Y control

placed by Y

Mesa 7
Erroneous edges inserted by No-trans and omitted by TNCF that connect weakly connected
components in TNCF. An explanation for each column is provided in the body of the section.

Ejemplo

Regla

sij

LHS component

RHS component

2
3

X want to see Y ⇒ X want

to help Y

X cut Y ⇒ X ﬁx Y
X share Y ⇒ X understand

X build Y ⇒ X rebuild Y

0.83 X need to visit Y, X
need to see Y

0.88 X chop Y, X slice Y
0.62 X partake in Y, Y be
shared by X
1.42 X construct Y, Y be

X be a supporter of Y, X

need to support Y
X cure Y, X mend Y
Y be recognized by X, X

realize Y

X regenerate Y, X restore

X measure Y ⇒ X weigh Y

manufactured by X

0.65 X quantify Y, X be a
measure of Y

X count for Y, X appear Y

Mesa 7 demonstrates cases where TNCF correctly omitted an edge and conse-
quently decomposed a weakly connected component into two. These are classical cases
where the global constraint of transitivity helps the algorithm avoid errors. In Exam-
por ejemplo 1, although sij = 0.83 (which is higher than λ = 0.2), the edge is not inserted because
this would cause the addition of wrong edges from the LHS component that deals with
seeing and visiting to the RHS component, which is concerned with help and support.
Ejemplo 2 is a case of a pair of predicates that are co-hyponyms or even perhaps antonyms.
Transitivity helps TNCF avoid this erroneous rule.

Para concluir, our analysis reveals that applying structural constraints encourages
global algorithms to omit edges. Although this is often desirable, it can also prevent cor-
rect rules from being discovered because of problems of ambiguity. De este modo, as discussed
en la sección 5.3, we believe that an important direction for future research is to apply our
global graph learning methods over context-sensitive representations (such as the one
presented by Melamud et al. [2013]).

Question 2: Node re-attachment. We would like to understand the effect of applying TNF
over the entailment graph, or more precisely, check whether TNF substantially alters the
set of graph edges compared with an initialization with HTL-FRG. Primero, we run both

283

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 2

Cifra 14
A fragment from the graphs learned by HTL-FRG (black solid edges) and TNF (rojo
dashed edges).

HTL-FRG and TNF with sparseness parameter λ = 0.2 and compute the proportion
of edges that is in their intersection. HTL-FRG learns 48,986 edges and TNF learns
61,186 bordes. The number of edges in the intersection is 35,636. This means that 27%
of the edges learned by HTL-FRG were removed (13,350 bordes) and in addition 25,550
edges were added into the graph. Claramente, this indicates that TNF changes the learned
graph substantially.

We exemplify this with a few fragments of graphs learned by HTL-FRG and TNF.
Cifra 14 muestra 11 predicates,11 where the black solid edges correspond to HTL-FRG
edges and red dashed edges correspond to TNF edges. Nodes that contain more than a
single predicate represent an agreement between TNF and HTL-FRG that all predicates
in the node entail one another. Claramente, TNF substantially changes the initialization
generated by HTL-FRG. HTL-FRG creates an erroneous component in which give a lot
of and generate a lot of are synonymous and mean a lot of entails them. TNF disassembles
this component and determines that give a lot of is equivalent to offer plenty of, mientras
generate a lot of entails cause of. Además, TNF disconnects the predicate mean a lot of
completely. TNF also identiﬁes that the predicates offer a wealth of, provide a wealth of,
provide a lot of, and provide plenty of entail offer plenty of. Por supuesto, TNF sometimes causes
errors—for example, HTL-FRG decided that cause and cause of are equivalent, but TNF
deleted the correct entailment rule cause of ⇒ cause.

Cifra 15 provides another interesting example for an improvement in the graph
due to the application of TNF. In the initialization, HTL-FRG determined that the
predicate encrypt entails the predicates convert and convince rather than the predicates
code and encode. This is because the local score provided by the classiﬁer for (encrypt,
convert) is 0.997—slightly higher than the local score for (encrypt, encode), cual es 0.995.
Por lo tanto, HTL-FRG inserts the wrong edge encrypt ⇒ convert and then it is unable to
insert the correct rule encrypt ⇒ encode because of the FRG assumption. After HTL-FRG
terminates, TNF goes over the nodes one by one and is able to determine that overall
the predicate encrypt ﬁts better as a synonym of the predicates code and encode and re-
attaches it in the correct location. As an aside, notice that both HTL-FRG and TNF are
able to identify in this case the correct directionality of the rule compress ⇒ encode. El

11 We do not explicitly write the X and Y variables for brevity, and thus we focus only on predicates where

the X variable is a subject.

284

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Berant et al.

Efﬁcient Global Learning of Entailment Graphs

Cifra 15
A fragment from the graphs learned by HTL-FRG (black solid edges) and TNF (rojo
dashed edges).

Cifra 16
A fragment from the graphs learned by HTL-FRG (black solid edges) and TNF (rojo
dashed edges).

local score given by the classiﬁer for (compress, encode) es 0.65, whereas the local score for
(encode, compress) is –0.11.

As a last example, Cifra 16 demonstrates again the problem of predicate ambi-
guity. The predicate depress can occur in two contexts that, though related, instigate
different entailment relations. The ﬁrst context is when the object of the predicate (el
Y argument) is a person and then the meaning of depress is similar to discourage. A
second context is when the object is some phenomenon, Por ejemplo, The serious economical
situation depresses consumption levels. In initialization, HTL-FRG generates a set of edges
that is compatible with the latter context, determining that the predicate depress entails
the predicates lower and bring down. TNF re-attaches depress as a synonym of discourage
and deter (because this improves the objective function), resulting in a set of edges
that corresponds to the ﬁrst meaning of depress mentioned above. The methods we
use in this article do not allow learning the rules for both contexts because this would
result in violations of the transitivity and FRG assumptions. This again emphasizes the
importance of learning graphs where predicates are marked by their various senses,
which will result in a model that can directly beneﬁt from the methods suggested in
this article.

6. Conclusions

The problem of language variability is at the heart of many semantic applications such
as Information Extraction, Question Answering, Semantic Parsing, y más. Consecuencia-
frecuentemente, learning broad coverage knowledge bases of entailment rules and paraphrases
(Ganitkevitch, Van Durme, and Callison-Burch 2013) has proved crucial for systems
that perform inference over textual representations (Stern and Dagan 2012; Angeli and
Manning 2014).

285

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 2

In this article, we have presented algorithms for learning entailment rules between
predicates that can scale to large sets of predicates. Our work builds on prior work by
tarde, Dagan, and Goldberger (2012), who deﬁned the concept of entailment graphs,
and formulated entailment rule learning as a graph optimization problem, donde el
graph has to obey the structural constraint of transitivity.

Our main contribution is a heuristic polynomial approximation algorithm that can
learn entailment graphs containing tens of thousands of nodes. The algorithm is based
on two main observations: primero, that entailment graphs are sparse, y segundo, eso
entailment graphs are forest-reducible. This leads to an efﬁcient algorithm in which at
the ﬁrst step the graph is decomposed into smaller components, and in the second step
we ﬁnd a set of edges for each component. The second step is an iterative approximation
algorithm in which at each iteration a node or component of the graph is deleted and
then re-attached in a way that improves the overall objective function.

We performed a thorough empirical evaluation, in which we ran our algorithm on
two separate data sets. On the ﬁrst data set we witnessed a substantial improvement
in runtime at a relatively small cost in performance. On the second data set, we show
that our method scales to an untyped entailment graph containing 20,000 nodos, mucho
larger than was previously possible using state-of-the-art global methods. We see that
using a global algorithm improves precision for modest values of recall, but that local
methods can achieve higher recall than global methods. We perform an analysis and
conclude that the main reason for this limitation is predicate ambiguity, which results
in graphs that do not adhere to the structural constraints of transitivity and forest
reducibility.

In future work, we hope to address the problem of predicate ambiguity. A natural
direction is to combine methods that model the context-sensitivity of predicates with
global structural assumptions. In this way, one can apply structural constraints for
a set of predicates, given that they appear in similar contexts. A second important
direction for future work is to demonstrate that using entailment rules learned by our
method improves performance in downstream semantic applications such as QA, IE,
and Recognizing Textual Entailment.

Apéndice A. Max-Trans-Forest Is NP-hard

In this appendix we show that the Max-Trans-Forest problem (8) is NP-hard. Nosotros
prove that the following decision problem (which is clearly easier than Max-Trans-
Forest) is NP-hard: Given a set of nodes V, a function w : V × V → R, and an integer
number α, is there a transitive forest-reducible (FRG) subgraph G = (V, mi) such that
(cid:80)
e∈E w(mi) ≥ α? In this appendix we use the standard terminology that for a directed
edge i ⇒ j, the node i is termed the parent of node j, and the node j is termed the
child of i. We start with a simple polynomial reduction from the following decision
problema.

Max-Sub-FRG: Given a directed graph G = (V, mi), a function w : E → Z+, y
is there a transitive FRG subgraph G(cid:48) = (V, mi(cid:48)) such that

a positive integer α,
(cid:80)

e∈E(cid:48) w(mi) ≥ α?

Max-Sub-FRG ≤p Max-Trans-Forest: Given an instance (G = (V, mi), w, a) of Max-
Sub-FRG, we construct the instance (GRAMO(cid:48) = (V, mi(cid:48)), w(cid:48), a) of Max-Trans-Forest such
that w(cid:48)(tu, v) = w(tu, v) si (tu, v) ∈ E and −∞ otherwise. We need to show that (G =
(V, mi), w, a) ∈ Max-Sub-FRG iff (GRAMO(cid:48) = (V, mi(cid:48)), w(cid:48), a) ∈ Max-Trans-Forest. This is trivial: Si
there is a transitive FRG subgraph of G whose sum of edges ≥ α, then choosing the same
edges will yield a transitive FRG subgraph of G(cid:48) whose sum of edges ≥ α. Similarmente, cualquier

286

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Berant et al.

Efﬁcient Global Learning of Entailment Graphs

transitive FRG subgraph of G(cid:48) whose sum of edges ≥ α cannot use any −∞ edges, y
therefore the edges of this FRG are in E and this deﬁnes a subgraph of G whose sum of
edges ≥ α.

Note that for Max-Sub-FRG, maximizing the sum of weights of edges in the sub-
graph is equivalent to minimizing the sum of weights of edges not in the subgraph. Nosotros
now denote by z the sum of weights of the edges deleted from the graph.

Next we show a polynomial reduction from the NP-hard Exact Cover by 3-sets
(X3C) problema (Garey and Johnson 1979) to Max-Sub-FRG. The X3C is deﬁned as
follows. Given a set X of size 3n, and m subsets, S1, S2, .., Sm of X, each of size 3, decide
if there is a collection of n Si’s whose union covers X.

X3C ≤p Max-Sub-FRG: Given an instance (X, S) of X3C problem, we construct an
instancia (G = (V, mi), w, z) of the Max-Sub-FRG problem as follows (an illustration of the
construction is given in Figure A.1). Primero, we construct the vertices V: We construct
x1, .., x3n vertices, corresponding to the points of X, m vertices s1, .., sm corresponding to
the subsets S, m additional vertices t1, t2, .., tm, and one more vertex a. Next we construct
the edges E and the weight function w : E → Z+.

For each 1 ≤ i ≤ m, we add an edge (de, si) of weight 2.
For each 1 ≤ i ≤ m, we add an edge (a, si) of weight 1.
For each 1 ≤ j ≤ 3n, we add an edge (a, xj) of weight M = 4m.
For each 1 ≤ i ≤ m, if si = {xp, xq, xr}, añadimos 3 edges of weight 1: (si, xp),
(si, xq), y (si, xr).

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Figure A.1
The graph constructed given an input X of size 3n and S of size m. Each s ∈ S is a set of size 3.
In this example, s1 = {x1, x2, x3}, s2 = {x3, x5, x3n−1}, sm = {x4, x3n−1, x3n}.

287

. . . x1 x2 x3 x4 x5 x3n-1 x3n . . . . s1 s2 sm . . . . t1 tm t2 a 2 2 2 1 1 1 M M M 1 1 1 1 1 1 1 1 1

Ligüística computacional

Volumen 41, Número 2

To complete the construction of the Max-Sub-FRG problem, we deﬁne z = 4m − 2n. Nosotros
next prove that S has an exact 3-cover of X ⇔ there is a transitive FRG subgraph of G
such that the sum of weights deleted is no more than 4m − 2n.

⇒: Assume there is an exact 3-cover of X by S. The FRG subgraph will consist of: norte
bordes (a, si) for the n vertices si that cover X, the 3n edges (si, xj), and m − n edges (tf , sf ),
for the m − n vertices sf that do not cover X. The transitive closure contains all edges
(a, xi), and the rest of the edges are deleted: for all sf ’s that are not part of the cover, el
three edges (sf , xj), and the edge (a, sf ) are deleted. Además, the weight 2 bordes (de, si)
for the si’s that cover X are deleted. The total weight deleted is thus 3(m − n) + m − n +
2n = 4m − 2n. It is easy to verify that the subgraph is a transitive FRG – there are no
connected components of size > 1, and in the transitive reduction of G there is no node
with more than one parent.

⇐: Assume there is no exact 3-cover of X by S, we will show that any FRG subgraph
must delete more than 4m − 2n weight. We cannot omit any edge (a, xi), as the weight
of each such edge is too large. Thus all these edges are in the FRG and are either deleted
during transitive reduction or not.

Assume ﬁrst that all these edges are deleted in the transitive reduction, eso es,
for every xj there exists an si such that (a, si) y (si, xj) are in the forest. Since there
is no collection of n subsets Si that cover X, there must be at least k > n such si’s.
Como consecuencia, for these si’s, the forest must not contain the edges (de, si) (de lo contrario, si
would have two parents and violate both the forest and the transitivity properties). Para
the m − k nodes with no edge (sf , xj) we can either add an edge (a, sf ) o (tf , sf ), pero
not both (de lo contrario, sf would have more than one parent). Since w(tf , sf ) > w(a, sf ) él
is better to delete the (a, sf ) bordes. Por eso, the total weight of deleted edges is 3m − 3n
for the edges between si’s and xj’s, 2k in the edges (de, si), for si’s that cover the xj’s,
and m − k for the edges (a, sf ). Total weight deleted is 4m − 3n + k > 4m − 2n since
k > n.

Assume now that r > 0 bordes (a, xj) are not deleted in the transitive reduction. Este
means that for these xj’s there is no edge (si, xj) for any i (de lo contrario, xj will have more
than one parent after transitive reduction). This means that 3n − r of the xj’s are covered
by si’s. To cover xj’s we need at least k ≥ (cid:100)n − r
3 (cid:101) si’s. As before, for these si’s we also
have the edges (a, si) and we delete the edges (de, si), and for the m − k nodes sf that
do not cover any xj it is best to add the edges (tf , sf ) and to delete the edges (a, sf ). So
the weight deleted is 3m − (3n − r) for edges between si and xj, 2k in the edges (de, si),
and m − k for the edges (a, sf ). De este modo, the weight deleted is 4m − 3n + k + r ≥ 4m − 3n +
(cid:100)n − r

3 (cid:101) + r ≥ 4m − 2n + r − (cid:98) r

3 (cid:99) > 4m − 2n. (cid:4)

Because it is known that X3C is NP-hard, this polynomial reduction shows that

Max-Trans-Forest is also NP-hard.

Referencias
Aho, Alfred V., Michael R. Garey, y

Jeffrey D. Ullman. 1972. The transitive
reduction of a directed graph. SIAM
Journal on Computing, 1(2):131–137.

Angeli, Gabor and Christopher D. Manning.
2014. Naturalli: Natural logic inference for
common sense reasoning. En procedimientos de
EMNLP, pages 534–545, Doha.
Banko, Michele, Michael Cafarella,

Stephen Soderland, Matt Broadhead, y

288

Oren Etzioni. 2007. Open information
extraction from the web. En procedimientos de
IJCAI, pages 68–74, Hyderabad.
Ben Aharon, Roni, Idan Szpektor, y

Ido Dagan. 2010. Generating
entailment rules from FrameNet. En
Proceedings of ACL, pages 241–246,
Uppsala.

tarde, Jonathan, Ido Dagan, Meni Adler,
and Jacob Goldberger. 2012. Efﬁcient
tree-based approximation for entailment

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Berant et al.

Efﬁcient Global Learning of Entailment Graphs

graph learning. In Proceedings of ACL,
pages 117–125, Jeju Island.

tarde, Jonathan, Ido Dagan, and Jacob
Goldberger. 2010. Global learning of
focused entailment graphs. En procedimientos
of ACL, pages 1,220–1,229, Uppsala.

tarde, Jonathan, Ido Dagan, y

Jacob Goldberger. 2011. Global learning of
typed entailment rules. En procedimientos de
LCA, pages 610–619, Portland, O.

tarde, Jonathan, Ido Dagan, y

Jacob Goldberger. 2012. Aprendiendo
entailment relations by global
graph structure optimization.
Journal of Computational Linguistics,
38(1):73–111.

tarde, Jonathan and Percy Liang. 2014.
Semantic parsing via paraphrasing. En
Proceedings of ACL, pages 1,415–1,425,
baltimore, Maryland.

Bhagat, Rahul, Patricio Pantel, and Eduard
Azul. 2007. LEDIR: An unsupervised
algorithm for learning directionality of
inference rules. En procedimientos de
EMNLP-CoNLL, pages 161–170, Prague.

Blei, David M., Andrew Y. Ng, y

Michael I. Jordán. 2003. Latent Dirichlet
allocation. Journal of Machine Learning
Investigación, 3:993–1022.

Budanitsky, Alexander and Graeme Hirst.

2006. Evaluating WordNet-based measures
of lexical semantic relatedness.
Ligüística computacional, 32(1):13–47.
Chambers, Nathanael and Dan Jurafsky.
2011. Template-based information
extraction without the templates. En
Proceedings of ACL, pages 976–986,
Portland, O.

Chang, Yin-Wen and Michael Collins. 2011.

Exact decoding of phrase-based translation
models through Lagrangian relaxation. En
Proceedings of EMNLP, pages 26–37,
Edimburgo.

Chklovski, Timothy and Patrick Pantel. 2004.

Verb ocean: Mining the web for
ﬁne-grained semantic verb relations. En
Proceedings of EMNLP, pages 33–40,
Barcelona.

Textual Entailment: Models and Applications.
Morgan and Claypool Publishers.

Dinu, Georgiana and Mirella Lapata. 2010.
Topic models for meaning similarity in
contexto. In Proceedings of COLING,
pages 250–258, Beijing.

Do, Quang and Dan Roth. 2010. Constraints
based taxonomic relation classiﬁcation. En
Proceedings of EMNLP, pages 1,099–1,109,
Cambridge, MAMÁ.

Dzeroski, Saso and Ivan Brakto. 1992.
Handling noise in inductive logic
programming. En Actas de la
International Workshop on Inductive Logic
Programming, pages 48–64, Tokio.
Fader, Antonio, Stephen Soderland, y

Oren Etzioni. 2011. Identifying relations
for open information extraction. En
Proceedings of EMNLP, pages 1,535–1,545,
Portland, O.

Fellbaum, Christiane, editor. 1998. WordNet:
An Electronic Lexical Database (Idioma,
Discurso, and Communication). La prensa del MIT.
Ganitkevitch, Juri, Benjamin Van Durme, y
Chris Callison Burch. 2013. PPDB: El
paraphrase database. En procedimientos de
HLT-NAACL, pages 758–764, Atlanta, Georgia.

Garey, Michael R. and David S. Johnson.

1979. Computers and Intractability: A Guide
to the Theory of NP-Completeness. W.. h.
Hombre libre.

Habash, Nizar and Bonnie Dorr. 2003. A

categorial variation database for English.
In Proceedings NAACL-HLT, pages 17–23,
Edmonton.

harris, Zellig. 1954. Distributional structure.

Word, 10(23):146–162.

Hearst, Marti. 1992. Automatic acquisition of

hyponyms from large text corpora. En
Proceedings of COLING, pages 23–28,
Nantes.

Joachims, Thorsten. 2005. A support vector
method for multivariate performance
measures. In Proceedings of ICML,
pages 377–384, Bonn.

kelly, j. mi. 1960. The cutting-plane method
for solving convex programs. Journal of the
Society for Industrial Applied Mathematics,
8:703–712.

Cormen, Thomas H., Charles E. Leiserson,

Kipper, Karin, Hoa T. Dang, y

Ronald L. Rivest, and Clifford Stein. 2002.
Introduction to Algorithms. La prensa del MIT.

Coyne, Bob and Owen Rambow. 2009.
Lexpar: A freely available English
paraphrase lexicon automatically extracted
from FrameNet. In Proceedings of IEEE
International Conference on Semantic
Informática, pages 53–58, berkeley, California.
Dagan, Ido, Dan Roth, Mark Sammons, y

Fabio Massimo Zanzotto. 2013. Recognizing

Martha Palmer. 2000. Class-based
construction of a verb lexicon. En
Proceedings of AAAI, pages 691–696,
austin, Texas.

lin, Dekang. 1998. Automatic retrieval
and clustering of similar words. En
Proceedings of COLING-ACL,
pages 768–774, Montréal.

lin, Dekang and Patrick Pantel. 2001.

Discovery of inference rules for question

289

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 2

answering. Natural Language Engineering,
7(4):343–360.

Martins, Andre, Noah Smith, and Eric Xing.
2009. Concise integer linear programming
formulations for dependency parsing.
In Proceedings of ACL, pages 342–350,
Singapur.

McCreath, Eric and Arun Sharma. 1997. ILP

with noise and ﬁxed example size: a
Bayesian approach. In Proceedings of IJCAI,
pages 1,310–1,315, Nagoya.

Melamud, Oren, Jonathan Berant, Ido Dagan,
Jacob Goldberger, and Idan Szpektor. 2013.
A two level model for context sensitive
inference rules. In Proceedings of ACL,
pages 1,331–1,340, Soﬁa.
Meyers, Adán, Ruth Reeves,

Catherine Macleod, Rachel Szekeley,
Veronika Zielinska, and Brian Young. 2004.
The cross-breeding of dictionaries. En
Proceedings of LREC, pages 1,095–1,098,
Lisbon.

Nakashole, Ndapandula, Gerhard Weikum,
and Fabian M. Suchanek. 2012. Patty:
A taxonomy of relational patterns with
semantic types. En procedimientos de
EMNLP-CoNLL, pages 1,135–1,145,
Jeju Island.

´OS´eaghdha, Diarmuid. 2010. Latent variable

models of selectional preference. En
Proceedings of ACL, pages 435–444,
Uppsala.

Pantel, Patrick, Rahul Bhagat, Bonaventura

Coppola, Timothy Chklovski, y
Eduard H. Azul. 2007. ISP: Aprendiendo
inferential selectional preferences. En
Proceedings of NAACL-HLT, pages 564–571,
Rochester, Nueva York.

Pekar, Viktor. 2008. Discovery of event
entailment knowledge from text
corpus. Computer Speech & Idioma,
22(1):1-dieciséis.

Ravichandran, Deepak and Eduard H. Azul.

2002. Learning surface text patterns
for a question answering system. En
Proceedings of ACL, pages 41–47,
Filadelfia, Pensilvania.

Reichart, Roi and Regina Barzilay. 2012.

Multi-event extraction guided by global
constraints. In Proceedings of NAACL-HLT,
pages 70–79, Montréal.

Richens, Tom. 2008. Anomalies in the

WordNet verb hierarchy. En procedimientos de
COLECCIONAR, pages 729–736, Manchester.
Riedel, Sebastian and James Clarke. 2006.

Incremental integer linear programming
for non-projective dependency parsing. En
Proceedings of EMNLP, pages 129–137,
Sídney.

290

Ritter, alan, Mausam, and Oren Etzioni.

2010. A latent Dirichlet allocation method
for selectional preferences. En procedimientos de
LCA, pages 424–434, Uppsala.

Roth, Michael and Anette Frank. 2012.

Aligning predicates across monolingual
comparable texts using graph-based
clustering. En procedimientos de
EMNLP-CoNLL, pages 171–182, Jeju Island.

Rush, Alexander M., Roi Reichart, Miguel

collins, and Amir Globerson. 2012.
Improved parsing and POS tagging using
inter-sentence consistency constraints. En
Proceedings of EMNLP, pages 1434–1444,
Jeju Island.

Schoenmackers, Stefan, Jesse Davis,

Oren Etzioni, y Daniel S.. Weld. 2010.
Learning ﬁrst-order horn clauses from web
texto. In Proceedings of EMNLP,
pages 1,088–1,098, Cambridge, MAMÁ.

Sekine, Satoshi. 2005. Automatic paraphrase
discovery based on context and keywords
between NE pairs. In Proceedings of IWP,
pages 80–87, Jeju Island.

Shinyama, Yusuke and Satoshi Sekine. 2006.
Preemptive information extraction using
unrestricted relation discovery. En
HLT-NAACL, pages 304–311,
New-York, Nueva York.

Snow, Rion, Daniel Jurafsky, y

Andrew Y. Ng. 2006. Semantic taxonomy
induction from heterogenous evidence. En
Proceedings of ACL, pages 801–808, Sídney.

Sontag, David, Amir Globerson, y

Tommi Jaakkola. 2011. Introduction to
dual decomposition for inference. In Suvrit
Sra, Sebastian Nowozin, and Stephen J.
Wright, editores, Optimization for Machine
Aprendiendo. CON prensa.

Stern, Asher and Ido Dagan. 2012. BIUTEE:

A modular open-source system
for recognizing textual entailment.
In Proceedings of ACL System
Demonstrations, pages 73–78.
Asociación de Computación
Lingüística, Jeju Island.

Szpektor, Idan and Ido Dagan. 2008.

Learning entailment rules for unary
templates. In Proceedings of COLING,
pages 849–856, Manchester.

Szpektor, Idan and Ido Dagan. 2009.

Augmenting WordNet-based inference
with argument mapping. En procedimientos de
TextInfer, pages 27–35, Singapur.

Szpektor, Idan, Hristo Tanev, Ido Dagan, y

Bonaventura Coppola. 2004. Scaling
web-based acquisition of entailment
relaciones. In Proceedings of EMNLP,
pages 41–48, Barcelona.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Berant et al.

Efﬁcient Global Learning of Entailment Graphs

Wang, Mengqiu, Wanxiang Che, y

Cristóbal D.. Manning. 2013. Effective
bilingual constraints for semi-supervised
learning of named entity recognizers. En
Proceedings of AAAI, pages 919–925,
Bellevue, Washington.

Weeds, Julie and David Weir. 2003. A general
framework for distributional similarity. En
Proceedings of EMNLP, pages 81–88,
Sapporo.

Weisman, Hila, Jonathan Berant,

Idan Szpektor, and Ido Dagan. 2012.
Learning verb inference rules from

linguistically-motivated evidence. En
Proceedings of EMNLP, pages 194–204,
Jeju Island.

Yates, Alexander and Oren Etzioni. 2009.

Unsupervised methods for determining
object and relation synonyms on the web.
Journal of Artiﬁcial Intelligence Research,
34:255–296.

Zeichner, Naomi, Jonathan Berant, y
Ido Dagan. 2012. Crowdsourcing
inference-rule evaluation. En procedimientos
of ACL (short papers), pages 156–160,
Jeju Island.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
2
2
4
9
1
8
0
4
8
7
2
/
C
oh

yo
i

_
a
_
0
0
2
2
0
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

291
Descargar PDF