AMR Similarity Metrics from Principles
Juri Opitz and Letitia Parcalabescu and Anette Frank
Department for Computational Linguistics
Heidelberg University
69120 Heidelberg
opitz,parcalabescu,frank
@cl.uni-heidelberg.de
}
{
Abstrait
Different metrics have been proposed to com-
pare Abstract Meaning Representation (AMR)
graphs. The canonical SMATCH metric (Cai
and Knight, 2013) aligns the variables of two
graphs and assesses triple matches. The recent
SEMBLEU metric (Song and Gildea, 2019) est
based on the machine-translation metric BLEU
(Papineni et al., 2002) and increases compu-
tational efficiency by ablating the variable-
alignment. In this paper, je) we establish criteria
that enable researchers to perform a principled
assessment of metrics comparing meaning rep-
resentations like AMR; ii) we undertake a
thorough analysis of SMATCH and SEMBLEU
where we show that the latter exhibits some
undesirable properties. Par exemple, it does
not conform to the identity of indiscernibles
rule and introduces biases that are hard to
control; and iii) we propose a novel metric
S2
to only
very slight meaning deviations and targets the
fulfilment of all established criteria. We assess
its suitability and show its advantages over
SMATCH and SEMBLEU.
is more benevolent
MATCH that
1 Introduction
Proposed in 2013, the aim of Abstract Meaning
Representation (AMR) is to represent a sentence’s
meaning in a machine-readable graph format
(Banarescu et al., 2013). AMR graphs are rooted,
acyclic, directed, and edge-labeled. Entities, events,
properties, and states are represented as variables
that are linked to corresponding concepts (encoded
as leaf nodes) via is-instance relations (cf. Chiffre 1,
gauche). This structure allows us to capture complex
linguistic phenomena such as coreference, seman-
tic roles, or polarity.
When measuring the similarity between two
AMR graphs A and B, for instance for the pur-
pose of AMR parse quality evaluation, the metric
of choice is usually SMATCH (Cai and Knight,
2013). Its backbone is an alignment-search be-
522
tween the graphs’ variables. Recently, the SEMBLEU
metric (Song and Gildea, 2019) has been proposed
that operates on the basis of a variable-free AMR
(Chiffre 1, droite),1 converting it to a bag of k-grams.
Circumventing a variable alignment search redu-
ces computational cost and ensures full determi-
nacy. Aussi, grounding the metric in BLEU (Papineni
et coll., 2002) has a certain appeal, since BLEU is
quite popular in machine translation.
Cependant, we find that we are lacking a
principled in-depth comparison of the properties of
different AMR metrics that would help informing
researchers to answer questions such as: Which
metric should I use to assess the similarity of
two AMR graphs, par exemple., in AMR parser evaluation?
What are the trade-offs when choosing one metric
over the other? Besides providing criteria for such
a principled comparison, we discuss a property
that none of the existing AMR metrics currently
satisfies: They do not measure graded meaning
differences. Such differences may emerge because
of near-synonyms such as ruin – annihilate; skinny
– thin – slim; enemy – foe (Inkpen and Hirst,
2006; Edmonds and Hirst, 2002) or paraphrases
such as be able to – can; unclear – not clear. Dans
a classical syntactic parsing task, metrics do not
need to address this issue because input tokens are
typically projected to lexical concepts by lemma-
tization, hence two graphs for the same sentence
tend not to disagree on the concepts projected
in semantic
from the input. This is different
parsing where the projected concepts are often
more abstract.
This article is structured as follows: We first
establish seven principles that one may expect a
metric for comparing meaning representations to
1Most research papers on AMR display the graphs in this
‘‘shallow’’ form. This increases simplicity and readability.
(Lyu and Titov, 2018; Konstas et al., 2017; Zhang et al.,
2019; Damonte and Cohen, 2019; Song et al., 2016).
Transactions of the Association for Computational Linguistics, vol. 8, pp. 522–538, 2020. https://doi.org/10.1162/tacl a 00329
Action Editor: Adam Lopez. Submission batch: 11/2019; Revision batch: 3/2020; Published 9/2020.
2020 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.
c
(cid:13)
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
2
9
1
9
2
3
3
1
2
/
/
t
je
un
c
_
un
_
0
0
3
2
9
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
the following constraint on metric:
[0, 1].2
II.
indiscernibles
identity of
D × D →
This focal
principle is formalized by metric(UN, B) = 1
⇔
A = B. It is violated if a metric assigns a value
indicating equivalence to inputs that are not
equivalent or if it considers equivalent inputs as
different.
III. symmetry In many cases, we want a metric
to be symmetric: metric(UN, B) = metric(B, UN).
A metric violates this principle if it assigns a
pair of objects different scores when argument
order is inverted. Together with principles I and
II, it extends the scope of the metric to usages
beyond parser evaluation, as it also enables sound
IAA calculation, clustering, and classification
of AMR graphs when we use the metric as a
kernel (par exemple., SVM). In parser evaluation, one may
dispense with any (fort)
requirements for
symmetry—however, the metric must then be
applied in a standardized way, with a fixed order
of arguments.
In cases where there is no defined reference,
the asymmetry could be handled by aggregating
metric(UN, B) and metric(B, UN), Par exemple,
using the mean. Cependant, it is open what aggre-
gation is best suited and how to interpret re-
sults, Par exemple, for metric(UN, B) = 0.1 et
metric(B, UN) = 0.9.
IV. determinacy Repeated calculation over the
same inputs should yield the same score. Ce
principle is clearly desirable as it ensures reprodu-
cibility (a very small deviation may be tolerable).
The next three principles we believe to be
desirable specifically when comparing meaning
representation graphs such as AMR (Banarescu
et coll., 2013). The first two of the following prin-
ciples are motivated by computer science and
linguistics, whereas the last one is motivated by a
linguistic and an engineering perspective.
V. no bias Meaning representations consist of
nodes and edges encoding specific information
les types. Unless explicitly justified, a metric should
not unjustifiably or in unintended ways favor
correctness or penalize errors for specific substruc-
photos (par exemple., leaf nodes). In case a metric favors or
penalizes certain substructures more than others,
in the interest of transparency, this should be made
clear and explicit, and should be easily verifiable
2At some places in this paper, due to conventions, nous
project this score onto [0,100] and speak of points.
Chiffre 1: A cat drinks water. Simplified AMR graph
and underlying deep form with is-instance relations
(—-je) from variables (solid) to concepts (dashed).
§
§
3). We then develop S2
satisfy, in order to obtain meaningful and appro-
2). Based on
priate scores for the given purpose (
these principles we provide an in-depth analysis
of the properties of the AMR metrics SMATCH
MATCH,
and SEMBLEU (
an extension of SMATCH that abstracts away from
a purely symbolic level, allowing for a graded
semantic comparison of atomic graph-elements
4). By this move, we enable SMATCH to take into
(
§
account fine-grained meaning differences. Nous
show that our proposed metric retains valuable
benefits of SMATCH, but at the same time is more
to slight meaning deviations. Notre
benevolent
code is available online at https://github.
com/Heidelberg-NLP/amr-metric-suite.
2 From Principles to AMR Metrics
The problem of comparing AMR graphs A, B
with respect
∈
to the meaning they express
D
occurs in several scenarios, Par exemple, parser
evaluation or inter-annotator agreement calcula-
tion (IAA). To measure the extent to which A
and B agree with each other, we need a metric:
D × D → R that returns a score reflecting meaning
distance or meaning similarity (for convenience,
we use similarity). Below we establish seven
principles that seem desirable for this metric.
2.1 Seven Metric Principles
The first four metric principles are mathemati-
cally motivated:
je. continuity, non-negativity and upper-bound
A similarity function should be continuous, avec
two natural edge cases: UN, B are equivalent (maxi-
mum similarity) or unrelated (minimum simi-
larity). By choosing 1 as upper bound, we obtain
523
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
2
9
1
9
2
3
3
1
2
/
/
t
je
un
c
_
un
_
0
0
3
2
9
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
and consistent. Par exemple, if we wish to give
negation of the main predicate of a sentence a
two-times higher weight compared with negation
in an embedded sentence, we want this to be made
transparent. A concrete example for a transparent
bias is found in Cai and Lam (2019). They analyze
the impact of their novel top–down AMR parsing
strategy by integrating a root-distance bias into
SMATCH to focus on structures situated at the top
of a graph.
We now turn to properties that focus on the
nature of the objects we aim to compare: graph-
based compositional meaning representations.
These graphs consist of atomic conditions that
determine the circumstances under which a
sentence is true. Ainsi, our metric score should
increase with increasing overlap of A and B,
which we denote f (UN, B), the number of matching
conditions. This overlap can be viewed from
a symbolic or/and a graded perspective (cf.,
par exemple., Schenker et al. [2005], who denote these
perspectives as ‘‘syntactic’’ vs. ‘‘semantic’’).
From the symbolic perspective, we compare the
nodes and edges of two graphs on a symbolic
level, while from the graded perspective, nous
take into account the degree to which nodes and
edges differ. Both types of matching involve a
precondition: If A and B contain variables, nous
need a variable-mapping in order to match con-
ditions from A and B.3
VI. matching (graph-based) meaning repre-
sentations – symbolic match A natural symbolic
overlap-objective can be found in the Jaccard
index J (Jaccard, 1912; Real and Vargas, 1996;
Papadimitriou et al., 2010): Let t(G) be the set
of triples of graph G, F (UN, B) =
t(B)
|
the size of the overlap of A, B, and z(UN, B) =
the size of their union. Alors, nous
t(UN)
|
wish that A and B are considered more simi-
lar to each other than A and C iff A and B
exhibit a greater relative agreement in their (sym-
bolic) conditions: metric(UN, B) > metric(UN, C)
z(UN,C) = J(UN, C). Un
⇔
allowed exception to this monotonic relationship
z(UN,B) = J(UN, B) > f (UN,C)
t(UN)
|
t(B)
F (UN,B)
∩
∪
|
x2, instance, cat
3Par exemple, consider graph A in Figure 1 and its
x1, instance, drink-1
set of triples t(UN):
,
je
{h
ih
. Quand
x3, instance, eau
x1, arg1, x3i
x1, arg0, x2i
,
,
h
h
h
comparing A against graph B we need to judge whether
a triple t
t(UN) is also contained in B: t
t(B). Pour
ce, we need a mapping map: vars(UN)
vars(B) où
such that f
vars(UN) =
x1, .., xn}
{
is maximized.
∈
→
y1, .., ym}
, vars(B) =
je}
∈
{
can occur if we want to take into account a
graded semantic match of atomic graph elements
or sub-structures, which we will now elaborate on.
VII. matching (graph-based) meaning repre-
sentations – graded semantic match: Un
motivation for this principle can be found in
engineering, Par exemple, when assessing the
quality of produced parts. Ici, small deviations
from a reference may be tolerable within certain
limits. De la même manière, two AMR graphs may match
almost perfectly—except for two small divergent
components. The extent of divergence can be
measured by the degree of similarity of the
two divergent components. In our case, we need
linguistic knowledge to judge what degree of
divergence we are dealing with and whether it is
tolerable.
h
X, instance, conceptA
je
Par exemple, consider that graph A contains a
and graph B a triple
triple
oui, instance, conceptB
, while otherwise the graphs
je
h
are equivalent, and the alignment has set x =
oui. Then f (UN, B) should be higher when conceptA
is similar to conceptB compared to the case where
conceptA is dissimilar to conceptB. In AMR,
concepts are often abstract, so near-synonyms may
even be fully admissible (enemy–foe). Although
tel (near-)synonyms are bound to occur fre-
quently when we compare AMR graphs of dif-
ferent sentences that may contain paraphrases, nous
will see, in Section
4, that this can also occur
in parser evaluation, where two different graphs
represent the same sentence. By defining metric
to map to a range [0,1] we already defined it to
be globally graded. Ici, we desire that graded
similarity may also hold of minimal units of AMR
graphs, such as atomic concepts or even sub-
graphs, Par exemple, to reflect that injustice(X)
is very similar to justice(X)
polarity(X,
−
2.2 AMR Metrics: SMATCH and SEMBLEU
∧
).
§
With our seven principles for AMR similarity
metrics in place, we now introduce SMATCH and
SEMBLEU, two metrics that differ in their design
and assumptions. We describe each of them in
detail and summarize their differences, setting the
stage for our in-depth metric analysis (
3).
§
Align and match – SMATCH The SMATCH metric
operates in two steps. D'abord, (je) we align the varia-
bles in A and B in the best possible way, by finding
a mapping map⋆: vars(UN)
vars(B) que
yields a maximal set of matching triples between
→
524
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
2
9
1
9
2
3
3
1
2
/
/
t
je
un
c
_
un
_
0
0
3
2
9
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
h
h
=
t(UN)
xi, rel, xj
i ∈
yk, rel, ym
h
map⋆(xi) rel, map⋆(xj)
je
A and B. Par exemple, si
et
i ∈
t(B), we obtain one triple match. (ii) We compute
Precision, Recall, and F1 score based on the set of
triples returned by the alignment search. The NP-
hard alignment search problem of step (je) is solved
with a greedy hill-climber: Let fmap(UN, B) être
the count of matching triples under any mapping
function map. Alors,
map⋆ = argmax
map
fmap(UN, B)
(1)
Multiple restarts with different seeds increase
the likelihood of finding better optima.
Simplify and match – SEMBLEU The SEMBLEU
metric in Song and Gildea (2019) can also be
described as a two-step procedure. But unlike
SMATCH it operates on a variable-free reduction
of an AMR graph G, which we denote by Gvf
(vf : variable-free, Chiffre 1, right-hand side).
In a first step, (je) SEMBLEU performs k-gram
extraction from Avf and Bvf in a breadth-first
traversal (path extraction). It then (ii) adopts the
BLEU score from MT (Papineni et al., 2002) à
calculate an overlap score based on the extracted
bags of k-grams:
n
SEMBLEU = BP
exp
·
min
BP = e
1
(cid:26)
−
Bvf
Avf
|
|
k=1
X
,0
(cid:27)
|
|
wk log pk
!
(2)
(3)
kgram(Avf )
kgram(Bvf )
∩
kgram(Avf )
where pk is BLEU’s modified k-gram precision that
measures k-gram overlap of a candidate against a
. wk is the
reference: pk = |
(typically uniform) weight over chosen k-gram
sizes. SEMBLEU uses NIST geometric probability
smoothing (Chen and Cherry, 2014). The recall-
focused ‘‘brevity penalty’’ BP returns a value
smaller than 1 when the candidate length
est
smaller than the reference length
Avf
|
|
|
|
|
The graph traversal performed in SEMBLEU
starts at the root node. During this traversal it
simplifies the graph by replacing variables with
their corresponding concepts (voir la figure 1: le
node c becomes DRINK-01) and collects visited
nodes and edges in uni-, bi- and tri grams (k = 3
is recommended). Ici, a source node together
with a relation and its target node counts as a
bi-gram. For the graph in Figure 1, the extracted
le
unigrams are
cat, eau, drink-01
;
Bvf
|
.
|
{
}
extracted bi grams are
drink-01arg2 water
.
}
drink-01 arg1 cat,
{
SMATCH vs. SEMBLEU in a nutshell SEMBLEU
differs significantly from SMATCH. A key dif-
ference is that SEMBLEU operates on reduced
variable-free AMR graphs (Gvf )—instead of full-
fledged AMR graphs. By eliminating variables,
SEMBLEU bypasses an alignment search. Ce
makes the calculation faster and alleviates a
weakness of SMATCH: The hill-climbing search
is slightly imprecise. Cependant, SEMBLEU is not
guided by aligned variables as anchors. Plutôt,
SEMBLEU uses an n-gram statistic (BLEU)
à
compute an overlap score for graphs, based on
k-hop paths extracted from Gvf , using the root
node as the start for the extraction process.
SMATCH, by contrast, acts directly on variable-
bound graphs matching triples based on a selected
alignment. If in some application we wanted it,
both metrics allow the capturing of more ‘‘global’’
graph properties: SEMBLEU can increase its k-
parameter and SMATCH may match conjunctions
de
Dans ce qui suit
triples.
analyse, cependant, we will adhere to their default
configurations because this is how they are used
in most applications.
(interconnected)
3 Assessing AMR Metrics with Principles
This section evaluates SMATCH and SEMBLEU
against the seven principles we established above
by asking: Why does a metric satisfy or violate a
given principle? and What does this imply? Nous
start with principles from mathematics.
je. Continuity, non-negativity, and upper-bound
This principle is fulfilled by both metrics as they
[0, 1].
are functions of the form metric :
D×D →
II. Identity of indiscernibles This principle is
fundamental: An AMR metric must return maxi-
mum score if and only if the graphs are equivalent
in meaning. Encore, there are cases where SEMBLEU, dans
contrast to SMATCH, does not satisfy this principle.
Chiffre 2 shows an example.
Ici, SEMBLEU yields a perfect score for two
AMRs that differ in a single but crucial aspect:
Two of its ARGx roles are filled with arguments that
are meant to refer to distinct individuals that share
the same concept. The graph on the left is an ab-
straction of, Par exemple, The man1 sees the other
man2 in the other man2, while the graph on the
right is an abstraction of The man1 sees himself 1
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
2
9
1
9
2
3
3
1
2
/
/
t
je
un
c
_
un
_
0
0
3
2
9
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
525
Chiffre 2: Two AMRs with semantic roles filled
differently, SEMBLEU considers them as equivalent.
in the other man2. SEMBLEU does not recognize
the difference in meaning between a reflexive
and a non-reflexive relation, assigning maximum
similarity score, whereas SMATCH reflects such
differences appropriately because it accounts for
variables.
Chiffre 3: Symmetry violation for two parses of Things
are so heated between us, I don’t know what to do.
symmetry violation
Graph banks
svr (%, ∆>0.0001)
SEMBLEU
SMATCH
msv (in points)
SMATCH
SEMBLEU
(
+
Ô
Ô
).
|
E
|
UN, B
∈ D
(SMATCH) plus
In sum, SEMBLEU does not satisfy principle II
because it operates on a variable-free reduction
of AMRs (Gvf ). One could address this prob-
lem by reverting to canonical AMR graphs and
adopting variable alignment in SEMBLEU. But this
would adversely affect the advertised efficiency
advantages over SMATCH. Re-integrating the align-
ment step would make SEMBLEU less efficient than
SMATCH because it would add the complexity of
breadth-first traversal, yielding a total complexity
de
V
|
|
III. Symmetry This principle is fulfilled if
: metric(UN, B) = metric(B, UN).
∀
Chiffre 3 shows an example where SEMBLEU does
not comply with this principle, to a significant
extent: When comparing AMR graph A against
B, it yields a score greater than 0.8, yet, quand
comparing B to A the score is smaller than 0.5.
We perform an experiment that quantifies this
effect on a larger scale by assessing the frequency
and the extent of such divergences. To this
end, we parse 1,368 development sentences from
an AMR corpus (LDC2017T10) with an AMR
) and evaluate
parser (obtaining graph bank
it against another graph bank
(gold graphs or
another parser-output). We quantify the symmetry
violation by the symmetry violation ratio (Eq. 4)
and the mean symmetry violation (Eq. 5) given
some metric m:
UN
B
svr =
|UN|i=1 I[m(
je,
UN
= m(
je,
B
je)]
UN
msv =
P.
P.
m(
|UN|je = 1 |
je,
UN
je)
B
|UN|
je)
B
|UN|
m(
je,
B
je)
UN
|
−
(4)
(5)
526
GPLA
CAMR
JAMR
GPLA
GPLA
JAMR
↔
↔
↔
↔
↔
↔
Gold
Gold
Gold
JAMR
CAMR
CAMR
avg.
2.7
7.8
5.0
4.2
7.4
7.9
5.8
81.8
92.8
87.0
86.0
93.4
91.6
88.8
0.1
0.2
0.1
0.1
0.1
0.2
0.1
3.2
3.1
3.2
3.0
3.4
3.3
3.2
Tableau 1: svr (Eq. 4), msv (Eq. 5) of AMR metrics.
BLEU symmetry violation, MT
data: newstest2018
worst-case
avg-case
(
↔
) svr (%, ∆ > 0.0001) msv (in points)
·
81.3
72.7
0.2
0.2
Tableau 2: svr (Eq. 4), msv (Eq. 5) of BLEU, MT
setting.
We conduct the experiment with three AMR
systèmes, CAMR (Wang et al., 2016), GPLA (Lyu
and Titov, 2018), and JAMR (Flanigan et al.,
2014), and the gold graphs. De plus, to provide
a baseline that allows us to better put the results
into perspective, we also estimate the symmetry
violation of BLEU (SEMBLEU’s MT ancestor) dans
an MT setting. Spécifiquement, we fetch 16 système
outputs of the WMT 2018 EN-DE metrics task
(Ma et al., 2018) and calculate BLEU(UN,B) et
BLEU(B,UN) of each sentence-pair (UN,B) from the
MT system’s output and the reference (using the
same smoothing method as SEMBLEU). As worst-
case/avg.-case, we use the outputs from the team
where BLEU exhibits maximum/median msv.4
Tableau 1 shows that more than 80% of the
evaluated AMR graph pairs lead to a symmetry
violation with SEMBLEU (as opposed to less than
4worst: LMU uns.; avg.: LMU NMT (Huck et al., 2017).
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
2
9
1
9
2
3
3
1
2
/
/
t
je
un
c
_
un
_
0
0
3
2
9
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
6
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
2
9
1
9
2
3
3
1
2
/
/
t
je
un
c
_
un
_
0
0
3
2
9
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Chiffre 4: Symmetry evaluations of metrics. SEMBLEU (left column) and SMATCH (middle column) and BLEU as
a ‘baseline’ in an MT task setting on newstest2018. SEMBLEU: large divergence, strong outliers. SMATCH: few
divergences, few outliers; BLEU: many small divergences, zero outliers. (un) marks the case in Figure 3.
10% for SMATCH). The msv of SMATCH is consi-
derably smaller compared to SEMBLEU: 0.1 vs. 3.2
points F1 score. Even though the BLEU metric is
inherently asymmetric, most of the symmetry
violations are negligible when applied in MT (haut
svr, low msv, Tableau 2). Cependant, when applied
to AMR graphs ‘‘via’’ SEMBLEU the asymmetry
is amplified by a factor of approximately 16 (0.2
vs. 3.2 points). Chiffre 4 visualizes the symme-
try violations of SEMBLEU (gauche), SMATCH (middle),
and BLEU (droite). The SEMBLEU-plots show that
the effect is widespread, some cases are extreme,
many others are less extreme but still considerable.
This stands in contrast to SMATCH but also to BLEU,
which itself appears well calibrated and does not
suffer from any major asymmetry.
In sum, symmetry violations with SMATCH are
much fewer and less pronounced than those
observed with SEMBLEU. En théorie, SMATCH is
fully symmetric, cependant, violations can occur
due to alignment errors from the greedy variable-
alignment search (we discuss this issue in the next
paragraph). Par contre, the symmetry violation
of SEMBLEU is intrinsic to the method because
the underlying overlap measure BLEU is inherently
asymmetric, cependant, this asymmetry is amplified
in SEMBLEU compared to BLEU.5
IV. Determinacy This principle states that
repeated calculations of a metric should yield iden-
1
2
# restarts
3
5
7
corpus vs. corpus
graph vs. graph
2.6e−
1.3e−
4
3
1.7e−
1.0e−
4
3
8.1e−
8.5e−
5
4
5.7e−
5.3e−
5
4
5.6e−
4.0e−
5
4
Tableau 3: Expected determinacy error ǫ in SMATCH
F1.
tical results. Because there is no randomness in
SEMBLEU, it fully complies with this principle.
The reference implementation of SMATCH does not
fully guarantee deterministic variable alignment
résultats, because it aligns the variables by means of
greedy hill-climbing. Cependant, multiple random
initializations together with the small set of AMR
variables imply that the deviation will be
ǫ
(a small number close to 0).6 In Table 3 nous
measure the expected ǫ: it displays the SMATCH F1
standard deviation with respect to 10 independent
runs, on a corpus level and on a graph-pair level
(arithmetic mean).7 We see that ǫ is small, même
when only one random start is performed (corpus
level: ǫ = 0.0003, graph level: ǫ = 0.0013).
We conclude that the hill-climbing in SMATCH is
unlikely to have any significant effects on the final
score.
≤
V. No bias A similarity metric of (UN)MRs
should not unjustifiably or unintentionally favor
6En plus, ǫ = 0 is guaranteed when resorting to a
5As we show below (principle V), this is due to the way in
which k-grams are extracted from variable-free AMR graphs.
(costly) ILP calculation (Cai and Knight, 2013).
7Données: dev set of LDC2017T10, parses by GPLA.
527
SEMBLEU
SMATCH
(3d)
(d)
Ô
Ô
√√√
Ô
(d2 + d)
(d)
Ô
Ô
(d2 + 2d)
(d)
Ô
Tableau 4: Error impact depending on error location
in a tree with node degree d.
Chiffre 5: Gauche: In April, a woman rides a car from
Rome to Pisa. root nodes A: travel-01 vs. B: drive-01.
Droite: In Apr., a sailor travels with a ship from P.
à N.
the correctness or penalize errors pertaining to
any (sub-)structures of the graphs. Cependant, nous
find that SEMBLEU is affected by a bias that affects
(quelques) leaf nodes attached to high-degree nodes.
The bias arises from two related factors: (je) quand
transforming G to Gvf , SEMBLEU replaces variable
nodes with concept nodes. Ainsi, nodes that were
leaf nodes in G can be raised to highly connected
nodes in Gvf . (ii) breadth-first k-gram extraction
starts from the root node. During graph traversal,
concept leaves—now occupying the position of
(former) variable nodes with a high number of
outgoing (and incoming) edges—will be visited
and extracted more frequently than others.
The two factors in combination make SEMBLEU
penalize a wrong concept node harshly when it
is attached to a high-degree variable node (le
leaf is raised to high-degree when transforming G
to Gvf ). Inversement, correct or wrongly assigned
concepts attached to nodes with low degree are
only weakly considered.8 For example, consider
Chiffre 5. SEMBLEU considers two graphs that
express quite distinct meanings (left and right) comme
more similar than graphs that are almost equivalent
in meaning (gauche, variant A vs. B). This is because
the leaf that is attached to the root is raised to a
highly connected node in Gvf and thus is over-
frequently contained in the extracted k-grams,
whereas the other leaves will remain leaves in
Gvf .
Analyzing and quantifying SEMBLEU’s bias To
better understand the bias, we study three limiting
8This may have severe consequences, par exemple., for negation,
since negation always occurs as a leaf in G and Gvf .
Donc, SEMBLEU, by-design, is benevolent to polarity
errors.
528
Chiffre 6: # of k-grams entered by a node in SEMBLEU.
cases: (je) the root is wrong (√√√ ) (ii) d leaf nodes are
wrong (
) et (iii) one branching node is wrong
( ). Depending on a specific node and its position
in the graph, we would like to know onto how
many k-grams (SEMBLEU) or triples (SMATCH) le
errors are projected. For the sake of simplicity,
we assume that the graph always comes in its
simplified form Gvf , that it is a tree, et ça
every non-leaf node has the same out-degree d.
The result of our analysis is given in Table 49
and exemplified in Figure 6. Both show that the
number of times k-gram extraction visits a node
heavily depends on its position and that with
growing d, the bias gets amplified (Tableau 4).10
Par exemple, when d = 3, 3 wrong leaves
yield 9 wrong k-grams, et 1 wrong branching
node can already yield 18 wrong k-grams. Par
contraste, in SMATCH the weight of d leaves always
approximates the weight of 1 branching node of
degree d.
In sum, in SMATCH the impact of a wrong node
is constant for all node types and rises linearly
with d. In SEMBLEU the impact of a node rises
approximately quadratically with d and it also
depends on the node type, because it raises some
(but not all) leaves in G to connected nodes in
Gvf .
9Proof sketch, SMATCH, d leaves: d triples, a root: d triples,
a branching node: d+1 triples. SEMBLEU
, d leaves: 3d
k-grams (d tri, d bi, d uni). A root: d2 tri, d bi, 1 uni. UN
branching node: d2+d+1 tri, d+1 bi, 1 uni.
wk=1/3
k=3
10Consider that in AMR, d can be quite high, par exemple., un
predicate with multiple arguments and additional modifiers.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
2
9
1
9
2
3
3
1
2
/
/
t
je
un
c
_
un
_
0
0
3
2
9
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Eliminating biases A possible approach to
reduce SEMBLEU’s biases could be to weigh the
extracted k-gram matches according to the degree
of the contained nodes. Cependant, this would imply
that we assume some k-grams (and thus also some
nodes and edges) to be of greater importance than
others—in other words, we would eliminate one
bias by introducing another. Because the breadth-
first traversal is the metric’s backbone, ce problème
may be hard to address well. When BLEU is used
for MT evaluation, there is no such bias because
the k-grams in a sentence appear linearly.
VI. Graph matching: Symbolic perspective
This principle requires that a metric’s score grows
with increasing overlap of the conditions that are
simultaneously contained in A and B. SMATCH
fulfills this principle since it matches two AMR
graphs inexactly (Yan et al., 2016; Riesen et al.,
2010) by aligning variables such that the triple
matches are maximized. Ainsi, SMATCH can be
seen as a graph-matching algorithm that works
on any pair of graphs that contain (quelques) nodes
that are variables. It fulfills the Jaccard-based
overlap objective, which symmetrically measures
the amount of triples on which two graphs agree,
normalized by their respective sizes (since SMATCH
F1 = 2J/(1 + J.) is a monotonic relation).
Because SEMBLEU does not satisfy principles
II and III (id. of indescernibles and symmetry),
it is a corollary that it cannot fulfill the overlap
objective.11 Generally, SEMBLEU does not com-
pare and match two AMR graphs per se, instead
it matches the results of a graph-to-bag-of-paths
2.2) and the input may not
projection function (
be recoverable from the output (surjective-only).
Ainsi, matching the outputs of this function cannot
be equated to matching the inputs on a graph-level.
§
4 Towards a More Semantic Metric for
Semantic Graphs: S2
MATCH
This section focuses on principle VII, semantically
graded graph matching, a principle that none of
the AMR metrics considered so far satisfies. UN
11Proof by symmetry violation:
∃
UN, B: metric(UN, B) > metric(B, UN)
w.l.o.g.
→ (cid:18) , since f (UN, B) =
> f (B, UN)
t(UN)
t(UN)
t(B)
|
|
∩
indiscernibles:
w.l.o.g.
F (UN, B)/z(UN, B) = 1 > f (UN, C)/z(UN, C)(cid:18)
= f (B, UN)
F (UN, B)
=
t(B)
|
/// Proof by identity of
⇒
∩
UN, B, C: metric(UN, B) = metric(UN, C) = 1
∃
|
∧
529
Chiffre 7: Three different AMR graphs representing
The cat sprints; The kitten runs; The giraffe sleeps and
pairwise similarity scores from SEMBLEU, SMATCH, et
S2MATCH (voir (
4) for S2Match).
§
fulfilment of this principle also increases the capa-
city of a metric to assess the semantic similarity
of two AMR graphs from different sentences.
Par exemple, when clustering AMR graphs or
detecting paraphrases in AMR-parsed texts, le
ability to abstract away from concrete lexica-
lizations is clearly desirable. Consider Figure 7,
with three different graphs. Two of them (UN, B)
are similar in meaning and differ significantly
from C. Cependant, both SMATCH and SEMBLEU yield
the same result in the sense that metric(UN, B) =
metric(UN, C). Put differently, neither metric
takes into account that giraffe and kitten are two
quite different concepts, while cat and kitten are
more similar. Cependant, we would like this to be
reflected by our metric and obtain metric (UN,
B) > metric(UN, C) in such a case.
MATCH We propose the S2
S2
MATCH metric (Soft
Semantic match, pronounced: [estu:mætS ℄) que
in one
builds on SMATCH but differs from it
important aspect:
Instead of maximizing the
number of (hard) triple matches between two
graphs during alignment search, we maximize
le (soft) triple matches by taking into account
the semantic similarity of concepts. Recall that
an AMR graph contains two types of triples:
instance and relation triples (par exemple., Chiffre 7, gauche:
). In SMATCH,
un, instance, cat
h
h
je
two triples can only be matched if they are
identical. In S2
MATCH, we relax this constraint,
which has also the potential to yield a different,
and possibly, a better variable alignment. More
precisely, in SMATCH we match two instance triples
c, arg 0, un
et
je
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
2
9
1
9
2
3
3
1
2
/
/
t
je
un
c
_
un
_
0
0
3
2
9
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
avg. msv (Eq. 5)
1 restart
determinacy error
2 restarts
4 restarts
SMATCH
S2
MATCH
relative change
0.0011
0.0005
54.6%
−
1.3e−
9.0e−
3
4
1.0e−
6.1e−
3
4
5.3e−
2.1e−
4
4
30.7%
−
39.0%
−
60.3%
−
Tableau 5: S2
reducing the extent of its non-determinacy.
MATCH improves upon SMATCH by
following pilot experiments, we use cosine (Eq. 8)
and τ = 0.5 over 100-dimensional GloVe vectors
(Pennington et al., 2014).
To summarize, S2
MATCH is designed to either
yield the same score as SMATCH—or a slightly
increased score when it aligns concepts that are
symbolically distinct but semantically similar.
An example, from parser evaluation, is shown in
Chiffre 8. Ici, S2
MATCH increases the score to
63 F1 (+10 points) by detecting a more adequate
alignment that accounts for the graded similarity
of two related AMR concepts pairs. We believe
that this is justified: The two graphs are very
similar and an F1 of 53 is too low, doing the
parser injustice.
On a technical note, the changes in alignments
also have the outcome that S2
MATCH mends some
of SMATCH’s flaws: It better addresses principles
III and IV, reducing the symmetry violation and
determinacy error (Tableau 5).
Qualitative study: Probing S2
MATCH’s choices
This study randomly samples 100 graph pairs from
those where S2
MATCH assigned higher scores than
SMATCH.12 Two annotators were asked to judge the
similarity of all aligned concepts with similarity
score <1.0: Are the concepts dissimilar, similar, or
extremely similar? For concepts judged dissimilar,
we conclude that S2
MATCH erroneously increased
the score; if judged as (extremely) similar, we
conclude that
the decision was justified. We
calculate three agreement statistics that all show
large consensus among our annotators (Cohen’s
kappa: 0.79, squared kappa: 0.87, Pearson’s ρ:
0.91) According to the annotations, the decision
to increase the score is mostly justified: in 56%
and 12% of cases both annotators voted that the
newly aligned concepts are extremely similar and
similar, respectively, while the agreed dissimilar
label makes up 25% of cases.
12Automatic graphs by GPLA, on LDC2017T10, dev set.
530
Figure 8: ‘6 Abu Sayyaf suspects were captured last
week in a raid in Metro Manila.’ gold (top) vs. parsed
AMR (bottom). SMATCH aligns criminal-organization
to city (red); S2
MATCH aligns criminal-organization to
suspect-01, city to country-region (blue).
a, instance, x
h
B as follows:
i ∈
A and
h
map(a), instance, y
hardM atch = I[x = y]
i ∈
(6)
where I(c) equals 1 if c is true and 0 otherwise.
S2
MATCH relaxes this condition:
sof tM atch = 1
d(x, y),
−
(7)
X
where d is an arbitrary distance function d :
[0, 1]. For example, in practice, if
X
n ,
we represent the concepts as vectors x, y
we can use
∈ R
→
×
d(x, y) = min
1, 1
(cid:26)
yT x
y
k2 k
−
x
k
k2 (cid:27)
.
(8)
When plugged into Eq. 7, this results in the
cosine similarity
[0, 1]. It may be suitable to
set a threshold τ (e.g., τ = 0.5), to only consider
the similarity between two concepts if it is above
d(x, y) < τ ). In the
τ (sof tM atch = 0 if 1
∈
−
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
2
9
1
9
2
3
3
1
2
/
/
t
l
a
c
_
a
_
0
0
3
2
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
input span region (excerpt)
amr region gold (excerpt)
amr region parser (excerpt)
cos
points F1
↑
40 km southwest of
:quant 40:unit ( k2 / kilometer )
( k22 / km :unit-of (d23 / distance-quantity
improving agricultural prod.
(i2 / improve-01 . . . :mod ( f2 / farming )
(i31 / improve-01:mod ( a23 / agriculture )
other deadly bacteria
op3 ( b / bacterium . . . :mod (o / other)))
op3 ( b13 / bacteria :ARG0-of:mod (o12 / other)))
drug and law enforcement aid
(a / and:op2 ( a3 / aid-01
:ARG1 (a9 / and:op1 ( d8 / drug ) :op2 (l10 / law)))
Get a lawyer and get a divorce.
:op1 (g / get-01:mode imp. :ARG0 ( y / you )
:op1 ( g0 / get-01 :ARG1 (l2 / lawyer):mode imp.)
The unusual development.
ARG0 (d / develop-01:mod ( u / usual :polarity -))
:ARG0 (d1 / develop-02:mod ( u0 / unusual ))
0.72
0.73
0.80
0.67
0.80
0.60
annotation
ex. similar
ex. similar
ex. similar
similar
dissimilar
1.2
3.0
5.1
1.8
4.8
14.0
dissimilar
Table 6: Examples where S2
MATCH assigns a higher score, accounting for the similarity of aligned concepts .
Table 6 lists examples of good or ill-founded
score increases. We observe, for example, that
S2
MATCH accounts for the similarity of two con-
cepts of different number: bacterium (gold) vs.
bacteria (parser) (line 3). It also captures abbre-
viations (km – kilometer) and closely related
concepts (farming – agriculture). SEMBLEU and
SMATCH would penalize the corresponding triples
in exactly the same way as predicting a truly
dissimilar concept.
An interesting case is seen in line 7. Here,
usual and unusual are correctly annotated as
dissimilar, since they are opposite concepts.
S2
MATCH, equipped with GloVe embeddings,
measures a cosine of 0.6, above the chosen
threshold, which results in an increase of the score
by 14 points (the increase is large as these two
graphs are tiny). It is well known that synonyms
and antonyms are difficult to distinguish with
distributional word representations, because they
often share similar contexts. However, the case
at hand is orthogonal to this problem: usual in
’,
the gold graph is modified with the polarity ‘
whereas the predicted graph assigned the (non-
negated) opposite concept unusual. Hence, given
the context in the gold graph, the prediction is
semantically almost equivalent. This points to an
aspect of principle VII that is not yet covered
by S2
MATCH: It assesses graded similarity at the
lexical, but not at the phrasal level, and hence
cannot account for compositional phenomena.
In future work, we aim to alleviate this issue
by developing extensions that measure semantic
similarity for larger graph contexts, in order to
fully satisfy all seven principles.13
−
Quantitative study: Metrics vs. human raters
This study investigates to what extent the judg-
ments of the three metrics under discussion resem-
ble human judgements, based on the following
two expectations. First, the more a human rates
13As we have seen, this requires much care. We therefore
consider this next step to be out of scope of the present paper.
two sentences to be semantically similar in their
meaning, the higher the metric should rate the cor-
responding AMR graphs (meaning similarity).
Second, the more a human rates two sentences
to be related in their meaning (maximum: equi-
valence), the higher the score of our metric of
the corresponding AMR graphs should tend to be
(meaning relatedness). Albeit not the exact same
(Budanitsky and Hirst, 2006), the tasks are closely
related and both in conjunction should allow us to
better assess the performance of our AMR metrics.
As ground truth for the meaning similarity
rating task we use test data of the Semantic
Textual Similarity (STS) shared task (Cer et al.,
2017), with 1,379 sentence pairs annotated for
meaning similarity. For the meaning-relatedness
task we use SICK (Marelli et al., 2014) with
9,840 sentence pairs that have been additionally
annotated for semantic relatedness.14 We proceed
as follows: We normalize the human ratings to
[0,1]. Then we apply GPLA to parse the sen-
tence tuples (si, s′i), obtaining tuples (parse(si),
parse(s′i)) and score the graph pairs with the
metrics: SMATCH(i), S2
MATCH(i), SEMBLEU(i), and
H(i), where H(i) is the human score. For both
tasks SMATCH and S2
MATCH yield better or equal
correlations with human raters than SEMBLEU
(Table 7). When considering the RMS error
metric(i))2. the difference
1
n−
n
i=1(H(i)
is even more pronounced.
p
−
P
This deviation in the absolute scores is also
reflected by the score density distributions plotted
in Figure 9: SEMBLEU underrates a good proportion
of graph pairs whose input sentences were rated as
highly semantically similar or related by humans.
This may well relate to the biases of different node
MATCH appears to provide
types (cf.
3). Overall, S2
§
14An example from SICK. Max. score: A man is cooking
pancakes–The man is cooking pancakes. Min. score: Two
girls are playing outdoors near a woman.–The elephant is
being ridden by the man. To further enhance the soundness
of the SICK experiment we discard pairs with a contradiction
relation and retain 8,416 pairs with neutral or entailment.
531
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
2
9
1
9
2
3
3
1
2
/
/
t
l
a
c
_
a
_
0
0
3
2
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
RMSE
RMSE (quant)
Pearson’s ρ
Spearman’s ρ
task
SB
SM
STS
0.34
SICK 0.38
0.25
0.25
2
S
M
0.25
0.24
SB
SM
0.25
0.32
0.11
0.14
2
S
M
0.10
0.13
SB
SM
0.52
0.62
0.55
0.64
2
S
M
0.55
0.64
SB
SM
0.51
0.66
0.53
0.66
2
S
M
0.53
0.66
Table 7: RMSE (lower is better) and correlation results of our metrics in our STS and
SICK investigations. RMSE (quant): RMSE on empirical quantile distribution with quantiles
0.1,0.2,. . . ,0.9.
Figure 9: Sentence meaning similarity distributions.
a better fit with the score-distribution of the human
rater when measuring semantic similarity and
relatedness, the latter being notably closer to
the human reference in some regions than the
otherwise similar SMATCH. A concrete example
from the STS data is given in Figure 10. Here,
S2
MATCH detects the similarity between the abstract
anaphors it and this and assigns a score that better
reflects the human score compared to SMATCH and
SEMBLEU, the latter being far too low. However, in
total, we conclude that S2
MATCH’s improvements
seem rather small and no metric is perfectly
aligned with human scores, possibly because
gradedness of semantic similarity that arises in
combination with constructional variation is not
yet captured—more research is needed to extend
S2
MATCH’s scope to such cases.
5 Metrics’ effects on parser evaluation
We have seen that different metrics can assign
different scores to the same pair of graphs. We
now want to assess to what extent this affects
rankings: Does one metric rank a graph higher
or lower than the other? And can this affect the
ranking of parsers on benchmark datasets?
Quantitative study: Graph rankings
In this
experiment, we assess whether our metrics rank
graphs differently. We use LDC2017T10 (dev)
parses by CAMR [c1. . .cn], JAMR [j1. . .jn] and
gold graphs [y1. . .yn]. Given metrics
we obtain results
(c1, y1) . . .
and analogously
and
G
F
(cn, yn)]
F
J . We calculate two
F
C and
C := [
F
J ,
F
G
G
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
2
9
1
9
2
3
3
1
2
/
/
t
l
a
c
_
a
_
0
0
3
2
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Figure 10: An example from STS, where S2MATCH
yields a score that better reflects the human judgement,
due to detecting a similarity between the abstract
anaphora it and this .
·
J
(
G
F
C
i )
J
i − F
statistics: (i) the ratio of cases i where the metrics
differ in their preference for one parse over the
J
C
other (
i ) < 0, and, to assess
i − G
significance, (ii) a t-test for paired samples on the
differences assigned by the metrics between the
C and
C . Table 8 shows that
J
parsers:
−F
F
SMATCH and S2
MATCH both differ (significantly)
from SEMBLEU in 15% – 20% of cases. SMATCH
and S2
MATCH differ on individual rankings in appr.
4% of cases. Furthermore, we note a considerable
amount of cases (8.1%) where SEMBLEU disagrees
with itself in the preference for one parse over the
other.15
−G
G
The differing preferences of S2
MATCH for
either candidate parse can be the outcome of
small divergences due to the alignment search
or because S2
MATCH accounts for the lexical
similarity of concepts, perhaps supported by a
new variable alignment. Figure 11 shows two
examples where S2
MATCH prefers a different
candidate parse compared to SMATCH. In the first
example (Figure 11a), S2
MATCH prefers the parse
15That is, SB(A,G)>SB(B,G) albeit SB(G,UN)