Weisfeiler-Leman in the BAMBOO : Novel AMR Graph Metrics - Recherche en IA spécialisée au MIT

Weisfeiler-Leman in the BAMBOO : Novel AMR Graph Metrics
and a Benchmark for AMR Graph Similarity

Juri Opitz1 Angel Daza2 Anette Frank1
1Dept. of Computational Linguistics, Heidelberg University, Allemagne
2CLTL, Vrije Universiteit Amsterdam, The Netherlands
{opitz, frank}@cl.uni-heidelberg.de, j.a.dazaarevalo@vu.nl

Abstrait

Several metrics have been proposed for as-
sessing the similarity of (abstract) meaning
representations (AMRs), but little is known
about how they relate to human similarity rat-
ings. De plus, the current metrics have com-
plementary strengths and weaknesses: Some
emphasize speed, while others make the align-
ment of graph structures explicit, at the price
of a costly alignment step.
In this work we propose new Weisfeiler-Leman
AMR similarity metrics that unify the strengths
of previous metrics, while mitigating their weak-
nesses. Spécifiquement, our new metrics are able
to match contextualized substructures and in-
duce n:m alignments between their nodes. Fur-
thermore, we introduce a Benchmark for AMR
Metrics based on Overt Objectives (BAMBOO ),
the first benchmark to support empirical as-
sessment of graph-based MR similarity met-
rics. BAMBOO maximizes the interpretability
of results by defining multiple overt objectives
that range from sentence similarity objectives
to stress tests that probe a metric’s robust-
ness against meaning-altering and meaning-
preserving graph transformations. We show
the benefits of BAMBOO by profiling previous
metrics and our own metrics. Results indicate
that our novel metrics may serve as a strong
baseline for future work.

Introduction

Meaning representations aim at capturing the
meaning of text in an explicit graph format. UN
prominent framework is abstract meaning repre-
phrase (AMR), proposed by Banarescu et al.
(2013). AMR views sentences as rooted, directed,
acyclic, labeled graphs. Their nodes are variables,
attributes, ou (open-class) concepts and are con-
nected with edges that express semantic relations.
There are many use cases in which we need to
compare or relate two AMR graphs. A common
situation is found in parser evaluation, où

AMR metrics are widely applied (May, 2016;
May and Priyadarshi, 2017).1 Encore, there are more
situations where we need to measure similarity
of meaning as expressed in AMR graphs. Pour
example, Bonial et al. (2020) leverage AMR met-
rics in a semantic search engine for COVID-19
queries, Naseem et al. (2019) use metric feedback
to reinforce AMR parsers, Opitz (2020) emulates
metrics for referenceless AMR ranking and rating,
and Opitz and Frank (2021) use AMR metrics for
NLG evaluation.

So far, multiple AMR metrics (Cai and Knight,
2013; Cai and Lam, 2019; Song and Gildea, 2019;
Anchiˆeta et al., 2019; Opitz et al., 2020) have
been proposed to assess AMR similarity. Comment-
jamais, due to a lack of an appropriate evaluation
benchmark, we have no empirical evidence that
could tell us more about their strengths and weak-
nesses or offer insight about which metrics may
be preferable over others in specific use cases.

En plus, we would like to move beyond the
aforementioned metrics and develop new metrics
that account for graded similarity of graph sub-
structures, which is not an easy task. Cependant, it
is crucial when we need to compare AMR graphs
in a deeper way. Consider Figure 1, which shows
two AMRs that convey very similar meanings.
All aforementioned metrics assign this pair a low
similarity score, and—if alignment-based, as is
SMATCH (Cai and Knight, 2013)—find only sub-
par alignments.2 In this case, we want a metric
that provides us with a high similarity score and,
ideally, an explanatory alignment.

The structure of this paper is as follows. In §2
we discuss related work. In §3 we describe our

1With minor adaptions, AMR metrics are also used in
other MR parsing tasks (van Noord et al., 2018; Zhang et al.,
2018; Oepen et al., 2020).

2Par exemple, in Figure 1, SMATCH aligns drink-01 à
slurp-01 and kitten to cat, resulting in a single matching triple
(X, arg0, oui).

1425

Transactions of the Association for Computational Linguistics, vol. 9, pp. 1425–1441, 2021. https://doi.org/10.1162/tacl a 00435
Action Editor: Yue Zhang. Submission batch: 4/2021; Revision batch: 8/2021; Published 12/2021.
c(cid:2) 2021 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

un
c
_
un
_
0
0
4
3
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

late the variable alignment of (W)S(2)
MATCH and
only consider their attached concepts, which in-
creases computation speed. Apart from this, le
metrics differ significantly: SEMBLEU extracts bags
of k-hop paths (k≤3) from the AMR graphs and
thereupon calculates BLEU (Papineni et al., 2002).
SEMA, on the other hand, is somewhat simpler and
provides us with an F1 score that it achieves by
comparing extracted triples.

From Measuring Structure Overlap to
Measuring Meaning Similarity Most AMR
metrics have been designed for semantic parser
evaluation, and therefore determine a score for
structure overlap. While this is legitimate, avec
extended use cases for AMR metrics arising, là
is increased awareness that structural matching of
labeled nodes and edges of an AMR graph is not
sufficient for assessing the meaning similarity ex-
pressed by two AMRs (Kapanipathi et al., 2021).
This insufficiency has also been observed in cross-
lingual AMR parsing evaluation (Blloshmi et al.,
2020; Sheth et al., 2021; Uhrig et al., 2021), mais
is most prominent when attempting to compare
the meaning of AMRs that represent different sen-
tences (Opitz et al., 2020; Opitz and Frank, 2021).
This work argues that in cases like Figure 1,
the available metrics do not sufficiently reflect
the similarity of the two AMRs and their underly-
ing sentences.

How Do Humans Rate Similarity of Sentence
Meaning? STS (Baudiˇs et al., 2016un, b; Cer
et coll., 2017) and SICK (Marelli et al., 2014)
elicited human ratings of sentence similarity on a
Likert scale. While STS annotates semantic simi-
larity, SICK annotates semantic relatedness. These
two aspects are highly related, but not the exact
same (Budanitsky and Hirst, 2006; Kolb, 2009).
Only the highest scores on the Likert scales of
SICK and STS can be seen as reflecting the equiv-
alence of meaning of two sentences. Other data
sets contain binary annotations of paraphrases
(Dolan and Brockett, 2005), that cover a wide
spectrum of semantic phenomena.

Benchmarking Metrics Metric benchmarking
is an active topic in NLP research and led to the
emergence of metric benchmarks in various areas,
most prominently MT and NLG (Gardent et al.,
2017; Zhu et al., 2018; Ma et al., 2019). These
benchmarks are useful since they help to assess

Chiffre 1: Similar AMRs, with sketched alignments.

first contribution: new AMR metrics that aim at
unifying the strengths of previous metrics while
mitigating their weaknesses. Spécifiquement, our new
metrics are capable of matching larger substruc-
tures and provide valuable n:m alignments in
polynomial time. In §4 we introduce BAMBOO ,
our second contribution: It is the first bench-
mark data set for AMR metrics and includes
novel robustness objectives that probe the behav-
ior of AMR metrics under meaning-preserving
and meaning-altering transformations of the in-
puts (§5). In §6 we use BAMBOO for a detailed,
multi-faceted empirical study of previous and our
proposed AMR metrics.

We release BAMBOO and our new metrics.3

2 Related Work

The Classical AMR Metric and its Adaptions
The ‘canonical’ and widely applied AMR metric is
SMATCH (Semantic match) (Cai and Knight, 2013).
It solves an NP-hard graph alignment problem ap-
proximately with a hill-climber and scores match-
ing triples. SMATCH has been adapted to S2
MATCH
(Soft Semantic match), by Opitz et al. (2020) à
account for graded similarity of concept nodes
(par exemple., cat—kitten), using word embeddings. SMATCH
has also been adapted by Cai and Lam (2019)
in W(eighted)Smatch (WSMATCH), which penal-
izes errors relative to their distance to the root.
This is motivated by the hypothesis that ‘‘core
semantics’’ tend to be located near a graph’s root.

BFS-based and Alignment-free AMR Metrics
Recently, two new AMR metrics have been pro-
posed: SEMA by Anchiˆeta et al.
(2019) et
SEMBLEU by Song and Gildea (2019). Common
to both is a mechanism that traverses the graph.
Both start from the root, and collect structures
with a breadth-first traversal (BFS). Aussi, both ab-

3https://git.io/J0J7V.

1426

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

un
c
_
un
_
0
0
4
3
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

and select metrics and encourage their further de-
velopment (Gehrmann et al., 2021). Cependant,
there is currently no established benchmark that
defines a ground truth of graded semantic similar-
ity between pairs of AMRs, and how to measure it
in terms of their structural representations. Aussi,
we do not have an established ground truth to
assess what alternative AMR metrics such as
(W|S2)MATCH or SEMBLEU really measure, and how
their scores correlate with human judgments of
the semantic similarity of sentences represented
by AMRs.

3 Grounding Novel AMR etrics in The
Weisfleiler-Leman Graph Kernel

Previous AMR metrics have complementary
strengths and weaknesses. Donc, we aim to
propose new AMR metrics that are able to mitigate
these weaknesses, while unifying their strengths,
aiming at the best of all worlds. We want:

je) an interpretable alignment (SMATCH);

ii) a fast metric (SEMA, SEMBLEU);

iii) matching larger substructures (SEMBLEU);

iv) and assessment of graded similarity of

AMR subgraphs (extending S2

MATCH).

le
This section proposes to make use of
Weisfeiler-Leman graph kernel (WLK) (Weisfeiler
and Leman, 1968; Shervashidze et al., 2011) à
assess AMR similarity. The idea is that WLK
provides us with SEMBLEU-like matches of larger
sub-structures, while bypassing potential biases
induced by the BFS-traversal (Opitz et al., 2020).
We then describe the Wasserstein Weisfeiler Leman
kernel (WWLK) (Togninalli et al., 2019) c'est
similar to WLK but provides (je) an alignment
of atomic and non-atomic substructures (going
beyond SMATCH) et (ii) a graded match of sub-
structures (going beyond S2
MATCH). Enfin, nous
further adapt WWLK to WWLKΘ, a variant that
we tailor to learn semantic edge parameters, à
better assess AMR graphs.

Chiffre 2: WLK example based on one iteration.

power in many tasks, ranging from protein clas-
sification to movie recommendation (Togninalli
et coll., 2019; Yanardag and Vishwanathan, 2015).
it has not been applied to
Cependant, so far,
(UN)MR graphs. Dans ce qui suit, we describe the
WLK method.

Generally, a kernel can be viewed as a similar-
ity measurement between two objects (Hofmann
et coll., 2008), in our case, two AMR graphs G, G(cid:4).
It is stated as k(G, G(cid:4)) = (cid:5)Φ(G), Φ(G(cid:4))(cid:6), où
(cid:5)·, ·(cid:6) : Rd × Rd → R+ is an inner product and
Φ maps an input to a feature vector that is built
incrementally over K iterations. For our AMR
graphs, one such iteration k works as follows: (un)
every node receives the labels of its neighbors and
the labels of the edges connecting it to their neigh-
bors, and stores them in a list (cf. Contextualize
in Figure 2). (b) The lists are alphabetically sorted
and the string elements of the lists are concate-
nated to form new aggregate labels (cf. Compress
in Figure 2). (c) Two count vectors xk
G and xk
G(cid:4)
are created where each dimension corresponds to
a node label that is found in any of the two graphs
and contains its count (cf. Features in Figure 2).
Since every iteration yields two vectors (one for
each input), we can concatenate the vectors over
iterations and calculate the kernel (cf. Similarity
in Figure 2):

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

un
c
_
un
_
0
0
4
3
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

k(·, ·) = (cid:5)ΦW L(G), ΦW L(G(cid:4))(cid:6)
G, . . . , xK

= (cid:5)concat(x0

G ), concat(x0

G(cid:4) (cid:6)
G(cid:4), . . . , xK
(1)

3.1 Basic Weisfeiler-Leman Kernel (WLK)

The Weisfeiler-Leman kernel (WLK) method
(Shervashidze et al., 2011) derives sub-graph fea-
tures from two input graphs. WLK has shown its

Spécifiquement, we use the cosine similarity kernel
and two iterations (K = 2), ce qui implique que
every node receives information from its neigh-
bors and their immediate neighbors. For simplicity

1427

d'abord

we will
treat edges as undirected, mais
later will experiment with various directionality
parameterizations.

3.2 Wasserstein Weisfeiler-Leman (WWLK)
S2
MATCH differs from all other AMR metrics in
that it accepts close concept synonyms for align-
ment (up to a similarity threshold). But it comes
with a restriction and a downside: je) it cannot
assess graded similarity of (non-atomic) AMR
subgraphs, which is crucial for assessing par-
tial meaning agreement between AMRs (as illus-
trated in Figure 1), and ii) the alignment is costly
to compute.

We hence propose to adopt a variant of
WLK, the Wasserstein-Weisfeiler Leman kernel
(WWLK) (Togninalli et al., 2019), for the follow-
ing two reasons: (je) WWLK can assess non-atomic
subgraphs on a finer level, et (ii) it provides
graph alignments that are faster to compute than
any of the existing SMATCH metrics: (W)S(2)
MATCH.
WWLK works in two steps: (1) Given its ini-
tial node embeddings, we use WL to project the
graph into a latent space, in which the final node
embeddings describe varying degrees of contextu-
alization. (2) Given a pair of such (WL) embedded
graphs, a transportation plan is found that de-
scribes the minimum cost of transforming one
graph into the other. In the top graph of Figure 3,
f indicates the first step, while Wasserstein dis-
tance indicates the second. Now, we describe the
steps in closer detail.

Step 1: WL Graph Projection into Latent Space
Let v = 1 . . . n be the nodes of AMR G. This graph
is projected onto a matrix Rn × R(K+1)d with

F (G) = hStack(X 0

G, . . . , X K

G ), où

X k

G = [xk(1), . . . , xk(n)]T ∈ Rn × Rd.

(2)

(3)

(cid:2)

(cid:3)

x y
w z

tel

hstack
(cid:2)
(cid:3)
a b
,
c d

que
concatenates matrices
(cid:3)
) →
a b x y
(
. This means that, dans
c d w z
the output space, every node is associated with
a vector that is itself a concatenation of K + 1
vectors with d dimensions each, where k indicates
in Figure 3).
the degree of contextualization (
The embedding x(v)k ∈ Rd for a node v in a
certain iteration k is computed as follows:

X(v)k+1 =

(cid:4)

1
2

X(v)k +

1
d(v)

(cid:5)

u∈Nv

(cid:6)

w(toi, v) · x(toi)k

(4)

Chiffre 3: Wasserstein WLK example w/o learned edge
parameters (top, §3.2) and w/ learnt edge parameters
(bottom, §3.3), which allow us to adjust the embedded
graphs such that they better take the (impact of) AMR
edges into account. Red: the distance increases because
of a negation contrast between the two AMRs that
otherwise convey similar meaning.

d(v) is the degree of a node, N returns the neigh-
bors for a node, w(toi, v) can assign a weight to a
node pair. The initial node embeddings, namely,
X(·)0, can be set up by looking up the node labels
in a set of pre-trained word embeddings, or using
random initialization. To distinguish between the
discrete edge labels, we sample random weights.

Step 2: Computing the Wasserstein Distance
Between two WL-embedded Graphs The
Wasserstein distance describes the minimum
amount of work that is necessary to transform
le (contextualized) nodes of one graph into the
(contextualized) nodes of the other. It is computed
based on pairwise Euclidean distances from f (G)
with n nodes, and f (G(cid:4)) with m nodes:

n(cid:5)

m(cid:5)

distance =

Ti,jDi,j

(5)

je = 1

j=1

Ici, the ‘cost matrix’ D ∈ Rn×m contains the
Euclidean distances between the n WL-embedded

1428

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

un
c
_
un
_
0
0
4
3
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

nodes from G and m WL-embedded nodes from
G(cid:4). C'est, Di,j = ||F (G)i − f (G(cid:4))j||2. The flow
matrix T describes a transportation plan between
the two graphs, namely, Ti,j ≥ 0 states how
much of node i from G flows to node j from
G(cid:4), the corresponding ‘local work’ can be stated
as f low(je, j) · cost(je, j) := Ti,j · Di,j. To find
the best T, c'est, the transportation plan that
minimizes the cumulative work needed (Eq. 5),
we solve a constraint linear problem:4

n(cid:5)

m(cid:5)

min

Ti,jDi,j

je = 1

j=1

(6)

s.t. : Ti,j ≥ 0, 1 ≤ i ≤ n, 1 ≤ j ≤ m (7)

m(cid:5)

j=1
n(cid:5)

je = 1

Ti,j =

1
m

, 1 ≤ i ≤ n

Ti,j =

1
n

, 1 ≤ j ≤ m

(8)

(9)

Note that (je) the transportation plan T describes
an n:m alignment between the nodes of the two
graphs, et ça (ii) solving Eq. 6 has polynomial
time complexity, tandis que le (W)S(2)
MATCH problem
is NP-complete.

3.3 From WWLK to WWLKθ with
zeroth-order Optimization

Motivation: AMR Edge Labels Have Meaning
The WL-embedding mechanism of WWLK (Eq. 4)
associates a weight w(toi, v) ∈ R with each edge.
For unlabeled graphs, w(toi, v) is simply set to
un. To distinguish between the discrete AMR
edge labels, in WWLK we have used random
weights. Cependant, AMR edge labels encode com-
plex relations between nodes, and simply choosing
random weights may not be enough. En fait, nous
hypothesize that different edge labels may im-
pact the meaning similarity of AMR graphs in
different ways. Whereas a modifier relation in an
AMR graph configuration may or may not have
a significant influence on the overall AMR graph
similarité, an edge representing negation is bound
to have a significant influence on the similarity
of different AMR graphs. Consider the example
in Figure 3: In the top figure, we embed AMRs
for The pretty warbler sings and The bird sings
gently, which have similar meanings. In the bot-
tom figure, the second AMR has been changed
to express the meaning of The bird doesn’t sing,

4We use https://pypi.org/project/pyemd.

which clearly reduces the meaning similarity of
the two AMRs. Ainsi, we hypothesize that learn-
ing edge parameters for different AMR relation
types may help to better adjust the graph em-
beddings, such that the Wasserstein distance may
increase or decrease, depending on the specific
meaning of AMR relation labels, and thus to bet-
ter capture global meaning differences between
AMRs (as outlined in Figure 3: fθ).

Officiellement, to make the Wasserstein Weisfeiler-
Leman kernel better account for edge-labeled
AMR graphs, we learn a parameter set Θ that con-
sists of parameters θedgeLabel, where edgeLabel
indicates the semantic relation, c'est à dire., edgeLabel ∈
L = {:arg0, :arg1, . . . , :polarity, …}. Ainsi, dans
Eq. 4, we can set w(toi, v) = θlabel(toi,v) et ap-
ply multiplication θlabel(toi,v) · x(toi)k. To facilitate
the multiplication, we either may learn a matrix
Θ ∈ R|L|×d or a parameter vector Θ ∈ R|L|. Dans
this paper, we constrain ourselves to the latter
setting, c'est, our goal is to learn a parameter
vector Θ ∈ R|L|.

Learning Edge Labels with Direct Feedback
To find suitable edge parameters Θ, we propose
a zeroth order (gradient-free [Conn et al., 2009])
optimization setup, which has the advantage that
we can explicitly teach our metric to better cor-
relate with human ratings, optimizing the desired
correlation objective without detours. In our case,
we apply a simultaneous perturbation stochas-
tic approximation (SPSA) procedure to estimate
gradients (Spall, 1987, 1998; Wang, 2020).5

Let sim(B, Ème) = −W W LKΘ(B) be the sim-
ilarity scores obtained from a (mini-)batch of
graph pairs (B = [(Gj, G(cid:4)
j), . . .]) as provided by
(parametrized) W W LK. Now, let Y be the human
reference scores. Then we design the loss function
as J(Oui, Ème) := 1 − correlation(sim(B, Ème), Oui ).
Plus loin, let μ be coefficients that are sampled
from a Bernoulli distribution. Then the gradient
is estimated as follows:

ˆ∇Θ =

J.(Oui, Ème + cμ) − J(Oui, Θ − cμ)
2cμ

(10)

Enfin, we can apply the common SGD learning
rule: Θt+1 = Θt − γ ˆ∇Θ. The learning rate γ and
c decrease proportionally to t.

5It improves upon a classic Kiefer-Wolfowitz approxima-
tion (Kiefer et al., 1952) by requiring, per gradient estimate,
only 2 objective function evaluations instead of 2n.

1429

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

un
c
_
un
_
0
0
4
3
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

4 BAMBOO : Creating the First

Benchmark for AMR
Similarity Metrics

We now describe the creation of BAMBOO , lequel
aims to provide the first benchmark that allows
researchers to empirically (je) assess AMR metrics,
(ii) compare AMR metrics, and possibly (iii) train
AMR metrics.

Grounding AMR Similarity Metrics in Human
Ratings of Semantic Sentence Similarity As
the main criterion for assessing AMR similarity
metrics, we use human judgments of the meaning
similarity of sentences underlying pairs of AMRs.
A corresponding principle has been proposed by
Opitz et al. (2020): A metric of pairs of AMR
graphs G and G(cid:4) that represent sentences s and
s(cid:4) should reflect human judgments of semantic
sentence similarity and relatedness:

amrM etric(G, G(cid:4)) ≈ humanScore(s, s(cid:4)) (11)

Similarity Objectives Accordingly, we select,
as evaluation targets for AMR metrics, three no-
tions of sentence similarity, which have previously
been operationalized in terms of human-rated eval-
uation datasets: (je) the semantic textual similarity
(STS) objective from Baudiˇs et al. (2016un,b); (ii)
the sentence relatedness objective (SICK) depuis
Marelli et al. (2014); (iii) the paraphrase detection
objective (PARA) by Dolan and Brockett (2005).
Each of these three evaluation data sets can be
seen as a set of pairs of sentences (si, s(cid:4)
je) with an
associated score humanScore(·) that provides
the human sentence relation assessment score
reflecting semantic similarity (STS), semantic
relatedness (SICK) and whether sentences are
paraphrastic (PARA). Ainsi, each of these data
sets can be described as {(si, s(cid:4)
je, humanScore
(si, s(cid:4)
je = 1. Both STS and SICK offer
scores on Likert scales, ranging from equivalence
(maximum) to unrelated (min), while PARA scores
are binary, judging sentence pairs as being para-
phrases (1), ou non (0). We min-max normalize the
Likert scale scores to the range [0, 1] to facilitate
standardized evaluation.

je) = yi)}n

For BAMBOO , we replace each pair (si, s(cid:4)
je)
with their AMR parses: (pi = parse(si), p(cid:4)
i =
parse(s(cid:4)
je)), transforming the data into {(pi, p(cid:4)
je,
yi)}n
je = 1. This provides the main partition of the

graph statistics

data instances
train/dev/test
5749/1500/1379
4500/500/4927

# nodes
(s. length)
avg. 50th avg. 50th avg. 50ème
source
8 14.1 12 0.10 0.08
9.9
STS
SICK
9 10.7 10 0.11 0.1
9.6
PARA 3576/500/1275 18.9 19 30.6 30 0.04 0.04

density

Tableau 1: BAMBOO data set statistics of the Main
partition. Sentence length (s. length, displayed for
reference only) and graph statistics (average and
median) are calculated on the training sets.

benchmarking data for BAMBOO , henceforth de-
noted as Main.6 Statistics of Main are shown in
Tableau 1). The sentences in PARA are longer com-
pared to STS and SICK. The corresponding AMR
graphs are, on average, much larger in number
of nodes, but less complex with respect to the
average density.7

AMR Construction We choose a strong parser
that achieves high scores in the range of human-
human inter-annotator agreement estimates in
AMR banking: The parser yields 0.80–0.83 Smatch
F1 on AMR2 and AMR3. The parser, henceforth
denoted as T5S2S, is based on an AMR fine-tuned
T5 language model (Raffel et al., 2019) and pro-
duces AMRs in a sequence-to-sequence fashion.8
It is on par with the current state-of-the-art that
similarly relies on seq-to-seq (Xu et al., 2020),
but the T5 backbone alleviates the need for mas-
sive MT pre-training. To obtain a better picture
of the graph quality we perform manual quality
inspections.

Manual Data Quality Assessment: Three-way
Graph Quality Ratings From each data set
(SICK, STS, PARA) we randomly select 100
sentences and create their parses with T5S2S. Ad-
ditionally, to establish a baseline, we also parse
the same sentences with the GPLA parser of
Lyu and Titov (2018), a neural graph prediction
system that uses latent alignments (which reports
74.4 Smatch score on AMR2). This results in 300
GPLA parses and 300 T5S2S parses. Un humain

6The other partitions, which are largely based on this data,

will be introduced in §5.

7The lower average density could be caused, par exemple., by the
fact that the PARA data is sampled from news sources, lequel
means that the AMRs contain more named entity structures
that usually have more terminal nodes.

8https://github.com/bjascob/amrlib.

1430

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

un
c
_
un
_
0
0
4
3
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Parser

%gold↑

%silver

%flawed↓

STS

37[28,46]
GPLA 43[33,53]
T5S2S 54[44,64]†‡ 41[31,50]

SICK GPLA 38[28,47]
T5S2S 48[38,58]†

49[39,59]
47[37,57]

20[12,27]
5[0,9]†‡

13[6,19]
5[0,9]†‡

PARA GPLA 9[3,14]

39[29,48]
T5S2S 21[13,29]†‡ 63[54, 73]†‡ 16[8,23]†‡

52[43,62]

ALL

GPLA 30[25,35]
46[40,52]
T5S2S 41[35,46]†‡ 50[45,56]

24[19,29]
9[5,12]†‡

Tableau 2: Three-way graph assessment. [X,oui]:
95-confidence intervals estimated with bootstrap.
† (‡) significant improvement of T5S2S over
GPLA with p < 0.05 (p < 0.005). annotator9 inspects the (shuffled) sample and as- signs three-way labels: flawed—an AMR contains critical errors that distort the meaning signif- icantly; silver—an AMR contains small errors that can potentially be neglected; gold—an AMR is acceptable. Results in Table 2 show that the quality of T5S2S parses is substantially better than the baseline in all three data sets. The percentage of ex- cellent parses increases considerably (STS: +11pp, SICK: +10pp, PARA: +11pp) while the percent- age of flawed parses drops notably (STS: −15pp, SICK: −8pp, PARA: −23pp). The increases in gold parses and decreases in flawed parses are sig- nificant in all data sets (p < 0.05, 10,000 bootstrap samples of the sample means).10 5 BAMBOO : Robustness Challenges Besides benchmarking AMR metric scores against human ratings, we are also interested in assessing a metric’s robustness under meaning-preserving and -altering graph transformations. Assume we are given any pair of AMRs from paraphrases. A small change in structure or node content can lead to two outcomes: The graphs still repre- sent paraphrases, or they do not. We consider a metric to be robust if its ratings correctly reflect such changes. Specifically, we propose three transformation strategies. (i) Reification (Reify ), which changes Figure 4: Examples for f and g graph transforms. the graph’s surface structure, but not its mean- ing; (ii) Concept synonym replacement (Syno ), which also preserves meaning and may or may not change the graph surface structure; (iii) Role con- fusion (Arg ), which applies small changes to the graph structure that do not preserve its meaning. 5.1 Meaning-preserving Transforms Generally, given a meaning-preserving function f of a graph, namely, G ≡ f (G), (12) it is natural to expect that a semantic similar- ity function over the pair of transformed AMRs nevertheless stays stable, and thus satisfies: metric(G, G(cid:4)) ≈ metric(f (G), f (G(cid:4))). (13) Reification Transform (Reify ) Reification is an established way to rephrase AMRs (Goodman, 2020). Formally, a reification is induced by a rule edge(x, y) reify−−→ instance(z, h(edge)0) ∧ h(edge)1(z, x) ∧ h(edge)2(z, y), (14) (15) (16) 9The human annotator is a proficient English speaker and has worked several years with AMR. 10H0(gold): amount of gold graphs T5S2S ≤ amount of gold graphs GPLA; H0(silver): amount of silver graphs T5S2S ≤ amount of gold graphs GPLA; H0(flawed): amount of gold graphs T5S2S ≥ amount of gold graphs GPLA. where h returns, for a given edge, a new concept and corresponding edges from a dictionary, where the edges are either :ARGi or :opi. An example is displayed in Figure 4 (top, left). Besides reifica- tion for location, other known types are polarity-, 1431 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 3 5 1 9 7 9 2 9 0 / / t l a c _ a _ 0 0 4 3 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 STS SICK mean th mean th PARA th mean Reify -OPS 2.74 [1, 2, 4] 1.17 [0, 1, 2] 5.14 [3, 5, 7] Syno -OPS 0.80 [0, 1, 2] 1.31 [0, 1, 2] 1.30 [0, 1, 2] Arg -OPS 1.33 [1, 1, 2] 1.11 [1, 1, 1] 1.80 [1, 2, 2] Table 3: Statistics about the amount of transform operations that were conducted, on average, on one graph. [x,y,z]: 25th, 50th (median), and 75th percentile of the amount of operations. modifier-, or time-reification.11 Processing statis- tics of the applied reification operations are shown in Table 3. Synonym Concept Node Transform (Syno ) Here, we iterate over AMR concept nodes. For any node that involves a predicate from PropBank, we consult a manually created database of (near-) synonyms that are also contained in PropBank, and sample one for replacement. For example, some sense of fall is near-equivalent to a sense of decrease (car prices fell/decreased). For con- cepts that are not predicates we run an ensemble of four WSD solvers12 (based on the concept and the sentence underlying the AMR) to identify its WordNet synset. From this synset we sample an alternative lemma.13 If an alternative lemma con- sists of multiple tokens where modifiers precede the noun, we replace the node with a graph- substructure. So, if the concept is man and we sam- ple adult male, we expand ‘instance(x, man)’ with ‘mod(x, y) ∧ instance(y, adult) ∧ instance = (x, male)’. Data processing statistics are shown in Table 3. 5.2 Meaning-altering Graph Transforms Role Confusion (Arg ) A na¨ıve AMR metric could be one that treats an AMR as a bag-of-nodes, omitting structural information, such as edges and edge-labels. Such metrics could exhibit mislead- ingly high correlation scores with human ratings, solely due to a high overlap in concept content. Hence, we design adversarial instances that can probe an AMR metric when confronted with cases 11A complete list of reifications are given in the offi- cial AMR guidelines: https://github.com/amrisi /amr-guidelines/blob/master/amr.md. 12‘Adapted lesk’, ‘Simple Lesk’, ‘Cosine Lesk’, ‘max sim’ (Banerjee and Pedersen, 2002; Lesk, 1986; Pedersen, 2007): https://github.com/alvations/pywsd. 13To increase precision, we only perform this step if all Figure 5: Metric objective example for Arg . of opposing factuality (e.g., polarity, modality, or relation inverses), while concept overlap is largely preserved. We design a function G (cid:16)= g(G), (17) that confuses role labels (see Arg in Figure 4). We make use of this function to turn two para- phrastic AMRs (G, G(cid:4)) into non-paraphrastic AMRs, by appling g to either G, or G(cid:4), but not both. In some cases g may create a meaning that still makes sense (The tiger bites the snake. → The snake bites the tiger.), while in others, g may induce a non-sensical meaning (The tiger jumps on the rock. → The rock jumps on the tiger.). However, this is not our primary concern, since in all cases, applying g achieves our main goal: It returns a different meaning that turns a paraphrase-relation between two AMRs into a non-paraphrastic one. To implement Arg , for each data set (PARA, STS, SICK) we create one new data subset. First, (i) we collect all paraphrases from the initial data (in SICK and STS these are pairs with maximum human score).14 (ii) We iterate over the AMR pairs (G, G(cid:4)) and randomly select the first or second AMR from the tuple. We then collect all n nodes with more than one outgoing edge. If n = 0, we skip this AMR pair (the pair will not be contained in the data). If n > 0, we apply the
meaning altering function g and randomly flip
edge labels. Enfin, we add the original (G, G(cid:4)) à
our data with the label paraphrase, and the altered
pair (G, g(G(cid:4))) with the label non-paraphrase (cf.
Chiffre 5). Per the graph, we allow a maximum
de 3 role confusion operations (see Table 3 pour
processing statistics).

14This shrinks the train/dev/test size of STS (now:

solvers agree on the predicted synset.

474/106/158) and SICK (now: 246/50/238).

1432

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

un
c
_
un
_
0
0
4
3
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

5.3 Discussion

Safety of Robustness Objectives We have pro-
posed three challenging robustness objectives.
Reify changes the graph structure, but preserves
the meaning. Arg
keeps the graph structure
(modulo edge labels) while changing the meaning.
Syno changes node labels and possibly the graph
structure and aims at preserving the meaning.

Reify and Arg are fully safe: they are well
defined and are guaranteed to fulfill our goal (Eq.
12 et 17): meaning-preserving or -altering graph
transforms. Syno is more experimental and has
(at least) three failure modes. In the first mode, de-
pending on context, a human similarity judgments
could change when near-synonyms are chosen
(sleep → doze, a young cat → kitten, etc.). Le
second mode occurs when WSD commits an error
(par exemple., minister (political sense) → priest). A third
mode are societal biases found in WordNet (par exemple.,
the node girl may be mapped onto its ‘synonym’
missy). The third mode may not really be a failure,
since it may not change the human rating, mais,
nevertheless, it may be undesirable.

In conclusion, Reify and Arg confusion con-
stitute safe robustness challenges, while results on
Syno have to be taken with a grain of salt.

Status of the Challenges in BAMBOO and Out-
look We believe that a key benefit of the
robustness challenges lies in their potential to
provide complementary performance indicators,
in addition to evaluation on the Main partition
of BAMBOO (cf. §4). En particulier, les défis
may serve to assess metrics more deeply, uncover
potential weak spots, and help select among met-
rics, Par exemple, when performance differences
on Main are small. In this work, cependant, le
complementary nature of Reify , Syno or Arg
versus Main is only reflected in the name of the
partitions, and in our experiments, we consider all
partitions equally. Future work may deviate from
this setup.

Our proposed robustness challenges are also by
no means exhaustive, and we believe that there is
ample room for developing more challenges (ex-
tending BAMBOO ) or experimenting with different
15). Pour
setups of our challenges (varying BAMBOO
these reasons, it is possible that future work may

15Par exemple, we may rectify only selected relations,
or create more data, setting Eq. 13 to metric(G, G(cid:4)) ≈
metric(G, F (G(cid:4))), only applying f to one graph.

justify alternative or enhanced setups, extensions
and variations of BAMBOO .

6 Experimental Insights

Questions Posed to BAMBOO
BAMBOO allows
us to address several open questions: The first set
of questions aims to gain more knowledge about
previously released metrics. C'est, we would like
to know: What semantic aspects of AMR does a
metric measure? If a metric has hyper-parameters
(par exemple., SEMBLEU), which hyper-parameters are suit-
capable (for a specific objective)? Does the costly
alignment of SMATCH pay off, by yielding better
prédictions, or do the faster alignment-free met-
rics offer a ‘free-lunch’? A second set of questions
aims to evaluate our proposed novel AMR similar-
ity metrics, and to assess their potential advantages.

Experimental Setup We evaluate all metrics on
the test set of BAMBOO . The two hyper-parameters
of S2
MATCH, that determine when concepts are
similar, are set with a small search on the devel-
opment set (by contrast, S2
MATCHdef ault denotes
the default setup). WWLKθ is trained with batch
size 16 on the training data. S2
MATCH, WWLK
and WWLKθ all make use of GloVe embeddings
(Pennington et al., 2014).

Our main evaluation metric is Pearson’s ρ be-
tween a metric’s output and the human ratings.
En plus, we consider two global performance
measures to better rank AMR metrics: the arithme-
tic mean (amean) and the harmonic mean (hmean)
over a metric’s results achieved in all tasks. Hmean
is always ≤ amean and is driven by low outliers.
Ainsi, a large difference between amean and
hmean serves as a warning light for a metric that
is extremely vulnerable in a specific task.

6.1 BAMBOO Studies Previous Metrics

Tableau 4 shows AMR metric results on BAMBOO
across all
three human similarity rating types
(STS,SICK, PARA) and our four challenges: Main
represents the standard setup (cf. §4), alors que
Reify , Syno , and Arg
the metric
robustness (cf. §5).

test

MATCH Rank 1st and 2nd of Pre-
SMATCH and S2
vious Metrics SMATCH, our baseline metric, pro-
vides strong results across all tasks (Tableau 4, amean:
51.28). With default parameters, S2
MATCHdef ault
performs slightly worse on the main data for STS

1433

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

un
c
_
un
_
0
0
4
3
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

speed align STS SICK PARA STS SICK PARA STS SICK PARA STS SICK PARA

–

Main

Reify

Syno

Arg

amean hmean

SMATCH

WSMATCH
S2
S2

MATCHdef ault
MATCH

SEMA
SEMBLEUk=1
SEMBLEUk=2
SEMBLEUk=3
SEMBLEUk=4

WLK (ours)
WWLK (ours)
WWLKΘ (ours)

–

–
–
–

++
++
++
++
++

++
+
+

(cid:2) 58.45 59.72 41.25 57.98 61.81 39.66 56.14 57.39 39.58 48.05 70.53 24.75

51.28

47.50

(cid:2) 53.06 59.24 38.64 53.39 61.17 37.49 51.41 57.56 37.85 42.47 66.79 22.68
(cid:2) 56.38 58.15 42.16 55.65 60.04 40.41 56.05 57.17 40.92 46.51 70.90 26.58
(cid:2) 58.82 60.42 42.55 58.08 62.25 40.60 56.70 57.92 41.22 48.79 71.41 27.83

55.90 53.32 33.43 55.51 56.16 32.33 50.16 48.87 29.11 49.73 68.18 22.79
66.03 62.88 39.72 61.76 62.10 38.17 61.83 58.83 37.10
1.40
60.62 59.86 36.88 57.68 59.64 36.24 57.34 56.18 33.26 44.54 67.54 16.60
56.49 57.76 32.47 54.84 57.70 33.25 52.82 53.47 28.44 49.06 69.49 24.27
53.19 56.69 29.61 52.28 56.12 30.11 49.31 52.11 25.56 49.75 69.58 29.44

1.99

1.47

64.86 61.52 37.35 62.69 62.55 36.49 59.41 56.60 33.71 45.89 64.70 19.47
(cid:2) 63.15 65.58 37.55 59.78 65.53 35.81 59.40 59.98 32.86 13.98 42.79
7.16
(cid:2) 66.94 67.64 37.91 64.34 65.49 39.23 60.11 62.29 35.15 55.03 75.06 29.64

48.48
50.91
52.22

46.29
41.11
48.87
47.50
46.15

50.44
45.30
54.90

44.58
47.80
49.07

41.85
5.78
42.13
42.82
41.75

44.35
28.83
50.26

Tableau 4: BAMBOO benchmark result of AMR metrics. All numbers are Pearson’s ρ × 100. ++: linear
time complexity; +: polynomial time complexity; -: NP complete.

and SICK, but improves upon SMATCH on PARA,
achieving a slight overall improvement with re-
spect to hmean (+0.30), but not amean (−0.37).
S2
MATCH is more robust against Syno (par exemple., +4.6
on Syno STS vs. SMATCH), and when confronted
with reified graphs (Reify STS +3.3 vs. SMATCH).
MATCH, after setting its two hyper-
parameters with a small search on the development
data16, consistently improves upon SMATCH over
all tasks (amean: +0.94, hmean: +1.57).

Enfin, S2

WSMATCH: Are Nodes Near the Root More Im-
portant? The hypothesis underlying WSMATCH
is that concepts that are located near the top of an
AMR have a higher impact on AMR similarity rat-
ings. Fait intéressant, WSMATCH mostly falls short of
SMATCH, offering substantially lower performance
on all main tasks and all robustness checks, result-
ing in reduced overall amean and hmean scores
(par exemple., main STS: −5.39 vs. SMATCH, amean: −2.8
vs. SMATCH, hmean: −2.9 vs. SMATCH). This contra-
dicts the ‘core-semantics’ hypothesis and provides
novel evidence that semantic concepts that influ-
ence human similarity ratings are not necessarily
located close to AMR roots.17

16STS/SICK: τ = 0.90, τ (cid:4) = 0.95; PARA: τ = 0.0,

τ (cid:4) = 0.95

17Manual inspection of examples shows that low similarity
can frequently be explained with differences in concrete
concepts that tend to be distant to the root. Par exemple,
the low similarity (0.16) of Morsi supporters clash with riot
police in Cairo vs. Protesters clash with riot police in Kiev
arises mostly from Kiev and Cairo and Morsi, cependant, ces
names (as are names in general in AMR) are distant to the root
region, which is similar in both graphs (clash, riot, protesters,
supporters).

BFS-based Metrics I: SEMA Increases Speed but
Pays a Price Next, we find that SEMA achieves
lower scores in almost all categories, when com-
pared with SMATCH (amean: −4.99, hmean −5.65),
ending up at rank 7 (according to hmean and
amean) among prior metrics. Il
is similar to
SMATCH in that it extracts triples from graphs, mais
differs by not providing an alignment. Donc,
it can only loosely model some phenomena, et
we conclude that the increase in speed comes at
the cost of a substantial drop in modeling capacity.

BFS-based Metrics II: SEMBLEU is Fast, but is
Sensitive to k Results for SEMBLEU show that it
is very sensible to parameterizations of k. Notably,
k = 1, which means that the method only extracts
bags of nodes, achieves strong results on SICK
and STS. On PARA, cependant, SEMBLEU is out-
performed by S2
MATCH, for all settings of k (best k
(k = 2): −2.8 amean, −4.7 hmean). De plus, tous
variants of SEMBLEU are vulnerable to robustness
checks. E.g., k = 2, et, naturally, k = 1 sont
easily fooled by Arg , where performance drops
massively. k = 4, on the other hand, is most robust
against Arg , but overall it falls behind k = 2.

Since SEMBLEU is asymmetric, we also re-
compute the metric in a ‘symmetric’ way by av-
eraging the metric result over different argument
ordres. We find that this can slightly increase its
performance ([k, amean, hmean]: [1, +0.8, +0.6];
[2, +0.5, +0.4]; [3, +0.2, +0.2]; [4, +0.1, +0.0]).

In sum, our conclusions concerning SEMBLEU
sont: (je) SEMBLEUk=1 (but not SEMBLEUk=3) par-
forms well when measuring similarity and relat-
edness. Cependant, SEMBLEUk=1 is na¨ıve and easily

1434

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

un
c
_
un
_
0
0
4
3
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

✗
✗
✗
✗
✗
✗

fooled (Arg ). (ii) Ainsi, we recommend k = 2
as a good tradeoff between robustness and per-
formance, with overall rank 4 (amean) et 6
(hmean).18

6.2 BAMBOO Assesses Novel Metrics

We now discuss results of our proposed metrics
based on the Weisfeiler-Leman Kernel.

Standard Weisfeiler-Leman (WLK) is Fast and
a Strong Baseline for AMR Similarity First,
we visit the classic Weisfeiler-Leman kernel. Like
SEMBLEU and SEMA, le (alignment-free) method
is very fast. Cependant, it outperforms these metrics
in almost all tasks (score difference against sec-
ond best alignment-free metric: ([un|h]mean: +1.6,
+1.5) but falls behind alignment-based SMATCH
([un|h]mean: −0.8, −3.2). Spécifiquement, WLK
proves robust against Reify but appears more
(−5 points on STS
vulnerable against Syno
and SICK) and Arg (notably PARA, with −10
points).19

The better performance, compared to SEMBLEU
and SEMA, may be due to the fact that WLK (et-
like SEMBLEU and SEMA) does not perform BFS
traversal from the root, which may reduce biases.

WWLK and WWLKθ Obtain First Ranks
Basic WWLK exhibits strong performance on
SICK (ranking second on main and first on
Reify ). Cependant, it has large vulnerabilities, comme
exposed by Arg , where only SEMBLEUk=1 ranks
lower. This can be explained by the fact that
WWLK (7.2 Pearson’s ρ on PARA Arg ) only
weakly considers the semantic relations (alors que
SEMBLEUk=1 does not consider semantic relations
in the first place).

WWLKΘ, our proposed algorithm for edge label
learning, mitigates this vulnerability (29.6 Pear-
son’s ρ on PARA Arg , 1st rank). Learning edge
labels also helps assessing similarity (STS) and re-
latedness (SICK), with substantial improvements
over standard WWLK and SMATCH (STS: 66.94,
+3.9 vs. WWLK and +10.6 vs. SMATCH; SICK
+2.1 vs. WWLK and +8.4 vs. SMATCH).

18Setting k = 2 stands in contrast to the original paper that
recommended k = 3, the common setting in MT. Cependant,
lower k in SEMBLEU reduces biases (Opitz et al., 2020), lequel
may explain the better result on BAMBOO .

19Similar to SEMBLEU, we can mitigate this performance
drop on Arg PARA by increasing the amount of passes K
in WLK, cependant, this decreases overall amean and hmean.

K (#WL iters)

basique (K=2)
amean hmean amean hmean amean hmean amean hmean

K=1

K=4

K=3

WLK
50.4
WWLK 45.3
WWLKθ 54.9

44.4
28.8
50.3

49.8
43.4
52.2

44.2
15.3
35.4

47.6
45.7
55.2

42.4
31.4
51.1

46.4
42.3
50.8

41.5
24.0
47.3

Tableau 5: WLK variants with different K.

In sum, WWLKθ occupies rank 1 of all consi-
dered metrics (amean and hmean), outperforming
all non-alignment based metrics by large margins
(amean +4.5 vs. WLK and +6.0 vs. SEMBLEUk=2;
hmean +5.9 vs. WLK and +8.1 vs. SEMBLEUk=2),
but also the alignment-based ones, albeit by lower
margins (amean +2.7 vs. S2
MATCH; hmean + 1.2
vs. S2

MATCH).

6.3 Analyzing Hyper-parameters of

(W)WLK

Setting K in (W)WLK How does setting the
number of iterations in Weisfeiler-Leman affect
prédictions? Tableau 5 shows K = 2 is a good choice
for all WLK variants. K = 3 slightly increases
performance in the latent variants (WWLK: +0.4
amean; WWLKθ: +0.3 amean), but lowers perfor-
mance for the fast symbolic matching WLK (−2.8
amean). This drop is somewhat expected: K > 2
introduces much sparsity in the symbolic WLK
feature space.

WL Message Passing Direction Even though
AMR defines directional edges, for optimal sim-
ilarity ratings, it was not a-priori clear in which
directions the node contextualization should be
restricted when attempting to model human sim-
ilarity. Donc, so far, our WLK variants have
treated AMR graphs as undirected graphs (↔).
In this experiment, we study three alternate sce-
narios: ‘TOP-DOWN’ (avant, →), où info-
mation is only passed in the direction that AMR
edges point at and ‘BOTTOM-UP’ (backwards,
←), where information is exclusively passed in
the opposite direction, and 2WAY ((cid:3)), où
information is passed forwards, but for every
edge edge(X, oui) we insert an edge−1(oui, X). 2WAY
facilitates more node interactions than either TOP-
DOWN or BOTTOM-UP, while preserving direc-
tional information.

Our findings in Table 6 show a clear trend:
treating AMR graphs as graphs with undirected
edges offers better results than TOP–DOWN (par exemple.,

1435

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

un
c
_
un
_
0
0
4
3
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

undirected TOP-DOWN BOTTOM-UP

2WAYS

amean hmean amean hmean amean hmean amean hmean

WLK
50.4
WWLK 45.3
WWLKθ 54.9

44.4
28.8
50.3

50.3
43.7
53.8

44.3
22.0
46.1

50.2
41.6
50.2

43.8
9.9
18.7

49.5
44.8
55.3

41.8
24.1
51.0

Tableau 6: (W)WLK: message passing directions.

WWLK −1.6 amean; −6.6 hmean) and consid-
erably better results when compared to WLK in
BOTTOM–UP mode (par exemple., WWLK −3.7 amean;
−18.9 hmean). Dans l'ensemble, 2WAY behaves similarly
to the standard setup, with a slight improvement
for WWLKθ. Notably, the symbolic WLK variant,
that does not use word embeddings, appears more
robust in this experiment and differences between
the three directional setups are small.

6.4 Revisiting the Data Quality in BAMBOO
Initial quality analyses (§4) suggested that the
quality of BAMBOO is high, with a large proportion
of AMR graphs that are of gold or silver quality.
In this experiment, we study how metric rankings
and predictions could change when confronted
with AMRs corrected by humans. From every
data set, we randomly sample 50 AMR graph pairs
(300 AMRs in total). In each AMR, the human
annotator searched for mistakes, and corrected
them.20

We study two settings. (je) Intra metric agree-
ment (IMA): For every metric, we calculate the
correlation of its predictions for the initial graph
pairs versus the predictions for the graph pairs that
are ensured to be correct. Note that, on one hand,
a high IMA for all metrics would further corrobo-
rate the trustworthiness of BAMBOO results. Comment-
jamais, on the other hand, a high IMA for a single
metric cannot be interpreted as a marker for this
metric’s quality. C'est, a maximum IMA (1.0)
could also indicate that a metric is completely in-
sensitive to the human corrections. En outre,
we study (ii) Metric human agreement (MHA):
Ici, we correlate the metric scores against hu-
man ratings, once when fed the fully gold-ensured
graph pairs and once when fed the standard graph
pairs. Both measures, IMA, and IAA, can provide
us with an indicator of how much metric ratings
would change if BAMBOO would be fully hu-
man corrected.

20Dans l'ensemble, few corrections were necessary, as reflected in a
high SMATCH between corrected and uncorrected graphs: 95.1
(STS), 96.8 (SICK), 97.9 (PARA).

STS

AVERAGE
MHA IMA MHA IMA MHA IMA MHA IMA

PARA

SICK

SM
[71, 73] 97.9 [66, 66] 99.9 [44, 44] 97.9 [60, 61] 98.6
WSM [64, 65] 99.2 [67, 67] 99.8 [47, 49] 98.7 [59, 60] 99.2
S2Mdef [69, 70] 97.7 [62, 63] 99.3 [44, 47] 97.7 [58, 60] 98.2
[71, 73] 97.8 [69, 70] 98.6 [41, 46] 98.0 [60, 63] 98.1
S2M
[66, 66] 97.7 [55, 55] 100 [42, 46] 99.0 [55, 56] 98.9
SE
SB2
[68, 68] 97.2 [62, 62] 99.8 [41, 42] 98.8 [57, 58] 98.6
SB3
[66, 66] 98.4 [63, 63] 99.7 [33, 34] 99.3 [54, 54] 99.1
WLK
[72, 72] 98.2 [65, 65] 99.8 [43, 46] 97.9 [60, 61] 98.6
WWLK [77, 78] 97.8 [65, 67] 98.1 [42, 46] 97.8 [61, 63] 97.9
WWLKθ [78, 78] 96.8 [67, 68] 98.1 [48, 48] 96.7 [64, 65] 97.2

Tableau 7: Retrospective sub-sample quality analy-
sis of BAMBOO graph quality and sensitivity of
metrics. All values are Pearson’s ρ × 100. Metric
Human Agreement (MHA): [X, oui], where x is the
correlation (to human ratings) when the metric is
executed on the uncorrected sample and y is the
same assessment on the manually post-processed
sample.

Results are shown in Table 7. All metrics exhi-
bit high IMA, suggesting that potential changes in
their ratings, when fed gold-ensured graphs, sont
quite small. En outre, on average, all metrics
tend to exhibit slightly better correlation with the
human when computed on the gold-ensured graph
pairs. Cependant, supporting the assessment of
IMA, the increments in MHA appear small, rang-
ing from a minimum increment of +0.3 (SEMBLEU)
to a maximum increment of +2.8 (S2
MATCH),
whereas WWLK yields an increment of +1.8.
Generally, while this assessment has to be taken
with a grain of salt due to the small sample size, it
overall supports the validity of BAMBOO results.

6.5 Discussion

Align or not Align? We can group metrics for
graph-based meaning representations into whether
they compute an alignment between AMRs or not
(Liu et al., 2020). A computed alignment, as in
SMATCH, has the advantage that it lets us assess
finer-grained AMR graph similarities and diver-
gences, by creating and exploiting a mapping that
shows which specific substructures of two graphs
are more or less similar to each other. On the other
main, it was still an open question whether such
an alignment is worth its computational cost and
enhances similarity judgments.

Experiments on BAMBOO provide novel evi-
dence on this matter: alignment-based metrics
may be preferred for better accuracy. Non-
alignment based metrics may be preferred if

1436

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

un
c
_
un
_
0
0
4
3
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

de 4972), all metrics considered in this paper, dans-
cluding ours, assign relative ranks that are too low
(WWLK: 2624). Future work may incorporate ex-
ternal PropBank (Palmer et al., 2005) connaissance
into AMR metrics. In PropBank, sense 11 of play
is defined as equivalent to making music.

7 Conclusion

Our contributions in this work are three-fold: (je)
We propose a suite of novel Weisfeiler-Leman
AMR similarity metrics that are able to recon-
cile a performance conflict between precision of
AMR similarity ratings and the efficiency of com-
puting alignments. (ii) We release BAMBOO , le
first benchmark that allows researchers to assess
AMR metrics empirically, setting the stage for
future work on graph-based meaning represen-
tation metrics. (iii) We showcase the utility of
BAMBOO , by applying it to profile existing AMR
metrics, uncovering hitherto unknown strengths
or weaknesses, and to assess the strengths of our
newly proposed metrics that we derive and fur-
ther develop from the classic Weisfeiler-Leman
Kernel. We show that through BAMBOO we are
able to gain novel insight regarding suitable hy-
perparameters of different metric types, and to
gain novel perspectives on how to further improve
AMR similarity metrics to achieve better corre-
lation with the degree of meaning similarity of
paired sentences, as perceived by humans.

Remerciements

We are grateful to three anonymous reviewers
and Action Editor Yue Zhang for their valuable
comments that have helped to improve this paper.
We are also thankful to Philipp Wiesenbach for
giving helpful feedback on a draft of this paper.
This work has been partially funded by the DFG
through the project ACCEPT as part of the Prior-
ity Program ‘‘Robust Argumentation Machines’’
(SPP1999).

Les références

Rafael Torres Anchiˆeta, Marco Antonio Sobrevilla
Cabezudo, and Thiago Alexandre Salgueiro
Pardo. 2019. SEMA: An extended semantic
evaluation for AMR. Dans (To appear) Proceed-
ings of the 20th Computational Linguistics and
Intelligent Text Processing. Springer Interna-
tional Publishing.

Chiffre 6: WWLK alignments and metric scores for
dissimilar (top, STS) and similar (bottom, SICK)
AMRs. Excavators indicate heavy Wasserstein work
f low · cost.

speed matters most. The latter situation may
occur, Par exemple, when AMR metrics must be
executed over a large cross-product of parses (pour
instance, to semantically cluster sentences from a
corpus). For a balanced approach, WWLKΘ offers
a good trade-off: polynomial-time alignment and
high accuracy.

Example Discussion I: Wasserstein Trans-
portation Analysis Explains Disagreement
Chiffre 6 (top) shows an example where the
human-assigned similarity score is relatively low
(rank 1164 de 1379). Due to the graphs having the
same structure (x arg0 y; x arg1 z), the previous
metrics (except SEMA) tend to assign similarities
that are relatively too high. En particulier, S2
MATCH
finds the exact same alignments in this case, mais
cannot assess the concept-relations more deeply.
WWLK yields more informative alignments since
they explain its decision to assign a more appro-
priate lower rank (1253 de 1379): Substantial work
is needed to transport, Par exemple, carry-01 à
slice-01.

Example Discussion II: TheVvalue of n:m
Alignments Figure 6 (bottom) shows that WWLK
produces valuable n:m alignments (play-11 vs.
make-01 and music), which are needed to prop-
erly reflect similarity (note that SMATCH, WSMATCH,
and S2
MATCH only provide 1-1 alignments). Encore,
the example also shows that there is still a way
to go. While humans assess this near-equivalence
facilement, providing a relatively high score (rank 331

1437

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

un
c
_
un
_
0
0
4
3
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Laura Banarescu, Claire Bonial, Shu Cai,
Madalina Georgescu, Kira Griffitt, Ulf
Hermjakob, Kevin Knight, Philipp Koehn,
Martha Palmer, and Nathan Schneider. 2013.
Abstract meaning representation for sembank-
the 7th Linguistic
ing.
Annotation Workshop and Interoperability with
Discourse, pages 178–186, Sofia, Bulgaria.
Association for Computational Linguistics.

In Proceedings of

Satanjeev Banerjee and Ted Pedersen. 2002. Un
adapted lesk algorithm for word sense dis-
ambiguation using wordnet. In International
Conference on Intelligent Text Processing and
Computational Linguistics, pages 136–145.
Springer. https://doi.org/10.1007/3
-540-45715-1_11

Petr Baudiˇs, Jan Pichl, Tom´aˇs Vyskoˇcil, and Jan
ˇSediv`y. 2016un. Sentence pair scoring: Towards
unified framework for text comprehension.
arXiv preprint arXiv:1603.06127. https://
doi.org/10.18653/v1/W16-1602

Petr Baudiˇs, Silvestr Stanko, and Jan ˇSediv´y.
2016b. Joint learning of sentence embeddings
for relevance and entailment. In Proceedings
of the 1st Workshop on Representation Learn-
ing for NLP, pages 8–17, Berlin, Allemagne.
Association for Computational Linguistics.

Rexhina Blloshmi, Rocco Tripodi, and Roberto
Navigli. 2020. XL-AMR: Enabling cross-
lingual AMR parsing with transfer learning
techniques. In Proceedings of the 2020 Con-
ference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 2487–2500,
En ligne. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/2020.emnlp-main.195

Claire Bonial, Stephanie M. Lukin, David
Doughty, Steven Hill, and Clare Voss. 2020.
InfoForager: Leveraging semantic search with
AMR for COVID-19 research. In Proceedings
of the Second International Workshop on De-
signing Meaning Representations, pages 67–77,
Barcelona Spain (online). Association for
Computational Linguistics.

semantic relatedness. Computational Linguis-
tics, 32(1):13–47. https://est ce que je.org/10
.1162/coli.2006.32.1.13

Deng Cai and Wai Lam. 2019. Core semantic first:
A top-down approach for AMR parsing. En Pro-
ceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 3799–3809, Hong Kong, Chine. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/D19-1393

Shu Cai and Kevin Knight. 2013. Smatch: An eval-
uation metric for semantic feature structures.
In Proceedings of the 51st Annual Meeting
of the Association for Computational Linguis-
tics (Volume 2: Short Papers), pages 748–752,
Sofia, Bulgaria. Association for Computational
Linguistics.

Daniel Cer, Mona Diab, Eneko Agirre, I˜nigo
Lopez-Gazpio,
and Lucia Specia. 2017.
SemEval-2017 task 1: Semantic textual similar-
ity multilingual and crosslingual focused evalu-
ation. In Proceedings of the 11th International
Workshop on Semantic Evaluation (SemEval-
2017), pages 1–14, Vancouver, Canada. Asso-
ciation for Computational Linguistics.

Andrew R. Connecticut, Katya Scheinberg, and Luis N.
Vicente. 2009. Introduction to Derivative-Free
Optimization, SIAM. https://doi.org
/10.1137/1.9780898718768

William B. Dolan and Chris Brockett. 2005.
Automatically constructing a corpus of sen-
le
tential paraphrases.
Third International Workshop on Paraphrasing
(IWP2005).

In Proceedings of

Claire Gardent, Anastasia Shimorina, Shashi
Narayan, and Laura Perez-Beltrachini. 2017.
The WebNLG challenge: Generating text from
RDF data. In Proceedings of the 10th Interna-
tional Conference on Natural Language Gener-
ation, pages 124–133, Santiago de Compostela,
Espagne. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1/W17
-3518

Alexander Budanitsky and Graeme Hirst. 2006.
Evaluating wordnet-based measures of lexical

Sebastian Gehrmann, Tosin Adewumi, Karmanya
Aggarwal, Pawan Sasanka Ammanamanchi,

1438

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

un
c
_
un
_
0
0
4
3
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Aremu Anuoluwapo, Antoine Bosselut,
Khyathi Raghavi Chandu, Miruna Clinciu,
Dipanjan Das, Kaustubh D. Dhole, Wanyu Du,
Esin Durmus, Ondˇrej Duˇsek, Chris Emezue,
Varun Gangal, Cristina Garbacea, Tatsunori
Hashimoto, Yufang Hou, Yacine Jernite, Harsh
Jhamtani, Yangfeng Ji, Shailza Jolly, Dhruv
Kumar, Faisal Ladhak, Aman Madaan,
Mounica Maddela, Khyati Mahajan, Saad
Mahamood, Bodhisattwa Prasad Majumder,
Pedro Henrique Martins, Angelina McMillan-
Major, Simon Mille, Emiel van Miltenburg,
Moin Nadeem, Shashi Narayan, Vitaly Nikolaev,
Rubungo Andre Niyongabo, Salomey Osei,
Ankur
Perez-Beltrachini,
Laura
Niranjan Ramesh Rao, Vikas Raunak, Juan
Diego Rodriguez, Sashank Santhanam, Jo˜ao
Sedoc, Thibault Sellam, Samira Shaikh,
Anastasia Shimorina, Marco Antonio Sobrevilla
Cabezudo, Hendrik Strobelt, Nishant Subramani,
Wei Xu, Diyi Yang, Akhila Yerukola, et
Jiawei Zhou. 2021. The gem benchmark: Natu-
ral language generation, its evaluation and met-
rics. arXiv preprint arXiv:2102.01672. https://
doi.org/10.18653/v1/2021.gem-1.10

Parikh,

Michael Wayne Goodman. 2020. Penman: Un
open-source library and tool for AMR graphs. Dans
Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics: Sys-
tem Demonstrations, pages 312–319, En ligne.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2020
.acl-demos.35

Thomas Hofmann, Bernhard Sch¨olkopf, et
Alexander J. Smola. 2008. Kernel methods
in machine learning. The Annals of Statistics,
pages 1171–1220. https://est ce que je.org/10
.1214/009053607000000677

Pavan Kapanipathi, Ibrahim Abdelaziz, Srinivas
Ravishankar, Salim Roukos, Alexander Gray,
Ramon Astudillo, Maria Chang, Cristina
Cornelio, Saswati Dana, Achille Fokoue,
Dinesh Garg, Alfio Gliozzo, Sairam Gurajada,
Hima Karanam, Naweed Khan, Dinesh
Khandelwal, Young-Suk Lee, Yunyao Li,
Francois Luus, Ndivhuwo Makondo, Nandana
Mihindukulasooriya, Tahira Naseem, Sumit
Neelam, Lucian Popa, Revanth Reddy, Ryan
Riegel, Gaetano Rossiello, Udit Sharma, G. P..

Shrivatsa Bhargav, and Mo Yu. 2021. Leverag-
ing abstract meaning representation for knowl-
edge base question answering. Findings of
the Association for Computational Linguistics:
ACL. https://doi.org/10.18653/v1
/2021.findings-acl.339

Jack Kiefer and Jacob Wolfowitz. 1952. Stochas-
tic estimation of the maximum of a regression
fonction. The Annals of Mathematical Statis-
tics, 23(3):462–466. https://est ce que je.org/10
.1214/aoms/1177729392

Peter Kolb. 2009. Experiments on the difference
between semantic similarity and relatedness. Dans
Proceedings of the 17th Nordic Conference of
Computational Linguistics (NODALIDA 2009),
pages 81–88, Odense, Denmark. Northern Eu-
ropean Association for Language Technology
(NEALT).

Michael Lesk. 1986. Automatic sense disam-
biguation using machine readable dictionaries:
How to tell a pine cone from an ice cream
cone. In Proceedings of the 5th Annual Interna-
tional Conference on Systems Documentation,
pages 24–26. https://doi.org/10.1145
/318723.318728

Jiangming Liu, Shay B. Cohen, and Mirella
Lapata. 2020. Dscorer: A fast evaluation metric
for discourse representation structure parsing.
In Proceedings of the 58th Annual Meeting
of the Association for Computational Linguis-
tics, pages 4547–4554, En ligne. Association for
Computational Linguistics.

Chunchuan Lyu and Ivan Titov. 2018. AMR pars-
ing as graph prediction with latent alignment.
In Proceedings of the 56th Annual Meeting
of the Association for Computational Linguis-
tics (Volume 1: Long Papers), pages 397–407,
Melbourne, Australia. Association for Compu-
tational Linguistics.

Qingsong Ma, Johnny Wei, Ondˇrej Bojar, et
Yvette Graham. 2019. Results of the WMT19
metrics shared task: Segment-level and strong
MT systems pose big challenges. In Proceed-
ings of the Fourth Conference on Machine
Translation (Volume 2: Shared Task Papers,
Day 1), pages 62–90, Florence, Italy. Associa-
tion for Computational Linguistics.

1439

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

un
c
_
un
_
0
0
4
3
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Marco Marelli, Stefano Menini, Marco Baroni,
Luisa Bentivogli, Raffaella Bernardi, et
Roberto Zamparelli. 2014. A SICK cure for the
evaluation of compositional distributional se-
mantic models. In Proceedings of the Ninth In-
ternational Conference on Language Resources
and Evaluation (LREC-2014), pages 216–223,
Reykjavik, Iceland. European Languages Re-
sources Association (ELRA).

Jonathan May. 2016. Semeval-2016 task 8: Mean-
ing representation parsing. In Proceedings of
the 10th International Workshop on Semantic
Evaluation (semeval-2016), pages 1063–1073.

Jonathan May and Jay Priyadarshi. 2017. Semeval-
2017 task 9: Abstract meaning representation
parsing and generation. In Proceedings of the
11th International Workshop on Semantic Eval-
uation (SemEval-2017), pages 536–545.

Tahira Naseem, Abhishek Shah, Hui Wan, Radu
Florian, Salim Roukos, and Miguel Ballesteros.
2019. Rewarding Smatch: Transition-based
AMR parsing with reinforcement learning. Dans
Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 4586–4592, Florence, Italy. Association
for Computational Linguistics. https://est ce que je
.org/10.18653/v1/P19-1451

Rik van Noord, Lasha Abzianidze, Hessel
Haagsma, and Johan Bos. 2018. Evaluating
scoped meaning representations. In Proceed-
ings of
the Eleventh International Confer-
ence on Language Resources and Evaluation
(LREC-2018). Miyazaki, Japan. European Lan-
guages Resources Association (ELRA).

Stephan Oepen, Omri Abend, Lasha Abzianidze,
Johan Bos, Jan Hajic, Daniel Hershcovich, Bin
Li, Tim O’Gorman, Nianwen Xue, and Daniel
Zeman. 2020. MRP 2020: The second shared
task on cross-framework and cross-lingual
meaning representation parsing. In Proceed-
ings of the CoNLL 2020 Shared Task: Cross-
Framework Meaning Representation Parsing,
pages 1–22. https://doi.org/10.18653
/v1/2020.conll-shared.1

Juri Opitz. 2020. AMR quality rating with a
lightweight CNN. In Proceedings of the 1st
Conference of the Asia-Pacific Chapter of the
Association for Computational Linguistics and

the 10th International Joint Conference on
Natural Language Processing, pages 235–247,
Suzhou, Chine. Association for Computational
Linguistics.

Juri Opitz and Anette Frank. 2021. Towards a de-
composable metric for explainable evaluation
of text generation from AMR. In Proceedings
of the 16th Conference of the European Chapter
of the Association for Computational Linguis-
tics: Main Volume, pages 1504–1518, En ligne.
Association for Computational Linguistics.

Juri Opitz, Letitia Parcalabescu, and Anette Frank.
2020. Amr similarity metrics from principles.
Transactions of the Association for Computa-
tional Linguistics, 8:522–538. https://est ce que je
.org/10.1162/tacl_a_00329

Martha Palmer, Daniel Gildea,

and Paul
Kingsbury. 2005. The proposition bank: Un
annotated corpus of semantic roles. Computa
tional Linguistics, 31(1):71–106. https://
doi.org/10.1162/0891201053630264

Kishore Papineni, Salim Roukos, Todd Ward,
and Wei-Jing Zhu. 2002. BLEU: A method for
automatic evaluation of machine translation.
In Proceedings of the 40th Annual Meeting
of the Association for Computational Linguis-
tics, pages 311–318, Philadelphia, Pennsyl-
vania, Etats-Unis. Association for Computational
Linguistics. https://doi.org/10.3115
/1073083.1073135

Ted Pedersen. 2007. Unsupervised corpus-based
methods for WSD. Word Sense Disambiguation,
pages 133–166. https://est ce que je.org/10
.1007/978-1-4020-4809-8 6

Jeffrey

Socher,

Pennington, Richard

et
Christopher Manning. 2014. GloVe: Global
vectors for word representation. In Proceedings
of the 2014 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP),
pages 1532–1543, Doha, Qatar. Association for
Computational Linguistics. https://est ce que je
.org/10.3115/v1/D14-1162

Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.
2019. Exploring the limits of transfer learning
with a unified text-to-text transformer. CoRR,
abs/1910.10683.

1440

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

un
c
_
un
_
0
0
4
3
5
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Nino Shervashidze, Pascal Schweitzer, Erik Jan
Van Leeuwen, Kurt Mehlhorn, and Karsten M.
Borgwardt. 2011. Weisfeiler-lehman graph ker-
nels. Journal of Machine Learning Research,
12(9).

Janaki Sheth, Young-Suk Lee, Ramon Fernandez
Astudillo, Tahira Naseem, Radu Florian, Salim
Roukos, and Todd Ward. 2021. Bootstrap-
ping multilingual AMR with contextual word
alignments. arXiv preprint arXiv:2102.02189.

Linfeng Song and Daniel Gildea. 2019. SemBleu:
A robust metric for AMR parsing evaluation.
In Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 4547–4552, Florence, Italy. Association
for Computational Linguistics. https://
doi.org/10.18653/v1/P19-1446

James C. Spall. 1987. A stochastic approximation
technique for generating maximum likelihood
parameter estimates. Dans 1987 American Control
Conference, pages 1161–1167. IEEE.

James C. Spall. 1998. An overview of the simulta-
neous perturbation method for efficient optimi-
zation. Johns Hopkins APL Technical Digest,
19(4):482–492.

Matteo Togninalli, Elisabetta Ghisu, Felipe
Llinares-L´opez, Bastian Rieck, and Karsten
Borgwardt. 2019. Wasserstein weisfeiler-
lehman graph kernels. In Advances in Neural
Information Processing Systems, volume 32,
pages 6436–6446. Curran Associates, Inc.

pages 58–64, En ligne. Association for Compu-
tational Linguistics. https://est ce que je.org/10
.18653/v1/2021.iwpt-1.6

Chen Wang. 2020. An overview of SPSA: Recent
development and applications. arXiv preprint
arXiv:2012.06952.

Boris Weisfeiler and Andrei Leman. 1968. Le
reduction of a graph to canonical form and
the algebra which appears therein. NTI, Series,
2(9):12–16.

Dongqin Xu, Junhui Li, Muhua Zhu, Min Zhang,
and Guodong Zhou. 2020. Improving AMR
parsing with sequence-to-sequence pre-training.
In Proceedings of
le 2020 Conference on
Empirical Methods in Natural Language Pro-
cessation (EMNLP), pages 2501–2511, En ligne.
Association for Computational Linguistics.

Pinar Yanardag and S. V. N. Vishwanathan. 2015.
Deep graph kernels. In Proceedings of
le
21th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining,
pages 1365–1374. https://est ce que je.org/10
.1145/2783258.2783417

Sheng Zhang, Xutai Ma, Rachel Rudinger,
Kevin Duh, and Benjamin Van Durme. 2018.
Cross-lingual decompositional semantic pars-
ing. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Pro-
cessation, pages 1664–1675, Brussels, Belgium.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/D18
-1194

Sarah Uhrig, Yoalli Garcia, Juri Opitz, et
Anette Frank. 2021. Translate, then parse! UN
strong baseline for cross-lingual AMR pars-
ing. In Proceedings of the 17th International
Conference on Parsing Technologies and the
IWPT 2021 Shared Task on Parsing into En-
hanced Universal Dependencies (IWPT 2021),

Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo,
Weinan Zhang, Jun Wang, and Yong Yu.
2018. Texygen: A benchmarking platform for
text generation models. In The 41st Interna-
tional ACM SIGIR Conference on Research
& Développement
in Information Retrieval,
pages 1097–1100.