Weisfeiler-Leman in the BAMBOO : Novel AMR Graph Metrics - 麻省理工学院人工智能研究专业

Weisfeiler-Leman in the BAMBOO : Novel AMR Graph Metrics
and a Benchmark for AMR Graph Similarity

Juri Opitz1 Angel Daza2 Anette Frank1
1Dept. of Computational Linguistics, Heidelberg University, 德国
2CLTL, Vrije Universiteit Amsterdam, 荷兰人
{opitz, 坦率}@cl.uni-heidelberg.de, j.a.dazaarevalo@vu.nl

抽象的

Several metrics have been proposed for as-
sessing the similarity of (抽象的) 意义
陈述 (AMRs), but little is known
about how they relate to human similarity rat-
英格斯. 而且, the current metrics have com-
plementary strengths and weaknesses: 一些
emphasize speed, while others make the align-
ment of graph structures explicit, at the price
of a costly alignment step.
In this work we propose new Weisfeiler-Leman
AMR similarity metrics that unify the strengths
of previous metrics, while mitigating their weak-
内塞斯. 具体来说, our new metrics are able
to match contextualized substructures and in-
duce n:m alignments between their nodes. 毛皮-
瑟莫雷, we introduce a Benchmark for AMR
Metrics based on Overt Objectives (BAMBOO ),
the first benchmark to support empirical as-
sessment of graph-based MR similarity met-
rics. BAMBOO maximizes the interpretability
of results by defining multiple overt objectives
that range from sentence similarity objectives
to stress tests that probe a metric’s robust-
ness against meaning-altering and meaning-
preserving graph transformations. We show
the benefits of BAMBOO by profiling previous
metrics and our own metrics. Results indicate
that our novel metrics may serve as a strong
baseline for future work.

介绍

Meaning representations aim at capturing the
meaning of text in an explicit graph format. A
prominent framework is abstract meaning repre-
sentation (AMR), proposed by Banarescu et al.
(2013). AMR views sentences as rooted, 指导的,
acyclic, labeled graphs. Their nodes are variables,
属性, 或者 (open-class) concepts and are con-
nected with edges that express semantic relations.
There are many use cases in which we need to
compare or relate two AMR graphs. A common
situation is found in parser evaluation, 在哪里

AMR metrics are widely applied (可能, 2016;
May and Priyadarshi, 2017).1 然而, there are more
situations where we need to measure similarity
of meaning as expressed in AMR graphs. 为了
例子, Bonial et al. (2020) leverage AMR met-
rics in a semantic search engine for COVID-19
queries, Naseem et al. (2019) use metric feedback
to reinforce AMR parsers, Opitz (2020) emulates
metrics for referenceless AMR ranking and rating,
and Opitz and Frank (2021) use AMR metrics for
NLG evaluation.

迄今为止, multiple AMR metrics (Cai and Knight,
2013; Cai and Lam, 2019; Song and Gildea, 2019;
Anchiˆeta et al., 2019; Opitz et al., 2020) 有
been proposed to assess AMR similarity. 如何-
曾经, due to a lack of an appropriate evaluation
benchmark, we have no empirical evidence that
could tell us more about their strengths and weak-
nesses or offer insight about which metrics may
be preferable over others in specific use cases.

此外, we would like to move beyond the
aforementioned metrics and develop new metrics
that account for graded similarity of graph sub-
结构, which is not an easy task. 然而, 它
is crucial when we need to compare AMR graphs
in a deeper way. Consider Figure 1, which shows
two AMRs that convey very similar meanings.
All aforementioned metrics assign this pair a low
similarity score, and—if alignment-based, as is
SMATCH (Cai and Knight, 2013)—find only sub-
par alignments.2 In this case, we want a metric
that provides us with a high similarity score and,
理想地, an explanatory alignment.

The structure of this paper is as follows. In §2
we discuss related work. In §3 we describe our

1With minor adaptions, AMR metrics are also used in
other MR parsing tasks (van Noord et al., 2018; 张等人。,
2018; Oepen et al., 2020).

2例如, 图中 1, SMATCH aligns drink-01 到
slurp-01 and kitten to cat, resulting in a single matching triple
(X, arg0, y).

1425

计算语言学协会会刊, 卷. 9, PP. 1425–1441, 2021. https://doi.org/10.1162/tacl 00435
动作编辑器: Yue Zhang. 提交批次: 4/2021; 修改批次: 8/2021; 已发表 12/2021.
C(西德:2) 2021 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

我

A
C
_
A
_
0
0
4
3
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

late the variable alignment of (瓦)S(2)
MATCH and
only consider their attached concepts, which in-
creases computation speed. Apart from this, 这
metrics differ significantly: SEMBLEU extracts bags
of k-hop paths (k≤3) from the AMR graphs and
thereupon calculates BLEU (Papineni et al., 2002).
SEMA, 另一方面, is somewhat simpler and
provides us with an F1 score that it achieves by
comparing extracted triples.

From Measuring Structure Overlap to
Measuring Meaning Similarity Most AMR
metrics have been designed for semantic parser
评估, and therefore determine a score for
structure overlap. While this is legitimate, 和
extended use cases for AMR metrics arising, 那里
is increased awareness that structural matching of
labeled nodes and edges of an AMR graph is not
sufficient for assessing the meaning similarity ex-
pressed by two AMRs (Kapanipathi et al., 2021).
This insufficiency has also been observed in cross-
lingual AMR parsing evaluation (Blloshmi et al.,
2020; Sheth et al., 2021; Uhrig et al., 2021), 但
is most prominent when attempting to compare
the meaning of AMRs that represent different sen-
时态 (Opitz et al., 2020; Opitz and Frank, 2021).
This work argues that in cases like Figure 1,
the available metrics do not sufficiently reflect
the similarity of the two AMRs and their underly-
ing sentences.

How Do Humans Rate Similarity of Sentence
意义? 超导系统 (Baudiˇs et al., 2016A, 乙; 氧化酶
等人。, 2017) and SICK (Marelli et al., 2014)
elicited human ratings of sentence similarity on a
Likert scale. While STS annotates semantic simi-
larity, SICK annotates semantic relatedness. 这些
two aspects are highly related, but not the exact
相同的 (Budanitsky and Hirst, 2006; Kolb, 2009).
Only the highest scores on the Likert scales of
SICK and STS can be seen as reflecting the equiv-
alence of meaning of two sentences. Other data
sets contain binary annotations of paraphrases
(Dolan and Brockett, 2005), that cover a wide
spectrum of semantic phenomena.

Benchmarking Metrics Metric benchmarking
is an active topic in NLP research and led to the
emergence of metric benchmarks in various areas,
most prominently MT and NLG (Gardent et al.,
2017; Zhu et al., 2018; Ma et al., 2019). 这些
benchmarks are useful since they help to assess

数字 1: Similar AMRs, with sketched alignments.

first contribution: new AMR metrics that aim at
unifying the strengths of previous metrics while
mitigating their weaknesses. 具体来说, our new
metrics are capable of matching larger substruc-
tures and provide valuable n:m alignments in
polynomial time. In §4 we introduce BAMBOO ,
our second contribution: It is the first bench-
mark data set for AMR metrics and includes
novel robustness objectives that probe the behav-
ior of AMR metrics under meaning-preserving
and meaning-altering transformations of the in-
看跌期权 (§5). In §6 we use BAMBOO for a detailed,
multi-faceted empirical study of previous and our
proposed AMR metrics.

We release BAMBOO and our new metrics.3

2 相关工作

The Classical AMR Metric and its Adaptions
The ‘canonical’ and widely applied AMR metric is
SMATCH (Semantic match) (Cai and Knight, 2013).
It solves an NP-hard graph alignment problem ap-
proximately with a hill-climber and scores match-
ing triples. SMATCH has been adapted to S2
MATCH
(Soft Semantic match), by Opitz et al. (2020) 到
account for graded similarity of concept nodes
(例如, cat—kitten), using word embeddings. SMATCH
has also been adapted by Cai and Lam (2019)
in W(eighted)Smatch (WSMATCH), which penal-
izes errors relative to their distance to the root.
This is motivated by the hypothesis that ‘‘core
semantics’’ tend to be located near a graph’s root.

BFS-based and Alignment-free AMR Metrics
最近, two new AMR metrics have been pro-
摆出姿势: SEMA by Anchiˆeta et al.
(2019) 和
SEMBLEU by Song and Gildea (2019). Common
to both is a mechanism that traverses the graph.
Both start from the root, and collect structures
with a breadth-first traversal (BFS). 还, both ab-

3https://git.io/J0J7V.

1426

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

我

A
C
_
A
_
0
0
4
3
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

and select metrics and encourage their further de-
发展 (Gehrmann et al., 2021). 然而,
there is currently no established benchmark that
defines a ground truth of graded semantic similar-
ity between pairs of AMRs, and how to measure it
in terms of their structural representations. 还,
we do not have an established ground truth to
assess what alternative AMR metrics such as
(瓦|S2)MATCH or SEMBLEU really measure, 以及如何
their scores correlate with human judgments of
the semantic similarity of sentences represented
by AMRs.

3 Grounding Novel AMR etrics in The
Weisfleiler-Leman Graph Kernel

Previous AMR metrics have complementary
strengths and weaknesses. 所以, we aim to
propose new AMR metrics that are able to mitigate
these weaknesses, while unifying their strengths,
aiming at the best of all worlds. We want:

我) an interpretable alignment (SMATCH);

二) a fast metric (SEMA, SEMBLEU);

三、) matching larger substructures (SEMBLEU);

四号) and assessment of graded similarity of

AMR subgraphs (extending S2

MATCH).

这
This section proposes to make use of
Weisfeiler-Leman graph kernel (WLK) (Weisfeiler
and Leman, 1968; Shervashidze et al., 2011) 到
assess AMR similarity. The idea is that WLK
provides us with SEMBLEU-like matches of larger
sub-structures, while bypassing potential biases
induced by the BFS-traversal (Opitz et al., 2020).
We then describe the Wasserstein Weisfeiler Leman
kernel (WWLK) (Togninalli et al., 2019) 那是
similar to WLK but provides (我) an alignment
of atomic and non-atomic substructures (going
beyond SMATCH) 和 (二) a graded match of sub-
结构 (going beyond S2
MATCH). 最后, 我们
further adapt WWLK to WWLKΘ, a variant that
we tailor to learn semantic edge parameters, 到
better assess AMR graphs.

数字 2: WLK example based on one iteration.

power in many tasks, ranging from protein clas-
sification to movie recommendation (Togninalli
等人。, 2019; Yanardag and Vishwanathan, 2015).
it has not been applied to
然而, so far,
(A)MR graphs. In the following, we describe the
WLK method.

一般来说, a kernel can be viewed as a similar-
ity measurement between two objects (Hofmann
等人。, 2008), in our case, two AMR graphs G, G(西德:4).
It is stated as k(G, G(西德:4)) = (西德:5)Φ(G), Φ(G(西德:4))(西德:6), 在哪里
(西德:5)·, ·(西德:6) : Rd × Rd → R+ is an inner product and
Φ maps an input to a feature vector that is built
incrementally over K iterations. For our AMR
图表, one such iteration k works as follows: (A)
every node receives the labels of its neighbors and
the labels of the edges connecting it to their neigh-
bors, and stores them in a list (比照. Contextualize
图中 2). (乙) The lists are alphabetically sorted
and the string elements of the lists are concate-
nated to form new aggregate labels (比照. Compress
图中 2). (C) Two count vectors xk
G and xk
G(西德:4)
are created where each dimension corresponds to
a node label that is found in any of the two graphs
and contains its count (比照. Features in Figure 2).
Since every iteration yields two vectors (one for
each input), we can concatenate the vectors over
iterations and calculate the kernel (比照. 相似
图中 2):

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

我

A
C
_
A
_
0
0
4
3
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

k(·, ·) = (西德:5)ΦW L(G), ΦW L(G(西德:4))(西德:6)
G, . . . , xK

= (西德:5)concat(x0

G ), concat(x0

G(西德:4) (西德:6)
G(西德:4), . . . , xK
(1)

3.1 Basic Weisfeiler-Leman Kernel (WLK)

The Weisfeiler-Leman kernel (WLK) 方法
(Shervashidze et al., 2011) derives sub-graph fea-
tures from two input graphs. WLK has shown its

具体来说, we use the cosine similarity kernel
and two iterations (K = 2), which implies that
every node receives information from its neigh-
bors and their immediate neighbors. For simplicity

1427

第一的

我们将
treat edges as undirected, 但
later will experiment with various directionality
parameterizations.

3.2 Wasserstein Weisfeiler-Leman (WWLK)
S2
MATCH differs from all other AMR metrics in
that it accepts close concept synonyms for align-
蒙特 (up to a similarity threshold). But it comes
with a restriction and a downside: 我) it cannot
assess graded similarity of (non-atomic) AMR
subgraphs, which is crucial for assessing par-
tial meaning agreement between AMRs (as illus-
trated in Figure 1), and ii) the alignment is costly
计算.

We hence propose to adopt a variant of
WLK, the Wasserstein-Weisfeiler Leman kernel
(WWLK) (Togninalli et al., 2019), for the follow-
ing two reasons: (我) WWLK can assess non-atomic
subgraphs on a finer level, 和 (二) it provides
graph alignments that are faster to compute than
any of the existing SMATCH metrics: (瓦)S(2)
MATCH.
WWLK works in two steps: (1) Given its ini-
tial node embeddings, we use WL to project the
graph into a latent space, in which the final node
embeddings describe varying degrees of contextu-
alization. (2) Given a pair of such (WL) embedded
图表, a transportation plan is found that de-
scribes the minimum cost of transforming one
graph into the other. In the top graph of Figure 3,
f indicates the first step, while Wasserstein dis-
tance indicates the second. 现在, we describe the
steps in closer detail.

Step 1: WL Graph Projection into Latent Space
Let v = 1 . . . n be the nodes of AMR G. This graph
is projected onto a matrix Rn × R(K+1)d with

F (G) = hStack(X 0

G, . . . , X K

G ), 在哪里

X k

G = [xk(1), . . . , xk(n)]T ∈ Rn × Rd.

(2)

(3)

(西德:2)

(西德:3)

x y
w z

这样的

hstack
(西德:2)
(西德:3)
a b
,
c d

那
concatenates matrices
(西德:3)
) →
a b x y
(
. 这意味着, 在
c d w z
the output space, every node is associated with
a vector that is itself a concatenation of K + 1
vectors with d dimensions each, where k indicates
图中 3).
the degree of contextualization (
The embedding x(v)k ∈ Rd for a node v in a
certain iteration k is computed as follows:

X(v)k+1 =

(西德:4)

1
2

X(v)k +

1
d(v)

(西德:5)

u∈Nv

(西德:6)

w(你, v) · x(你)k

(4)

数字 3: Wasserstein WLK example w/o learned edge
参数 (顶部, §3.2) and w/ learnt edge parameters
(底部, §3.3), which allow us to adjust the embedded
graphs such that they better take the (impact of) AMR
edges into account. Red: the distance increases because
of a negation contrast between the two AMRs that
otherwise convey similar meaning.

d(v) is the degree of a node, N returns the neigh-
bors for a node, w(你, v) can assign a weight to a
node pair. The initial node embeddings, 即,
X(·)0, can be set up by looking up the node labels
in a set of pre-trained word embeddings, or using
random initialization. To distinguish between the
discrete edge labels, we sample random weights.

Step 2: Computing the Wasserstein Distance
Between two WL-embedded Graphs The
Wasserstein distance describes the minimum
amount of work that is necessary to transform
这 (contextualized) nodes of one graph into the
(contextualized) nodes of the other. It is computed
based on pairwise Euclidean distances from f (G)
with n nodes, and f (G(西德:4)) with m nodes:

n(西德:5)

米(西德:5)

distance =

Ti,jDi,j

(5)

我=1

j=1

这里, the ‘cost matrix’ D ∈ Rn×m contains the
Euclidean distances between the n WL-embedded

1428

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

我

A
C
_
A
_
0
0
4
3
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

nodes from G and m WL-embedded nodes from
G(西德:4). 那是, 从,j = ||F (G)i − f (G(西德:4))j||2. The flow
matrix T describes a transportation plan between
the two graphs, 即, Ti,j ≥ 0 states how
much of node i from G flows to node j from
G(西德:4), the corresponding ‘local work’ can be stated
as f low(我, j) · cost(我, j) := Ti,j · Di,j. To find
the best T, 那是, the transportation plan that
minimizes the cumulative work needed (Eq. 5),
we solve a constraint linear problem:4

n(西德:5)

米(西德:5)

min

Ti,jDi,j

我=1

j=1

(6)

s.t. : Ti,j ≥ 0, 1 ≤ i ≤ n, 1 ≤ j ≤ m (7)

米(西德:5)

j=1
n(西德:5)

我=1

Ti,j =

1
米

, 1 ≤ i ≤ n

Ti,j =

1
n

, 1 ≤ j ≤ m

(8)

(9)

注意 (我) the transportation plan T describes
an n:m alignment between the nodes of the two
图表, 然后 (二) solving Eq. 6 has polynomial
time complexity, 而 (瓦)S(2)
MATCH problem
is NP-complete.

3.3 From WWLK to WWLKθ with
zeroth-order Optimization

动机: AMR Edge Labels Have Meaning
The WL-embedding mechanism of WWLK (Eq. 4)
associates a weight w(你, v) ∈ R with each edge.
For unlabeled graphs, w(你, v) is simply set to
一. To distinguish between the discrete AMR
edge labels, in WWLK we have used random
重量. 然而, AMR edge labels encode com-
plex relations between nodes, and simply choosing
random weights may not be enough. 实际上, 我们
hypothesize that different edge labels may im-
pact the meaning similarity of AMR graphs in
different ways. Whereas a modifier relation in an
AMR graph configuration may or may not have
a significant influence on the overall AMR graph
相似, an edge representing negation is bound
to have a significant influence on the similarity
of different AMR graphs. Consider the example
图中 3: In the top figure, we embed AMRs
for The pretty warbler sings and The bird sings
轻轻地, which have similar meanings. In the bot-
tom figure, the second AMR has been changed
to express the meaning of The bird doesn’t sing,

4We use https://pypi.org/project/pyemd.

which clearly reduces the meaning similarity of
the two AMRs. 因此, we hypothesize that learn-
ing edge parameters for different AMR relation
types may help to better adjust the graph em-
beddings, such that the Wasserstein distance may
increase or decrease, depending on the specific
meaning of AMR relation labels, and thus to bet-
ter capture global meaning differences between
AMRs (as outlined in Figure 3: fθ).

正式地, to make the Wasserstein Weisfeiler-
Leman kernel better account for edge-labeled
AMR graphs, we learn a parameter set Θ that con-
sists of parameters θedgeLabel, where edgeLabel
indicates the semantic relation, IE。, edgeLabel ∈
L = {:arg0, :arg1, . . . , :polarity, …}. 因此, 在
Eq. 4, we can set w(你, v) = θlabel(你,v) and ap-
ply multiplication θlabel(你,v) · x(你)k. To facilitate
the multiplication, we either may learn a matrix
Θ ∈ R|L|×d or a parameter vector Θ ∈ R|L|. 在
这篇论文, we constrain ourselves to the latter
环境, 那是, our goal is to learn a parameter
vector Θ ∈ R|L|.

Learning Edge Labels with Direct Feedback
To find suitable edge parameters Θ, 我们建议
a zeroth order (gradient-free [Conn et al., 2009])
optimization setup, which has the advantage that
we can explicitly teach our metric to better cor-
relate with human ratings, optimizing the desired
correlation objective without detours. In our case,
we apply a simultaneous perturbation stochas-
tic approximation (SPSA) procedure to estimate
gradients (Spall, 1987, 1998; 王, 2020).5

Let sim(乙, Θ) = −W W LKΘ(乙) be the sim-
ilarity scores obtained from a (mini-)batch of
graph pairs (B = [(Gj, G(西德:4)
j), . . .]) as provided by
(parametrized) W W LK. 现在, let Y be the human
reference scores. Then we design the loss function
as J(是, Θ) := 1 − correlation(模拟(乙, Θ), 是 ).
更远, let μ be coefficients that are sampled
from a Bernoulli distribution. Then the gradient
is estimated as follows:

ˆ∇Θ =

J(是, Θ + cμ) − J(是, Θ − cμ)
2cμ

(10)

最后, we can apply the common SGD learning
规则: Θt+1 = Θt − γ ˆ∇Θ. The learning rate γ and
c decrease proportionally to t.

5It improves upon a classic Kiefer-Wolfowitz approxima-
的 (Kiefer et al., 1952) by requiring, per gradient estimate,
仅有的 2 objective function evaluations instead of 2n.

1429

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

我

A
C
_
A
_
0
0
4
3
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

4 BAMBOO : Creating the First

Benchmark for AMR
Similarity Metrics

We now describe the creation of BAMBOO , 哪个
aims to provide the first benchmark that allows
researchers to empirically (我) assess AMR metrics,
(二) compare AMR metrics, and possibly (三、) 火车
AMR metrics.

Grounding AMR Similarity Metrics in Human
Ratings of Semantic Sentence Similarity As
the main criterion for assessing AMR similarity
指标, we use human judgments of the meaning
similarity of sentences underlying pairs of AMRs.
A corresponding principle has been proposed by
Opitz et al. (2020): A metric of pairs of AMR
graphs G and G(西德:4) that represent sentences s and
s(西德:4) should reflect human judgments of semantic
sentence similarity and relatedness:

amrM etric(G, G(西德:4)) ≈ humanScore(s, s(西德:4)) (11)

Similarity Objectives Accordingly, we select,
as evaluation targets for AMR metrics, three no-
tions of sentence similarity, which have previously
been operationalized in terms of human-rated eval-
uation datasets: (我) the semantic textual similarity
(超导系统) objective from Baudiˇs et al. (2016A,乙); (二)
the sentence relatedness objective (SICK) 从
Marelli et al. (2014); (三、) the paraphrase detection
客观的 (PARA) by Dolan and Brockett (2005).
Each of these three evaluation data sets can be
seen as a set of pairs of sentences (si, s(西德:4)
我) with an
associated score humanScore(·) that provides
the human sentence relation assessment score
reflecting semantic similarity (超导系统), semantic
relatedness (SICK) and whether sentences are
paraphrastic (PARA). 因此, each of these data
sets can be described as {(si, s(西德:4)
我, humanScore
(si, s(西德:4)
我=1. Both STS and SICK offer
scores on Likert scales, ranging from equivalence
(max) to unrelated (min), while PARA scores
are binary, judging sentence pairs as being para-
短语 (1), 或不 (0). We min-max normalize the
Likert scale scores to the range [0, 1] to facilitate
standardized evaluation.

我) = yi)}n

For BAMBOO , we replace each pair (si, s(西德:4)
我)
with their AMR parses: (pi = parse(si), p(西德:4)
i =
parse(s(西德:4)
我)), transforming the data into {(pi, p(西德:4)
我,
做)}n
我=1. This provides the main partition of the

graph statistics

data instances
train/dev/test
5749/1500/1379
4500/500/4927

# 节点
(s. length)
平均. 50th avg. 50th avg. 50th
来源
8 14.1 12 0.10 0.08
9.9
超导系统
SICK
9 10.7 10 0.11 0.1
9.6
PARA 3576/500/1275 18.9 19 30.6 30 0.04 0.04

density

桌子 1: BAMBOO data set statistics of the Main
partition. Sentence length (s. length, displayed for
reference only) and graph statistics (average and
median) are calculated on the training sets.

benchmarking data for BAMBOO , henceforth de-
noted as Main.6 Statistics of Main are shown in
桌子 1). The sentences in PARA are longer com-
pared to STS and SICK. The corresponding AMR
graphs are, 一般, much larger in number
of nodes, but less complex with respect to the
average density.7

AMR Construction We choose a strong parser
that achieves high scores in the range of human-
human inter-annotator agreement estimates in
AMR banking: The parser yields 0.80–0.83 Smatch
F1 on AMR2 and AMR3. The parser, henceforth
denoted as T5S2S, is based on an AMR fine-tuned
T5 language model (Raffel et al., 2019) 和亲-
duces AMRs in a sequence-to-sequence fashion.8
It is on par with the current state-of-the-art that
similarly relies on seq-to-seq (徐等人。, 2020),
but the T5 backbone alleviates the need for mas-
sive MT pre-training. To obtain a better picture
of the graph quality we perform manual quality
inspections.

Manual Data Quality Assessment: Three-way
Graph Quality Ratings From each data set
(SICK, 超导系统, PARA) we randomly select 100
sentences and create their parses with T5S2S. Ad-
ditionally, to establish a baseline, we also parse
the same sentences with the GPLA parser of
Lyu and Titov (2018), a neural graph prediction
system that uses latent alignments (which reports
74.4 Smatch score on AMR2). This results in 300
GPLA parses and 300 T5S2S parses. A human

6The other partitions, which are largely based on this data,

will be introduced in §5.

7The lower average density could be caused, 例如, 由
fact that the PARA data is sampled from news sources, 哪个
means that the AMRs contain more named entity structures
that usually have more terminal nodes.

8https://github.com/bjascob/amrlib.

1430

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

我

A
C
_
A
_
0
0
4
3
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Parser

%gold↑

%银

%flawed↓

超导系统

37[28,46]
GPLA 43[33,53]
T5S2S 54[44,64]†‡ 41[31,50]

SICK GPLA 38[28,47]
T5S2S 48[38,58]†

49[39,59]
47[37,57]

20[12,27]
5[0,9]†‡

13[6,19]
5[0,9]†‡

PARA GPLA 9[3,14]

39[29,48]
T5S2S 21[13,29]†‡ 63[54, 73]†‡ 16[8,23]†‡

52[43,62]

ALL

GPLA 30[25,35]
46[40,52]
T5S2S 41[35,46]†‡ 50[45,56]

24[19,29]
9[5,12]†‡

桌子 2: Three-way graph assessment. [X,y]:
95-confidence intervals estimated with bootstrap.
† (‡) significant improvement of T5S2S over
GPLA with p < 0.05 (p < 0.005). annotator9 inspects the (shuffled) sample and as- signs three-way labels: flawed—an AMR contains critical errors that distort the meaning signif- icantly; silver—an AMR contains small errors that can potentially be neglected; gold—an AMR is acceptable. Results in Table 2 show that the quality of T5S2S parses is substantially better than the baseline in all three data sets. The percentage of ex- cellent parses increases considerably (STS: +11pp, SICK: +10pp, PARA: +11pp) while the percent- age of flawed parses drops notably (STS: −15pp, SICK: −8pp, PARA: −23pp). The increases in gold parses and decreases in flawed parses are sig- nificant in all data sets (p < 0.05, 10,000 bootstrap samples of the sample means).10 5 BAMBOO : Robustness Challenges Besides benchmarking AMR metric scores against human ratings, we are also interested in assessing a metric’s robustness under meaning-preserving and -altering graph transformations. Assume we are given any pair of AMRs from paraphrases. A small change in structure or node content can lead to two outcomes: The graphs still repre- sent paraphrases, or they do not. We consider a metric to be robust if its ratings correctly reflect such changes. Specifically, we propose three transformation strategies. (i) Reification (Reify ), which changes Figure 4: Examples for f and g graph transforms. the graph’s surface structure, but not its mean- ing; (ii) Concept synonym replacement (Syno ), which also preserves meaning and may or may not change the graph surface structure; (iii) Role con- fusion (Arg ), which applies small changes to the graph structure that do not preserve its meaning. 5.1 Meaning-preserving Transforms Generally, given a meaning-preserving function f of a graph, namely, G ≡ f (G), (12) it is natural to expect that a semantic similar- ity function over the pair of transformed AMRs nevertheless stays stable, and thus satisfies: metric(G, G(cid:4)) ≈ metric(f (G), f (G(cid:4))). (13) Reification Transform (Reify ) Reification is an established way to rephrase AMRs (Goodman, 2020). Formally, a reification is induced by a rule edge(x, y) reify−−→ instance(z, h(edge)0) ∧ h(edge)1(z, x) ∧ h(edge)2(z, y), (14) (15) (16) 9The human annotator is a proficient English speaker and has worked several years with AMR. 10H0(gold): amount of gold graphs T5S2S ≤ amount of gold graphs GPLA; H0(silver): amount of silver graphs T5S2S ≤ amount of gold graphs GPLA; H0(flawed): amount of gold graphs T5S2S ≥ amount of gold graphs GPLA. where h returns, for a given edge, a new concept and corresponding edges from a dictionary, where the edges are either :ARGi or :opi. An example is displayed in Figure 4 (top, left). Besides reifica- tion for location, other known types are polarity-, 1431 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 3 5 1 9 7 9 2 9 0 / / t l a c _ a _ 0 0 4 3 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 STS SICK mean th mean th PARA th mean Reify -OPS 2.74 [1, 2, 4] 1.17 [0, 1, 2] 5.14 [3, 5, 7] Syno -OPS 0.80 [0, 1, 2] 1.31 [0, 1, 2] 1.30 [0, 1, 2] Arg -OPS 1.33 [1, 1, 2] 1.11 [1, 1, 1] 1.80 [1, 2, 2] Table 3: Statistics about the amount of transform operations that were conducted, on average, on one graph. [x,y,z]: 25th, 50th (median), and 75th percentile of the amount of operations. modifier-, or time-reification.11 Processing statis- tics of the applied reification operations are shown in Table 3. Synonym Concept Node Transform (Syno ) Here, we iterate over AMR concept nodes. For any node that involves a predicate from PropBank, we consult a manually created database of (near-) synonyms that are also contained in PropBank, and sample one for replacement. For example, some sense of fall is near-equivalent to a sense of decrease (car prices fell/decreased). For con- cepts that are not predicates we run an ensemble of four WSD solvers12 (based on the concept and the sentence underlying the AMR) to identify its WordNet synset. From this synset we sample an alternative lemma.13 If an alternative lemma con- sists of multiple tokens where modifiers precede the noun, we replace the node with a graph- substructure. So, if the concept is man and we sam- ple adult male, we expand ‘instance(x, man)’ with ‘mod(x, y) ∧ instance(y, adult) ∧ instance = (x, male)’. Data processing statistics are shown in Table 3. 5.2 Meaning-altering Graph Transforms Role Confusion (Arg ) A na¨ıve AMR metric could be one that treats an AMR as a bag-of-nodes, omitting structural information, such as edges and edge-labels. Such metrics could exhibit mislead- ingly high correlation scores with human ratings, solely due to a high overlap in concept content. Hence, we design adversarial instances that can probe an AMR metric when confronted with cases 11A complete list of reifications are given in the offi- cial AMR guidelines: https://github.com/amrisi /amr-guidelines/blob/master/amr.md. 12‘Adapted lesk’, ‘Simple Lesk’, ‘Cosine Lesk’, ‘max sim’ (Banerjee and Pedersen, 2002; Lesk, 1986; Pedersen, 2007): https://github.com/alvations/pywsd. 13To increase precision, we only perform this step if all Figure 5: Metric objective example for Arg . of opposing factuality (e.g., polarity, modality, or relation inverses), while concept overlap is largely preserved. We design a function G (cid:16)= g(G), (17) that confuses role labels (see Arg in Figure 4). We make use of this function to turn two para- phrastic AMRs (G, G(cid:4)) into non-paraphrastic AMRs, by appling g to either G, or G(cid:4), but not both. In some cases g may create a meaning that still makes sense (The tiger bites the snake. → The snake bites the tiger.), while in others, g may induce a non-sensical meaning (The tiger jumps on the rock. → The rock jumps on the tiger.). However, this is not our primary concern, since in all cases, applying g achieves our main goal: It returns a different meaning that turns a paraphrase-relation between two AMRs into a non-paraphrastic one. To implement Arg , for each data set (PARA, STS, SICK) we create one new data subset. First, (i) we collect all paraphrases from the initial data (in SICK and STS these are pairs with maximum human score).14 (ii) We iterate over the AMR pairs (G, G(cid:4)) and randomly select the first or second AMR from the tuple. We then collect all n nodes with more than one outgoing edge. If n = 0, we skip this AMR pair (the pair will not be contained in the data). If n > 0, we apply the
meaning altering function g and randomly flip
edge labels. 最后, we add the original (G, G(西德:4)) 到
our data with the label paraphrase, and the altered
pair (G, G(G(西德:4))) with the label non-paraphrase (比照.
数字 5). Per the graph, we allow a maximum
的 3 role confusion operations (见表 3 为了
processing statistics).

14This shrinks the train/dev/test size of STS (现在:

solvers agree on the predicted synset.

474/106/158) and SICK (现在: 246/50/238).

1432

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

我

A
C
_
A
_
0
0
4
3
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

5.3 讨论

Safety of Robustness Objectives We have pro-
posed three challenging robustness objectives.
Reify changes the graph structure, but preserves
the meaning. 精氨酸
keeps the graph structure
(modulo edge labels) while changing the meaning.
Syno changes node labels and possibly the graph
structure and aims at preserving the meaning.

Reify and Arg are fully safe: they are well
defined and are guaranteed to fulfill our goal (Eq.
12 和 17): meaning-preserving or -altering graph
transforms. Syno is more experimental and has
(至少) three failure modes. In the first mode, 的-
pending on context, a human similarity judgments
could change when near-synonyms are chosen
(sleep → doze, a young cat → kitten, ETC。). 这
second mode occurs when WSD commits an error
(例如, minister (political sense) → priest). A third
mode are societal biases found in WordNet (例如,
the node girl may be mapped onto its ‘synonym’
missy). The third mode may not really be a failure,
since it may not change the human rating, 但,
尽管如此, it may be undesirable.

综上所述, Reify and Arg confusion con-
stitute safe robustness challenges, while results on
Syno have to be taken with a grain of salt.

Status of the Challenges in BAMBOO and Out-
look We believe that a key benefit of the
robustness challenges lies in their potential to
provide complementary performance indicators,
in addition to evaluation on the Main partition
of BAMBOO (比照. §4). 尤其, the challenges
may serve to assess metrics more deeply, uncover
potential weak spots, and help select among met-
rics, 例如, when performance differences
on Main are small. 在这项工作中, 然而, 这
complementary nature of Reify , Syno or Arg
versus Main is only reflected in the name of the
partitions, and in our experiments, we consider all
partitions equally. Future work may deviate from
this setup.

Our proposed robustness challenges are also by
no means exhaustive, and we believe that there is
ample room for developing more challenges (前任-
tending BAMBOO ) or experimenting with different
15). 为了
setups of our challenges (varying BAMBOO
these reasons, it is possible that future work may

15例如, we may rectify only selected relations,
or create more data, setting Eq. 13 to metric(G, G(西德:4)) ≈
metric(G, F (G(西德:4))), only applying f to one graph.

justify alternative or enhanced setups, extensions
and variations of BAMBOO .

6 Experimental Insights

Questions Posed to BAMBOO
BAMBOO allows
us to address several open questions: The first set
of questions aims to gain more knowledge about
previously released metrics. 那是, we would like
to know: What semantic aspects of AMR does a
metric measure? If a metric has hyper-parameters
(例如, SEMBLEU), which hyper-parameters are suit-
有能力的 (for a specific objective)? Does the costly
alignment of SMATCH pay off, by yielding better
预测, or do the faster alignment-free met-
rics offer a ‘free-lunch’? A second set of questions
aims to evaluate our proposed novel AMR similar-
ity metrics, and to assess their potential advantages.

Experimental Setup We evaluate all metrics on
the test set of BAMBOO . The two hyper-parameters
of S2
MATCH, that determine when concepts are
相似的, are set with a small search on the devel-
opment set (相比之下, S2
MATCHdef ault denotes
the default setup). WWLKθ is trained with batch
尺寸 16 on the training data. S2
MATCH, WWLK
and WWLKθ all make use of GloVe embeddings
(Pennington et al., 2014).

Our main evaluation metric is Pearson’s ρ be-
tween a metric’s output and the human ratings.
此外, we consider two global performance
measures to better rank AMR metrics: the arithme-
tic mean (amean) and the harmonic mean (hmean)
over a metric’s results achieved in all tasks. Hmean
is always ≤ amean and is driven by low outliers.
因此, a large difference between amean and
hmean serves as a warning light for a metric that
is extremely vulnerable in a specific task.

6.1 BAMBOO Studies Previous Metrics

桌子 4 shows AMR metric results on BAMBOO
across all
three human similarity rating types
(超导系统,SICK, PARA) and our four challenges: 主要的
represents the standard setup (比照. §4), 然而
Reify , Syno , and Arg
the metric
robustness (比照. §5).

测试

MATCH Rank 1st and 2nd of Pre-
SMATCH and S2
vious Metrics SMATCH, our baseline metric, 亲-
vides strong results across all tasks (桌子 4, amean:
51.28). 使用默认参数, S2
MATCHdef ault
performs slightly worse on the main data for STS

1433

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

我

A
C
_
A
_
0
0
4
3
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

speed align STS SICK PARA STS SICK PARA STS SICK PARA STS SICK PARA

–

主要的

Reify

Syno

精氨酸

amean hmean

SMATCH

WSMATCH
S2
S2

MATCHdef ault
MATCH

SEMA
SEMBLEUk=1
SEMBLEUk=2
SEMBLEUk=3
SEMBLEUk=4

WLK (ours)
WWLK (ours)
WWLKΘ (ours)

–

–
–
–

++
++
++
++
++

++
+
+

(西德:2) 58.45 59.72 41.25 57.98 61.81 39.66 56.14 57.39 39.58 48.05 70.53 24.75

51.28

47.50

(西德:2) 53.06 59.24 38.64 53.39 61.17 37.49 51.41 57.56 37.85 42.47 66.79 22.68
(西德:2) 56.38 58.15 42.16 55.65 60.04 40.41 56.05 57.17 40.92 46.51 70.90 26.58
(西德:2) 58.82 60.42 42.55 58.08 62.25 40.60 56.70 57.92 41.22 48.79 71.41 27.83

55.90 53.32 33.43 55.51 56.16 32.33 50.16 48.87 29.11 49.73 68.18 22.79
66.03 62.88 39.72 61.76 62.10 38.17 61.83 58.83 37.10
1.40
60.62 59.86 36.88 57.68 59.64 36.24 57.34 56.18 33.26 44.54 67.54 16.60
56.49 57.76 32.47 54.84 57.70 33.25 52.82 53.47 28.44 49.06 69.49 24.27
53.19 56.69 29.61 52.28 56.12 30.11 49.31 52.11 25.56 49.75 69.58 29.44

1.99

1.47

64.86 61.52 37.35 62.69 62.55 36.49 59.41 56.60 33.71 45.89 64.70 19.47
(西德:2) 63.15 65.58 37.55 59.78 65.53 35.81 59.40 59.98 32.86 13.98 42.79
7.16
(西德:2) 66.94 67.64 37.91 64.34 65.49 39.23 60.11 62.29 35.15 55.03 75.06 29.64

48.48
50.91
52.22

46.29
41.11
48.87
47.50
46.15

50.44
45.30
54.90

44.58
47.80
49.07

41.85
5.78
42.13
42.82
41.75

44.35
28.83
50.26

桌子 4: BAMBOO benchmark result of AMR metrics. All numbers are Pearson’s ρ × 100. ++: 线性
time complexity; +: polynomial time complexity; -: NP complete.

and SICK, but improves upon SMATCH on PARA,
achieving a slight overall improvement with re-
spect to hmean (+0.30), but not amean (−0.37).
S2
MATCH is more robust against Syno (例如, +4.6
on Syno STS vs. SMATCH), and when confronted
with reified graphs (Reify STS +3.3 与. SMATCH).
MATCH, after setting its two hyper-
parameters with a small search on the development
data16, consistently improves upon SMATCH over
all tasks (amean: +0.94, hmean: +1.57).

最后, S2

WSMATCH: Are Nodes Near the Root More Im-
重要的? The hypothesis underlying WSMATCH
is that concepts that are located near the top of an
AMR have a higher impact on AMR similarity rat-
英格斯. 有趣的是, WSMATCH mostly falls short of
SMATCH, offering substantially lower performance
on all main tasks and all robustness checks, 结果-
ing in reduced overall amean and hmean scores
(例如, main STS: −5.39 vs. SMATCH, amean: −2.8
与. SMATCH, hmean: −2.9 vs. SMATCH). This contra-
dicts the ‘core-semantics’ hypothesis and provides
novel evidence that semantic concepts that influ-
ence human similarity ratings are not necessarily
located close to AMR roots.17

16STS/SICK: τ = 0.90, t (西德:4) = 0.95; PARA: τ = 0.0,

t (西德:4) = 0.95

17Manual inspection of examples shows that low similarity
can frequently be explained with differences in concrete
concepts that tend to be distant to the root. 例如,
the low similarity (0.16) of Morsi supporters clash with riot
police in Cairo vs. Protesters clash with riot police in Kiev
arises mostly from Kiev and Cairo and Morsi, 然而, 这些
名字 (as are names in general in AMR) are distant to the root
地区, which is similar in both graphs (clash, riot, protesters,
supporters).

BFS-based Metrics I: SEMA Increases Speed but
Pays a Price Next, we find that SEMA achieves
lower scores in almost all categories, when com-
pared with SMATCH (amean: −4.99, hmean −5.65),
ending up at rank 7 (according to hmean and
amean) among prior metrics. 它
is similar to
SMATCH in that it extracts triples from graphs, 但
differs by not providing an alignment. 所以,
it can only loosely model some phenomena, 和
we conclude that the increase in speed comes at
the cost of a substantial drop in modeling capacity.

BFS-based Metrics II: SEMBLEU is Fast, but is
Sensitive to k Results for SEMBLEU show that it
is very sensible to parameterizations of k. 尤其,
k = 1, which means that the method only extracts
bags of nodes, achieves strong results on SICK
and STS. On PARA, 然而, SEMBLEU is out-
performed by S2
MATCH, for all settings of k (best k
(k = 2): −2.8 amean, −4.7 hmean). 而且, 全部
variants of SEMBLEU are vulnerable to robustness
检查. E.g., k = 2, 和, naturally, k = 1 是
easily fooled by Arg , where performance drops
massively. k = 4, 另一方面, is most robust
against Arg , but overall it falls behind k = 2.

Since SEMBLEU is asymmetric, we also re-
compute the metric in a ‘symmetric’ way by av-
eraging the metric result over different argument
orders. We find that this can slightly increase its
表现 ([k, amean, hmean]: [1, +0.8, +0.6];
[2, +0.5, +0.4]; [3, +0.2, +0.2]; [4, +0.1, +0.0]).

总共, our conclusions concerning SEMBLEU
是: (我) SEMBLEUk=1 (but not SEMBLEUk=3) 每-
forms well when measuring similarity and relat-
埃德内斯. 然而, SEMBLEUk=1 is na¨ıve and easily

1434

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

我

A
C
_
A
_
0
0
4
3
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

✗
✗
✗
✗
✗
✗

fooled (精氨酸 ). (二) 因此, we recommend k = 2
as a good tradeoff between robustness and per-
formance, with overall rank 4 (amean) 和 6
(hmean).18

6.2 BAMBOO Assesses Novel Metrics

We now discuss results of our proposed metrics
based on the Weisfeiler-Leman Kernel.

Standard Weisfeiler-Leman (WLK) is Fast and
a Strong Baseline for AMR Similarity First,
we visit the classic Weisfeiler-Leman kernel. Like
SEMBLEU and SEMA, 这 (alignment-free) 方法
is very fast. 然而, it outperforms these metrics
in almost all tasks (score difference against sec-
ond best alignment-free metric: ([A|H]意思是: +1.6,
+1.5) but falls behind alignment-based SMATCH
([A|H]意思是: −0.8, −3.2). 具体来说, WLK
proves robust against Reify but appears more
(−5 points on STS
vulnerable against Syno
and SICK) and Arg (notably PARA, with −10
点).19

The better performance, compared to SEMBLEU
and SEMA, may be due to the fact that WLK (和-
like SEMBLEU and SEMA) does not perform BFS
traversal from the root, which may reduce biases.

WWLK and WWLKθ Obtain First Ranks
Basic WWLK exhibits strong performance on
SICK (ranking second on main and first on
Reify ). 然而, it has large vulnerabilities, 作为
exposed by Arg , where only SEMBLEUk=1 ranks
降低. This can be explained by the fact that
WWLK (7.2 Pearson’s ρ on PARA Arg ) 仅有的
weakly considers the semantic relations (然而
SEMBLEUk=1 does not consider semantic relations
in the first place).

WWLKΘ, our proposed algorithm for edge label
学习, mitigates this vulnerability (29.6 Pear-
son’s ρ on PARA Arg , 1st rank). Learning edge
labels also helps assessing similarity (超导系统) 并重新-
latedness (SICK), with substantial improvements
over standard WWLK and SMATCH (超导系统: 66.94,
+3.9 与. WWLK and +10.6 与. SMATCH; SICK
+2.1 与. WWLK and +8.4 与. SMATCH).

18Setting k = 2 stands in contrast to the original paper that
recommended k = 3, the common setting in MT. 然而,
lower k in SEMBLEU reduces biases (Opitz et al., 2020), 哪个
may explain the better result on BAMBOO .

19Similar to SEMBLEU, we can mitigate this performance
drop on Arg PARA by increasing the amount of passes K
in WLK, 然而, this decreases overall amean and hmean.

K (#WL iters)

basic (K=2)
amean hmean amean hmean amean hmean amean hmean

K=1

K=4

K=3

WLK
50.4
WWLK 45.3
WWLKθ 54.9

44.4
28.8
50.3

49.8
43.4
52.2

44.2
15.3
35.4

47.6
45.7
55.2

42.4
31.4
51.1

46.4
42.3
50.8

41.5
24.0
47.3

桌子 5: WLK variants with different K.

总共, WWLKθ occupies rank 1 of all consi-
dered metrics (amean and hmean), outperforming
all non-alignment based metrics by large margins
(amean +4.5 与. WLK and +6.0 与. SEMBLEUk=2;
hmean +5.9 与. WLK and +8.1 与. SEMBLEUk=2),
but also the alignment-based ones, albeit by lower
margins (amean +2.7 与. S2
MATCH; hmean + 1.2
与. S2

MATCH).

6.3 Analyzing Hyper-parameters of

(瓦)WLK

Setting K in (瓦)WLK How does setting the
number of iterations in Weisfeiler-Leman affect
预测? 桌子 5 shows K = 2 is a good choice
for all WLK variants. K = 3 slightly increases
performance in the latent variants (WWLK: +0.4
amean; WWLKθ: +0.3 amean), but lowers perfor-
mance for the fast symbolic matching WLK (−2.8
amean). This drop is somewhat expected: K > 2
introduces much sparsity in the symbolic WLK
feature space.

WL Message Passing Direction Even though
AMR defines directional edges, for optimal sim-
ilarity ratings, it was not a-priori clear in which
directions the node contextualization should be
restricted when attempting to model human sim-
ilarity. 所以, so far, our WLK variants have
treated AMR graphs as undirected graphs (↔).
In this experiment, we study three alternate sce-
narios: ‘TOP-DOWN’ (向前, →), where infor-
mation is only passed in the direction that AMR
edges point at and ‘BOTTOM-UP’ (backwards,
←), where information is exclusively passed in
the opposite direction, and 2WAY ((西德:3)), 在哪里
information is passed forwards, but for every
edge edge(X, y) we insert an edge−1(y, X). 2WAY
facilitates more node interactions than either TOP-
DOWN or BOTTOM-UP, while preserving direc-
tional information.

Our findings in Table 6 show a clear trend:
treating AMR graphs as graphs with undirected
edges offers better results than TOP–DOWN (例如,

1435

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

我

A
C
_
A
_
0
0
4
3
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

undirected TOP-DOWN BOTTOM-UP

2WAYS

amean hmean amean hmean amean hmean amean hmean

WLK
50.4
WWLK 45.3
WWLKθ 54.9

44.4
28.8
50.3

50.3
43.7
53.8

44.3
22.0
46.1

50.2
41.6
50.2

43.8
9.9
18.7

49.5
44.8
55.3

41.8
24.1
51.0

桌子 6: (瓦)WLK: message passing directions.

WWLK −1.6 amean; −6.6 hmean) and consid-
erably better results when compared to WLK in
BOTTOM–UP mode (例如, WWLK −3.7 amean;
−18.9 hmean). 全面的, 2WAY behaves similarly
to the standard setup, with a slight improvement
for WWLKθ. 尤其, the symbolic WLK variant,
that does not use word embeddings, appears more
robust in this experiment and differences between
the three directional setups are small.

6.4 Revisiting the Data Quality in BAMBOO
Initial quality analyses (§4) suggested that the
quality of BAMBOO is high, with a large proportion
of AMR graphs that are of gold or silver quality.
In this experiment, we study how metric rankings
and predictions could change when confronted
with AMRs corrected by humans. From every
数据集, we randomly sample 50 AMR graph pairs
(300 AMRs in total). In each AMR, the human
annotator searched for mistakes, and corrected
them.20

We study two settings. (我) Intra metric agree-
蒙特 (IMA): For every metric, we calculate the
correlation of its predictions for the initial graph
pairs versus the predictions for the graph pairs that
are ensured to be correct. 注意, on one hand,
a high IMA for all metrics would further corrobo-
rate the trustworthiness of BAMBOO results. 如何-
曾经, 另一方面, a high IMA for a single
metric cannot be interpreted as a marker for this
metric’s quality. 那是, a maximum IMA (1.0)
could also indicate that a metric is completely in-
sensitive to the human corrections. 此外,
we study (二) Metric human agreement (MHA):
这里, we correlate the metric scores against hu-
man ratings, once when fed the fully gold-ensured
graph pairs and once when fed the standard graph
对. Both measures, IMA, and IAA, can provide
us with an indicator of how much metric ratings
would change if BAMBOO would be fully hu-
man corrected.

20全面的, few corrections were necessary, as reflected in a
high SMATCH between corrected and uncorrected graphs: 95.1
(超导系统), 96.8 (SICK), 97.9 (PARA).

超导系统

AVERAGE
MHA IMA MHA IMA MHA IMA MHA IMA

PARA

SICK

SM
[71, 73] 97.9 [66, 66] 99.9 [44, 44] 97.9 [60, 61] 98.6
WSM [64, 65] 99.2 [67, 67] 99.8 [47, 49] 98.7 [59, 60] 99.2
S2Mdef [69, 70] 97.7 [62, 63] 99.3 [44, 47] 97.7 [58, 60] 98.2
[71, 73] 97.8 [69, 70] 98.6 [41, 46] 98.0 [60, 63] 98.1
S2M
[66, 66] 97.7 [55, 55] 100 [42, 46] 99.0 [55, 56] 98.9
SE
SB2
[68, 68] 97.2 [62, 62] 99.8 [41, 42] 98.8 [57, 58] 98.6
SB3
[66, 66] 98.4 [63, 63] 99.7 [33, 34] 99.3 [54, 54] 99.1
WLK
[72, 72] 98.2 [65, 65] 99.8 [43, 46] 97.9 [60, 61] 98.6
WWLK [77, 78] 97.8 [65, 67] 98.1 [42, 46] 97.8 [61, 63] 97.9
WWLKθ [78, 78] 96.8 [67, 68] 98.1 [48, 48] 96.7 [64, 65] 97.2

桌子 7: Retrospective sub-sample quality analy-
sis of BAMBOO graph quality and sensitivity of
指标. All values are Pearson’s ρ × 100. Metric
Human Agreement (MHA): [X, y], where x is the
correlation (to human ratings) when the metric is
executed on the uncorrected sample and y is the
same assessment on the manually post-processed
sample.

Results are shown in Table 7. All metrics exhi-
bit high IMA, suggesting that potential changes in
their ratings, when fed gold-ensured graphs, 是
quite small. 此外, 一般, all metrics
tend to exhibit slightly better correlation with the
human when computed on the gold-ensured graph
对. 然而, supporting the assessment of
IMA, the increments in MHA appear small, rang-
ing from a minimum increment of +0.3 (SEMBLEU)
to a maximum increment of +2.8 (S2
MATCH),
whereas WWLK yields an increment of +1.8.
一般来说, while this assessment has to be taken
with a grain of salt due to the small sample size, 它
overall supports the validity of BAMBOO results.

6.5 讨论

Align or not Align? We can group metrics for
graph-based meaning representations into whether
they compute an alignment between AMRs or not
(刘等人。, 2020). A computed alignment, 如
SMATCH, has the advantage that it lets us assess
finer-grained AMR graph similarities and diver-
gences, by creating and exploiting a mapping that
shows which specific substructures of two graphs
are more or less similar to each other. On the other
手, it was still an open question whether such
an alignment is worth its computational cost and
enhances similarity judgments.

Experiments on BAMBOO provide novel evi-
dence on this matter: alignment-based metrics
may be preferred for better accuracy. Non-
alignment based metrics may be preferred if

1436

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

我

A
C
_
A
_
0
0
4
3
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

的 4972), all metrics considered in this paper, 在-
cluding ours, assign relative ranks that are too low
(WWLK: 2624). Future work may incorporate ex-
ternal PropBank (Palmer et al., 2005) 知识
into AMR metrics. In PropBank, 感觉 11 of play
is defined as equivalent to making music.

7 结论

Our contributions in this work are three-fold: (我)
We propose a suite of novel Weisfeiler-Leman
AMR similarity metrics that are able to recon-
cile a performance conflict between precision of
AMR similarity ratings and the efficiency of com-
puting alignments. (二) We release BAMBOO , 这
first benchmark that allows researchers to assess
AMR metrics empirically, setting the stage for
future work on graph-based meaning represen-
tation metrics. (三、) We showcase the utility of
BAMBOO , by applying it to profile existing AMR
指标, uncovering hitherto unknown strengths
or weaknesses, and to assess the strengths of our
newly proposed metrics that we derive and fur-
ther develop from the classic Weisfeiler-Leman
Kernel. We show that through BAMBOO we are
able to gain novel insight regarding suitable hy-
perparameters of different metric types, 并
gain novel perspectives on how to further improve
AMR similarity metrics to achieve better corre-
lation with the degree of meaning similarity of
paired sentences, as perceived by humans.

致谢

We are grateful to three anonymous reviewers
and Action Editor Yue Zhang for their valuable
comments that have helped to improve this paper.
We are also thankful to Philipp Wiesenbach for
giving helpful feedback on a draft of this paper.
This work has been partially funded by the DFG
through the project ACCEPT as part of the Prior-
ity Program ‘‘Robust Argumentation Machines’’
(SPP1999).

参考

Rafael Torres Anchiˆeta, Marco Antonio Sobrevilla
Cabezudo, and Thiago Alexandre Salgueiro
Pardo. 2019. SEMA: An extended semantic
evaluation for AMR. 在 (To appear) Proceed-
ings of the 20th Computational Linguistics and
Intelligent Text Processing. Springer Interna-
tional Publishing.

数字 6: WWLK alignments and metric scores for
dissimilar (顶部, 超导系统) and similar (底部, SICK)
AMRs. Excavators indicate heavy Wasserstein work
f low · cost.

speed matters most. The latter situation may
发生, 例如, when AMR metrics must be
executed over a large cross-product of parses (为了
实例, to semantically cluster sentences from a
语料库). For a balanced approach, WWLKΘ offers
a good trade-off: polynomial-time alignment and
high accuracy.

Example Discussion I: Wasserstein Trans-
portation Analysis Explains Disagreement
数字 6 (顶部) shows an example where the
human-assigned similarity score is relatively low
(rank 1164 的 1379). Due to the graphs having the
same structure (x arg0 y; x arg1 z), the previous
指标 (except SEMA) tend to assign similarities
that are relatively too high. 尤其, S2
MATCH
finds the exact same alignments in this case, 但
cannot assess the concept-relations more deeply.
WWLK yields more informative alignments since
they explain its decision to assign a more appro-
priate lower rank (1253 的 1379): Substantial work
is needed to transport, 例如, carry-01 到
slice-01.

Example Discussion II: TheVvalue of n:米
Alignments Figure 6 (底部) shows that WWLK
produces valuable n:m alignments (play-11 vs.
制作-01 and music), which are needed to prop-
erly reflect similarity (note that SMATCH, WSMATCH,
and S2
MATCH only provide 1-1 对齐). 然而,
the example also shows that there is still a way
to go. While humans assess this near-equivalence
easily, providing a relatively high score (rank 331

1437

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

我

A
C
_
A
_
0
0
4
3
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Laura Banarescu, Claire Bonial, Shu Cai,
Madalina Georgescu, Kira Griffitt, Ulf
Hermjakob, Kevin Knight, Philipp Koehn,
Martha Palmer, and Nathan Schneider. 2013.
Abstract meaning representation for sembank-
the 7th Linguistic
英.
Annotation Workshop and Interoperability with
话语, pages 178–186, Sofia, Bulgaria.
计算语言学协会.

在诉讼程序中

Satanjeev Banerjee and Ted Pedersen. 2002. 一个
adapted lesk algorithm for word sense dis-
ambiguation using wordnet. 在国际
Conference on Intelligent Text Processing and
计算语言学, pages 136–145.
施普林格. https://doi.org/10.1007/3
-540-45715-1_11

Petr Baudiˇs, Jan Pichl, Tom´aˇs Vyskoˇcil, and Jan
ˇSediv`y. 2016A. Sentence pair scoring: Towards
unified framework for text comprehension.
arXiv 预印本 arXiv:1603.06127. https://
doi.org/10.18653/v1/W16-1602

Petr Baudiˇs, Silvestr Stanko, and Jan ˇSediv´y.
2016乙. Joint learning of sentence embeddings
for relevance and entailment. In Proceedings
of the 1st Workshop on Representation Learn-
ing for NLP, pages 8–17, 柏林, 德国.
计算语言学协会.

Rexhina Blloshmi, Rocco Tripodi, and Roberto
Navigli. 2020. XL-AMR: Enabling cross-
lingual AMR parsing with transfer learning
技巧. 在诉讼程序中 2020 骗局-
ference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 2487–2500,
在线的. Association for Computational Lin-
语言学. https://doi.org/10.18653/v1
/2020.emnlp-main.195

Claire Bonial, Stephanie M. Lukin, 大卫
Doughty, Steven Hill, and Clare Voss. 2020.
InfoForager: Leveraging semantic search with
AMR for COVID-19 research. In Proceedings
of the Second International Workshop on De-
signing Meaning Representations, pages 67–77,
Barcelona Spain (在线的). 协会
计算语言学.

semantic relatedness. Computational Linguis-
抽动症, 32(1):13–47. https://doi.org/10
.1162/coli.2006.32.1.13

Deng Cai and Wai Lam. 2019. Core semantic first:
A top-down approach for AMR parsing. In Pro-
ceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 3799–3809, 香港, 中国. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/D19-1393

Shu Cai and Kevin Knight. 2013. Smatch: An eval-
uation metric for semantic feature structures.
In Proceedings of the 51st Annual Meeting
of the Association for Computational Linguis-
抽动症 (体积 2: Short Papers), pages 748–752,
Sofia, Bulgaria. Association for Computational
语言学.

Daniel Cer, Mona Diab, Eneko Agirre, I˜nigo
Lopez-Gazpio,
and Lucia Specia. 2017.
SemEval-2017 task 1: Semantic textual similar-
ity multilingual and crosslingual focused evalu-
化. In Proceedings of the 11th International
Workshop on Semantic Evaluation (SemEval-
2017), pages 1–14, Vancouver, 加拿大. Asso-
ciation for Computational Linguistics.

Andrew R. Conn, Katya Scheinberg, and Luis N.
Vicente. 2009. Introduction to Derivative-Free
Optimization, SIAM. https://doi.org
/10.1137/1.9780898718768

William B. Dolan and Chris Brockett. 2005.
Automatically constructing a corpus of sen-
这
tential paraphrases.
Third International Workshop on Paraphrasing
(IWP2005).

在诉讼程序中

Claire Gardent, Anastasia Shimorina, Shashi
Narayan, and Laura Perez-Beltrachini. 2017.
The WebNLG challenge: Generating text from
RDF data. In Proceedings of the 10th Interna-
tional Conference on Natural Language Gener-
化, pages 124–133, Santiago de Compostela,
西班牙. Association for Computational Linguis-
抽动症. https://doi.org/10.18653/v1/W17
-3518

Alexander Budanitsky and Graeme Hirst. 2006.
Evaluating wordnet-based measures of lexical

Sebastian Gehrmann, Tosin Adewumi, Karmanya
Aggarwal, Pawan Sasanka Ammanamanchi,

1438

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

我

A
C
_
A
_
0
0
4
3
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Aremu Anuoluwapo, Antoine Bosselut,
Khyathi Raghavi Chandu, Miruna Clinciu,
Dipanjan Das, Kaustubh D. Dhole, Wanyu Du,
Esin Durmus, Ondˇrej Duˇsek, Chris Emezue,
Varun Gangal, Cristina Garbacea, Tatsunori
Hashimoto, Yufang Hou, Yacine Jernite, Harsh
Jhamtani, Yangfeng Ji, Shailza Jolly, Dhruv
Kumar, Faisal Ladhak, Aman Madaan,
Mounica Maddela, Khyati Mahajan, Saad
Mahamood, Bodhisattwa Prasad Majumder,
Pedro Henrique Martins, Angelina McMillan-
Major, Simon Mille, Emiel van Miltenburg,
Moin Nadeem, Shashi Narayan, Vitaly Nikolaev,
Rubungo Andre Niyongabo, Salomey Osei,
Ankur
Perez-Beltrachini,
Laura
Niranjan Ramesh Rao, Vikas Raunak, Juan
Diego Rodriguez, Sashank Santhanam, Jo˜ao
Sedoc, Thibault Sellam, Samira Shaikh,
Anastasia Shimorina, Marco Antonio Sobrevilla
Cabezudo, Hendrik Strobelt, Nishant Subramani,
Wei Xu, Diyi Yang, Akhila Yerukola, 和
Jiawei Zhou. 2021. The gem benchmark: Natu-
ral language generation, its evaluation and met-
rics. arXiv 预印本 arXiv:2102.01672. https://
doi.org/10.18653/v1/2021.gem-1.10

Parikh,

Michael Wayne Goodman. 2020. Penman: 一个
open-source library and tool for AMR graphs. 在
Proceedings of the 58th Annual Meeting of the
计算语言学协会: Sys-
tem Demonstrations, pages 312–319, 在线的.
计算语言学协会.
https://doi.org/10.18653/v1/2020
.acl-demos.35

Thomas Hofmann, Bernhard Sch¨olkopf, 和
Alexander J. Smola. 2008. Kernel methods
in machine learning. The Annals of Statistics,
pages 1171–1220. https://doi.org/10
.1214/009053607000000677

Pavan Kapanipathi, Ibrahim Abdelaziz, Srinivas
Ravishankar, Salim Roukos, Alexander Gray,
Ramon Astudillo, Maria Chang, Cristina
Cornelio, Saswati Dana, Achille Fokoue,
Dinesh Garg, Alfio Gliozzo, Sairam Gurajada,
Hima Karanam, Naweed Khan, Dinesh
Khandelwal, Young-Suk Lee, Yunyao Li,
Francois Luus, Ndivhuwo Makondo, Nandana
Mihindukulasooriya, Tahira Naseem, 苏米特
Neelam, Lucian Popa, Revanth Reddy, Ryan
Riegel, Gaetano Rossiello, Udit Sharma, G. 磷.

Shrivatsa Bhargav, and Mo Yu. 2021. Leverag-
ing abstract meaning representation for knowl-
edge base question answering. Findings of
the Association for Computational Linguistics:
前交叉韧带. https://doi.org/10.18653/v1
/2021.findings-acl.339

Jack Kiefer and Jacob Wolfowitz. 1952. Stochas-
tic estimation of the maximum of a regression
function. The Annals of Mathematical Statis-
抽动症, 23(3):462–466. https://doi.org/10
.1214/aoms/1177729392

Peter Kolb. 2009. Experiments on the difference
between semantic similarity and relatedness. 在
Proceedings of the 17th Nordic Conference of
计算语言学 (NODALIDA 2009),
pages 81–88, Odense, 丹麦. Northern Eu-
ropean Association for Language Technology
(NEALT).

Michael Lesk. 1986. Automatic sense disam-
biguation using machine readable dictionaries:
How to tell a pine cone from an ice cream
cone. In Proceedings of the 5th Annual Interna-
tional Conference on Systems Documentation,
pages 24–26. https://doi.org/10.1145
/318723.318728

Jiangming Liu, Shay B. 科恩, and Mirella
警告. 2020. Dscorer: A fast evaluation metric
for discourse representation structure parsing.
In Proceedings of the 58th Annual Meeting
of the Association for Computational Linguis-
抽动症, pages 4547–4554, 在线的. 协会
计算语言学.

Chunchuan Lyu and Ivan Titov. 2018. AMR pars-
ing as graph prediction with latent alignment.
In Proceedings of the 56th Annual Meeting
of the Association for Computational Linguis-
抽动症 (体积 1: Long Papers), pages 397–407,
墨尔本, 澳大利亚. Association for Compu-
tational Linguistics.

Qingsong Ma, Johnny Wei, Ondˇrej Bojar, 和
Yvette Graham. 2019. Results of the WMT19
metrics shared task: Segment-level and strong
MT systems pose big challenges. In Proceed-
ings of the Fourth Conference on Machine
Translation (体积 2: Shared Task Papers,
Day 1), pages 62–90, Florence, 意大利. Associa-
tion for Computational Linguistics.

1439

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

我

A
C
_
A
_
0
0
4
3
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Marco Marelli, Stefano Menini, Marco Baroni,
Luisa Bentivogli, Raffaella Bernardi, 和
Roberto Zamparelli. 2014. A SICK cure for the
evaluation of compositional distributional se-
mantic models. In Proceedings of the Ninth In-
ternational Conference on Language Resources
and Evaluation (LREC-2014), pages 216–223,
Reykjavik, 冰岛. European Languages Re-
sources Association (ELRA).

Jonathan May. 2016. Semeval-2016 task 8: 意思是-
ing representation parsing. 在诉讼程序中
the 10th International Workshop on Semantic
评估 (semeval-2016), pages 1063–1073.

Jonathan May and Jay Priyadarshi. 2017. Semeval-
2017 任务 9: Abstract meaning representation
parsing and generation. 在诉讼程序中
11th International Workshop on Semantic Eval-
uation (SemEval-2017), pages 536–545.

Tahira Naseem, Abhishek Shah, Hui Wan, Radu
Florian, Salim Roukos, and Miguel Ballesteros.
2019. Rewarding Smatch: Transition-based
AMR parsing with reinforcement learning. 在
Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 4586–4592, Florence, 意大利. 协会
for Computational Linguistics. https://土井
.org/10.18653/v1/P19-1451

Rik van Noord, Lasha Abzianidze, Hessel
Haagsma, and Johan Bos. 2018. Evaluating
scoped meaning representations. In Proceed-
ings of
the Eleventh International Confer-
ence on Language Resources and Evaluation
(LREC-2018). Miyazaki, 日本. European Lan-
guages Resources Association (ELRA).

Stephan Oepen, Omri Abend, Lasha Abzianidze,
Johan Bos, Jan Hajic, 丹尼尔·赫什科维奇, Bin
李, Tim O’Gorman, Nianwen Xue, 和丹尼尔
Zeman. 2020. MRP 2020: The second shared
task on cross-framework and cross-lingual
meaning representation parsing. In Proceed-
ings of the CoNLL 2020 Shared Task: 叉-
Framework Meaning Representation Parsing,
pages 1–22. https://doi.org/10.18653
/v1/2020.conll-shared.1

Juri Opitz. 2020. AMR quality rating with a
lightweight CNN. In Proceedings of the 1st
Conference of the Asia-Pacific Chapter of the
Association for Computational Linguistics and

the 10th International Joint Conference on
自然语言处理, pages 235–247,
Suzhou, 中国. Association for Computational
语言学.

Juri Opitz and Anette Frank. 2021. Towards a de-
composable metric for explainable evaluation
of text generation from AMR. In Proceedings
of the 16th Conference of the European Chapter
of the Association for Computational Linguis-
抽动症: Main Volume, pages 1504–1518, 在线的.
计算语言学协会.

Juri Opitz, Letitia Parcalabescu, and Anette Frank.
2020. Amr similarity metrics from principles.
Transactions of the Association for Computa-
tional Linguistics, 8:522–538. https://土井
.org/10.1162/tacl_a_00329

Martha Palmer, Daniel Gildea,

and Paul
Kingsbury. 2005. The proposition bank: 一个
annotated corpus of semantic roles. Computa
tional Linguistics, 31(1):71–106. https://
doi.org/10.1162/0891201053630264

Kishore Papineni, Salim Roukos, Todd Ward,
and Wei-Jing Zhu. 2002. 蓝线: A method for
automatic evaluation of machine translation.
In Proceedings of the 40th Annual Meeting
of the Association for Computational Linguis-
抽动症, pages 311–318, 费城, 宾夕法尼亚州-
瓦尼亚, 美国. Association for Computational
语言学. https://doi.org/10.3115
/1073083.1073135

Ted Pedersen. 2007. Unsupervised corpus-based
methods for WSD. Word Sense Disambiguation,
pages 133–166. https://doi.org/10
.1007/978-1-4020-4809-8 6

杰弗里

Socher,

Pennington, 理查德

和
Christopher Manning. 2014. GloVe: 全球的
vectors for word representation. In Proceedings
的 2014 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP),
pages 1532–1543, Doha, Qatar. 协会
计算语言学. https://土井
.org/10.3115/v1/D14-1162

Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, 迈克尔
Matena, Yanqi Zhou, Wei Li, and Peter J. 刘.
2019. Exploring the limits of transfer learning
with a unified text-to-text transformer. CoRR,
abs/1910.10683.

1440

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
3
5
1
9
7
9
2
9
0

/
t

我

A
C
_
A
_
0
0
4
3
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Nino Shervashidze, Pascal Schweitzer, Erik Jan
Van Leeuwen, Kurt Mehlhorn, and Karsten M.
Borgwardt. 2011. Weisfeiler-lehman graph ker-
内尔斯. Journal of Machine Learning Research,
12(9).

Janaki Sheth, Young-Suk Lee, Ramon Fernandez
Astudillo, Tahira Naseem, Radu Florian, Salim
Roukos, and Todd Ward. 2021. Bootstrap-
ping multilingual AMR with contextual word
对齐. arXiv 预印本 arXiv:2102.02189.

Linfeng Song and Daniel Gildea. 2019. SemBleu:
A robust metric for AMR parsing evaluation.
In Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 4547–4552, Florence, 意大利. 协会
for Computational Linguistics. https://
doi.org/10.18653/v1/P19-1446

James C. Spall. 1987. A stochastic approximation
technique for generating maximum likelihood
parameter estimates. 在 1987 American Control
会议, pages 1161–1167. IEEE.

James C. Spall. 1998. An overview of the simulta-
neous perturbation method for efficient optimi-
扎化. Johns Hopkins APL Technical Digest,
19(4):482–492.

Matteo Togninalli, Elisabetta Ghisu, Felipe
Llinares-L´opez, Bastian Rieck, and Karsten
Borgwardt. 2019. Wasserstein weisfeiler-
lehman graph kernels. In Advances in Neural
Information Processing Systems, 体积 32,
pages 6436–6446. 柯伦联合公司, Inc.

pages 58–64, 在线的. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/2021.iwpt-1.6

Chen Wang. 2020. An overview of SPSA: Recent
development and applications. arXiv 预印本
arXiv:2012.06952.

Boris Weisfeiler and Andrei Leman. 1968. 这
reduction of a graph to canonical form and
the algebra which appears therein. NTI, Series,
2(9):12–16.

Dongqin Xu, Junhui Li, Muhua Zhu, Min Zhang,
and Guodong Zhou. 2020. Improving AMR
parsing with sequence-to-sequence pre-training.
在诉讼程序中
这 2020 会议
Empirical Methods in Natural Language Pro-
cessing (EMNLP), pages 2501–2511, 在线的.
计算语言学协会.

Pinar Yanardag and S. V. 氮. Vishwanathan. 2015.
Deep graph kernels. 在诉讼程序中
这
21th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining,
pages 1365–1374. https://doi.org/10
.1145/2783258.2783417

Sheng Zhang, Xutai Ma, Rachel Rudinger,
Kevin Duh, and Benjamin Van Durme. 2018.
Cross-lingual decompositional semantic pars-
英. 在诉讼程序中 2018 会议
Empirical Methods in Natural Language Pro-
cessing, pages 1664–1675, 布鲁塞尔, 比利时.
计算语言学协会.
https://doi.org/10.18653/v1/D18
-1194

Sarah Uhrig, Yoalli Garcia, Juri Opitz, 和
Anette Frank. 2021. Translate, then parse! A
strong baseline for cross-lingual AMR pars-
英. In Proceedings of the 17th International
Conference on Parsing Technologies and the
IWPT 2021 Shared Task on Parsing into En-
hanced Universal Dependencies (IWPT 2021),

Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo,
Weinan Zhang, Jun Wang, and Yong Yu.
2018. Texygen: A benchmarking platform for
text generation models. In The 41st Interna-
tional ACM SIGIR Conference on Research
& 发展
in Information Retrieval,
pages 1097–1100.