AMR-To-Text Generation with Graph Transformer
Tianming Wang, Xiaojun Wan, Hanqi Jin
Wangxuan Institute of Computer Technology, Peking University
The MOE Key Laboratory of Computational Linguistics, Peking University
{wangtm, wanxiaojun, jinhanqi}@pku.edu.cn
抽象的
Abstract meaning representation (AMR)-到-
text generation is the challenging task of gener-
ating natural language texts from AMR graphs,
where nodes represent concepts and edges
denote relations. The current state-of-the-art
methods use graph-to-sequence models; 如何-
曾经, they still cannot significantly outperform
the previous sequence-to-sequence models or
statistical approaches. 在本文中, we pro-
pose a novel graph-to-sequence model (图形
Transformer) to address this task. The model
directly encodes the AMR graphs and learns
the node representations. A pairwise interac-
tion function is used for computing the seman-
tic relations between the concepts. 而且,
attention mechanisms are used for aggregat-
ing the information from the incoming and out-
going neighbors, which help the model to cap-
ture the semantic information effectively. 我们的
model outperforms the state-of-the-art neural
approach by 1.5 BLEU points on LDC2015E86
和 4.8 BLEU points on LDC2017T10 and
achieves new state-of-the-art performances.
1 介绍
the nodes represent
Abstract meaning representation (AMR) 是一个
semantic formalism that abstracts away from the
syntactic realization of a sentence, and encodes
its definition as a rooted, 指导的, and acyclic
这
图形. In the graph,
概念, and edges denote the relations between
the concepts. The root of an AMR binds its
contents to a single traversable graph and serves
as a rudimentary representation of the overall
focus. The existence of co-references and control
structures results in nodes with multiple incoming
边缘, called reentrancies, and causes an AMR
to possess a graph structure, instead of a tree
结构. Numerous natural language processing
19
(自然语言处理) tasks can benefit from using AMR, 这样的
as machine translation (Jones et al., 2012; 歌曲
等人。, 2019), question answering (Mitra and Baral,
2016), summarization (刘等人。, 2015; Takase
等人。, 2016), and event extraction (Huang et al.,
2016).
AMR-to-text generation is the task of recover-
ing a text representing the same definition as a
given AMR graph. Because the function words
the AMR
and structures are abstracted away,
graph can correspond to multiple realizations.
Numerous important details are underspecified,
including tense, 数字, and definiteness, 哪个
makes this task extremely challenging (Flanigan
等人。, 2016). 数字 1 shows an example AMR
graph and its corresponding sentence.
Early works
relied on grammar-based or
statistical approaches (Flanigan et al., 2016;
Pourdamghani et al., 2016; Lampouras and
Vlachos, 2017; Gruzitis et al., 2017). 这样的
approaches generally require alignments between
the graph nodes and surface tokens, 哪个是
automatically generated and can lead to error
积累. In recent research, the graphs are
first transformed into linear sequences, 进而
the text is generated from the inputs (Konstas
等人。, 2017). Such a method may lose information
from the graph structure. The current state-of-the-
art neural methods are graph-to-sequence models
and hybrid variants (Beck et al., 2018; 歌曲
等人。, 2018; Damonte and Cohen, 2019). 这些
methods use a graph state long short-term memory
(LSTM) 网络, gated graph neural network
(GGNN), or graph convolution network (GCN)
to encode AMR graphs directly, and they can
explicitly utilize the information provided by the
graph structure. 然而, these graph encoders
still cannot significantly outperform sequence
encoders. The AMR-to-text generation task can
be regarded as a distinct translation task, 和
basing it on the concepts of off-the-shelf methods
计算语言学协会会刊, 卷. 8, PP. 19–33, 2020. https://doi.org/10.1162/tacl 00297
动作编辑器: Trevor Cohn. 提交批次: 7/2019; 修改批次: 9/2019; 已发表 2/2020.
C(西德:2) 2020 计算语言学协会. 根据 CC-BY 分发 4.0 执照.
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
9
7
1
9
2
3
1
0
7
/
/
t
我
A
C
_
A
_
0
0
2
9
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
connections are used for connecting adjacent lay-
呃. The final node representations are formed by
concatenating the two individual representations
encoded by multiple layers. The decoder is similar
to the original decoder in Transformer, perform-
ing multi-head attentions and self-attentions over
the representations of the nodes in the encoder and
over the hidden states of the decoder, 分别.
For the decoder stack, we adopt a copy mecha-
nism to generate the texts, which can help copy
low-frequency tokens, such as named entities and
numbers.
We perform experiments on two benchmark
datasets (LDC2015E86 and LDC2017T10). 我们的
model significantly outperforms the prior methods
and achieves a new state-of-the-art performance.
Without external data, our model improves the
BLEU scores of the state-of-the-art and a mostly
recently proposed neural model (IE。, g-GCNSEQ
[Damonte and Cohen, 2019]) 经过 1.5 points on
LDC2015E86 and 4.8 points on LDC2017T10.
When using the Gigaword corpus as the additional
training data, which is automatically labeled by
a pre-trained AMR parser, our model achieves a
BLEU score of 36.4 on LDC2015E86, 这是
the highest result on the dataset. 实验的
result also shows that the improved structural
representation encoding by our proposed graph
encoder is most useful when the amount of training
data is small. The variations in our model are
evaluated to verify its robustness as well as the
importance of the proposed modules. 此外,
we study the performances of our model and
baselines under different structures of the input
图表.
Our contributions can be summarized as
如下:
• For AMR-to-text generation, 我们建议
a novel graph-to-
Graph Transformer,
sequence model based on the attention mech-
万物有灵论. Our model uses a pairwise interaction
function to compute the semantic relations
and uses separate graph attentions on the
incoming and outgoing neighbors, 哪个
help in enhanced capturing of the seman-
tic information provided in the graph. 这
code is available at https://github.
com/sodawater/GraphTransformer.
• The experimental results show that our model
achieves a new state-of-the-art performance
on benchmark datasets.
数字 1: An example AMR graph and its corre-
sponding sentence. The graph is rooted by ‘‘expect-
01’’, which means the AMR is about the expecting.
The node ‘‘create’’ is a reentrance and it plays two
roles simultaneously (IE。, ‘‘ARG1’’ of ‘‘accelerate’’
and ‘‘ARG1’’ of ‘‘slow-down’’).
for neural machine translation can be helpful.
The Transformer model (Vaswani et al., 2017)
is a stacked attention architecture and has shown
its effectiveness in translation tasks; 然而,
applying it to AMR-to-text generation has a major
问题: It can only deal with sequential inputs.
To address these issues, we propose a novel
graph network (Graph Transformer) for AMR-
to-text generation. Graph Transformer
is an
adaptation of the Transformer model, and it has a
stacked attention-based encoder-decoder architec-
真实. The encoder considers the AMR graph as the
input and learns the node representations from the
node attributes by the aggregation of the neigh-
borhood information. The global semantic infor-
mation is captured by stacked graph attention
layers, which allow a node to deal with the hidden
states of the neighbor nodes and their correspond-
ing relations. Multiple stacked graph attention
layers enable the nodes to utilize the informa-
tion of those nodes that are not directly adjacent,
allowing the global information to propagate. 我们
consider that the AMR graph is a directed graph
in which the directions hold extremely important
信息. 所以, for encoding the informa-
tion from the incoming and outgoing edges, 我们用
two individual graph attentions in each layer. 然后
we utilize a fusion layer to incorporate the infor-
mation from the incoming and outgoing relations,
followed by a feed-forward network. Residual
20
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
9
7
1
9
2
3
1
0
7
/
/
t
我
A
C
_
A
_
0
0
2
9
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
2 相关工作
2.1 AMR-to-Text Generation
Early work on AMR-to-text generation focused
on statistical methods. Flanigan et al. (2016)
transformed AMR graphs to appropriate spanning
trees and applied tree-to-string transducers to
generate texts. 宋等人. (2016) partitioned an
AMR graph into small fragments and generated
the translations for all the fragments, whose order
was finally decided by solving an asymmetric
generalized traveling salesman problem. 歌曲
等人. (2017) used synchronous node replacement
grammar to parse AMR graphs and generate out-
put sentences. Pourdamghani et al. (2016) adopted
a phrase-based machine translation model on the
input of a linearized graph. Recent works propose
using neural networks for generation. Konstas
等人. (2017) used a sequence-to-sequence model to
generate texts, leveraging an LSTM for encoding
a linearized AMR structure. Graph-to-sequence
models outperform sequence-to-sequence models,
including a graph state LSTM (Song et al., 2018)
and GGNN (Beck et al., 2018). A most recently
developed hybrid neural model achieved the state-
of-the-art performance by applying a BiLSTM on
the output of a graph encoder GCN, to utilize both
structural and sequential information (Damonte
和科恩, 2019).
2.2 Neural Networks for Graphs
Neural network methods for processing the data
represented in graph domains have been studied
for several years. Graph neural networks (GNNs)
have also been proposed, which are an extension
of recursive neural networks and can be applied
to most of the practically useful types of graphs
(Gori et al., 2005; Scarselli et al., 2009). GCNs
are the main alternatives for neural-based graph
陈述, and are widely used to address
various problems (Bruna et al., 2014; Duvenaud
等人。, 2015; Kipf and Welling, 2017). 李等人.
(2015) further extended a GNN and modified it
to use gated recurrent units for processing the
data represented in graphs; this method is known
as a GGNN. Beck et al. (2018) followed their
concept and applied a GGNN to string generation.
Another neural architecture based on gated units
is the graph state LSTM (Song et al., 2018), 哪个
uses an LSTM structure for encoding graph-level
语义学. Our model is most similar to graph
attention networks (GATs) (Velickovic et al.,
2018); it incorporates the attention mechanism
in the information aggregation.
2.3 Transformer Network
Recurrent neural networks (RNNs) and convolu-
tion neural networks (CNNs) have been widely
used in NLP tasks because of their advantages
of capturing long-term and local dependencies,
分别. Compared with these networks,
models based solely on the attention mecha-
nism show superiority in terms of the parallelism
and flexibility in the modeling dependencies.
最近, RNN/CNN-free networks have attracted
increasing interests. Vaswani et al. (2017) 亲-
posed a stacked attention architecture, the Trans-
former model, for neural machine translation.
Gu et al. (2018) introduced a non-autoregressive
translation model based on the transformer. 张
等人. (2018) integrated the paraphrase rules and the
Transformer model, for sentence simplification.
Devlin et al. (2018) proposed a language repre-
sentation model called BERT, which achieved
new state-of-the-art results on 11 NLP tasks.
3 Graph Transformer
The overall architecture of Graph Transformer
is shown in Figure 2, with an example AMR
graph and its corresponding sentence. We begin
by providing the formal definition of the AMR-
to-text generation and the notations we use, 和
then reviewing the Transformer model. Then we
introduce the graph encoder and sentence decoder
used in our model. 最后, we describe the training
and decoding procedures.
3.1 Problem Formulation and Notations
Given an AMR graph, G, our goal is to generate
a natural language sentence that represents the
same definition as G. Our model is trained to
maximize the probability, 磷 (S|G), where S is the
gold sentence.
In the following, we define the notations
used in this study. We assume a directed graph,
G = (V, 乙), where V is a set of N nodes, 乙
is a set of M edges, and N and M are the
numbers of nodes and edges, 分别. 每个
edge in E can be represented as (我, j, 我), 在哪里
i and j are the indices of the source and target
节点, 分别, and l is the edge label. 我们
further denote the incoming neighborhoods (IE。,
reached by an incoming edge) of node vi ∈ V
21
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
9
7
1
9
2
3
1
0
7
/
/
t
我
A
C
_
A
_
0
0
2
9
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
9
7
1
9
2
3
1
0
7
/
/
t
我
A
C
_
A
_
0
0
2
9
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
数字 2: 左边: Graph attention mechanism. We take the node ‘‘accelerate’’ in Figure 1 as an example. Head
representation is marked with yellow and tail representation is marked with blue. The node ‘‘accelerate’’ has
one incoming relation and two outgoing relations to be attend respectively; 正确的: The overall architecture of our
proposed Graph Transformer.
我
as N in
and outgoing neighborhoods (IE。, reached
by an outgoing edge) as N out
. The corresponding
sentence is S = {s1, s2, …, sT }, where si is the
i-th token of the sentence and T is the number of
the tokens.
我
3.2 Transformer
Our model is adapted from the Transformer model,
and here, we briefly review this model. 这
original Transformer network uses an encoder-
decoder architecture, with each layer consisting
of a multi-head attention mechanism and a feed-
forward network. Both the components are de-
scribed here.
The multi-head attention mechanism builds on
scaled dot-product attention, which operates on a
package of queries Q and keys K of dimension
dk and values V of dimension dv,
Attention(问, K, V ) = softmax(
QK (西德:4)
√
dk
)V
(1)
The multi-head attention linearly projects
dmodel-dimensional queries, keys, and values dh
times with different projections, and it performs
scaled dot-product attention on each projected
pair. The outputs of the attention are concatenated
and again projected, resulting in the final output,
headx = Attention(QW x
q , KW x
(西德:2)
dh(西德:2)
k , V W x
v )
(西德:3)
MultiHead(问, K, V ) =
headx
Wo
(2)
x=1
22
(西德:3)
denotes the concatenation of the dh
在哪里
∈
attention heads. Projection matrices W x
q
Rdk×dmodel, W x
∈ Rdv ×dmodel,
∈ Rdk×dmodel, W x
v
k
and Wo ∈ Rdh∗dv×dmodel. dk = dv = dmodel/dh.
是一个
The other component of each layer
feed-forward network. It consists of two linear
transformations, with a ReLU activation in
之间.
FFN(X) = max(0, xW1 + b1)W2 + b2
(3)
For constructing a deep network and regular-
化, a residual connection and layer normal-
ization are used to connect adjacent layers.
3.3 Graph Encoder
Our model also has an encoder-decoder architec-
真实. In our model, the graph encoder is com-
posed of a stack of L1 identical graph layers that
use different parameters from layer to layer. 每个
layer has three sub-layers: a graph attention mech-
万物有灵论, fusion layer, and feed-forward network.
The encoder takes the nodes as the input and
learns the node representations by aggregating
the neighborhood information. Considering that
an AMR graph is a directed graph, our model
learns two distinct representations for each node.
The first is a head representation, which repre-
sents a node when it works as a head node (IE。, A
source node) in a semantic relation and only aggre-
gates the information from the outgoing edges and
corresponding nodes. The second is a tail repre-
sentation, which represents a node when it works
as a tail node (IE。, a target node) and only aggre-
gates the information from the outgoing edges
and corresponding nodes. 具体来说, we denote
−→
h t
i as the head representation and tail
representation of each node vi at the t-th layer,
分别. The embedding of each node (IE。,
the word embedding of the concept) is fed to the
graph encoder as the initial hidden state of the
node,
i and
←−
h t
−→
H 0
i =
←−
H 0
i = eiWe + 是
(4)
where ei is the embedding of node vi, We ∈
Rdemb×dmodel and be ∈ Rdmodel are the parameters,
and demb is the dimension of the embedding.
Different from previous methods, 我们建议
using graph attention as the aggregator, 反而
of a gated unit or pooling layer. In an AMR graph,
the semantic representation of a node is deter-
mined by its own concept definition and relations
to other concepts. Graph attention is used for cap-
turing such global semantic information in a graph.
具体来说, it allows each node to deal with the
triples that are composed of the embeddings of the
neighbor nodes, embeddings of the corresponding
边缘, and its own embedding. We represent the
triple of two adjacent nodes connected by edge
(我, j, 我) 作为
(西德:4)−→
h t−1
我
(西德:4)−→
h t−1
我
rt
ij =
rt
ij =
(西德:5)
←−
h t−1
j
(西德:5)
←−
h t−1
j
(西德:9) el (西德:9)
(西德:9) el (西德:9)
Wr + br
(5)
Wr + br,
(6)
(西德:4)−→
(西德:9) el (西德:9)
h t−1
我
←−
h t−1
j
where el ∈ Rdmodel is the embedding of edge label
(西德:5)
l and
is the concatenation of
these three representations. Wr ∈ R3dmodel×dmodel
and br ∈ Rdmodel are the parameters. rt
ij is the
representations of the triple, which will be deal
with both source node vi and target node vj.
Using such a pairwise-interaction function
to compute a relation has three advantages: 1)
it does not encounter the parameter explosion
问题 (Beck et al., 2018) because the linear
transformation for the triple is independent of the
edge label, 2) the edge information is encoded
by edge embedding so that there is no loss of
信息, 和 3) the representation incorporates
the context information of the nodes. Then we
perform graph attentions over the incoming and
outgoing relations (IE。, incoming and outgoing
edges and the corresponding nodes). The multi-
head graph attentions for node vi are computed
作为
−→
g t
i =
dh(西德:2)
x=1
αx
ij =
(西德:13)
⎛
⎝
(西德:8)
⎞
ijrt
αx
ijW x
v
⎠ Wo
(西德:4)
ijW x
k )
(西德:12)
(7)
j∈N out
我
(西德:11) −→
经验值
h t−1
i W x
q ·(rt
√
dk
(西德:11) −→
z∈N out
我
经验值
h t−1
q ·(rt
i W x
√
dk
izW x
k )
(西德:12)
(西德:4)
−→
g t
在哪里
the outgoing relations for node vi. 相似地,
is computed over all the incoming relations.
i is the output of the graph attention on
←−
g t
我
Following the graph attention sub-layer, 我们
use a fusion layer to incorporate the information
aggregated from the incoming and outgoing
关系.
st
i = sigmoid
∗ −→
g t
i = st
gt
我
(西德:14)(西德:14)−→
(西德:15)
(西德:9) ←−
g t
g t
Ws + bs
我
我
我) ∗ ←−
我 + (1 − st
g t
我
(西德:15)
(8)
where Ws ∈ R2∗dmodel×1 and bs ∈ R1 are the
参数.
The last sub-layer is a fully connected feed-
forward network, which is applied to each node
separately and identically. We use a GeLU
activation function instead of the standard ReLU
activation. The dimensions of the input, inner
层, and output are dmodel, 4 ∗ dmodel, 和 2 ∗
dmodel, 分别. The output is divided into two
parts to obtain the head and tail representations,
分别. 此外, a residual connection is
used to connect adjacent layers.
−→
O t (西德:9)
←−
O t = F F N (Gt)
−→
H t = LayerNorm(
←−
H t = LayerNorm(
−→
O t +
←−
O t +
(9)
−→
H t−1)
←−
H t−1)
−→
H t and
−→
h t
←−
h t
←−
H t
where Gt
i is the package of outputs gt
我.
i and tail
are the packages of head representation
我, 分别. LayerN orm is
表示
the layer normalization. Note that using a residual
connection and layer normalization around each
layer in the graph encoder is more effective than
using them around each of the three sub-layers for
our model.
The final node representation is obtained
by concatenating the forward and backward
23
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
9
7
1
9
2
3
1
0
7
/
/
t
我
A
C
_
A
_
0
0
2
9
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
陈述. A linear transformation layer is
also used for compressing the dimension. 为了
方便, we denote hi as the final repre-
sentation of node vi,
(西德:4)−→
h L1
我
←−
h L1
我
hi =
(10)
Wh
(西德:5)
(西德:9)
where Wh ∈ R2dmodel×demb is a parameter and L1
is the number of layers of the encoder stack.
3.4 Sentence Decoder
identical
the output of
In our model, the decoder has an architecture
similar to that in the original Transformer model,
which is composed of L2
layers.
Each layer has three sub-layers: a multi-head
self-attention mechanism, multi-head attention
the encoder
mechanism over
stack, and position-wise feed-forward network. A
residual connection is used for connecting adjacent
sub-layers. The decoder generates the natural
language sentence, and we denote the hidden state
at position i of the t-th layer in the decoder stack
as ˆht
我. Different from the input representation of
the encoder, the position information is added and
the sum of the embedding and position encoding
is fed as the input,
ˆh0
i = eiWe + 是 + pei
(11)
where ei and pei ∈ Rdmodel are the embedding
and positional encoding of the token at position i,
分别.
The self-attention sub-layer is used for encoding
the information of the decoded subsequences.
We use masking to ensure that the attention and
prediction for position i depend only on the known
words at positions preceding i,
At = MultiHead( ˆH t−1, ˆH t−1, ˆH t−1)
Bt = LayerNorm(在 + ˆH t−1)
(12)
where ˆH t−1 is the package of hidden states ˆht−1
in the decoder.
我
下一个, the output of the self-attention is further
fed into the multi-head attention and feed-forward
网络, expressed as follows:
ˆAt = MultiHead(Bt, H, H)
ˆBt = LayerNorm( ˆAt + Bt)
ˆOt = FFN( ˆBt)
ˆH t = LayerNorm( ˆOt + ˆBt)
(13)
24
where H is the package of final node repre-
sentations hi encoded by the graph encoder.
为了方便, we denote the final hidden
state of the decoder at position i as ˆhi. Considering
that numerous low-frequency open-class tokens
such as named entities and numbers in an AMR
graph appear in the corresponding sentence, 我们
adopt the copy mechanism (Gu et al., 2016) 到
solve the problem. A gate is used over the decoder
stack for controlling the generation of words from
the vocabulary or directly copying them from the
图形, expressed as
θi = σ(ˆhiWθ + bθ)
(14)
where Wθ ∈ Rdmodel×1 and bθ ∈ R1 are the
参数.
Probability distribution pg
i of the words to be
directly generated at time-step i is computed as
pg
i = softmax(ˆhiWg + bg)
(15)
where Wg ∈ Rdmodel×dvocab and bg ∈ Rdvocab are
the parameters and dvocab is the vocabulary size.
i of the words to be
Probability distribution pc
copied at time-step i is computed as
经验值
(西德:4)
(西德:5)
ˆhi · hi∗
(西德:4)
氮
j∗=1 exp
ˆhi · hj∗
氮(西德:8)
i∗=1
(西德:13)
个人电脑
i =
(西德:5) zi∗
(16)
where zi∗ is the one-hot vector of node vi∗.
The final probability distribution of the words at
time-step i is the interpolation of two probabilities,
pi = θi ∗ pg
我 + (1 − θi) ∗ pc
我
(17)
3.5 Training and Decoding
For the training, we aim to maximize the likelihood
of each gold-standard output sequence, S, 给定
the graph, G.
我(S|G) =
时间(西德:8)
我=1
log P (si|si−1, …, s1, G, 我)
(18)
whereθis the model parameter. 磷 (si|si−1, …, s1, G, 我)
corresponds to the probability score of word si in
pi computed by Eq. (16).
We use the beam search to generate the target
sentence during the decoding stage.
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
9
7
1
9
2
3
1
0
7
/
/
t
我
A
C
_
A
_
0
0
2
9
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
4 Comparison to Prior Graph Encoders
在这个部分, we compare our proposed graph
encoders with the existing ones presented in prior
作品.
Most models, including a GCN (Damonte and
科恩, 2019), GGNN (Beck et al., 2018), 和
GraphLSTM (Song et al., 2018), use a non-
pairwise interaction function to represent the infor-
mation to be aggregated from the neighborhoods.
具体来说, they ignore the receiver node (IE。,
the node to be updated), operating only on the
sender node (IE。, the neighbor node) 和
edge attribute (Battaglia et al., 2018). They add
a self-loop edge for each node so that its own
information can be considered. In our model, 我们
compute the pairwise interactions using Eq. (5);
因此, no self-loop edge is required.
In our model, the graph attention mechanism
is similar to GAT (Velickovic et al., 2018). 这
main differences are that GAT is designed for
undirected graphs and neither directions nor labels
of edges are considered. We propose using two
distinct representations (IE。, head representation
and tail representation) for each node and utilizing
graph attentions on the incoming and outgoing
关系. 因此, the model can consider the
differences in the incoming and outgoing relations,
and the results presented in the next section verify
the effectiveness of this proposed modification. 在
添加, GAT adopts additive attention and uses
averages of the outputs of the multi-head attention
in the final layer. In our model, we use a scaled
dot-product attention for all the attention layers.
5 实验
5.1 Data and Preprocessing
We used two standard AMR corpora (LDC2015E86
and LDC2017T10) as our experiment datasets.
The LDC2015E86 dataset contains 16,833 在-
stances for the training, 1,368 for the develop-
蒙特, 和 1,371 for the test. The LDC2017T10
dataset is the latest AMR corpus release, 哪个
包含 36,521 instances for the training and the
same instances for the development and test
as in LDC2015E86. Most prior works evaluate
their models on the former dataset. 因为
prior approaches during the same period achieve
the state-of-the-art performances on LDC2015E86
and LDC2017T10, 分别, we performed
experiments on both the datasets.
Following Konstas et al. (2017), we supple-
mented the gold data with large-scale external
数据. We used the Gigaword corpus1 released by
宋等人. (2018) as the external data, 哪个
was automatically parsed by the JAMR. 为了
training on both the gold data and automatically
labeled data, the same training strategy as that
of Konstas et al. (2017) was adopted, 这是
fine-tuning the model on the gold data after each
epoch of the pre-training on the Gigaword data.
5.2 Parameter Settings and Training Details
We set our model parameters based on preliminary
experiments on the development set. dmodel is set
到 256 and demb is set to 300. The head number
of attention is set to 2. The numbers (L1 and L2)
of layers of the encoder and decoder are set to 8
和 6, 分别. The batch size is set to 64. 我们
extract a vocabulary from the training set, 哪个
is shared by both the encoder and the decoder. 这
word embeddings are initialized from GloVe word
嵌入 (Pennington et al., 2014). We use the
Adam optimizer (Kingma and Ba, 2015) 和
lr = 0.0002, β1 = 0.9, β2 = 0.98, 和 (西德:6) = 10−9.
Learning rate is halved every time perplexity on
the development set does not improve for two
纪元. We apply dropout to the output of each
attention sub-layer and the input embeddings, 和
use a rate of Pdrop = 0.3. Beam search with beam
size to 6 is used for decoding. During training, 我们
filter out instances with more than 100 nodes in
graph or 100 words in sentence for speeding up.
Note that dmodel is set to 512, the head number
is set to 4, and the learning rate is set to 0.0001
when training on both gold data and automatically
labeled data.
5.3 Metrics and Baselines
Following existing works, we evaluate the results
with the BLEU metric (Papineni et al., 2002). 我们
also report the results using CHRF++ (Popovi´c,
2017), similar to Beck et al. (2018).
Our direct baseline is the original Transformer,
which takes a linearized graph as the input. 我们
use the same linearization as that by Konstas et al.
(2017). We also compare our model with prior
statistical approaches (PBMT, Tree2Str, and TSP),
(S2S+Anon
sequence-to-sequence approaches
state-of-the-art
and S2S+Copy),
当前的
1https://www.cs.rochester.edu/∼lsong10/
downloads/2m.json.gz.
25
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
9
7
1
9
2
3
1
0
7
/
/
t
我
A
C
_
A
_
0
0
2
9
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
方法
蓝线
PBMT (Pourdamghani et al., 2016)
Tree2Str (Flanigan et al., 2016)
TSP (Song et al., 2016)
S2S+Anon (Konstas et al., 2017)
GraphLSTM (Song et al., 2018)
t-GCNSEQ (Damonte and Cohen, 2019)
g-GCNSEQ (Damonte and Cohen, 2019)
Transformer
Graph Transformer
S2S+Anon (2中号) (Konstas et al., 2017)
S2S+Anon (20中号) (Konstas et al., 2017)
S2S+Copy (2中号) (Song et al., 2018)
GraphLSTM (2中号) (Song et al., 2018)
Transformer (2中号)
Graph Transformer (2中号)
26.9
23.0
22.4
22.0
23.3
23.9
24.4
17.7
25.9
32.3
33.8
31.7
33.6
35.1
36.4
桌子 1: Test results of models. ‘‘(2中号)’’ / ‘‘(20中号)’’
denotes using the corresponding number of auto-
matically labeled Gigaword data instances as addi-
tional training data.
graph-to-sequence approaches (GraphLSTM and
GGNN), and hybrid approaches (t-GCNSEQ and
g-GCNSEQ). PBMT (Pourdamghani et al., 2016)
adopts a phrased-based machine translation model
with the input of a linearized AMR graph. Tree2Str
(Flanigan et al., 2016) converts AMR graphs into
trees by splitting the reentrants and applies a tree-
to-string transducer to generate text. TSP (歌曲
等人。, 2016) solves the generation problem as a
traveling salesman problem. S2S+Anon (Konstas
等人。, 2017) is a multi-layer attention-based bi-
directional LSTM model, which is trained with
anonymized data. S2S+Copy (Song et al., 2018)
is also an attention-based LSTM model, 但它
instead uses the copy mechanism. GGNN (Beck
等人。, 2018) uses a gated graph neural network
to encode the AMR graph and an RNN-based
decoder to generate the text. GraphLSTM (歌曲
等人。, 2018) utilizes a graph state LSTM as the
graph encoder and uses the copy mechanism in-
stead of anonymization. T-GCNSEQ (Damonte
和科恩, 2019) also splits the reentrancies and
applies stacking of the encoders to encode the
树, in which BiLSTM networks are used on
top of the GCN for utilizing both the structure
and sequential information. G-GCNSEQ has the
same architecture as t-GCNSEQ, but it directly
encodes the graph rather than the tree. Tree2Str,
TSP, S2S+Anon, S2S+Copy, and GraphLSTM
have been trained on LDC2015E86. PBMT has
been trained on a previous release of the corpus
(LDC2014T12).2 Note that PBMT, Tree2Str, 和
TSP also train and use a language model based
on an additional Gigaword corpus. GGNN has
been trained on LDC2017T10. T-GCNSEQ and
g-GCNSEQ have been trained on both LDC2015E86
and LDC2017T10.
5.4 Comparison Results
桌子 1 summarizes the results of the models using
LDC2015E86 as the gold training data. 什么时候
trained only on the gold training data, our model
achieves the best BLEU score of 25.9 among all
the neural models and outperforms S2S+Anon by
3.9 BLEU points. Compared with the graph-to-
sequence model, GraphLSTM, our model is 2.6
BLEU points higher, which shows the superior-
ity of our proposed architecture. Our model also
outperforms hybrid models
t-GCNSEQ and
g-GCNSEQ by 2.0 points and 1.5 点, 重新指定-
主动地. Comparing the two sequence-to-sequence
neural models, Transformer underperforms the
RNN-based model (S2S+Anon). This is in plain
contrast to their performances in machine trans-
关系. The reason is attributed to the possible
extreme length of the linearized AMR graph and
difficulty in performing self-attention to obtain a
good context representation of each token with a
small training data. Our proposed Graph Trans-
former does not encounter this problem, 哪个
is significantly better than Transformer, and im-
proves the BLEU score by more than 8 点.
It also shows that our proposed deep architecture
is even effective with a small training data. 这
results of statistical approaches PBMT, Tree2Str,
and TSP are not strictly comparable because
they use an additional Gigaword corpus to train
the language model. Our model still outperforms
the Tree2Str and TSP and performs close to the
PBMT.
Following the approach of Konstas et al. (2017),
we also evaluate our model using automatically
labeled Gigaword data as additional training data.
When using external data, the performance of
our model is improved significantly. Utilizing 2M
gigaword data, the performance of our model
improves by 10.5. With the 2M additional data,
our model achieves the new state-of-the-art BLEU
分数为 36.4 点, 这是 4.7 和 2.8 点
2The LDC2014T12 dataset contains 10,313 instances for
the training and the same instances for the development and
test as in case of LDC2015E86.
26
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
9
7
1
9
2
3
1
0
7
/
/
t
我
A
C
_
A
_
0
0
2
9
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
方法
蓝线
CHRF++
GGNN (Beck et al., 2018)
GGNN(ensemble) (Beck et al., 2018)
t-GCNSEQ (Damonte and Cohen, 2019)
g-GCNSEQ (Damonte and Cohen, 2019)
23.3
27.5
24.1
24.5
19.4
29.3
50.4
53.5
-
-
48.1
59.0
results of models trained on
Transformer
Graph Transformer
桌子 2: Test
LDC2017T10.
higher than those of S2S+Copy and GraphLSTM
using the same training data, 分别. 反式-
former achieves a BLEU score of 35.1, 这是
much higher compared with that achieved with
the one trained on the gold data. This verifies the
effectiveness of a deep neural model when the
training dataset is sufficiently big. With 20M
external data, the S2S+Anon obtains a BLEU
分数为 33.8, which is much worse than our
model score. We speculate the performance can be
further improved with a relatively larger number
of external data; 然而, we do not attempt
this owing to hardware limitations. 注意
the CHRF++ score is not reported for these
approaches in previous works; 所以, 我们的确是
not compare it in this experiment.
桌子 2 lists the results of the models trained
on LDC2017T10. Our model strongly outperforms
GGNN, and improves the BLEU score by 6.0 点
and the CHRF++ score by 8.6 点. Hybrid mod-
els t-GCNSEQ and g-GCNSEQ achieve BLEU
scores of 24.1 和 24.5, 哪个是 5.2 和
4.8 points lower than those of our model, 关于-
spectively. Compared with the same model with
smaller gold training data in Table 1, the BLEU
score of our model is also improved by 3.4 点
and the scores of t-GCNSEQ and g-GCNSEQ are
improved by only 0.2 和 0.1 点, 分别.
This indicates that the performance of our model
can easily benefit from more gold training data.
Beck et al. (2018) also reported the scores of
GGNN ensemble, which achieves a BLEU score
的 27.5 and a CHRF++ score of 53.5; these scores
are even much worse than those of our single
模型.
5.5 Model Variations
To evaluate the importance of the different com-
ponents of our proposed Graph Transformer, 我们
vary our model and perform both hyper-parameter
and ablation studies. We train the models on both
LDC2015E86 and LDC2017T10 and measure the
数字 3: Development results of Graph Transformer
and GraphLSTM against transition steps in the graph
encoder.
performance changes on the development set, 和
the results are listed in Table 3.
5.5.1 Hyper-Parameter Tuning
表中 3 (A), we vary the number of transition
脚步 (IE。, number of layers), L1, in the graph
encoder. As we can see, the performance of our
model increases as L1 increases; 然而, it starts
decreasing gradually when L1 becomes larger than
8. Our model achieves the best performance when
L1 equals 8. This shows that incorporating the
information from the nodes with a long distance
can help improve capture of the global semantic
信息. The reason for this performance drop
when L1 is larger than 8 may be attributed to the
over-fitting because the amount of training data is
not large. 此外, we also compare the BLEU
scores of our model and GraphLSTM with the
same number of transition steps. These models
are only trained on LDC2015E86. Results on the
development set are shown in Figure 3. 比较的
with the performance of the GraphLSTM, 我们的
model performs consistently and significantly
better when L1, which varies from 1 到 10. 这
indicates that our proposed graph encoder has a
stronger ability of utilizing both local and global
semantic information.
表中 3 (乙), we vary the number of layers in
the decoder, L2. Our model achieves the best per-
formance when L2 equals 6, and its performance
drops significantly when L2 decreases. With few
layers, the decoder might not be able to utilize
the information provided by the graph encoder
and generate fluent sentences. An extremely large
27
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
9
7
1
9
2
3
1
0
7
/
/
t
我
A
C
_
A
_
0
0
2
9
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
dmodel
dh
Pdrop
256
2
0.3
L1
8
2
4
6
10
L2
6
4
5
7
512
1
4
0.1
0.2
0.4
single representation for each node
single representation, inseparate graph attention
根据
(A)
(乙)
(C)
(D)
(乙)
(F)
蓝线
(LDC2015E86)
蓝线
(LDC2017T10)
25.5
20.4
23.7
24.3
24.6
23.4
24.6
24.8
25.1
23.6
23.0
22.7
25.3
24.7
25.1
23.1
28.8
24.6
27.6
28.3
28.4
27.1
28.0
28.7
28.5
28.4
28.7
26.9
28.5
27.9
28.3
27.6
桌子 3: Development results of the variations on Graph Transformer. Unlisted values are
identical to those of the base model. Both models trained on LDC2015E86 and LDC2017T10
are evaluated.
L2 also adversely affects the performance, 参与-
ularly when training on the smaller dataset.
表中 3 (C), we observe that larger models
do not lead to better performance. We attribute
the reason to the number of training pairs being
quite small. 表中 3 (D), we observe that the
型号, trained on a small dataset, are extremely
sensitive to the number of heads, dh. 单人-
head attention is 1.9 BLEU points worse than
the best setting. The performance also deceases
with too many heads. Using more training data,
our model becomes more stable and insensitive
to dh. 表中 3 (乙), we can see that a suitable
rate of dropout is extremely helpful for avoiding
over-fitting.
5.5.2 Ablation Study
We further perform an ablation study on two
datasets to investigate the influence of the modules
in the graph encoder. We fix the sentence decoder
of our model because it is similar to the original
one in Transformer. The modules in the graph
encoder are tested by two methods: using a single
representation for each node (IE。, the head rep-
resentation and tail representation are updated
with shared parameters), and using a single rep-
resentation and performing the inseparate graph
attention over the incoming and outgoing relations
(西德:13)
i =
(西德:3)
dh
x=1(
同时地 (IE。, the output of the attention in
Eq. (7) is gt
v )Wo
j∈N in
我
and the fusion layer is discarded). These modifi-
cations test the effectiveness of the separate graph
attentions.
ijW x
ijrt
αx
N out
我
(西德:2)
The results are presented in Table 3 (F). 我们可以
see that using a single representation for each node
results in a loss of 0.4 和 0.5 BLEU points on the
two datasets, 分别. It indicates that learning
the head representation and tail representation for
each node is helpful. We further observe that
without separated graph attentions (IE。, in insep-
arate graph attention), the performance of our model
drops, suffering a loss of 2.4 BLEU points on
the LDC2015E86 dataset and 1.2 on LDC2017T10.
We consider that the relations represented by the
incoming edges and outgoing edges are different.
而且, projecting them into the same space for
the graph attentions might cause confusion, 参与-
ularly when the number of training data is small.
Separate graph attentions can help the model better
capture the semantics.
Compared with the prior methods, there are two
changes in our model: the graph encoder and the
Transformer decoder. To study the influences of
the different encoders and decoders, we imple-
ment three encoders (RNN encoder, Transformer
28
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
9
7
1
9
2
3
1
0
7
/
/
t
我
A
C
_
A
_
0
0
2
9
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
方法
蓝线
蓝线
(LDC2015E86) (LDC2017T10)
模型
Number of edges
11−20
1−10
21-
RNN+RNN (S2S)
RNN+TFM
TFM+RNN
TFM+TFM (Transformer)
Graph+RNN
Graph+TFM (Ours)
21.2
17.9
16.0
17.4
21.1
25.5
22.3
18.7
17.9
19.0
25.2
28.8
桌子 4: Development results of models with three
different encoders and two different decoders.
A+B represents the model with A encoder and
B decoder. RNN encoder and decoder are abbre-
viated to RNN, Transformer encoder and decoder
are abbreviated to TFM and our Graph encoder are
abbreviated to Graph.
模型
#数数
1−5
278
37.2
S2S
GraphLSTM
+3.9
Graph Transformer +6.0
Depth
6−10
828
21.2
+2.1
+4.5
11-
265
19.3
+0.7
+3.7
桌子 5: Counts of AMR graphs with different
depth for the test split and the BLEU scores
of different models on these graphs.
encoder, and our graph encoder) and two decoders
(RNN decoder and Transformer decoder). 我们也
perform a study of their combinations. 桌子 4
presents the results. We find an interesting phe-
nomenon that simply mixing Transformer-based
networks with RNNs can lead to a large decrease
in the performance. Irrespective of replacing the
RNN decoder with the Transformer decoder in
S2S or replacing the Transformer decoder with
the RNN decoder in Transformer and Graph
Transformer, the replaced models perform much
worse than the original ones. This indicates that
there is a mismatch in using an RNN (or Trans-
以前的) to decode a sentence from the representa-
tions encoded by Transformer-based networks (或者
RNNs). The superior performance of our model
is owing to the interplay of the Transformer-based
graph encoder and the Transformer decoder.
5.6 Performance Against Size and Structure
of AMR Graph
To study the advantages of our proposed model
against prior sequence-to-sequence models and
#数数
S2S
GraphLSTM
Graph Transformer
425
30.8
+4.3
+6.2
452
21.4
+2.9
+4.6
494
19.7
+0.8
+3.8
桌子 6: Counts of AMR graphs with different
number of edges for the test split and the BLEU
scores of different models on these graphs.
模型
Number of reentrancies
0
1−3
#数数
624
566
S2S
26.6 21.2
+2.9 +2.4
GraphLSTM
Graph Transformer +7.1 +4.0
4-
181
15.3
+0.1
+2.9
桌子 7: Counts of AMR graphs with different
number of reentrancies for the test split and the
BLEU scores of different models on these graphs.
graph models further, we compare the results of
the models on different sizes and structures of
the AMR graphs. All the models are trained on
LDC2015E86. We consider the size and structure
of a graph in three approaches: 深度, number of
边缘, and number of reentrancies.
The depth of an AMR graph is defined as
the longest distance between the AMR node and
its root. The deeper the graph, the longer the
依赖性. 桌子 5 lists the counts of the AMR
graphs with different depths for the test split and
the results of different models on these graphs.
We can see that the graph models outperform
the sequence-to-sequence model, 但
the gap
narrows when the depth increases. GraphLSTM
outperforms the S2S by 3.9 points when the depth
is less than 6, and the gap is only 0.7 点
when the depth is larger than 10. 与相比
GraphLSTM, Graph Transformer returns better
performance on deeper graphs, which shows
that our model is more powerful for capturing
long-term dependencies.
The edges in an AMR graph represent the se-
mantic relations of the concepts. The more the
edges in the graph, the more semantic information
is represented and usually the larger the graph.
桌子 6 lists the counts of the AMR graphs with
different number of edges for the test split and
29
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
9
7
1
9
2
3
1
0
7
/
/
t
我
A
C
_
A
_
0
0
2
9
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
模型
Unnatural Language % Missing Information % Node Mistranslation % Edge Mistranslation %
S2S
GraphLSTM
Graph Transformer
62.0
54.0
48.0
60.0
54.0
44.0
34.0
28.0
28.0
52.0
46.0
40.0
桌子 8: The ratio of outputs with each error type for each compared system. Lower percentage is better.
the corresponding BLEU scores of the different
型号. We observe that all the models have much
better performances on small graphs than on large
那些. Similar to the phenomena based on Table 4,
our model shows a stronger ability in dealing with
more semantic relations than GraphLSTM.
Following Damonte and Cohen (2019), we also
study the influence of the reentrancies in the graph.
Reentrancies represent the co-references and con-
trol structures in AMR and make it a graph rather
than a tree. A graph with more reentrancies is
typically more complex. From Table 7, 我们可以
see that the performance of all the models drop
significantly when the number of reentrancies
增加. With more reentrancies, the lead of
the graph-to-sequence models over the sequence-
to-sequence model also narrows. We consider
那
this is because the reentrancies increase
the complexity of the graph structure and make
the graph models difficult to learn the semantic
表示. Our model exhibits an extremely
strong performance when the input degrades
into a tree. This is because we use two graph
attentions over the incoming and outgoing edges,
分别, and only one incoming edge makes
the model easy to learn and train. 此外,
our model outperforms S2S by 2.9 points when
the input graphs have more than 3 reentrancies.
In comparison, GraphLSTM achieves nearly an
identical result to that of S2S, which indicates that
our proposed encoder is also better in dealing with
complex graphs than GraphLSTM.
5.7 Case Study
We perform case studies to provide a better under-
standing of the model performance. 我们com-
pare the outputs of S2S, GraphLSTM, 和我们的
Graph Transformer trained on the gold data of
LDC2015E86. We observe that there are sev-
eral common error types in the outputs of these
系统: 1) generating unnatural language or un-
readable sentences; 2) missing information from
the input graph; 3) generating words or tokens
inconsistent with the given semantic represen-
站 (IE。, mistranslation of the nodes in the
图形); 4) mixing the semantic relations between
the entities (IE。, mistranslation of the edges in the
图形).
To exhibit how systematic these errors are ex-
plicitly, we manually evaluate 50 randomly sam-
pled outputs from each compared system, 和
count the ratio of the outputs with each error
类型. Note that these four types of errors are
not mutually exclusive. 桌子 8 lists the results.
We can clearly see that
these four types of
errors occur in all the three systems, and Graph
Transformer performs the best by comparison.
Compared with S2S and GraphLSTM, our model
Graph Transformer significantly covers more
information from the input graph. All the models
make more mistakes on the fluency aspect than
on other three aspects. This is because both the
missing information from the input graph and the
mistranslation of the nodes and edges can cause a
generated sentence to be unnatural or unreadable.
此外, we present several example outputs
表中 9. AMR denotes the input graph and Ref
denotes the reference output sentence.
In the first case, we can see that S2S fails to gen-
erate a fluent sentence. It also omits the concept,
工作, 因此, adversely affects the seman-
tic relation between you and hard. GraphLSTM
omits the adverb hard and generates an adverb
really for verb work, which is only supposed to
modify verb want. Graph Transformer generates
a basically correct answer.
In the second case, the AMR graph is more
复杂的. S2S mistranslates the concept, 站立,
as take away and omits adjective passive. 这
动词, plant,
is also omitted, which might be
caused by the long distance between plant and
pressure in the linearized input. 而且, 这
entire sentence is unreadable owing to numerous
grammar mistakes. GraphLSTM also omits the
概念, passive, and fails to generate the clause
headed by stand. 此外, it makes a mistake
at the conjunction between the pressure and
30
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
9
7
1
9
2
3
1
0
7
/
/
t
我
A
C
_
A
_
0
0
2
9
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
AMR: (H / have-condition-91
:ARG1 (w / 工作-01
:ARG0 (y / 你)
:ARG1-of (h2 / 难的-02))
:ARG2 (w2 / 想-01
:ARG0 y
:ARG2 (哦 / 出去)
:模组 (r / really)))
Ref: Work hard if you really want out.
S2S: If you really want to want out you are hard.
GraphLSTM: If you really want out, you really work.
Graph Transformer: If you really want out, then you ’ll work
难的.
AMR: (A / 和
:op1 (s / stand-11
:ARG0 (C / criminal-organization :wiki ‘‘Taliban’’
:姓名 (n / 姓名 :op1 ‘‘Taliban’’))
:ARG1 (p / passive)
:ARG2 (c2 / cultivate-01)
:时间 (y / 年
:模组 (t2 / 这)))
:op2 (p2 / pressure-01
:ARG0 c
:ARG1 (p5 / 人
:模组 (c3 / 国家 :wiki ‘‘Afghanistan’’
:姓名 (n2 / 姓名 :op1 ‘‘Afghanistan’’))
:ARG0-of (F / farm-01))
:ARG2 (p3 / plant-01
:ARG0 p5
:ARG1 (p4 / poppy
:模组 (哦 / opium)))
:程度 (我 / 较少的)))
Ref: The Taliban this year are taking a passive stance toward
cultivation and putting less pressure on Afghan farmers to plant
opium poppy.
S2S: the Taliban will take away with cultivation of cultivate this
year and pressure on Afghan farmers less with opium poppy in
他们.
GraphLSTM: The Taliban has been standing in the cultivation
this year and less pressure from the Afghan farmers to plant
opium poppies.
Graph Transformer: The Taliban has stood passive in
cultivation this year and has less pressured Afghan farmers to
plant opium poppies.
AMR: (p / participate-01
:ARG0 (p2 / 人 :quant 40
:ARG1-of (e / 采用-01
:ARG0 (C / company
:模组 (哦 / 油)
:模组 (s / 状态)))
:accompanier (s2 / soldier :quant 1200))
:ARG1 (e2 / exercise
:ARG1-of (r / resemble-01))
:时间 (d / date-entity :年 2005 :月 6))
Ref: 40 employees of the state oil company participated in a
similar exercise with 1200 soldiers in 050600.
S2S: 在六月 2005 40 people’s state company with 1200 士兵
were part of a similar exercise.
GraphLSTM: 40 state of oil companies using 1200 soldiers have
participated in similar exercises in 6.
Graph Transformer: 在六月 2005 40 oil companies have
participated in similar exercises with 1200 士兵.
桌子 9: Example outputs of different systems
are compared, including S2S, GraphLSTM, 和
Graph Transformer.
f armers. Our model does not omit any concept
in the graph, but the generated sentence is not
highly fluent. It treats the stand and pressure
as predicates and fails to generate take and put
because they are explicitly given in the graph.
In the last case, all three models fail to capture
the concept, 采用, and disturbs the relations
between person, 采用, and company. S2S omits
the adjective, 油, and mistranslates the concept,
participate, as part of. GraphLSTM is completely
confused in this case and even fails to generate
这个单词, 六月, from the relation, :月 6. 我们的
Graph Transformer correctly generates the sen-
tence constituents other than the subject.
具体来说, the four types of errors occur in all
the three models, particularly when the input graph
is complex. Compared with S2S and GraphLSTM,
our model is less likely to miss the information
from the input, and it can generate sentences with
high quality, in terms of the fluency and fidelity
to the input semantics.
6 结论
In this study, we present a novel graph network
(Graph Transformer) for AMR-to-text generation.
Our model is solely based on the attention mech-
万物有灵论. Our proposed graph attentions over the
neighbor nodes, and the corresponding edges are
used for learning the representations of the nodes
and capturing global information. The experimen-
tal results shows that our model significantly out-
performs the prior neural models and achieves a
new state-of-the-art performance on benchmark
datasets.
In future work, we will incorporate BERT em-
beddings and multi-task learnings to improve the
performance further. We will also apply Graph
Transformer to other related text generation tasks
like MRS-to-text generation, data-to-text gen-
进化, and image captioning.
致谢
This work was supported by National Natural
Science Foundation of China (61772036) 和
Key Laboratory of Science, Technology and Stan-
dard in Press Industry (Key Laboratory of Intel-
ligent Press Media Technology). We thank the
anonymous reviewers for their helpful comments.
Xiaojun Wan is the corresponding author.
31
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
9
7
1
9
2
3
1
0
7
/
/
t
我
A
C
_
A
_
0
0
2
9
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
参考
Peter Battaglia, Jessica B. Hamrick, Victor Bapst,
Alvaro Sanchezgonzalez, Vinicius Flores
Zambaldi, Mateusz Malinowski, 安德里亚
Tacchetti, David Raposo, Adam Santoro, Ryan
Faulkner, 和别的. 2018. Relational induc-
tive biases, deep learning, and graph networks.
arXiv 预印本: 1806.01261.
Daniel Beck, Gholamreza Haffari, and Trevor
Cohn. 2018. Graph-to-sequence learning using
gated graph neural networks. In Proceedings
of the 56th Annual Meeting of the Association
for Computational Linguistics (体积 1: 长的
文件), 体积 1, pages 273–283.
Joan Bruna, Wojciech Zaremba, Arthur Szlam,
and Yann Lecun. 2014. Spectral networks
and locally connected networks on graphs.
International Conference on Learning Repre-
句子.
Marco Damonte and Shay B. 科恩. 2019. Struc-
tural neural encoders for amr-to-text generation.
arXiv 预印本 arXiv:1903.11410v1.
Jacob Devlin, Ming-Wei Chang, Kenton Lee,
and Kristina Toutanova. 2018. Bert: 预-
transformers
training of deep bidirectional
for language understanding. arXiv 预印本
arXiv:1810.04805.
David K. Duvenaud, Dougal Maclaurin, 豪尔赫
Iparraguirre, Rafael Bombarell, Timothy Hirzel,
Al´an Aspuru-Guzik, and Ryan P. Adams.
2015. Convolutional networks on graphs for
learning molecular fingerprints. In Advances
in Neural Information Processing Systems,
pages 2224–2232.
Jeffrey Flanigan, Chris Dyer, 诺亚A. 史密斯,
and Jaime Carbonell. 2016. Generation from
abstract meaning representation using tree
transducers. 在诉讼程序中 2016 骗局-
ference of the North American Chapter of the
计算语言学协会: 胡-
man Language Technologies, pages 731–739.
Marco Gori, Gabriele Monfardini, and Franco
Scarselli. 2005. A new model for learning in
graph domains. In Proceedings. 2005 IEEE
国际神经网络联合会议
网络, 2005., 体积 2, pages 729–734.
IEEE.
Normunds Gruzitis, Didzis Gosko, and Guntis
Barzdins. 2017. Rigotrio at SemEval-2017 task
9: combining machine learning and grammar
engineering for amr parsing and generation. 在
Proceedings of the 11th International Work-
shop on Semantic Evaluation (SemEval-2017).
计算语言学协会.
Jiatao Gu, James Bradbury, Caiming Xiong,
Victor O. K. 李, and Richard Socher. 2018.
Non-autoregressive neural machine transla-
的. International Conference on Learning
Representations.
Jiatao Gu, Zhengdong Lu, Hang Li,
和
Victor O. K. 李. 2016. Incorporating copying
mechanism in sequence-to-sequence learning.
arXiv 预印本 arXiv:1603.06393.
Lifu Huang, Taylor Cassidy, Xiaocheng Feng,
Heng Ji, Clare R. Voss, Jiawei Han, 和
Avirup Sil. 2016. Liberal event extraction and
event schema induction. 在诉讼程序中
54th Annual Meeting of the Association for
计算语言学 (体积 1: 长的
文件), 体积 1, pages 258–268.
Bevan Jones, Jacob Andreas, Daniel Bauer,
Karl Moritz Hermann, and Kevin Knight. 2012.
Semantics-based machine
translation with
hyperedge replacement grammars. 会议记录
of COLING 2012, pages 1359–1376.
Diederik P. Kingma and Jimmy Ba. 2015. 亚当:
A method for stochastic optimization. 内特纳-
tional Conference on Learning Representations.
Thomas N. Kipf and Max Welling. 2017. Semi-
supervised classification with graph convolu-
tional networks. International Conference on
Learning Representations.
Ioannis Konstas, Srinivasan Iyer, Mark Yatskar,
Yejin Choi, and Luke Zettlemoyer. 2017.
Neural amr: Sequence-to-sequence models for
parsing and generation. 在诉讼程序中
55th Annual Meeting of the Association for
计算语言学 (体积 1: 长的
文件), 体积 1, pages 146–157.
Gerasimos Lampouras and Andreas Vlachos.
2017. Sheffield at SemEval-2017 task 9:
Transition-based language generation from
AMR. In Proceedings of the 11th International
32
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
9
7
1
9
2
3
1
0
7
/
/
t
我
A
C
_
A
_
0
0
2
9
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
Workshop on Semantic Evaluation (SemEval-
2017), pages 586–591.
Yujia Li, Daniel Tarlow, Marc Brockschmidt,
and Richard Zemel. 2015. Gated graph
sequence neural networks. arXiv 预印本
arXiv:1511.05493.
Fei Liu,
Jeffrey Flanigan, Sam Thomson,
Norman M. Sadeh, and Noah A. 史密斯. 2015.
Toward abstractive summarization using se-
mantic representations. North American Chapter
的
the Association for Computational Lin-
语言学, pages 1077–1086.
Arindam Mitra and Chitta Baral. 2016. Addressing
a question answering challenge by combining
statistical methods with inductive rule learning
and reasoning. In Thirtieth AAAI Conference on
人工智能.
Kishore Papineni, Salim Roukos, Todd Ward,
and Wei-Jing Zhu. 2002. 蓝线: A method for
automatic evaluation of machine translation.
In Proceedings of the 40th Annual Meeting of
the Association for Computational Linguistics,
pages 311–318. Association for Computational
语言学.
杰弗里
Socher,
Pennington, 理查德
和
Christopher Manning. 2014. GloVe: 全球的
vectors for word representation. In Proceedings
的 2014 经验方法会议
自然语言处理博士 (EMNLP),
pages 1532–1543.
Maja Popovi´c. 2017. CHRF++: Words helping
这
character n-grams.
Second Conference on Machine Translation,
pages 612–618.
在诉讼程序中
neural machine
using AMR.
Transactions of the Association for Compu-
tational Linguistics, 7:19–31.
翻译
Linfeng Song, Xiaochang Peng, Yue Zhang,
Zhiguo Wang, and Daniel Gildea. 2017.
AMR-to-text generation with synchronous
node replacement grammar. Meeting of
这
计算语言学协会,
2:7–13.
Linfeng Song, Yue Zhang, Xiaochang Peng,
Zhiguo Wang, and Daniel Gildea. 2016. Amr-
to-text generation as a traveling salesman
问题. 在诉讼程序中 2016 Confer-
ence on Empirical Methods in Natural Lan-
guage Processing, pages 2084–2089.
Linfeng Song, Yue Zhang, Zhiguo Wang, 和
Daniel Gildea. 2018. A graph-to-sequence
model for AMR-to-text generation. In Proceed-
ings of the 56th Annual Meeting of the Asso-
ciation for Computational Linguistics (体积 1:
Long Papers), 体积 1, pages 1616–1626.
Sho Takase,
Jun Suzuki, Naoaki Okazaki,
Tsutomu Hirao, and Masaaki Nagata. 2016.
Neural headline generation on abstract meaning
这 2016
表示. 在诉讼程序中
Conference on Empirical Methods in Natural
语言处理, pages 1054–1059.
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N.
Gomez, Łukasz Kaiser, and Illia Polosukhin.
2017. Attention is all you need. In Advances
in Neural Information Processing Systems,
pages 5998–6008.
Nima Pourdamghani, Kevin Knight, and Ulf
Hermjakob. 2016. Generating English from
abstract meaning representations. In Proceed-
ings of the 9th International Natural Language
Generation Conference, pages 21–25.
Petar Velickovic, Guillem Cucurull, Arantxa
Casanova, Adriana Romero, Pietro Lio, 和
Yoshua Bengio. 2018. Graph attention net-
作品. International Conference on Learn-
ing Representations.
Franco Scarselli, Marco Gori, Ah Chung
Tsoi, Markus Hagenbuchner, and Gabriele
Monfardini. 2009. The graph neural network
模型. IEEE Transactions on Neural Networks,
20(1):61–80.
Linfeng Song, Daniel Gildea, Yue Zhang,
Zhiguo Wang, and Jinsong Su. 2019. 语义学
Jiacheng Zhang, Huanbo Luan, Maosong Sun,
Feifei Zhai, Jingfang Xu, Min Zhang, 和
Yang Liu. 2018. Improving the transformer
translation model with document-level context.
这 2018 会议
在诉讼程序中
Empirical Methods
in Natural Language
加工, pages 533–542.
33
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
9
7
1
9
2
3
1
0
7
/
/
t
我
A
C
_
A
_
0
0
2
9
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3