Attentive Convolution:
Equipping CNNs with RNN-style Attention Mechanisms
Wenpeng Yin
Department of Computer and Information
科学, 宾夕法尼亚大学
wenpeng@seas.upenn.edu
Hinrich Schütze
Center for Information and Language
加工, LMU Munich, 德国
inquiries@cislmu.org
抽象的
In NLP, convolutional neural networks
(CNNs) have benefited less than recur-
rent neural networks (RNNs) from attention
mechanisms. We hypothesize that this is be-
cause the attention in CNNs has been mainly
implemented as attentive pooling (IE。, 这是
applied to pooling) rather than as attentive
卷积 (IE。, it is integrated into con-
volution). Convolution is the differentiator
of CNNs in that it can powerfully model
the higher-level representation of a word by
taking into account its local fixed-size con-
text in the input text tx.
在这项工作中, 我们
propose an attentive convolution network,
ATTCONV. It extends the context scope of
the convolution operation, deriving higher-
level features for a word not only from
local context, but also from information ex-
tracted from nonlocal context by the atten-
tion mechanism commonly used in RNNs.
This nonlocal context can come (我) 从
parts of the input text tx that are distant
或者 (二) from extra (IE。, external) 上下文
蒂. Experiments on sentence modeling with
zero-context (sentiment analysis), 单身的-
语境 (textual entailment) 和多个-
语境 (claim verification) demonstrate the
effectiveness of ATTCONV in sentence rep-
resentation learning with the incorporation
of context. 尤其, attentive convo-
lution outperforms attentive pooling and
is a strong competitor to popular attentive
RNNs.1
1
介绍
Natural language processing (自然语言处理) has benefited
greatly from the resurgence of deep neural net-
作品 (DNNs), thanks to their high performance
with less need of engineered features. A DNN typ-
ically is composed of a stack of non-linear trans-
1https://github.com/yinwenpeng/Attentive_
Convolution.
687
formation layers, each generating a hidden rep-
resentation for the input by projecting the output
of a preceding layer into a new space. 迄今为止,
building a single and static representation to ex-
press an input across diverse problems is far from
satisfactory. 反而, it is preferable that the rep-
resentation of the input vary in different applica-
tion scenarios. 作为回应, attention mechanisms
(格雷夫斯, 2013; Graves et al., 2014) have been pro-
posed to dynamically focus on parts of the in-
put that are expected to be more specific to the
问题. They are mostly implemented based on
fine-grained alignments between two pieces of ob-
项目, each emitting a dynamic soft-selection to the
components of the other, so that the selected ele-
ments dominate in the output hidden representa-
的. Attention-based DNNs have demonstrated good
performance on many tasks.
Convolutional neural networks (CNNs; 乐存
等人。, 1998) and recurrent neural networks (RNNs;
Elman, 1990) are two important types of DNNs.
Most work on attention has been done for RNNs.
Attention-based RNNs typically take three types
of inputs to make a decision at the current step:
(我) the current input state, (二) a representation of
local context (computed unidirectionally or bidi-
rectionally; Rocktäschel et al.
[2016]), 和 (三、)
the attention-weighted sum of hidden states cor-
responding to nonlocal context (例如, the hidden
states of the encoder in neural machine translation;
Bahdanau et al. [2015]). An important question,
所以, is whether CNNs can benefit from such
an attention mechanism as well, 以及如何. 这是
our technical motivation.
Our second motivation is natural language un-
理解. In generic sentence modeling without
extra context (Collobert et al., 2011; Kalchbrenner
等人。, 2014; Kim, 2014), CNNs learn sentence rep-
resentations by composing word representations
that are conditioned on a local context window.
We believe that attentive convolution is needed
计算语言学协会会刊, 卷. 6, PP. 687–702, 2018. 动作编辑器: Slav Petrov.
提交批次: 6/2018; 修改批次: 10/2018; 已发表 12/2018.
C(西德:13) 2018 计算语言学协会. 根据 CC-BY 分发 4.0 执照.
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
4
9
1
5
6
7
6
7
6
/
/
t
我
A
C
_
A
_
0
0
2
4
9
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
前提, modeled as context ty
Plant cells have structures that animal cells lack.
Animal cells do not have cell walls.
The cell wall is not a freestanding structure.
Plant cells possess a cell wall, animals never.
0
1
0
1
桌子 1: Examples of four premises for the hypothesis
tx = “A cell wall is not present in animal cells.” in
SCITAIL data set. Right column (hypothesis’s label):
“1” means true, “0” otherwise.
for some natural language understanding tasks that
are essentially sentence modeling within contexts.
Examples: textual entailment (is a hypothesis true
given a premise as the single context?; 达甘
等人. [2013]) and claim verification (is a claim cor-
rect given extracted evidence snippets from a text
corpus as the context?; Thorne et al. [2018]). 骗局-
sider the SCITAIL (Khot et al., 2018) textual en-
tailment examples in Table 1; 这里, the input text
tx is the hypothesis and each premise is a context
text ty. And consider the illustration of claim ver-
ification in Figure 1; 这里, the input text tx is the
claim and ty can consist of multiple pieces of con-
文本. In both cases, we would like the representa-
tion of tx to be context-specific.
在这项工作中, we propose attentive convolution
网络, ATTCONV, to model a sentence (IE。,
tx) either in intra-context (where ty = tx) or extra-
语境 (where ty (西德:54)= tx and ty can have many
件) scenarios. In the intra-context case (sen-
timent analysis, 例如), ATTCONV extends
the local context window of standard CNNs to
cover the entire input text tx. In the extra-context
案件, ATTCONV extends the local context win-
dow to cover accompanying contexts ty.
For a convolution operation over a window
in tx such as (leftcontext, word, rightcontext), 我们
first compare the representation of word with
all hidden states in the context ty to obtain
an attentive context representation attcontext, 然后
convolution filters derive a higher-level represen-
tation for word, denoted as wordnew, by integrat-
ing word with three pieces of context: leftcontext,
rightcontext, and attcontext. We interpret
this at-
tentive convolution in two perspectives.
(我) 为了
intra-context, a higher-level word representation
wordnew is learned by considering the local (IE。,
leftcontext and rightcontext) as well as nonlocal (IE。,
attcontext) 语境. (二) For extra-context, wordnew
is generated to represent word,
together with
its cross-text alignment attcontext, in the context
leftcontext and rightcontext. 换句话说, the deci-
sion for the word is made based on the connected
数字 1: Verify claims in contexts.
hidden states of cross-text aligned terms, 和
local context.
We apply ATTCONV to three sentence mod-
eling tasks with variable-size context: 一个大的-
scale Yelp sentiment classification task (林等人。,
2017) (intra-context, IE。, no additional context),
SCITAIL textual entailment (Khot et al., 2018)
(single extra-context),
and claim verification
(Thorne et al., 2018) (multiple extra-contexts).
ATTCONV outperforms competitive DNNs with
and without attention and achieves state-of-the-art
on the three tasks.
全面的, we make the following contributions:
• This is the first work that equips convolution
filters with the attention mechanism com-
monly used in RNNs.
• We distinguish and build flexible modules—
attention source, attention focus, and atten-
tion beneficiary—to greatly advance the ex-
pressivity of attention mechanisms in CNNs.
• ATTCONV provides a new way to broaden
the originally constrained scope of filters in
conventional CNNs. Broader and richer con-
text comes from either external context (IE。,
蒂) or the sentence itself (IE。, tx).
• ATTCONV shows its flexibility and effec-
tiveness in sentence modeling with variable-
size context.
2 相关工作
In this section we discuss attention-related DNNs
in NLP, the most relevant work for our paper.
2.1 RNNs with Attention
格雷夫斯 (2013) and Graves et al. (2014) first in-
troduced a differentiable attention mechanism that
allows RNNs to focus on different parts of the
输入. This idea has been broadly explored in
to deal with text
RNNs, shown in Figure 2,
一代, such as neural machine translation
688
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
4
9
1
5
6
7
6
7
6
/
/
t
我
A
C
_
A
_
0
0
2
4
9
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
maybe haoyofajrfngnovajrnvharyaojnbarlvjhnhjarnohg nvhyhnv va jmaybe haoyofajrfngnovajrnvharyaojnbarlvjhnhjarnohg nvhyhnv va jwhatno jmofjag as ajgonahnbjunaeorg varguoerguarg ag .arghoguerng mao rhg aerare hn kvarenb bhebnbnjb ye nerbhbjanrihbjrbn arebahofjrf Marilyn Monroe worked withWarner BrothersTelemundo is anEnglish-languagetelevision network. c1c2cicn…contextsclaimclasses
数字 2: A simplified illustration of attention mecha-
nism in RNNs.
(Bahdanau et al., 2015; Luong et al., 2015; Kim
等人。, 2017; Libovický and Helcl, 2017), response
generation in social media (Shang et al., 2015),
document reconstruction (李等人。, 2015), 和
document summarization (Nallapati et al., 2016);
machine comprehension (Hermann et al., 2015;
Kumar et al., 2016; Xiong et al., 2016; Seo et al.,
2017; Wang and Jiang, 2017; Xiong et al., 2017;
王等人。, 2017A); and sentence relation classi-
fication, such as textual entailment (Cheng et al.,
2016; Rocktäschel et al., 2016; Wang and Jiang,
2016; 王等人。, 2017乙; 陈等人。, 2017乙) 和
answer sentence selection (Miao et al., 2016).
We try to explore the RNN-style attention mech-
anisms in CNNs—more specifically, in convolution.
2.2 CNNs with Attention
In NLP, there is little work on attention-based
CNNs. Gehring et al. (2017) propose an attention-
based convolutional seq-to-seq model for machine
翻译. Both the encoder and decoder are hi-
erarchical convolution layers. At the nth layer of
the decoder, the output hidden state of a convolu-
tion queries each of the encoder-side hidden states,
then a weighted sum of all encoder hidden states
is added to the decoder hidden state, and finally
this updated hidden state is fed to the convolution
at layer n + 1. Their attention implementation re-
lies on the existence of a multi-layer convolution
structure—otherwise the weighted context from
the encoder side could not play a role in the de-
编码员. So essentially their attention is achieved af-
ter convolution. 相比之下, we aim to modify the
vanilla convolution, so that CNNs with attentive
convolution can be applied more broadly.
We discuss two systems that are representative
of CNNs that implement the attention in pooling
(IE。, the convolution is still not affected): Yin
等人. (2016) and dos Santos et al. (2016), illus-
trated in Figure 3. Specifically, these two systems
work on two input sentences, each with a set of
数字 3: Attentive pooling,
summarized from
ABCNN (Yin et al., 2016) and APCNN (两位圣人
等人。, 2016).
hidden states generated by a convolution layer;
然后, each sentence will learn a weight for ev-
ery hidden state by comparing this hidden state
with all hidden states in the other sentence; finally,
each input sentence obtains a representation by a
weighted mean pooling over all its hidden states.
The core component—weighted mean pooling—
was referred to as “attentive pooling,” aiming to
yield the sentence representation.
In contrast to attentive convolution, attentive
pooling does not connect directly the hidden states
of cross-text aligned phrases in a fine-grained
manner to the final decision making; only the
matching scores contribute to the final weighting
in mean pooling. This important distinction be-
tween attentive convolution and attentive pooling
is further discussed in Section 3.3.
Inspired by the attention mechanisms in RNNs,
we assume that it is the hidden states of aligned
phrases rather than their matching scores that can
better contribute to representation learning and deci-
sion making. 因此, our attentive convolution differs
from attentive pooling in that it uses attended hidden
states from extra context (IE。, 蒂) or broader-range
context within tx to participate in the convolution.
In experiments, we will show its superiority.
3 ATTCONV Model
We use bold uppercase (例如, H) for matrices;
bold lowercase (例如, H) for vectors; bold lower-
case with index (例如, 你好) for columns of H; 和
non-bold lowercase for scalars.
689
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
4
9
1
5
6
7
6
7
6
/
/
t
我
A
C
_
A
_
0
0
2
4
9
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
weightedsumattentivecontextsentence tysentence txhiddenstatesconvolutionconvolutioninter-hidden-statematchcolumn-wise compose row-wise compose sentence txsentence tyword embedding layer hidden states layer XYX⋅softmax()Y⋅softmax()表示:txrepresentation:蒂 (4×6)matchingscores
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
4
9
1
5
6
7
6
7
6
/
/
t
我
A
C
_
A
_
0
0
2
4
9
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
(A) Light attentive convolution layer
(乙) Advanced attentive convolution layer
数字 4: ATTCONV models sentence tx with context ty.
To start, we assume that a piece of text t (t ε
{tx, 蒂}) is represented as a sequence of hidden
states hi ∈ Rd (i = 1, 2, . . . , |t|), forming feature
map H ∈ Rd×|t|, where d is the dimensionality
of hidden states. Each hidden state hi has its left
context li and right context ri. In concrete CNN
系统, contexts li and ri can cover multiple adja-
cent hidden states; we set li = hi−1 and ri = hi+1
for simplicity in the following description.
We now describe light and advanced versions
of ATTCONV. Recall that ATTCONVaims to com-
pute a representation for tx in a way that convolu-
tion filters encode not only local context, 但是也
attentive context over ty.
3.1 Light ATTCONV
数字 4(A) shows the light version of ATTCONV.
It differs in two key points—(我) 和 (二)—both
from the basic convolution layer that models a sin-
gle piece of text and from the Siamese CNN that
models two text pieces in parallel. (我) A match-
ing function determines how relevant each hidden
state in the context ty is to the current hidden state
hx
in sentence tx. We then compute an average
我
of the hidden states in the context ty, weighted by
the matching scores, to get the attentive context
i for hx
cx
(二) The convolution for position i in
我 .
tx integrates hidden state hx
i with three sources of
语境: left context hx
i+1, 和
attentive context cx
我 .
i−1, right context hx
Attentive Context. 第一的, a function generates a
matching score ei,j between a hidden state in tx
and a hidden state in ty by (我) 点积:
不,j = (hx
我 )T · hy
j
(1)
690
或者 (二) bilinear form:
不,j = (hx
我 )T Wehy
j
(2)
(where We ∈ Rd×d), 或者 (三、) additive projection:
不,j = (ve)T · tanh(We · hx
我 + Ue · hy
j )
(3)
where We, Ue ∈ Rd×d and ve ∈ Rd.
Given the matching scores, the attentive context
cx
i for hidden state hx
i is the weighted average of
all hidden states in ty:
cx
i =
(西德:88)
j
softmax(不)j · hy
j
(4)
We refer to the concatenation of attentive contexts
[cx
|tx|] as the feature map Cx ∈
1; . . . ; cx
Rd×|tx| for tx.
我 ; . . . ; cx
Attentive Convolution. After attentive context
has been computed, a position i in the sentence tx
我 , the left context hx
has a hidden state hx
i−1, 这
i+1, and the attentive context cx
right context hx
我 .
Attentive convolution then generates the higher-
level hidden state at position i:
我,new = tanh(W · [hx
hx
= tanh(W1 · [hx
W2 · cx
i−1, hx
i−1, hx
我 + 乙)
我 , hx
我 , hx
i+1, cx
i+1]+
我 ] + 乙)
(5)
(6)
where W ∈ Rd×4d is the concatenation of W1 ∈
Rd×3d and W2 ∈ Rd×d, b ∈ Rd.
As Equation (6) 节目, 方程 (5) 可
achieved by summing up the results of two
separate and parallel convolution steps before
the non-linearity. 第一个
is still a standard
convolution-without-attention over feature map
Hx by filter width 3 over the window (hx
i−1, hx
我 ,
hx
i+1). The second is a convolution on the feature
map Cx (IE。,
the attentive context) with filter
width 1 (IE。, over each cx
我 ); then we sum up the
cihihi+1sentence txcontext tyattentivecontextattentiveconvolutionLayernLayern+1hi-1matchingattentivecontextattentiveconvolutionfbene(Hx)fmgran(Hx)fmgran(Hy)LayernLayern+1sourcefocusbeneficiarysentence txcontext ty
角色
前提
假设
文本
Three firefighters come out of subway station
Three firefighters putting out a fire inside
of a subway station
桌子 2: Multi-granular alignments required in textual
entailment.
results element-wise and add a bias term and the non-
linearity. This divide-then-compose strategy makes
the attentive convolution easy to implement in
实践, with no need to create a new feature map,
as required in Equation (5), to integrate Hx and Cx.
It is worth mentioning that W1 ∈ Rd×3d cor-
responds to the filter parameters of a vanilla CNN
and the only added parameter here is W2 ∈ Rd×d,
which only depends on the hidden size.
This light ATTCONV shows the basic princi-
ples of using RNN-style attention mechanisms in
卷积. Our experiments show that this light
version of ATTCONV—even though it incurs a
limited increase of parameters (IE。, W2)—works
much better than the vanilla Siamese CNN and
some of the pioneering attentive RNNs. The fol-
lowing two considerations show that there is space
to improve its expressivity.
(我) Higher-level or more abstract representa-
tions are required in subsequent layers. We find
that directly forwarding the hidden states in tx or
ty to the matching process does not work well in
some tasks. Pre-learning some more high-level or
abstract representations helps in subsequent learn-
ing phases.
(二) Multi-granular alignments are preferred
in the interaction modeling between tx and ty.
桌子 2 shows another example of textual entail-
蒙特. On the unigram level, “out” in the premise
matches with “out” in the hypothesis perfectly,
whereas “out” in the premise is contradictory
to “inside” in the hypothesis. But their context
snippets—“come out” in the premise and “putting
out a fire” in the hypothesis—clearly indicate
that they are not semantically equivalent. 还有
gold conclusion for this pair is “neutral” (IE。,
the hypothesis is possibly true). 所以, matching
should be conducted across phrase granularities.
We now present advanced ATTCONV. 这是更多
expressive and modular, based on the two forego-
ing considerations (我) 和 (二).
3.2 Advanced ATTCONV
Adel and Schütze (2017) distinguish between
focus and source of attention. The focus of atten-
tion is the layer of the network that is reweighted
by attention weights. The source of attention is the
information source that is used to compute the
attention weights. Adel and Schütze showed that
increasing the scope of the attention source is
beneficial. It possesses some preliminary princi-
ples of the query/key/value distinction by Vaswani
等人. (2017). 这里, we further extend this princi-
ple to define beneficiary of attention – the feature
map (labeled “beneficiary” in Figure 4(乙)) 那
is contextualized by the attentive context (labeled
“attentive context” in Figure 4(乙)).
In the light
这
attentive convolutional layer (数字 4(A)),
source of attention is hidden states in sentence tx,
the focus of attention is hidden states of the con-
text ty, and the beneficiary of attention is again
the hidden states of tx; 那是, it is identical to the
source of attention.
We now try to distinguish these three con-
cepts further to promote the expressivity of an at-
tentive convolutional layer. We call it “advanced
ATTCONV”; 见图 4(乙). It differs from the
light version in three ways: (我) attention source is
learned by function fmgran(Hx), feature map Hx
of tx acting as input; (二) attention focus is learned
by function fmgran(Hy), feature map Hy of con-
text ty acting as input; 和 (三、) attention benefi-
ciary is learned by function fbene(Hx), Hx acting
as input. Both functions fmgran() and fbene() 是
based on a gated convolutional function fgconv():
oi = tanh(Wh · ii + bh)
gi = sigmoid(Wg · ii + bg)
fgconv(二) = gi · ui + (1 − gi) · oi
(7)
(8)
(9)
where ii is a composed representation, denoting
a generally defined input phrase [··· , ui, ··· ] 的
arbitrary length with ui as the central unigram-
level hidden state, and the gate gi sets a trade-off
between the unigram-level input ui and the tem-
porary output oi at the phrase-level. We elaborate
these modules in the remainder of this subsection.
Attention Source. 第一的, we present a general
instance of generating source of attention by func-
tion fmgran(H),
learning word representations
in multi-granular context. In our system, 我们骗-
sider granularities 1 和 3, 对应于
unigram hidden state and trigram hidden state. 为了
the uni-hidden state case, it is a gated convolution
层:
大学,i = fgconv(hx
hx
我 )
(10)
691
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
4
9
1
5
6
7
6
7
6
/
/
t
我
A
C
_
A
_
0
0
2
4
9
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
For the tri-hidden state case:
tri,i = fgconv([hx
hx
i−1, hx
我 , hx
i+1])
(11)
最后, the overall hidden state at position i is the
concatenation of huni,i and htri,我:
大学,我, hx
tri,我]
(12)
mgran,i = [hx
hx
那是, fmgran(Hx) = Hx
mgran.
Such a kind of comprehensive hidden state can
encode the semantics of multigranular spans at
a position, such as “out” and “come out of.”
Gating here implicitly enables cross-granular
alignments in subsequent attention mechanism as
it sets highway connections (Srivastava et al.,
2015) between the input granularity and the output
granularity.
Attention Focus. For simplicity, 我们使用
same architecture for the attention source (just in-
troduced) and for the attention focus, 蒂 (IE。, 为了
the attention focus: fmgran(Hy) = Hy
mgran; 看
数字 4(乙)). 因此, the focus of attention will
participate in the matching process as well as be
reweighted to form an attentive context vector. 我们
leave exploring different architectures for atten-
tion source and focus for future work.
Another benefit of multi-granular hidden states
in attention focus is to keep structure information in
the context vector. In standard attention mechanisms
in RNNs, all hidden states are average-weighted
as a context vector, and the order information is
丢失的. By introducing hidden states of larger
granularity into CNNs that keep the local order or
结构, we boost the attentive effect.
Attention Beneficiary. In our system, we sim-
ply use fgconv() over uni-granularity to learn a
more abstract representation over the current hid-
den representations in Hx, so that
fbene(hx
我 ) = fgconv(hx
我 )
(13)
the attentive context vector cx
随后,
我
is generated based on attention source feature
map fmgran(Hx) and attention focus feature map
fmgran(Hy), according to the description of the
light ATTCONV. Then attentive convolution is
conducted over attention beneficiary feature map
fbene(Hx) and the attentive context vectors Cx to
get a higher-layer feature map for the sentence tx.
3.3 分析
Compared with the standard attention mechanism
in RNNs, ATTCONV has a similar matching func-
692
tion and a similar process of computing context
vectors, but differs in three ways. (我) The dis-
crimination of attention source, focus, and ben-
eficiary improves expressivity. (二) In CNNs, 这
surrounding hidden states for a concrete position
可用, so the attention matching is able to
encode the left context as well as the right con-
文本. In RNNs, 然而, we need bidirectional
RNNs to yield both left and right context
陈述. (三、) As attentive convolution can
be implemented by summing up two separate
convolution steps (Equations 5 和 6), this ar-
chitecture provides both attentive representations
and representations computed without
the use
of attention. This strategy is helpful in practice to
use richer representations for some NLP prob-
莱姆斯.
相比之下, such a clean modular separa-
tion of representations computed with and without
attention is harder to realize in attention-based
RNNs.
Prior attention mechanisms explored in CNNs
mostly involve attentive pooling (dos Santos et al.,
2016; Yin et al., 2016); 即, the weights of the
post-convolution pooling layer are determined by
注意力. These weights come from the matching
process between hidden states of two text pieces.
然而, a weight value is not informative enough
to tell the relationships between aligned terms. 骗局-
sider a textual entailment sentence pair for which
we need to determine whether “inside −→ outside”
holds. The matching degree (take cosine similar-
ity as example) of these two words is high: 对于前任-
充足, ≈ 0.7 in Word2Vec (Mikolov et al., 2013)
and GloVe (Pennington et al., 2014). On the other
手, the matching score between “inside” and
“in” is lower: 0.31 in Word2Vec, 0.46 in GloVe.
Apparently, the higher number 0.7 并不意味着
that “outside” is more likely than “in” to be en-
tailed by “inside.” Instead, joint representations
for aligned phrases [hinside, houtside], [hinside, hin]
are more informative and enable finer-grained rea-
soning than a mechanism that can only transmit
information downstream by matching scores. 我们
modify the conventional CNN filters so that “in-
side” can make the entailment decision by looking
at the representation of the counterpart term (“out-
side” or “in”) rather than a matching score.
A more damaging property of attentive pooling
is the following. Even if matching scores could
convey the phrase-level entailment degree to some
extent, matching weights, 实际上, are not lever-
aged to make the entailment decision directly;
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
4
9
1
5
6
7
6
7
6
/
/
t
我
A
C
_
A
_
0
0
2
4
9
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
they are used to weight
反而,
the sum of
the output hidden states of a convolution as the
global sentence representation. 换句话说,
fine-grained entailment degrees are likely to be
lost in the summation of many vectors. 这
illustrates why attentive context vectors partici-
pating in the convolution operation are expected
to be more effective than post-convolution atten-
tive pooling (more explanations in §4.3, paragraph
“Visualization”).
Intra-context attention and extra-context at-
注意力. 人物 4(A) 和 4(乙) depict the model-
ing of a sentence tx with its context ty. 这是
a common application of attention mechanism in
the literature; we call it extra-context attention.
But ATTCONV can also be applied to model a
single text input, 那是, intra-context attention.
Consider a sentiment analysis example: “With the
2017 NBA All-Star game in the books I think we
can all agree that this was definitely one to re-
member. Not because of the three-point shootout,
the dunk contest, or the game itself but because of
the ludicrous trade that occurred after the festivi-
ties.” This example contains informative points at
different locations (“remember” and “ludicrous”);
conventional CNNs’ ability to model nonlocal de-
pendency is limited because of fixed-size filter
widths. In ATTCONV, we can set ty = tx. 这
attentive context vector then accumulates all re-
lated parts together for a given position. 其他
字, our intra-context attentive convolution is
able to connect all related spans together to form
a comprehensive decision. This is a new way to
broaden the scope of conventional filter widths: A
filter now covers not only the local window, 但
also those spans that are related, but are beyond
the scope of the window.
Comparison to Transformer.2 The “focus”
in ATTCONV corresponds to “key” and “value”
in Transformer;
那是, our versions of “key”
and “value” are the same, coming from the con-
text sentence. The “query” in Transformer cor-
responds to the “source” and “beneficiary” of
ATTCONV; 即, our model has two perspec-
tives to utilize the context: one acts as a real
query (IE。, “source”) to attend the context, 这
其他 (IE。, “beneficiary”) takes the attentive con-
2Our “source-focus-beneficiary” mechanism was inspired
by Adel and Schütze (2017). Vaswani et al. (2017) later pub-
lished the Transformer model, which has a similar “query-
key-value” mechanism.
text back to improve the learned representation of
本身. If we reduce ATTCONV to unigram convo-
lutional filters, it is pretty much a single Trans-
former layer (if we neglect the positional encoding
in Transformer and unify the “query-key-value”
and “source-focus-beneficiary” mechanisms).
4 实验
We evaluate ATTCONV on sentence modeling in
three scenarios:
(我) Zero-context, 那是, intra-
the same input sentence acts as tx as
语境;
well as ty; (二) Single-context,
textual
entailment—hypothesis modeling with a single
premise as the extra-context; 和 (三、) 多种的-
语境, 即, claim verification—claim mod-
eling with multiple extra-contexts.
那是,
4.1 Common Set-up and Common Baselines
All experiments share a common set-up. The input
is represented using 300-dimensional publicly
available Word2Vec (Mikolov et al., 2013) 嗯-
beddings; out of vocabulary embeddings are ran-
domly initialized. The architecture consists of the
following four layers in sequence: embedding,
attentive convolution, max-pooling, and logistic
regression. The context-aware representation of
tx is forwarded to the logistic regression layer.
We use AdaGrad (Duchi et al., 2011) for training.
Embeddings are fine-tuned during training. Hyper-
parameter values include: learning rate 0.01, 隐
尺寸 300, batch size 50, filter width 3.
All experiments are designed to explore com-
parisons in three aspects: (我) within ATTCONV,
“light” vs. “advanced”; (二) “attentive convolution”
与. “attentive pooling”/“attention only”; 和 (三、)
“attentive convolution” vs. “attentive RNN”.
为此, we always report “light” and
“advanced” ATTCONV performance and compare
against five types of common baselines: (我) w/o
语境; (二) w/o attention; (三、) w/o convolution:
Similar to the Transformer’s principle (Vaswani
等人。, 2017), we discard the convolution oper-
ation in Equation (5) and forward the addition
of the attentive context cx
into a
fully connected layer. To keep enough parame-
特尔斯, we stack in total four layers so that “w/o
convolution” has the same size of parameters as
light-ATTCONV; (四号) with attention: RNNs with
attention and CNNs with attentive pooling; 和 (v)
prior state of the art, typeset in italics.
i and the hx
我
693
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
4
9
1
5
6
7
6
7
6
/
/
t
我
A
C
_
A
_
0
0
2
4
9
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
系统
Paragraph Vector
Lin et al. Bi-LSTM
Lin et al. CNN
MultichannelCNN (Kim)
CNN+internal attention
ABCNN
APCNN
Attentive-LSTM
Lin et al. RNN Self-Att.
光
w/o convolution
先进的
acc
58.43
61.99
62.05
64.62
61.43
61.36
61.98
63.11
64.21
66.75
61.34
67.36∗
哦
/
w
n
哦
我
t
n
e
t
t
A
H
t
我
w
n
哦
我
t
n
e
t
t
A
时间
时间
A
V
氮
氧
C
桌子 3: System comparison of sentiment analysis on
Yelp. Significant improvements over state of the art are
marked with ∗ (test of equal proportions, p < 0.05).
4.2 Sentence Modeling with Zero-context:
Sentiment Analysis
We evaluate sentiment analysis on a Yelp bench-
mark released by Lin et al. (2017): review-star
pairs in sizes 500K (train), 2,000 (dev), and 2,000
(test). Most text instances in this data set are
long: 25%, 50%, 75% percentiles are 46, 81,
and 125 words, respectively. The task is five-way
classification: 1 to 5 stars. The measure is accuracy.
We use this benchmark because the predominance
of long texts lets us evaluate the system perfor-
mance of encoding long-range context, and the
system by Lin et al. is directly related to ATTCONV
in intra-context scenario.
Baselines.
(i) w/o attention. Three baselines
from Lin et al. (2017): Paragraph Vector (Le
and Mikolov, 2014) (unsupervised sentence rep-
resentation learning), BiLSTM, and CNN. We
also reimplement MultichannelCNN (Kim, 2014),
recognized as a simple but surprisingly strong
sentence modeler. (ii) with attention. A vanilla
“Attentive-LSTM” by Rocktäschel et al. (2016).
“RNN Self-Attention” (Lin et al., 2017) is di-
rectly comparable to ATTCONV: it also uses intra-
context attention. “CNN+internal attention” (Adel
and Schütze, 2017), an intra-context attention idea
similar to, but less complicated than, Lin et al.
(2017). ABCNN & APCNN – CNNs with atten-
tive pooling.
Results and Analysis. Table 3 shows that
advanced-ATTCONV surpasses its “light” coun-
terpart, and obtains significant improvement over
the state of the art.
Figure 5: ATTCONV vs. MultichannelCNN for
lengths.
groups of Yelp text with ascending text
ATTCONV performs more robustly across different
lengths of text.
In addition, ATTCONV surpasses attentive pool-
ing (ABCNN&APCNN) with a big margin (>5%)
and outperforms the representative attentive-LSTM
(>4%).
此外,
it outperforms the two self-
attentive models: CNN+internal attention (Adel
and Schütze, 2017) and RNN Self-Attention (林
等人。, 2017), which are specifically designed
for single-sentence modeling. Adel and Schütze
(2017) generate an attention weight for each CNN
hidden state by a linear transformation of the same
hidden state, then compute weighted average over
all hidden states as the text representation. 林
等人. (2017) extend that idea by generating a
group of attention weight vectors, then RNN hid-
den states are averaged by those diverse weighted
vectors, allowing extracting different aspects of
the text into multiple vector representations. 两个都
works are essentially weighted mean pooling, 模拟-
ilar to the attentive pooling in Yin et al. (2016) 和
dos Santos et al. (2016).
下一个, we compare ATTCONV with Multichan-
nelCNN,
the strongest baseline system (“w/o
attention”), for different length ranges to check
whether ATTCONV can really encode long-range
context effectively. We sort the 2,000 test instances
by length, then split them into 10 团体, each
consisting of 200 instances. 数字 5 shows per-
formance of ATTCONV vs. MultichannnelCNN.
We observe that ATTCONV consistently outper-
forms MultichannelCNN for all lengths. 更远-
更多的, the improvement over MultichannelCNN
generally increases with length. This is evidence
that ATTCONV more effectively models long text.
694
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
4
9
1
5
6
7
6
7
6
/
/
t
我
A
C
_
A
_
0
0
2
4
9
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
12345678910indices of sorted text groups0.580.600.620.640.660.680.70accMultichannelCNNATTCONV0.6+diff of two curves
#instances
23,596
1,304
2,126
27,026
#entail
8,602
657
842
10,101
#neutral
14,994
647
1,284
16,925
火车
dev
测试
全部的
桌子 4: Statistics of SCITAIL data set.
哦
/
w
n
哦
我
t
n
e
t
t
A
H
t
我
w
n
哦
我
t
n
e
t
t
A
acc
系统
60.4
Majority Class
65.1
w/o Context
69.5
Bi-LSTM
70.6
NGram model
Bi-CNN
74.4
Enhanced LSTM 70.6
Attentive-LSTM 71.5
72.3
Decomp-Att
77.3
DGEM
75.2
APCNN
75.8
ABCNN
78.1
75.1
79.2
ATTCONV-light
w/o convolution
ATTCONV-advanced
桌子 5: ATTCONV vs. baselines on SCITAIL.
This is likely because of ATTCONV’s capability to
encode broader context in its filters.
4.3 Sentence Modeling with a Single Context:
Textual Entailment
Data Set. SCITAIL (Khot et al., 2018) is a textual
entailment benchmark designed specifically for a
real-world task: multi-choice question answering.
All hypotheses tx were obtained by rephrasing
(问题, correct answer) pairs into single sen-
时态, and premises ty are relevant Web sentences
retrieved by an information retrieval method. 然后
the task is to determine whether a hypothesis is
true or not, given a premise as context. 全部 (tx, 蒂)
pairs are annotated via crowdsourcing. Accuracy
is reported. 桌子 1 shows examples and Table 4
gives statistics.
By this construction, a substantial performance
improvement on SCITAIL is equivalent to a better
QA performance (Khot et al., 2018). The hypoth-
esis tx is the target sentence, and the premise ty
acts as its context.
基线. Apart from the common baselines
(参见章节 4.1), we include systems covered
by Khot et al. (2018): (我) n-gram Overlap: 一个
overlap baseline, considering lexical granularity
695
such as unigrams, one-skip bigrams, 和一个-
skip trigrams. (二) Decomposable Attention Model
(Decomp-Att) (Parikh et al., 2016): Explore atten-
tion mechanisms to decompose the task into sub-
(三、) Enhanced LSTM
tasks to solve in parallel.
(陈等人。, 2017乙): Enhance LSTM by taking
into account syntax and semantics from parsing
信息.
(四号) DGEM (Khot et al., 2018): A
decomposed graph entailment model, 当前的
state-of-the-art.
桌子 5 presents results on SCITAIL. (我) 之内
ATTCONV, “advanced” beats “light” by 1.1%;
(二) “w/o convolution” and attentive pooling (IE。,
ABCNN & APCNN) get lower performances by
3%–4%; (三、) More complicated attention mech-
anisms equipped into LSTM (例如, “attentive-
LSTM” and “enhanced-LSTM”) perform even
更差.
误差分析.
To better understand the
ATTCONV in SCITAIL, we study some error
cases listed in Table 6.
Language conventions. Pair #1 uses sequen-
tial commas (IE。, in “the egg, larva, pupa, 和
adult”) or a special symbol sequence (IE。, in “egg
−> larva −> pupa −> adult”) to form a set or
顺序; pair #2 has “A (or B)” to express the
equivalence of A and B. This challenge is expected
to be handled by DNNs with specific training signals.
在 #3, “be-
cause smaller amounts of water evaporate in the
cool morning” cannot be inferred from the premise
ty directly. The main challenge in #4 is to dis-
tinguish “weight” from “force,” which requires
background physical knowledge that is beyond the
presented text here and beyond the expressivity of
word embeddings.
Knowledge beyond the text ty.
Complex discourse relation. The premise in #5
has an “or” structure. 在 #6, the inserted phrase
“with about 16,000 species” makes the connection
between “nonvascular plants” and “the mosses,
liverworts, and hornworts” hard to detect. 两个都
instances require the model to decode the dis-
course relation.
ATTCONV on SNLI. 桌子 7 shows the com-
parison. We observe that: (我) classifying hypothe-
ses without looking at premises,
那是, “w/o
context” baseline, results in a large improvement
over the “majority baseline.” This verifies the
strong bias in the hypothesis construction of the
SNLI data set (Gururangan et al., 2018; Poliak
等人。, 2018). (二) ATTCONV (先进的) surpasses
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
4
9
1
5
6
7
6
7
6
/
/
t
我
A
C
_
A
_
0
0
2
4
9
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
#
1
2
3
4
5
6
(Premise ty, Hypothesis tx) Pair
(蒂) These insects have 4 life stages, the egg, larva, pupa, and adult.
(tx) The sequence egg −> larva −> pupa −> adult shows the life cycle
of some insects.
(蒂) . . . the notochord forms the backbone (or vertebral column).
(tx) Backbone is another name for the vertebral column.
(蒂) Water lawns early in the morning . . . prevent evaporation.
(tx) Watering plants and grass in the early morning is a way to conserve water
because smaller amounts of water evaporate in the cool morning.
(蒂) . . . the SI unit . . . for force is the Newton (氮) and is defined as (kg·m/s−2 ).
(tx) 牛顿 (氮) is the SI unit for weight.
(蒂) Heterotrophs get energy and carbon from living plants or animals
(消费者) or from dead organic matter (decomposers).
(tx) Mushrooms get their energy from decomposing dead organisms.
(蒂) . . . are a diverse assemblage of three phyla of nonvascular plants, 和
关于 16,000 物种, that includes the mosses, liverworts, and hornworts.
(tx) Moss is best classified as a nonvascular plant.
G/P Challenge
1/0
1/0
语言
conventions
语言
conventions
1/0
beyond text
0/1
beyond text
0/1
1/0
话语
关系
话语
关系
桌子 6: Error cases of ATTCONV in SCITAIL. “. . . ”: truncated text. “G/P”: gold/predicted label.
哦
/
w
n
哦
我
t
n
e
t
t
A
H
t
我
w
n
哦
我
t
n
e
t
t
A
0
系统
#para acc
34.3
majority class
w/o context (IE。, hypothesis only) 270K 68.7
220K 77.6
Bi-LSTM (Bowman et al., 2015)
270K 80.3
Bi-CNN
3.5中号 82.1
Tree-CNN (Mou et al., 2016)
6.3中号 84.8
NES (Munkhdalai and Yu, 2017)
250K 83.5
Attentive-LSTM (Rocktäschel)
95中号 84.4
Self-Attentive (林等人。, 2017)
1.9中号 86.1
Match-LSTM (Wang and Jiang)
3.4中号 86.3
LSTMN (Cheng et al., 2016)
580K 86.8
Decomp-Att (Parikh)
7.7中号 88.6
Enhanced LSTM (陈等人。, 2017乙)
ABCNN (Yin et al., 2016)
834K 83.7
APCNN (dos Santos et al., 2016) 360K 83.9
360K 86.3
360K 84.9
900K 87.8
8中号 88.7
ATTCONV – light
w/o convolution
ATTCONV – advanced
State-of-the-art (Peters et al., 2018)
volution (数字 6(A)) and attentive pooling
(数字 6(乙)).
(后
(我) 不,j
in sentence tx; (二) hx
softmax), which shows
数字 6(A) explores the visualization of two
kinds of features learned by light ATTCONV in
SNLI data set (most are short sentences with
in Equa-
rich phrase-level reasoning):
的 (1)
这
attention distribution over context ty by the hidden
state hx
我,new in Equation (5)
我
for i = 1, 2, ··· , |tx|;
it shows the context-
aware word features in tx. By the two visual-
ized features, we can identify which parts of the
context ty are more important for a word in sen-
tence tx, and a max-pooling, over those context-
driven word representations, selects and forwards
dominant (word, leftcontext, rightcontext, attcontext)
combinations to the final decision maker.
桌子 7: Performance comparison on SNLI test. En-
semble systems are not included.
all “w/o attention” baselines and “with attention”
CNN baselines (IE。, attentive pooling), obtaining
a performance (87.8%) that is close to the state of
the art (88.7%).
We also report
the parameter size in SNLI
as most baseline systems did. 桌子 7 节目
in comparison to these baselines, 我们的
那,
ATTCONV (light and advanced) has a more lim-
ited number of parameters, yet its performance is
competitive.
可视化.
图中 6, we visualize the
attention mechanisms explored in attentive con-
数字 6(A) shows the features3 of sentence tx
= “A dog jumping for a Frisbee in the snow” con-
ditioned on the context ty = “An animal is out-
side in the cold weather, playing with a plastic
toy.” Observations:
(我) The right figure shows
that the attention mechanism successfully aligns
some cross-sentence phrases that are informative
to the textual entailment problem, such as “dog”
to “animal” (IE。, cx
dog ≈ “animal”), “Frisbee”
to “plastic toy” and “playing” (IE。, cx
F risbee ≈
“plastic toy”+“playing”); (二) The left figure shows
a max-pooling over the generated features of
filter_1 and filter_2 will focus on the context-
aware phrases (A, 狗, 跳跃, cx
狗) 和 (A,
3For simplicity, we show 2 在......之外 300 ATTCONV filters.
696
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
4
9
1
5
6
7
6
7
6
/
/
t
我
A
C
_
A
_
0
0
2
4
9
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
(A) Visualization for features generated by ATTCONV’s filters on sentence tx and ty. A max-pooling, over filter_1, locates
the phrase (A, 狗, 跳跃, cx
dog” (resp. cx
F ris.)-这
attentive context of “dog” (resp. “Frisbee”) in tx—mainly comes from “animal” (resp. “toy” and “playing”) in ty.
狗), and locates the phrase (A, Frisbee, 在, cx
F risbee) via filter_2. “cx
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
4
9
1
5
6
7
6
7
6
/
/
t
我
A
C
_
A
_
0
0
2
4
9
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
(乙) Attention visualization for attentive pooling (ABCNN). Based on the words in tx and ty, 首先, a convolution layer with
filter width 3 outputs hidden states for each sentence, then each hidden state will obtain an attention weight for how well this
hidden state matches towards all the hidden states in the other sentence, and finally all hidden states in each sentence will be
weighted and summed up as the sentence representation. This visualization shows that the spans “dog jumping for” and “in
the snow” in tx and the spans “animal is outside” and “in the cold” in ty are most indicative to the entailment reasoning.
数字 6: Attention visualization for attentive convolution (顶部) and attentive pooling (底部) between sentence
tx = “A dog jumping for a Frisbee in the snow” (左边) and sentence ty = “An animal is outside in the cold weather,
playing with a plastic toy” (正确的).
Frisbee, 在, cx
F risbee) 分别; the two phrases
are crucial to the entailment reasoning for this (蒂,
tx) pair.
数字 6(乙) shows the phrase-level (IE。, each
consecutive trigram) attentions after the convolu-
tion operation. 如图 3 节目, a subsequent
pooling step will weight and sum up those phrase-
level hidden states as an overall sentence represen-
站. 所以, even though some phrases such as “in
the snow” in tx and “in the cold” in ty show im-
portance in this pair instance, the final sentence
representation still (我) lacks a fine-grained phrase-
to-phrase reasoning, 和 (二) underestimates some
indicative phrases such as “A dog” in tx and “An
animal” in ty.
简单地说, attentive convolution first performs
phrase-to-phrase,
然后
composes features; attentive pooling composes
inter-sentence reasoning,
697
Adogforainthe.snowjumpingFrisbeecxdogcxFris.Anincold,isthewithtoy.aanimaloutsideweatherplayingplastictxtyAdogforainthe.snowjumpingFrisbeeAnincold,isthewithtoy.aanimaloutsideweatherplayingplastictxtyconvolutionoutput (filter width=3)
#SUPPORTED #REFUTED
29,775
3,333
3,333
80,035
3,333
3,333
#NEI
35,639
3,333
3,333
火车
dev
测试
桌子 8: Statistics of claims in the FEVER data set.
phrase features as sentence representations, 然后
直观地, attentive convo-
performs reasoning.
lution better fits the way humans conduct entail-
ment reasoning, and our experiments validate its
superiority—it is the hidden states of the aligned
phrases rather than their matching scores that support
better representation learning and decision-making.
The comparisons in both SCITAIL and SNLI
显示:
• CNNs with attentive
卷积 (IE。,
ATTCONV) outperform the CNNs with at-
tentive pooling (IE。, ABCNN and APCNN);
• Some competitors got over-tuned on SNLI
while demonstrating mediocre performance
in SCITAIL—a real-world NLP task. Our sys-
tem ATTCONV shows its robustness in both
benchmark data sets.
4.4 Sentence Modeling with Multiple Contexts:
Claim Verification
Data Set. For this task, we use FEVER (Thorne
等人。, 2018); it infers the truthfulness of claims by
extracted evidence. The claims in FEVER were
manually constructed from the introductory sec-
tions of about 50K popular Wikipedia articles in
六月 2017 dump. Claims have 9.4 tokens on
average. 桌子 8 lists the claim statistics.
In addition to claims, FEVER also provides a
Wikipedia corpus of approximately 5.4 million ar-
ticles, from which gold evidences are gathered and
假如. 数字 7 shows the distributions of sen-
tence sizes in FEVER’s ground truth evidence set
(IE。, the context size in our experimental set-up).
We can see that roughly 28% of evidence instances
cover more than one sentence and roughly 16%
cover more than two sentences.
Each claim is labeled as SUPPORTED, 关于-
FUTED, or NOTENOUGHINFO (NEI) given the
gold evidence. The standard FEVER task also
explores the performance of evidence extraction,
evaluated by F1 between extracted evidence and
gold evidence. This work focuses on the claim en-
tailment part, assuming the evidences are provided
(extracted or gold). 更具体地说, we treat a
claim as tx, and its evidence sentences as context ty.
数字 7: Distribution of #sentence in FEVER evi-
登塞.
This task has two evaluations:
(我) ALL—
accuracy of claim verification regardless of the
validness of evidence; (二) SUBSET—verification
accuracy of a subset of claims, in which the gold
evidence for SUPPORTED and REFUTED claims
must be fully retrieved. We use the official eval-
uation toolkit.4
Set-ups.
(我) We adopt the same retrieved evi-
dence set (i.e, contexts ty) as Thorne et al. (2018):
top-5 most relevant sentences from top-5 retrieved
wiki pages by a document retriever (陈等人。,
2017A). The quality of this evidence set against the
ground truth is: 44.22 (记起), 10.44 (precision),
16.89 (F1) on dev, 和 45.89 (记起), 10.79 (pre-
切除术), 17.47 (F1) on test. This set-up challenges
our system with potentially unrelated or even mis-
leading context. (二) We use the ground truth evi-
dence as context. This lets us determine how far
our ATTCONV can go for this claim verification
problem once the accurate evidence is given.
基线. We first include the two systems ex-
plored by Thorne et al. (2018): (我) 多层线性规划: 一个多-
layer perceptron baseline with a single hidden
层, based on tf-idf cosine similarity between the
claim and the evidence (Riedel et al., 2017); (二)
Decomp-Att (Parikh et al., 2016): A decompos-
able attention model that is tested in SCITAIL and
SNLI before. Note that both baselines first relied
on an information retrieval system to extract the
top-5 relevant sentences from the retrieved top-5
wiki pages as evidence for claims, then concate-
nated all evidence sentences as a longer context
for a claim.
4https://github.com/sheffieldnlp/fever-
scorer.
698
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
4
9
1
5
6
7
6
7
6
/
/
t
我
A
C
_
A
_
0
0
2
4
9
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
12345678910>10#context for each claim sentence051015…%12.134.072.852.901.760.980.680.490.401.8571.88
系统
多层线性规划
Bi-CNN
APCNN
ABCNN
Attentive-LSTM
Decomp-Att
ATTCONV
v
e
d
retrie. evi.
gold
evi.
ALL SUB
41.86 19.04 65.13
47.82 26.99 75.02
50.75 30.24 78.91
51.39 32.44 77.13
52.47 33.19 78.44
52.09 32.57 80.82
光,context-wise
w/o conv.
光,context-conc
w/o conv.
57.78 34.29 83.20
47.29 25.94 73.18
59.31 37.75 84.74
48.02 26.67 73.44
advan.,context-wise 60.20 37.94 84.99
advan.,context-conc 62.26 39.44 86.02
t (Thorne et al., 2018)
s
e
t
ATTCONV
50.91 31.87
61.03 38.77 84.61
–
桌子 9: Performance on dev and test of FEVER. 在
“gold evi.” scenario, ALL SUBSET are the same.
我们
then consider
two variants of our
tx
ATTCONV in dealing with modeling of
with variable-size context ty. (我) Context-wise:
we first use all evidence sentences one by one as
context ty to guide the representation learning of
the claim tx, generating a group of context-aware
representation vectors for the claim,
然后我们
do element-wise max-pooling over this vector
group as the final representation of the claim. (二)
Context-conc: concatenate all evidence sentences
as a single piece of context,
这
claim based on this context. This is the same
preprocessing step as Thorne et al. (2018) 做过.
then model
结果. 桌子 9 compares our ATTCONV in dif-
ferent set-ups against the baselines. 第一的, ATTCONV
surpasses
the top competitor “Decomp-Att,”
reported in Thorne et al. (2018), with big mar-
gins in dev (ALL: 62.26 与.
52.09) and test
(ALL: 61.03 与. 50.91). 此外, “advanced-
ATTCONV” consistently outperforms its “light”
对方. 而且, ATTCONV surpasses at-
tentive pooling (IE。, ABCNN & APCNN) 和
“attentive-LSTM” by >10% in ALL, >6% in SUB
and >8% in “gold evi.”
数字 8 further explores the fine-grained per-
formance of ATTCONV for different sizes of gold
证据 (IE。, different sizes of context ty). 这
system shows comparable performances for sizes
1 和 2. Even for context sizes larger than 5, 它
only drops by 5%.
699
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
4
9
1
5
6
7
6
7
6
/
/
t
我
A
C
_
A
_
0
0
2
4
9
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
数字 8: Fine-grained ATTCONV performance given
variable-size golden FEVER evidence as claim’s con-
文本.
These experiments on claim verification clearly
show the effectiveness of ATTCONV in sen-
tence modeling with variable-size context. 这
should be attributed to the attention mechanism in
ATTCONV, which enables a word or a phrase in
the claim tx to “see” and accumulate all related
clues even if those clues are scattered across mul-
tiple contexts ty.
误差分析. We do error analysis for “re-
trieved evidence” scenario.
Error case #1 is due to the failure of fully re-
trieving all evidence. 例如, a successful
support of the claim “Weekly Idol has a host born
in the year 1978” requires the information compo-
sition from three evidence sentences, two from the
wiki article “Weekly Idol,” and one from “Jeong
Hyeong-don.” However, only one of them is
retrieved in the top-5 candidates. Our system pre-
dicts REFUTED. This error is more common in
instances for which no evidence is retrieved.
Error case #2 is due to the insufficiency of rep-
resentation learning. Consider the wrong claim
in REFUTED
“Corsica belongs to Italy” (IE。,
班级). Even though good evidence is retrieved, 这
system is misled by noise evidence: “It is located
. . . west of the Italian Peninsula, with the nearest
land mass being the Italian island . . . ”.
Error case #3 is due to the lack of advanced data
preprocessing. For a human, it is very easy to “re-
fute” the claim “Telemundo is an English-language
television network” by the evidence “Telemundo
is an American Spanish-language terrestrial tele-
想象 . . . ” (from the “Telemundo” wikipage), 经过
checking the keyphrases: “Spanish-language” vs.
“English-language.” Unfortunately, both tokens
are unknown words in our system; 因此,
12345>5gold #context for each claim81.582.082.583.083.584.084.585.0acc (%)
they do not have informative embeddings. 一个更多
careful data preprocessing is expected to help.
5 Summary
We presented ATTCONV, the first work that en-
ables CNNs to acquire the attention mechanism
commonly used in RNNs. ATTCONV combines
the strengths of CNNs with the strengths of the
RNN attention mechanism. 一方面,
it makes broad and rich context available for
prediction, either context from external
输入
(extra-context) or internal inputs (intra-context).
另一方面, it can take full advantage of
the strengths of convolution:
It is more order-
sensitive than attention in RNNs and local-context
information can be powerfully and efficiently
modeled through convolution filters. Our experi-
ments demonstrate the effectiveness and flexibil-
ity of ATTCONV when modeling sentences with
variable-size context.
致谢
这
We gratefully acknowledge funding for
work by the European Research Council (ERC
#740516). We would like to thank the anonymous
reviewers for their helpful comments.
参考
Heike Adel and Hinrich Schütze. 2017. Exploring
different dimensions of attention for uncertainty
detection. In Proceedings of EACL, pages 22–34,
Valencia, 西班牙.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
本吉奥. 2015. Neural machine translation by
jointly learning to align and translate. In Pro-
ceedings of ICLR, 圣地亚哥, 美国.
Samuel R. Bowman, Gabor Angeli, Christopher
波茨, and Christopher D. 曼宁. 2015. A
large annotated corpus for learning natural lan-
guage inference. In Proceedings of EMNLP,
pages 632–642, 里斯本, Portugal.
Danqi Chen, Adam Fisch, Jason Weston, 和
Antoine Bordes. 2017A. Reading Wikipedia to
answer open-domain questions. In Proceedings
of ACL, pages 1870–1879, Vancouver, 加拿大.
Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei,
Hui Jiang, and Diana Inkpen. 2017乙. Enhanced
LSTM for natural language inference. In Pro-
ceedings of ACL, pages 1657–1668, Vancouver,
加拿大.
Jianpeng Cheng, Li Dong, and Mirella Lapata.
2016. Long short-term memory-networks for
In Proceedings of EMNLP,
machine reading.
pages 551–561, Austin, 美国.
Ronan Collobert, Jason Weston, Léon Bottou,
Michael Karlen, Koray Kavukcuoglu, and Pavel
磷. Kuksa. 2011. Natural language processing
(almost) from scratch. Journal of Machine
Learning Research, 12:2493–2537.
Ido Dagan, Dan Roth, Mark Sammons, 和
Fabio Massimo Zanzotto. 2013. Recognizing
Textual Entailment: Models and Applications.
Synthesis Lectures on Human Language Tech-
逻辑的. 摩根 & Claypool.
John Duchi, Elad Hazan, and Yoram Singer. 2011.
Adaptive subgradient methods for online learn-
ing and stochastic optimization. Journal of Ma-
chine Learning Research, 12:2121–2159.
Jeffrey L. Elman. 1990. Finding structure in time.
认知科学, 14(2):179–211.
Jonas Gehring, Michael Auli, David Grangier,
Denis Yarats, and Yann N. Dauphin. 2017.
Convolutional sequence to sequence learning.
In Proceedings of ICML, pages 1243–1252,
悉尼, 澳大利亚.
Alex Graves. 2013. Generating sequences with re-
current neural networks. CoRR, abs/1308.0850.
Alex Graves, Greg Wayne, and Ivo Danihelka.
2014. Neural turing machines. CoRR, abs/1410.5401.
Suchin Gururangan, Swabha Swayamdipta, Omer
征收, Roy Schwartz, Samuel R. Bowman, 和
诺亚A. 史密斯. 2018. Annotation artifacts in
natural language inference data. In Proceedings
of NAACL-HLT, pages 107–112, New Orleans,
美国.
Karl Moritz Hermann, Tomás Kociský, 爱德华
格芬施泰特, Lasse Espeholt, Will Kay, Mustafa
Suleyman, and Phil Blunsom. 2015. Teach-
ing machines to read and comprehend. In Pro-
ceedings of NIPS, pages 1693–1701, 蒙特利尔,
加拿大.
700
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
4
9
1
5
6
7
6
7
6
/
/
t
我
A
C
_
A
_
0
0
2
4
9
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
Nal Kalchbrenner, Edward Grefenstette, and Phil
Blunsom. 2014. A convolutional neural net-
work for modelling sentences. In Proceedings
of ACL, pages 655–665, 巴尔的摩, 美国.
Tushar Khot, Ashish Sabharwal, and Peter Clark.
2018. SciTaiL: A textual entailment dataset
from science question answering. In Proceed-
ings of AAAI, pages 5189–5197, New Orleans,
美国.
Yoon Kim. 2014. Convolutional neural networks
for sentence classification. 在诉讼程序中
EMNLP, pages 1746–1751, Doha, Qatar.
Yoon Kim, Carl Denton, Luong Hoang, 和
Alexander M. 匆忙. 2017. Structured atten-
tion networks. In Proceedings of ICLR, Toulon,
法国.
Ankit Kumar, Ozan Irsoy, Peter Ondruska,
Mohit Iyyer, James Bradbury, Ishaan Gulrajani,
Victor Zhong, Romain Paulus, and Richard
Socher. 2016. Ask me anything: 动态的
memory networks for natural language process-
英. In Proceedings of ICML, pages 1378–1387,
New York City, 美国.
Quoc Le and Tomas Mikolov. 2014. Distributed
representations of sentences and documents.
In Proceedings of ICML, pages 1188–1196,
北京, 中国.
Yann LeCun, Léon Bottou, Yoshua Bengio, 和
Patrick Haffner. 1998. Gradient-based learning
applied to document recognition. 会议记录
of the IEEE, 86(11):2278–2324.
Jiwei Li, Minh-Thang Luong, and Dan Jurafsky.
2015. A hierarchical neural autoencoder for
paragraphs and documents. 在诉讼程序中
前交叉韧带, pages 1106–1115, 北京, 中国.
Jindrich Libovický and Jindrich Helcl. 2017. 在-
tention strategies for multi-source sequence-
to-sequence learning. In Proceedings of ACL,
pages 196–202, Vancouver, 加拿大.
Zhouhan Lin, Minwei Feng, Cícero Nogueira dos
Santos, Mo Yu, Bing Xiang, Bowen Zhou,
and Yoshua Bengio. 2017. A structured self-
attentive sentence embedding. In Proceedings
of ICLR, Toulon, 法国.
Minh-Thang Luong, Hieu Pham, and Christopher
D. 曼宁. 2015. Effective approaches to
attention-based neural machine translation. 在
Proceedings of EMNLP, pages 1412–1421,
里斯本, Portugal.
Yishu Miao, Lei Yu, and Phil Blunsom. 2016.
Neural variational inference for text processing.
In Proceedings of ICML, pages 1727–1736,
New York City, 美国.
Tomas Mikolov, 伊利亚·苏茨克维尔, Kai Chen, Gregory
S. 科拉多, and Jeffrey Dean. 2013. 迪斯-
tributed representations of words and phrases
and their compositionality. 在诉讼程序中
NIPS, 第 3111–3119 页, Lake Tahoe, 美国.
Lili Mou, Rui Men, Ge Li, Yan Xu, Lu Zhang, Rui
严, and Zhi Jin. 2016. Natural language in-
ference by tree-based convolution and heuristic
matching. In Proceedings of ACL, pages 130–136,
柏林, 德国.
Tsendsuren Munkhdalai and Hong Yu. 2017.
在诉讼程序中
Neural semantic encoders.
EACL, pages 397–407, Valencia, 西班牙.
Ramesh Nallapati, Bowen Zhou, Cícero Nogueira
两位圣人, Çaglar Gülçehre, and Bing Xiang.
2016. Abstractive text summarization using
sequence-to-sequence rnns and beyond. In Pro-
ceedings of CoNLL, pages 280–290, 柏林,
德国.
Ankur P. Parikh, Oscar Täckström, Dipanjan Das,
and Jakob Uszkoreit. 2016. A decomposable
attention model for natural language inference.
In Proceedings of EMNLP, pages 2249–2255,
Austin, 美国.
Jeffrey Pennington, Richard Socher, and Christopher
D. 曼宁. 2014. GloVe: Global vectors for
word representation. In Proceedings of EMNLP,
pages 1532–1543, Doha, Qatar.
Matthew E. Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep contextu-
alized word representations. 在诉讼程序中
NAACL-HLT, pages 2227–2237, New Orleans,
美国.
Adam Poliak, Jason Naradowsky, Aparajita Haldar,
Rachel Rudinger, and Benjamin Van Durme.
2018. Hypothesis only baselines in natural
701
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
4
9
1
5
6
7
6
7
6
/
/
t
我
A
C
_
A
_
0
0
2
4
9
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
language inference. In Proceedings of *SEM,
pages 180–191, New Orleans, 美国.
Benjamin Riedel, Isabelle Augenstein, Georgios P.
Spithourakis, and Sebastian Riedel. 2017. A
simple but tough-to-beat baseline for the fake
news challenge stance detection task. CoRR,
abs/1707.03264.
Tim Rocktäschel, Edward Grefenstette, Karl
Moritz Hermann, Tomáš Koˇcisk`y, and Phil
Blunsom. 2016. Reasoning about entailment
with neural attention. In Proceedings of ICLR,
San Juan, Puerto Rico.
Cícero Nogueira dos Santos, Ming Tan, Bing
Xiang, and Bowen Zhou. 2016. Attentive pool-
ing networks. CoRR, abs/1602.03609.
Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi,
and Hannaneh Hajishirzi. 2017. Bidirectional
attention flow for machine comprehension. 在
Proceedings of ICLR, Toulon, 法国.
Lifeng Shang, Zhengdong Lu, and Hang Li.
2015. Neural responding machine for short-
In Proceedings of ACL,
text conversation.
pages 1577–1586, 北京, 中国.
Rupesh Kumar Srivastava, Klaus Greff, 和
Jürgen Schmidhuber. 2015. Training very
In Proceedings of NIPS,
深层网络.
pages 2377–2385, 蒙特利尔, 加拿大.
James Thorne, Andreas Vlachos, Christos
Christodoulopoulos, and Arpit Mittal. 2018.
FEVER: A large-scale dataset for fact extraction
and verification. In Proceedings of NAACL-
赫勒特, pages 809–819, New Orleans, 美国.
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin. 2017. 在-
tention is all you need. In Proceedings of NIPS,
pages 6000–6010, Long Beach, 美国.
Shuohang Wang and Jing Jiang. 2016. 学习-
ing natural language inference with LSTM. 在
Proceedings of NAACL-HLT, pages 1442–1451,
圣地亚哥, 美国.
Shuohang Wang and Jing Jiang. 2017. 机器
comprehension using match-LSTM and an-
swer pointer. In Proceedings of ICLR, Toulon,
法国.
Wenhui Wang, Nan Yang, Furu Wei, Baobao
张, and Ming Zhou. 2017A. Gated self-
matching networks for reading comprehension
and question answering. In Proceedings of ACL,
pages 189–198, Vancouver, 加拿大.
Zhiguo Wang, Wael Hamza, and Radu Florian.
2017乙. Bilateral multi-perspective matching for
natural language sentences. 在诉讼程序中
IJCAI, pages 4144–4150, 墨尔本, 澳大利亚.
Caiming Xiong, Stephen Merity, and Richard
Socher. 2016. Dynamic memory networks for
visual and textual question answering. In Pro-
ceedings of ICML, pages 2397–2406, 纽约
城市, 美国.
Caiming Xiong, Victor Zhong, and Richard
Socher. 2017. Dynamic coattention networks for
question answering. In Proceedings of ICLR,
Toulon, 法国.
Wenpeng Yin, Hinrich Schütze, Bing Xiang, 和
Bowen Zhou. 2016. ABCNN: Attention-based
convolutional neural network for modeling sen-
tence pairs. 处理, 4:259–272.
702
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
4
9
1
5
6
7
6
7
6
/
/
t
我
A
C
_
A
_
0
0
2
4
9
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
下载pdf