Learning Neural Sequence-to-Sequence Models from
Weak Feedback with Bipolar Ramp Loss
Laura Jehl∗
Carolin Lawrence∗
Computational Linguistics
Heidelberg University
69120 Heidelberg, Germany
{jehl, lawrence}@cl.uni-heidelberg.de
Stefan Riezler
Computational Linguistics & IWR
Heidelberg University
69120 Heidelberg, Germany
riezler@cl.uni-heidelberg.de
Abstract
In many machine learning scenarios, supervi-
sion by gold labels is not available and conse-
quently neural models cannot be trained directly
by maximum likelihood estimation. In a weak
supervision scenario, metric-augmented objec-
tives can be employed to assign feedback to
model outputs, which can be used to extract a
supervision signal for training. We present sev-
eral objectives for two separate weakly super-
vised tasks, machine translation and semantic
parsing. We show that objectives should ac-
tively discourage negative outputs in addition
to promoting a surrogate gold structure. This
notion of bipolarity is naturally present in
ramp loss objectives, which we adapt to neu-
ral models. We show that bipolar ramp loss
objectives outperform other non-bipolar ramp
loss objectives and minimum risk training on
both weakly supervised tasks, as well as on a
supervised machine translation task. Addition-
ally, we introduce a novel token-level ramp
loss objective, which is able to outperform
even the best sequence-level ramp loss on both
weakly supervised tasks.
1
Introduction
Sequence-to-sequence neural models are stan-
dardly trained using a maximum likelihood esti-
mation (MLE) objective. However, MLE training
requires full supervision by gold target structures,
which in many scenarios are too difficult or
expensive to obtain. For example, in semantic
parsing for question-answering it is often easier
to collect gold answers rather than gold parses
∗Both authors contributed equally to this publication.
233
(Clarke et al., 2010; Berant et al., 2013; Pasupat
and Liang, 2015; Rajpurkar et al., 2016, inter alia).
In machine translation, there are many domains
for which no gold references exist, although cross-
lingual document-level links are present for many
multilingual data collections.
In this paper we investigate methods where a
supervision signal for output structures can be
extracted from weak feedback. In the following,
we use learning from weak feedback, or weakly
supervised learning, to refer to a scenario where
output structures generated by the model are
judged according to an external metric, and this
feedback is used to extract a supervision signal that
guides the learning process. Metric-augmented
sequence-level objectives from reinforcement
learning (Williams, 1992; Ranzato et al., 2016),
minimum risk training (MRT) (Smith and Eisner,
2006; Shen et al., 2016) or margin-based struc-
tured prediction objectives (Taskar et al., 2005;
Edunov et al., 2018) can be seen as instances of
such algorithms.
In natural
language processing applications,
such algorithms have mostly been used in com-
bination with full supervision tasks, allowing to
compute a feedback score from metrics such as
BLEU or F-score that measure the similarity
of output structures against gold structures. Our
main interest is in weak supervision tasks where
the calculation of a feedback score cannot fall
back onto gold structures. For example, matching
proposed answers to a gold answer can guide a
semantic parser towards correct parses, and match-
ing proposed translations against
linked doc-
uments can guide learning in machine translation.
In such scenarios the judgments by the external
metric may be unreliable and thus unable to select
a good update direction. It is our intuition that
Transactions of the Association for Computational Linguistics, vol. 7, pp. 233–248, 2019. Action Editor: Phil Blunsom.
Submission batch: 11/2018; Revision batch: 1/2019; Published 5/2019.
c(cid:3) 2019 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
5
1
9
2
3
1
7
4
/
/
t
l
a
c
_
a
_
0
0
2
6
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
a more reliable signal can be produced by not
just encouraging outputs that are good according
to weak positive feedback, but also by actively
discouraging bad structures. In this way, a system
can more effectively learn what distinguishes good
outputs from bad ones. We call an objective that
incorporates this idea a bipolar objective. The bi-
polar idea is naturally captured by the structured
ramp loss objective (Chapelle et al., 2009), espe-
cially in the formulation by Gimpel and Smith
(2012) and Chiang (2012), who use ramp loss to
separate a hope from a fear output in a linear
structured prediction model. We employ several
ramp loss objectives for two weak supervision
tasks, and adapt them to neural models.
First, we turn to the task of semantic parsing
in a setup where only question-answer pairs, but
no gold semantic parses, are given. We assume
a baseline system has been trained using a small
supervised data set of question-parse pairs under
the MLE objective. The goal is to improve this
system by leveraging a larger data set of question-
answer pairs. During learning, the semantic parser
suggests parses for which corresponding answers
are retrieved. These answers are then compared to
the gold answer and the resulting weak supervision
signal guides the semantic parser towards finding
correct parses. We can show that a bipolar ramp
loss objective can improve upon the baseline by
over 12 percentage points in F1 score.
Second, we use ramp losses on a machine
translation task where only weak supervision in
the form of cross-lingual document-level links
is available. We assume a translation system
has been trained using MLE on out-of-domain
data. We then investigate whether document-
level links can be used as a weak supervision
signal to adapt the translation system to the target
domain. We formulate ramp loss objectives that
incorporate bipolar supervision from relevant and
irrelevant documents. We also present a metric
that allows us to include bipolar supervision in
an MRT objective. Experiments show that bipolar
supervision is crucial for obtaining gains over the
baseline. Even with this very weak supervision,
we are able to achieve an improvement of over 0.4%
BLEU over the baseline using a bipolar ramp loss.
Finally, we turn to a fully supervised machine
translation task. In supervised learning, MLE train-
ing in a fully supervised scenario has also been
associated with two issues. First, it can cause
exposure bias (Ranzato et al., 2016) because
during training the model receives its context
from the gold structures of the training data, but
at test time the context is drawn from the model
distribution instead. Second, the MLE objective
is agnostic to the final evaluation metric, causing
a loss-evaluation mismatch (Wiseman and Rush,
2016). Our experiments use a similar setup as
Edunov et al. (2018), who apply structured pre-
diction losses to two fully supervised sequence-
to-sequence tasks, but do not consider structured
ramp loss objectives. Like our predecessors, we
want to understand whether training a pre-trained
machine translation model further with a metric-
informed sequence-level objective will improve
translation performance by alleviating the above-
mentioned issues. By gauging the potential of
applying bipolar ramp loss in a full supervision
scenario, we achieve best results for a bipolar
ramp loss, improving the baseline by over 0.4%
BLEU.
In sum, we show that bipolar ramp loss is
superior to other sequence-level objectives for all
investigated tasks, supporting our intuition that a
bipolar approach is crucial where strong positive
supervision is not available. In addition to adapting
the ramp loss objective to weak supervision, our
ramp loss objective can also be adapted to operate
at the token level, which makes it particularly
suitable for neural models as they produce their
outputs token by token. A token-level objective
also better emulates the behavior of the ramp loss
for linear models, which only update the weights of
features that differ between hope and fear. Finally,
the token-level objective allows us to capture
token-level errors in a setup where MLE training
is not available. Using this objective, we obtain
additional gains on top of the sequence-level ramp
loss for weakly supervised tasks.
2 Related Work
Training neural models with metric-augmented
objectives has been explored for various NLP tasks
in supervised and weakly supervised scenarios.
MRT for neural models has previously been
used for machine translation (Shen et al., 2016)
and semantic parsing (Liang et al., 2017; Guu
et al., 2017).1 Other objectives based on classical
1Note that Liang et al. (2017) refer to their objective as an
instantiation of REINFORCE, however they build an average
over several outputs for one input and thus the objective more
accurately falls under the heading of MRT.
234
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
5
1
9
2
3
1
7
4
/
/
t
l
a
c
_
a
_
0
0
2
6
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
structured prediction losses have been used for both
machine translation and summarization (Edunov
et al., 2018), as well as semantic parsing (Iyyer
et al., 2017; Misra et al., 2018). Objectives inspired
by REINFORCE have, for example, been ap-
plied to machine translation (Ranzato et al., 2016;
Norouzi et al., 2016), semantic parsing (Liang
et al., 2017; Mou et al., 2017; Guu et al., 2017),
and reading comprehension (Choi et al., 2017;
Yang et al., 2017).2
Misra et al. (2018) are the first to compare
several objectives for neural semantic parsing.
For semantic parsing, they find that objectives
employing structured prediction losses perform
best. Edunov et al. (2018) compare different clas-
sical structured prediction objectives including
MRT on a fully supervised machine translation
task. They find MRT to perform best. However,
they only obtain larger gains by interpolating
MRT with the MLE loss. Neither Misra et al.
(2018) nor Edunov et al. (2018) investigate objec-
tives that correspond to the bipolar ramp loss that
is central in our work.
The ramp loss objective (Chapelle et al., 2009)
has been applied to supervised phrase-based
machine translation (Gimpel and Smith, 2012;
Chiang, 2012). We adapt these objectives to neural
models and adapt them to incorporate bipolar
weak supervision, while also introducing a novel
token-level ramp loss objective.
3 Neural Sequence-to-Sequence Learning
et
Our neural sequence-to-sequence models utilize
an encoder-decoder setup (Cho et al., 2014;
Sutskever
al., 2014) with an attention
mechanism (Bahdanau et al., 2015). Specifically,
we employ the framework NEMATUS (Sennrich
et al., 2017). Given an input sequence x =
x1, x2, . . . x|x|, the probability that a model assigns
for an output sequence y = y1, y2, . . . y|y| is given
j=1 πw(yj|y
we also see the biggest gap between MLE and
RAMP−-T of 0.52 and 0.67% BLEU for the two
runs. This increase is mitigated by much weaker
increases in the bottom brackets. A possible
explanation for the weaker performance of MLE in
the top bracket is the observation that hypotheses
produced by the MLE system are longer than
Figure 4: Improvements in BLEU scores by Wikipedia
article for the RAMP−-T runs.
for RAMP−-T. For the top bracket, hypothesis
lengths exceed reference lengths for all systems.
However, for MLE this over-generation is more
severe at 106% of the reference length, compared
to RAMP−-T at 102%, potentially causing a higher
loss in precision.
As our test set consists of parallel sentences
extracted from four Wikipedia articles, we can
examine the performance for each article sepa-
rately. Figure 3 shows the results. We observe
large differences in performance according to arti-
cle ID. These are probably caused by some articles
being more similar to the out-of-domain training
data than others. Comparing RAMP−-T and MLE,
we see that RAMP−-T outperforms MLE for each
article by a small margin. Figure 4 shows the
size of the improvements by article. We observe
that margins are bigger on articles with better
baseline performance. This suggests that there are
challenges arising from domain mismatch that
are not addressed by our method.
Lastly, we present an examination of example
outputs. Table 6 shows an example of a long
sentence from Article 2, which describes the
German town of Sch¨uttorf. This article is orig-
inally in German, meaning that our model is
241
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
5
1
9
2
3
1
7
4
/
/
t
l
a
c
_
a
_
0
0
2
6
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
MLE
Source
Towards the end of the 19th century, a strong textile industry was developing itself in Sch¨uttorf
with several large local businesses (Schlikker & S¨ohne, Gathmann & Gerdemann, G. Sch¨umer &
Co. and ten Wolde, later Carl Remy; today’s RoFa is not one of the original textile companies,
but was founded by H. Lammering and later taken over by Gerhard Schlikker jun., Levert Rost
and Wilhelm Edel;
Ende des 19. Jahrhunderts, eine starke Textilindustrie, die sich in Ettorf mit mehreren
großen lokalen Unternehmen (Schlikker & S¨ohne, Gathmann & Ger´eann, G. Schal
& Co. und zehn Wolde, sp¨ater Carl Remy) entwickelt hat; die heutige RoFa ist nicht
einer der urspr¨unglichen Textilunternehmen, sondern wurde von H. Lammering [gegr¨undet] und
sp¨ater von Gerhard Schaloker Junge, Levert Rost und Wilhelm Edel ¨ubernommen.
RAMP−-T Ende des 19. Jahrhunderts entwickelte sich [in Sch¨uttorf] eine starke Textilindustrie mit mehreren
großen lokalen Unternehmen (Schlikker & S¨ohne, Gathmann & Gerdemann, G. Schal & Co.
und zehn Wolde, sp¨ater Carl Remy; die heutige RoFa ist nicht eines der urspr ¨unglichen
Textilunternehmen, sondern wurde von H. Lammering [gegr¨undet] und sp¨ater von Gerhard
Schaloker Junge, Levert Rost und Wilhelm Edel ¨ubernommen.
gegen Ende des 19. Jahrhunderts entwickelte sich in Sch¨uttorf eine starke Textilindustrie mit
mehreren großen lokalen Unternehmen (Schlikker & S¨ohne, Gathmann & Gerdemann, G.
Sch¨umer & Co. und ten Wolde, sp¨ater Carl Remy, die heutige RoFa ist keine urspr¨ungliche
Textilfirma, sondern wurde von H. Lammering gegr¨undet und sp¨ater von Gerhard Schlikker jun.,
Levert Rost und Wilhelm Edel ¨ubernommen.)
Reference
Table 6: MT example from Article 2 in the test set. All translation errors are underlined. Incorrect
proper names are also set in cursive. Omissions are inserted in brackets and set in cursive [like this].
Improvements by RAMP−-T over MLE are marked in boldface.
back-translating from English into German. The
reference contains some awkward or even un-
grammatical phrases such as ‘‘was developing
itself’’, a literal translation from German. The
example also illustrates that translating Wikipedia
involves handling frequent proper names (there
are 11 proper names in the example). Both mod-
els struggle with translating proper names, but
RAMP−-T produces the correct phrase ‘‘Gathmann
& Gerdemann’’, while MLE fails to do so. The
RAMP−-T translation is also fully grammatical,
whereas MLE incorrectly translates the main verb
phrase ‘‘was developing itself’’ into a relative clause,
and contains an agreement error in the translation
of the noun phrase ‘‘one of the original textile
companies’’. Although making fewer errors in
grammar and proper name translation, RAMP−-T
contains two deletion errors and MLE only con-
tains one. This could be caused by the active opti-
mization of sentence length in the ramp loss model.
6 Fully Supervised Machine Translation
Our work focuses on weakly supervised tasks,
but we also conduct experiments using a fully
supervised MT task. These experiments are
motivated on the one hand by adapting the findings
of Gimpel and Smith (2012) to the neural MT
paradigm, and on the other hand by expanding the
work by Edunov et al. (2018) on applying classical
structured prediction losses to neural MT.
Ramp Loss Objectives. For fully supervised
MT we assume access to one or more reference
translations ¯y for each input x. The reward
BLEU+1(y, ¯y) is a per-sentence approximation of
the BLEU score.11 Table 7 shows the different
definitions of y+ and y−, which give rise to
different
ramp losses. RAMP, RAMP1, and
RAMP2 are defined analogously to the other
tasks. We again include a hyperparameter α > 0
interpolating cost function and model score when
searching for y+ and y−. Gimpel and Smith
(2012) also include the perceptron loss in their
analysis. PERC1 is a re-formulation of the Collins
perceptron (Collins, 2002) where the reference is
used as y+ and ˆy is used as y−. A comparison with
PERC1 is not possible for the weakly supervised
tasks in the previous sections, as gold structures
are not available for these tasks. With neural MT
and subword methods we are able to compute
this loss for any reference without running into
the problem of reachability that was faced by
phrase-based MT (Liang et al., 2006). However,
11We use the BLEU score with add-1 smoothing for n > 1,
as proposed by Chen and Cherry (2014).
242
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
5
1
9
2
3
1
7
4
/
/
t
l
a
c
_
a
_
0
0
2
6
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Loss
RAMP
RAMP1
RAMP2
PERC1
PERC2
y+
arg maxy πw(y|x) − α(1 − BLEU+1(y, ¯y))
ˆy
arg maxy πw(y|x) − α(1 − BLEU+1(y, ¯y))
¯y
arg maxy BLEU+1(y, ¯y)
y−
arg maxy πw(y|x) + α(1 − BLEU+1(y, ¯y))
arg maxy πw(y|x) + α(1 − BLEU+1(y, ¯y))
ˆy
ˆy
ˆy
Table 7: Configurations for y+ and y− for fully supervised MT. ˆy is the highest-probability model output, ¯y is a
gold standard reference. πw(y|x) is the probability of y according to the model. The arg maxy is taken over the
k-best list K(x). BLEU+1 is smoothed per-sentence BLEU and α is a scaling factor.
using sequence-level training towards a reference
can lead to degenerate solutions where the model
gives low probability to all its predictions (Shen
et al., 2016). PERC2 addresses this problem by
replacing ¯y by a surrogate translation that achieves
the highest BLEU+1 score in K(x). This approach
is also used by Edunov et al. (2018) for the
loss functions which require an oracle. PERC1
corresponds to equation (9), PERC2 to equation
(10) of Gimpel and Smith (2012).
Experimental Setup. We conduct experiments
on the IWSLT 2014 German–English task, which
is based on Cettolo et al. (2012) in the same way
as Edunov et al. (2018). The training set contains
160K sentence pairs. We set the maximum sen-
tence length to 50 and use BPE with 14,000
merge operations. Edunov et al. (2018) sample 7K
sentences from the training set as heldout data.
We do the same, but only use one tenth of the data
as heldout set to be able to validate often.
Our baseline system (MLE) is a BiLSTM
encoder-decoder with attention, which is trained
using the MLE objective. Word embedding and
hidden layer dimensions are set to 256. We use
batches of 64 sentences for baseline training and
batches of 40 inputs for training RAMP and
PERC variants. MRT makes an update after each
input using all sampled outputs and resulting in a
batch size of 1. All experiments use dropout for
regularization, with dropout probability set to 0.2
for embedding and hidden layers and to 0.1 for
source and target layers. During MLE-training,
the model is validated every 2500 updates and
training is stopped if the MLE loss on the heldout
set worsens for 10 consecutive validations.
For metric-augmented training, we use SGD
for optimization with learning rates optimized on
the development set. Ramp losses and PERC2 use
a k-best list of size 16. For ramp loss training,
we set α = 10. RAMP and PERC variants both
use a learning rate of 0.001. A new k-best list is
generated for each input using the current model
parameters. We compare ramp loss to MRT as
described above. For MRT, we use SGD with a
learning rate of 0.01 and set S = 16 and S(cid:5) = 10.
As Edunov et al. (2018) observe beam search
to work better than sampling for MRT, we also
run an experiment in this configuration, but find
no difference between results. As beam search
runs significantly slower, we only report sampling
experiments.
The model is validated on the development
set after every 200 updates for experiments with
batch size 40 and after 8,000 updates for MRT
experiments with batch size 1. The stopping
point is determined by the BLEU score on the
heldout set after 25 validations. As we are training
on the same data as the MLE baseline, we
also apply dropout during ramp loss training to
prevent overfitting. BLEU scores are computed
with Moses’ multi-bleu.perl on tokenized,
truecased output. Each experiment is run 3 times
and results are averaged over the runs.
Experimental Results. As shown in Table 8,
all experiments except for PERC1 yield improve-
ments over MLE, confirming that sequence-
level losses that update towards the reference
can lead to degenerate solutions. For MRT, our
findings show similar performance to the initial
experiments reported by Edunov et al. (2018),
who gain 0.24 BLEU points on the same test
set.12 PERC2 and RAMP2,
improve over the
12See their Table 2. Using interpolation with the MLE
objective, Edunov et al. (2018) achieve +0.7 BLEU points.
As we are only interested in the effect of sequence-level
objectives, we do not add MLE interpolation. The best model
by Edunov et al. (2018) achieved a BLEU score of 32.91%.
It is possible that these scores are not directly comparable to
ours due to different pre- and post-processing. They also use
a multi-layer CNN architecture (Gehring et al., 2017), which
has been shown to outperform a simple RNN architecture
such as ours.
243
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
5
1
9
2
3
1
7
4
/
/
t
l
a
c
_
a
_
0
0
2
6
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
1 MLE
2 MRT
3
PERC1
PERC2
RAMP1
RAMP2
RAMP
RAMP-T
4
5
6
7
8
M
64
1
40
40
40
40
40
40
% BLEU
Δ
31.99
32.17 ± 0.02
31.91 ± 0.02
32.22 ± 0.03
32.36∗ ± 0.05
32.19 ± 0.01
32.44∗∗ ± 0.00
32.33∗ ± 0.00
+ 0.18
− 0.08
+ 0.23
+ 0.37
+ 0.20
+ 0.45
+ 0.34
Table 8: BLEU scores for fully supervised MT
experiments. Boldfaced results are significantly
better than MLE at p < 0.01 according to
multeval (Clark et al., 2011). ∗ marks a sig-
nificant difference to MRT and PERC2, and ∗∗
marks a difference to RAMP1.
MLE baseline and PERC1, but perform on a
par with MRT and each other. Both RAMP and
RAMP1 are able to outperform MRT, PERC2,
and RAMP2, with the bipolar objective RAMP
also outperforming RAMP1 by a narrow margin.
The main difference between RAMP and RAMP1,
compared to PERC2 and RAMP2, is the fact that
the latter objectives use ˆy as y−, whereas the
former use a fear translation with high prob-
ability and low BLEU+1. We surmise that for
this fully supervised task, selecting a y− which
has some known negative characteristics is more
important for success than finding a good y+.
RAMP, which fulfills both criteria, still out-
performs RAMP2. This result re-confirms the
superiority of bipolar objectives compared to non-
bipolar ones. Although still improving over MLE,
token-level ramp loss RAMP-T is outperformed
by RAMP by a small margin. This result suggests
that when using a metric-augmented objective on
top of an MLE-trained model in a full super-
vision scenario without domain shift, there is
little room for improvement from token-level
supervision, while gains can still be obtained from
additional sequence-level information captured by
the external metric, such as information about the
sequence length.
To summarize, our findings on a fully super-
vised task show the same small margin for
improvement as Edunov et al. (2018), without
any further tuning of performance (e.g., by inter-
polation with the MLE objective). Bipolar RAMP
is found to outperform the other losses. This
observation is also consistent with the results
by Gimpel and Smith (2012) for phrase-based
244
MT. We conclude that for fully supervised MT,
deliberately selecting a hope and fear translation
is beneficial.
7 Conclusion
We presented a study of weakly supervised
learning objectives for three neural sequence-to-
sequence learning tasks. In our first task of seman-
tic parsing, question-answer pairs provide a weak
supervision signal to find parses that execute to the
correct answer. We show that ramp loss can out-
perform MRT if it incorporates bipolar supervision
where parses that receive negative feedback are
actively discouraged. The best overall objective is
constituted by the token-level ramp loss. Next, we
turn to weak supervision for machine translation
in form of cross-lingual document-level links. We
present two ramp loss objectives that combine
bipolar weak supervision from a linked document
d+ and an irrelevant document d−. Again, the
bipolar ramp loss objectives outperform MRT,
and the best overall result is obtained using token-
level ramp loss. Finally, to tie our work to previous
work on supervised machine translation, we con-
duct experiments in a fully supervised scenario
where gold references are available and a metric-
augmented loss is desired to reduce the exposure
bias and the loss-evaluation mismatch. Again, the
bipolar ramp loss objective performs best, but we
find that the overall margin for improvement is
small without any additional engineering. We con-
clude that ramp loss objectives show promise for
neural sequence-to-sequence learning, especially
when it comes to weakly supervised tasks where
the MLE objective cannot be applied. In contrast to
ramp losses that either operate only in the undesir-
able region of the search space (‘‘cost-augmented
decoding’’ as in RAMP1) or only in the desir-
able region of the search space (‘‘cost-diminished
decoding’’ as in RAMP2), bipolar RAMP operates
in both regions of the search space when extract-
ing supervision signals from weak feedback. We
showed that MRT can be turned into a bipolar
objective by defining a metric that assigns neg-
ative values to bad outputs. This improves the
performance of MRT objectives. However, the
ramp loss objective is still superior as it is easy to
implement and efficient to compute. Furthermore,
on weakly supervised tasks our novel token-level
ramp loss objective RAMP-T can obtain further
improvements over its sequence-level counterpart
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
5
1
9
2
3
1
7
4
/
/
t
l
a
c
_
a
_
0
0
2
6
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
because it can more directly assess which tokens
in a sequence are crucial to its success or failure.
Acknowledgments
The research reported in this paper was supported
in part by DFG grants RI-2221/4-1 and RI 2221/
2-1. We would like to thank the reviewers for their
helpful comments.
Kyunghyun Cho, Bart van Merri¨enboer, Caglar
Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
Holger Schwenk, and Yoshua Bengio. 2014.
Learning Phrase Representations using RNN
Encoder–Decoder for Statistical Machine Trans-
lation. In Proceedings of the 2014 Conference
on Empirical Methods in Natural Language
Processing (EMNLP). Doha, Qatar.
References
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
Bengio. 2015. Neural machine translation by
jointly learning to align and translate. In Inter-
national Conference on Learning Representa-
tions (ICLR). San Diego, CA.
Luisa Bentivogli, Arianna Bisazza, Mauro Cettolo,
and Marcello Federico. 2016. Neural versus
phrase-based machine translation quality: A
case study. In Proceedings of the 2016 Con-
ference on Empirical Methods in Natural
Language Processing (EMNLP). Austin, TX.
Jonathan Berant, Andrew Chou, Roy Frostig,
and Percy Liang. 2013. Semantic parsing on
freebase from question-answer pairs. In Pro-
ceedings of the 2013 Conference on Empirical
Methods in Natural Language Processing
(EMNLP). Seattle, WA.
Mauro Cettolo, Christian Girardi, and Marcello
Federico. 2012. WIT3: Web Inventory of
Transcribed and Translated Talks. In Proceed-
ings of the 16th Conference of the European
Association for Machine Translation (EAMT).
Trento, Italy.
Olivier Chapelle, Chuong B. Do, Choon H. Teo,
Quoc V. Le, and Alex J. Smola. 2009. Tighter
bounds for structured estimation. In Advances
in Neural
Information Processing Systems
(NIPS). Vancouver, Canada.
Boxing Chen and Colin Cherry. 2014. A sys-
tematic comparison of smoothing techniques
for sentence-level BLEU. In Proceedings of the
9th Workshop on Statistical Machine Trans-
lation. Baltimore, MD.
David Chiang. 2012. Hope and fear for discrim-
inative training of statistical translation models.
Journal of Machine Learning Research,
13(1):1159–1187.
Eunsol Choi, Daniel Hewlett, Jakob Uszkoreit,
Illia Polosukhin, Alexandre Lacoste, and
Jonathan Berant. 2017. Coarse-to-dine question
answering for long documents. In Proceedings
of
the Asso-
ciation for Computational Linguistics (ACL).
Vancouver, Canada.
the 55th Annual Meeting of
Jonathan H. Clark, Chris Dyer, Alon Lavie, and
Noah A. Smith. 2011. Better hypothesis testing
for statistical machine translation: Controlling
for optimizer instability. In Proceedings of
the 2011 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies
(HLT-NAACL). Portland, OR.
James Clarke, Dan Goldwasser, Ming-Wei Chang,
and Dan Roth. 2010. Driving Semantic Parsing
from the World’s Response. In Proceedings of
the 14th Conference on Computational Natural
Language Learning. Uppsala, Sweden.
Michael Collins. 2002. Discriminative training
methods for hidden Markov models: Theory
and experiments with perceptron algorithms.
the 2002 Conference on
In Proceedings of
Empirical Methods
in Natural Language
Processing (EMNLP). Philadelphia, PA.
Long Duong, Hadi Afshar, Dominique Estival,
Glen Pink, Philip Cohen, and Mark Johnson.
2018. Active learning for deep semantic
parsing. In Proceedings of the 56th Annual
Meeting of the Association for Computational
Linguistics (ACL), Melbourne, Australia.
Sergey Edunov, Myle Ott, Michael Auli, David
Grangier, and Marc’Aurelio Ranzato. 2018.
Classical structured prediction losses for se-
quence to sequence learning. In Proceedings
of the 2018 Conference of the North American
245
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
5
1
9
2
3
1
7
4
/
/
t
l
a
c
_
a
_
0
0
2
6
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Chapter of the Association for Computational
Linguistics: Human Language Technologies
(HLT-NAACL). New Orleans, LA.
Andreas Eisele and Yu Chen. 2010. MultiUN:
A multilingual corpus from united nation
documents. In Proceedings of the Seventh con-
ference on International Language Resources
and Evaluation (LREC). Valetta, Malta.
Jonas Gehring, Michael Auli, David Grangier,
Denis Yarats, and Yann N. Dauphin. 2017.
Convolutional sequence to sequence learning.
In Proceedings of the 34th International Con-
ference on Machine Learning (ICML). Sydney,
Australia.
Kevin Gimpel and Noah A. Smith. 2012. Struc-
tured ramp loss minimization for machine
translation. In Proceedings of the 2012 Con-
ference of
the North American Chapter of
the Association for Computational Linguistics:
Human Language Technologies (HLT-NAACL).
Montreal, Canada.
Evan Greensmith, Peter L. Bartlett, and Jonathan
Baxter. 2004. Variance reduction techniques
for gradient estimation in reinforcement learn-
ing. Journal of Machine Learning Research,
5:1471–1530.
Kelvin Guu, Panupong Pasupat, Evan Liu, and
Percy Liang. 2017. From language to programs:
Bridging reinforcement learning and maximum
the
marginal
55th Annual Meeting of the Association for
Computational Linguistics (ACL). Vancouver,
Canada.
likelihood. In Proceedings of
Carolin Haas and Stefan Riezler. 2016. A cor-
pus and semantic parser for multilingual nat-
ural language querying of openStreetMap. In
Proceedings of the 2016 Conference of the
North American Chapter of the Association for
Computational Linguistics: Human Language
Technologies (HLT-NAACL). San Diego, CA.
Tamir Hazan, Joseph Keshet, and David A.
loss minimization
McAllester. 2010. Direct
for structured prediction. In Advances in Neu-
ral Information Processing Systems (NIPS).
Vancouver, Canada.
Mohit
Iyyer, Wen-tau Yih, and Ming-Wei
Chang. 2017. Search-based neural structured
learning for sequential question answering.
In Proceedings of the 55th Annual Meeting of
the Association for Computational Linguistics
(ACL). Vancouver, Canada.
Laura Jehl and Stefan Riezler. 2016. Learning to
translate from graded and negative relevance
the 26th
information.
International Conference on Computational
Linguistics (COLING). Osaka, Japan.
In Proceedings of
Philipp Koehn. 2005. Europarl: A parallel corpus
for statistical machine translation. In Proceed-
ings of
the Machine Translation Summit,
volume 5. Phuket, Thailand.
Tom´aˇs Koˇcisk´y, G´abor Melis,
Edward
Grefenstette, Chris Dyer, Wang Ling, Phil
Blunsom, and Karl Moritz Hermann. 2016.
Semantic parsing with semi-supervised se-
quential autoencoders. In Proceedings of the
2016 Conference on Empirical Methods in
Natural Language Processing
(EMNLP).
Austin, TX.
Carolin Lawrence and Stefan Riezler. 2018. Im-
proving a neural semantic parser by counter-
factual learning from human bandit feedback.
In Proceedings of the 56th Annual Meeting of
the Association for Computational Linguistics
(ACL). Melbourne, Australia.
Chen Liang, Jonathan Berant, Quoc V. Le,
Kenneth D. Forbus, and Ni Lao. 2017.
Neural symbolic machines: learning semantic
parsers on freebase with weak supervision. In
Proceedings of the 55th Annual Meeting of
the Association for Computational Linguistics
(ACL), Vancouver, Canada.
Percy Liang, Alexandre Bouchard-Cˆot´e, Dan
Klein, and Ben Taskar. 2006. An end-to-end
discriminative approach to machine translation.
In Proceedings of the 21st International Con-
ference on Computational Linguistics and
the 44th Annual Meeting of the Association
for Computational Linguistics (ACL). Sydney,
Australia.
Dipendra Misra, Ming-Wei Chang, Xiaodong
He, and Wen-tau Yih. 2018. Policy shaping
and generalized update equations for semantic
parsing from denotations. In Proceedings of
246
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
5
1
9
2
3
1
7
4
/
/
t
l
a
c
_
a
_
0
0
2
6
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
the 2018 Conference on Empirical Methods
in Natural Language Processing (EMNLP).
Brussels, Belgium.
Lili Mou, Zhengdong Lu, Hang Li, and Zhi
Jin. 2017. Coupling distributed and symbolic
execution for natural
In
Proceedings of
the 34th International Con-
ference on Machine Learning (ICML). Sydney,
Australia.
language queries.
Eric W. Noreen. 1989. Computer
Intensive
Methods for Testing Hypotheses: An Intro-
duction. Wiley, New York.
Mohammad Norouzi, Samy Bengio, Zhifeng
Chen, Navdeep Jaitly, Mike Schuster, Yonghui
Wu, and Dale Schuurmans. 2016. Reward
augmented maximum likelihood for neural struc-
tured prediction. In Advances in Neural Infor-
mation Processing Systems (NIPS). Barcelona,
Spain.
Kishore Papineni, Salim Roukos, Todd Ward,
and Wei-Jing Zhu. 2001, BLEU: A method
for automatic evaluation of machine trans-
lation. Technical Report IBM Research Divi-
sion Technical Report, RC22176 (W0190-022),
Yorktown Heights, NY.
Panupong Pasupat and Percy Liang. 2015. Com-
positional semantic parsing on semi-structured
tables. In Proceedings of
the 53rd Annual
Meeting of the Association for Computational
Linguistics and the 7th International Joint
Conference on Natural Language Processing
(ACL-IJCNLP). Beijing, China.
Pranav Rajpurkar,
Jian Zhang, Konstantin
Lopyrev, and Percy Liang. 2016. SQuAD:
100,000+ Questions for machine comprehen-
the 2016
sion of
Conference on Empirical Methods in Natural
Language Processing (EMNLP). Austin, TX.
In Proceedings of
text.
Marc’Aurelio Ranzato, Sumit Chopra, Michael
Auli, and Wojciech Zaremba. 2016. Sequence
level training with recurrent neural networks.
In International Conference on Learning
Representations (ICLR). San Juan, Puerto Rico.
247
Shigehiko Schamoni, Felix Hieber, Artem
Sokolov, and Stefan Riezler. 2014. Learning
translational and knowledge-based similarities
from relevance rankings for cross-language
retrieval. In Proceedings of the 52nd Annual
Meeting of the Association for Computational
Linguistics (ACL). Baltimore, MD.
Rico Sennrich, Orhan Firat, Kyunghyun Cho,
Julian
Alexandra Birch, Barry Haddow,
Hitschler, Marcin Junczys-Dowmunt, Samuel
L¨aubli, Antonio Valerio Miceli Barone, Jozef
Mokry, and Maria Nadejde. 2017. Nematus:
A toolkit for neural machine translation. In
Proceedings of the Software Demonstrations of
the 15th Conference of the European Chapter of
the Association for Computational Linguistics
(EACL). Valencia, Spain.
Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Neural machine translation of rare
words with subword units. In Proceedings of
the 54th Annual Meeting of the Association
for Computational Linguistics (ACL). Berlin,
Germany.
Shiqi Shen, Yong Cheng, Zhongjun He, Wei He,
Hua Wu, Maosong Sun, and Yang Liu. 2016.
Minimum Risk Training for neural machine
translation. In Proceedings of the 54th Annual
Meeting of the Association for Computational
Linguistics (ACL). Berlin, Germany.
David A. Smith and Jason Eisner. 2006. Minimum
risk annealing for training log-linear models. In
Proceedings of the COLING/ACL 2006 Main
Conference Poster Sessions. Sydney, Australia.
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le.
2014. Sequence to sequence learning with
neural networks. In Advances in Neural Infor-
mation Processing Systems (NIPS). Montreal,
Canada.
Ben Taskar, Vassil Chatalbashev, Daphne Koller,
and Carlos Guestrin. 2005. Learning structured
prediction models: A large margin approach.
In Proceedings of the 22nd International Con-
ference on Machine Learning (ICML), Bonn,
Germany.
Ben Taskar, Carlos Guestrin, and Daphne
Koller. 2004. Max-margin Markov networks.
In Advances in Neural Information Processing
Systems (NIPS). Vancouver, Canada.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
5
1
9
2
3
1
7
4
/
/
t
l
a
c
_
a
_
0
0
2
6
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Ronald J. Williams. 1992. Simple statistical
gradient-following algorithms for connectionist
learning. Machine Learning,
reinforcement
20:229–256.
Sam Wiseman and Alexander M. Rush. 2016.
Sequence-to-sequence learning as beam-search
the 2016
optimization.
Conference on Empirical Methods in Natural
Language Processing (EMNLP). Austin, TX.
In Proceedings of
Zhilin Yang, Junjie Hu, Ruslan Salakhutdinov,
and William Cohen. 2017. Semi-supervised
QA with generative domain-adaptive nets. In
Proceedings of the 55th Annual Meeting of
the Association for Computational Linguistics
(ACL). Vancouver, Canada.
Matthew D. Zeiler. 2012. ADADELTA: An
adaptive learning rate method. ArXiv e-prints,
cs.LG/1212.5701v1.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
5
1
9
2
3
1
7
4
/
/
t
l
a
c
_
a
_
0
0
2
6
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
248