Revisiting Multi-Domain Machine Translation
MinhQuang Pham† ‡, Josep Maria Crego†, Franc¸ois Yvon‡
‡Universit´e Paris-Saclay, CNRS, LIMSI, 91400, Orsay, France
francois.yvon@limsi.fr
†SYSTRAN, 5 rue Feydeau, 75002 Paris, France
{minhquang.pham,josep.crego}@systrangroup.com
Abstrait
When building machine translation systems,
one often needs to make the best out of hetero-
geneous sets of parallel data in training, et
to robustly handle inputs from unexpected
domains in testing. This multi-domain scenario
has attracted a lot of recent work that fall under
the general umbrella of transfer learning. Dans
this study, we revisit multi-domain machine
translation, with the aim to formulate the moti-
vations for developing such systems and the
associated expectations with respect to perfor-
mance. Our experiments with a large sample
of multi-domain systems show that most of
these expectations are hardly met and suggest
that further work is needed to better analyze
the current behaviour of multi-domain systems
and to make them fully hold their promises.
1
Introduction
Data-based Machine Translation (MT), si
statistical or neural, rests on well-understood ma-
chine learning principles. Given a training sample
of matched source-target sentence pairs (F , e)
drawn from an underlying distribution Ds, un
model parameterized by θ (ici, a translation
function hθ) is trained by minimizing the empirical
expectation of a loss function (cid:3)(hθ(F ), e). Ce
approach ensures that the translation loss remains
low when translating more sentences drawn from
the same distribution.
Owing to the great variability of language data,
this ideal situation is rarely met
in practice,
warranting the study of an alternative scenario,
where the test distribution Dt differs from Ds.
In this setting, domain adaptation (DA) méthodes
are in order. DA has a long history in Machine
Learning in general (par exemple., Shimodaira, 2000; Ben-
David et al., 2010; Joaquin Quionero-Candela and
Lawrence, 2008; Pan and Yang, 2010) and in NLP
17
in particular (par exemple., Daum´e III and Marcu, 2006;
Blitzer, 2007; Jiang and Zhai, 2007). Various
techniques thus exist to handle both the situations
où un (petit) training sample drawn from Dt
is available in training, or where only samples
target-side) sentences are
of source-side (ou
available (see Foster and Kuhn [2007]; Bertoldi
and Federico [2009]; Axelrod et al. [2011]; pour
proposals from the statistical MT era, or Chu and
Wang [2018] for a recent survey of DA for Neural
MT).
A seemingly related problem is multi-domain
(MARYLAND) machine translation (Sajjad et al., 2017;
Farajian et al., 2017b; Kobus et al., 2017; Zeng
et coll., 2018; Pham et al., 2019) where one single
system is trained and tested with data from mul-
tiple domains. MD machine translation (MDMT)
corresponds to a very common situation, où
all available data, no matter its origin, is used
to train a robust system that performs well for
any kind of new input. If the intuitions behind
MDMT are quite simple, the exact specifications
of MDMT systems are rarely spelled out: Pour
instance, should MDMT perform well when the
test data is distributed like the training data, quand
it is equally distributed across domains or when
the test distribution is unknown? Should MDMT
also be robust to new domains? How should it
handle domain labeling errors?
A related question concerns the relationship
between supervised domain adaptation and multi-
domain translation. The latter task seems more
challenging as it tries to optimize MT performance
for a more diverse set of potential inputs, with an
additional uncertainty regarding the distribution
of test data. Are there still situations where MD
systems can surpass single domain adaptation, comme
is sometimes expected?
In this paper, we formulate in a more pre-
cise fashion the requirements that an effective
MDMT system should meet (Section 2). Our first
Transactions of the Association for Computational Linguistics, vol. 9, pp. 17–35, 2021. https://doi.org/10.1162/tacl a 00351
Action Editor: George Foster. Submission batch: 5/2020; Revision batch: 9/2020; Published 02/2021.
c(cid:2) 2021 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
5
1
1
9
2
4
0
5
9
/
/
t
je
un
c
_
un
_
0
0
3
5
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
contribution is thus of methodological nature and
consists of lists of expected properties of MDMT
systems and associated measurements to evaluate
eux (Section 3). Ce faisant, we also shed light on
new problems that arise in this context, regarding,
par exemple, the accommodation of new domains
in the course of training, or the computation
of automatic domain tags. Our second main
contribution is experimental and consists in a
thorough reanalysis of eight recent multi-domain
approaches from the literature, including a variant
of a model initially introduced for DA. We show in
Section 4 that existing approaches still fall short
to match many of these requirements, notably
with respect to the handling of a large amount
of heterogeneous domains and to dynamically
integrating new domains in training.
2 Requirements of Multi-Domain MT
Dans cette section, we recap the main reasons
for considering a multi-domain scenario and
discuss their implications in terms of performance
evaluation.
2.1 Formalizing Multi-Domain Translation
We conventionally define a domain d as a dis-
tribution Dd(X) over some feature space X that is
shared across domains (Pan and Yang, 2010): Dans
machine translation, X is the representation space
for source sentences; each domain corresponds
to a specific source of data, and differs from
the other data sources in terms of textual genre,
thematic content (Chen et al., 2016; Zhang et al.,
2016), register (Sennrich et al., 2016un), style
(Niu et al., 2018), an so forth. Translation in
domain d is formalized by a translation function
hd(oui|X) pairing sentences in a source language
with sentences in a target language y ∈ Y. hd
is usually assumed to be deterministic (hence
y = hd(X)), but can differ from one domain to the
other.
(cid:2)
d λs
d
Dd(X), avec {λs
A typical learning scenario in MT is to have
access to samples from nd domains, which means
the training distribution Ds is a mixture
que
d, d = 1 . . . nd}
Ds(X) =
d λs
d = 1).
the corresponding mixture weights (
Multi-domain learning, as defined in Dredze and
Crammer (2008), further assumes that domain
tags are also available in testing; the implication
being that the test distribution is also as a mix-
Dd(X) of several domains,
ture Dt(X) =
(cid:2)
(cid:2)
d λt
d
making the problem distinct from mere domain
adaption. A multi-domain learner is then expected
to use these tags effectively (Joshi et al., 2012)
when computing the combined translation func-
tion h(X, d), and to perform well in all domains
(Finkel and Manning, 2009). This setting is
closely related to the multi-source adaptation
problem formalized in Mansour et al. (2009un,b)
and Hoffman et al. (2018).
This definition seems to be the most accepted
view of a multi-domain MT1 and one that we
in the absence of
also adopt here. Note that
the naive answer to the
further specification,
MD setting should be to estimate one translation
function ˆhd(X) separately for each domain, alors
d(cid:4) hd(cid:4)(X)je(d(cid:4) = d),
to translate using ˆh(X, d) =
where I(X) is the indicator function. We now
discuss the arguments that are put forward to
proceed differently.
(cid:2)
2.2 Reasons for Building MDMT Systems
A first motivation for moving away from the
one-domain / one-system solution are practical
(Sennrich et al., 2013; Farajian et al., 2017un):
When faced with inputs that are potentially from
multiple domains, it is easier and computationally
cheaper to develop one single system instead
of having to optimize and maintain multiple
engines. The underlying assumption here is that
the number of domains of interests can be large, un
limiting scenario being fully personalized machine
translation (Michel and Neubig, 2018).
A second line of reasoning rests on linguistic
properties of the translation function and contends
that domain specificities are mostly expressed
lexically and will primarily affect content words
or multi-word expressions; function words, sur
the other hand, are domain agnostic and tend
to remain semantically stable across domains,
motivating some cross-domain parameter sharing.
An MDMT system should simultaneously learn
lexical domain peculiarities, and leverage cross-
domain similarities to improve the translation of
generic contexts and words (Zeng et al., 2018;
Pham et al., 2019). It is here expected that the
MDMT scenario should be more profitable when
the domain mix includes domains that are closely
related and can share more information.
1An exception is Farajian et al. (2017b), where test
translations rely on similarity scores between test and train
phrases, rather than on domain labels.
18
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
5
1
1
9
2
4
0
5
9
/
/
t
je
un
c
_
un
_
0
0
3
5
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
A third series of motivations is of statistical na-
ture. The training data available for each domain is
usually unevenly distributed, and domain-specific
systems trained or adapted on small datasets are
likely to have a high variance and generalize
poorly. For some test domains, there may even be
no data at all (Farajian et al., 2017un). Training mix-
domain systems is likely to reduce this variance,
at the expense of a larger statistical bias (Clark
et coll., 2012). Under this view, MDMT would be es-
pecially beneficial for domains with little training
data. This is observed for multilingual MT from
English: an improvement for under-resourced
languages due to positive transfer, at the cost
of a decrease in performance for well-resourced
languages (Arivazhagan et al., 2019).
Combining multiple domain-specific MTs can
also be justified in the sake of distributional robust-
ness (Mansour et al., 2009un,b), par exemple, quand
the test mixture differs from the train mixture, ou
when it includes new domains unseen in training.
An even more challenging case is when the
MT would need to perform well for any test
distribution, as studied for statistical MT in Huck
et autres. (2015). In all these cases, mixing domains
in training and/or testing is likely to improve
robustness against unexpected or adversarial test
distribution (Oren et al., 2019).
A distinct
line of reasoning is that mixing
domains can have a positive regularization effect
for all domains. By introducing variability in
it prevents DA from overfitting the
entraînement,
available adaptation data and could help improve
generalization even for well-resourced domains.
A related case is made in Joshi et al. (2012), lequel
shows that part of the benefits of MD training is
due to an ensembling effect, where systems from
multiple domains are simultaneously used in the
prediction phase; this effect may subsist even in
the absence of clear domain separations.
To recap,
there are multiple arguments for
adopting MDMT, some already used in DA
settings, and some original. These arguments are
not mutually exclusive; cependant, each yields
specific expectations with respect to the perfor-
mance of this approach, and should also yield
appropriate evaluation procedure. If the motiva-
tion is primarily computational, then a drop in
MT quality with respect to multiple individual
domains might be acceptable if compensated by
the computational savings. If it is to improve
statistical estimation, then the hope will be that
MDMT will improve, at least for some under-
individually trained
resourced domains, over
systèmes. If, finally, it is to make the system more
robust to unexpected or adversarial test distribu-
tion, then this is the setting that should be used to
evaluate MDMT. The next section discusses ways
in which these requirements of MDMT systems
could be challenged.
3 Challenging Multi-Domain Systems
Dans cette section, we propose seven operational
requirements that can be expected from an effec-
tive multi-domain system, and discuss ways to
evaluate whether these requirements are actually
met. All these evaluations will rest on comparison
of translation performance, and do not depend on
the choice of a particular metric. To make our
results comparable with the literature, we will
only use the BLEU score (Papineni et al., 2002) dans
Section 4, noting it may not be the best yardstick to
assess subtle improvements of lexical choices that
are often associated with domain adapted systems
(Irvine et al., 2013). Other important figures of
merit for MDMT systems are the computational
training cost and the total number of parameters.
3.1 Multi-Domain Systems Should
Be Effective
A first expectation is that MDMT systems should
perform well in the face of mixed-domain test
data. We thus derive the following requirements.
[P1-LAB] A MDMT should perform better than
the baseline, which disregards domain labels, ou
reassigns them in a random fashion (Joshi et al.,
2012). Evaluating this requirement is a matter of
a mere comparison, assuming the test distribution
of domains is known: If all domains are equally
important, performance averages can be reported;
if they are not, weighted averages should be used
instead.
[P2-TUN] En plus, one can expect
que
MDMT will improve over fine-tuning (Luong and
Manning, 2015; Freitag and Al-Onaizan, 2016),
at least in domains where data is scarce, or in situ-
ations where several domains are close. To evalu-
ate this, we perform two measurements, using a
real as well as an artificial scenario. In the real
scénario, we simply compare the performance of
MDMT and fine-tuning for domains of varying
sizes, expecting a larger gain for smaller domains.
19
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
5
1
1
9
2
4
0
5
9
/
/
t
je
un
c
_
un
_
0
0
3
5
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
In the artificial scenario, we split a single do-
main in two parts which are considered as distinct
in training. The expectation here is that a MDMT
should yield a clear gain for both pseudo sub-
domains, which should benefit from the supple-
entraînement. Dans ce
mentary amount of relevant
situation, MDMT should even outperform fine-
tuning on either of the pseudo sub-domain.
3.2 Robustness to Fuzzy Domain Separation
A second set of requirements is related to the
definition of a domain. As repeatedly pointed out
in the literature, parallel corpora in MT are often
collected opportunistically and the view that each
corpus constitutes a single domain is often a gross
approximation.2 MDMT should aim to make the
best of the available data and be robust to domain
assignments. To challenge these requirements we
propose evaluating the following requirements.
[P3-HET] The notion of a domain being a
fragile one, an effective MDMT system should
be able to discover not only when cross-domain
sharing is useful (cf. requirement [P2-TUN]),
but also when intra-domain heterogeneity is
hurting. This requirement is tested by artificially
conjoining separate domains into one during
entraînement, hoping that the loss in performance with
respect to the baseline (using correct domain tags)
will remain small.
[P4-ERR] MDMTs should perform best when
the true domain tag is known, but deteriorate
gracefully in the face of tag errors; dans cette situation,
catastrophic drops in performance are often ob-
served. This requirement can be assessed by trans-
lating test texts with erroneous domain tags and
reporting the subsequent loss in performance.
[P5-UNK] A related situation occurs when the
domain of a test document is unknown. Several
situations need to be considered: For domains seen
in training, using automatically predicted domain
labels should not be much worse than using the
correct one. For test documents from unknown
domains (zero-shot transfer), a good MD system
should ideally outperform the default baseline that
merges all available data.
[P6-DYN] Another requirement, more of an
is that an MDMT system
operational nature,
2Two of our own ‘‘domains’’ actually comprise several
subcorpora (IT and MED), see details in Section 4.1.
should smoothly evolve to handle a growing
number of domains, without having to retrain
the full system each time new data is available.
This is a requirement [P6-DYN] that we challenge
by dynamically changing the number of training
and test domains.
3.3 Scaling to a Large Number of Domains
[P7-NUM] As mentioned above, MDMT sys-
tems have often been motivated by computational
arguments. This argument is all the more sensible
as the number of domains increases, making the
optimization of many individual systems both in-
effective and undesirable. For lack of having ac-
cess to corpora containing very large sets (par exemple., dans
the order of 100–1,000) domains, we experiment
with automatically learned domains.
4 Experimental Settings
4.1 Data and Metrics
We experiment with translation from English
into French and use texts initially originating
from six domains, corresponding to the following
data sources: the UFAL Medical corpus V1.0
(MED);3 the European Central Bank corpus (BANK)
(Tiedemann, 2012); The JRC-Acquis Commu-
nautaire corpus (LAW) (Steinberger et al., 2006),
documentations for KDE, Ubuntu, GNOME, et
PHP from Opus collection (Tiedemann, 2009),
collectively merged in a IT-domain; TED Talks
(TALK) (Cettolo et al., 2012); and the Koran (REL).
Complementary experiments also use v12 of the
News Commentary corpus (NEWS). Most corpora
are available from the Opus Web site.4 These
corpora were deduplicated and tokenized with in-
house tools; statistics are in Table 1. To reduce
the number of types and build open-vocabulary
systèmes, we use Byte-Pair Encoding (Sennrich
et coll., 2016b) avec 30,000 merge operations on a
corpus containing all sentences in both languages.
We randomly select in each corpus a devel-
opment and a test set of 1,000 lines and keep the
rest for training.5 Validation sets are used to chose
the best model according to the average BLEU
3https://ufal.mff.cuni.cz/ufal medical
corpus. We only use the in-domain (medical) subcorpora:
PATR, EMEA, CESTA, ECDC.
4http://opus.nlpl.eu.
5The code for reproducing our train, dev and test data-
sets is available at https://github.com/qmpham
/experiments.
20
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
5
1
1
9
2
4
0
5
9
/
/
t
je
un
c
_
un
_
0
0
3
5
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
MED
LAW
BANK
IT
TALK
REL
NEWS
# lines
# tokens
# les types
# uniq
2609 (0.68)
133 / 154
771 / 720
700 / 640
501 (0.13)
17.1 / 19.6
52.7 / 63.1
20.2 / 23.7
190 (0.05)
6.3 / 7.3
92.3 / 94.7
42.9 / 40.1
270 (0.07)
3.6 / 4.6
75.8 / 91.4
44.7 / 55.7
160 (0.04)
3.6 / 4.0
61.5 / 73.3
20.7 / 25.6
130 (0.03)
3.2 / 3.4
22.4 / 10.5
7.1 / 2.1
260 (0)
7.8 / 9.2
−
−
Tableau 1: Corpora statistics: number of parallel lines (×103) and proportion in the basic domain mixture
(which does not include the NEWS domain), number of tokens in English and French (×106), number
of types in English and French (×103), number of types that only appear in a given domain (×103).
MED is the largest domain, containing almost 70% of the sentences, while REL is the smallest, avec
only 3% of the data.
LAW
BANK TALK
IT
REL
1.93
1.97
1.94
MED
LAW
BANK
TALK
IT
1.9
1.97
1.98
1.93
1.93
1.94
1.92
1.97
1.99
1.99
1.93
1.99
Tableau 2: The H-divergence between domains.
score (Papineni et al., 2002).6 Significance testing
is performed using bootstrap resampling (Koehn,
implemented in compare-mt7
(Neubig
2004),
et coll., 2019). We report significant differences
at the level of p = 0.05.
We measure the distance between domains
using the H-Divergence (Ben-David et al., 2010),
which relates domain similarity to the test error of
a domain discriminator: the larger the error, le
closer the domains. Our discriminator is a SVM
independently trained for each pair of domains,
with sentence representations derived via mean
pooling from the source side representation of the
generic Transformer model. We used the scikit-
learn8 implementation with default values. Results
in Table 2 show that all domains are well separated
from all others, with REL being the furthest apart,
while TALK is slightly more central.
4.2 Baselines
Our baselines are standard for multi-domain sys-
tems.9 Using Transformers (Vaswani et al., 2017)
6We use truecasing and the multibleu script.
7https://github.com/neulab/compare-mt.
8https://scikit-learn.org.
9We omit domain-specific systems trained only with the
corresponding subset of the data, as these are always inferior
to the mix-domain strategy (Britz et al., 2017).
implemented in OpenNMT-tf10 (Klein et al.,
2017), we build the following systems:
• a generic model trained on a concatenation
of all corpora (Mixed). We develop two
versions11 of this system, one where the
domain unbalance reflects the distribution
training data given in Table 1
of our
(Mixed-Nat) and one where all domains
are equally represented in training (Mixed-
Bal). The former is the best option when
the train mixture Ds is also expected in
testing; the latter should be used when the
test distribution is uniform across domains.
Accordingly, we report two aggregate scores:
a weighted average reflecting the training
distribution, and an unweighted average,
meaning that
test domains are equally
important.
• fine-tuned models (Luong and Manning,
2015; Freitag and Al-Onaizan, 2016), based
on the Mixed-Nat system, further trained
on each domain for at most 20,000 iterations,
with early stopping when the dev BLEU stops
increasing. The full fine-tuning (FT-Full)
procedure may update all the parameters of
the initial generic model, resulting in six
systems adapted for one domain, with no
parameter-sharing across domains.
All models use embeddings and the hidden
layers sizes of dimension 512. Transformers
contain with 8 attention heads in each of the 6+6
layers; the inner feedforward layer contains 2,048
cells. The adapter-based systems (see below)
10https://github.com/OpenNMT/OpenNMT-tf.
11In fact three: to enable a fair comparison with WDCMT,
a RNN-based variant is also trained and evaluated. Ce
system appears as Mixed-Nat-RNN in Table 3.
21
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
5
1
1
9
2
4
0
5
9
/
/
t
je
un
c
_
un
_
0
0
3
5
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
additionally use an adaptation block in each
layer, composed of a two-layer perceptron, avec
an inner ReLU activation function operating on
normalized entries of dimension 1,024. Training
uses batches of 12,288 tokens, Adam with
parameters β1 = 0.9, β2 = 0.98, Noam decay
(warmup steps = 4, 000), and a dropout rate of
0.1 in all layers.
4.3 Multi-Domain Systems
Our comparison of multi-domain systems includes
our own reimplementations of recent proposals
from the literature:12
• a system using domain control as in Kobus
et autres. (2017): domain information is intro-
token for
duced either as an additional
each source sentence (DC-Tag), or as
a supplementary feature for each word
(DC-Feat).
• a system using lexicalized domain represen-
tations (Pham et al., 2019): word embeddings
are composed of a generic and a domain
specific part (LDR);
• the three proposals of Britz et al. (2017).
TTM is a feature-based approach where the
domain tag is introduced as an extra word
on the target side. Training uses reference
tags and inference is usually performed with
predicted tags, just like for regular target
words. DM is a multi-task learner where a
domain classifier is trained on top the MT
encoder, so as to make it aware of domain
differences; ADM is the adversarial version
of DM, pushing the encoder towards learning
domain-independent source representations.
These methods thus only use domain tags in
entraînement.
• the multi-domain model of Zeng et al. (2018)
(WDCMT), where a domain-agnostic and
a domain-specialized representation of the
input are simultaneously processed; super-
vised classification and adversarial training
are used to compute these representations.
Encore, inference does not use domain tags.13
12Further implementation details are in Appendix A.
13For this system, we use the available RNN-based system
from the authors (https://github.com/DeepLearnXMU
/WDCNMT), which does not directly compare to the
other, Transformer-based, systèmes; the improved version of
• two multi-domain versions of the approach
of Bapna and Firat (2019), denoted FT-
Res and MDL-Res, where a domain-specific
adaptation module is added to all the Trans-
former layers; within each layer, residual
connections enable to short-cut this adapter.
The former variant corresponds to the orig-
inal proposal of Bapna and Firat (2019) (voir
also Sharaf et al., 2020). It fine-tunes the
adapter modules of a Mixed-Nat system
independently for each domain, keeping all
the other parameters frozen. The latter uses
the same architecture, but a different training
procedure and learns all parameters jointly
from scratch with a mix-domain corpus.
includes systems that slightly depart
This list
from our definition of MDMT: Standard imple-
mentations of TTM and WDCMT rely on infered,
rather than on gold, domain tags, which must
somewhat affect their predictions; DM and ADM
make no use of domain tags at all. We did not
consider the proposal of Farajian et al. (2017b),
cependant, which performs on-the-fly tuning for
each test sentence and diverges more strongly
from our notion of MDMT.
5 Results and Discussion
5.1 Performance of MDMT Systems
Dans cette section, we discuss the basic performance of
MDMT systems trained and tested on six domains.
Results are in Table 3. As expected, balancing data
in the generic setting makes a great difference
(the unweighted average is 2 BLEU points better,
notably owing to the much better results for REL).
As explained above, this setting should be the
baseline when the test distribution is assumed to
be balanced across domains. As all other systems
are trained with an unbalanced data distribution,
we use the weighted average to perform global
comparisons.
Fine-tuning each domain separately yields a
better baseline, outperforming Mixed-Nat for
all domains, with significant gains for domains
that are distant from MED: REL, IT, BANK, LAW.
All MDMTs (except DM and ADM) slightly
improve over Mixed-Nat(for most domains),
mais
these gains are rarely significant. Parmi
systems using an extra domain feature, CC-
Tag has a small edge over DC-Feat and also
Su et al. (2019) seems to produce comparable, albeit slightly
improved, résultats.
22
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
5
1
1
9
2
4
0
5
9
/
/
t
je
un
c
_
un
_
0
0
3
5
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Model / Domain
Mixed-Nat
Mixed-Bal
FT-Full
DC-Tag
DC-Feat
LDR
TTM
DM
ADM
FT-Res
MDL-Res
[65m]
[65m]
[6×65m]
[+4k]
[+140k]
[+1.4m]
[+4k]
[+0]
[+0]
[+12.4m]
[+12.4m]
Mixed-Nat-RNN [51m]
WDCMT
[73m]
MED
37.3
35.3
37.7
38.1
37.7
37.0
37.3
35.6
36.4
37.3
37.9
36.8
36.0
LAW
BANK
TALK
54.6
54.1
59.2
55.3
54.9
54.7
54.9
49.5
53.5
57.9
56.0
53.8
53.3
50.1
52.5
54.5
49.9
49.5
49.9
49.5
45.6
48.3
53.9
51.2
47.2
48.8
33.5
31.9
34.0
33.2
32.9
33.9
32.9
29.9
32.0
33.8
33.5
30.0
31.1
IT
43.2
44.9
46.8
43.5
43.6
43.6
43.6
37.1
41.5
46.7
44.4
35.7
38.8
REL
77.5
89.5
90.8
80.5
79.9
79.9
79.9
62.4
73.4
90.2
88.3
60.2
58.5
wAVG
41.1
40.3
42.7
41.6
41.4
40.9
41.0
38.1
38.9
42.3
42.0
39.2
39.0
AVG
49.4
51.4
53.8
50.1
49.9
49.8
49.7
43.4
47.5
53.3
51.9
44.0
44.4
Tableau 3: Translation performance of MDMT systems based on the same Transformer (top) or RNN
(bottom) architecture. The former contains 65m parameters, the latter has 51m. For each system,
we report the number of additional domain specific parameters, BLEU scores for each domain,
domain-weighted (WAVG) and unweighted (AVG) averages. For weighted-averages, we take the
domain proportions from Table 1. Boldface denotes significant gains with respect to Mix-Nat
(or Mix-Nat-RNN, for WDCMT), underline denotes significant losses.
requires fewer parameters; it also outperforms
TTM, lequel, cependant, uses predicted rather than
gold domain tags. TTM is also the best choice
among the systems that do not use domain tags
in inference. The best contenders overall are FT-
Res and MDL-Res, which significantly improve
over Mixed-Nat for a majority of domains,
and are the only ones to clearly fulfill [P1-
LAB]; WDCMT also improves on three domains,
but regresses on one. The use of a dedicated
adaptation module thus seems better than feature-
based strategies, but yields a large increase of the
number of parameters. The effect of the adaptation
layer is especially significant for small domains
(BANK, IT, and REL).
All systems fail
to outperform fine-tuning,
sometimes by a wide margin, especially for an
‘‘isolated’’ domain like REL. This might be due
to the fact that domains are well separated (cf.
Section 4.1) and are hardly helping each other. Dans
this situation, MDMT systems should dedicate a
sufficient number of parameters to each domain,
so as to close the gap with fine-tuning.
5.2 Redefining domains
Tableau 4 summarizes the results of four experiments
where we artificially redefine the boundaries of
domains, with the aim to challenge requirements
[P2-TUN], [P3-HET], et [P4-ERR]. In first
three, we randomly split one corpus in two
parts and proceed as if this corresponded to two
actual domains. A MD system should detect that
these two pseudo-domains are mutually beneficial
and should hardly be affected by this change
with respect to the baseline scenario (no split).
In this situation, we expect MDMT to even
surpass fine-tuning separately on each of these
dummy domains, as MDMT exploits all data,
while fine-tuning focuses only on a subpart.
In testing, we decode the test set twice, once
with each pseudo-domain tag. This makes no
difference for TTM, DM, ADM, and WDCMT, lequel
do not use domain tags in testing. In the merge
experiment, we merge two corpora in training,
in order to assess the robustness with respect
to heterogenous domains [P3-HET]. We then
translate the two corresponding tests with the
same (merged) système.
Our findings can be summarized as follows.
For the split experiments, we see small variations
that can be positive or negative compared to
the baseline situation, but these are hardly sig-
nificant. All systems show some robustness with
respect to fuzzy domain boundaries; this is mostly
notable for ADM, suggesting that when domain are
23
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
5
1
1
9
2
4
0
5
9
/
/
t
je
un
c
_
un
_
0
0
3
5
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Set-up
Model
Split
Split
MED (0.5 / 0.5) MED (0.25 / 0.75) LAW (0.5 / 0.5)
Split
Merge
BANK+LAW
Wrong
rnd
NEW
MED2
MED1
MED1
FT-Full −0.1 −0.6 −1.5
−0.2 −0.3
DC-Tag
+0.1
DC-Feat −0.5
+0.3
0.0
LDR
+0.1
+0.4
+0.1
−0.2 −0.2 −0.2
TTM (*)
−0.3 −0.3
DM (*)
+0.4
ADM (*)
+0.6
+0.4
+0.6
−0.1 −0.4 −0.3
FT-Res
MDL-Res −0.2 −0.1
+0.2
WDCMT (*) −0.0 −0.0
+0.2
MED2
−0.2
+0.2
+0.3
+0.4
−0.2
+0.4
+0.4
−0.3
+0.0
+0.2
LAW1
−2.3
−0.4
+0.3
0.0
−0.3
+0.3
+0.4
−2.2
−0.9
+0.8
LAW2
−5.1
−0.4
+0.3
0.0
−0.3
+0.3
+0.4
−2.9
−0.9
+0.8
LAW
ALL
BANK
NEWS
−1.6 −1.4 −19.6 −3.3
−0.5 −0.4 −13.4 −1.7
+0.1 −14.2 −1.8
+0.3
+0.1 −12.0 −1.4
0.0
0.0 −0.1
0.0 −0.3
0.0 −0.9
+0.1
+0.9
+0.1 −0.4
0.0 −0.2
−2.4 −3.2 −13.3 −3.0
+0.7 −0.3 −18.6 −1.3
−0.4 −0.8
+0.2
0.0
Tableau 4: Translation performance with variable domain definitions. In the Split/Merge experiments, nous
report BLEU differences for the related test set(s). Underline denotes significant loss when domains
are changed wrt. the baseline situation; bold for a significant improvement over FT-Full; (*) tags
systems ignoring test domains.
close, ignoring domain differences is effective.
In contrary, FT-Full incurs clear losses across
the board, especially for the small data condition
(Miceli Barone et al., 2017). Even in this very
favourable case however, very few MDMT sys-
tems are able to significantly outperform FT-
Full and this is only observed for the smaller
part of the MED domain. The merge condition
is hardly different, with again large losses for
FT-full and FT-Res, and small variations
for all systems. We even observe some rare
improvements with respect to the situation where
we use actual domains.
5.2.1 Handling Wrong or Unknown Domains
In the last two columns of Table 4, we report the
drop in performance when the domain information
is not correct. Dans le premier (RND), we use test data
from the domains seen in training, presented with a
random domain tag. In this situation, the loss with
respect to using the correct tag is generally large
(plus que 10 BLEU points), showing an overall
failure to meet requirement [P4-ERR], except for
systems that ignore domain tags in testing.
In the second (NEW), we assess [P5-UNK] par
translating sentences from a domain unseen in
entraînement (NEWS). For each sentence, we auto-
matically predict the domain tag and use it for
decoding.14 In this configuration, again, systèmes
14Domain tags are assigned as follows: we train a language
model for each domain and assign tag on a per-sentence
basis based on the language model log-probability (assuming
using domain tags during inference perform
poorly, significantly worse than the Mixed-Nat
baseline (BLEU=23.5).
5.2.2 Handling Growing Numbers of
Domains
Another set of experiments evaluate the ability
to dynamically handle supplementary domains
(requirement [P6-DYN]) as follows. Starting with
the existing MD systems of Section 5.1, nous
introduce an extra domain (NEWS) and resume
training with this new mixture of data15 for 50,000
additional iterations. We contrast this approach
with training all systems from scratch and report
differences in performance in Figure 1 (see also
Tableau 7 in Appendix B).16 We expect that MDMT
systems should not be too significantly impacted
by the addition of a new domain and reach
about the same performance as when training
with this domain from scratch. From a practical
viewpoint, dynamically integrating new domains
is straightforward for DC-Tag, DC-Feat, ou
uniform domain priors). The domain classifier has an average
prediction error of 16.4% for in-domain data.
15The design of a proper balance between domains in
training is critical for achieving optimal performance: As our
goal is to evaluate all systems in the same conditions, nous
consider a basic mixing policy based on the new training
distribution. This is detrimental to the small domains, pour
which the ‘‘negative transfer’’ effect is stronger than for
larger domains.
16WDCMT results are excluded from this table, comme
resuming training proved difficult to implement.
24
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
5
1
1
9
2
4
0
5
9
/
/
t
je
un
c
_
un
_
0
0
3
5
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
5
1
1
9
2
4
0
5
9
/
/
t
je
un
c
_
un
_
0
0
3
5
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Chiffre 1: Ability to handle a new domain. We report BLEU scores for a complete training session with seven
domains, as well as differences (in blue) with training with six domains (from Table 3); et (in red) differences
with continual training.
TTM, for which new domains merely add new
labels. It is less easy for DM, ADM, and WDCMT,
which include a built-in domain classifier whose
outputs have to be pre-specified, ou, for LDR, FT-
Res, and MDL-Res, for which the number of
possible domains is built in the architecture and
has to be anticipated from the start. This makes a
difference between domain-bounded systems, pour
which the number of domains is limited and truly
open-domain systems.
We can first compare the results of coldstart
training with six or seven domains in Table 7:
A first observation is that
the extra training
data is hardly helping for most domains, except
for NEWS, where we see a large gain, et pour
TALK. The picture is the same when one looks
at MDMTs, where only the weakest systems
(DM, ADM) seem to benefit from more (out-of-
domain) data. Comparing now the coldstart with
the warmstart scenario, we see that the former is
always significantly better for NEWS, as expected,
and that resuming training also negatively impacts
the performance for other domains. This happens
notably for DC-Tag, TTM, and ADM. In this setting
MDL-Res and DM show the smaller average loss,
with the former achieving the best balance of
training cost and average BLEU score.
5.3 Automatic Domains
Dans cette section, we experiment with automatic
domains, obtained by clustering sentences into
k = 30 classes using the k-means algorithm based
on generic sentence representations obtained via
mean pooling (cf. Section 4.1). This allows us
to evaluate requirement [P7-scale], entraînement, et
testing our systems as if these domains were
fully separated. Many of these clusters are mere
splits of the large MED, while a fewer number of
classes are mixtures of two (ou plus) existing
domains (full details are in Appendix C). Nous
are thus in a position to reiterate, at a larger
scale, the measurements of Section 5.2 and test
whether multi-domain systems can effectively
take advantage from the cross-domain similarities
and to eventually perform better than fine-tuning.
The results in Table 5 also suggest that MDMT
can surpass fine-tuning for the smaller clusters;
for the large clusters, this is no longer true. Le
complete table (in Appendix C) shows that this
25
Model/
Clusters
10 petit
10 mid
10 grand
Avg
Train
size
Mixed
Nat
29.3k
104.7k
251.1k
128.4k
68.3
44.8
50.4
54.5
FT
CC
MDL
Full Res Res Feat Tag
FT
CC
TTM ADM
DM
LDR
70.0
48.0
52.9
57.0
70.7
46.0
52.0
56.2
71.2
45.7
51.3
56.1
70.6
44.8
49.6
55.0
53.1
44.3
43.2
46.9
67.3
44.5
49.1
53.6
69.8
43.7
48.5
54.0
67.0
41.6
44.3
51.0
70.2
44.5
49.5
54.7
Tableau 5: BLEU scores computed by merging the 10 smaller, moyen, and larger cluster test sets.
Best score for each group is in boldface. For the small clusters, full-fine tuning is outperformed by
several MDMT systems – see details in Appendix C.
Domain / Model
DC-Tag
DC-Feat
LDR
TTM
DM
ADM
FT-Res
MDL-Res
WDCMT
MED
38.5
37.3
37.4
37.4
35.4
36.1
37.5
37.3
35.6
LAW
54.0
54.2
54.1
53.7
49.3
53.5
55.7
55.5
53.1
BANK
TALK
49.0
49.3
48.7
48.9
45.2
48.0
51.1
50.2
48.4
33.6
33.6
32.5
32.8
29.7
32.0
33.1
32.2
30.5
IT
42.2
41.9
41.4
41.3
37.1
41.1
44.1
42.1
37.7
REL
76.7
75.8
75.9
75.8
60.0
72.1
86.7
86.7
56.0
wAVG
41.6
40.8
39.1
40.7
37.8
39.5
41.6
41.2
38.5
AVG
49.0
48.7
48.3
48.3
42.8
47.1
51.4
50.7
43.6
Tableau 6: Translation performance with automatic domains, computed with the original test sets.
Significance tests are for comparisons with the six-domain scenario (Tableau 3).
effect is more visible for small subsets of the
medical domain.
Enfin, Tableau 6 reports the effect of using
automatic domain for each of the six test sets:
Each sentence was first assigned to an automatic
class, translated with the corresponding multi-
domain system with 30 classes; aggregate numbers
were then computed, and contrasted with the six-
domain scenario. Results are clear and confirm
previous observations: Even though some clusters
are very close, the net effect is a loss in perfor-
mance for almost all systems and conditions. Dans
this setting, the best MDMT in our pool (MDL-
Res) is no longer able to surpass the Mix-Nat
baseline.
6 Related Work
The multi-domain training regime is more the
norm than the exception for natural
langue
traitement (Dredze and Crammer, 2008; Finkel
and Manning, 2009), and the design of multi-
domain systems has been proposed for many
language processing tasks. We focus here ex-
clusively on MD machine translation, keeping
in mind that similar problems and solutions
(parameter sharing, instance selection / weighting,
adversarial training, etc.) have been studied in
d'autres contextes.
Multi-domain translation was already proposed
for statistical MT, either considering as we do
multiple sources of training data (par exemple., Banerjee
et coll., 2010; Clark et al., 2012; Sennrich et al.,
2013; Huck et al., 2015), or domains made of
multiple topics (Eidelman et al., 2012; Hasler
et coll., 2014). Two main strategies were considered:
instance-based, involving a measure of similarities
between train and test domains; feature-based,
where domain/topic labels give rise to additional
features.
The latter strategy has been widely used in
NMT: Kobus et al. (2017) inject an additional
domain feature in their seq2seq model, either
in the form of an extra (initial) domain-token
or in the form of an additional domain-feature
associated to each word. These results are
reproduced by Tars and Fishel
(2018), OMS
also consider automatically induced domain tags.
This technique also helps control the style of
(2016un) et
MT outputs in Sennrich et al.
26
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
5
1
1
9
2
4
0
5
9
/
/
t
je
un
c
_
un
_
0
0
3
5
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Niu et al. (2018), and to encode the source or
target languages in multilingual MT (Firat et al.,
2016; Johnson et al., 2017). Domain control can
also be performed on the target side, as in Chen
et autres. (2016), where a topic vector describing
the whole document serves as an extra context
in the softmax layer of the decoder. Such ideas
are further developed in Chu and Dabre (2018)
and Pham et al. (2019), where domain differences
and commonalties are encoded in the network
architecture: Some parameters are shared across
domains, while others are domain-specific.
Techniques proposed by Britz et al. (2017)
aim to ensure that domain information is actually
used in a mix-domain system. Three methods are
considered, using either domain classification (ou
domain normalization, via adversarial training)
on the source or target side. There is no clear
winner in either of the three language pairs
considered. One contribution of this work is
the idea of normalizing representations through
adversarial training, so as to make the mixture of
heterogeneous data more effective; representation
normalization has since proven a key ingredient
in multilingual transfer learning. The same basic
techniques (parameter sharing, automatic domain
identification / normalization) are simultaneously
at play in Zeng et al. (2018) and Su et al. (2019):
Dans cette approche, the lower layers of the MT
use auxiliary classification tasks to disentangle
domain specific representations on the one hand
from domain-agnostic representations on the other
main. These representations are then processed as
two separate inputs, then recombined to compute
the translation.
Another parameter-sharing scheme is in Jiang
et autres. (2019), which augments a Transformer
model with domain-specific heads, whose con-
the word/position
tributions are regulated at
level: Some words have ‘‘generic’’ use and rely
on mixed-domain heads, alors que
quelques
other words it
is preferable to use domain-
specific heads, thereby reintroducing the idea of
ensembling at the core of Huck et al. (2015)
and Saunders et al.
(2019). The results for
three language pairs outperform several standard
baselines for a two-domain systems (in fr:en and
de:dans) and a four-domain system (zh:dans).
pour
Enfin, Farajian et al.
(2017b), Li et al.
(2018), and Xu et al. (2019) adopt a different
strategy. Each test sentence triggers the selection
of a small set of related instances; using these,
a generic NMT is tuned for some iterations,
before delivering its output. This approach entirely
dispenses with the notion of domain and relies
on data selection techniques to handle data
heterogeneity.
7 Conclusion and Outlook
Dans cette étude, we have carefully reconsidered
the idea of multi-domain machine translation,
which seems to be taken for granted in many
recent studies. We have spelled out the various
motivations for building such systems and the
associated expectations in terms of system per-
formance. We have then designed a series of
requirements that MDMT systems should meet,
and proposed a series of associated test pro-
cedures. In our experiments with a representative
sample of MDMTs, we have found that most
requirements were hardly met for our experimen-
tal conditions. If MDMT systems are able to
outperform the mixed-domain baseline, at least
for some domains, they all fall short to match
the performance of fine-tuning on each individual
domain, which remains the best choice in multi-
source single domain adaptation. As expected
cependant, MDMTs are less brittle than fine-tuning
when domain frontiers are uncertain, and can,
to a certain extent, dynamically accommodate
additional domains, this being especially easy
for feature-based approaches. Our experiments
finally suggest that all methods show decreasing
performance when the number of domains or the
diversity of the domain mixture increases.
Two other main conclusions can be drawn
from this study: D'abord, it seems that more work
is needed to make MDMT systems make the best
out of the variety of the available data, both to
effectively share what needs to be shared while at
the same time separating what needs to be kept
separated. We notably see two areas worthy of
further exploration: the development of parameter
sharing strategies when the number of domains
is large; and the design of training strategies that
can effectively handle a change of the training
mixture, including an increase in the number of
domains. Both problems are of practical relevance
in industrial settings. Deuxième, and maybe more
importantly, there is a general need to adopt better
evaluation methodologies for evaluating MDMT
systèmes, which require systems developers to
clearly spell out the testing conditions and the
27
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
5
1
1
9
2
4
0
5
9
/
/
t
je
un
c
_
un
_
0
0
3
5
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
associated expected distribution of testing in-
stances, and to report more than comparisons with
simple baselines on a fixed and known handful of
domains.
Remerciements
The work presented in this paper was partially
supported by the European Commission under
contract H2020-787061 ANITA.
This work was granted access to the HPC
resources of [TGCC/CINES/IDRIS] under the
allocation 2020-[AD011011270] made by GENCI
(Grand Equipement National de Calcul Intensif).
Les références
Naveen Arivazhagan, Ankur Bapna, Orhan
Firat, Dmitry Lepikhin, Melvin Johnson,
Maxim Krikun, Mia Xu Chen, Yuan Cao,
George Foster, Colin Cherry, Wolfgang
Macherey, Zhifeng Chen, and Yonghui Wu.
2019. Massively multilingual neural machine
translation in the wild: Findings and challenges.
arXiv e-prints, abs/1907.05019.
Amittai Axelrod, Xiaodong He, and Jianfeng
Gao. 2011. Domain adaptation via pseudo
in-domain data selection. In Proceedings of
the Conference on Empirical Methods in
Natural Language Processing, EMNLP ’11,
pages 355–362. Édimbourg, United Kingdom.
Pratyush Banerjee, Jinhua Du, Baoli Li, Sudip
Kumar Naskar, Andy Way, and Josef van
Genabith. 2010. Combining multi-domain
statistical machine translation models using
automatic classifiers. In Proceedings of
le
9th Conference of the Association for Machine
Translation in the Americas, AMTA 2010.
Denver, CO, Etats-Unis.
Ankur Bapna and Orhan Firat. 2019. Sim-
ple, scalable adaptation for neural machine
translation. In Proceedings of the 2019 Con-
ference on Empirical Methods in Natural
Language Processing and the 9th International
Joint Conference on Natural Language Pro-
cessation, EMNLP-IJCNLP, pages 1538–1548,
Hong Kong, Chine. Association for Computa-
tional Linguistics. EST CE QUE JE: https://doi.org
/10.18653/v1/D19-1165
Shai Ben-David, John Blitzer, Koby Crammer,
Alex Kulesza, Fernando Pereira, and Jenn
Wortman. 2010. A theory of learning from
different domains. Machine Learning, 79(1):
151–175. EST CE QUE JE: https://est ce que je.org/10
.1007/s10994-009-5152-4
Nicola Bertoldi and Marcello Federico. 2009.
Domain adaptation for
statistical machine
Dans
translation with monolingual
ressources.
Proceedings of the Fourth Workshop on Sta-
tistical Machine Translation, pages 182–189,
Athens, Grèce. Association for Computa-
tional Linguistics. EST CE QUE JE: https://doi.org
/10.3115/1626431.1626468
John Blitzer. 2007. Domain Adaptation of Natu-
ral Language Processing Systems. Ph.D. thesis,
School of Computer Science, Université de
Pennsylvania.
Denny Britz, Quoc Le, and Reid Pryzant.
2017. Effective domain mixing for neural
machine translation. In Proceedings of
le
Second Conference on Machine Translation,
pages 118–126, Copenhagen, Denmark. Asso-
ciation for Computational Linguistics. EST CE QUE JE:
https://doi.org/10.18653/v1/W17
-4712
Mauro Cettolo, Christian Girardi, and Marcello
Federico. 2012. Wit3: Web inventory of
transcribed and translated talks. In Proceedings
the European
de
Association for Machine Translation (EAMT),
pages 261–268. Trento, Italy.
the 16th Conference of
training for
Wenhu Chen, Evgeny Matusov, Shahram
Khadivi, and Jan-Thorsten Peter. 2016. Guided
alignment
topic-aware neural
le
machine translation. In Proceedings of
Twelth Biennial Conference of the Association
for Machine Translation in the Americas,
AMTA 2012. Austin, Texas.
adaptation
Chenhui Chu and Raj Dabre. 2018. Multilingual
neural
and multi-domain
le
machine translation. In Proceedings of
24st Annual Meeting of the Association for
Natural Language Processing, NLP 2018,
pages 909–912, Okayama, Japan.
pour
Chenhui Chu and Rui Wang. 2018. A survey
of domain adaptation for neural machine
28
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
5
1
1
9
2
4
0
5
9
/
/
t
je
un
c
_
un
_
0
0
3
5
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
translation. In Proceedings of the 27th Inter-
national Conference on Computational Lin-
guistics, COLING 2018, pages 1304–1319,
Santa Fe, New Mexico, Etats-Unis.
Jonathan H. Clark, Alon Lavie, and Chris Dyer.
2012. One system, many domains: Open-
domain statistical machine translation via
feature augmentation. In Proceedings of the
Tenth Biennial Conference of the Association
for Machine Translation in the Americas,
(AMTA 2012). San Diego, Californie.
Hal Daum´e III and Daniel Marcu. 2006. Domain
adaptation for statistical classifiers. Journal
of Artificial
Intelligence Research (JAIR),
26:101–126. EST CE QUE JE: https://est ce que je.org/10
.1613/jair.1872
Mark Dredze and Koby Crammer. 2008. En ligne
methods for multi-domain learning and adap-
tation. In Proceedings of the 2008 Conference
on Empirical Methods in Natural Language
Processing, EMNLP, pages 689–697, Hon-
olulu, Hawaii. EST CE QUE JE: https://doi.org
/10.3115/1613715.1613801
Vladimir Eidelman, Jordan Boyd-Graber, et
Philip Resnik. 2012. Topic models for dynamic
translation model adaptation. In Proceedings of
the 50th Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short
Papers), pages 115–119, Jeju Island, Korea.
Association for Computational Linguistics.
the 15th Conference of
M.. Amin Farajian, Marco Turchi, Matteo Negri,
Nicola Bertoldi,
and Marcello Federico.
2017un. Neural vs. phrase-based machine trans-
lation in a multi-domain scenario. En Pro-
le
ceedings of
European Chapter of
the Association for
Computational Linguistics: Volume 2, Short
Papers, pages 280–284, Valencia, Espagne. Asso-
ciation for Computational Linguistics. EST CE QUE JE:
https://doi.org/10.18653/v1/E17
-2045
M.. Amin Farajian, Marco Turchi, Matteo Negri,
and Marcello Federico. 2017b. Multi-domain
neural machine translation through unsu-
le
pervised adaptation. In Proceedings of
Second Conference on Machine Translation,
pages 127–137, Copenhagen, Denmark. EST CE QUE JE:
https://doi.org/10.18653/v1/W17
-4713
Jenny Rose Finkel and Christopher D. Manning.
2009. Hierarchical Bayesian domain adap-
tation. In Proceedings of Human Language
Technologies: Le 2009 Annual Conference
de
le
the North American Chapter of
Association for Computational Linguistics,
pages 602–610, Boulder, Colorado. EST CE QUE JE:
https://doi.org/10.3115/1620754
.1620842
Orhan Firat, Kyunghyun Cho, and Yoshua Bengio.
2016. Multi-way, multilingual neural machine
translation with a shared attention mechanism.
In Proceedings of the 2016 Conference of the
North American Chapter of the Association for
Computational Linguistics: Human Language
Technologies, pages 866–875. Association for
Computational Linguistics. EST CE QUE JE: https://
doi.org/10.18653/v1/N16-1101
George Foster and Roland Kuhn. 2007. Mixture-
model adaptation for SMT. In Proceedings of
the Second Workshop on Statistical Machine
Translation, pages 128–135, Prague, Czech
Republic.
Markus Freitag and Yaser Al-Onaizan. 2016.
Fast domain adaptation for neural machine
translation. CoRR, abs/1612.06897.
Eva Hasler, Phil Blunsom, Philipp Koehn, et
Barry Haddow. 2014. Dynamic topic adapta-
tion for phrase-based MT. In Proceedings of the
14th Conference of the European Chapter of
the Association for Computational Linguistics,
pages 328–337, Gothenburg, Sweden, Asso-
ciation for Computational Linguistics. EST CE QUE JE:
https://doi.org/10.3115/v1/E14
-1035
Judy Hoffman, Mehryar Mohri, and Ningshan
Zhang. 2018. Algorithms and theory for
multiple-source adaptation, S. Bengio, H.
Wallach, H. Larochelle, K. Grauman, N. Cesa-
Bianchi, et R. Garnett, editors, Advances in
Neural Information Processing Systems 31,
pages 8246–8256, Curran Associates, Inc.
Matthias Huck, Alexandra Birch, and Barry
Haddow. 2015. Mixed domain vs. multi-domain
statistical machine translation. In Proceedings
29
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
5
1
1
9
2
4
0
5
9
/
/
t
je
un
c
_
un
_
0
0
3
5
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
the Machine Translation Summit, MT
de
Summit XV, pages 240–255. Miami Florida.
Ann Irvine, John Morgan, Marine Carpuat,
Hal Daum, and Dragos Munteanu. 2013.
Measuring machine translation errors in new
domains. Transactions of the Association for
Computational Linguistics, 1:429–440. EST CE QUE JE:
https://doi.org/10.1162/tacl a
00239
Haoming Jiang, Chen Liang, Chong Wang, et
Tuo Zhao. 2019. Multi-domain neural machine
translation with word-level adaptive layer-
wise domain mixing. CoRR, abs/1911.02692.
EST CE QUE JE: https://doi.org/10.18653/v1
/2020.acl-main.165, PMID: 31986961
Jing Jiang and ChengXiang Zhai. 2007. Instance
weighting for domain adaptation in NLP. Dans
Proceedings of the 45th Annual Meeting of
the Association of Computational Linguistics,
pages 264–271, Prague, Czech Republic.
Association for Computational Linguistics.
Anton Schwaighofer Joaquin Quionero-Candela,
Masashi Sugiyama and Neil D. Lawrence, edi-
tors. 2008. Dataset Shift in Machine Learning,
Neural Information Processing series. AVEC
Presse. EST CE QUE JE: https://doi.org/10.7551
/mitpress/9780262170055.001.0001
Melvin Johnson, Mike Schuster, Quoc Le, Maxim
Krikun, Yonghui Wu, Zhifeng Chen, Nikhil
Thorat, Fernand a Vi´egas, Martin Wattenberg,
Greg Corrado, Macduff Hughes, and Jeffrey
Dean. 2017. Google’s multilingual neural ma-
chine translation system: Enabling zero-shot
translation. Transactions of the Association for
Computational Linguistics, 5:339–351. EST CE QUE JE:
https://doi.org/10.1162/tacl a
00065
Mahesh Joshi, Mark Dredze, William W.
Cohen, and Carolyn P. Rose. 2012. Multi-
domain learning: When do domains matter?
In Empirical Methods in Natural Language
Processing (EMNLP), pages 1302–1312.
Guillaume Klein, Yoon Kim, Yuntian Deng,
Jean Senellart, and Alexander Rush. 2017.
for neural
OpenNMT: Open-source toolkit
machine translation. In Proceedings of ACL
2017, System Demonstrations, pages 67–72,
Vancouver, Canada. Association for Compu-
tational Linguistics. EST CE QUE JE: https://est ce que je
.org/10.18653/v1/P17-4012
Catherine Kobus, Josep Crego, and Jean Senellart.
2017. Domain control for neural machine trans-
lation. In Proceedings of the International Con-
ference Recent Advances in Natural Language
Processing, RANLP 2017, pages 372–378,
Varna, Bulgaria. EST CE QUE JE: https://doi.org
/10.26615/978-954-452-049-6 049
Philipp Koehn. 2004. Statistical significance tests
for machine translation evaluation. En Pro-
ceedings of the 2004 Conference on Empirical
Methods in Natural Language Processing,
pages 388–395, Barcelona, Espagne. Association
for Computational Linguistics.
Xiaoqing Li,
Jiajun Zhang, and Chengqing
pour
Zong. 2018. One sentence one model
neural machine translation. In Proceedings
of the Eleventh International Conference on
Language Resources and Evaluation (LREC
2018). Miyazaki, Japan. European Language
Resources Association (ELRA).
Minh-Thang Luong and Christopher D. Manning.
2015. Stanford neural machine translation
systèmes
Dans
Proceedings of
the International Workshop
on Spoken Language Translation, IWSLT, Da
Nang, Vietnam.
spoken language domain.
pour
Yishay Mansour, Mehryar Mohri, and Afshin
Rostamizadeh. 2009un. Domain adaptation with
multiple sources. In D. Koller, D. Schuurmans,
Oui. Bengio, and L. Bottou, editors, Advances
in Neural Information Processing Systems 21,
pages 1041–1048, Curran Associates, Inc.
Yishay Mansour, Mehryar Mohri, and Afshin
Rostamizadeh. 2009b. Multiple source adapta-
tion and the R´enyi divergence. In Proceedings
of the 25th Conference on Uncertainty in Arti-
ficial Intelligence, UAI 2009, pages 367–374.
Antonio Valerio Miceli Barone, Barry Haddow,
Ulrich Germann, and Rico Sennrich. 2017.
Regularization techniques for fine-tuning in
In Proceed-
neural machine
le 2017 Conference on Empirical
ings of
Methods in Natural Language Processing,
pages 1489–1494, Copenhagen, Denmark.
translation.
30
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
5
1
1
9
2
4
0
5
9
/
/
t
je
un
c
_
un
_
0
0
3
5
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Association for Computational Linguistics,
EST CE QUE JE: https://doi.org/10.18653/v1
/D17-1156
Paul Michel
and Graham Neubig.
2018.
Extreme adaptation for personalized neural
machine translation. In Proceedings of
le
56th Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short
Papers), pages 312–318, Melbourne, Australia.
Association for Computational Linguistics.
EST CE QUE JE: https://doi.org/10.18653/v1
/P18-2050
Graham Neubig, Zi-Yi Dou, Junjie Hu, Paul
Michel, Danish Pruthi, Xinyi Wang, et Jean
Wieting. 2019. compare-mt: A tool for holistic
comparison of language generation systems.
CoRR, abs/1903.07926. EST CE QUE JE: https://est ce que je
.org/10.18653/v1/N19-4007
Xing Niu, Sudha Rao, and Marine Carpuat.
2018. Multi-task neural models for translating
between styles within and across languages.
Emily M. Cintreuse, Leon Derczynski, and Pierre
Isabelle, editors, In Proceedings of the 27th
International Conference on Computational
Linguistics, COLING, pages 1008–1021, Santa
Fe, New Mexico, Etats-Unis.
Yonatan Oren,
Shiori
Sagawa, Tatsunori
Hashimoto, and Percy Liang. 2019. Distribu-
language modeling. En Pro-
tionally robust
ceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 4227–4237, Hong Kong, Chine. Asso-
ciation for Computational Linguistics. EST CE QUE JE:
https://doi.org/10.18653/v1/D19
-1432
Sinno Jialin Pan and Qiang Yang. 2010. UN
survey on transfer learning. IEEE Transactions
on Knowledge and Data Engineering, 22(10):
1345–1359. EST CE QUE JE: https://est ce que je.org/10
.1109/TKDE.2009.191
Etats-Unis. EST CE QUE JE: https://doi.org/10.3115
/1073083.1073135
Minh Quang Pham, Josep-Maria Crego, Jean
Senellart, and Franc¸ois Yvon. 2019. Generic
and specialized word embeddings for multi-
domain machine translation. In Proceedings of
the 16th International Workshop on Spoken
Language Translation,
IWSLT, page 9p,
Hong-Kong, CN.
Hassan Sajjad, Nadir Durrani, Fahim Dalvi,
Yonatan Belinkov, and Stephan Vogel. 2017.
Neural machine translation training in a multi-
domain scenario. In Proceedings of the 14th
International Workshop on Spoken Language
Translation, IWSLT 2017, Tokyo, Japan.
Danielle Saunders, Felix Stahlberg, and Bill
Byrne. 2019. UCAM biomedical
translation
at WMT19: Transfer learning multi-domain
ensembles. In Proceedings of the Fourth Con-
ference on Machine Translation (Volume 3:
Shared Task Papers, Day 2), pages 169–174,
Italy. Association for Computa-
Florence,
tional Linguistics. EST CE QUE JE: https://doi.org
/10.18653/v1/W19-5421
Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016un. Controlling politeness in neu-
ral machine translation via side constraints. Dans
Actes du 2016 Conference of the
North American Chapter of the Association for
Computational Linguistics: Human Language
Technologies, pages 35–40, San Diego, Cali-
fornia. Association for Computational Linguis-
tics. EST CE QUE JE: https://doi.org/10.18653
/v1/N16-1005
Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016b. Neural machine translation
de
Dans
rare words with subword units.
Proceedings of the 54th Annual Meeting of
the Association for Computational Linguistics
(Volume 1: Long Papers), pages 1715–1725,
Berlin, Allemagne. EST CE QUE JE: https://doi.org
/10.18653/v1/P16-1162
Kishore Papineni, Salim Roukos, Todd Ward,
and Wei-Jing Zhu. 2002. BLEU: A method for
automatic evaluation of machine translation.
In Proceedings of the 40th Annual Meeting
on Association for Computational Linguistics,
ACL ’02, pages 311–318, Stroudsburg, Pennsylvanie,
Rico Sennrich, Holger Schwenk, and Walid
Aransa. 2013. A multi-domain translation
statistical machine
model
framework for
the 51st An-
translation. In Proceedings of
nual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers),
31
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
5
1
1
9
2
4
0
5
9
/
/
t
je
un
c
_
un
_
0
0
3
5
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
pages 832–840, Sofia, Bulgaria. Association for
Computational Linguistics.
Amr Sharaf, Hany Hassan, and Hal Daum´e III.
2020. Meta-learning for few-shot NMT adap-
tation. In Proceedings of the Fourth Work-
shop on Neural Generation and Translation,
pages 43–53, En ligne. Association for Computa-
tional Linguistics. EST CE QUE JE: https://doi.org
/10.18653/v1/2020.ngt-1.5
Hidetoshi Shimodaira. 2000. Improving predictive
inference under covariate shift by weighting the
log-likelihood function. Journal of Statistical
Planning and Inference, 90(2):227–244. EST CE QUE JE:
https://doi.org/10.1016/S0378
-3758(00)00115-4
Ralf Steinberger, Bruno Pouliquen, Anna Widiger,
Camelia Ignat, Toma Erjavec, Dan Tufis, et
Daniel Varga. 2006. The JRC-Acquis: A mul-
tilingual aligned parallel corpus with 20+
languages. In Proceedings of the Fifth Interna-
tional Conference on Language Resources and
Evaluation, LREC’06, Genoa, Italy. européen
Language Resources Association (ELRA).
Jinsong Su, Jiali Zeng, Jun Xie, Huating Wen,
Yongjing Yin, and Yang Liu. 2019. Exploring
discriminative word-level domain contexts
for multi-domain neural machine translation.
IEEE Transactions on Pattern Analysis and
Machine Intelligence (PAMI), pages 1–1. EST CE QUE JE:
https://doi.org/10.1109/TPAMI
.2019.2954406, PMID: 31751225
Sander Tars and Mark Fishel. 2018. Multi-domain
neural machine translation. In Proceedings of
the 21st Annual Conference of the European
Association for Machine Translation, EAMT,
pages 259–269, Alicante, Espagne. EAMT.
J¨org Tiedemann. 2009, News from OPUS – UN
collection of multilingual parallel corpora
with tools and interfaces. In N. Nicolov, K.
Bontcheva, G. Angelova, et R. Mitkov,
editors, Recent Advances in Natural Lan-
guage Processing, volume V, pages 237–248.
John Benjamins, Amsterdam/Philadelphia,
Borovets, Bulgaria. EST CE QUE JE: https://est ce que je
.org/10.1075/cilt.309.19tie
J¨org Tiedemann. 2012. Parallel data,
tools
and interfaces in OPUS. Nicoletta Calzolari
(Conference Chair), Khalid Choukri, Thierry
Declerck, Mehmet Ugur Dogan, Bente
Maegaard, Joseph Mariani, Jan Odijk, et
Stelios Piperidis, editors, In Proceedings of the
Eight International Conference on Language
Resources and Evaluation, LREC’12, Istanbul,
Turkey,
Language Resources
européen
Association (ELRA).
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017.
In I. Guyon,
Attention is all you need.
U. V. Luxburg, S. Bengio, H. Wallach,
R.. Fergus, S. Vishwanathan, et R. Garnett,
editors, Advances
Information
Processing Systems 30, pages 5998–6008,
Curran Associates, Inc.
in Neural
Jitao Xu, Josep Crego, and Jean Senellart.
2019. Lexical micro-adaptation for neural
machine translation. In Proceedings of the 16th
International Workshop on Spoken Language
Translation, IWSLT 2019, Hong Kong, Chine.
Jiali Zeng, Jinsong Su, Huating Wen, Lequel
Liu, Jun Xie, Yongjing Yin, and Jianqiang
Zhao. 2018. Multi-domain neural machine
translation with word-level domain context
discrimination. In Proceedings of
le 2018
Conference on Empirical Methods in Nat-
ural Language Processing, pages 447–457,
Brussels, Belgium. Association for Computa-
tional Linguistics. EST CE QUE JE: https://doi.org
/10.18653/v1/D18-1041
Jian Zhang, Liangyou Li, Andy Way, and Qun
Liu. 2016. Topic-informed neural machine
the 26th
In Proceedings of
translation.
International Conference on Computational
Linguistics: Technical Papers, COLING 2016,
pages 1807–1817, Osaka, Japan. The COLING
2016 Organizing Committee.
Appendices
UN. Description of Multi-Domain Systems
We use the following setups for MDMT systems.
• Mixed-Nat, FT-full, TTM, DC-Tag use
a medium Transformer model of Vaswani
et autres. (2017) with the following settings:
embeddings size and hidden layers size are
set to 512. Multi-head attention comprises
32
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
5
1
1
9
2
4
0
5
9
/
/
t
je
un
c
_
un
_
0
0
3
5
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
8 heads in each of the 6 layers; the inner
feedforward layer contains 2,048 cells.
Training use a batch size of 12,288 tokens;
optimization uses Adam with parameters
β1 = 0.9, β2 = 0.98 and Noam decay
(warmup steps = 4, 000), and a dropout
taux de 0.1 for all layers.
• FT-Res and MDL-res use the same
medium Transformer and add residual layers
with a bottleneck dimension of size 1,024.
• ADM, DM use medium Tranformer model and
a domain classifier composing of 3 dense
layers of size 512 × 2,048, 2,048 × 2,048,
et 2,048 × domain num. The two first
layers of the classifier use the ReLU() comme
activation function, the last layer uses tanh()
as activation function.
• DC-Feat uses medium Transformer model
and domain embeddings of size 4. Given a
sentence of domain i in a training batch, le
embedding of domain i is concatenated to the
embedding of each token in the sentence.
• LDR uses medium Transformer model and
for each token we introduce a LDR feature of
size 4 × domain num. Given a sentence of
domain i ∈ [1, .., K] in the training batch,
for each token of the sentence, the LDR
units of the indexes outside of the range
[4(i − 1), .., 4i − 1] are masked to 0, et le
masked LDR feature will be concatenated to
the embedding of the token. Details are in
Pham et al. (2019).
• Mixed-Nat-RNN uses one bidirectional
LSTM layer in the encoder and one LSTM
layer in the decoder. The size of hidden layers
est 1,024, the size of word embeddings is 512.
• WDCNMT uses one bidirectional GRU layer in
the encoder and one GRU-conditional layer
in the decoder. The size of hidden layers is
1,024, the size of word embeddings is 512.
Training For each domain, we create train/
dev/test sets by randomly splitting each corpus.
We maintain the size of validation sets and of test
sets equal to 1,000 lines for every domain. Le
learning rate is set as in Vaswani et al. (2017).
For the fine-tuning procedures used for FT-
full and FT-Res, we continue training using
the same learning rate schedule, continuing the
incrementation of the number of steps. All other
MDMT systems reported in Tables 3 et 4 utiliser
a combined validation set comprising 6,000 lines,
obtained by merging the six development sets. Pour
the results in Table 7 we also append the validation
set of NEWS to the multi-domain validation set. Dans
any case, training stops if either training reaches
the maximum number of iterations (50,000) or the
score on the validation set does not increase for
three consecutive evaluations. We average five
checkpoints to get the final model.
B. Experiments with Continual Learning
Complete results for the experiments with con-
tinual learning are reported in Table 7.
C. Experiments with Automatic Domains
This experiment aims to simulate with automatic
domains a scenario where the number of
‘‘domains’’ is large and where some ‘‘domains’’
are close and can effectively share information.
Full results in Table 8. Cluster size vary from
approximately 8k sentences (cluster 24) up to more
than 350k sentences. More than two thirds of these
clusters mostly comprise texts from one single
domain, as for cluster 12 which is predominantly
MED, the remaining clusters typically mix two
domains. Fine-tuning with small domains is often
outperformed by other MDMT techniques, un
issue that a better regularization strategy might
mitigate. Domain-control (DC-Feat) is very
effective for small domains, but again less so
in larger data conditions. Among the MD models,
approaches using residual adapters have the best
average performance.
33
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
5
1
1
9
2
4
0
5
9
/
/
t
je
un
c
_
un
_
0
0
3
5
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
3
4
Domain
Model
Mixed-Nat
DC-Tag
DC-Feat
LDR
TTM
DM
ADM
FT-Res
MDL-Res
MED
LAW
BANK
TALK
IT
REL
NEWS
wAVG
AVG
37.1
|
–
+0.2
37.7
+0.3 | +0.3
37.4
+0.3 | −0.2
37.0
0.0 | −0.6
37.3
0.0 | −0.3
36.0
−0.4 | +0.6
36.6
−0.2 | +0.3
37.0
+0.3 | +0.3
37.7
+0.2 | −0.2
54.1
+0.5 | — +0.5
49.6
|
–
54.5
+0.8 | −0.1
54.9
−0.1 | −0.1
54.6
+0.1 | +0.5
54.4
+0.4 | −0.3
51.3
−1.8 | +0.4
54.2
−0.7 | −0.8
57.6
+0.4 | +0.4
55.6
+0.4 | +0.5
49.9
−0.04 | −0.6
50.0
−0.3 | −0.1
49.6
+0.2 | −0.4
49.6
−0.1 | −0.5
46.8
−1.2 | +0.6
49.1
−0.8 | −0.8
53.8
+0.1 | +0.1
51.1
+0.1 | 0.0
34.1
|
–
−0.6
34.8
−1.6 | −1.1
34.7
−1.3 | −0.6
34.3
−0.4 | −0.6
33.8
−0.9 | −1.1
31.8
−1.8 | −0.1
32.9
−0.9 | −0.2
34.5
−0.7 | −0.7
34.4
−0.9 | −0.4
42.1
|
–
+1.1
43.9
−0.4 | −1.3
43.9
−0.1 | −0.9
43.0
+0.5 | +0.5
42.9
+0.6 | −1.0
39.8
−2.6 | +0.5
42.1
−0.5 | −0.4
46.1
+0.5 | +0.5
44.5
−0.1 | −0.2
77.0
|
–
+0.5
78.8
+1.7 | −3.5
79.6
+0.4 | +0.3
77.0
+2.9 | +3.8
78.2
+1.8 | −4.0
65.7
−3.3 | 0.0
75.7
−2.3 | −5.0
91.1
−0.9 | −0.9
87.5
+0.9 | −0.2
28.9
|
–
−5.4
29.5
−7.7 | −1.4
28.9
−7.3 | −0.8
28.7
−6.6 | −0.9
29.1
−5.7 | −1.4
27.0
−4.4 | −1.2
28.7
−5.4 | −1.9
29.6
−9.0 | −0.6
29.1
−8.0 | −0.8
40.8
|
–
+0.3
41.4
+0.2 | −0.1
41.2
+0.1 | −0.2
40.8
+0.6 | +0.5
41.0
0.0 | −0.5
38.9
−0.8 | +0.5
40.2
−0.5 | −0.2
42.2
−0.1 | −0.1
41.9
+0.1 | −0.2
49.0
|
–
+0.4
49.9
+0.1 | −1.1
50.1
−0.2 | −0.3
49.2
+0.1 | −0.4
49.4
+0.3 | −1.2
45.2
−1.8 | +0.3
48.4
−0.9 | −1.1
53.3
+0.2 | +0.2
51.8
+0.1 | −0.1
Tableau 7: Ability to handle a new domain. We report BLEU scores for a complete training session with seven domains, aussi
as differences with (gauche) training with six domains (from Table 3); (droite) continuous training mode. Averages only take into
account six domains (NEWS excluded). Underline denotes a significant loss, bold a significant gain.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
5
1
1
9
2
4
0
5
9
/
/
t
je
un
c
_
un
_
0
0
3
5
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Model
Cluster
size
train / test
Mixed
Nat
8.1k / 3
17.3k / 52
25.6k / 54
27.2k / 88
27.4k / 72
27.5k / 103
28.2k / 56
30.4k / 18
47.0k / 23
54.4k / 26
61.4k / 214
68.1k / 122
91.5k / 30
93.0k / 38
24 [med]
[-]
13
[-]
28
[IT]
19
[-]
0
[-]
22
[-]
25
16 [med]
23 [med]
17 [med]
[IT]
8
1
[-]
7 [med]
11 [med]
29 [law] 109.2k / 242
27 [med] 109.3k / 49
5
[-] 109.9k / 267
6 [med] 133.4k / 73
26
[-] 134.8k / 428
15 [bank] 136.9k / 674
4
[rel] 137.4k / 1016
2 [med] 182.6k / 85
20 [med] 183.0k / 71
21
[-] 222.8k / 868
10 [med] 225.4k / 115
18 [med] 245.0k / 106
9 [med] 301.6k / 145
[law] 323.5k / 680
3
334.0 / 146
14 [med]
12 [med] 356.4k / 148
90.4
67.6
71.6
58.5
43.9
91.5
57.0
57.2
24.5
39.9
46.9
47.2
41.3
31.6
65.9
11.0
46.3
37.2
31.8
46.5
77.1
70.6
47.4
38.7
40.0
57.7
37.2
50.1
31.6
36.3
FT
FT MDL
CC
Full Res Res Feat Tag
CC
TTM ADM
DM
LDR
90.4
75.4
68.7
63.0
33.3
93.7
44.8
70.4
27.2
40.3
53.1
47.5
35.5
42.6
69.2
9.6
47.4
38.9
30.8
51.5
85.3
75.8
47.2
38.8
42.6
60.3
37.3
52.0
31.4
36.6
90.4 90.4 100.0 65.6 100.0 90.4 100.0 100.0
76.9
74.3 74.3
72.6
68.1 70.2
60.3
60.9 63.9
47.8
45.4 45.4
93.4
93.4 93.9
52.4
48.2 49.1
58.3
77.4 73.5
29.8
26.5 28.5
33.7
41.6 38.0
46.7
55.8 53.6
44.9
48.7 45.1
41.8
41.4 39.9
36.6
31.8 35.4
65.9
67.6 67.7
10.6
9.2
45.7
46.9 45.4
35.9
38.7 36.8
31.2
31.8 31.2
46.0
47.9 48.0
75.9
83.5 83.3
68.2
71.7 69.4
46.8
46.8 47.2
37.0
39.0 37.2
40.7
40.0 38.2
55.9
58.7 58.6
37.0
36.5 36.1
49.1
50.8 50.1
31.8
31.9 33.0
36.3
35.9 35.9
75.0 54.7
71.0 42.5
63.7 57.2
49.9 15.4
92.5 72.8
54.6 47.2
61.8 54.2
30.5 27.3
37.1 36.6
48.9 45.1
46.8 39.1
41.4 36.5
36.0 29.6
66.0 63.8
10.0 19.4
44.0 42.9
37.5 27.5
31.9 32.6
46.6 46.0
75.8 46.1
68.2 67.3
48.4 47.5
37.5 35.9
39.9 35.8
58.4 56.3
36.4 37.7
49.1 48.3
32.5 34.1
35.8 37.0
74.7 75.9
72.0 71.3
59.4 61.1
46.8 49.2
92.3 93.2
49.8 54.2
58.4 58.1
32.0 24.4
35.2 35.4
48.8 50.9
45.4 44.2
37.3 37.1
36.7 32.7
65.1 64.7
7.9
43.7 44.3
38.0 37.2
32.2 30.5
45.8 45.7
74.2 73.3
67.3 68.6
48.8 47.3
36.9 37.1
39.5 39.1
57.3 56.1
36.4 35.2
49.0 48.2
31.4 32.1
36.4 35.4
65.9
65.6
60.5
46.6
91.4
45.1
52.5
29.0
31.3
43.0
40.7
40.7
26.5
62.4
10.7
40.9
31.3
29.6
42.9
63.2
65.6
47.1
33.4
36.3
54.9
34.2
44.4
30.5
34.2
8.7
9.4
Tableau 8: Complete results for the experiments with automatic domains. For each cluster, we report: le
majority domain when one domain accounts for more than 75% of the class; training and test sizes; et
BLEU scores obtained with the various systems used in this study. Most test sets are too small to report
significance tests.
35
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
5
1
1
9
2
4
0
5
9
/
/
t
je
un
c
_
un
_
0
0
3
5
1
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Télécharger le PDF