Revisiting Multi-Domain Machine Translation

Revisiting Multi-Domain Machine Translation

MinhQuang Pham† ‡, Josep Maria Crego†, Franc¸ois Yvon‡

‡Universit´e Paris-Saclay, CNRS, LIMSI, 91400, Orsay, Francia
francois.yvon@limsi.fr
†SYSTRAN, 5 rue Feydeau, 75002 París, Francia
{minhquang.pham,josep.crego}@systrangroup.com

Abstracto

When building machine translation systems,
one often needs to make the best out of hetero-
geneous sets of parallel data in training, y
to robustly handle inputs from unexpected
domains in testing. This multi-domain scenario
has attracted a lot of recent work that fall under
the general umbrella of transfer learning. En
este estudio, we revisit multi-domain machine
traducción, with the aim to formulate the moti-
vations for developing such systems and the
associated expectations with respect to perfor-
mance. Our experiments with a large sample
of multi-domain systems show that most of
these expectations are hardly met and suggest
that further work is needed to better analyze
the current behaviour of multi-domain systems
and to make them fully hold their promises.

1

Introducción

Data-based Machine Translation (MONTE), si
statistical or neural, rests on well-understood ma-
chine learning principles. Given a training sample
of matched source-target sentence pairs (F , mi)
drawn from an underlying distribution Ds, a
model parameterized by θ (aquí, a translation
function hθ) is trained by minimizing the empirical
expectation of a loss function (cid:3)((F ), mi). Este
approach ensures that the translation loss remains
low when translating more sentences drawn from
the same distribution.

Owing to the great variability of language data,
this ideal situation is rarely met
en la práctica,
warranting the study of an alternative scenario,
where the test distribution Dt differs from Ds.
In this setting, domain adaptation (Y) methods
are in order. DA has a long history in Machine
Learning in general (p.ej., Shimodaira, 2000; Ben-
David et al., 2010; Joaquin Quionero-Candela and
lorenzo, 2008; Pan and Yang, 2010) and in NLP

17

En particular (p.ej., Daum´e III and Marcu, 2006;
Blitzer, 2007; Jiang and Zhai, 2007). Varios
techniques thus exist to handle both the situations
where a (pequeño) training sample drawn from Dt
is available in training, or where only samples
target-side) las oraciones son
of source-side (o
disponible (see Foster and Kuhn [2007]; Bertoldi
and Federico [2009]; Axelrod et al. [2011]; para
proposals from the statistical MT era, or Chu and
Wang [2018] for a recent survey of DA for Neural
MONTE).

A seemingly related problem is multi-domain
(Maryland) machine translation (Sajjad et al., 2017;
Farajian et al., 2017b; Kobus et al., 2017; Zeng
et al., 2018; Pham et al., 2019) where one single
system is trained and tested with data from mul-
tiple domains. MD machine translation (MDMT)
corresponds to a very common situation, dónde
all available data, no matter its origin, se utiliza
to train a robust system that performs well for
any kind of new input. If the intuitions behind
MDMT are quite simple, the exact specifications
of MDMT systems are rarely spelled out: Para
instancia, should MDMT perform well when the
test data is distributed like the training data, cuando
it is equally distributed across domains or when
the test distribution is unknown? Should MDMT
also be robust to new domains? How should it
handle domain labeling errors?

A related question concerns the relationship
between supervised domain adaptation and multi-
domain translation. The latter task seems more
challenging as it tries to optimize MT performance
for a more diverse set of potential inputs, con un
additional uncertainty regarding the distribution
of test data. Are there still situations where MD
systems can surpass single domain adaptation, como
is sometimes expected?

en este documento, we formulate in a more pre-
cise fashion the requirements that an effective
MDMT system should meet (Sección 2). Our first

Transacciones de la Asociación de Lingüística Computacional, volumen. 9, páginas. 17–35, 2021. https://doi.org/10.1162/tacl a 00351
Editor de acciones: George Foster. Lote de envío: 5/2020; Lote de revisión: 9/2020; Publicado 02/2021.
C(cid:2) 2021 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
1
1
9
2
4
0
5
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
1
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

contribution is thus of methodological nature and
consists of lists of expected properties of MDMT
systems and associated measurements to evaluate
a ellos (Sección 3). Al hacerlo, we also shed light on
new problems that arise in this context, regarding,
por ejemplo, the accommodation of new domains
in the course of training, or the computation
of automatic domain tags. Our second main
contribution is experimental and consists in a
thorough reanalysis of eight recent multi-domain
approaches from the literature, including a variant
of a model initially introduced for DA. We show in
Sección 4 that existing approaches still fall short
to match many of these requirements, notably
with respect to the handling of a large amount
of heterogeneous domains and to dynamically
integrating new domains in training.

2 Requirements of Multi-Domain MT

En esta sección, we recap the main reasons
for considering a multi-domain scenario and
discuss their implications in terms of performance
evaluación.

2.1 Formalizing Multi-Domain Translation

We conventionally define a domain d as a dis-
tribution Dd(X) over some feature space X that is
shared across domains (Pan and Yang, 2010): En
machine translation, X is the representation space
for source sentences; each domain corresponds
to a specific source of data, and differs from
the other data sources in terms of textual genre,
thematic content (Chen et al., 2016; Zhang et al.,
2016), register (Sennrich et al., 2016a), style
(Niu et al., 2018), an so forth. Translation in
domain d is formalized by a translation function
hd(y|X) pairing sentences in a source language
with sentences in a target language y ∈ Y. hd
is usually assumed to be deterministic (hence
y = hd(X)), but can differ from one domain to the
otro.

(cid:2)

d λs
d

Dd(X), con {λs

A typical learning scenario in MT is to have
access to samples from nd domains, which means
the training distribution Ds is a mixture
eso
d, re = 1 . . . nd}
Ds(X) =
d λs
re = 1).
the corresponding mixture weights (
Multi-domain learning, as defined in Dredze and
Crammer (2008), further assumes that domain
tags are also available in testing; the implication
being that the test distribution is also as a mix-
Dd(X) of several domains,
ture Dt(X) =

(cid:2)

(cid:2)

d λt
d

making the problem distinct from mere domain
adaption. A multi-domain learner is then expected
to use these tags effectively (Joshi et al., 2012)
when computing the combined translation func-
tion h(X, d), and to perform well in all domains
(Finkel and Manning, 2009). This setting is
closely related to the multi-source adaptation
problem formalized in Mansour et al. (2009a,b)
and Hoffman et al. (2018).

This definition seems to be the most accepted
view of a multi-domain MT1 and one that we
in the absence of
also adopt here. Tenga en cuenta que
the naive answer to the
further specification,
MD setting should be to estimate one translation
function ˆhd(X) separately for each domain, entonces
d(cid:4) hd(cid:4)(X)I(d(cid:4) = d),
to translate using ˆh(X, d) =
where I(X) is the indicator function. We now
discuss the arguments that are put forward to
proceed differently.

(cid:2)

2.2 Reasons for Building MDMT Systems

A first motivation for moving away from the
one-domain / one-system solution are practical
(Sennrich et al., 2013; Farajian et al., 2017a):
When faced with inputs that are potentially from
multiple domains, it is easier and computationally
cheaper to develop one single system instead
of having to optimize and maintain multiple
engines. The underlying assumption here is that
the number of domains of interests can be large, a
limiting scenario being fully personalized machine
traducción (Michel and Neubig, 2018).

A second line of reasoning rests on linguistic
properties of the translation function and contends
that domain specificities are mostly expressed
lexically and will primarily affect content words
or multi-word expressions; function words, en
the other hand, are domain agnostic and tend
to remain semantically stable across domains,
motivating some cross-domain parameter sharing.
An MDMT system should simultaneously learn
lexical domain peculiarities, and leverage cross-
domain similarities to improve the translation of
generic contexts and words (Zeng et al., 2018;
Pham et al., 2019). It is here expected that the
MDMT scenario should be more profitable when
the domain mix includes domains that are closely
related and can share more information.

1An exception is Farajian et al. (2017b), where test
translations rely on similarity scores between test and train
oraciones, rather than on domain labels.

18

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
1
1
9
2
4
0
5
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
1
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

A third series of motivations is of statistical na-
tura. The training data available for each domain is
usually unevenly distributed, and domain-specific
systems trained or adapted on small datasets are
likely to have a high variance and generalize
poorly. For some test domains, there may even be
no data at all (Farajian et al., 2017a). Training mix-
domain systems is likely to reduce this variance,
at the expense of a larger statistical bias (clark
et al., 2012). Under this view, MDMT would be es-
pecially beneficial for domains with little training
datos. This is observed for multilingual MT from
Inglés: an improvement for under-resourced
languages due to positive transfer, at the cost
of a decrease in performance for well-resourced
idiomas (Arivazhagan et al., 2019).

Combining multiple domain-specific MTs can
also be justified in the sake of distributional robust-
ness (Mansour et al., 2009a,b), por ejemplo, cuando
the test mixture differs from the train mixture, o
when it includes new domains unseen in training.
An even more challenging case is when the
MT would need to perform well for any test
distribución, as studied for statistical MT in Huck
et al. (2015). In all these cases, mixing domains
in training and/or testing is likely to improve
robustness against unexpected or adversarial test
distribución (Oren et al., 2019).

A distinct

line of reasoning is that mixing
domains can have a positive regularization effect
for all domains. By introducing variability in
it prevents DA from overfitting the
training,
available adaptation data and could help improve
generalization even for well-resourced domains.
A related case is made in Joshi et al. (2012), cual
shows that part of the benefits of MD training is
due to an ensembling effect, where systems from
multiple domains are simultaneously used in the
prediction phase; this effect may subsist even in
the absence of clear domain separations.

To recap,

there are multiple arguments for
adopting MDMT, some already used in DA
settings, and some original. These arguments are
not mutually exclusive; sin embargo, each yields
specific expectations with respect to the perfor-
mance of this approach, and should also yield
appropriate evaluation procedure. If the motiva-
tion is primarily computational, then a drop in
MT quality with respect to multiple individual
domains might be acceptable if compensated by
the computational savings. If it is to improve
statistical estimation, then the hope will be that

MDMT will improve, at least for some under-
individually trained
resourced domains, encima
sistemas. Si, finalmente, it is to make the system more
robust to unexpected or adversarial test distribu-
ciones, then this is the setting that should be used to
evaluate MDMT. The next section discusses ways
in which these requirements of MDMT systems
could be challenged.

3 Challenging Multi-Domain Systems

En esta sección, we propose seven operational
requirements that can be expected from an effec-
tive multi-domain system, and discuss ways to
evaluate whether these requirements are actually
met. All these evaluations will rest on comparison
of translation performance, and do not depend on
the choice of a particular metric. To make our
results comparable with the literature, we will
only use the BLEU score (Papineni et al., 2002) en
Sección 4, noting it may not be the best yardstick to
assess subtle improvements of lexical choices that
are often associated with domain adapted systems
(Irvine et al., 2013). Other important figures of
merit for MDMT systems are the computational
training cost and the total number of parameters.

3.1 Multi-Domain Systems Should

Be Effective

A first expectation is that MDMT systems should
perform well in the face of mixed-domain test
datos. We thus derive the following requirements.

[P1-LAB] A MDMT should perform better than
the baseline, which disregards domain labels, o
reassigns them in a random fashion (Joshi et al.,
2012). Evaluating this requirement is a matter of
a mere comparison, assuming the test distribution
of domains is known: If all domains are equally
important, performance averages can be reported;
if they are not, weighted averages should be used
en cambio.

[P2-TUN] Además, one can expect
eso
MDMT will improve over fine-tuning (Luong and
Manning, 2015; Freitag and Al-Onaizan, 2016),
at least in domains where data is scarce, or in situ-
ations where several domains are close. To evalu-
ate this, we perform two measurements, using a
real as well as an artificial scenario. In the real
scenario, we simply compare the performance of
MDMT and fine-tuning for domains of varying
sizes, expecting a larger gain for smaller domains.

19

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
1
1
9
2
4
0
5
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
1
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

In the artificial scenario, we split a single do-
main in two parts which are considered as distinct
in training. The expectation here is that a MDMT
should yield a clear gain for both pseudo sub-
dominios, which should benefit from the supple-
training. En esto
mentary amount of relevant
situation, MDMT should even outperform fine-
tuning on either of the pseudo sub-domain.

3.2 Robustness to Fuzzy Domain Separation

A second set of requirements is related to the
definition of a domain. As repeatedly pointed out
in the literature, parallel corpora in MT are often
collected opportunistically and the view that each
corpus constitutes a single domain is often a gross
approximation.2 MDMT should aim to make the
best of the available data and be robust to domain
assignments. To challenge these requirements we
propose evaluating the following requirements.

[P3-HET] The notion of a domain being a
fragile one, an effective MDMT system should
be able to discover not only when cross-domain
sharing is useful (cf. requirement [P2-TUN]),
but also when intra-domain heterogeneity is
hurting. This requirement is tested by artificially
conjoining separate domains into one during
training, hoping that the loss in performance with
respect to the baseline (using correct domain tags)
will remain small.

[P4-ERR] MDMTs should perform best when
the true domain tag is known, but deteriorate
gracefully in the face of tag errors; in this situation,
catastrophic drops in performance are often ob-
served. This requirement can be assessed by trans-
lating test texts with erroneous domain tags and
reporting the subsequent loss in performance.

[P5-UNK] A related situation occurs when the
domain of a test document is unknown. Several
situations need to be considered: For domains seen
in training, using automatically predicted domain
labels should not be much worse than using the
correct one. For test documents from unknown
dominios (zero-shot transfer), a good MD system
should ideally outperform the default baseline that
merges all available data.

[P6-DYN] Another requirement, more of an
is that an MDMT system
operational nature,

2Two of our own ‘‘domains’’ actually comprise several

subcorpora (IT and MED), see details in Section 4.1.

should smoothly evolve to handle a growing
number of domains, without having to retrain
the full system each time new data is available.
This is a requirement [P6-DYN] that we challenge
by dynamically changing the number of training
and test domains.

3.3 Scaling to a Large Number of Domains

[P7-NUM] Como se ha mencionado más arriba, MDMT sys-
tems have often been motivated by computational
argumentos. This argument is all the more sensible
as the number of domains increases, making the
optimization of many individual systems both in-
effective and undesirable. For lack of having ac-
cess to corpora containing very large sets (p.ej., en
the order of 100–1,000) dominios, we experiment
with automatically learned domains.

4 Experimental Settings

4.1 Data and Metrics

We experiment with translation from English
into French and use texts initially originating
from six domains, corresponding to the following
data sources: the UFAL Medical corpus V1.0
(MED);3 the European Central Bank corpus (BANK)
(Tiedemann, 2012); The JRC-Acquis Commu-
nautaire corpus (LAW) (Steinberger et al., 2006),
documentations for KDE, Ubuntu, GNOME, y
PHP from Opus collection (Tiedemann, 2009),
collectively merged in a IT-domain; TED Talks
(TALK) (Cettolo et al., 2012); and the Koran (REL).
Complementary experiments also use v12 of the
News Commentary corpus (NEWS). Most corpora
are available from the Opus Web site.4 These
corpora were deduplicated and tokenized with in-
house tools; statistics are in Table 1. To reduce
the number of types and build open-vocabulary
sistemas, we use Byte-Pair Encoding (Sennrich
et al., 2016b) con 30,000 merge operations on a
corpus containing all sentences in both languages.
We randomly select in each corpus a devel-
opment and a test set of 1,000 lines and keep the
rest for training.5 Validation sets are used to chose
the best model according to the average BLEU

3https://ufal.mff.cuni.cz/ufal medical
cuerpo. We only use the in-domain (medical) subcorpora:
PATR, EMEA, CESTA, ECDC.
4http://opus.nlpl.eu.
5The code for reproducing our train, dev and test data-
sets is available at https://github.com/qmpham
/experimentos.

20

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
1
1
9
2
4
0
5
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
1
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

MED

LAW

BANK

IT

TALK

REL

NEWS

# líneas
# tokens
# types
# uniq

2609 (0.68)
133 / 154
771 / 720
700 / 640

501 (0.13)
17.1 / 19.6
52.7 / 63.1
20.2 / 23.7

190 (0.05)
6.3 / 7.3
92.3 / 94.7
42.9 / 40.1

270 (0.07)
3.6 / 4.6
75.8 / 91.4
44.7 / 55.7

160 (0.04)
3.6 / 4.0
61.5 / 73.3
20.7 / 25.6

130 (0.03)
3.2 / 3.4
22.4 / 10.5
7.1 / 2.1

260 (0)
7.8 / 9.2

Mesa 1: Corpora statistics: number of parallel lines (×103) and proportion in the basic domain mixture
(which does not include the NEWS domain), number of tokens in English and French (×106), number
of types in English and French (×103), number of types that only appear in a given domain (×103).
MED is the largest domain, containing almost 70% of the sentences, while REL is the smallest, con
solo 3% of the data.

LAW

BANK TALK

IT

REL

1.93

1.97
1.94

MED
LAW
BANK
TALK
IT

1.9
1.97
1.98

1.93
1.93
1.94
1.92

1.97
1.99
1.99
1.93
1.99

Mesa 2: The H-divergence between domains.

puntaje (Papineni et al., 2002).6 Significance testing
is performed using bootstrap resampling (Koehn,
implemented in compare-mt7
(Neubig
2004),
et al., 2019). We report significant differences
at the level of p = 0.05.

We measure the distance between domains
using the H-Divergence (Ben-David et al., 2010),
which relates domain similarity to the test error of
a domain discriminator: the larger the error, el
closer the domains. Our discriminator is a SVM
independently trained for each pair of domains,
with sentence representations derived via mean
pooling from the source side representation of the
generic Transformer model. We used the scikit-
learn8 implementation with default values. Resultados
en mesa 2 show that all domains are well separated
from all others, with REL being the furthest apart,
while TALK is slightly more central.

4.2 Líneas de base

Our baselines are standard for multi-domain sys-
tems.9 Using Transformers (Vaswani et al., 2017)

6We use truecasing and the multibleu script.
7https://github.com/neulab/compare-mt.
8https://scikit-learn.org.
9We omit domain-specific systems trained only with the
corresponding subset of the data, as these are always inferior
to the mix-domain strategy (Britz et al., 2017).

implemented in OpenNMT-tf10 (Klein et al.,
2017), we build the following systems:

• a generic model trained on a concatenation
of all corpora (Mixed). We develop two
versions11 of this system, one where the
domain unbalance reflects the distribution
training data given in Table 1
de nuestro
(Mixed-Nat) and one where all domains
are equally represented in training (Mixed-
Bal). The former is the best option when
the train mixture Ds is also expected in
pruebas; the latter should be used when the
test distribution is uniform across domains.
Respectivamente, we report two aggregate scores:
a weighted average reflecting the training
distribución, and an unweighted average,
meaning that
test domains are equally
important.

• fine-tuned models (Luong and Manning,
2015; Freitag and Al-Onaizan, 2016), based
on the Mixed-Nat system, further trained
on each domain for at most 20,000 iterations,
with early stopping when the dev BLEU stops
increasing. The full fine-tuning (FT-Full)
procedure may update all the parameters of
the initial generic model, resulting in six
systems adapted for one domain, with no
parameter-sharing across domains.

All models use embeddings and the hidden
layers sizes of dimension 512. transformadores
contain with 8 attention heads in each of the 6+6
capas; the inner feedforward layer contains 2,048
cells. The adapter-based systems (see below)

10https://github.com/OpenNMT/OpenNMT-tf.
11In fact three: to enable a fair comparison with WDCMT,
a RNN-based variant is also trained and evaluated. Este
system appears as Mixed-Nat-RNN in Table 3.

21

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
1
1
9
2
4
0
5
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
1
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

additionally use an adaptation block in each
capa, composed of a two-layer perceptron, con
an inner ReLU activation function operating on
normalized entries of dimension 1,024. Capacitación
uses batches of 12,288 tokens, Adam with
parameters β1 = 0.9, β2 = 0.98, Noam decay
(warmup steps = 4, 000), and a dropout rate of
0.1 in all layers.

4.3 Multi-Domain Systems

Our comparison of multi-domain systems includes
our own reimplementations of recent proposals
from the literature:12

• a system using domain control as in Kobus
et al. (2017): domain information is intro-
token for
duced either as an additional
each source sentence (DC-Tag), or as
a supplementary feature for each word
(DC-Feat).

• a system using lexicalized domain represen-
taciones (Pham et al., 2019): word embeddings
are composed of a generic and a domain
specific part (LDR);

• the three proposals of Britz et al. (2017).
TTM is a feature-based approach where the
domain tag is introduced as an extra word
on the target side. Training uses reference
tags and inference is usually performed with
predicted tags, just like for regular target
palabras. DM is a multi-task learner where a
domain classifier is trained on top the MT
encoder, so as to make it aware of domain
diferencias; ADM is the adversarial version
of DM, pushing the encoder towards learning
domain-independent source representations.
These methods thus only use domain tags in
training.

• the multi-domain model of Zeng et al. (2018)
(WDCMT), where a domain-agnostic and
a domain-specialized representation of the
input are simultaneously processed; super-
vised classification and adversarial training
are used to compute these representations.
De nuevo, inference does not use domain tags.13

12Further implementation details are in Appendix A.
13For this system, we use the available RNN-based system
from the authors (https://github.com/DeepLearnXMU
/WDCNMT), which does not directly compare to the
otro, Transformer-based, sistemas; the improved version of

• two multi-domain versions of the approach
of Bapna and Firat (2019), denoted FT-
Res and MDL-Res, where a domain-specific
adaptation module is added to all the Trans-
former layers; within each layer, residual
connections enable to short-cut this adapter.
The former variant corresponds to the orig-
inal proposal of Bapna and Firat (2019) (ver
also Sharaf et al., 2020). It fine-tunes the
adapter modules of a Mixed-Nat system
independently for each domain, keeping all
the other parameters frozen. The latter uses
the same architecture, but a different training
procedure and learns all parameters jointly
from scratch with a mix-domain corpus.

includes systems that slightly depart
This list
from our definition of MDMT: Standard imple-
mentations of TTM and WDCMT rely on infered,
rather than on gold, domain tags, which must
somewhat affect their predictions; DM and ADM
make no use of domain tags at all. We did not
consider the proposal of Farajian et al. (2017b),
sin embargo, which performs on-the-fly tuning for
each test sentence and diverges more strongly
from our notion of MDMT.

5 Results and Discussion

5.1 Performance of MDMT Systems

En esta sección, we discuss the basic performance of
MDMT systems trained and tested on six domains.
Results are in Table 3. As expected, balancing data
in the generic setting makes a great difference
(the unweighted average is 2 BLEU points better,
notably owing to the much better results for REL).
As explained above, this setting should be the
baseline when the test distribution is assumed to
be balanced across domains. As all other systems
are trained with an unbalanced data distribution,
we use the weighted average to perform global
comparisons.

Fine-tuning each domain separately yields a
better baseline, outperforming Mixed-Nat for
all domains, with significant gains for domains
that are distant from MED: REL, IT, BANK, LAW.

All MDMTs (except DM and ADM) slightly
improve over Mixed-Nat(for most domains),
pero
these gains are rarely significant. Among
systems using an extra domain feature, corriente continua-
Tag has a small edge over DC-Feat and also

Su et al. (2019) seems to produce comparable, albeit slightly
mejorado, resultados.

22

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
1
1
9
2
4
0
5
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
1
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Modelo / Domain

Mixed-Nat
Mixed-Bal
FT-Full

DC-Tag
DC-Feat
LDR
TTM
DM
ADM
FT-Res
MDL-Res

[65metro]

[65metro]
[6×65m]

[+4k]

[+140k]

[+1.4metro]

[+4k]

[+0]

[+0]

[+12.4metro]

[+12.4metro]

Mixed-Nat-RNN [51metro]

WDCMT

[73metro]

MED

37.3
35.3
37.7

38.1
37.7
37.0
37.3
35.6
36.4
37.3
37.9

36.8

36.0

LAW

BANK

TALK

54.6
54.1
59.2

55.3
54.9
54.7
54.9
49.5
53.5
57.9
56.0

53.8

53.3

50.1
52.5
54.5

49.9
49.5
49.9
49.5
45.6
48.3
53.9
51.2

47.2

48.8

33.5
31.9
34.0

33.2
32.9
33.9
32.9
29.9
32.0
33.8
33.5

30.0

31.1

IT

43.2
44.9
46.8

43.5
43.6
43.6
43.6
37.1
41.5
46.7
44.4

35.7

38.8

REL

77.5
89.5
90.8

80.5
79.9
79.9
79.9
62.4
73.4
90.2
88.3

60.2

58.5

wAVG

41.1
40.3
42.7

41.6
41.4
40.9
41.0
38.1
38.9
42.3
42.0

39.2

39.0

AVG

49.4
51.4
53.8

50.1
49.9
49.8
49.7
43.4
47.5
53.3
51.9

44.0

44.4

Mesa 3: Translation performance of MDMT systems based on the same Transformer (arriba) or RNN
(abajo) architecture. The former contains 65m parameters, the latter has 51m. For each system,
we report the number of additional domain specific parameters, BLEU scores for each domain,
domain-weighted (WAVG) and unweighted (AVG) averages. For weighted-averages, we take the
domain proportions from Table 1. Boldface denotes significant gains with respect to Mix-Nat
(or Mix-Nat-RNN, for WDCMT), underline denotes significant losses.

requires fewer parameters; it also outperforms
TTM, cual, sin embargo, uses predicted rather than
gold domain tags. TTM is also the best choice
among the systems that do not use domain tags
in inference. The best contenders overall are FT-
Res and MDL-Res, which significantly improve
over Mixed-Nat for a majority of domains,
and are the only ones to clearly fulfill [P1-
LAB]; WDCMT also improves on three domains,
but regresses on one. The use of a dedicated
adaptation module thus seems better than feature-
based strategies, but yields a large increase of the
number of parameters. The effect of the adaptation
layer is especially significant for small domains
(BANK, IT, and REL).
All systems fail

to outperform fine-tuning,
sometimes by a wide margin, especially for an
‘‘isolated’’ domain like REL. This might be due
to the fact that domains are well separated (cf.
Sección 4.1) and are hardly helping each other. En
esta situación, MDMT systems should dedicate a
sufficient number of parameters to each domain,
so as to close the gap with fine-tuning.

5.2 Redefining domains

Mesa 4 summarizes the results of four experiments
where we artificially redefine the boundaries of

dominios, with the aim to challenge requirements
[P2-TUN], [P3-HET], y [P4-ERR]. In first
tres, we randomly split one corpus in two
parts and proceed as if this corresponded to two
actual domains. A MD system should detect that
these two pseudo-domains are mutually beneficial
and should hardly be affected by this change
with respect to the baseline scenario (no split).
In this situation, we expect MDMT to even
surpass fine-tuning separately on each of these
dummy domains, as MDMT exploits all data,
while fine-tuning focuses only on a subpart.
In testing, we decode the test set twice, once
with each pseudo-domain tag. This makes no
difference for TTM, DM, ADM, and WDCMT, cual
do not use domain tags in testing. In the merge
experimento, we merge two corpora in training,
in order to assess the robustness with respect
to heterogenous domains [P3-HET]. Nosotros entonces
translate the two corresponding tests with the
mismo (merged) sistema.

Our findings can be summarized as follows.
For the split experiments, we see small variations
that can be positive or negative compared to
the baseline situation, but these are hardly sig-
nificant. All systems show some robustness with
respect to fuzzy domain boundaries; this is mostly
notable for ADM, suggesting that when domain are

23

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
1
1
9
2
4
0
5
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
1
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Set-up

Modelo

Split

Split
MED (0.5 / 0.5) MED (0.25 / 0.75) LAW (0.5 / 0.5)

Split

Merge
BANK+LAW

Wrong

rnd

NEW

MED2

MED1
MED1
FT-Full −0.1 −0.6 −1.5
−0.2 −0.3
DC-Tag
+0.1
DC-Feat −0.5
+0.3
0.0
LDR
+0.1
+0.4
+0.1
−0.2 −0.2 −0.2
TTM (*)
−0.3 −0.3
DM (*)
+0.4
ADM (*)
+0.6
+0.4
+0.6
−0.1 −0.4 −0.3
FT-Res
MDL-Res −0.2 −0.1
+0.2
WDCMT (*) −0.0 −0.0
+0.2

MED2
−0.2
+0.2
+0.3
+0.4
−0.2
+0.4
+0.4
−0.3
+0.0
+0.2

LAW1
−2.3
−0.4
+0.3
0.0
−0.3
+0.3
+0.4
−2.2
−0.9
+0.8

LAW2
−5.1
−0.4
+0.3
0.0
−0.3
+0.3
+0.4
−2.9
−0.9
+0.8

LAW

ALL

BANK
NEWS
−1.6 −1.4 −19.6 −3.3
−0.5 −0.4 −13.4 −1.7
+0.1 −14.2 −1.8
+0.3
+0.1 −12.0 −1.4
0.0
0.0 −0.1
0.0 −0.3
0.0 −0.9
+0.1
+0.9
+0.1 −0.4
0.0 −0.2
−2.4 −3.2 −13.3 −3.0
+0.7 −0.3 −18.6 −1.3
−0.4 −0.8
+0.2

0.0

Mesa 4: Translation performance with variable domain definitions. In the Split/Merge experiments, nosotros
report BLEU differences for the related test set(s). Underline denotes significant loss when domains
are changed wrt. the baseline situation; bold for a significant improvement over FT-Full; (*) tags
systems ignoring test domains.

close, ignoring domain differences is effective.
In contrary, FT-Full incurs clear losses across
the board, especially for the small data condition
(Miceli Barone et al., 2017). Even in this very
favourable case however, very few MDMT sys-
tems are able to significantly outperform FT-
Full and this is only observed for the smaller
part of the MED domain. The merge condition
is hardly different, with again large losses for
FT-full and FT-Res, and small variations
for all systems. We even observe some rare
improvements with respect to the situation where
we use actual domains.

5.2.1 Handling Wrong or Unknown Domains

In the last two columns of Table 4, we report the
drop in performance when the domain information
is not correct. In the first (RND), we use test data
from the domains seen in training, presented with a
random domain tag. In this situation, the loss with
respect to using the correct tag is generally large
(más que 10 BLEU points), showing an overall
failure to meet requirement [P4-ERR], excepto por
systems that ignore domain tags in testing.

In the second (NEW), we assess [P5-UNK] por
translating sentences from a domain unseen in
training (NEWS). For each sentence, we auto-
matically predict the domain tag and use it for
decoding.14 In this configuration, de nuevo, sistemas

14Domain tags are assigned as follows: we train a language
model for each domain and assign tag on a per-sentence
basis based on the language model log-probability (assuming

using domain tags during inference perform
poorly, significantly worse than the Mixed-Nat
base (BLEU=23.5).

5.2.2 Handling Growing Numbers of

Domains

Another set of experiments evaluate the ability
to dynamically handle supplementary domains
(requirement [P6-DYN]) como sigue. Starting with
the existing MD systems of Section 5.1, nosotros
introduce an extra domain (NEWS) and resume
training with this new mixture of data15 for 50,000
additional iterations. We contrast this approach
with training all systems from scratch and report
differences in performance in Figure 1 (see also
Mesa 7 en el Apéndice B).16 We expect that MDMT
systems should not be too significantly impacted
by the addition of a new domain and reach
about the same performance as when training
with this domain from scratch. From a practical
viewpoint, dynamically integrating new domains
is straightforward for DC-Tag, DC-Feat, o

uniform domain priors). The domain classifier has an average
prediction error of 16.4% for in-domain data.

15The design of a proper balance between domains in
training is critical for achieving optimal performance: As our
goal is to evaluate all systems in the same conditions, nosotros
consider a basic mixing policy based on the new training
distribución. This is detrimental to the small domains, para
which the ‘‘negative transfer’’ effect is stronger than for
larger domains.

16WDCMT results are excluded from this table, como

resuming training proved difficult to implement.

24

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
1
1
9
2
4
0
5
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
1
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
1
1
9
2
4
0
5
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
1
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 1: Ability to handle a new domain. We report BLEU scores for a complete training session with seven
dominios, as well as differences (in blue) with training with six domains (from Table 3); y (in red) diferencias
with continual training.

TTM, for which new domains merely add new
labels. It is less easy for DM, ADM, and WDCMT,
which include a built-in domain classifier whose
outputs have to be pre-specified, o, for LDR, FT-
Res, and MDL-Res, for which the number of
possible domains is built in the architecture and
has to be anticipated from the start. This makes a
difference between domain-bounded systems, para
which the number of domains is limited and truly
open-domain systems.

We can first compare the results of coldstart
training with six or seven domains in Table 7:
A first observation is that
the extra training
data is hardly helping for most domains, excepto
for NEWS, where we see a large gain, and for
TALK. The picture is the same when one looks
at MDMTs, where only the weakest systems
(DM, ADM) seem to benefit from more (out-of-
domain) datos. Comparing now the coldstart with
the warmstart scenario, we see that the former is
always significantly better for NEWS, as expected,
and that resuming training also negatively impacts
the performance for other domains. This happens
notably for DC-Tag, TTM, and ADM. In this setting
MDL-Res and DM show the smaller average loss,

with the former achieving the best balance of
training cost and average BLEU score.

5.3 Automatic Domains

En esta sección, we experiment with automatic
dominios, obtained by clustering sentences into
k = 30 classes using the k-means algorithm based
on generic sentence representations obtained via
mean pooling (cf. Sección 4.1). This allows us
to evaluate requirement [P7-scale], training, y
testing our systems as if these domains were
fully separated. Many of these clusters are mere
splits of the large MED, while a fewer number of
classes are mixtures of two (or more) existing
dominios (full details are in Appendix C). Nosotros
are thus in a position to reiterate, at a larger
escala, the measurements of Section 5.2 and test
whether multi-domain systems can effectively
take advantage from the cross-domain similarities
and to eventually perform better than fine-tuning.
The results in Table 5 also suggest that MDMT
can surpass fine-tuning for the smaller clusters;
for the large clusters, this is no longer true. El
complete table (in Appendix C) shows that this

25

Model/
Clusters

10 pequeño
10 mid
10 grande
Avg

Tren
tamaño

Mixed
Nat

29.3k
104.7k
251.1k
128.4k

68.3
44.8
50.4
54.5

FT

corriente continua
MDL
Full Res Res Feat Tag

FT

corriente continua

TTM ADM

DM

LDR

70.0
48.0
52.9
57.0

70.7
46.0
52.0
56.2

71.2
45.7
51.3
56.1

70.6
44.8
49.6
55.0

53.1
44.3
43.2
46.9

67.3
44.5
49.1
53.6

69.8
43.7
48.5
54.0

67.0
41.6
44.3
51.0

70.2
44.5
49.5
54.7

Mesa 5: BLEU scores computed by merging the 10 smaller, medio, and larger cluster test sets.
Best score for each group is in boldface. For the small clusters, full-fine tuning is outperformed by
several MDMT systemssee details in Appendix C.

Domain / Modelo

DC-Tag
DC-Feat
LDR
TTM
DM
ADM
FT-Res
MDL-Res
WDCMT

MED

38.5
37.3
37.4
37.4
35.4
36.1
37.5
37.3
35.6

LAW

54.0
54.2
54.1
53.7
49.3
53.5
55.7
55.5
53.1

BANK

TALK

49.0
49.3
48.7
48.9
45.2
48.0
51.1
50.2
48.4

33.6
33.6
32.5
32.8
29.7
32.0
33.1
32.2
30.5

IT

42.2
41.9
41.4
41.3
37.1
41.1
44.1
42.1
37.7

REL

76.7
75.8
75.9
75.8
60.0
72.1
86.7
86.7
56.0

wAVG

41.6
40.8
39.1
40.7
37.8
39.5
41.6
41.2
38.5

AVG

49.0
48.7
48.3
48.3
42.8
47.1
51.4
50.7
43.6

Mesa 6: Translation performance with automatic domains, computed with the original test sets.
Significance tests are for comparisons with the six-domain scenario (Mesa 3).

effect is more visible for small subsets of the
medical domain.

Finalmente, Mesa 6 reports the effect of using
automatic domain for each of the six test sets:
Each sentence was first assigned to an automatic
class, translated with the corresponding multi-
domain system with 30 classes; aggregate numbers
were then computed, and contrasted with the six-
domain scenario. Results are clear and confirm
previous observations: Even though some clusters
are very close, the net effect is a loss in perfor-
mance for almost all systems and conditions. En
this setting, the best MDMT in our pool (MDL-
Res) is no longer able to surpass the Mix-Nat
base.

6 Trabajo relacionado

The multi-domain training regime is more the
norm than the exception for natural
idioma
Procesando (Dredze and Crammer, 2008; Finkel
and Manning, 2009), and the design of multi-
domain systems has been proposed for many
language processing tasks. We focus here ex-
clusively on MD machine translation, keeping

in mind that similar problems and solutions
(parameter sharing, instance selection / weighting,
adversarial training, etc.) have been studied in
other contexts.

Multi-domain translation was already proposed
for statistical MT, either considering as we do
multiple sources of training data (p.ej., Banerjee
et al., 2010; Clark et al., 2012; Sennrich et al.,
2013; Huck et al., 2015), or domains made of
multiple topics (Eidelman et al., 2012; Hasler
et al., 2014). Two main strategies were considered:
instance-based, involving a measure of similarities
between train and test domains; basado en características,
where domain/topic labels give rise to additional
características.

The latter strategy has been widely used in
NMT: Kobus et al. (2017) inject an additional
domain feature in their seq2seq model, either
in the form of an extra (initial) domain-token
or in the form of an additional domain-feature
associated to each word. These results are
reproduced by Tars and Fishel
(2018), OMS
also consider automatically induced domain tags.
This technique also helps control the style of
(2016a) y
MT outputs in Sennrich et al.

26

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
1
1
9
2
4
0
5
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
1
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Niu et al. (2018), and to encode the source or
target languages in multilingual MT (Firat et al.,
2016; Johnson et al., 2017). Domain control can
also be performed on the target side, as in Chen
et al. (2016), where a topic vector describing
the whole document serves as an extra context
in the softmax layer of the decoder. Such ideas
are further developed in Chu and Dabre (2018)
and Pham et al. (2019), where domain differences
and commonalties are encoded in the network
architecture: Some parameters are shared across
dominios, while others are domain-specific.

Techniques proposed by Britz et al. (2017)
aim to ensure that domain information is actually
used in a mix-domain system. Three methods are
consideró, using either domain classification (o
domain normalization, via adversarial training)
on the source or target side. There is no clear
winner in either of the three language pairs
consideró. One contribution of this work is
the idea of normalizing representations through
adversarial training, so as to make the mixture of
heterogeneous data more effective; representación
normalization has since proven a key ingredient
in multilingual transfer learning. The same basic
técnicas (parameter sharing, automatic domain
identification / normalization) are simultaneously
at play in Zeng et al. (2018) and Su et al. (2019):
In this approach, the lower layers of the MT
use auxiliary classification tasks to disentangle
domain specific representations on the one hand
from domain-agnostic representations on the other
mano. These representations are then processed as
two separate inputs, then recombined to compute
the translation.

Another parameter-sharing scheme is in Jiang
et al. (2019), which augments a Transformer
model with domain-specific heads, whose con-
the word/position
tributions are regulated at
nivel: Some words have ‘‘generic’’ use and rely
on mixed-domain heads, mientras
alguno
other words it
is preferable to use domain-
specific heads, thereby reintroducing the idea of
ensembling at the core of Huck et al. (2015)
and Saunders et al.
(2019). The results for
three language pairs outperform several standard
baselines for a two-domain systems (in fr:en and
de:en) and a four-domain system (zh:en).

para

Finalmente, Farajian et al.

(2017b), Li et al.
(2018), and Xu et al. (2019) adopt a different
estrategia. Each test sentence triggers the selection
of a small set of related instances; using these,

a generic NMT is tuned for some iterations,
before delivering its output. This approach entirely
dispenses with the notion of domain and relies
on data selection techniques to handle data
heterogeneity.

7 Conclusion and Outlook

en este estudio, we have carefully reconsidered
the idea of multi-domain machine translation,
which seems to be taken for granted in many
recent studies. We have spelled out the various
motivations for building such systems and the
associated expectations in terms of system per-
rendimiento. We have then designed a series of
requirements that MDMT systems should meet,
and proposed a series of associated test pro-
cedures. In our experiments with a representative
sample of MDMTs, we have found that most
requirements were hardly met for our experimen-
tal conditions. If MDMT systems are able to
outperform the mixed-domain baseline, al menos
for some domains, they all fall short to match
the performance of fine-tuning on each individual
domain, which remains the best choice in multi-
source single domain adaptation. As expected
sin embargo, MDMTs are less brittle than fine-tuning
when domain frontiers are uncertain, and can,
to a certain extent, dynamically accommodate
additional domains, this being especially easy
for feature-based approaches. Our experiments
finally suggest that all methods show decreasing
performance when the number of domains or the
diversity of the domain mixture increases.

Two other main conclusions can be drawn
from this study: Primero, it seems that more work
is needed to make MDMT systems make the best
out of the variety of the available data, both to
effectively share what needs to be shared while at
the same time separating what needs to be kept
separated. We notably see two areas worthy of
further exploration: the development of parameter
sharing strategies when the number of domains
is large; and the design of training strategies that
can effectively handle a change of the training
mixture, including an increase in the number of
dominios. Both problems are of practical relevance
in industrial settings. Segundo, and maybe more
importantly, there is a general need to adopt better
evaluation methodologies for evaluating MDMT
sistemas, which require systems developers to
clearly spell out the testing conditions and the

27

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
1
1
9
2
4
0
5
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
1
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

associated expected distribution of testing in-
posturas, and to report more than comparisons with
simple baselines on a fixed and known handful of
dominios.

Expresiones de gratitud

The work presented in this paper was partially
supported by the European Commission under
contract H2020-787061 ANITA.

This work was granted access to the HPC
resources of [TGCC/CINES/IDRIS] under the
allocation 2020-[AD011011270] made by GENCI
(Grand Equipement National de Calcul Intensif).

Referencias

Naveen Arivazhagan, Ankur Bapna, Orhan
Firat, Dmitry Lepikhin, Melvin Johnson,
Maxim Krikun, Mia Xu Chen, Yuan Cao,
George Foster, Colin Cherry, Wolfgang
Macherey, Zhifeng Chen, and Yonghui Wu.
2019. Massively multilingual neural machine
translation in the wild: Findings and challenges.
arXiv e-prints, abs/1907.05019.

Amittai Axelrod, Xiaodong He, and Jianfeng
gao. 2011. Domain adaptation via pseudo
in-domain data selection. En procedimientos de
the Conference on Empirical Methods in
Natural Language Processing, EMNLP ’11,
pages 355–362. Edimburgo, Reino Unido.

Pratyush Banerjee, Jinhua Du, Baoli Li, Sudip
Kumar Naskar, Andy Way, and Josef van
Genabith. 2010. Combining multi-domain
statistical machine translation models using
automatic classifiers. En procedimientos de
el
9th Conference of the Association for Machine
Translation in the Americas, AMTA 2010.
Denver, CO, EE.UU.

Ankur Bapna and Orhan Firat. 2019. Sim-
por ejemplo, scalable adaptation for neural machine
traducción. En Actas de la 2019 Estafa-
ference on Empirical Methods in Natural
El procesamiento del lenguaje y la IX Internacional
Conferencia conjunta sobre lenguaje natural Pro-
cesando, EMNLP-IJCNLP, pages 1538–1548,
Hong Kong, Porcelana. Asociación de Computación-
lingüística nacional. DOI: https://doi.org
/10.18653/v1/D19-1165

Shai Ben-David, John Blitzer, Koby Crammer,
Alex Kulesza, Fernando Pereira, and Jenn
Wortman. 2010. A theory of learning from
different domains. Machine Learning, 79(1):
151–175. DOI: https://doi.org/10
.1007/s10994-009-5152-4

Nicola Bertoldi and Marcello Federico. 2009.
Domain adaptation for
statistical machine
En
translation with monolingual
resources.
Proceedings of the Fourth Workshop on Sta-
tistical Machine Translation, pages 182–189,
Atenas, Greece. Asociación de Computación-
lingüística nacional. DOI: https://doi.org
/10.3115/1626431.1626468

John Blitzer. 2007. Domain Adaptation of Natu-
ral Language Processing Systems. Doctor. tesis,
School of Computer Science, Universidad de
Pensilvania.

Denny Britz, Quoc Le, and Reid Pryzant.
2017. Effective domain mixing for neural
machine translation. En procedimientos de
el
Second Conference on Machine Translation,
pages 118–126, Copenhague, Dinamarca. también-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/W17
-4712

Mauro Cettolo, Christian Girardi, and Marcello
Federico. 2012. Wit3: Web inventory of
transcribed and translated talks. En procedimientos
the European
de
Association for Machine Translation (EAMT),
pages 261–268. trento, Italia.

la 16ª Conferencia de

training for

Wenhu Chen, Evgeny Matusov, Shahram
Khadivi, and Jan-Thorsten Peter. 2016. Guided
alignment
topic-aware neural
el
machine translation. En procedimientos de
Twelth Biennial Conference of the Association
for Machine Translation in the Americas,
AMTA 2012. austin, Texas.

adaptación

Chenhui Chu and Raj Dabre. 2018. Plurilingüe
neural
and multi-domain
el
machine translation. En procedimientos de
24st Annual Meeting of the Association for
Natural Language Processing, NLP 2018,
pages 909–912, Okayama, Japón.

para

Chenhui Chu and Rui Wang. 2018. A survey
of domain adaptation for neural machine

28

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
1
1
9
2
4
0
5
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
1
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

traducción. In Proceedings of the 27th Inter-
national Conference on Computational Lin-
guísticos, COLECCIONAR 2018, pages 1304–1319,
Santa Fe, New Mexico, EE.UU.

Jonathan H. clark, Alon Lavie, and Chris Dyer.
2012. One system, many domains: Open-
domain statistical machine translation via
feature augmentation. En Actas de la
Tenth Biennial Conference of the Association
for Machine Translation in the Americas,
(AMTA 2012). San Diego, California.

Hal Daum´e III and Daniel Marcu. 2006. Domain
adaptation for statistical classifiers. Diario
of Artificial
Intelligence Research (JAIR),
26:101–126. DOI: https://doi.org/10
.1613/jair.1872

Mark Dredze and Koby Crammer. 2008. En línea
methods for multi-domain learning and adap-
tation. En Actas de la 2008 Conferencia
sobre métodos empíricos en lenguaje natural
Procesando, EMNLP, pages 689–697, Hon-
olulu, Hawaii. DOI: https://doi.org
/10.3115/1613715.1613801

Vladimir Eidelman, Jordan Boyd-Graber, y
Philip Resnik. 2012. Topic models for dynamic
translation model adaptation. En procedimientos de
the 50th Annual Meeting of the Association for
Ligüística computacional (Volumen 2: Short
Documentos), pages 115–119, Jeju Island, Korea.
Asociación de Lingüística Computacional.

the 15th Conference of

METRO. Amin Farajian, Marco Turchi, Matteo Negri,
Nicola Bertoldi,
and Marcello Federico.
2017a. Neural vs. phrase-based machine trans-
lation in a multi-domain scenario. En profesional-
el
cesiones de
European Chapter of
la Asociación para
Ligüística computacional: Volumen 2, Short
Documentos, pages 280–284, Valencia, España. también-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/E17
-2045

METRO. Amin Farajian, Marco Turchi, Matteo Negri,
and Marcello Federico. 2017b. Multi-domain
neural machine translation through unsu-
el
pervised adaptation. En procedimientos de
Second Conference on Machine Translation,
pages 127–137, Copenhague, Dinamarca. DOI:

https://doi.org/10.18653/v1/W17
-4713

Jenny Rose Finkel and Christopher D. Manning.
2009. Hierarchical Bayesian domain adap-
tation. In Proceedings of Human Language
Technologies: El 2009 Annual Conference
de
el
el Capítulo Norteamericano de
Asociación de Lingüística Computacional,
pages 602–610, Roca, Colorado. DOI:
https://doi.org/10.3115/1620754
.1620842

Orhan Firat, Kyunghyun Cho, and Yoshua Bengio.
2016. Multi-way, multilingual neural machine
translation with a shared attention mechanism.
En Actas de la 2016 Conference of the
North American Chapter of the Association for
Ligüística computacional: Human Language
Technologies, pages 866–875. Asociación para
Ligüística computacional. DOI: https://
doi.org/10.18653/v1/N16-1101

George Foster and Roland Kuhn. 2007. Mixture-
model adaptation for SMT. En procedimientos de
the Second Workshop on Statistical Machine
Translation, pages 128–135, Prague, checo
República.

Markus Freitag and Yaser Al-Onaizan. 2016.
Fast domain adaptation for neural machine
traducción. CORR, abs/1612.06897.

Eva Hasler, Phil Blunsom, Philipp Koehn, y
Barry Haddow. 2014. Dynamic topic adapta-
tion for phrase-based MT. En Actas de la
14th Conference of the European Chapter of
la Asociación de Lingüística Computacional,
pages 328–337, Gothenburg, Suecia, también-
ciation for Computational Linguistics. DOI:
https://doi.org/10.3115/v1/E14
-1035

Judy Hoffman, Mehryar Mohri, and Ningshan
zhang. 2018. Algorithms and theory for
multiple-source adaptation, S. bengio, h.
Wallach, h. Larochelle, k. Grauman, norte. Cesa-
Bianchi, y r. Garnett, editores, Avances en
Neural Information Processing Systems 31,
pages 8246–8256, Asociados Curran, Cª.

Matthias Huck, Alexandra Birch, and Barry
Haddow. 2015. Mixed domain vs. multi-domain
statistical machine translation. En procedimientos

29

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
1
1
9
2
4
0
5
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
1
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

the Machine Translation Summit, MONTE

de
Summit XV, pages 240–255. Miami Florida.

Ann Irvine, John Morgan, Marine Carpuat,
Hal Daum, and Dragos Munteanu. 2013.
Measuring machine translation errors in new
dominios. Transactions of the Association for
Ligüística computacional, 1:429–440. DOI:
https://doi.org/10.1162/tacl a
00239

Haoming Jiang, Chen Liang, Chong Wang, y
Tuo Zhao. 2019. Multi-domain neural machine
translation with word-level adaptive layer-
wise domain mixing. CORR, abs/1911.02692.
DOI: https://doi.org/10.18653/v1
/2020.acl-main.165, PMID: 31986961

Jing Jiang and ChengXiang Zhai. 2007. Instance
weighting for domain adaptation in NLP. En
Proceedings of the 45th Annual Meeting of
the Association of Computational Linguistics,
pages 264–271, Prague, Czech Republic.
Asociación de Lingüística Computacional.

Anton Schwaighofer Joaquin Quionero-Candela,
Masashi Sugiyama and Neil D. lorenzo, edi-
tores. 2008. Dataset Shift in Machine Learning,
Neural Information Processing series. CON
Prensa. DOI: https://doi.org/10.7551
/mitpress/9780262170055.001.0001

Melvin Johnson, Mike Schuster, Quoc Le, Maxim
Krikun, Yonghui Wu, Zhifeng Chen, Nikhil
Thorat, Fernand a Vi´egas, Martin Wattenberg,
Greg Corrado, Macduff Hughes, and Jeffrey
Dean. 2017. Google’s multilingual neural ma-
chine translation system: Enabling zero-shot
traducción. Transactions of the Association for
Ligüística computacional, 5:339–351. DOI:
https://doi.org/10.1162/tacl a
00065

Mahesh Joshi, Mark Dredze, William W.
cohen, and Carolyn P. Rose. 2012. Multi-
domain learning: When do domains matter?
In Empirical Methods in Natural Language
Procesando (EMNLP), pages 1302–1312.

Guillaume Klein, Yoon Kim, Yuntian Deng,
Jean Senellart, y Alejandro Rush. 2017.
for neural
OpenNMT: Open-source toolkit
machine translation. In Proceedings of ACL
2017, Demostraciones del sistema, pages 67–72,

vancouver, Canada. Asociación de Computación-
lingüística nacional. DOI: https://doi
.org/10.18653/v1/P17-4012

Catherine Kobus, Josep Crego, and Jean Senellart.
2017. Domain control for neural machine trans-
lación. In Proceedings of the International Con-
ference Recent Advances in Natural Language
Procesando, RANLP 2017, pages 372–378,
Varna, Bulgaria. DOI: https://doi.org
/10.26615/978-954-452-049-6 049

Philipp Koehn. 2004. Statistical significance tests
for machine translation evaluation. En profesional-
cesiones de la 2004 Conferencia sobre Empirismo
Métodos en el procesamiento del lenguaje natural,
pages 388–395, Barcelona, España. Asociación
para Lingüística Computacional.

Xiaoqing Li,

Jiajun Zhang, and Chengqing
para
Zong. 2018. One sentence one model
neural machine translation. En procedimientos
of the Eleventh International Conference on
Language Resources and Evaluation (LREC
2018). Miyazaki, Japón. European Language
Resources Association (ELRA).

Minh-Thang Luong and Christopher D. Manning.
2015. Stanford neural machine translation
sistemas
En
Actas de
the International Workshop
on Spoken Language Translation, IWSLT, Da
Nang, Vietnam.

spoken language domain.

para

Yishay Mansour, Mehryar Mohri, and Afshin
Rostamizadeh. 2009a. Domain adaptation with
multiple sources. In D. Koller, D. Schuurmans,
Y. bengio, y yo. Bottou, editores, Avances
en sistemas de procesamiento de información neuronal 21,
pages 1041–1048, Asociados Curran, Cª.

Yishay Mansour, Mehryar Mohri, and Afshin
Rostamizadeh. 2009b. Multiple source adapta-
tion and the R´enyi divergence. En procedimientos
of the 25th Conference on Uncertainty in Arti-
ficial Intelligence, UAI 2009, pages 367–374.

Antonio Valerio Miceli Barone, Barry Haddow,
Ulrich Germann, and Rico Sennrich. 2017.
Regularization techniques for fine-tuning in
En curso-
neural machine
el 2017 Conferencia sobre Empirismo
cosas de
Métodos en el procesamiento del lenguaje natural,
pages 1489–1494, Copenhague, Dinamarca.

traducción.

30

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
1
1
9
2
4
0
5
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
1
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Asociación de Lingüística Computacional,
DOI: https://doi.org/10.18653/v1
/D17-1156

Paul Michel

y Graham Neubig.

2018.
Extreme adaptation for personalized neural
machine translation. En procedimientos de
el
56ª Reunión Anual de la Asociación de
Ligüística computacional (Volumen 2: Short
Documentos), pages 312–318, Melbourne, Australia.
Asociación de Lingüística Computacional.
DOI: https://doi.org/10.18653/v1
/P18-2050

Graham Neubig, Zi-Yi Dou, Jun Jie Hu, Pablo
Michel, Danish Pruthi, Xinyi Wang, y juan
Wieting. 2019. compare-mt: A tool for holistic
comparison of language generation systems.
CORR, abs/1903.07926. DOI: https://doi
.org/10.18653/v1/N19-4007

Xing Niu, Sudha Rao, and Marine Carpuat.
2018. Multi-task neural models for translating
between styles within and across languages.
Emily M. Bender, Leon Derczynski, and Pierre
Isabelle, editores, In Proceedings of the 27th
International Conference on Computational
Lingüística, COLECCIONAR, pages 1008–1021, Santa
Fe, New Mexico, EE.UU.

Yonatan Oren,

Shiori

Sagawa, Tatsunori
Hashimoto, y Percy Liang. 2019. Distribu-
language modeling. En profesional-
tionally robust
cesiones de la 2019 Conferencia sobre Empirismo
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 4227–4237, Hong Kong, Porcelana. también-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1432

Sinno Jialin Pan and Qiang Yang. 2010. A
survey on transfer learning. IEEE Transactions
on Knowledge and Data Engineering, 22(10):
1345–1359. DOI: https://doi.org/10
.1109/TKDE.2009.191

EE.UU. DOI: https://doi.org/10.3115
/1073083.1073135

Minh Quang Pham, Josep-Maria Crego, Jean
Senellart, and Franc¸ois Yvon. 2019. Generic
and specialized word embeddings for multi-
domain machine translation. En procedimientos de
the 16th International Workshop on Spoken
Language Translation,
IWSLT, page 9p,
Hong-Kong, CN.

Hassan Sajjad, Nadir Durrani, Fahim Dalvi,
Yonatan Belinkov, and Stephan Vogel. 2017.
Neural machine translation training in a multi-
domain scenario. In Proceedings of the 14th
International Workshop on Spoken Language
Translation, IWSLT 2017, Tokio, Japón.

Danielle Saunders, Felix Stahlberg, and Bill
Byrne. 2019. UCAM biomedical
traducción
at WMT19: Transfer learning multi-domain
ensembles. In Proceedings of the Fourth Con-
ference on Machine Translation (Volumen 3:
Shared Task Papers, Day 2), pages 169–174,
Italia. Asociación de Computación-
Florencia,
lingüística nacional. DOI: https://doi.org
/10.18653/v1/W19-5421

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016a. Controlling politeness in neu-
ral machine translation via side constraints. En
Actas de la 2016 Conference of the
North American Chapter of the Association for
Ligüística computacional: Human Language
Technologies, pages 35–40, San Diego, Cali-
fornia. Association for Computational Linguis-
tics. DOI: https://doi.org/10.18653
/v1/N16-1005

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016b. Neural machine translation
de
En
rare words with subword units.
Proceedings of the 54th Annual Meeting of
la Asociación de Lingüística Computacional
(Volumen 1: Artículos largos), pages 1715–1725,
Berlina, Alemania. DOI: https://doi.org
/10.18653/v1/P16-1162

Kishore Papineni, Salim Roukos, Todd Ward,
y Wei-Jing Zhu. 2002. AZUL: Un método para
evaluación automática de la traducción automática.
In Proceedings of the 40th Annual Meeting
on Association for Computational Linguistics,
ACL ’02, páginas 311–318, Stroudsburg, Pensilvania,

Rico Sennrich, Holger Schwenk, and Walid
Aransa. 2013. A multi-domain translation
statistical machine
modelo
framework for
the 51st An-
traducción. En procedimientos de
nual Meeting of the Association for Compu-
lingüística nacional (Volumen 1: Artículos largos),

31

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
1
1
9
2
4
0
5
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
1
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

pages 832–840, Sofia, Bulgaria. Asociación para
Ligüística computacional.

Amr Sharaf, Hany Hassan, and Hal Daum´e III.
2020. Meta-learning for few-shot NMT adap-
tation. In Proceedings of the Fourth Work-
shop on Neural Generation and Translation,
pages 43–53, En línea. Asociación de Computación-
lingüística nacional. DOI: https://doi.org
/10.18653/v1/2020.ngt-1.5

Hidetoshi Shimodaira. 2000. Improving predictive
inference under covariate shift by weighting the
log-likelihood function. Journal of Statistical
Planning and Inference, 90(2):227–244. DOI:
https://doi.org/10.1016/S0378
-3758(00)00115-4

Ralf Steinberger, Bruno Pouliquen, Anna Widiger,
Camelia Ignat, Toma Erjavec, Dan Tufis, y
Daniel Varga. 2006. The JRC-Acquis: A mul-
tilingual aligned parallel corpus with 20+
idiomas. In Proceedings of the Fifth Interna-
tional Conference on Language Resources and
Evaluation, LREC’06, Genoa, Italia. European
Language Resources Association (ELRA).

Jinsong Su, Jiali Zeng, Jun Xie, Huating Wen,
Yongjing Yin, and Yang Liu. 2019. Exploring
discriminative word-level domain contexts
for multi-domain neural machine translation.
IEEE Transactions on Pattern Analysis and
Machine Intelligence (PAMI), pages 1–1. DOI:
https://doi.org/10.1109/TPAMI
.2019.2954406, PMID: 31751225

Sander Tars and Mark Fishel. 2018. Multi-domain
neural machine translation. En procedimientos de
the 21st Annual Conference of the European
Association for Machine Translation, EAMT,
pages 259–269, Alicante, España. EAMT.

J¨org Tiedemann. 2009, News from OPUS – A
collection of multilingual parallel corpora
with tools and interfaces. Posada. Nicolov, k.
Bontcheva, GRAMO. Angelova, y r. Mitkov,
editores, Recent Advances in Natural Lan-
Procesamiento de calibre, volume V, pages 237–248.
Juan Benjamín, Amsterdam/Philadelphia,
Borovets, Bulgaria. DOI: https://doi
.org/10.1075/cilt.309.19tie

J¨org Tiedemann. 2012. Parallel data,

herramientas
and interfaces in OPUS. Nicoletta Calzolari

(Conference Chair), Khalid Choukri, Thierry
Declerck, Mehmet Ugur Dogan, Bente
Maegaard, Joseph Mariani, Jan Odijk, y
Stelios Piperidis, editores, En Actas de la
Eight International Conference on Language
Resources and Evaluation, LREC’12, Istanbul,
Pavo,
Language Resources
European
Asociación (ELRA).

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Leon Jones, Aidan N.. Gómez,
lucas káiser, y Illia Polosukhin. 2017.
In I. Guyon,
Attention is all you need.
Ud.. V. Luxburg, S. bengio, h. Wallach,
R. Fergus, S. Vishwanathan, y r. Garnett,
editores, Avances
Información
Sistemas de procesamiento 30, pages 5998–6008,
Asociados Curran, Cª.

in Neural

Jitao Xu, Josep Crego, and Jean Senellart.
2019. Lexical micro-adaptation for neural
machine translation. In Proceedings of the 16th
International Workshop on Spoken Language
Translation, IWSLT 2019, Hong Kong, Porcelana.

Jiali Zeng, Jinsong Su, Huating Wen, Cual
Liu, Jun Xie, Yongjing Yin, and Jianqiang
zhao. 2018. Multi-domain neural machine
translation with word-level domain context
discriminación. En procedimientos de
el 2018
Conference on Empirical Methods in Nat-
ural Language Processing, pages 447–457,
Bruselas, Bélgica. Asociación de Computación-
lingüística nacional. DOI: https://doi.org
/10.18653/v1/D18-1041

Jian Zhang, Liangyou Li, Andy Way, and Qun
Liu. 2016. Topic-informed neural machine
the 26th
En procedimientos de
traducción.
International Conference on Computational
Lingüística: Technical Papers, COLECCIONAR 2016,
pages 1807–1817, Osaka, Japón. The COLING
2016 Organizing Committee.

Appendices

A. Description of Multi-Domain Systems

We use the following setups for MDMT systems.

• Mixed-Nat, FT-full, TTM, DC-Tag use
a medium Transformer model of Vaswani
et al. (2017) with the following settings:
embeddings size and hidden layers size are
set to 512. Multi-head attention comprises

32

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
1
1
9
2
4
0
5
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
1
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

8 heads in each of the 6 capas; the inner
feedforward layer contains 2,048 cells.
Training use a batch size of 12,288 tokens;
optimization uses Adam with parameters
β1 = 0.9, β2 = 0.98 and Noam decay
(warmup steps = 4, 000), and a dropout
tasa de 0.1 for all layers.

• FT-Res and MDL-res use the same
medium Transformer and add residual layers
with a bottleneck dimension of size 1,024.

• ADM, DM use medium Tranformer model and
a domain classifier composing of 3 dense
layers of size 512 × 2,048, 2,048 × 2,048,
y 2,048 × domain num. The two first
layers of the classifier use the ReLU() como
activation function, the last layer uses tanh()
as activation function.

• DC-Feat uses medium Transformer model
and domain embeddings of size 4. Given a
sentence of domain i in a training batch, el
embedding of domain i is concatenated to the
embedding of each token in the sentence.

• LDR uses medium Transformer model and
for each token we introduce a LDR feature of
tamaño 4 × domain num. Given a sentence of
domain i ∈ [1, .., k] in the training batch,
for each token of the sentence, the LDR
units of the indexes outside of the range
[4(i − 1), .., 4i − 1] are masked to 0, y el
masked LDR feature will be concatenated to
the embedding of the token. Details are in
Pham et al. (2019).

• Mixed-Nat-RNN uses one bidirectional
LSTM layer in the encoder and one LSTM
layer in the decoder. The size of hidden layers
es 1,024, the size of word embeddings is 512.

• WDCNMT uses one bidirectional GRU layer in
the encoder and one GRU-conditional layer

in the decoder. The size of hidden layers is
1,024, the size of word embeddings is 512.

Training For each domain, we create train/
dev/test sets by randomly splitting each corpus.
We maintain the size of validation sets and of test
sets equal to 1,000 lines for every domain. El
learning rate is set as in Vaswani et al. (2017).
For the fine-tuning procedures used for FT-
full and FT-Res, we continue training using
the same learning rate schedule, continuing the
incrementation of the number of steps. All other
MDMT systems reported in Tables 3 y 4 usar
a combined validation set comprising 6,000 líneas,
obtained by merging the six development sets. Para
the results in Table 7 we also append the validation
set of NEWS to the multi-domain validation set. En
any case, training stops if either training reaches
the maximum number of iterations (50,000) or the
score on the validation set does not increase for
three consecutive evaluations. We average five
checkpoints to get the final model.

B. Experiments with Continual Learning

Complete results for the experiments with con-
tinual learning are reported in Table 7.

C. Experiments with Automatic Domains

This experiment aims to simulate with automatic
domains a scenario where the number of
‘‘domains’’ is large and where some ‘‘domains’’
are close and can effectively share information.
Full results in Table 8. Cluster size vary from
approximately 8k sentences (grupo 24) up to more
than 350k sentences. More than two thirds of these
clusters mostly comprise texts from one single
domain, as for cluster 12 which is predominantly
MED, the remaining clusters typically mix two
dominios. Fine-tuning with small domains is often
outperformed by other MDMT techniques, un
issue that a better regularization strategy might
mitigate. Domain-control (DC-Feat) is very
effective for small domains, but again less so
in larger data conditions. Among the MD models,
approaches using residual adapters have the best
average performance.

33

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
1
1
9
2
4
0
5
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
1
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

3
4

Domain
Modelo

Mixed-Nat

DC-Tag

DC-Feat

LDR

TTM

DM

ADM

FT-Res

MDL-Res

MED

LAW

BANK

TALK

IT

REL

NEWS

wAVG

AVG

37.1
|

+0.2

37.7
+0.3 | +0.3
37.4
+0.3 | −0.2
37.0
0.0 | −0.6
37.3
0.0 | −0.3
36.0
−0.4 | +0.6
36.6
−0.2 | +0.3
37.0
+0.3 | +0.3
37.7
+0.2 | −0.2

54.1

+0.5 | +0.5

49.6
|

54.5
+0.8 | −0.1
54.9
−0.1 | −0.1
54.6
+0.1 | +0.5
54.4
+0.4 | −0.3
51.3
−1.8 | +0.4
54.2
−0.7 | −0.8
57.6
+0.4 | +0.4
55.6
+0.4 | +0.5

49.9
−0,04 | −0.6
50.0
−0.3 | −0.1
49.6
+0.2 | −0.4
49.6
−0.1 | −0.5
46.8
−1.2 | +0.6
49.1
−0.8 | −0.8
53.8
+0.1 | +0.1
51.1
+0.1 | 0.0

34.1
|

−0.6

34.8
−1.6 | −1.1
34.7
−1.3 | −0.6
34.3
−0.4 | −0.6
33.8
−0.9 | −1.1
31.8
−1.8 | −0.1
32.9
−0.9 | −0.2
34.5
−0.7 | −0.7
34.4
−0.9 | −0.4

42.1
|

+1.1

43.9
−0.4 | −1.3
43.9
−0.1 | −0.9
43.0
+0.5 | +0.5
42.9
+0.6 | −1.0
39.8
−2.6 | +0.5
42.1
−0.5 | −0.4
46.1
+0.5 | +0.5
44.5
−0.1 | −0.2

77.0
|

+0.5

78.8
+1.7 | −3.5
79.6
+0.4 | +0.3
77.0
+2.9 | +3.8
78.2
+1.8 | −4.0
65.7
−3.3 | 0.0
75.7
−2.3 | −5.0
91.1
−0.9 | −0.9
87.5
+0.9 | −0.2

28.9
|

−5.4

29.5
−7.7 | −1.4
28.9
−7.3 | −0.8
28.7
−6.6 | −0.9
29.1
−5.7 | −1.4
27.0
−4.4 | −1.2
28.7
−5.4 | −1.9
29.6
−9.0 | −0.6
29.1
−8.0 | −0.8

40.8
|

+0.3

41.4
+0.2 | −0.1
41.2
+0.1 | −0.2
40.8
+0.6 | +0.5
41.0
0.0 | −0.5
38.9
−0.8 | +0.5
40.2
−0.5 | −0.2
42.2
−0.1 | −0.1
41.9
+0.1 | −0.2

49.0
|

+0.4

49.9
+0.1 | −1.1
50.1
−0.2 | −0.3
49.2
+0.1 | −0.4
49.4
+0.3 | −1.2
45.2
−1.8 | +0.3
48.4
−0.9 | −1.1
53.3
+0.2 | +0.2
51.8
+0.1 | −0.1

Mesa 7: Ability to handle a new domain. We report BLEU scores for a complete training session with seven domains, también
as differences with (izquierda) training with six domains (from Table 3); (bien) continuous training mode. Averages only take into
account six domains (NEWS excluded). Underline denotes a significant loss, bold a significant gain.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
1
1
9
2
4
0
5
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
1
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Modelo
Cluster

tamaño
train / prueba

Mixed
Nat

8.1k / 3
17.3k / 52
25.6k / 54
27.2k / 88
27.4k / 72
27.5k / 103
28.2k / 56
30.4k / 18
47.0k / 23
54.4k / 26
61.4k / 214
68.1k / 122
91.5k / 30
93.0k / 38

24 [med]
[-]
13
[-]
28
[IT]
19
[-]
0
[-]
22
[-]
25
16 [med]
23 [med]
17 [med]
[IT]
8
1
[-]
7 [med]
11 [med]
29 [law] 109.2k / 242
27 [med] 109.3k / 49
5
[-] 109.9k / 267
6 [med] 133.4k / 73
26
[-] 134.8k / 428
15 [bank] 136.9k / 674
4
[rel] 137.4k / 1016
2 [med] 182.6k / 85
20 [med] 183.0k / 71
21
[-] 222.8k / 868
10 [med] 225.4k / 115
18 [med] 245.0k / 106
9 [med] 301.6k / 145
[law] 323.5k / 680
3
334.0 / 146
14 [med]
12 [med] 356.4k / 148

90.4
67.6
71.6
58.5
43.9
91.5
57.0
57.2
24.5
39.9
46.9
47.2
41.3
31.6
65.9
11.0
46.3
37.2
31.8
46.5
77.1
70.6
47.4
38.7
40.0
57.7
37.2
50.1
31.6
36.3

FT

FT MDL

corriente continua
Full Res Res Feat Tag

corriente continua

TTM ADM

DM

LDR

90.4
75.4
68.7
63.0
33.3
93.7
44.8
70.4
27.2
40.3
53.1
47.5
35.5
42.6
69.2
9.6
47.4
38.9
30.8
51.5
85.3
75.8
47.2
38.8
42.6
60.3
37.3
52.0
31.4
36.6

90.4 90.4 100.0 65.6 100.0 90.4 100.0 100.0
76.9
74.3 74.3
72.6
68.1 70.2
60.3
60.9 63.9
47.8
45.4 45.4
93.4
93.4 93.9
52.4
48.2 49.1
58.3
77.4 73.5
29.8
26.5 28.5
33.7
41.6 38.0
46.7
55.8 53.6
44.9
48.7 45.1
41.8
41.4 39.9
36.6
31.8 35.4
65.9
67.6 67.7
10.6
9.2
45.7
46.9 45.4
35.9
38.7 36.8
31.2
31.8 31.2
46.0
47.9 48.0
75.9
83.5 83.3
68.2
71.7 69.4
46.8
46.8 47.2
37.0
39.0 37.2
40.7
40.0 38.2
55.9
58.7 58.6
37.0
36.5 36.1
49.1
50.8 50.1
31.8
31.9 33.0
36.3
35.9 35.9

75.0 54.7
71.0 42.5
63.7 57.2
49.9 15.4
92.5 72.8
54.6 47.2
61.8 54.2
30.5 27.3
37.1 36.6
48.9 45.1
46.8 39.1
41.4 36.5
36.0 29.6
66.0 63.8
10.0 19.4
44.0 42.9
37.5 27.5
31.9 32.6
46.6 46.0
75.8 46.1
68.2 67.3
48.4 47.5
37.5 35.9
39.9 35.8
58.4 56.3
36.4 37.7
49.1 48.3
32.5 34.1
35.8 37.0

74.7 75.9
72.0 71.3
59.4 61.1
46.8 49.2
92.3 93.2
49.8 54.2
58.4 58.1
32.0 24.4
35.2 35.4
48.8 50.9
45.4 44.2
37.3 37.1
36.7 32.7
65.1 64.7
7.9
43.7 44.3
38.0 37.2
32.2 30.5
45.8 45.7
74.2 73.3
67.3 68.6
48.8 47.3
36.9 37.1
39.5 39.1
57.3 56.1
36.4 35.2
49.0 48.2
31.4 32.1
36.4 35.4

65.9
65.6
60.5
46.6
91.4
45.1
52.5
29.0
31.3
43.0
40.7
40.7
26.5
62.4
10.7
40.9
31.3
29.6
42.9
63.2
65.6
47.1
33.4
36.3
54.9
34.2
44.4
30.5
34.2

8.7

9.4

Mesa 8: Complete results for the experiments with automatic domains. For each cluster, nosotros reportamos: el
majority domain when one domain accounts for more than 75% of the class; training and test sizes; y
BLEU scores obtained with the various systems used in this study. Most test sets are too small to report
significance tests.

35

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
3
5
1
1
9
2
4
0
5
9

/

/
t

yo

a
C
_
a
_
0
0
3
5
1
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Descargar PDF