Revisiting Multi-Domain Machine Translation

MinhQuang Pham† ‡, Josep Maria Crego†, Franc¸ois Yvon‡

‡Universit´e Paris-Saclay, 法国国家科学研究中心, LIMSI, 91400, Orsay, 法国
francois.yvon@limsi.fr
†SYSTRAN, 5 rue Feydeau, 75002 巴黎, 法国
{minhquang.pham,josep.crego}@systrangroup.com

抽象的

When building machine translation systems,
one often needs to make the best out of hetero-
geneous sets of parallel data in training, 和
to robustly handle inputs from unexpected
domains in testing. This multi-domain scenario
has attracted a lot of recent work that fall under
the general umbrella of transfer learning. 在
this study, we revisit multi-domain machine
翻译, with the aim to formulate the moti-
vations for developing such systems and the
associated expectations with respect to perfor-
曼斯. Our experiments with a large sample
of multi-domain systems show that most of
these expectations are hardly met and suggest
that further work is needed to better analyze
the current behaviour of multi-domain systems
and to make them fully hold their promises.

介绍

Data-based Machine Translation (公吨), 无论
statistical or neural, rests on well-understood ma-
chine learning principles. Given a training sample
of matched source-target sentence pairs (F , e)
drawn from an underlying distribution Ds, A
model parameterized by θ (这里, a translation
function hθ) is trained by minimizing the empirical
expectation of a loss function (西德:3)(hθ(F ), e). 这
approach ensures that the translation loss remains
low when translating more sentences drawn from
the same distribution.

Owing to the great variability of language data,
this ideal situation is rarely met
在实践中,
warranting the study of an alternative scenario,
where the test distribution Dt differs from Ds.
在这个设置下, domain adaptation (DA) 方法
are in order. DA has a long history in Machine
Learning in general (例如, Shimodaira, 2000; 本-
David et al., 2010; Joaquin Quionero-Candela and
劳伦斯, 2008; Pan and Yang, 2010) and in NLP

尤其 (例如, Daum´e III and Marcu, 2006;
Blitzer, 2007; Jiang and Zhai, 2007). 各种各样的
techniques thus exist to handle both the situations
where a (小的) training sample drawn from Dt
is available in training, or where only samples
target-side) sentences are
of source-side (或者
可用的 (see Foster and Kuhn [2007]; Bertoldi
and Federico [2009]; Axelrod et al. [2011]; 为了
proposals from the statistical MT era, or Chu and
王 [2018] for a recent survey of DA for Neural
公吨).

A seemingly related problem is multi-domain
(医学博士) machine translation (Sajjad et al., 2017;
Farajian et al., 2017乙; Kobus et al., 2017; 曾
等人。, 2018; Pham et al., 2019) where one single
system is trained and tested with data from mul-
tiple domains. MD machine translation (MDMT)
corresponds to a very common situation, 在哪里
all available data, no matter its origin, is used
to train a robust system that performs well for
any kind of new input. If the intuitions behind
MDMT are quite simple, the exact specifications
of MDMT systems are rarely spelled out: 为了
实例, should MDMT perform well when the
test data is distributed like the training data, 什么时候
it is equally distributed across domains or when
the test distribution is unknown? Should MDMT
also be robust to new domains? How should it
handle domain labeling errors?

A related question concerns the relationship
between supervised domain adaptation and multi-
domain translation. The latter task seems more
challenging as it tries to optimize MT performance
for a more diverse set of potential inputs, with an
additional uncertainty regarding the distribution
of test data. Are there still situations where MD
systems can surpass single domain adaptation, 作为
is sometimes expected?

在本文中, we formulate in a more pre-
cise fashion the requirements that an effective
MDMT system should meet (部分 2). Our first

计算语言学协会会刊, 卷. 9, PP. 17–35, 2021. https://doi.org/10.1162/tacl 00351
动作编辑器: George Foster. 提交批次: 5/2020; 修改批次: 9/2020; 已发表 02/2021.
C(西德:2) 2021 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
5
1
1
9
2
4
0
5
9

/
t

我

A
C
_
A
_
0
0
3
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

contribution is thus of methodological nature and
consists of lists of expected properties of MDMT
systems and associated measurements to evaluate
他们 (部分 3). 在这样做, we also shed light on
new problems that arise in this context, regarding,
例如, the accommodation of new domains
in the course of training, or the computation
of automatic domain tags. Our second main
contribution is experimental and consists in a
thorough reanalysis of eight recent multi-domain
approaches from the literature, including a variant
of a model initially introduced for DA. We show in
部分 4 that existing approaches still fall short
to match many of these requirements, 尤其
with respect to the handling of a large amount
of heterogeneous domains and to dynamically
integrating new domains in training.

2 Requirements of Multi-Domain MT

在这个部分, we recap the main reasons
for considering a multi-domain scenario and
discuss their implications in terms of performance
评估.

2.1 Formalizing Multi-Domain Translation

We conventionally define a domain d as a dis-
tribution Dd(X) over some feature space X that is
shared across domains (Pan and Yang, 2010): 在
machine translation, X is the representation space
for source sentences; each domain corresponds
to a specific source of data, and differs from
the other data sources in terms of textual genre,
thematic content (陈等人。, 2016; 张等人。,
2016), register (Sennrich et al., 2016A), style
(Niu et al., 2018), an so forth. Translation in
domain d is formalized by a translation function
hd(y|X) pairing sentences in a source language
with sentences in a target language y ∈ Y. hd
is usually assumed to be deterministic (因此
y = hd(X)), but can differ from one domain to the
其他.

(西德:2)

d λs
d

Dd(X), 和 {λs

A typical learning scenario in MT is to have
access to samples from nd domains, 意思是
the training distribution Ds is a mixture
那
d, d = 1 . . . ND}
Ds(X) =
d λs
d = 1).
the corresponding mixture weights (
Multi-domain learning, as defined in Dredze and
Crammer (2008), further assumes that domain
tags are also available in testing; the implication
being that the test distribution is also as a mix-
Dd(X) of several domains,
ture Dt(X) =

(西德:2)

d λt
d

making the problem distinct from mere domain
adaption. A multi-domain learner is then expected
to use these tags effectively (Joshi et al., 2012)
when computing the combined translation func-
tion h(X, d), and to perform well in all domains
(Finkel and Manning, 2009). This setting is
closely related to the multi-source adaptation
problem formalized in Mansour et al. (2009A,乙)
and Hoffman et al. (2018).

This definition seems to be the most accepted
view of a multi-domain MT1 and one that we
in the absence of
also adopt here. 注意
the naive answer to the
further specification,
MD setting should be to estimate one translation
function ˆhd(X) separately for each domain, 然后
d(西德:4) hd(西德:4)(X)我(d(西德:4) = d),
to translate using ˆh(X, d) =
where I(X) is the indicator function. We now
discuss the arguments that are put forward to
proceed differently.

(西德:2)

2.2 Reasons for Building MDMT Systems

A first motivation for moving away from the
one-domain / one-system solution are practical
(Sennrich et al., 2013; Farajian et al., 2017A):
When faced with inputs that are potentially from
multiple domains, it is easier and computationally
cheaper to develop one single system instead
of having to optimize and maintain multiple
engines. The underlying assumption here is that
the number of domains of interests can be large, A
limiting scenario being fully personalized machine
翻译 (Michel and Neubig, 2018).

A second line of reasoning rests on linguistic
properties of the translation function and contends
that domain specificities are mostly expressed
lexically and will primarily affect content words
or multi-word expressions; function words, 在
另一方面, are domain agnostic and tend
to remain semantically stable across domains,
motivating some cross-domain parameter sharing.
An MDMT system should simultaneously learn
lexical domain peculiarities, and leverage cross-
domain similarities to improve the translation of
generic contexts and words (Zeng et al., 2018;
Pham et al., 2019). It is here expected that the
MDMT scenario should be more profitable when
the domain mix includes domains that are closely
related and can share more information.

1An exception is Farajian et al. (2017乙), where test
translations rely on similarity scores between test and train
句子, rather than on domain labels.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
5
1
1
9
2
4
0
5
9

/
t

我

A
C
_
A
_
0
0
3
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

A third series of motivations is of statistical na-
真实. The training data available for each domain is
usually unevenly distributed, and domain-specific
systems trained or adapted on small datasets are
likely to have a high variance and generalize
poorly. For some test domains, there may even be
no data at all (Farajian et al., 2017A). Training mix-
domain systems is likely to reduce this variance,
at the expense of a larger statistical bias (克拉克
等人。, 2012). Under this view, MDMT would be es-
pecially beneficial for domains with little training
数据. This is observed for multilingual MT from
英语: an improvement for under-resourced
languages due to positive transfer, at the cost
of a decrease in performance for well-resourced
语言 (Arivazhagan et al., 2019).

Combining multiple domain-specific MTs can
also be justified in the sake of distributional robust-
内斯 (Mansour et al., 2009A,乙), 例如, 什么时候
the test mixture differs from the train mixture, 或者
when it includes new domains unseen in training.
An even more challenging case is when the
MT would need to perform well for any test
分配, as studied for statistical MT in Huck
等人. (2015). In all these cases, mixing domains
in training and/or testing is likely to improve
robustness against unexpected or adversarial test
分配 (Oren et al., 2019).

A distinct

line of reasoning is that mixing
domains can have a positive regularization effect
for all domains. By introducing variability in
it prevents DA from overfitting the
训练,
available adaptation data and could help improve
generalization even for well-resourced domains.
A related case is made in Joshi et al. (2012), 哪个
shows that part of the benefits of MD training is
due to an ensembling effect, where systems from
multiple domains are simultaneously used in the
prediction phase; this effect may subsist even in
the absence of clear domain separations.

To recap,

there are multiple arguments for
adopting MDMT, some already used in DA
settings, and some original. These arguments are
not mutually exclusive; 然而, each yields
specific expectations with respect to the perfor-
mance of this approach, and should also yield
appropriate evaluation procedure. If the motiva-
tion is primarily computational, then a drop in
MT quality with respect to multiple individual
domains might be acceptable if compensated by
the computational savings. If it is to improve
statistical estimation, then the hope will be that

MDMT will improve, at least for some under-
individually trained
resourced domains, 超过
系统. 如果, 最后, it is to make the system more
robust to unexpected or adversarial test distribu-
系统蒸发散, then this is the setting that should be used to
evaluate MDMT. The next section discusses ways
in which these requirements of MDMT systems
could be challenged.

3 Challenging Multi-Domain Systems

在这个部分, we propose seven operational
requirements that can be expected from an effec-
tive multi-domain system, and discuss ways to
evaluate whether these requirements are actually
met. All these evaluations will rest on comparison
of translation performance, and do not depend on
the choice of a particular metric. To make our
results comparable with the literature, 我们将
only use the BLEU score (Papineni et al., 2002) 在
部分 4, noting it may not be the best yardstick to
assess subtle improvements of lexical choices that
are often associated with domain adapted systems
(Irvine et al., 2013). Other important figures of
merit for MDMT systems are the computational
training cost and the total number of parameters.

3.1 Multi-Domain Systems Should

Be Effective

A first expectation is that MDMT systems should
perform well in the face of mixed-domain test
数据. We thus derive the following requirements.

[P1-LAB] A MDMT should perform better than
基线, which disregards domain labels, 或者
reassigns them in a random fashion (Joshi et al.,
2012). Evaluating this requirement is a matter of
a mere comparison, assuming the test distribution
of domains is known: If all domains are equally
重要的, performance averages can be reported;
if they are not, weighted averages should be used
反而.

[P2-TUN] 此外, one can expect
那
MDMT will improve over fine-tuning (Luong and
曼宁, 2015; Freitag and Al-Onaizan, 2016),
at least in domains where data is scarce, or in situ-
ations where several domains are close. To evalu-
ate this, we perform two measurements, using a
real as well as an artificial scenario. In the real
scenario, we simply compare the performance of
MDMT and fine-tuning for domains of varying
sizes, expecting a larger gain for smaller domains.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
5
1
1
9
2
4
0
5
9

/
t

我

A
C
_
A
_
0
0
3
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

In the artificial scenario, we split a single do-
main in two parts which are considered as distinct
in training. The expectation here is that a MDMT
should yield a clear gain for both pseudo sub-
域, which should benefit from the supple-
训练. 在这个
mentary amount of relevant
situation, MDMT should even outperform fine-
tuning on either of the pseudo sub-domain.

3.2 Robustness to Fuzzy Domain Separation

A second set of requirements is related to the
definition of a domain. As repeatedly pointed out
in the literature, parallel corpora in MT are often
collected opportunistically and the view that each
corpus constitutes a single domain is often a gross
approximation.2 MDMT should aim to make the
best of the available data and be robust to domain
assignments. To challenge these requirements we
propose evaluating the following requirements.

[P3-HET] The notion of a domain being a
fragile one, an effective MDMT system should
be able to discover not only when cross-domain
sharing is useful (比照. requirement [P2-TUN]),
but also when intra-domain heterogeneity is
hurting. This requirement is tested by artificially
conjoining separate domains into one during
训练, hoping that the loss in performance with
respect to the baseline (using correct domain tags)
will remain small.

[P4-ERR] MDMTs should perform best when
the true domain tag is known, but deteriorate
gracefully in the face of tag errors; in this situation,
catastrophic drops in performance are often ob-
服务. This requirement can be assessed by trans-
lating test texts with erroneous domain tags and
reporting the subsequent loss in performance.

[P5-UNK] A related situation occurs when the
domain of a test document is unknown. 一些
situations need to be considered: For domains seen
in training, using automatically predicted domain
labels should not be much worse than using the
correct one. For test documents from unknown
域 (zero-shot transfer), a good MD system
should ideally outperform the default baseline that
merges all available data.

[P6-DYN] Another requirement, more of an
is that an MDMT system
operational nature,

2Two of our own ‘‘domains’’ actually comprise several

subcorpora (IT and MED), see details in Section 4.1.

should smoothly evolve to handle a growing
number of domains, without having to retrain
the full system each time new data is available.
This is a requirement [P6-DYN] that we challenge
by dynamically changing the number of training
and test domains.

3.3 Scaling to a Large Number of Domains

[P7-NUM] As mentioned above, MDMT sys-
tems have often been motivated by computational
论据. This argument is all the more sensible
as the number of domains increases, making the
optimization of many individual systems both in-
effective and undesirable. For lack of having ac-
cess to corpora containing very large sets (例如, 在
the order of 100–1,000) 域, we experiment
with automatically learned domains.

4 Experimental Settings

4.1 Data and Metrics

We experiment with translation from English
into French and use texts initially originating
from six domains, corresponding to the following
data sources: the UFAL Medical corpus V1.0
(MED);3 the European Central Bank corpus (BANK)
(Tiedemann, 2012); The JRC-Acquis Commu-
nautaire corpus (LAW) (Steinberger et al., 2006),
documentations for KDE, Ubuntu, GNOME, 和
PHP from Opus collection (Tiedemann, 2009),
collectively merged in a IT-domain; TED Talks
(TALK) (Cettolo et al., 2012); and the Koran (REL).
Complementary experiments also use v12 of the
News Commentary corpus (NEWS). Most corpora
are available from the Opus Web site.4 These
corpora were deduplicated and tokenized with in-
house tools; statistics are in Table 1. To reduce
the number of types and build open-vocabulary
系统, we use Byte-Pair Encoding (Sennrich
等人。, 2016乙) 和 30,000 merge operations on a
corpus containing all sentences in both languages.
We randomly select in each corpus a devel-
opment and a test set of 1,000 lines and keep the
rest for training.5 Validation sets are used to chose
the best model according to the average BLEU

3https://ufal.mff.cuni.cz/ufal medical
语料库. We only use the in-domain (medical) subcorpora:
PATR, EMEA, CESTA, ECDC.
4http://opus.nlpl.eu.
5The code for reproducing our train, dev and test data-
sets is available at https://github.com/qmpham
/实验.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
5
1
1
9
2
4
0
5
9

/
t

我

A
C
_
A
_
0
0
3
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

MED

LAW

BANK

它

TALK

REL

NEWS

# 线
# 代币
# 类型
# uniq

2609 (0.68)
133 / 154
771 / 720
700 / 640

501 (0.13)
17.1 / 19.6
52.7 / 63.1
20.2 / 23.7

190 (0.05)
6.3 / 7.3
92.3 / 94.7
42.9 / 40.1

270 (0.07)
3.6 / 4.6
75.8 / 91.4
44.7 / 55.7

160 (0.04)
3.6 / 4.0
61.5 / 73.3
20.7 / 25.6

130 (0.03)
3.2 / 3.4
22.4 / 10.5
7.1 / 2.1

260 (0)
7.8 / 9.2
-
-

桌子 1: Corpora statistics: number of parallel lines (×103) and proportion in the basic domain mixture
(which does not include the NEWS domain), number of tokens in English and French (×106), 数字
of types in English and French (×103), number of types that only appear in a given domain (×103).
MED is the largest domain, containing almost 70% of the sentences, while REL is the smallest, 和
仅有的 3% 数据的.

LAW

BANK TALK

它

REL

1.93

1.97
1.94

MED
LAW
BANK
TALK
它

1.9
1.97
1.98

1.93
1.93
1.94
1.92

1.97
1.99
1.99
1.93
1.99

桌子 2: The H-divergence between domains.

分数 (Papineni et al., 2002).6 Significance testing
is performed using bootstrap resampling (科恩,
implemented in compare-mt7
(Neubig
2004),
等人。, 2019). We report significant differences
at the level of p = 0.05.

We measure the distance between domains
using the H-Divergence (Ben-David et al., 2010),
which relates domain similarity to the test error of
a domain discriminator: the larger the error, 这
closer the domains. Our discriminator is a SVM
independently trained for each pair of domains,
with sentence representations derived via mean
pooling from the source side representation of the
generic Transformer model. We used the scikit-
learn8 implementation with default values. 结果
表中 2 show that all domains are well separated
from all others, with REL being the furthest apart,
while TALK is slightly more central.

4.2 基线

Our baselines are standard for multi-domain sys-
tems.9 Using Transformers (Vaswani et al., 2017)

6We use truecasing and the multibleu script.
7https://github.com/neulab/compare-mt.
8https://scikit-learn.org.
9We omit domain-specific systems trained only with the
corresponding subset of the data, as these are always inferior
to the mix-domain strategy (Britz et al., 2017).

implemented in OpenNMT-tf10 (Klein et al.,
2017), we build the following systems:

• a generic model trained on a concatenation
of all corpora (Mixed). We develop two
versions11 of this system, one where the
domain unbalance reflects the distribution
training data given in Table 1
of our
(Mixed-Nat) and one where all domains
are equally represented in training (Mixed-
Bal). The former is the best option when
the train mixture Ds is also expected in
testing; the latter should be used when the
test distribution is uniform across domains.
因此, we report two aggregate scores:
a weighted average reflecting the training
分配, and an unweighted average,
meaning that
test domains are equally
重要的.

• fine-tuned models (Luong and Manning,
2015; Freitag and Al-Onaizan, 2016), 基于
on the Mixed-Nat system, further trained
on each domain for at most 20,000 迭代,
with early stopping when the dev BLEU stops
增加. The full fine-tuning (FT-Full)
procedure may update all the parameters of
the initial generic model, resulting in six
systems adapted for one domain, with no
parameter-sharing across domains.

All models use embeddings and the hidden
layers sizes of dimension 512. Transformers
contain with 8 attention heads in each of the 6+6
layers; the inner feedforward layer contains 2,048
细胞. The adapter-based systems (见下文)

10https://github.com/OpenNMT/OpenNMT-tf.
11In fact three: to enable a fair comparison with WDCMT,
a RNN-based variant is also trained and evaluated. 这
system appears as Mixed-Nat-RNN in Table 3.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
5
1
1
9
2
4
0
5
9

/
t

我

A
C
_
A
_
0
0
3
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

additionally use an adaptation block in each
层, composed of a two-layer perceptron, 和
an inner ReLU activation function operating on
normalized entries of dimension 1,024. Training
uses batches of 12,288 代币, Adam with
parameters β1 = 0.9, β2 = 0.98, Noam decay
(warmup steps = 4, 000), and a dropout rate of
0.1 in all layers.

4.3 Multi-Domain Systems

Our comparison of multi-domain systems includes
our own reimplementations of recent proposals
from the literature:12

• a system using domain control as in Kobus
等人. (2017): domain information is intro-
token for
duced either as an additional
each source sentence (DC-Tag), or as
a supplementary feature for each word
(DC-Feat).

• a system using lexicalized domain represen-
tations (Pham et al., 2019): word embeddings
are composed of a generic and a domain
specific part (LDR);

• the three proposals of Britz et al. (2017).
TTM is a feature-based approach where the
domain tag is introduced as an extra word
on the target side. Training uses reference
tags and inference is usually performed with
predicted tags, just like for regular target
字. DM is a multi-task learner where a
domain classifier is trained on top the MT
encoder, so as to make it aware of domain
差异; ADM is the adversarial version
of DM, pushing the encoder towards learning
domain-independent source representations.
These methods thus only use domain tags in
训练.

• the multi-domain model of Zeng et al. (2018)
(WDCMT), where a domain-agnostic and
a domain-specialized representation of the
input are simultaneously processed; super-
vised classification and adversarial training
are used to compute these representations.
再次, inference does not use domain tags.13

12Further implementation details are in Appendix A.
13For this system, we use the available RNN-based system
from the authors (https://github.com/DeepLearnXMU
/WDCNMT), which does not directly compare to the
其他, Transformer-based, 系统; the improved version of

• two multi-domain versions of the approach
of Bapna and Firat (2019), denoted FT-
Res and MDL-Res, where a domain-specific
adaptation module is added to all the Trans-
former layers; within each layer, 残差
connections enable to short-cut this adapter.
The former variant corresponds to the orig-
inal proposal of Bapna and Firat (2019) (看
also Sharaf et al., 2020). It fine-tunes the
adapter modules of a Mixed-Nat system
independently for each domain, keeping all
the other parameters frozen. The latter uses
the same architecture, but a different training
procedure and learns all parameters jointly
from scratch with a mix-domain corpus.

includes systems that slightly depart
This list
from our definition of MDMT: Standard imple-
mentations of TTM and WDCMT rely on infered,
rather than on gold, domain tags, which must
somewhat affect their predictions; DM and ADM
make no use of domain tags at all. We did not
consider the proposal of Farajian et al. (2017乙),
然而, which performs on-the-fly tuning for
each test sentence and diverges more strongly
from our notion of MDMT.

5 Results and Discussion

5.1 Performance of MDMT Systems

在这个部分, we discuss the basic performance of
MDMT systems trained and tested on six domains.
Results are in Table 3. 正如预期的那样, balancing data
in the generic setting makes a great difference
(the unweighted average is 2 BLEU points better,
notably owing to the much better results for REL).
As explained above, this setting should be the
baseline when the test distribution is assumed to
be balanced across domains. As all other systems
are trained with an unbalanced data distribution,
we use the weighted average to perform global
comparisons.

Fine-tuning each domain separately yields a
better baseline, outperforming Mixed-Nat for
all domains, with significant gains for domains
that are distant from MED: REL, 它, BANK, LAW.

All MDMTs (except DM and ADM) 轻微地
improve over Mixed-Nat(for most domains),
但
these gains are rarely significant. Among
systems using an extra domain feature, 直流-
Tag has a small edge over DC-Feat and also

Su et al. (2019) seems to produce comparable, albeit slightly
改进的, 结果.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
5
1
1
9
2
4
0
5
9

/
t

我

A
C
_
A
_
0
0
3
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

模型 / Domain

Mixed-Nat
Mixed-Bal
FT-Full

DC-Tag
DC-Feat
LDR
TTM
DM
ADM
FT-Res
MDL-Res

[65米]

[65米]
[6×65m]

[+4k]

[+140k]

[+1.4米]

[+4k]

[+0]

[+12.4米]

Mixed-Nat-RNN [51米]

WDCMT

[73米]

MED

37.3
35.3
37.7

38.1
37.7
37.0
37.3
35.6
36.4
37.3
37.9

36.8

36.0

LAW

BANK

TALK

54.6
54.1
59.2

55.3
54.9
54.7
54.9
49.5
53.5
57.9
56.0

53.8

53.3

50.1
52.5
54.5

49.9
49.5
49.9
49.5
45.6
48.3
53.9
51.2

47.2

48.8

33.5
31.9
34.0

33.2
32.9
33.9
32.9
29.9
32.0
33.8
33.5

30.0

31.1

它

43.2
44.9
46.8

43.5
43.6
43.6
43.6
37.1
41.5
46.7
44.4

35.7

38.8

REL

77.5
89.5
90.8

80.5
79.9
79.9
79.9
62.4
73.4
90.2
88.3

60.2

58.5

wAVG

41.1
40.3
42.7

41.6
41.4
40.9
41.0
38.1
38.9
42.3
42.0

39.2

39.0

AVG

49.4
51.4
53.8

50.1
49.9
49.8
49.7
43.4
47.5
53.3
51.9

44.0

44.4

桌子 3: Translation performance of MDMT systems based on the same Transformer (顶部) or RNN
(底部) 建筑学. The former contains 65m parameters, the latter has 51m. For each system,
we report the number of additional domain specific parameters, BLEU scores for each domain,
domain-weighted (WAVG) and unweighted (AVG) averages. For weighted-averages, we take the
domain proportions from Table 1. Boldface denotes significant gains with respect to Mix-Nat
(or Mix-Nat-RNN, for WDCMT), underline denotes significant losses.

requires fewer parameters; it also outperforms
TTM, 哪个, 然而, uses predicted rather than
gold domain tags. TTM is also the best choice
among the systems that do not use domain tags
in inference. The best contenders overall are FT-
Res and MDL-Res, which significantly improve
over Mixed-Nat for a majority of domains,
and are the only ones to clearly fulfill [P1-
LAB]; WDCMT also improves on three domains,
but regresses on one. The use of a dedicated
adaptation module thus seems better than feature-
based strategies, but yields a large increase of the
number of parameters. The effect of the adaptation
layer is especially significant for small domains
(BANK, 它, and REL).
All systems fail

to outperform fine-tuning,
sometimes by a wide margin, especially for an
‘‘isolated’’ domain like REL. This might be due
to the fact that domains are well separated (比照.
部分 4.1) and are hardly helping each other. 在
this situation, MDMT systems should dedicate a
sufficient number of parameters to each domain,
so as to close the gap with fine-tuning.

5.2 Redefining domains

桌子 4 summarizes the results of four experiments
where we artificially redefine the boundaries of

域, with the aim to challenge requirements
[P2-TUN], [P3-HET], 和 [P4-ERR]. In first
三, we randomly split one corpus in two
parts and proceed as if this corresponded to two
actual domains. A MD system should detect that
these two pseudo-domains are mutually beneficial
and should hardly be affected by this change
with respect to the baseline scenario (no split).
在这种情况下, we expect MDMT to even
surpass fine-tuning separately on each of these
dummy domains, as MDMT exploits all data,
while fine-tuning focuses only on a subpart.
In testing, we decode the test set twice, 一次
with each pseudo-domain tag. This makes no
difference for TTM, DM, ADM, and WDCMT, 哪个
do not use domain tags in testing. In the merge
实验, we merge two corpora in training,
in order to assess the robustness with respect
to heterogenous domains [P3-HET]. We then
translate the two corresponding tests with the
相同的 (合并) 系统.

Our findings can be summarized as follows.
For the split experiments, we see small variations
that can be positive or negative compared to
the baseline situation, but these are hardly sig-
nificant. All systems show some robustness with
respect to fuzzy domain boundaries; this is mostly
notable for ADM, suggesting that when domain are

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
5
1
1
9
2
4
0
5
9

/
t

我

A
C
_
A
_
0
0
3
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Set-up

模型

Split

Split
MED (0.5 / 0.5) MED (0.25 / 0.75) LAW (0.5 / 0.5)

Split

Merge
BANK+LAW

错误的

rnd

NEW

MED2

MED1
MED1
FT-Full −0.1 −0.6 −1.5
−0.2 −0.3
DC-Tag
+0.1
DC-Feat −0.5
+0.3
0.0
LDR
+0.1
+0.4
+0.1
−0.2 −0.2 −0.2
TTM (*)
−0.3 −0.3
DM (*)
+0.4
ADM (*)
+0.6
+0.4
+0.6
−0.1 −0.4 −0.3
FT-Res
MDL-Res −0.2 −0.1
+0.2
WDCMT (*) −0.0 −0.0
+0.2

MED2
−0.2
+0.2
+0.3
+0.4
−0.2
+0.4
+0.4
−0.3
+0.0
+0.2

LAW1
−2.3
−0.4
+0.3
0.0
−0.3
+0.3
+0.4
−2.2
−0.9
+0.8

LAW2
−5.1
−0.4
+0.3
0.0
−0.3
+0.3
+0.4
−2.9
−0.9
+0.8

LAW

ALL

BANK
NEWS
−1.6 −1.4 −19.6 −3.3
−0.5 −0.4 −13.4 −1.7
+0.1 −14.2 −1.8
+0.3
+0.1 −12.0 −1.4
0.0
0.0 −0.1
0.0 −0.3
0.0 −0.9
+0.1
+0.9
+0.1 −0.4
0.0 −0.2
−2.4 −3.2 −13.3 −3.0
+0.7 −0.3 −18.6 −1.3
−0.4 −0.8
+0.2

0.0

桌子 4: Translation performance with variable domain definitions. In the Split/Merge experiments, 我们
report BLEU differences for the related test set(s). Underline denotes significant loss when domains
are changed wrt. the baseline situation; bold for a significant improvement over FT-Full; (*) tags
systems ignoring test domains.

close, ignoring domain differences is effective.
In contrary, FT-Full incurs clear losses across
the board, especially for the small data condition
(Miceli Barone et al., 2017). Even in this very
favourable case however, very few MDMT sys-
tems are able to significantly outperform FT-
Full and this is only observed for the smaller
part of the MED domain. The merge condition
is hardly different, with again large losses for
FT-full and FT-Res, and small variations
for all systems. We even observe some rare
improvements with respect to the situation where
we use actual domains.

5.2.1 Handling Wrong or Unknown Domains

In the last two columns of Table 4, we report the
drop in performance when the domain information
is not correct. In the first (RND), we use test data
from the domains seen in training, presented with a
random domain tag. 在这种情况下, the loss with
respect to using the correct tag is generally large
(多于 10 BLEU points), showing an overall
failure to meet requirement [P4-ERR], 除了
systems that ignore domain tags in testing.

In the second (NEW), we assess [P5-UNK] 经过
translating sentences from a domain unseen in
训练 (NEWS). For each sentence, we auto-
matically predict the domain tag and use it for
decoding.14 In this configuration, 再次, 系统

14Domain tags are assigned as follows: we train a language
model for each domain and assign tag on a per-sentence
basis based on the language model log-probability (assuming

using domain tags during inference perform
poorly, significantly worse than the Mixed-Nat
基线 (BLEU=23.5).

5.2.2 Handling Growing Numbers of

Domains

Another set of experiments evaluate the ability
to dynamically handle supplementary domains
(requirement [P6-DYN]) as follows. Starting with
the existing MD systems of Section 5.1, 我们
introduce an extra domain (NEWS) and resume
training with this new mixture of data15 for 50,000
additional iterations. We contrast this approach
with training all systems from scratch and report
differences in performance in Figure 1 (see also
桌子 7 in Appendix B).16 We expect that MDMT
systems should not be too significantly impacted
by the addition of a new domain and reach
about the same performance as when training
with this domain from scratch. From a practical
viewpoint, dynamically integrating new domains
is straightforward for DC-Tag, DC-Feat, 或者

uniform domain priors). The domain classifier has an average
prediction error of 16.4% for in-domain data.

15The design of a proper balance between domains in
training is critical for achieving optimal performance: As our
goal is to evaluate all systems in the same conditions, 我们
consider a basic mixing policy based on the new training
分配. This is detrimental to the small domains, 为了
which the ‘‘negative transfer’’ effect is stronger than for
larger domains.

16WDCMT results are excluded from this table, 作为

resuming training proved difficult to implement.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
5
1
1
9
2
4
0
5
9

/
t

我

A
C
_
A
_
0
0
3
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
5
1
1
9
2
4
0
5
9

/
t

我

A
C
_
A
_
0
0
3
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 1: Ability to handle a new domain. We report BLEU scores for a complete training session with seven
域, as well as differences (in blue) with training with six domains (from Table 3); 和 (红色的) 差异
with continual training.

TTM, for which new domains merely add new
labels. It is less easy for DM, ADM, and WDCMT,
which include a built-in domain classifier whose
outputs have to be pre-specified, 或者, for LDR, FT-
Res, and MDL-Res, for which the number of
possible domains is built in the architecture and
has to be anticipated from the start. This makes a
difference between domain-bounded systems, 为了
which the number of domains is limited and truly
open-domain systems.

We can first compare the results of coldstart
training with six or seven domains in Table 7:
A first observation is that
the extra training
data is hardly helping for most domains, 除了
for NEWS, where we see a large gain, 并为
TALK. The picture is the same when one looks
at MDMTs, where only the weakest systems
(DM, ADM) seem to benefit from more (out-of-
domain) 数据. Comparing now the coldstart with
the warmstart scenario, we see that the former is
always significantly better for NEWS, as expected,
and that resuming training also negatively impacts
the performance for other domains. This happens
notably for DC-Tag, TTM, and ADM. 在这个设置下
MDL-Res and DM show the smaller average loss,

with the former achieving the best balance of
training cost and average BLEU score.

5.3 Automatic Domains

在这个部分, we experiment with automatic
域, obtained by clustering sentences into
k = 30 classes using the k-means algorithm based
on generic sentence representations obtained via
mean pooling (比照. 部分 4.1). This allows us
to evaluate requirement [P7-scale], 训练, 和
testing our systems as if these domains were
fully separated. Many of these clusters are mere
splits of the large MED, while a fewer number of
classes are mixtures of two (或者更多) existing
域 (full details are in Appendix C). 我们
are thus in a position to reiterate, at a larger
规模, the measurements of Section 5.2 and test
whether multi-domain systems can effectively
take advantage from the cross-domain similarities
and to eventually perform better than fine-tuning.
The results in Table 5 also suggest that MDMT
can surpass fine-tuning for the smaller clusters;
for the large clusters, 这不再是真的. 这
complete table (in Appendix C) shows that this

Model/
Clusters

10 小的
10 mid
10 大的
Avg

Train
尺寸

Mixed
Nat

29.3k
104.7k
251.1k
128.4k

68.3
44.8
50.4
54.5

直流
MDL
Full Res Res Feat Tag

直流

TTM ADM

LDR

70.0
48.0
52.9
57.0

70.7
46.0
52.0
56.2

71.2
45.7
51.3
56.1

70.6
44.8
49.6
55.0

53.1
44.3
43.2
46.9

67.3
44.5
49.1
53.6

69.8
43.7
48.5
54.0

67.0
41.6
44.3
51.0

70.2
44.5
49.5
54.7

桌子 5: BLEU scores computed by merging the 10 较小, medium, and larger cluster test sets.
Best score for each group is in boldface. For the small clusters, full-fine tuning is outperformed by
several MDMT systems – see details in Appendix C.

Domain / 模型

DC-Tag
DC-Feat
LDR
TTM
DM
ADM
FT-Res
MDL-Res
WDCMT

MED

38.5
37.3
37.4
37.4
35.4
36.1
37.5
37.3
35.6

LAW

54.0
54.2
54.1
53.7
49.3
53.5
55.7
55.5
53.1

BANK

TALK

49.0
49.3
48.7
48.9
45.2
48.0
51.1
50.2
48.4

33.6
33.6
32.5
32.8
29.7
32.0
33.1
32.2
30.5

它

42.2
41.9
41.4
41.3
37.1
41.1
44.1
42.1
37.7

REL

76.7
75.8
75.9
75.8
60.0
72.1
86.7
86.7
56.0

wAVG

41.6
40.8
39.1
40.7
37.8
39.5
41.6
41.2
38.5

AVG

49.0
48.7
48.3
48.3
42.8
47.1
51.4
50.7
43.6

桌子 6: Translation performance with automatic domains, computed with the original test sets.
Significance tests are for comparisons with the six-domain scenario (桌子 3).

effect is more visible for small subsets of the
medical domain.

最后, 桌子 6 reports the effect of using
automatic domain for each of the six test sets:
Each sentence was first assigned to an automatic
班级, translated with the corresponding multi-
domain system with 30 类; aggregate numbers
were then computed, and contrasted with the six-
domain scenario. Results are clear and confirm
previous observations: Even though some clusters
are very close, the net effect is a loss in perfor-
mance for almost all systems and conditions. 在
this setting, the best MDMT in our pool (MDL-
Res) is no longer able to surpass the Mix-Nat
基线.

6 相关工作

The multi-domain training regime is more the
norm than the exception for natural
语言
加工 (Dredze and Crammer, 2008; Finkel
and Manning, 2009), and the design of multi-
domain systems has been proposed for many
language processing tasks. We focus here ex-
clusively on MD machine translation, 保持

in mind that similar problems and solutions
(parameter sharing, instance selection / weighting,
adversarial training, ETC。) have been studied in
other contexts.

Multi-domain translation was already proposed
for statistical MT, either considering as we do
multiple sources of training data (例如, Banerjee
等人。, 2010; Clark et al., 2012; Sennrich et al.,
2013; Huck et al., 2015), or domains made of
multiple topics (Eidelman et al., 2012; Hasler
等人。, 2014). Two main strategies were considered:
instance-based, involving a measure of similarities
between train and test domains; feature-based,
where domain/topic labels give rise to additional
特征.

The latter strategy has been widely used in
NMT: Kobus et al. (2017) inject an additional
domain feature in their seq2seq model, 任何一个
in the form of an extra (最初的) domain-token
or in the form of an additional domain-feature
associated to each word. These results are
reproduced by Tars and Fishel
(2018), WHO
also consider automatically induced domain tags.
This technique also helps control the style of
(2016A) 和
MT outputs in Sennrich et al.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
5
1
1
9
2
4
0
5
9

/
t

我

A
C
_
A
_
0
0
3
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Niu et al. (2018), and to encode the source or
target languages in multilingual MT (Firat et al.,
2016; Johnson et al., 2017). Domain control can
also be performed on the target side, as in Chen
等人. (2016), where a topic vector describing
the whole document serves as an extra context
in the softmax layer of the decoder. Such ideas
are further developed in Chu and Dabre (2018)
and Pham et al. (2019), where domain differences
and commonalties are encoded in the network
建筑学: Some parameters are shared across
域, while others are domain-specific.

Techniques proposed by Britz et al. (2017)
aim to ensure that domain information is actually
used in a mix-domain system. Three methods are
经过考虑的, using either domain classification (或者
domain normalization, via adversarial training)
on the source or target side. There is no clear
winner in either of the three language pairs
经过考虑的. One contribution of this work is
the idea of normalizing representations through
adversarial training, so as to make the mixture of
heterogeneous data more effective; 表示
normalization has since proven a key ingredient
in multilingual transfer learning. The same basic
技巧 (parameter sharing, automatic domain
identification / normalization) are simultaneously
at play in Zeng et al. (2018) and Su et al. (2019):
In this approach, the lower layers of the MT
use auxiliary classification tasks to disentangle
domain specific representations on the one hand
from domain-agnostic representations on the other
手. These representations are then processed as
two separate inputs, then recombined to compute
the translation.

Another parameter-sharing scheme is in Jiang
等人. (2019), which augments a Transformer
model with domain-specific heads, whose con-
the word/position
tributions are regulated at
等级: Some words have ‘‘generic’’ use and rely
on mixed-domain heads, 然而
一些
other words it
is preferable to use domain-
specific heads, thereby reintroducing the idea of
ensembling at the core of Huck et al. (2015)
and Saunders et al.
(2019). The results for
three language pairs outperform several standard
baselines for a two-domain systems (in fr:en and
的:在) and a four-domain system (zh:在).

为了

最后, Farajian et al.

(2017乙), 李等人.
(2018), and Xu et al. (2019) adopt a different
战略. Each test sentence triggers the selection
of a small set of related instances; using these,

a generic NMT is tuned for some iterations,
before delivering its output. This approach entirely
dispenses with the notion of domain and relies
on data selection techniques to handle data
heterogeneity.

7 Conclusion and Outlook

In this study, we have carefully reconsidered
the idea of multi-domain machine translation,
which seems to be taken for granted in many
recent studies. We have spelled out the various
motivations for building such systems and the
associated expectations in terms of system per-
formance. We have then designed a series of
requirements that MDMT systems should meet,
and proposed a series of associated test pro-
cedures. In our experiments with a representative
sample of MDMTs, we have found that most
requirements were hardly met for our experimen-
tal conditions. If MDMT systems are able to
outperform the mixed-domain baseline, 至少
for some domains, they all fall short to match
the performance of fine-tuning on each individual
domain, which remains the best choice in multi-
source single domain adaptation. 正如预期的那样
然而, MDMTs are less brittle than fine-tuning
when domain frontiers are uncertain, 并且可以,
to a certain extent, dynamically accommodate
additional domains, this being especially easy
for feature-based approaches. Our experiments
finally suggest that all methods show decreasing
performance when the number of domains or the
diversity of the domain mixture increases.

Two other main conclusions can be drawn
from this study: 第一的, it seems that more work
is needed to make MDMT systems make the best
out of the variety of the available data, both to
effectively share what needs to be shared while at
the same time separating what needs to be kept
separated. We notably see two areas worthy of
further exploration: the development of parameter
sharing strategies when the number of domains
is large; and the design of training strategies that
can effectively handle a change of the training
mixture, including an increase in the number of
域. Both problems are of practical relevance
in industrial settings. 第二, and maybe more
重要的是, there is a general need to adopt better
evaluation methodologies for evaluating MDMT
系统, which require systems developers to
clearly spell out the testing conditions and the

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
5
1
1
9
2
4
0
5
9

/
t

我

A
C
_
A
_
0
0
3
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

associated expected distribution of testing in-
立场, and to report more than comparisons with
simple baselines on a fixed and known handful of
域.

致谢

The work presented in this paper was partially
supported by the European Commission under
contract H2020-787061 ANITA.

This work was granted access to the HPC
resources of [TGCC/CINES/IDRIS] under the
allocation 2020-[AD011011270] made by GENCI
(Grand Equipement National de Calcul Intensif).

参考

Naveen Arivazhagan, Ankur Bapna, Orhan
Firat, Dmitry Lepikhin, Melvin Johnson,
Maxim Krikun, Mia Xu Chen, Yuan Cao,
George Foster, Colin Cherry, Wolfgang
Macherey, Zhifeng Chen, and Yonghui Wu.
2019. Massively multilingual neural machine
translation in the wild: Findings and challenges.
arXiv e-prints, abs/1907.05019.

Amittai Axelrod, Xiaodong He, and Jianfeng
高. 2011. Domain adaptation via pseudo
in-domain data selection. 在诉讼程序中
the Conference on Empirical Methods in
自然语言处理, EMNLP ’11,
pages 355–362. 爱丁堡, 英国.

Pratyush Banerjee, Jinhua Du, Baoli Li, Sudip
Kumar Naskar, Andy Way, and Josef van
Genabith. 2010. Combining multi-domain
statistical machine translation models using
automatic classifiers. 在诉讼程序中
这
9th Conference of the Association for Machine
Translation in the Americas, AMTA 2010.
丹佛, 一氧化碳, 美国.

Ankur Bapna and Orhan Firat. 2019. 辛-
普莱, scalable adaptation for neural machine
翻译. 在诉讼程序中 2019 骗局-
ference on Empirical Methods in Natural
Language Processing and the 9th International
Joint Conference on Natural Language Pro-
cessing, EMNLP-IJCNLP, pages 1538–1548,
香港, 中国. Association for Computa-
tional Linguistics. DOI: https://doi.org
/10.18653/v1/D19-1165

Shai Ben-David, John Blitzer, Koby Crammer,
Alex Kulesza, Fernando Pereira, and Jenn
Wortman. 2010. A theory of learning from
different domains. Machine Learning, 79(1):
151–175. DOI: https://doi.org/10
.1007/s10994-009-5152-4

Nicola Bertoldi and Marcello Federico. 2009.
Domain adaptation for
statistical machine
在
translation with monolingual
资源.
Proceedings of the Fourth Workshop on Sta-
tistical Machine Translation, pages 182–189,
雅典, 希腊. Association for Computa-
tional Linguistics. DOI: https://doi.org
/10.3115/1626431.1626468

John Blitzer. 2007. Domain Adaptation of Natu-
ral Language Processing Systems. 博士. 论文,
School of Computer Science, 大学
宾夕法尼亚州.

Denny Britz, Quoc Le, and Reid Pryzant.
2017. Effective domain mixing for neural
machine translation. 在诉讼程序中
这
Second Conference on Machine Translation,
pages 118–126, 哥本哈根, 丹麦. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/W17
-4712

Mauro Cettolo, Christian Girardi, and Marcello
Federico. 2012. Wit3: Web inventory of
transcribed and translated talks. In Proceedings
the European
的
Association for Machine Translation (EAMT),
pages 261–268. Trento, 意大利.

the 16th Conference of

training for

Wenhu Chen, Evgeny Matusov, Shahram
Khadivi, and Jan-Thorsten Peter. 2016. Guided
结盟
topic-aware neural
这
machine translation. 在诉讼程序中
Twelth Biennial Conference of the Association
for Machine Translation in the Americas,
AMTA 2012. Austin, 德克萨斯州.

adaptation

Chenhui Chu and Raj Dabre. 2018. Multilingual
neural
and multi-domain
这
machine translation. 在诉讼程序中
24st Annual Meeting of the Association for
自然语言处理, 自然语言处理 2018,
pages 909–912, Okayama, 日本.

为了

Chenhui Chu and Rui Wang. 2018. 一项调查
of domain adaptation for neural machine

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
5
1
1
9
2
4
0
5
9

/
t

我

A
C
_
A
_
0
0
3
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

翻译. In Proceedings of the 27th Inter-
national Conference on Computational Lin-
语言学, 科林 2018, pages 1304–1319,
圣达菲, New Mexico, 美国.

Jonathan H. 克拉克, Alon Lavie, and Chris Dyer.
2012. One system, many domains: Open-
domain statistical machine translation via
feature augmentation. 在诉讼程序中
Tenth Biennial Conference of the Association
for Machine Translation in the Americas,
(AMTA 2012). 圣地亚哥, CA.

Hal Daum´e III and Daniel Marcu. 2006. Domain
adaptation for statistical classifiers. 杂志
of Artificial
Intelligence Research (JAIR),
26:101–126. DOI: https://doi.org/10
.1613/jair.1872

Mark Dredze and Koby Crammer. 2008. 在线的
methods for multi-domain learning and adap-
站. 在诉讼程序中 2008 会议
on Empirical Methods in Natural Language
加工, EMNLP, pages 689–697, Hon-
olulu, Hawaii. DOI: https://doi.org
/10.3115/1613715.1613801

Vladimir Eidelman, Jordan Boyd-Graber, 和
Philip Resnik. 2012. Topic models for dynamic
translation model adaptation. 在诉讼程序中
the 50th Annual Meeting of the Association for
计算语言学 (体积 2: Short
文件), pages 115–119, Jeju Island, 韩国.
计算语言学协会.

the 15th Conference of

中号. Amin Farajian, Marco Turchi, Matteo Negri,
Nicola Bertoldi,
and Marcello Federico.
2017A. Neural vs. phrase-based machine trans-
lation in a multi-domain scenario. In Pro-
这
ceedings of
European Chapter of
the Association for
计算语言学: 体积 2, Short
文件, pages 280–284, Valencia, 西班牙. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/E17
-2045

中号. Amin Farajian, Marco Turchi, Matteo Negri,
and Marcello Federico. 2017乙. Multi-domain
neural machine translation through unsu-
这
pervised adaptation. 在诉讼程序中
Second Conference on Machine Translation,
pages 127–137, 哥本哈根, 丹麦. DOI:

https://doi.org/10.18653/v1/W17
-4713

Jenny Rose Finkel and Christopher D. 曼宁.
2009. Hierarchical Bayesian domain adap-
站. In Proceedings of Human Language
Technologies: 这 2009 Annual Conference
的
这
the North American Chapter of
计算语言学协会,
pages 602–610, 博尔德, 科罗拉多州. DOI:
https://doi.org/10.3115/1620754
.1620842

Orhan Firat, Kyunghyun Cho, and Yoshua Bengio.
2016. Multi-way, multilingual neural machine
translation with a shared attention mechanism.
在诉讼程序中 2016 Conference of the
North American Chapter of the Association for
计算语言学: Human Language
Technologies, pages 866–875. 协会
计算语言学. DOI: https://
doi.org/10.18653/v1/N16-1101

George Foster and Roland Kuhn. 2007. Mixture-
model adaptation for SMT. 在诉讼程序中
the Second Workshop on Statistical Machine
Translation, pages 128–135, Prague, Czech
共和国.

Markus Freitag and Yaser Al-Onaizan. 2016.
Fast domain adaptation for neural machine
翻译. CoRR, abs/1612.06897.

Eva Hasler, Phil Blunsom, Philipp Koehn, 和
Barry Haddow. 2014. Dynamic topic adapta-
tion for phrase-based MT. 在诉讼程序中
14th Conference of the European Chapter of
the Association for Computational Linguistics,
pages 328–337, Gothenburg, 瑞典, Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.3115/v1/E14
-1035

Judy Hoffman, Mehryar Mohri, and Ningshan
张. 2018. Algorithms and theory for
multiple-source adaptation, S. 本吉奥, H.
瓦拉赫, H. 拉罗谢尔, K. Grauman, 氮. Cesa-
Bianchi, 和R. 加内特, 编辑, 进展
Neural Information Processing Systems 31,
pages 8246–8256, 柯伦联合公司, Inc.

Matthias Huck, Alexandra Birch, and Barry
Haddow. 2015. Mixed domain vs. multi-domain
statistical machine translation. In Proceedings

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
5
1
1
9
2
4
0
5
9

/
t

我

A
C
_
A
_
0
0
3
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

the Machine Translation Summit, 公吨

的
Summit XV, pages 240–255. Miami Florida.

Ann Irvine, John Morgan, Marine Carpuat,
Hal Daum, and Dragos Munteanu. 2013.
Measuring machine translation errors in new
域. 协会的交易
计算语言学, 1:429–440. DOI:
https://doi.org/10.1162/tacl
00239

Haoming Jiang, Chen Liang, Chong Wang, 和
Tuo Zhao. 2019. Multi-domain neural machine
translation with word-level adaptive layer-
wise domain mixing. CoRR, abs/1911.02692.
DOI: https://doi.org/10.18653/v1
/2020.acl-main.165, PMID: 31986961

Jing Jiang and ChengXiang Zhai. 2007. Instance
weighting for domain adaptation in NLP. 在
Proceedings of the 45th Annual Meeting of
the Association of Computational Linguistics,
pages 264–271, Prague, 捷克共和国.
计算语言学协会.

Anton Schwaighofer Joaquin Quionero-Candela,
Masashi Sugiyama and Neil D. 劳伦斯, edi-
托尔斯. 2008. Dataset Shift in Machine Learning,
Neural Information Processing series. 和
按. DOI: https://doi.org/10.7551
/mitpress/9780262170055.001.0001

Melvin Johnson, Mike Schuster, Quoc Le, Maxim
Krikun, Yonghui Wu, Zhifeng Chen, Nikhil
Thorat, Fernand a Vi´egas, Martin Wattenberg,
Greg Corrado, Macduff Hughes, and Jeffrey
院长. 2017. Google’s multilingual neural ma-
chine translation system: Enabling zero-shot
翻译. 协会的交易
计算语言学, 5:339–351. DOI:
https://doi.org/10.1162/tacl
00065

Mahesh Joshi, Mark Dredze, William W.
科恩, and Carolyn P. Rose. 2012. 多-
domain learning: When do domains matter?
In Empirical Methods in Natural Language
加工 (EMNLP), pages 1302–1312.

Guillaume Klein, Yoon Kim, Yuntian Deng,
Jean Senellart, and Alexander Rush. 2017.
for neural
OpenNMT: Open-source toolkit
machine translation. In Proceedings of ACL
2017, 系统演示, pages 67–72,

Vancouver, 加拿大. Association for Compu-
tational Linguistics. DOI: https://土井
.org/10.18653/v1/P17-4012

Catherine Kobus, Josep Crego, and Jean Senellart.
2017. Domain control for neural machine trans-
关系. In Proceedings of the International Con-
ference Recent Advances in Natural Language
加工, RANLP 2017, pages 372–378,
Varna, Bulgaria. DOI: https://doi.org
/10.26615/978-954-452-049-6 049

Philipp Koehn. 2004. Statistical significance tests
for machine translation evaluation. In Pro-
ceedings of the 2004 Conference on Empirical
Methods in Natural Language Processing,
pages 388–395, 巴塞罗那, 西班牙. 协会
for Computational Linguistics.

Xiaoqing Li,

Jiajun Zhang, and Chengqing
为了
Zong. 2018. One sentence one model
neural machine translation. In Proceedings
of the Eleventh International Conference on
语言资源与评估 (LREC
2018). Miyazaki, 日本. European Language
Resources Association (ELRA).

Minh-Thang Luong and Christopher D. 曼宁.
2015. Stanford neural machine translation
系统
在
会议记录
the International Workshop
on Spoken Language Translation, IWSLT, Da
Nang, 越南.

spoken language domain.

为了

Yishay Mansour, Mehryar Mohri, and Afshin
Rostamizadeh. 2009A. Domain adaptation with
multiple sources. 在D中. Koller, D. Schuurmans,
是. 本吉奥, 和L. 波图, 编辑, Advances
in Neural Information Processing Systems 21,
pages 1041–1048, 柯伦联合公司, Inc.

Yishay Mansour, Mehryar Mohri, and Afshin
Rostamizadeh. 2009乙. Multiple source adapta-
tion and the R´enyi divergence. In Proceedings
of the 25th Conference on Uncertainty in Arti-
ficial Intelligence, UAI 2009, pages 367–374.

Antonio Valerio Miceli Barone, Barry Haddow,
Ulrich Germann, and Rico Sennrich. 2017.
Regularization techniques for fine-tuning in
In Proceed-
neural machine
这 2017 Conference on Empirical
ings of
Methods in Natural Language Processing,
pages 1489–1494, 哥本哈根, 丹麦.

翻译.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
5
1
1
9
2
4
0
5
9

/
t

我

A
C
_
A
_
0
0
3
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

计算语言学协会,
DOI: https://doi.org/10.18653/v1
/D17-1156

Paul Michel

and Graham Neubig.

2018.
Extreme adaptation for personalized neural
machine translation. 在诉讼程序中
这
56th Annual Meeting of the Association for
计算语言学 (体积 2: Short
文件), pages 312–318, 墨尔本, 澳大利亚.
计算语言学协会.
DOI: https://doi.org/10.18653/v1
/P18-2050

Graham Neubig, Zi-Yi Dou, Junjie Hu, 保罗
Michel, Danish Pruthi, Xinyi Wang, and John
Wieting. 2019. compare-mt: A tool for holistic
comparison of language generation systems.
CoRR, abs/1903.07926. DOI: https://土井
.org/10.18653/v1/N19-4007

Xing Niu, Sudha Rao, and Marine Carpuat.
2018. Multi-task neural models for translating
between styles within and across languages.
Emily M. Bender, Leon Derczynski, and Pierre
Isabelle, 编辑, In Proceedings of the 27th
国际计算会议
语言学, 科林, pages 1008–1021, Santa
铁, New Mexico, 美国.

Yonatan Oren,

Shiori

Sagawa, Tatsunori
Hashimoto, and Percy Liang. 2019. Distribu-
语言建模. In Pro-
tionally robust
ceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 4227–4237, 香港, 中国. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1432

Sinno Jialin Pan and Qiang Yang. 2010. A
survey on transfer learning. IEEE Transactions
on Knowledge and Data Engineering, 22(10):
1345–1359. DOI: https://doi.org/10
.1109/TKDE.2009.191

美国. DOI: https://doi.org/10.3115
/1073083.1073135

Minh Quang Pham, Josep-Maria Crego, 让
Senellart, and Franc¸ois Yvon. 2019. Generic
and specialized word embeddings for multi-
domain machine translation. 在诉讼程序中
the 16th International Workshop on Spoken
Language Translation,
IWSLT, page 9p,
Hong-Kong, CN.

Hassan Sajjad, Nadir Durrani, Fahim Dalvi,
Yonatan Belinkov, and Stephan Vogel. 2017.
Neural machine translation training in a multi-
domain scenario. In Proceedings of the 14th
International Workshop on Spoken Language
Translation, IWSLT 2017, 东京, 日本.

Danielle Saunders, Felix Stahlberg, and Bill
Byrne. 2019. UCAM biomedical
翻译
at WMT19: Transfer learning multi-domain
ensembles. In Proceedings of the Fourth Con-
ference on Machine Translation (体积 3:
Shared Task Papers, Day 2), pages 169–174,
意大利. Association for Computa-
Florence,
tional Linguistics. DOI: https://doi.org
/10.18653/v1/W19-5421

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016A. Controlling politeness in neu-
ral machine translation via side constraints. 在
诉讼程序 2016 Conference of the
North American Chapter of the Association for
计算语言学: Human Language
Technologies, pages 35–40, 圣地亚哥, 卡利-
福尼亚. Association for Computational Linguis-
抽动症. DOI: https://doi.org/10.18653
/v1/N16-1005

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016乙. Neural machine translation
的
在
rare words with subword units.
Proceedings of the 54th Annual Meeting of
the Association for Computational Linguistics
(体积 1: Long Papers), pages 1715–1725,
柏林, 德国. DOI: https://doi.org
/10.18653/v1/P16-1162

Kishore Papineni, Salim Roukos, Todd Ward,
and Wei-Jing Zhu. 2002. 蓝线: A method for
automatic evaluation of machine translation.
In Proceedings of the 40th Annual Meeting
on Association for Computational Linguistics,
前交叉韧带'02, pages 311–318, Stroudsburg, PA,

Rico Sennrich, Holger Schwenk, and Walid
Aransa. 2013. A multi-domain translation
statistical machine
模型
framework for
the 51st An-
翻译. 在诉讼程序中
nual Meeting of the Association for Compu-
tational Linguistics (体积 1: Long Papers),

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
5
1
1
9
2
4
0
5
9

/
t

我

A
C
_
A
_
0
0
3
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

pages 832–840, Sofia, Bulgaria. 协会
计算语言学.

Amr Sharaf, Hany Hassan, and Hal Daum´e III.
2020. Meta-learning for few-shot NMT adap-
站. In Proceedings of the Fourth Work-
shop on Neural Generation and Translation,
pages 43–53, 在线的. Association for Computa-
tional Linguistics. DOI: https://doi.org
/10.18653/v1/2020.ngt-1.5

Hidetoshi Shimodaira. 2000. Improving predictive
inference under covariate shift by weighting the
log-likelihood function. Journal of Statistical
Planning and Inference, 90(2):227–244. DOI:
https://doi.org/10.1016/S0378
-3758(00)00115-4

Ralf Steinberger, Bruno Pouliquen, Anna Widiger,
Camelia Ignat, Toma Erjavec, Dan Tufis, 和
Daniel Varga. 2006. The JRC-Acquis: A mul-
tilingual aligned parallel corpus with 20+
语言. In Proceedings of the Fifth Interna-
tional Conference on Language Resources and
评估, LREC’06, Genoa, 意大利. 欧洲的
Language Resources Association (ELRA).

Jinsong Su, Jiali Zeng, Jun Xie, Huating Wen,
Yongjing Yin, and Yang Liu. 2019. Exploring
discriminative word-level domain contexts
for multi-domain neural machine translation.
IEEE Transactions on Pattern Analysis and
Machine Intelligence (PAMI), pages 1–1. DOI:
https://doi.org/10.1109/TPAMI
.2019.2954406, PMID: 31751225

Sander Tars and Mark Fishel. 2018. Multi-domain
neural machine translation. 在诉讼程序中
the 21st Annual Conference of the European
Association for Machine Translation, EAMT,
pages 259–269, 阿利坎特, 西班牙. EAMT.

J¨org Tiedemann. 2009, News from OPUS – A
collection of multilingual parallel corpora
with tools and interfaces. 客栈. Nicolov, K.
Bontcheva, G. Angelova, 和R. Mitkov,
编辑, Recent Advances in Natural Lan-
guage Processing, volume V, pages 237–248.
John Benjamins, Amsterdam/Philadelphia,
Borovets, Bulgaria. DOI: https://土井
.org/10.1075/cilt.309.19tie

J¨org Tiedemann. 2012. Parallel data,

工具
and interfaces in OPUS. Nicoletta Calzolari

(Conference Chair), Khalid Choukri, Thierry
Declerck, Mehmet Ugur Dogan, Bente
Maegaard, Joseph Mariani, Jan Odijk, 和
Stelios Piperidis, 编辑, 在诉讼程序中
Eight International Conference on Language
Resources and Evaluation, LREC’12, Istanbul,
Turkey,
Language Resources
欧洲的
协会 (ELRA).

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017.
In I. Guyon,
Attention is all you need.
U. V. Luxburg, S. 本吉奥, H. 瓦拉赫,
右. 弗格斯, S. Vishwanathan, 和R. 加内特,
编辑, Advances
信息
Processing Systems 30, pages 5998–6008,
柯伦联合公司, Inc.

in Neural

Jitao Xu, Josep Crego, and Jean Senellart.
2019. Lexical micro-adaptation for neural
machine translation. In Proceedings of the 16th
International Workshop on Spoken Language
Translation, IWSLT 2019, 香港, 中国.

Jiali Zeng, Jinsong Su, Huating Wen, 哪个
刘, Jun Xie, Yongjing Yin, and Jianqiang
赵. 2018. Multi-domain neural machine
translation with word-level domain context
歧视. 在诉讼程序中
这 2018
Conference on Empirical Methods in Nat-
ural Language Processing, pages 447–457,
布鲁塞尔, 比利时. Association for Computa-
tional Linguistics. DOI: https://doi.org
/10.18653/v1/D18-1041

Jian Zhang, Liangyou Li, Andy Way, and Qun
刘. 2016. Topic-informed neural machine
the 26th
在诉讼程序中
翻译.
国际计算会议
语言学: 技术论文, 科林 2016,
pages 1807–1817, 大阪, 日本. The COLING
2016 Organizing Committee.

Appendices

A. Description of Multi-Domain Systems

We use the following setups for MDMT systems.

• Mixed-Nat, FT-full, TTM, DC-Tag use
a medium Transformer model of Vaswani
等人. (2017) with the following settings:
embeddings size and hidden layers size are
set to 512. Multi-head attention comprises

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
5
1
1
9
2
4
0
5
9

/
t

我

A
C
_
A
_
0
0
3
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

8 heads in each of the 6 layers; the inner
feedforward layer contains 2,048 细胞.
Training use a batch size of 12,288 代币;
optimization uses Adam with parameters
β1 = 0.9, β2 = 0.98 and Noam decay
(warmup steps = 4, 000), and a dropout
率 0.1 for all layers.

• FT-Res and MDL-res use the same
medium Transformer and add residual layers
with a bottleneck dimension of size 1,024.

• ADM, DM use medium Tranformer model and
a domain classifier composing of 3 dense
layers of size 512 × 2,048, 2,048 × 2,048,
和 2,048 × domain num. The two first
layers of the classifier use the ReLU() 作为
activation function, the last layer uses tanh()
as activation function.

• DC-Feat uses medium Transformer model
and domain embeddings of size 4. Given a
sentence of domain i in a training batch, 这
embedding of domain i is concatenated to the
embedding of each token in the sentence.

• LDR uses medium Transformer model and
for each token we introduce a LDR feature of
尺寸 4 × domain num. Given a sentence of
domain i ∈ [1, .., K] in the training batch,
for each token of the sentence, the LDR
units of the indexes outside of the range
[4(i − 1), .., 4i − 1] are masked to 0, 和
masked LDR feature will be concatenated to
the embedding of the token. Details are in
Pham et al. (2019).

• Mixed-Nat-RNN uses one bidirectional
LSTM layer in the encoder and one LSTM
layer in the decoder. The size of hidden layers
是 1,024, the size of word embeddings is 512.

• WDCNMT uses one bidirectional GRU layer in
the encoder and one GRU-conditional layer

in the decoder. The size of hidden layers is
1,024, the size of word embeddings is 512.

Training For each domain, we create train/
dev/test sets by randomly splitting each corpus.
We maintain the size of validation sets and of test
sets equal to 1,000 lines for every domain. 这
learning rate is set as in Vaswani et al. (2017).
For the fine-tuning procedures used for FT-
full and FT-Res, we continue training using
the same learning rate schedule, continuing the
incrementation of the number of steps. 所有其他
MDMT systems reported in Tables 3 和 4 使用
a combined validation set comprising 6,000 线,
obtained by merging the six development sets. 为了
the results in Table 7 we also append the validation
set of NEWS to the multi-domain validation set. 在
any case, training stops if either training reaches
the maximum number of iterations (50,000) 或者
score on the validation set does not increase for
three consecutive evaluations. We average five
checkpoints to get the final model.

乙. Experiments with Continual Learning

Complete results for the experiments with con-
tinual learning are reported in Table 7.

C. Experiments with Automatic Domains

This experiment aims to simulate with automatic
domains a scenario where the number of
‘‘domains’’ is large and where some ‘‘domains’’
are close and can effectively share information.
Full results in Table 8. Cluster size vary from
approximately 8k sentences (cluster 24) up to more
than 350k sentences. More than two thirds of these
clusters mostly comprise texts from one single
domain, as for cluster 12 which is predominantly
MED, the remaining clusters typically mix two
域. Fine-tuning with small domains is often
outperformed by other MDMT techniques, 一个
issue that a better regularization strategy might
mitigate. Domain-control (DC-Feat) is very
effective for small domains, but again less so
in larger data conditions. Among the MD models,
approaches using residual adapters have the best
average performance.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
5
1
1
9
2
4
0
5
9

/
t

我

A
C
_
A
_
0
0
3
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

3
4

Domain
模型

Mixed-Nat

DC-Tag

DC-Feat

LDR

TTM

ADM

FT-Res

MDL-Res

MED

LAW

BANK

TALK

它

REL

NEWS

wAVG

AVG

37.1
|
–

+0.2

37.7
+0.3 | +0.3
37.4
+0.3 | −0.2
37.0
0.0 | −0.6
37.3
0.0 | −0.3
36.0
−0.4 | +0.6
36.6
−0.2 | +0.3
37.0
+0.3 | +0.3
37.7
+0.2 | −0.2

54.1

+0.5 | — +0.5

49.6
|
–

54.5
+0.8 | −0.1
54.9
−0.1 | −0.1
54.6
+0.1 | +0.5
54.4
+0.4 | −0.3
51.3
−1.8 | +0.4
54.2
−0.7 | −0.8
57.6
+0.4 | +0.4
55.6
+0.4 | +0.5

49.9
−0.04 | −0.6
50.0
−0.3 | −0.1
49.6
+0.2 | −0.4
49.6
−0.1 | −0.5
46.8
−1.2 | +0.6
49.1
−0.8 | −0.8
53.8
+0.1 | +0.1
51.1
+0.1 | 0.0

34.1
|
–

−0.6

34.8
−1.6 | −1.1
34.7
−1.3 | −0.6
34.3
−0.4 | −0.6
33.8
−0.9 | −1.1
31.8
−1.8 | −0.1
32.9
−0.9 | −0.2
34.5
−0.7 | −0.7
34.4
−0.9 | −0.4

42.1
|
–

+1.1

43.9
−0.4 | −1.3
43.9
−0.1 | −0.9
43.0
+0.5 | +0.5
42.9
+0.6 | −1.0
39.8
−2.6 | +0.5
42.1
−0.5 | −0.4
46.1
+0.5 | +0.5
44.5
−0.1 | −0.2

77.0
|
–

+0.5

78.8
+1.7 | −3.5
79.6
+0.4 | +0.3
77.0
+2.9 | +3.8
78.2
+1.8 | −4.0
65.7
−3.3 | 0.0
75.7
−2.3 | −5.0
91.1
−0.9 | −0.9
87.5
+0.9 | −0.2

28.9
|
–

−5.4

29.5
−7.7 | −1.4
28.9
−7.3 | −0.8
28.7
−6.6 | −0.9
29.1
−5.7 | −1.4
27.0
−4.4 | −1.2
28.7
−5.4 | −1.9
29.6
−9.0 | −0.6
29.1
−8.0 | −0.8

40.8
|
–

+0.3

41.4
+0.2 | −0.1
41.2
+0.1 | −0.2
40.8
+0.6 | +0.5
41.0
0.0 | −0.5
38.9
−0.8 | +0.5
40.2
−0.5 | −0.2
42.2
−0.1 | −0.1
41.9
+0.1 | −0.2

49.0
|
–

+0.4

49.9
+0.1 | −1.1
50.1
−0.2 | −0.3
49.2
+0.1 | −0.4
49.4
+0.3 | −1.2
45.2
−1.8 | +0.3
48.4
−0.9 | −1.1
53.3
+0.2 | +0.2
51.8
+0.1 | −0.1

桌子 7: Ability to handle a new domain. We report BLEU scores for a complete training session with seven domains, 还有
as differences with (左边) training with six domains (from Table 3); (正确的) continuous training mode. Averages only take into
account six domains (NEWS excluded). Underline denotes a significant loss, bold a significant gain.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
5
1
1
9
2
4
0
5
9

/
t

我

A
C
_
A
_
0
0
3
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

模型
Cluster

尺寸
火车 / 测试

Mixed
Nat

8.1k / 3
17.3k / 52
25.6k / 54
27.2k / 88
27.4k / 72
27.5k / 103
28.2k / 56
30.4k / 18
47.0k / 23
54.4k / 26
61.4k / 214
68.1k / 122
91.5k / 30
93.0k / 38

24 [和]
[-]
13
[-]
28
[它]
19
[-]
0
[-]
22
[-]
25
16 [和]
23 [和]
17 [和]
[它]
8
1
[-]
7 [和]
11 [和]
29 [法律] 109.2k / 242
27 [和] 109.3k / 49
5
[-] 109.9k / 267
6 [和] 133.4k / 73
26
[-] 134.8k / 428
15 [bank] 136.9k / 674
4
[相对] 137.4k / 1016
2 [和] 182.6k / 85
20 [和] 183.0k / 71
21
[-] 222.8k / 868
10 [和] 225.4k / 115
18 [和] 245.0k / 106
9 [和] 301.6k / 145
[法律] 323.5k / 680
3
334.0 / 146
14 [和]
12 [和] 356.4k / 148

90.4
67.6
71.6
58.5
43.9
91.5
57.0
57.2
24.5
39.9
46.9
47.2
41.3
31.6
65.9
11.0
46.3
37.2
31.8
46.5
77.1
70.6
47.4
38.7
40.0
57.7
37.2
50.1
31.6
36.3

FT MDL

直流
Full Res Res Feat Tag

直流

TTM ADM

LDR

90.4
75.4
68.7
63.0
33.3
93.7
44.8
70.4
27.2
40.3
53.1
47.5
35.5
42.6
69.2
9.6
47.4
38.9
30.8
51.5
85.3
75.8
47.2
38.8
42.6
60.3
37.3
52.0
31.4
36.6

90.4 90.4 100.0 65.6 100.0 90.4 100.0 100.0
76.9
74.3 74.3
72.6
68.1 70.2
60.3
60.9 63.9
47.8
45.4 45.4
93.4
93.4 93.9
52.4
48.2 49.1
58.3
77.4 73.5
29.8
26.5 28.5
33.7
41.6 38.0
46.7
55.8 53.6
44.9
48.7 45.1
41.8
41.4 39.9
36.6
31.8 35.4
65.9
67.6 67.7
10.6
9.2
45.7
46.9 45.4
35.9
38.7 36.8
31.2
31.8 31.2
46.0
47.9 48.0
75.9
83.5 83.3
68.2
71.7 69.4
46.8
46.8 47.2
37.0
39.0 37.2
40.7
40.0 38.2
55.9
58.7 58.6
37.0
36.5 36.1
49.1
50.8 50.1
31.8
31.9 33.0
36.3
35.9 35.9

75.0 54.7
71.0 42.5
63.7 57.2
49.9 15.4
92.5 72.8
54.6 47.2
61.8 54.2
30.5 27.3
37.1 36.6
48.9 45.1
46.8 39.1
41.4 36.5
36.0 29.6
66.0 63.8
10.0 19.4
44.0 42.9
37.5 27.5
31.9 32.6
46.6 46.0
75.8 46.1
68.2 67.3
48.4 47.5
37.5 35.9
39.9 35.8
58.4 56.3
36.4 37.7
49.1 48.3
32.5 34.1
35.8 37.0

74.7 75.9
72.0 71.3
59.4 61.1
46.8 49.2
92.3 93.2
49.8 54.2
58.4 58.1
32.0 24.4
35.2 35.4
48.8 50.9
45.4 44.2
37.3 37.1
36.7 32.7
65.1 64.7
7.9
43.7 44.3
38.0 37.2
32.2 30.5
45.8 45.7
74.2 73.3
67.3 68.6
48.8 47.3
36.9 37.1
39.5 39.1
57.3 56.1
36.4 35.2
49.0 48.2
31.4 32.1
36.4 35.4

65.9
65.6
60.5
46.6
91.4
45.1
52.5
29.0
31.3
43.0
40.7
40.7
26.5
62.4
10.7
40.9
31.3
29.6
42.9
63.2
65.6
47.1
33.4
36.3
54.9
34.2
44.4
30.5
34.2

8.7

9.4

桌子 8: Complete results for the experiments with automatic domains. For each cluster, we report: 这
majority domain when one domain accounts for more than 75% of the class; training and test sizes; 和
BLEU scores obtained with the various systems used in this study. Most test sets are too small to report
significance tests.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
5
1
1
9
2
4
0
5
9

/
t

我

A
C
_
A
_
0
0
3
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
下载pdf