PERL: Pivot-based Domain Adaptation for Pre-trained Deep - 麻省理工学院人工智能研究专业

PERL: Pivot-based Domain Adaptation for Pre-trained Deep
Contextualized Embedding Models

Eyal Ben-David∗

Carmel Rabinovitz∗

Roi Reichart

Technion, Israel Institute of Technology
{eyalbd12@campus.|carmelrab@campus.|roiri@}technion.ac.il

抽象的

Pivot-based neural representation models have
led to significant progress in domain adapta-
tion for NLP. 然而, previous research
following this approach utilize only labeled
data from the source domain and unlabeled
data from the source and target domains,
but neglect to incorporate massive unlabeled
corpora that are not necessarily drawn from
these domains. To alleviate this, 我们建议
PERL: A representation learning model that
extends contextualized word embedding mod-
els such as BERT (Devlin et al., 2019) 和
pivot-based fine-tuning. PERL outperforms
strong baselines across 22 sentiment classifica-
tion domain adaptation setups, improves in-
domain model performance, yields effective
reduced-size models, and increases model
stability.1

1 介绍

自然语言处理 (自然语言处理) 算法
are constantly improving, gradually approaching
human-level performance (Dozat and Manning,
2017; Edunov et al., 2018; Radford et al., 2018).
然而, those algorithms often depend on the
availability of large amounts of manually annota-
ted data from the domain in which the task is
执行的. 很遗憾, collecting such anno-
tated data is often costly and laborious, 哪个
substantially limits the applicability of NLP
技术.

Domain Adaptation (DA), training an algorithm
on annotated data from a source domain so that it
can be effectively applied to other target domains,
is one of the ways to solve the above bottleneck.

∗Both authors contributed equally to this work.
1Our code is at https://github.com/eyalbd2/

PERL.

504

的确, over the years substantial efforts have been
devoted to the DA challenge (Roark and Bacchiani,
2003; Daum´e III and Marcu, 2006; Ben-David
等人。, 2010; Jiang and Zhai, 2007; McClosky et al.,
2010; Rush et al., 2012; Schnabel and Sch¨utze,
2014). Our focus in this paper is on unsupervised
DA, the setup we consider most realistic. 在这个
setup labeled data is available only from the source
domain and unlabeled data is available from both
the source and the target domains.

While various approaches for DA have been
proposed (§2), with the prominence of deep
neural network (DNN) 造型, attention has
been recently focused on representation learn-
ing approaches. Within representation learning
for unsupervised DA, two approaches have been
shown particularly useful. In one line of work,
DNN-based methods that use compress-based
noise reduction to learn cross-domain features
已经开发了 (Glorot et al., 2011; 陈
等人。, 2012). In another line of work, 方法
based on the distinction between pivot and non-
pivot features (Blitzer et al., 2006, 2007) 学习
a joint feature representation for the source and
the target domains. Later on, Ziser and Reichart
(2017, 2018), and Li et al. (2018) married the two
approaches and achieved substantial
improve-
ments on a variety of DA setups.

Despite their success, pivot-based DNN models
still only utilize labeled data from the source
domain and unlabeled data from both the source
and the target domains, but neglect to incorporate
massive unlabeled corpora that are not necessarily
drawn from these domains. With the recent
game-changing success of contextualized word
embedding models trained on such massive
语料库 (Devlin et al., 2019; Peters et al., 2018),
it is natural to ask whether information from such
corpora can enhance these DA methods, 参与-
ularly that background knowledge from non-
contextualized embeddings has shown useful for

计算语言学协会会刊, 卷. 8, PP. 504–521, 2020. https://doi.org/10.1162/tacl 00328
动作编辑器: Jimmy Lin. 提交批次: 4/2020; 修改批次: 2020; 已发表 8/2020.
C(西德:13) 2020 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
8
1
9
2
3
7
3
2

/
t

我

A
C
_
A
_
0
0
3
2
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

DA (Plank and Moschitti, 2013; Nguyen et al.,
2015).

to hyper-parameter selection compared to other
DA methods.

In this paper we hence propose an unsupervised
DA approach that extends leading approaches
based on DNNs and pivot-based ideas, so that they
can incorporate information encoded in massive
语料库 (§3). Our model, named PERL: Pivot-
based Encoder Representation of Language, builds
on massively pre-trained contextualized word
embedding models such as BERT (Devlin et al.,
2019). To adjust the representations learned by
these models so that they close the gap between
the source and target domains, we fine-tune their
parameters using a pivot-based variant of the
Masked Language Modeling (MLM) 客观的,
optimized on unlabeled data from both the source
and the target domains. We further present R-PERL
(regularized PERL), which facilitates parameter
sharing for pivots with similar meaning.

We perform extensive experimentation in vari-
ous unsupervised DA setups of
the task of
binary sentiment classification (§4, 5). 第一的, 为了
compatibility with previous work, we experiment
with the legacy product review domains of Blitzer
等人. (2007) (12 setups). We then experiment with
more challenging setups, adapting between the
above domains and the airline review domain
(阮, 2015) used in Ziser and Reichart (2018)
(4 setups), as well as the IMDb movie review
domain (Maas et al., 2011) (6 setups). 我们比较
PERL to the best performing pivot-based methods
(Ziser and Reichart, 2018; 李等人。, 2018) 和
to DA approaches that fine-tune a massively pre-
trained BERT model by optimizing its standard
MLM objective using target-domain unlabeled
数据 (李等人。, 2020; Han and Eisenstein, 2019).
PERL and R-PERL substantially outperform these
基线, emphasizing the additive effect of mas-
sive pre-training and pivot-based fine-tuning.

As an additional contribution, we show that
pivot-based learning is effective beyond improv-
ing domain adaptation accuracy. Particularly, 我们
show that an in-domain variant of PERL sub-
stantially improves the in-domain performance
of a BERT-based sentiment classifier, for vary-
ing training set sizes (从 100 to 20K labeled
examples). We also show that PERL facilitates
the generation of effective reduced-size DA mod-
这. 最后, we perform an extensive ablation
学习 (§6) that uncovers PERL’s crucial design
choices and demonstrates the stability of PERL

2 Background and Previous Work

There are several approaches to DA, 包括
instance re-weighting (Sugiyama et al., 2007;
Huang et al., 2006; Mansour et al., 2008), 子-
sampling from the participating domains Chen
等人. (2011) and DA through representation learn-
英, where a joint representation is learned based
on texts from the source and target domains
(Blitzer et al., 2007; Xue et al., 2008; Ziser
and Reichart, 2017, 2018). We first describe the
unsupervised DA pipeline, continue with repre-
sentation learning methods for DA with a focus
on pivot-based methods, 和, 最后, 描述
contextualized embedding models.

Unsupervised Domain Adaptation through
Representation Learning As noted in §1 our
focus in this work is on unsupervised DA through
representation learning. A common pipeline for
this setup consists of two steps: (A) 学习
a representation model (often referred to as the
encoder) using the source and target unlabeled
数据; 和 (乙) Training a supervised classifier
on the source domain labeled data. To facilitate
domain adaptation, every text fed to the classifier
in the second step is first represented by the pre-
trained encoder. This is performed both when the
classifier is trained in the source domain and when
it is applied to new text from the target domain.

Exceptions to this pipeline are end-to-end mod-
els that jointly learn to perform the cross-domain
text representation and the classification task. 这
is achieved by training a unified objective on the
source domain labeled data and the unlabeled data
from both the source and the target. Among these
models are domain adversarial networks (Ganin et
等人。, 2016), which were strongly outperformed by
Ziser and Reichart (2018) to which we compare
our methods, and the hierarchical attention transfer
网络 (HATN; 李等人。, 2018), which is one of
our baselines (见下文).

Unsupervised DA through representation learn-
ing has followed two main avenues. 首先
avenue consists of works that aim to explicitly
build a feature representation that bridges the gap
between the domains. A seminal framework in this
line is structural correspondence learning (SCL;
Blitzer et al., 2006, 2007), that splits the feature
space into pivot and non-pivot features. A large

505

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
8
1
9
2
3
7
3
2

/
t

我

A
C
_
A
_
0
0
3
2
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

number of works have followed this idea (例如,
Pan et al., 2010; Gouws et al., 2012; Bollegala
等人。, 2015; Yu and Jiang, 2016; 李等人。, 2017,
2018; Tu and Wang, 2019; Ziser and Reichart,
2017, 2018) and we discuss it below.

Works in the second avenue learn cross-domain
representations by training autoencoders (AEs) 在
the unlabeled data from the source and target do-
电源. This way they hope to obtain a more robust
表示, which is hopefully better suited for
DA. Examples for such models include the stacked
denoising AE (SDA; Vincent et al., 2008; Glorot
等人。, 2011, the marginalized SDA and its variants
(MSDA; 陈等人。, 2012; Yang and Eisenstein,
2014; Clinchant et al., 2016) and variational AE
based models (Louizos et al., 2016).

最近, Ziser and Reichart (2017, 2018)
and Li et al. (2018) married these approaches
and presented pivot-based approaches where the
representation model is based on DNN encoders
(AE, long short-term memory [LSTM], or hierar-
chical attention networks). Because their methods
outperformed the above models, we aim to extend
them to models that can also exploit massive out
的 (source and target) domain corpora. We next
elaborate on pivot-based approaches.

Pivot-based Domain Adaptation Proposed by
Blitzer et al. (2006, 2007) through their SCL
框架, the main idea of pivot-based DA is to
divide the shared feature space of the source and
the target domains to two complementary subsets:
one of pivots and one of non-pivots. Pivot features
are defined based on two criteria: (A) 他们是
frequent in the unlabeled data of both domains;
和 (乙) They are prominent for the classification
task defined by the source domain labeled data.
Non-pivot features are those features that do not
meet at least one of the above criteria. 尽管
SCL is based on linear models, there have been
some very successful recent efforts to extend this
framework so that non-linear encoders (DNNs)
are utilized. Here we focus on the latter line of
工作, which produces much better results, and do
not elaborate on SCL any further.

Ziser and Reichart

(2018) have presented
the Pivot Based Language Model
(PBLM),
which incorporates pre-training and pivot-based
学习. PBLM is a variant of an LSTM-based
language model, but instead of predicting at each
point the most likely next input word, it predicts
the next input unigram or bigram if one of these

is a pivot (if both are, it predicts the bigram),
and NONE otherwise. In the unsupervised DA
pipeline PBLM is trained on the source and target
unlabeled data. 然后, when the task classifier is
trained and applied to the target domain, PBLM is
used as a contextualized word embedding layer.
Notice that PBLM is not pre-trained on massive
在......之外 (source and target) domain corpora, 和它的
single-layer, unidirectional LSTM architecture is
probably not ideal for knowledge encoding from
such corpora.

Another work in this line is HATN (李等人。,
2018). This model automatically learns the pivot/
non-pivot distinction, rather than following the
SCL definition as Ziser and Reichart (2017,
2018) 做. HATN consists of two hierarchical
attention networks, P-net and NP-net. 第一的, 它
trains the P-net on the source labeled data. 然后,
it decodes the most prominent tokens of P-net
(IE。, tokens that received the highest attention
价值观), and considers them as its pivots. 最后,
it simultaneously trains the P-net and the NP-net
on both the labeled and the unlabeled data, 这样的
that P-net is adversarially trained to predict the
domain of the input example (Ganin et al., 2016)
and NP-net is trained to predict its pivots, 和
hidden representations from both networks serve
for the task label (情绪) prediction.

Both HATN and PBLM strongly outperform a
large variety of previous DA models on various
cross-domain sentiment classification setups. 因此,
they are our major baselines in this work. Like
PBLM, we use the same definition of the pivot
and non-pivot subsets as in Blitzer et al. (2007).
Like HATN, we also use an attention-based DNN.
Unlike both models, we design our model so
that it incorporates pivot-based learning with pre-
training on massive out of (source and target) 做-
main corpora. We next discuss this pre-training
过程, which is also known as training models
for contextualized word embeddings.

Contextualized Word Embedding Models
Contextualized word embedding (CWE) 型号
are trained on massive corpora (Peters et al., 2018;
Radford et al., 2019). They typically utilize a lan-
guage modeling objective or a closely related
variant (Peters et al., 2018; Ziser and Reichart,
2018; Devlin et al., 2019; 杨等人。, 2019), 阿尔-
though in some recent papers the model is trained
on a mixture of basic NLP tasks (张等人。,
2019; Rotman and Reichart, 2019). The contribution

506

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
8
1
9
2
3
7
3
2

/
t

我

A
C
_
A
_
0
0
3
2
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

of such models to the state-of-the-art in a variety
of NLP tasks is already well-established.

CWE models typically follow three steps: (1)
Pre-training: Where a DNN (referred to as the
encoder of the model) is first trained on massive
unlabeled corpora which represent a broad domain
(such as English Wikipedia); (2) 微调: 一个
optional step, where the encoder is refined on un-
labeled text of interest. 如上所述, Lee et
阿尔. (2020) and Han and Eisenstein (2019) tuned
BERT on unlabeled target domain data to facilitate
domain adaptation; 和 (3) Supervised task train-
英: Where task specific layers are trained on label-
ed data for a downstream task of interest.

PERL uses a pre-trained encoder, BERT in this
纸. BERT’s architecture is based on multi-head
attention layers, trained with a two-component
客观的: (A) MLM and (乙) Is-next-sentence pre-
措辞 (NSP). For Step 2, PERL modifies only
the MLM objective and it can hence be imple-
mented within any CWE framework that uses this
客观的 (刘等人。, 2019; Lan et al., 2020; 哪个
等人。, 2019).

MLM is a modified language modeling objec-
主动的, adjusted to self-attention models. 什么时候
building the pre-training task, all input tokens
have the same probability to be masked.2 After
the masking process, the model has to predict a
distribution over the vocabulary for each masked
token given the non-masked tokens. The input text
may have more than one masked token, 什么时候
predicting one masked token information from the
other masked tokens is not utilized.

In the next section we describe our PERL
domain adaptation model. The novel component
of this model is a pivot-based MLM objective,
optimized at the fine-tuning step (Step 2) 的
CWE pipeline, using source and target unlabeled
数据.

3 Domain adaptation with PERL

PERL uses pivot features in order to learn a
representation that bridges the gap between two
域. Contrary to previous pivot-based DA
representation models, it exploits unlabeled data
from the source and target domains, and also from
massive out of source and target domain corpora.

2We use the huggingface BERT code (沃尔夫等人。, 2019):
https://github.com/huggingface/transformers,
where the masking probability is 0.15.

PERL consists of three steps that correspond to
the three steps of CWE models, as described in § 2:
(1) Pre-training (图1a): in which it utilizes
a pre-trained CWE model (encoder, BERT in this
工作) that was trained on massive corpora; (2)
微调 (Figure 1b): where it refines some of
the pre-trained encoder weights, based on a pivot-
based objective that is optimized on unlabeled
data from the source and target domains; 和
(3) Supervised task training (Figure 1c): 在哪里
task specific layers are trained on source domain
labeled data for the downstream task of interest.

Our pivot selection method is identical to that
of Blitzer et al. (2007) and Ziser and Reichart
(2017, 2018). 那是, the pivots are selected inde-
pendently of the above three steps protocol.

We further present a variant of PERL, denoted
with R-PERL, where the non-contextualized em-
bedding matrix of the BERT model trained at Step
(1) is used in order to regularize PERL during its
fine-tuning stage (Step 2). We elaborate on this
model towards the end of this section. We next
provide a detailed description.

Pivot Selection Being a pivot-based language
representation model, PERL is based on high
quality pivot extraction. Since the representation
learning is based on a masked language modeling
任务, the feature set we address consists of the
unigrams and bigrams of the vocabulary. We base
the division of this feature set into pivots and non-
pivots on unlabeled data from the source and target
域. Pivot features are: (A) Frequent in the
unlabeled data from the source and target domains;
和 (乙) Among those frequent features, pivot
features are the ones whose mutual information
with the task label according to source domain
labeled data crosses a pre-defined threshold.
Features that do not meet the above two criteria
form the non-pivot feature subset.

PERL pre-training (Step 1, 图1a)
在
order to inject prior language knowledge to our
模型, we first initialize the PERL encoder with
a powerful pre-trained CWE model. As noted
多于, our rationale is that the general language
knowledge encoded in these models, 这不是
specific to the source or target domains, 应该
be useful for DA just as it has shown useful for
in-domain learning. In this work we use BERT,
although any other CWE model that employs

507

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
8
1
9
2
3
7
3
2

/
t

我

A
C
_
A
_
0
0
3
2
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
8
1
9
2
3
7
3
2

/
t

我

A
C
_
A
_
0
0
3
2
8
p
d

数字 1: Illustrations of the three PERL steps. PRD and PLR stand for the BERT prediction head and pooler
头, 分别, FC is a fully connected layer, and msk stands for masked tokens embeddings (embeddings of
tokens that were masked). NSP and MLM are the next sentence prediction and masked language model objectives.
For the definitions of the PRD and PRL layers as well as the NSP objective, see Devlin et al. (2019). We mark
frozen layers (layers whose parameters are kept fixed) and non-frozen layers with snow-flake and fire symbols,
分别. The token embedding and BERT layers values at the end of each step initialize the corresponding
layers of the next step model. The BERT box of the fine tuning step is described in more details in Figure 2.

the MLM objective for pre-training (Step 1) 和
fine-tuning (Step 2), could have been used.

PERL fine-tuning (Step 2, Figure 1b) 这
step is the core novelty of PERL. Our goal is to
refine the initialized encoder on unlabeled data
from the source and the target domains, using the
distinction between pivot and non-pivot features.
For this aim we fine-tune the parameters of the
pre-trained BERT using its MLM objective, 但
we choose the masked words so that the model
learns to map non-pivot to pivot features. Recall
that when building the MLM training task, each
training example consists of an input text in which
some of the words are masked, and the task of
the model is to predict the identity of each of the
masked words given the rest of the (non-masked)
input text. Whereas in standard MLM training
all input tokens have the same probability to be
masked, in the PERL fine-tuning step we change
both the masking probability and the prediction
task so that the desired non-pivot to pivot mapping
is learned. We next describe these two changes; 看
also a detailed graphical illustration in Figure 2.

508

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

数字 2: The PERL pivot-based fine-tuning task (Step 2).
In this example two tokens are masked, general and
好的, only the latter is a pivot. The architecture is
identical to that of BERT but the MLM task and the
masking process are different, taking into account the
pivot/non-pivot distinction.

1. Prediction task. While in standard MLM
the task is to predict a token out of the
entire vocabulary, here we define a pivot-
base prediction task. Particularly, 该模型

should predict whether the masked token is
a pivot feature or not, and if it is then it has
to identify the pivot. 那是, this is a multi-
class classification task where the number of
classes is equal to the number of pivots plus
1 (for the non-pivot prediction).

Put more formally, the modified pivot-based

MLM objective is:

p(yi = j) =

埃夫 (你好)·Wj
k=1 ef (你好)·Wk + 埃夫 (你好)·Wnone

磷|磷 |

where yi is a masked unigram or bigram at position
我, P is the set of pivot features (token unigrams and
二元组), hi is the encoder representation for the i-
th token, 瓦 (the FC-Pivots layer of Figure 1b and
数字 2) is the pivot predictor matrix that maps
from the latent space to the pivot set space (Wa is
the a-th row of W ), and f is a non-linear function
composed of a dense layer, a gelu activation layer
and LayerNorm (the PRD layer of Figure 1b and
数字 2).

2. Masking process. Instead of masking each
input token (unigram) with the same prob-
能力, we perform the following masking
过程. For each input token (unigram) 我们
first check whether it forms a bigram pivot
together with the next token, and if so we
mask this bigram with a probability of α. 如果
the answer is negative, we check if the token
at hand is a unigram pivot and if so we again
mask it with a probability of α. 最后, 如果
the token is not a pivot we mask it with a
probability of β. Our hyper-parameter tuning
process revealed that the values of α = 0.5
and β = 0.1 provide strong results across
our various experimental setups (see more
on this in §6). This way PERL gives a higher
probability to pivot masking, and by doing
so the encoder parameters are fine-tuned so
that they can predict (大多) pivot features
基于 (大多) on non-pivot input.

Designing the fine-tuning task this way yields
two advantages. 第一的, the model should shape
its parameters so that most of the information
about the input pivots is preserved, while most of
the information preserved about the non-pivots is
what needed in order to predict the existence of
the pivots. This way the model keeps mostly the
information about unigrams and bigrams that are

shared among the two domains and are significant
for the supervised task, thus hopefully increasing
its cross-domain generalization capacity.

第二, standard MLM, which has recently
been used for fine-tuning in domain adapta-
的 (李等人。, 2020; Han and Eisenstein, 2019),
performs a multi-class classification task with 30K
代币,3 which requires ∼ 23M parameters as in
the FC1 layer of Figure 1. By focusing PERL on
pivot prediction, we can use only a factor of |磷 |+1
30K
of the FC layer parameters, as we do in the FC-
pivots layer (数字 1, 在哪里 |磷 | is the number of
pivots, in our experiments |磷 | ∈ [100, 500]).

Supervised task training (Step 3, Figure 1c)
To adjust PERL for a downstream task, we place a
classification network on top of its encoder. 尽管
training on labeled data from the source domain
and testing on the target domain, each input text
is first represented by the encoder and is then fed
to the classification network. Because our focus
in this work is on the representation learning, 这
classification network is kept simple, 组成
of one convolution layer followed by an average
pooling layer and a linear layer. When training
for the downstream task, the encoder weights are
frozen.

R-PERL A potential limitation of PERL is that
it ignores the semantics of its pivots. While the
negative pivots sad and unhappy encode similar
information with respect to the sentiment classi-
fication task, PERL considers them as two dif-
ferent output classes. To alleviate this, 我们建议
the regularized PERL (R-PERL) model where
pivot-similarity information is taken into account.
To achieve this we construct the FC-pivots
matrix of R-PERL (Figure 1b and 2) based on the
Token Embedding matrix learned by BERT in its
pre-training stage (图1a). Particularly, we fix
the unigram pivot rows of the FC-pivots matrix
to the corresponding rows in BERT’s Token
Embedding matrix, and the bigram pivot rows
to the mean of the Token Embedding rows that
correspond to the unigrams that form this bigram.
The FC-pivots matrix of R-PERL is kept fixed
during fine-tuning.

Our assumptions are that: (1) Pivots with similar
意义, such as sad and unhappy, have similar
representations in the Token Embedding matrix

3The BERT implementation we use keeps a fixed 30K

word vocabulary, derived from its pre-training process.

509

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
8
1
9
2
3
7
3
2

/
t

我

A
C
_
A
_
0
0
3
2
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

learned at the pre-training stage (Step 1); 和
(2) There is a positive correlation between the
appearance of such pivots (IE。,
they tend to
出现, or not appear, 一起; see Ziser and
Reichart [2017] for similar considerations). 在其
fine-tuning step, R-PERL is hence biased to learn
similar representations to such pivots in order
to capture the positive correlation between them.
This follows from the fact that pivot probability
is computed by taking the dot product of its
representation with its corresponding row in the
FC-pivots matrix.

4 实验

Tasks and Domains Following a large body of
prior DA work, we focus on the task of binary
sentiment classification. For compatibility with
previous literature, we first experiment with the
four legacy product review domains of Blitzer
等人. (2007): 图书 (乙), DVDs (D), Electronic
项目 (乙), and Kitchen appliances (K), 与一个
total of 12 cross-domain setups. Each domain has
2,000 labeled reviews, 1,000 positive and 1,000
negative, and unlabeled reviews as follows: 乙:
6,000, D: 34,741, 乙: 13,153 and K: 16,785.

We next experiment in a more challenging
setup, considering an airline review dataset (A)
(阮, 2015; Ziser and Reichart, 2018). 这
setup is challenging both due to the differences
between the product and service domains, 和
because the prior probability of observing a
positive review at the A domain is much lower
than the same probability in the product domains.4
For the A domain, following Ziser and Reichart
(2018), we randomly sampled 1,000 positive and
1,000 negative reviews for our labeled set, 和
39,396 reviews for our unlabeled set. Due to the
heavy computational demands of the experiments,
we arbitrarily chose 3 product to airline and 3
airline to product setups.

We further consider an additional modern
domain: IMDb (我) (Maas et al., 2011),5 哪个
is commonly used in recent sentiment analysis
工作. This dataset consists of 50,000 movie
reviews from IMDb (25,000 positive and 25,000
negative), where there is a limitation on the
number of reviews per movie. We randomly

4This analysis, performed by Ziser and Reichart (2018),

is based on the gold labels of the unlabeled data.

5The details of the IMDb dataset are available at:
http://www.andrew-maas.net/data/sentiment.

sampled 2,000 labeled reviews, 1,000 积极的
和 1,000 negative, for our labeled set, 和
其余的 48,000 reviews form our unlabeled set.6
As above, we arbitrarily chose 2 IMDb to product
和 2 product to IMDb setups for our experiments.
Pivot-based representation learning has shown
instrumental for DA. We hypothesize that it can
also be beneficial for in-domain tasks, as it focuses
the representation on the information encoded in
prominent unigrams and bigrams. 为了测试这个
hypothesis we experiment in an in-domain setup,
with the IMDb movie review dataset. We follow
the same experimental setup as in the domain
adaptation case, except that only IMDb unlabeled
data is used for fine-tuning, and the frequency
criterion in pivot selection is defined with respect
to this dataset.

We randomly sampled 25,000 training and
25,000 test examples, keeping the two sets
balanced, and additional 50,000 reviews formed
an unlabeled balanced set.7 We consider 6 setups,
differing in their training set size: 100, 500, 1K,
2K, 10K, and 20K randomly sampled examples.

Baselines We compare our PERL and R-PERL
models to the following baselines: (a+b) PBLM-
CNN and PBLM-LSTM (Ziser and Reichart,
2018), differing only in their classification layer
(CNN vs. LSTM);8 (C) HATN (李等人。, 2018);9
(d) BERT; 和 (e) Fine-tuned BERT (following
李等人。, 2020 and Han and Eisenstein,
2019): This model is identical to PERL, 除了
那
the fine-tuning stage is performed with
a standard MLM instead of our pivot-based
MLM. BERT, Fine-tuned BERT, PBLM-CNN,
PERL, and R-PERL all use the same CNN-based
sentiment classifier, while HATN jointly learns
the feature representation and performs sentiment
classification.

Cross-validation We use a five-fold cross-
validation protocol, where in every fold 80% 的
the source domain examples are randomly selected
for training data, 和 20% for development data
(both sets are kept balanced). For each model we
report the average results across the five folds. 在
each fold we tune the hyper-parameters so that

6We make sure that all reviews of the same movie appear

either in the training set or in the test set.

7These reviews are also part of the IMDb dataset.
8https://github.com/yftah89/PBLM-Domain-

Adaptation.

9https://github.com/hsqmlzno1/HATN.

510

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
8
1
9
2
3
7
3
2

/
t

我

A
C
_
A
_
0
0
3
2
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

to minimize the cross-entropy development data
loss.

Hyper-parameter Tuning For all models we
use the WordPiece word embeddings (Wu et al.,
2016) with a vocabulary size of 30k, and the same
optimizer (with the same hyper-parameters) 如
their original paper. For all pivot-based methods
we consider the unigrams and bigrams that appear
至少 20 times both in the unlabeled data of
the source domain and in the unlabeled data of
the target domain as candidates for pivots,10 和
from these we select the |磷 | candidates with the
highest mutual information with the task source
domain label (|磷 | = {100, 200, . . . , 500}). 这
exception is HATN that automatically selects its
pivots, which are limited to unigrams.

We next describe the hyper-parameters of
每一个
the models. Due to our extensive
实验 (22 DA and 6 in-domain setups,
5-fold cross-validation), we limit our search space,
especially for the heavier components of the
型号.

R-PERL, PERL, BERT and Fine-tuned BERT
For the encoder, we use the BERT-base uncased
architecture with the same hyper-parameters as in
Devlin et al. (2019), tuning for PERL, R-PERL
and Fine-tuned BERT the number of fine-tuning
纪元 (在......之外: 20, 40, 60) and the number
of unfrozen BERT layer during the fine-tuning
过程 (1, 2, 3, 5, 8, 12). For PERL and R-
PERL we tune the number of pivots (100, 200,
300, 400, 500) as well as α and β (0.1, 0.3, 0.5,
0.8). The supervised task classifier is a basic CNN
建筑学, which enables us to search over the
number of filters (在......之外: 16, 32, 64), the filter size
(7, 9, 11) and the training batch size (32, 64).

PBLM-LSTM and PBLM-CNN For PBLM we
tune the input word embedding size (32, 64, 128,
256), the number of pivots (100, 200, 300, 400,
500), and the hidden dimension (128, 256, 512).
For the LSTM classification layer of PBLM-
LSTM we consider the same hidden dimension
and input word embedding size as for the PBLM
encoder. For the CNN classification layer of
PBLM-CNN, following Ziser and Reichart (2018)
我们用 250 filters and a kernel size of 3. 每一个
setup we choose the PBLM model (PBLM-LSTM
or PBLM-CNN) that yields better test set accuracy
and report its result, under PBLM-Max.

10In the in-domain experiments we consider the IMDb

unlabeled data.

HATN The hyper-parameters of Li et al. (2018)
were tuned on a larger training set than ours, 和
they hence yield sub-optimal performance in our
setup. We tune the training batch size (20, 50 300),
the hidden layer size (20, 100, 300), and the word
embedding size (50, 100, 300).

5 结果

Overall results Table 1 presents domain adap-
tation results, and is divided to two panels. 这
top panel reports results on the 12 setups derived
从 4 legacy product review domains of
Blitzer et al. (2007) (denoted with P ⇔ P ). 这
bottom panel reports results for 10 setups invol-
ving product review domains and the IMDb movie
review domain (left side; denoted P ⇔ I) 或者
the airline review domain (right side; denoted
P ⇔ A). 桌子 2 presents in-domain results on
the IMDb domain, for various training set sizes.

Domain Adaptation As presented in Table 1,
PERL models are superior in 20 在......之外 22 DA
setups, with R-PERL performing best in 17 在......之外
22 setups. In the P ⇔ P setups, their averaged
表现 (top table, All column) 是 87.5%
和 86.9% (for R-PERL and PERL, 分别)
compared with 82.3% of HATN and 80.7% 的
PBLM-Max. 重要的, in the more challenging
setups, the performance of one of these baselines
substantially degrade. Particularly, the averaged
R-PERL and PERL performance in the P ⇔ I
setups are 84.7% 和 84.4%, 分别 (底部
panel, left All column), compared with 75.5%
of HATN and 69.0% of PBLM-Max. 在里面
P ⇔ A setups the averaged R-PERL and PERL
performances are 84.2% 和 82.9%, 分别
(bottom panel, right All column), compared with
80.5% of PBLM-Max and only 71.8% of HATN.
The performance of BERT and Fine-tuned
BERT also degrade on the challenging setups:
From an average of 80.2% (BERT) 和 84.1%
(Fine-tuned BERT) in P ⇔ P setups, 到 74.2%
和 78.9%, 分别, in P ⇔ I setups, 和
到 75.6% 和 79.4%, 分别, in P ⇔ A
setups. R-PERL and PERL, 相比之下, remain
stable across setups, with an averaged accuracy of
84.2–87.5% (R-PERL) and 82.9–86.8% (PERL).
The IMDb and airline domains differ from the
product domains in their topic (电影 [IMDb]
and services [airline] 与. 产品). 而且, 这
unlabeled data from the airline domain contains

511

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
8
1
9
2
3
7
3
2

/
t

我

A
C
_
A
_
0
0
3
2
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

BERT
Fine-tuned BERT
PBLM-Max
HATN
PERL
R-PERL

5
1
2

D → K D → B E → D B → D B → E B → K E → B
80.6
84.4
84.2
82.2
86.5
87.8

81.0
84.1
82.5
83.5
85.0
85.6

82.0
86.7
82.5
81.2
89.9
90.2

78.8
84.2
77.6
78.0
87.0
87.2

82.5
86.9
83.3
85.4
89.9
90.4

76.8
81.7
77.6
78.8
85.0
84.8

78.2
80.2
71.4
80.0
84.3
83.9

E → K D → E K → D K → E K → B
77.7
79.8
79.8
81.0
84.6
85.6

85.1
89.2
87.8
87.4
90.6
91.2

78.5
81.5
74.2
81.2
81.9
83.0

76.5
82.0
80.4
83.2
87.1
89.3

84.7
88.6
87.1
85.9
90.7
91.2

I → E
75.4
81.5
70.1
74.0
87.1
87.9

I → K
78.8
78.0
69.8
74.4
86.3
86.0

E → I
72.2
77.6
67.0
74.8
82.0
82.5

K → I
70.6
78.7
69.0
78.9
82.2
82.5

ALL
74.2
78.9
69.0
75.5
84.4
84.7

A → B A → K A → E B → A K → A E → A

70.9
72.9
70.6
58.7
77.1
78.4

78.8
81.9
82.6
68.8
84.2
85.9

77.1
83.0
81.1
64.1
84.6
85.9

72.1
79.5
83.8
77.6
82.1
84.0

74.0
76.3
87.4
78.5
83.9
85.1

81.0
82.8
87.7
83.0
85.3
85.9

ALL
75.6
79.4
80.5
71.8
82.9
84.2

桌子 1: Domain adaptation results. The top table is for the legacy product review domains of Blitzer et al. (2007) (denoted as the P ⇔ P setups in the
文本). The bottom table involves selected legacy domains as well as the IMDb movie review domain (左边; denoted as P ⇔ I) or the airline review domain
(正确的; denoted as P ⇔ A). The All columns present averaged results across the setups to their left.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

ALL
80.2
84.1
80.7
82.3
86.9
87.5

我

A
C
_
A
_
0
0
3
2
8
1
9
2
3
7
3
2

/
t

我

A
C
_
A
_
0
0
3
2
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Num

Sentences BERT
67.9
73.9
75.3
77.9
80.9
81.7

100
500
1K
2K
10K
20K

Fine-tuned
BERT
76.4
83.3
83.9
83.6
86.9
86.0

PERL R-PERL
81.6
84.3
84.6
85.3
87.1
87.8

83.9
84.6
84.9
85.3
87.5
88.1

桌子 2: In domain results on the IMDb movie
review domain with increasing training set size.

an increased fraction of negative reviews (see §4).
最后, the IMDb and airline reviews are also
more recent. The success of PERL in the P ⇔ I
and P ⇔ A setups is of particular importance,
as it indicates the potential of our algorithm to
adapt supervised NLP algorithms to domains that
substantially differ from their training domain.

最后, our results clearly indicate the positive
impact of a pivot-aware approach when fine-
tuning BERT with unlabeled source and target
数据. 的确, the averaged gaps between Fine-
tuned BERT and BERT (3.9% for P ⇔ P , 4.7%
for P ⇔ I, 和 3.8% for P ⇔ A) are much
smaller than the corresponding gaps between
R-PERL and BERT (7.3% for P ⇔ P , 10.5%
for P ⇔ I, 和 8.6% for P ⇔ A).

In-domain Results
In this setup both the labeled
and the unlabeled data, used for supervised
task training (labeled data, Step 3), fine-tuning
(unlabeled data, Step 2), and pivot selection (两个都
datasets) come from the same domain (IMDb). 作为
如表所示 2, PERL outperforms BERT and
Fine-tuned BERT for all training set sizes.
这

(R-)PERL
diminishes as more labeled training data become
可用的: 从 7.5% (R-PERL vs. Fine-tuned
BERT) 什么时候 100 sentences are available, 到 2.1%
for 20K training sentences. To our knowledge,
the effectiveness of pivot-based methods for in-
domain learning has not been demonstrated in the
过去的.

不出所料,

impact of

6 Ablation Analysis and Discussion

In order to shed more light on PERL, we conduct
an ablation analysis. We start by uncovering the
hyper-parameters that have strong impact on its
表现, and analyzing its stability across
hyper-parameter configurations. We then explore

513

数字 3: The impact of the number of unfrozen PERL
layers during fine-tuning (Step 2).

the impact of some of the design choices we made
when constructing the model.

In order to keep our analysis concise and to
avoid heavy computations, we have to consider
only a handful of arbitrarily chosen DA setups
for each analysis. We follow the five-fold cross-
validation protocol of §4 for hyper-parameter
tuning, except that in some of the analyses a
hyper-parameter of interest is kept fixed.

6.1 Hyper-parameter Analysis

In this analysis we focus on one hyper-parameter
that is relevant only for methods that use massively
pre-trained encoders (the number of unfrozen
encoder layers during fine-tuning), as well as on
two hyper-parameters that impact the core of our
modified MLM objective (number of pivots and
the pivot and non-pivot masking probabilities).
We finally perform stability analysis across
hyper-parameter configurations.

Number of Unfrozen BERT Layers during
Fine Tuning (阶段 2, Figure 1b)
图中 3
we compare PERL final sentiment classification
accuracy with six alternatives–1, 2, 3, 5, 8,
或者 12 unfrozen layers, going from the top to
the bottom layers. We consider 4 arbitrarily
chosen DA setups, where the number of unfrozen
layers is kept fixed during the five-fold cross
validation process. The general trend is clear:
PERL performance improves as more layers are
unfrozen, and this improvement saturates at 8
unfrozen layers (for the K→A setup the saturation
is at 5 layers). The classification accuracy im-
provement (相比 1 unfrozen layer) is of
4% or more in three of the setups (K→A is again

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
8
1
9
2
3
7
3
2

/
t

我

A
C
_
A
_
0
0
3
2
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
8
1
9
2
3
7
3
2

/
t

我

A
C
_
A
_
0
0
3
2
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

数字 4: PERL sentiment classification accuracy
across four setups with a varying number of pivots.

the exception with only ∼ 2% 改进).
Across the experiments of this paper, this hyper-
parameter has been the single most influential
hyper-parameter of
the PERL, R-PERL and
Fine-tuned BERT models.

Number of Pivots Following previous work
(例如, Ziser and Reichart, 2018), our hyper-
parameter tuning process considers 100 到 500
pivots in steps of 100. We would next like to
explore the impact of this hyper-parameter on
PERL performance. 数字 4 presents our results,
for four arbitrarily selected setups. 在 3 的 4
setups PERL performance is stable across pivot
numbers. 在 2 setups, 100 is the optimal number
of pivots (for the A → B setup with a large
gap), and in the 2 other setups it lags behind
the best value by no more than 0.2%. 这些
two characteristics—model stability across pivot
numbers and somewhat better performance when
using fewer pivots—were observed across our
experiments with PERL and R-PERL.

Pivot and Non-Pivot Masking Probabilities
We next study the impact of the pivot and non-
pivot masking probabilities, used during PERL
fine-tuning (α and β, 分别, see §3). 为了
both α and β we consider the values of 0.1, 0.3,
0.5, 和 0.8. 数字 5 presents heat maps that
summarize our results. A first observation is the
relative stability of PERL to the values of these
hyper-parameters: The gap between the best and
worst performing configurations are 2.6% (E →
D), 1.2% (B → E), 3.1% (K → D), 和 5.0% (A →
乙). A second observation is that extreme α values

数字 5: Heat maps of PERL performance with
different pivot (A) and non-pivot (β) masking pro-
babilities. A darker color corresponds to a higher
sentiment classification accuracy.

(0.1 和 0.8) tend to harm the model. 最后, 在 3
的 4 cases the best model performance is achieved
with α = 0.5 and β = 0.1.

Stability Analysis We finally turn to analyze the
stability of the PERL models compared with the
基线. Previous work on PBLM and HATN
has demonstrated their instability across model
configurations (see Ziser and Reichart [2019]
for PBLM and Cui et al. [2019] for HATN).
As noted in Ziser and Reichart (2019), 叉-
configuration stability is of particular importance
in unsupervised domain adaptation as the hyper-
parameter configuration is selected using un-
labeled data from the source, rather than the target
domain.

In this analysis a hyper-parameter value is not
considered for a model if it is not included in the
best hyper-parameter configuration of that model
for at least one DA setup. 因此, for PERL we
fix the number of unfrozen layers (8), 号码
of pivots (100), and set (A, β) = (0.5, 0.1), 和
for PBLM we consider only word embedding size
的 128 和 256. Other than that, we consider
all possible hyper-parameter configurations of
all models (§4, 54 configurations for PERL, 右-
PERL and Fine-tuned BERT, 18 for BERT, 30 为了
PBLM and 27 for HATN). 桌子 3 presents the
minimum (min), maximum (max), average (平均),
and standard deviation (std) of the test set scores

514

across the hyper-parameter configurations of each
模型, 为了 4 arbitrarily selected setups.

In all 4 setups, PERL and R-PERL consistently
achieve higher avg, max, and min values and lower
std values compared to the other models (与
exception of PBLM achieving higher max for
K → A). 而且, the std values of PBLM
and especially HATN are substantially higher
than those of the models that use BERT. 然而,
PERL and R-PERL demonstrate lower std values
compared to BERT and Fine-tuned BERT in 3 的
4 setups, indicating that our method contributes to
stability beyond the documented contribution of
BERT itself Hao et al. (2019).

6.2 Design Choice Analysis

Impact of Pivot Selection One design choice
that impacts our results is the method through
which pivots are selected. We next compare three
alternatives to our pivot selection method, 保持
all other aspects of PERL fixed. As above, 我们
arbitrarily select four setups.

(A) Random-Frequent: Pivots

We consider the following pivot selection
是
方法:
randomly selected from the unigrams and bigrams
that appear at least 80 times in the unlabeled
data of each of the domains; (乙) High-MI, 不
目标: We select the pivots that have the highest
mutual information (MI) with the source domain
标签, but appear less than 10 times in the target
domain unlabeled data; (C) Oracle Miller (2019):
Here the pivots are selected according to our
方法, but the labeled data used for pivot-label
MI computation is the target domain test data
rather than the source domain training data. 这
is an upper bound on the performance of our
method since it uses target domain labeled data,
which is not available to us. For all methods we
选择 100 pivots (see above).

桌子 5 presents the results of the four PERL
variants, and compare them to BERT and Fine-
tuned BERT. We observe four patterns in the
结果. 第一的, PERL with our pivot selection
方法, which emphasizes both high MI with
the task label and high frequency in both the
source and target domains, is the best performing
模型. 第二, PERL with Random-Frequent
pivot selection is substantially outperformed by
PERL, but it still performs better than BERT (在 3
的 4 setups), probably because BERT is not tuned
on unlabeled data from the participating domains.
然而, PERL with Random-Frequent pivots is

515

E→D

avg max min std
R-PERL
0.7
84.6 85.8 83.1
PERL
0.4
85.2 86.0 84.4
Fine-tuned BERT 81.3 83.2 79.0
1.2
BERT
1.8
75.0 76.8 70.6
PBLM
71.7 79.3 65.9
3.4
HATN
73.7 81.1 53.9 10.7

B→K

avg max min std
R-PERL
0.5
89.5 90.5 88.8
PERL
0.3
89.4 90.2 88.8
Fine-tuned BERT 86.9 87.7 84.9
0.8
BERT
1.1
81.1 82.5 78.6
PBLM
3.3
78.6 84.1 71.3
HATN
7.7
76.8 82.8 59.5

A→B

avg max min std
R-PERL
75.3 79.0 72.0 1.7
PERL
73.9 77.1 70.9 1.7
Fine-tuned BERT 72.1 74.2 68.2 1.7
BERT
69.9 73.0 66.9 1.8
PBLM
64.2 71.6 60.9 2.7
HATN
57.6 65.0 53.7 3.5

K→A

avg max min std
R-PERL
85.3 86.4 84.6 0.5
PERL
83.8 84.9 81.5 0.9
Fine-tuned BERT 77.8 82.1 67.1 4.2
BERT
70.4 74.0 65.1 2.6
PBLM
76.1 86.1 66.2 6.8
HATN
72.1 79.2 53.9 9.9

桌子 3: Stability analysis.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
8
1
9
2
3
7
3
2

/
t

我

A
C
_
A
_
0
0
3
2
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

outperformed by the Fine-tuned BERT in all
setups, indicating that it provides a sub-optimal
way of exploiting source and target unlabeled data.
第三, 在 3 的 4 setups, PERL with the High-MI,
No Target pivots is outperformed by the baseline
BERT model. This is a clear indication of the
sub-optimality of this pivot selection method that
yields a model that is inferior even to a model
that was not tuned on source and target domain
数据. 最后, 虽然, unsurprisingly, PERL with
oracle pivots outperforms the standard PERL, 这
gap is smaller than 2% in all four cases. Our results
clearly demonstrate the strong positive impact of
our pivot selection method on the performance of
PERL.

5 layers 8 layers 10 layers 12 layers (满的) 5 layers 8 layers 10 layers 12 layers (满的)

B → E

A → K

BERT
Fine-tuned BERT
PERL (Ours)

70.9
74.6
81.1

75.9
76.5
83.2

80.6
84.2
88.2

78.8
84.2
87.0

71.2
74.0
77.7

74.9
76.3
80.2

81.2
80.8
84.7

78.8
81.9
84.2

桌子 4: Classification accuracy with reduced-size encoders.

BERT
Fine-tuned BERT
High-MI, No Target
Random-Frequent
PERL (Ours)
Oracle

B → E
78.8
84.2
76.2
79.7
87.0
88.9

K → D
77.7
79.8
76.4
76.8
84.6
85.6

E → K
85.1
89.2
84.9
85.5
90.6
91.5

D → B
81.0
84.1
83.7
81.7
85.0
86.7

桌子 5: Impact of PERL’s pivot selection method.

BERT

Fine-tuned BERT
PERL

B → E

K → D
No fine-tuning

78.8

80.7
79.6

82.0
86.9

77.7

Source data only

79.8
82.2

Target data only

80.9
83.0

Source and target data

Fine-tuned BERT
PERL

84.2
87.0

79.8
84.6

A → B

I → E

70.9

69.4
69.8

71.6
71.8

72.9
77.1

75.4

81.0
84.4

81.1
84.2

81.5
87.1

桌子 6: Impact of fine-tuning data selection.

Unlabeled Data Selection Another design
choice we consider is the impact of the type
of fine-tuning data. While we followed previous
工作 (例如, Ziser and Reichart, 2018) and used
the unlabeled data from both the source and target
域, it might be that data from only one of
the domains, particularly the target, is a better
选择. As above, we explore this question on
4 arbitrarily selected domain pairs. The results,
presented in Table 6, clearly indicate that our
choice to use unlabeled data from both domains
is optimal, particularly when transferring from a
non-product domain (A or I) to a product domain.

Reduced Size Encoder We finally explore the
effect of the fine-tuning step on the performance
of reduced-size models. By doing this we address
a major limitation of pre-trained encoders—their
尺寸, which prevents them from running on small
computational devices and dictates long run times.

For this experiment we prune the top encoder
layers before its fine-tuning step, yielding three
new model sizes, 和 5, 8, 或者 10 layers, compared
with the full 12 layers. This is done both for Fine-
tuned BERT and for PERL. We then tune the
number of encoder’s top unfrozen layers during
fine-tuning, as follows: 5 layer-encoder (1, 2, 3);
8 layer-encoder (1, 3, 4, 5); 10 layer-encoder (1,
3, 5, 8); and full encoder (1, 2, 3, 5, 8, 12). 为了
比较, we utilize the BERT model when
its top layers are pruned, and no fine-tuning is
执行的. We focus on two arbitrarily selected
DA setups.

桌子 4 presents accuracy results. In both setups
PERL with 10 layers is the best performing
模型. 而且, for each number of layers,
PERL outperforms the other two models, 和
particularly substantial improvements for 5 和
8 layers (IE。, 7.3% 和 6.7%, over BERT and

516

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
8
1
9
2
3
7
3
2

/
t

我

A
C
_
A
_
0
0
3
2
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Fine-tuned BERT, 分别, for B → E and 8
layers).

Reduced-size PERL is of course much faster
than the full model. The averaged run-time of the
满的 (12 layers) PERL on our test-sets is 196.5
msec and 9.9 msec on CPU (skylake i9-7920X,
2.9 GHz, single thread) and GPU (GeForce GTX
1080 Ti), 分别. 为了 8 layers the numbers
drop to 132.4 毫秒 (CPU) 和 6.9 毫秒 (GPU)
并为 5 layers to 84.0 (CPU) 和 4.7 (GPU)
毫秒.

7 结论

We presented PERL, a domain-adaptation model
that fine-tunes a massively pre-trained deep con-
textualized embedding encoder (BERT) 与一个
pivot-based MLM objective. PERL outperforms
strong baselines across 22 sentiment classification
DA setups, improves in-domain model perfor-
曼斯, increases its cross-configuration stability
and yields effective reduced-size models.

Our focus in this paper is on binary sentiment
classification, as was done in a large body of
previous DA work. In future work we would
like to extend PERL’s reach to structured (例如,
dependency parsing and aspect-based sentiment
classification) and generation (例如, abstractive
summarization and machine translation) 自然语言处理
任务.

致谢

We would like to thank the action editor and the
reviewers, Yftah Ziser, as well as the members of
the IE@Technion NLP group for their valuable
feedback and advice. This research was partially
funded by an ISF personal grant no. 1625/18.

参考

Shai Ben-David, John Blitzer, Koby Crammer,
Alex Kulesza, Fernando Pereira, and Jennifer
Wortman Vaughan. 2010. A theory of learning
from different domains. Machine Learning,
79(1-2):151–175.

John Blitzer, Mark Dredze, and Fernando Pereira.
2007. Biographies, Bollywood, boom-boxes
and blenders: Domain adaptation for sentiment
classification. In John A. 卡罗尔, Antal van den
Bosch, and Annie Zaenen, 编辑, 前交叉韧带 2007,
Proceedings of the 45th Annual Meeting of
the Association for Computational Linguistics,

六月 23-30, 2007, Prague, 捷克共和国. 这
计算语言学协会,

John Blitzer, Ryan T. 麦当劳, and Fernando
佩雷拉. 2006. Domain adaptation with structu-
ral correspondence learning. In Dan Jurafsky
and ´Eric Gaussier, 编辑, EMNLP 2006,
这 2006 会议
会议记录
Empirical Methods
in Natural Language
加工, 22-23 七月 2006, 悉尼, 澳大利亚,
pages 120–128. 前交叉韧带.

Danushka Bollegala, Takanori Maehara, 和
Ken-ichi Kawarabayashi. 2015. Unsupervised
cross-domain word representation learning. 在
Proceedings of the 53rd Annual Meeting of
the Association for Computational Linguistics
and the 7th International Joint Conference on
Natural Language Processing of
the Asian
Federation of Natural Language Processing,
前交叉韧带 2015, 七月 26-31, 2015, 北京, 中国,
体积 1: Long Papers, pages 730–740. 这
Association for Computer Linguistics.

Minmin Chen, Kilian Q. 温伯格, and Yixin
陈. 2011. Automatic feature decomposition
for single view co-training. In Lise Getoor
and Tobias Scheffer, 编辑, 会议记录
the 28th International Conference on Machine
学习, ICML 2011, Bellevue, 华盛顿,
美国, 六月 28 – 七月 2, 2011, pages 953–960.
Omnipress.

Minmin Chen, Zhixiang Eddie Xu, Kilian Q.
温伯格, and Fei Sha. 2012. Marginalized
denoising autoencoders for domain adaptation.
在诉讼程序中
the 29th International
Conference on Machine Learning, ICML 2012,
爱丁堡, 苏格兰, 英国, 六月 26 – 七月 1,
2012. icml.cc / Omnipress.

St´ephane Clinchant, Gabriela Csurka, and Boris
Chidlovskii. 2016. A domain adaptation regu-
larization for denoising autoencoders. In Pro-
ceedings of the 54th Annual Meeting of the
计算语言学协会, 前交叉韧带
2016, 八月 7-12, 2016, 柏林, 德国,
体积 2: Short Papers. The Association for
Computer Linguistics.

Wanyun Cui, Guangyu Zheng, Zhiqiang Shen,
Sihang Jiang, and Wei Wang. 2019. Transfer
learning for sequences via learning to collocate.

517

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
8
1
9
2
3
7
3
2

/
t

我

A
C
_
A
_
0
0
3
2
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

In 7th International Conference on Learning
Representations, ICLR 2019, New Orleans,
这, 美国, 可能 6-9, 2019. OpenReview.net.

domains with synchronous neural
语言
型号. In Proc. of the xLite Workshop on
Cross-Lingual Technologies, NIPS.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, 和
Kristina Toutanova. 2019. BERT: pre-training
of deep bidirectional transformers for language
理解. In Jill Burstein, Christy Doran,
and Thamar Solorio, 编辑, 会议记录
这 2019 Conference of the North American
Chapter of the Association for Computational
语言学: 人类语言技术,
NAACL-HLT 2019, 明尼阿波利斯, 明尼苏达州, 美国,
六月 2-7, 2019, 体积 1 (Long and Short
文件), pages 4171–4186. 协会
计算语言学.

Timothy Dozat and Christopher D. 曼宁.
2017. Deep biaffine attention for neural depen-
dency parsing. In 5th International Conference
ICLR 2017,
on Learning Representations,
Toulon, 法国, 四月 24-26, 2017, 会议
Track Proceedings. OpenReview.net.

Sergey Edunov, Myle Ott, Michael Auli, 和
David Grangier. 2018. Understanding back-
translation at scale. In Ellen Riloff, 大卫
蒋, Julia Hockenmaier, and Jun’ichi Tsujii,
编辑, 诉讼程序 2018 会议
on Empirical Methods in Natural Language
加工, 布鲁塞尔, 比利时, 十月 31 –
十一月 4, 2018, pages 489–500. 协会
for Computational Linguistics.

Yaroslav Ganin, Evgeniya Ustinova, Hana
Ajakan, Pascal Germain, Hugo Larochelle,
Franc¸ois Laviolette, Mario Marchand, 和
Victor S. Lempitsky. 2016. Domain-adversarial
training of neural networks. J. Mach. 学习.
Res., 17:59:1–59:35.

Xavier Glorot, Antoine Bordes, and Yoshua
本吉奥. 2011. Domain adaptation for large-
scale sentiment classification: A deep learning
方法. In Lise Getoor and Tobias Scheffer,
编辑, Proceedings of the 28th International
Conference on Machine Learning, ICML 2011,
Bellevue, 华盛顿, 美国, 六月 28 – 七月 2,
2011, pages 513–520. Omnipress.

Stephan Gouws, Gert-Jan Van Rooyen, 和
Yoshua Bengio. 2012. Learning structural
语言学的
correspondences across different

Xiaochuang Han and Jacob Eisenstein. 2019.
Unsupervised domain adaptation of contextual-
ized embeddings: A case study in early modern
english. CoRR, abs/1904.02817.

Yaru Hao, Li Dong, Furu Wei, and Ke Xu.
2019. Visualizing and understanding the
effectiveness of BERT.
In Kentaro Inui,
Jing Jiang, Vincent Ng, and Xiaojun Wan,
编辑, 诉讼程序 2019 会议
on Empirical Methods in Natural Language
Processing and the 9th International Joint
Conference on Natural Language Processing,
EMNLP-IJCNLP 2019, 香港, 中国,
十一月 3-7, 2019, pages 4141–4150.
计算语言学协会.

Jiayuan Huang, Alexander J. Smola, Arthur
Gretton, Karsten M. Borgwardt, and Bernhard
Sch¨olkopf. 2006. Correcting sample selection
bias by unlabeled data. In Bernhard Sch¨olkopf,
John C. Platt, and Thomas Hofmann, 编辑,
神经信息处理的进展
the Twentieth
系统 19, 会议记录
Annual Conference on Neural Information Pro-
cessing Systems, Vancouver, British Columbia,
加拿大, 十二月 4-7, 2006, pages 601–608.
与新闻界.

Hal Daum´e III and Daniel Marcu. 2006. Domain
adaptation for statistical classifiers. J. Artif.
Intell. Res., 26:101–126.

Jing Jiang and ChengXiang Zhai. 2007. Instance
weighting for domain adaptation in NLP. 在
John A. 卡罗尔, Antal van den Bosch, 和
Annie Zaenen, 编辑, 前交叉韧带 2007, Pro-
ceedings of the 45th Annual Meeting of the
计算语言学协会,
六月 23-30, 2007, Prague, 捷克共和国. 这
计算语言学协会.

Zhenzhong Lan, Mingda Chen, Sebastian Good-
男人, Kevin Gimpel, Piyush Sharma, 和
Radu Soricut. 2020. ALBERT: A lite BERT
for self-supervised learning of language rep-
resentations. In 8th International Conference
ICLR 2020,
on Learning Representations,

518

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
8
1
9
2
3
7
3
2

/
t

我

A
C
_
A
_
0
0
3
2
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

亚的斯亚贝巴（埃塞俄比亚首都, 埃塞俄比亚, 四月 26-30, 2020.
OpenReview.net.

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim,
Donghyeon Kim, Sunkyu Kim, Chan Ho So,
and Jaewoo Kang. 2020. Biobert: A pre-trained
language representation model
biomedical
文本挖掘. Bioinformatics,
for biomedical
36(4):1234–1240.

Zheng Li, Ying Wei, Yu Zhang, and Qiang Yang.
2018. Hierarchical attention transfer network
for cross-domain sentiment classification. 在
Sheila A. McIlraith and Kilian Q. 温伯格,
编辑, 会议记录
the Thirty-Second
AAAI 人工智能会议,
(AAAI-18), the 30th innovative Applications of
人工智能 (IAAI-18), and the 8th
AAAI Symposium on Educational Advances
in Artificial
智力 (EAAI-18), 新的
Orleans, Louisiana, 美国, 二月 2-7, 2018,
pages 5852–5859. AAAI出版社.

Zheng Li, Yu Zhang, Ying Wei, Yuxiang Wu,
and Qiang Yang. 2017. End-to-end adversarial
memory network for cross-domain sentiment
classification. In Carles Sierra, 编辑, Pro-
ceedings of the Twenty-Sixth International Joint
Conference on Artificial Intelligence, IJCAI
2017, 墨尔本, 澳大利亚, 八月 19-25,
2017, pages 2237–2243. ijcai.org.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019. Roberta: A robustly optimized
BERT pretraining approach. CoRR, abs/1907.
11692.

Christos Louizos, Kevin Swersky, Yujia Li, Max
Welling, and Richard S. Zemel. 2016. 这
variational fair autoencoder. In Yoshua Bengio
and Yann LeCun, 编辑, 4th International
Conference on Learning Representations, ICLR
2016, San Juan, Puerto Rico, 可能 2-4, 2016,
Conference Track Proceedings.

Andrew L. Maas, Raymond E. Daly, Peter T.
Pham, Dan Huang, 安德鲁·Y. 的, 和
Christopher Potts. 2011. Learning word vectors
for sentiment analysis. In Dekang Lin, Yuji
Matsumoto, and Rada Mihalcea, 编辑, 这
49th Annual Meeting of the Association for

519

计算语言学: Human Language
Technologies, Proceedings of the Conference,
19-24 六月, 2011, Portland, Oregon, 美国,
pages 142–150. The Association for Computer
语言学.

Yishay Mansour, Mehryar Mohri, and Afshin
Rostamizadeh. 2008. Domain adaptation with
multiple sources.
In Daphne Koller, 戴尔
Schuurmans, Yoshua Bengio, and L´eon Bottou,
信息
编辑, Advances
Processing Systems 21, 会议记录
这
Twenty-Second Annual Conference on Neural
Information Processing Systems, Vancouver,
British Columbia, 加拿大, 十二月 8-11,
2008, pages 1041–1048. 柯伦联合公司,
Inc.

in Neural

David McClosky, Eugene Charniak, and Mark
约翰逊. 2010. Automatic domain adaptation
for parsing. In Human Language Technologies:
Conference of the North American Chapter of
the Association of Computational Linguistics,
会议记录, 六月 2-4, 2010, 天使们,
加利福尼亚州, 美国, pages 28–36. The Association
for Computational Linguistics.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
8
1
9
2
3
7
3
2

/
t

我

A
C
_
A
_
0
0
3
2
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

在

adaptation.

Timothy A. 磨坊主. 2019. Simplified neural
吉尔
unsupervised domain
Burstein, Christy Doran, and Thamar Solorio,
编辑, 诉讼程序 2019 会议
这
the North American Chapter of
的
计算语言学协会:
人类语言技术, NAACL-HLT
2019, 明尼阿波利斯, 明尼苏达州, 美国, 六月 2-7,
2019, 体积 1 (Long and Short Papers),
pages 414–419. Association for Computational
语言学.

Quang Nguyen. 2015. The airline review dataset.

Thien Huu Nguyen, Barbara Plank, and Ralph
Grishman. 2015. Semantic representations for
domain adaptation: A case study on the tree
kernel-based method for relation extraction. 在
Proceedings of the 53rd Annual Meeting of
the Association for Computational Linguistics
and the 7th International Joint Conference on
Natural Language Processing of
the Asian
Federation of Natural Language Processing,
前交叉韧带 2015, 七月 26-31, 2015, 北京, 中国,
体积 1: Long Papers, pages 635–644. 这
Association for Computer Linguistics.

Sinno Jialin Pan, Xiaochuan Ni, Jian-Tao Sun,
Qiang Yang, and Zheng Chen. 2010. 叉-
domain sentiment classification via spectral
feature alignment. In Michael Rappa, 保罗
琼斯, Juliana Freire, and Soumen Chakrabarti,
编辑, Proceedings of the 19th International
Conference on World Wide Web, 万维网 2010,
Raleigh, North Carolina, 美国, 四月 26-30,
2010, pages 751–760. ACM.

Matthew E. Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep contex-
tualized word representations. In Marilyn A.
沃克, Heng Ji, and Amanda Stent, edi-
托尔斯, 诉讼程序 2018 Conference of
the North American Chapter of the Associ-
ation for Computational Linguistics: 人类
语言技术, NAACL-HLT 2018,
New Orleans, Louisiana, 美国, 六月 1-6, 2018,
体积 1 (Long Papers), pages 2227–2237.
计算语言学协会.

Barbara Plank and Alessandro Moschitti. 2013.
Embedding semantic similarity in tree kernels
for domain adaptation of relation extraction.
In Proceedings of the 51st Annual Meeting of
the Association for Computational Linguistics,
前交叉韧带 2013, 4-9 八月 2013, Sofia, Bulgaria,
体积 1: Long Papers, pages 1498–1507. 这
Association for Computer Linguistics.

Alec Radford, Karthik Narasimhan, Tim
Salimans, and Ilya Sutskever. 2018. Improving
language understanding by generative pre-
训练. https://s3-us-west-2.amazonaws.com/
openai-assets/researchcovers/languageunsuper
vised/language understandingpaper.pdf.

Alec Radford, Jeffrey Wu, Rewon Child, 大卫
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners. OpenAI Blog, 1(8).

Brian Roark and Michiel Bacchiani. 2003. Super-
vised and unsupervised PCFG adaptation to
novel domains. In Marti A. Hearst and Mari
Ostendorf, 编辑, Human Language Tech-
nology Conference of
the North American
Chapter of the Association for Computational
语言学, HLT-NAACL 2003, Edmonton,
加拿大, 可能 27 – 六月 1, 2003. The Asso-
ciation for Computational Linguistics.

Guy Rotman and Roi Reichart. 2019. Deep con-
textualized self-training for low resource depen-
dency parsing. Transactions of the Associa-
tion for Computational Linguistics, 7:695–713.

Alexander M. 匆忙, Roi Reichart, 迈克尔
柯林斯, and Amir Globerson. 2012. 改进
parsing and POS tagging using inter-sentence
consistency constraints.
In Jun’ichi Tsujii,
James Henderson, and Marius Pasca, 编辑,
诉讼程序 2012 Joint Conference on
Empirical Methods in Natural Language Pro-
cessing and Computational Natural Language
学习, EMNLP-CoNLL 2012, 七月 12-14,
2012, Jeju Island, 韩国, pages 1434–1444. 前交叉韧带.

Tobias Schnabel and Hinrich Sch¨utze. 2014.
FLORS: fast and simple domain adaptation
for part-of-speech tagging. Transactions of
the Association for Computational Linguistics,
2:15–26.

Masashi Sugiyama, Shinichi Nakajima, Hisashi
Kashima, Paul von B¨unau, and Motoaki
Kawanabe. 2007. Direct importance estima-
tion with model selection and its application
to covariate shift adaptation. In John C. Platt,
Daphne Koller, Yoram Singer, and Sam T.
Roweis, 编辑, Advances in Neural Informa-
tion Processing Systems 20, 会议记录
the Twenty-First Annual Conference on Neural
Information Processing Systems, Vancouver,
British Columbia, 加拿大, 十二月 3-6,
2007, pages 1433–1440. 柯伦联合公司,
Inc.

Manshu Tu and Bing Wang. 2019. Adding
prior knowledge in hierarchical attention neural
network for cross domain sentiment classifica-
的. IEEE Access, 7:32578–32588.

Pascal Vincent, Hugo Larochelle, Yoshua Bengio,
and Pierre-Antoine Manzagol. 2008. Extracting
and composing robust features with denoising
autoencoders. In William W. 科恩, 安德鲁
麦卡勒姆, and Sam T. Roweis, 编辑,
Machine Learning, Proceedings of the Twenty-
Fifth International Conference (ICML 2008),
Helsinki, 芬兰, 六月 5-9, 2008, 体积 307
of ACM International Conference Proceeding
Series, pages 1096–1103. ACM.

Thomas Wolf, Lysandre Debut, Victor Sanh,
Julien Chaumond, Clement Delangue, Anthony

520

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
8
1
9
2
3
7
3
2

/
t

我

A
C
_
A
_
0
0
3
2
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Moi, Pierric Cistac, Tim Rault, R´emi Louf,
Morgan Funtowicz, and Jamie Brew. 2019.
Huggingface’s transformers: State-of-the-art
natural language processing. CoRR, abs/1910.
03771.

Yonghui Wu, Mike Schuster, Zhifeng Chen,
Quoc V. Le, Mohammad Norouzi, Wolfgang
Macherey, Maxim Krikun, Yuan Cao, Qin
高, Klaus Macherey, Jeff Klingner, Apurva
Shah, Melvin Johnson, Xiaobing Liu, Lukasz
Kaiser, Stephan Gouws, Yoshikiyo Kato,
Taku Kudo, Hideto Kazawa, Keith Stevens,
George Kurian, Nishant Patil, Wei Wang,
Cliff Young, Jason Smith, Jason Riesa, Alex
Rudnick, Oriol Vinyals, Greg Corrado, Macduff
休斯, and Jeffrey Dean. 2016. Google’s
neural machine translation system: Bridging the
gap between human and machine translation.
CoRR, abs/1609.08144.

Gui-Rong Xue, Wenyuan Dai, Qiang Yang,
and Yong Yu. 2008. Topic-bridged PLSA for
cross-domain text classification. In Sung-Hyon
Myaeng, Douglas W. Oard, Fabrizio Sebastiani,
Tat-Seng Chua, and Mun-Kew Leong, 编辑,
Proceedings of the 31st Annual International
ACM SIGIR Conference on Research and De-
velopment in Information Retrieval, SIGIR 2008,
新加坡, 七月 20-24, 2008, pages 627–634.
ACM.

Yi Yang and Jacob Eisenstein. 2014. Fast easy
unsupervised domain adaptation with marginal-
ized structured dropout. 在诉讼程序中
52nd Annual Meeting of the Association for
计算语言学, 前交叉韧带 2014, 六月
22-27, 2014, 巴尔的摩, 医学博士, 美国, 体积 2:
Short Papers, pages 538–544. The Association
for Computer Linguistics.

language understanding.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G.
Carbonell, Ruslan Salakhutdinov, and Quoc V.
Le. 2019. Xlnet: Generalized autoregressive
pretraining for
在
Hanna M. 瓦拉赫, Hugo Larochelle, Alina
贝格尔齐默, Florence d’Alch´e-Buc, Emily B.
狐狸, and Roman Garnett, 编辑, Advances
in Neural Information Processing Systems 32:
Annual Conference on Neural
信息
Processing Systems 2019, 神经信息处理系统 2019, 8-
14 十二月 2019, Vancouver, BC, 加拿大,
pages 5754–5764.

Jianfei Yu and Jing Jiang. 2016. 学习
sentence embeddings with auxiliary tasks
for cross-domain sentiment classification. 在
Jian Su, Xavier Carreras, and Kevin Duh,
编辑, 诉讼程序 2016 会议
on Empirical Methods in Natural Language
加工, EMNLP 2016, Austin, 德克萨斯州, 美国,
十一月 1-4, 2016, pages 236–246. 这
计算语言学协会.

Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin
Jiang, Maosong Sun, and Qun Liu. 2019.
ERNIE: Enhanced language representation
with informative entities. In Anna Korhonen,
David R. Traum, and Llu´ıs M`arquez, 编辑,
Proceedings of the 57th Conference of the Asso-
ciation for Computational Linguistics, 前交叉韧带
2019, Florence, 意大利, July 28-August 2, 2019,
体积 1: Long Papers, pages 1441–1451.
计算语言学协会.

Yftah Ziser and Roi Reichart. 2017. Neural
structural correspondence learning for domain
adaptation. In Roger Levy and Lucia Specia,
编辑, Proceedings of the 21st Conference
on Computational Natural Language Learning
(CoNLL 2017), Vancouver, 加拿大, 八月
3-4, 2017, pages 400–410. 协会
计算语言学.

Yftah Ziser and Roi Reichart. 2018. Pivot based
language modeling for improved neural domain
adaptation. In Marilyn A. 沃克, Heng Ji, 和
Amanda Stent, 编辑, 诉讼程序 2018
Conference of the North American Chapter of
the Association for Computational Linguistics:
人类语言技术, NAACL-HLT
2018, New Orleans, Louisiana, 美国, 六月
1-6, 2018, 体积 1 (Long Papers), 页面
1241–1251. Association for Computational
语言学.

learning for

Yftah Ziser and Roi Reichart. 2019. Task refine-
蒙特
improved accuracy and
stability of unsupervised domain adaptation.
In Anna Korhonen, David R. Traum, 和
这
Llu´ıs M`arquez, 编辑, 会议记录
57th Conference of the Association for Com-
putational Linguistics, 前交叉韧带 2019, Florence,
意大利, July 28-August 2, 2019, 体积 1: 长的
文件, pages 5895–5906. 协会
计算语言学.

521

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
8
1
9
2
3
7
3
2

/
t

我

A
C
_
A
_
0
0
3
2
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3 PERL: Pivot-based Domain Adaptation for Pre-trained Deep image

下载pdf