Perla: Adaptación de dominio basada en pivotes para profundidad previamente entrenada
Contextualized Embedding Models
Eyal Ben-David∗
Carmel Rabinovitz∗
Roi Reichart
Technion, Israel Institute of Technology
{eyalbd12@campus.|carmelrab@campus.|roiri@}technion.ac.il
Abstracto
Pivot-based neural representation models have
led to significant progress in domain adapta-
tion for NLP. Sin embargo, previous research
following this approach utilize only labeled
data from the source domain and unlabeled
data from the source and target domains,
but neglect to incorporate massive unlabeled
corpora that are not necessarily drawn from
these domains. To alleviate this, we propose
Perla: A representation learning model that
extends contextualized word embedding mod-
els such as BERT (Devlin et al., 2019) con
pivot-based fine-tuning. PERL outperforms
strong baselines across 22 sentiment classifica-
tion domain adaptation setups, improves in-
domain model performance, yields effective
reduced-size models, and increases model
stability.1
1 Introducción
Natural Language Processing (NLP) algoritmos
are constantly improving, gradually approaching
human-level performance (Dozat and Manning,
2017; Edunov et al., 2018; Radford et al., 2018).
Sin embargo, those algorithms often depend on the
availability of large amounts of manually annota-
ted data from the domain in which the task is
performed. Desafortunadamente, collecting such anno-
tated data is often costly and laborious, cual
substantially limits the applicability of NLP
tecnología.
Domain Adaptation (Y), training an algorithm
on annotated data from a source domain so that it
can be effectively applied to other target domains,
is one of the ways to solve the above bottleneck.
∗Both authors contributed equally to this work.
1Our code is at https://github.com/eyalbd2/
Perla.
504
En efecto, over the years substantial efforts have been
devoted to the DA challenge (Roark and Bacchiani,
2003; Daum´e III and Marcu, 2006; Ben-David
et al., 2010; Jiang and Zhai, 2007; McClosky et al.,
2010; Rush et al., 2012; Schnabel and Sch¨utze,
2014). Our focus in this paper is on unsupervised
Y, the setup we consider most realistic. En esto
setup labeled data is available only from the source
domain and unlabeled data is available from both
the source and the target domains.
While various approaches for DA have been
propuesto (§2), with the prominence of deep
neural network (DNN) modelado, attention has
been recently focused on representation learn-
ing approaches. Within representation learning
for unsupervised DA, two approaches have been
shown particularly useful. In one line of work,
DNN-based methods that use compress-based
noise reduction to learn cross-domain features
have been developed (Glorot et al., 2011; Chen
et al., 2012). In another line of work, methods
based on the distinction between pivot and non-
pivot features (Blitzer et al., 2006, 2007) aprender
a joint feature representation for the source and
the target domains. Later on, Ziser and Reichart
(2017, 2018), and Li et al. (2018) married the two
approaches and achieved substantial
improve-
ments on a variety of DA setups.
Despite their success, pivot-based DNN models
still only utilize labeled data from the source
domain and unlabeled data from both the source
and the target domains, but neglect to incorporate
massive unlabeled corpora that are not necessarily
drawn from these domains. With the recent
game-changing success of contextualized word
embedding models trained on such massive
corpus (Devlin et al., 2019; Peters et al., 2018),
it is natural to ask whether information from such
corpora can enhance these DA methods, partícipe-
ularly that background knowledge from non-
contextualized embeddings has shown useful for
Transacciones de la Asociación de Lingüística Computacional, volumen. 8, páginas. 504–521, 2020. https://doi.org/10.1162/tacl a 00328
Editor de acciones: Jimmy Lin. Lote de envío: 4/2020; Lote de revisión: 2020; Publicado 8/2020.
C(cid:13) 2020 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
2
8
1
9
2
3
7
3
2
/
/
t
yo
a
C
_
a
_
0
0
3
2
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Y (Plank and Moschitti, 2013; Nguyen et al.,
2015).
to hyper-parameter selection compared to other
DA methods.
In this paper we hence propose an unsupervised
DA approach that extends leading approaches
based on DNNs and pivot-based ideas, so that they
can incorporate information encoded in massive
corpus (§3). Our model, named PERL: Pivot-
based Encoder Representation of Language, builds
on massively pre-trained contextualized word
embedding models such as BERT (Devlin et al.,
2019). To adjust the representations learned by
these models so that they close the gap between
the source and target domains, we fine-tune their
parameters using a pivot-based variant of the
Masked Language Modeling (MLM) objetivo,
optimized on unlabeled data from both the source
and the target domains. We further present R-PERL
(regularized PERL), which facilitates parameter
sharing for pivots with similar meaning.
We perform extensive experimentation in vari-
ous unsupervised DA setups of
the task of
binary sentiment classification (§4, 5). Primero, para
compatibility with previous work, we experiment
with the legacy product review domains of Blitzer
et al. (2007) (12 setups). We then experiment with
more challenging setups, adapting between the
above domains and the airline review domain
(Nguyen, 2015) used in Ziser and Reichart (2018)
(4 setups), as well as the IMDb movie review
domain (Maas et al., 2011) (6 setups). comparamos
PERL to the best performing pivot-based methods
(Ziser and Reichart, 2018; Le et al., 2018) y
to DA approaches that fine-tune a massively pre-
trained BERT model by optimizing its standard
MLM objective using target-domain unlabeled
datos (Lee et al., 2020; Han and Eisenstein, 2019).
PERL and R-PERL substantially outperform these
baselines, emphasizing the additive effect of mas-
sive pre-training and pivot-based fine-tuning.
As an additional contribution, we show that
pivot-based learning is effective beyond improv-
ing domain adaptation accuracy. Particularly, nosotros
show that an in-domain variant of PERL sub-
stantially improves the in-domain performance
of a BERT-based sentiment classifier, for vary-
ing training set sizes (de 100 to 20K labeled
examples). We also show that PERL facilitates
the generation of effective reduced-size DA mod-
los. Finalmente, we perform an extensive ablation
estudiar (§6) that uncovers PERL’s crucial design
choices and demonstrates the stability of PERL
2 Background and Previous Work
There are several approaches to DA, incluido
instance re-weighting (Sugiyama et al., 2007;
Huang et al., 2006; Mansour et al., 2008), sub-
sampling from the participating domains Chen
et al. (2011) and DA through representation learn-
En g, where a joint representation is learned based
on texts from the source and target domains
(Blitzer et al., 2007; Xue et al., 2008; Ziser
and Reichart, 2017, 2018). We first describe the
unsupervised DA pipeline, continue with repre-
sentation learning methods for DA with a focus
on pivot-based methods, y, finalmente, describe
contextualized embedding models.
Unsupervised Domain Adaptation through
Representation Learning As noted in §1 our
focus in this work is on unsupervised DA through
representation learning. A common pipeline for
this setup consists of two steps: (A) Aprendiendo
a representation model (often referred to as the
encoder) using the source and target unlabeled
datos; y (B) Training a supervised classifier
on the source domain labeled data. Facilitar
domain adaptation, every text fed to the classifier
in the second step is first represented by the pre-
trained encoder. This is performed both when the
classifier is trained in the source domain and when
it is applied to new text from the target domain.
Exceptions to this pipeline are end-to-end mod-
els that jointly learn to perform the cross-domain
text representation and the classification task. Este
is achieved by training a unified objective on the
source domain labeled data and the unlabeled data
from both the source and the target. Among these
models are domain adversarial networks (Ganin et
Alabama., 2016), which were strongly outperformed by
Ziser and Reichart (2018) to which we compare
our methods, and the hierarchical attention transfer
network (HATN; Le et al., 2018), which is one of
our baselines (see below).
Unsupervised DA through representation learn-
ing has followed two main avenues. The first
avenue consists of works that aim to explicitly
build a feature representation that bridges the gap
between the domains. A seminal framework in this
line is structural correspondence learning (SCL;
Blitzer et al., 2006, 2007), that splits the feature
space into pivot and non-pivot features. Un gran
505
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
2
8
1
9
2
3
7
3
2
/
/
t
yo
a
C
_
a
_
0
0
3
2
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
number of works have followed this idea (p.ej.,
Pan et al., 2010; Gouws et al., 2012; Bollegala
et al., 2015; Yu and Jiang, 2016; Le et al., 2017,
2018; Tu and Wang, 2019; Ziser and Reichart,
2017, 2018) and we discuss it below.
Works in the second avenue learn cross-domain
representations by training autoencoders (AEs) en
the unlabeled data from the source and target do-
mains. This way they hope to obtain a more robust
representación, which is hopefully better suited for
Y. Examples for such models include the stacked
denoising AE (SDA; Vincent et al., 2008; Glorot
et al., 2011, the marginalized SDA and its variants
(MSDA; Chen et al., 2012; Yang and Eisenstein,
2014; Clinchant et al., 2016) and variational AE
based models (Louizos et al., 2016).
Recientemente, Ziser and Reichart (2017, 2018)
and Li et al. (2018) married these approaches
and presented pivot-based approaches where the
representation model is based on DNN encoders
(AE, memoria larga a corto plazo [LSTM], or hierar-
chical attention networks). Because their methods
outperformed the above models, we aim to extend
them to models that can also exploit massive out
de (source and target) domain corpora. nosotros el siguiente
elaborate on pivot-based approaches.
Pivot-based Domain Adaptation Proposed by
Blitzer et al. (2006, 2007) through their SCL
estructura, the main idea of pivot-based DA is to
divide the shared feature space of the source and
the target domains to two complementary subsets:
one of pivots and one of non-pivots. Pivot features
are defined based on two criteria: (a) Ellos son
frequent in the unlabeled data of both domains;
y (b) They are prominent for the classification
task defined by the source domain labeled data.
Non-pivot features are those features that do not
meet at least one of the above criteria. Mientras
SCL is based on linear models, there have been
some very successful recent efforts to extend this
framework so that non-linear encoders (DNNs)
are utilized. Here we focus on the latter line of
trabajar, which produces much better results, and do
not elaborate on SCL any further.
Ziser and Reichart
(2018) have presented
the Pivot Based Language Model
(PBLM),
which incorporates pre-training and pivot-based
aprendiendo. PBLM is a variant of an LSTM-based
modelo de lenguaje, but instead of predicting at each
point the most likely next input word, it predicts
the next input unigram or bigram if one of these
is a pivot (if both are, it predicts the bigram),
and NONE otherwise. In the unsupervised DA
pipeline PBLM is trained on the source and target
unlabeled data. Entonces, when the task classifier is
trained and applied to the target domain, PBLM is
used as a contextualized word embedding layer.
Notice that PBLM is not pre-trained on massive
out of (source and target) domain corpora, and its
single-layer, unidirectional LSTM architecture is
probably not ideal for knowledge encoding from
such corpora.
Another work in this line is HATN (Le et al.,
2018). This model automatically learns the pivot/
non-pivot distinction, rather than following the
SCL definition as Ziser and Reichart (2017,
2018) does. HATN consists of two hierarchical
attention networks, P-net and NP-net. Primero, él
trains the P-net on the source labeled data. Entonces,
it decodes the most prominent tokens of P-net
(es decir., tokens that received the highest attention
valores), and considers them as its pivots. Finalmente,
it simultaneously trains the P-net and the NP-net
on both the labeled and the unlabeled data, semejante
that P-net is adversarially trained to predict the
domain of the input example (Ganin et al., 2016)
and NP-net is trained to predict its pivots, y el
hidden representations from both networks serve
for the task label (sentiment) predicción.
Both HATN and PBLM strongly outperform a
large variety of previous DA models on various
cross-domain sentiment classification setups. Por eso,
they are our major baselines in this work. Like
PBLM, we use the same definition of the pivot
and non-pivot subsets as in Blitzer et al. (2007).
Like HATN, we also use an attention-based DNN.
Unlike both models, we design our model so
that it incorporates pivot-based learning with pre-
training on massive out of (source and target) hacer-
main corpora. We next discuss this pre-training
proceso, which is also known as training models
for contextualized word embeddings.
Contextualized Word Embedding Models
Contextualized word embedding (CWE) modelos
are trained on massive corpora (Peters et al., 2018;
Radford et al., 2019). They typically utilize a lan-
guage modeling objective or a closely related
variante (Peters et al., 2018; Ziser and Reichart,
2018; Devlin et al., 2019; Yang et al., 2019), Alabama-
though in some recent papers the model is trained
on a mixture of basic NLP tasks (Zhang et al.,
2019; Rotman and Reichart, 2019). The contribution
506
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
2
8
1
9
2
3
7
3
2
/
/
t
yo
a
C
_
a
_
0
0
3
2
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
of such models to the state-of-the-art in a variety
of NLP tasks is already well-established.
CWE models typically follow three steps: (1)
Pre-entrenamiento: Where a DNN (referred to as the
encoder of the model) is first trained on massive
unlabeled corpora which represent a broad domain
(such as English Wikipedia); (2) Fine-tuning: Un
optional step, where the encoder is refined on un-
labeled text of interest. As noted above, Lee et
Alabama. (2020) and Han and Eisenstein (2019) tuned
BERT on unlabeled target domain data to facilitate
domain adaptation; y (3) Supervised task train-
En g: Where task specific layers are trained on label-
ed data for a downstream task of interest.
PERL uses a pre-trained encoder, BERT in this
paper. BERT’s architecture is based on multi-head
attention layers, trained with a two-component
objetivo: (a) MLM and (b) Is-next-sentence pre-
diction (NSP). For Step 2, PERL modifies only
the MLM objective and it can hence be imple-
mented within any CWE framework that uses this
objetivo (Liu et al., 2019; Lan et al., 2020; Cual
et al., 2019).
MLM is a modified language modeling objec-
tivo, adjusted to self-attention models. Cuando
building the pre-training task, all input tokens
have the same probability to be masked.2 After
the masking process, the model has to predict a
distribution over the vocabulary for each masked
token given the non-masked tokens. The input text
may have more than one masked token, and when
predicting one masked token information from the
other masked tokens is not utilized.
In the next section we describe our PERL
domain adaptation model. The novel component
of this model is a pivot-based MLM objective,
optimized at the fine-tuning step (Step 2) del
CWE pipeline, using source and target unlabeled
datos.
3 Domain adaptation with PERL
PERL uses pivot features in order to learn a
representation that bridges the gap between two
dominios. Contrary to previous pivot-based DA
representation models, it exploits unlabeled data
from the source and target domains, and also from
massive out of source and target domain corpora.
2We use the huggingface BERT code (Wolf et al., 2019):
https://github.com/huggingface/transformers,
where the masking probability is 0.15.
PERL consists of three steps that correspond to
the three steps of CWE models, as described in § 2:
(1) Pre-entrenamiento (Figura 1a): in which it utilizes
a pre-trained CWE model (encoder, BERT in this
trabajar) that was trained on massive corpora; (2)
Fine-tuning (Figura 1b): where it refines some of
the pre-trained encoder weights, based on a pivot-
based objective that is optimized on unlabeled
data from the source and target domains; y
(3) Supervised task training (Figure 1c): dónde
task specific layers are trained on source domain
labeled data for the downstream task of interest.
Our pivot selection method is identical to that
of Blitzer et al. (2007) and Ziser and Reichart
(2017, 2018). Eso es, the pivots are selected inde-
pendently of the above three steps protocol.
We further present a variant of PERL, denotado
with R-PERL, where the non-contextualized em-
bedding matrix of the BERT model trained at Step
(1) is used in order to regularize PERL during its
fine-tuning stage (Step 2). We elaborate on this
model towards the end of this section. nosotros el siguiente
provide a detailed description.
Pivot Selection Being a pivot-based language
representation model, PERL is based on high
quality pivot extraction. Since the representation
learning is based on a masked language modeling
tarea, the feature set we address consists of the
unigrams and bigrams of the vocabulary. We base
the division of this feature set into pivots and non-
pivots on unlabeled data from the source and target
dominios. Pivot features are: (a) Frequent in the
unlabeled data from the source and target domains;
y (b) Among those frequent features, pivot
features are the ones whose mutual information
with the task label according to source domain
labeled data crosses a pre-defined threshold.
Features that do not meet the above two criteria
form the non-pivot feature subset.
PERL pre-training (Step 1, Figura 1a)
En
order to inject prior language knowledge to our
modelo, we first initialize the PERL encoder with
a powerful pre-trained CWE model. As noted
arriba, our rationale is that the general language
knowledge encoded in these models, which is not
specific to the source or target domains, debería
be useful for DA just as it has shown useful for
in-domain learning. In this work we use BERT,
although any other CWE model that employs
507
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
2
8
1
9
2
3
7
3
2
/
/
t
yo
a
C
_
a
_
0
0
3
2
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
2
8
1
9
2
3
7
3
2
/
/
t
yo
a
C
_
a
_
0
0
3
2
8
pag
d
.
Cifra 1: Illustrations of the three PERL steps. PRD and PLR stand for the BERT prediction head and pooler
cabeza, respectivamente, FC is a fully connected layer, and msk stands for masked tokens embeddings (embeddings of
tokens that were masked). NSP and MLM are the next sentence prediction and masked language model objectives.
For the definitions of the PRD and PRL layers as well as the NSP objective, see Devlin et al. (2019). We mark
frozen layers (layers whose parameters are kept fixed) and non-frozen layers with snow-flake and fire symbols,
respectivamente. The token embedding and BERT layers values at the end of each step initialize the corresponding
layers of the next step model. The BERT box of the fine tuning step is described in more details in Figure 2.
the MLM objective for pre-training (Step 1) y
fine-tuning (Step 2), could have been used.
PERL fine-tuning (Step 2, Figura 1b) Este
step is the core novelty of PERL. Our goal is to
refine the initialized encoder on unlabeled data
from the source and the target domains, using the
distinction between pivot and non-pivot features.
For this aim we fine-tune the parameters of the
pre-trained BERT using its MLM objective, pero
we choose the masked words so that the model
learns to map non-pivot to pivot features. Recordar
that when building the MLM training task, cada
training example consists of an input text in which
some of the words are masked, and the task of
the model is to predict the identity of each of the
masked words given the rest of the (non-masked)
input text. Whereas in standard MLM training
all input tokens have the same probability to be
enmascarado, in the PERL fine-tuning step we change
both the masking probability and the prediction
task so that the desired non-pivot to pivot mapping
is learned. We next describe these two changes; ver
also a detailed graphical illustration in Figure 2.
508
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Cifra 2: The PERL pivot-based fine-tuning task (Step 2).
In this example two tokens are masked, general and
bien, only the latter is a pivot. The architecture is
identical to that of BERT but the MLM task and the
masking process are different, taking into account the
pivot/non-pivot distinction.
1. Prediction task. While in standard MLM
the task is to predict a token out of the
entire vocabulary, here we define a pivot-
base prediction task. Particularly, el modelo
should predict whether the masked token is
a pivot feature or not, and if it is then it has
to identify the pivot. Eso es, this is a multi-
class classification task where the number of
classes is equal to the number of pivots plus
1 (for the non-pivot prediction).
Put more formally, the modified pivot-based
MLM objective is:
pag(yi = j) =
ef (hi)·Wj
k=1 ef (hi)·Wk + ef (hi)·Wnone
PAG|PAG |
where yi is a masked unigram or bigram at position
i, P is the set of pivot features (token unigrams and
bigrams), hi is the encoder representation for the i-
th token, W. (the FC-Pivots layer of Figure 1b and
Cifra 2) is the pivot predictor matrix that maps
from the latent space to the pivot set space (Wa is
the a-th row of W ), and f is a non-linear function
composed of a dense layer, a gelu activation layer
and LayerNorm (the PRD layer of Figure 1b and
Cifra 2).
2. Masking process. Instead of masking each
input token (unigram) with the same prob-
capacidad, we perform the following masking
proceso. For each input token (unigram) nosotros
first check whether it forms a bigram pivot
together with the next token, and if so we
mask this bigram with a probability of α. Si
the answer is negative, we check if the token
at hand is a unigram pivot and if so we again
mask it with a probability of α. Finalmente, si
the token is not a pivot we mask it with a
probability of β. Our hyper-parameter tuning
process revealed that the values of α = 0.5
and β = 0.1 provide strong results across
our various experimental setups (see more
on this in §6). This way PERL gives a higher
probability to pivot masking, and by doing
so the encoder parameters are fine-tuned so
that they can predict (mostly) pivot features
based (mostly) on non-pivot input.
Designing the fine-tuning task this way yields
two advantages. Primero, the model should shape
its parameters so that most of the information
about the input pivots is preserved, while most of
the information preserved about the non-pivots is
what needed in order to predict the existence of
the pivots. This way the model keeps mostly the
information about unigrams and bigrams that are
shared among the two domains and are significant
for the supervised task, thus hopefully increasing
its cross-domain generalization capacity.
Segundo, standard MLM, which has recently
been used for fine-tuning in domain adapta-
ción (Lee et al., 2020; Han and Eisenstein, 2019),
performs a multi-class classification task with 30K
tokens,3 which requires ∼ 23M parameters as in
the FC1 layer of Figure 1. By focusing PERL on
pivot prediction, we can use only a factor of |PAG |+1
30k
of the FC layer parameters, as we do in the FC-
pivots layer (Cifra 1, dónde |PAG | is the number of
pivots, in our experiments |PAG | ∈ [100, 500]).
Supervised task training (Step 3, Figure 1c)
To adjust PERL for a downstream task, we place a
classification network on top of its encoder. Mientras
training on labeled data from the source domain
and testing on the target domain, each input text
is first represented by the encoder and is then fed
to the classification network. Because our focus
in this work is on the representation learning, el
classification network is kept simple, consisting
of one convolution layer followed by an average
pooling layer and a linear layer. When training
for the downstream task, the encoder weights are
frozen.
R-PERL A potential limitation of PERL is that
it ignores the semantics of its pivots. Mientras que la
negative pivots sad and unhappy encode similar
information with respect to the sentiment classi-
fication task, PERL considers them as two dif-
ferent output classes. To alleviate this, we propose
the regularized PERL (R-PERL) model where
pivot-similarity information is taken into account.
To achieve this we construct the FC-pivots
matrix of R-PERL (Figure 1b and 2) based on the
Token Embedding matrix learned by BERT in its
pre-training stage (Figura 1a). Particularly, we fix
the unigram pivot rows of the FC-pivots matrix
to the corresponding rows in BERT’s Token
Embedding matrix, and the bigram pivot rows
to the mean of the Token Embedding rows that
correspond to the unigrams that form this bigram.
The FC-pivots matrix of R-PERL is kept fixed
during fine-tuning.
Our assumptions are that: (1) Pivots with similar
significado, such as sad and unhappy, have similar
representations in the Token Embedding matrix
3The BERT implementation we use keeps a fixed 30K
word vocabulary, derived from its pre-training process.
509
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
2
8
1
9
2
3
7
3
2
/
/
t
yo
a
C
_
a
_
0
0
3
2
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
learned at the pre-training stage (Step 1); y
(2) There is a positive correlation between the
appearance of such pivots (es decir.,
they tend to
appear, or not appear, together; see Ziser and
Reichart [2017] for similar considerations). In its
fine-tuning step, R-PERL is hence biased to learn
similar representations to such pivots in order
to capture the positive correlation between them.
This follows from the fact that pivot probability
is computed by taking the dot product of its
representation with its corresponding row in the
FC-pivots matrix.
4 experimentos
Tasks and Domains Following a large body of
prior DA work, we focus on the task of binary
sentiment classification. For compatibility with
previous literature, we first experiment with the
four legacy product review domains of Blitzer
et al. (2007): Books (B), DVDs (D), Electrónico
elementos (mi), and Kitchen appliances (k), con un
total of 12 cross-domain setups. Each domain has
2,000 labeled reviews, 1,000 positive and 1,000
negative, and unlabeled reviews as follows: B:
6,000, D: 34,741, mi: 13,153 and K: 16,785.
We next experiment in a more challenging
setup, considering an airline review dataset (A)
(Nguyen, 2015; Ziser and Reichart, 2018). Este
setup is challenging both due to the differences
between the product and service domains, y
because the prior probability of observing a
positive review at the A domain is much lower
than the same probability in the product domains.4
For the A domain, following Ziser and Reichart
(2018), we randomly sampled 1,000 positive and
1,000 negative reviews for our labeled set, y
39,396 reviews for our unlabeled set. Due to the
heavy computational demands of the experiments,
we arbitrarily chose 3 product to airline and 3
airline to product setups.
We further consider an additional modern
domain: IMDb (I) (Maas et al., 2011),5 cual
is commonly used in recent sentiment analysis
trabajar. This dataset consists of 50,000 movie
reviews from IMDb (25,000 positive and 25,000
negative), where there is a limitation on the
number of reviews per movie. We randomly
4este análisis, performed by Ziser and Reichart (2018),
is based on the gold labels of the unlabeled data.
5The details of the IMDb dataset are available at:
http://www.andrew-maas.net/data/sentiment.
muestreado 2,000 labeled reviews, 1,000 positivo
y 1,000 negative, for our labeled set, y el
remaining 48,000 reviews form our unlabeled set.6
As above, we arbitrarily chose 2 IMDb to product
y 2 product to IMDb setups for our experiments.
Pivot-based representation learning has shown
instrumental for DA. We hypothesize that it can
also be beneficial for in-domain tasks, as it focuses
the representation on the information encoded in
prominent unigrams and bigrams. To test this
hypothesis we experiment in an in-domain setup,
with the IMDb movie review dataset. We follow
the same experimental setup as in the domain
adaptation case, except that only IMDb unlabeled
data is used for fine-tuning, and the frequency
criterion in pivot selection is defined with respect
to this dataset.
We randomly sampled 25,000 training and
25,000 test examples, keeping the two sets
balanced, and additional 50,000 reviews formed
an unlabeled balanced set.7 We consider 6 setups,
differing in their training set size: 100, 500, 1k,
2k, 10k, and 20K randomly sampled examples.
Baselines We compare our PERL and R-PERL
models to the following baselines: (a+b) PBLM-
CNN and PBLM-LSTM (Ziser and Reichart,
2018), differing only in their classification layer
(CNN vs. LSTM);8 (C) HATN (Le et al., 2018);9
(d) BERT; y (mi) Fine-tuned BERT (following
Lee et al., 2020 and Han and Eisenstein,
2019): This model is identical to PERL, excepto
eso
the fine-tuning stage is performed with
a standard MLM instead of our pivot-based
MLM. BERT, Fine-tuned BERT, PBLM-CNN,
Perla, and R-PERL all use the same CNN-based
sentiment classifier, while HATN jointly learns
the feature representation and performs sentiment
clasificación.
Cross-validation We use a five-fold cross-
validation protocol, where in every fold 80% de
the source domain examples are randomly selected
for training data, y 20% for development data
(both sets are kept balanced). For each model we
report the average results across the five folds. En
each fold we tune the hyper-parameters so that
6We make sure that all reviews of the same movie appear
either in the training set or in the test set.
7These reviews are also part of the IMDb dataset.
8https://github.com/yftah89/PBLM-Domain-
Adaptation.
9https://github.com/hsqmlzno1/HATN.
510
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
2
8
1
9
2
3
7
3
2
/
/
t
yo
a
C
_
a
_
0
0
3
2
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
to minimize the cross-entropy development data
loss.
Hyper-parameter Tuning For all models we
use the WordPiece word embeddings (Wu et al.,
2016) with a vocabulary size of 30k, and the same
optimizer (with the same hyper-parameters) as in
their original paper. For all pivot-based methods
we consider the unigrams and bigrams that appear
al menos 20 times both in the unlabeled data of
the source domain and in the unlabeled data of
the target domain as candidates for pivots,10 y
from these we select the |PAG | candidates with the
highest mutual information with the task source
domain label (|PAG | = {100, 200, . . . , 500}). El
exception is HATN that automatically selects its
pivots, which are limited to unigrams.
We next describe the hyper-parameters of
each of
the models. Due to our extensive
experimentation (22 DA and 6 in-domain setups,
5-fold cross-validation), we limit our search space,
especially for the heavier components of the
modelos.
R-PERL, Perla, BERT and Fine-tuned BERT
For the encoder, we use the BERT-base uncased
architecture with the same hyper-parameters as in
Devlin et al. (2019), tuning for PERL, R-PERL
and Fine-tuned BERT the number of fine-tuning
epochs (out of: 20, 40, 60) and the number
of unfrozen BERT layer during the fine-tuning
proceso (1, 2, 3, 5, 8, 12). For PERL and R-
PERL we tune the number of pivots (100, 200,
300, 400, 500) as well as α and β (0.1, 0.3, 0.5,
0.8). The supervised task classifier is a basic CNN
architecture, which enables us to search over the
number of filters (out of: 16, 32, 64), the filter size
(7, 9, 11) and the training batch size (32, 64).
PBLM-LSTM and PBLM-CNN For PBLM we
tune the input word embedding size (32, 64, 128,
256), the number of pivots (100, 200, 300, 400,
500), and the hidden dimension (128, 256, 512).
For the LSTM classification layer of PBLM-
LSTM we consider the same hidden dimension
and input word embedding size as for the PBLM
encoder. For the CNN classification layer of
PBLM-CNN, following Ziser and Reichart (2018)
we use 250 filters and a kernel size of 3. In each
setup we choose the PBLM model (PBLM-LSTM
or PBLM-CNN) that yields better test set accuracy
and report its result, under PBLM-Max.
10In the in-domain experiments we consider the IMDb
unlabeled data.
HATN The hyper-parameters of Li et al. (2018)
were tuned on a larger training set than ours, y
they hence yield sub-optimal performance in our
setup. We tune the training batch size (20, 50 300),
the hidden layer size (20, 100, 300), and the word
embedding size (50, 100, 300).
5 Resultados
Overall results Table 1 presents domain adap-
tation results, and is divided to two panels. El
top panel reports results on the 12 setups derived
desde el 4 legacy product review domains of
Blitzer et al. (2007) (denoted with P ⇔ P ). El
bottom panel reports results for 10 setups invol-
ving product review domains and the IMDb movie
review domain (left side; denoted P ⇔ I) o
the airline review domain (right side; denotado
P ⇔ A). Mesa 2 presents in-domain results on
the IMDb domain, for various training set sizes.
Domain Adaptation As presented in Table 1,
PERL models are superior in 20 out of 22 Y
setups, with R-PERL performing best in 17 out of
22 setups. In the P ⇔ P setups, their averaged
actuación (top table, All column) son 87.5%
y 86.9% (for R-PERL and PERL, respectivamente)
comparado con 82.3% of HATN and 80.7% de
PBLM-Max. En tono rimbombante, in the more challenging
setups, the performance of one of these baselines
substantially degrade. Particularly, the averaged
R-PERL and PERL performance in the P ⇔ I
setups are 84.7% y 84.4%, respectivamente (abajo
panel, left All column), comparado con 75.5%
of HATN and 69.0% of PBLM-Max. En el
P ⇔ A setups the averaged R-PERL and PERL
performances are 84.2% y 82.9%, respectivamente
(bottom panel, right All column), comparado con
80.5% of PBLM-Max and only 71.8% of HATN.
The performance of BERT and Fine-tuned
BERT also degrade on the challenging setups:
From an average of 80.2% (BERT) y 84.1%
(Fine-tuned BERT) in P ⇔ P setups, a 74.2%
y 78.9%, respectivamente, in P ⇔ I setups, y
a 75.6% y 79.4%, respectivamente, in P ⇔ A
setups. R-PERL and PERL, in contrast, remain
stable across setups, with an averaged accuracy of
84.2–87.5% (R-PERL) and 82.9–86.8% (Perla).
The IMDb and airline domains differ from the
product domains in their topic (cine [IMDb]
and services [aerolínea] vs. products). Además, el
unlabeled data from the airline domain contains
511
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
2
8
1
9
2
3
7
3
2
/
/
t
yo
a
C
_
a
_
0
0
3
2
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
BERT
Fine-tuned BERT
PBLM-Max
HATN
Perla
R-PERL
BERT
Fine-tuned BERT
PBLM-Max
HATN
Perla
R-PERL
5
1
2
D → K D → B E → D B → D B → E B → K E → B
80.6
84.4
84.2
82.2
86.5
87.8
81.0
84.1
82.5
83.5
85.0
85.6
82.0
86.7
82.5
81.2
89.9
90.2
78.8
84.2
77.6
78.0
87.0
87.2
82.5
86.9
83.3
85.4
89.9
90.4
76.8
81.7
77.6
78.8
85.0
84.8
78.2
80.2
71.4
80.0
84.3
83.9
E → K D → E K → D K → E K → B
77.7
79.8
79.8
81.0
84.6
85.6
85.1
89.2
87.8
87.4
90.6
91.2
78.5
81.5
74.2
81.2
81.9
83.0
76.5
82.0
80.4
83.2
87.1
89.3
84.7
88.6
87.1
85.9
90.7
91.2
I → E
75.4
81.5
70.1
74.0
87.1
87.9
I → K
78.8
78.0
69.8
74.4
86.3
86.0
E → I
72.2
77.6
67.0
74.8
82.0
82.5
K → I
70.6
78.7
69.0
78.9
82.2
82.5
ALL
74.2
78.9
69.0
75.5
84.4
84.7
A → B A → K A → E B → A K → A E → A
70.9
72.9
70.6
58.7
77.1
78.4
78.8
81.9
82.6
68.8
84.2
85.9
77.1
83.0
81.1
64.1
84.6
85.9
72.1
79.5
83.8
77.6
82.1
84.0
74.0
76.3
87.4
78.5
83.9
85.1
81.0
82.8
87.7
83.0
85.3
85.9
ALL
75.6
79.4
80.5
71.8
82.9
84.2
Mesa 1: Domain adaptation results. The top table is for the legacy product review domains of Blitzer et al. (2007) (denoted as the P ⇔ P setups in the
texto). The bottom table involves selected legacy domains as well as the IMDb movie review domain (izquierda; denoted as P ⇔ I) or the airline review domain
(bien; denoted as P ⇔ A). The All columns present averaged results across the setups to their left.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
ALL
80.2
84.1
80.7
82.3
86.9
87.5
yo
a
C
_
a
_
0
0
3
2
8
1
9
2
3
7
3
2
/
/
t
yo
a
C
_
a
_
0
0
3
2
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Num
Sentences BERT
67.9
73.9
75.3
77.9
80.9
81.7
100
500
1k
2k
10k
20k
Fine-tuned
BERT
76.4
83.3
83.9
83.6
86.9
86.0
PERL R-PERL
81.6
84.3
84.6
85.3
87.1
87.8
83.9
84.6
84.9
85.3
87.5
88.1
Mesa 2: In domain results on the IMDb movie
review domain with increasing training set size.
an increased fraction of negative reviews (ver §4).
Finalmente, the IMDb and airline reviews are also
more recent. The success of PERL in the P ⇔ I
and P ⇔ A setups is of particular importance,
as it indicates the potential of our algorithm to
adapt supervised NLP algorithms to domains that
substantially differ from their training domain.
Finalmente, our results clearly indicate the positive
impact of a pivot-aware approach when fine-
tuning BERT with unlabeled source and target
datos. En efecto, the averaged gaps between Fine-
tuned BERT and BERT (3.9% for P ⇔ P , 4.7%
for P ⇔ I, y 3.8% for P ⇔ A) are much
smaller than the corresponding gaps between
R-PERL and BERT (7.3% for P ⇔ P , 10.5%
for P ⇔ I, y 8.6% for P ⇔ A).
In-domain Results
In this setup both the labeled
and the unlabeled data, used for supervised
task training (labeled data, Step 3), fine-tuning
(unlabeled data, Step 2), and pivot selection (ambos
conjuntos de datos) come from the same domain (IMDb). Como
mostrado en la tabla 2, PERL outperforms BERT and
Fine-tuned BERT for all training set sizes.
el
(R-)Perla
diminishes as more labeled training data become
disponible: De 7.5% (R-PERL vs. Fine-tuned
BERT) cuando 100 sentences are available, a 2.1%
for 20K training sentences. To our knowledge,
the effectiveness of pivot-based methods for in-
domain learning has not been demonstrated in the
pasado.
Como era de esperar,
impact of
6 Ablation Analysis and Discussion
In order to shed more light on PERL, nosotros llevamos a cabo
an ablation analysis. We start by uncovering the
hyper-parameters that have strong impact on its
actuación, and analyzing its stability across
hyper-parameter configurations. We then explore
513
Cifra 3: The impact of the number of unfrozen PERL
layers during fine-tuning (Step 2).
the impact of some of the design choices we made
when constructing the model.
In order to keep our analysis concise and to
avoid heavy computations, we have to consider
only a handful of arbitrarily chosen DA setups
for each analysis. We follow the five-fold cross-
validation protocol of §4 for hyper-parameter
tuning, except that in some of the analyses a
hyper-parameter of interest is kept fixed.
6.1 Hyper-parameter Analysis
In this analysis we focus on one hyper-parameter
that is relevant only for methods that use massively
pre-trained encoders (the number of unfrozen
encoder layers during fine-tuning), as well as on
two hyper-parameters that impact the core of our
modified MLM objective (number of pivots and
the pivot and non-pivot masking probabilities).
We finally perform stability analysis across
hyper-parameter configurations.
Number of Unfrozen BERT Layers during
Fine Tuning (stage 2, Figura 1b)
En figura 3
we compare PERL final sentiment classification
accuracy with six alternatives–1, 2, 3, 5, 8,
o 12 unfrozen layers, going from the top to
the bottom layers. We consider 4 arbitrarily
chosen DA setups, where the number of unfrozen
layers is kept fixed during the five-fold cross
validation process. The general trend is clear:
PERL performance improves as more layers are
unfrozen, and this improvement saturates at 8
unfrozen layers (for the K→A setup the saturation
is at 5 capas). The classification accuracy im-
provement (compared to 1 unfrozen layer) is of
4% or more in three of the setups (K→A is again
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
2
8
1
9
2
3
7
3
2
/
/
t
yo
a
C
_
a
_
0
0
3
2
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
2
8
1
9
2
3
7
3
2
/
/
t
yo
a
C
_
a
_
0
0
3
2
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Cifra 4: PERL sentiment classification accuracy
across four setups with a varying number of pivots.
the exception with only ∼ 2% mejora).
Across the experiments of this paper, this hyper-
parameter has been the single most influential
hyper-parameter of
the PERL, R-PERL and
Fine-tuned BERT models.
Number of Pivots Following previous work
(p.ej., Ziser and Reichart, 2018), our hyper-
parameter tuning process considers 100 a 500
pivots in steps of 100. We would next like to
explore the impact of this hyper-parameter on
PERL performance. Cifra 4 presents our results,
for four arbitrarily selected setups. En 3 de 4
setups PERL performance is stable across pivot
numbers. En 2 setups, 100 is the optimal number
of pivots (for the A → B setup with a large
gap), and in the 2 other setups it lags behind
the best value by no more than 0.2%. Estos
two characteristics—model stability across pivot
numbers and somewhat better performance when
using fewer pivots—were observed across our
experiments with PERL and R-PERL.
Pivot and Non-Pivot Masking Probabilities
We next study the impact of the pivot and non-
pivot masking probabilities, used during PERL
fine-tuning (α and β, respectivamente, see §3). Para
both α and β we consider the values of 0.1, 0.3,
0.5, y 0.8. Cifra 5 presents heat maps that
summarize our results. A first observation is the
relative stability of PERL to the values of these
hyper-parameters: The gap between the best and
worst performing configurations are 2.6% (E →
D), 1.2% (B → E), 3.1% (K → D), y 5.0% (A →
B). A second observation is that extreme α values
Cifra 5: Heat maps of PERL performance with
different pivot (a) and non-pivot (b) masking pro-
babilities. A darker color corresponds to a higher
sentiment classification accuracy.
(0.1 y 0.8) tend to harm the model. Finalmente, en 3
de 4 cases the best model performance is achieved
with α = 0.5 and β = 0.1.
Stability Analysis We finally turn to analyze the
stability of the PERL models compared with the
baselines. Previous work on PBLM and HATN
has demonstrated their instability across model
configuraciones (see Ziser and Reichart [2019]
for PBLM and Cui et al. [2019] for HATN).
As noted in Ziser and Reichart (2019), cross-
configuration stability is of particular importance
in unsupervised domain adaptation as the hyper-
parameter configuration is selected using un-
labeled data from the source, rather than the target
domain.
In this analysis a hyper-parameter value is not
considered for a model if it is not included in the
best hyper-parameter configuration of that model
for at least one DA setup. Por eso, for PERL we
fix the number of unfrozen layers (8), the number
of pivots (100), and set (a, b) = (0.5, 0.1), y
for PBLM we consider only word embedding size
de 128 y 256. Other than that, we consider
all possible hyper-parameter configurations of
all models (§4, 54 configurations for PERL, R-
PERL and Fine-tuned BERT, 18 for BERT, 30 para
PBLM and 27 for HATN). Mesa 3 presenta el
minimum (mín.), maximum (máximo), promedio (avg),
and standard deviation (enfermedad de transmisión sexual) of the test set scores
514
across the hyper-parameter configurations of each
modelo, para 4 arbitrarily selected setups.
En todo 4 setups, PERL and R-PERL consistently
achieve higher avg, máximo, and min values and lower
std values compared to the other models (con el
exception of PBLM achieving higher max for
K → A). Además, the std values of PBLM
and especially HATN are substantially higher
than those of the models that use BERT. Todavía,
PERL and R-PERL demonstrate lower std values
compared to BERT and Fine-tuned BERT in 3 de
4 setups, indicating that our method contributes to
stability beyond the documented contribution of
BERT itself Hao et al. (2019).
6.2 Design Choice Analysis
Impact of Pivot Selection One design choice
that impacts our results is the method through
which pivots are selected. We next compare three
alternatives to our pivot selection method, keeping
all other aspects of PERL fixed. As above, nosotros
arbitrarily select four setups.
(a) Random-Frequent: Pivots
We consider the following pivot selection
son
methods:
randomly selected from the unigrams and bigrams
that appear at least 80 times in the unlabeled
data of each of the domains; (b) High-MI, No
Target: We select the pivots that have the highest
mutual information (MI) with the source domain
label, but appear less than 10 times in the target
domain unlabeled data; (C) Oracle Miller (2019):
Here the pivots are selected according to our
método, but the labeled data used for pivot-label
MI computation is the target domain test data
rather than the source domain training data. Este
is an upper bound on the performance of our
method since it uses target domain labeled data,
which is not available to us. For all methods we
select 100 pivots (see above).
Mesa 5 presents the results of the four PERL
variants, and compare them to BERT and Fine-
tuned BERT. We observe four patterns in the
resultados. Primero, PERL with our pivot selection
método, which emphasizes both high MI with
the task label and high frequency in both the
source and target domains, is the best performing
modelo. Segundo, PERL with Random-Frequent
pivot selection is substantially outperformed by
Perla, but it still performs better than BERT (en 3
de 4 setups), probably because BERT is not tuned
on unlabeled data from the participating domains.
Todavía, PERL with Random-Frequent pivots is
515
E→D
avg max min std
R-PERL
0.7
84.6 85.8 83.1
Perla
0.4
85.2 86.0 84.4
Fine-tuned BERT 81.3 83.2 79.0
1.2
BERT
1.8
75.0 76.8 70.6
PBLM
71.7 79.3 65.9
3.4
HATN
73.7 81.1 53.9 10.7
B→K
avg max min std
R-PERL
0.5
89.5 90.5 88.8
Perla
0.3
89.4 90.2 88.8
Fine-tuned BERT 86.9 87.7 84.9
0.8
BERT
1.1
81.1 82.5 78.6
PBLM
3.3
78.6 84.1 71.3
HATN
7.7
76.8 82.8 59.5
A→B
avg max min std
R-PERL
75.3 79.0 72.0 1.7
Perla
73.9 77.1 70.9 1.7
Fine-tuned BERT 72.1 74.2 68.2 1.7
BERT
69.9 73.0 66.9 1.8
PBLM
64.2 71.6 60.9 2.7
HATN
57.6 65.0 53.7 3.5
K→A
avg max min std
R-PERL
85.3 86.4 84.6 0.5
Perla
83.8 84.9 81.5 0.9
Fine-tuned BERT 77.8 82.1 67.1 4.2
BERT
70.4 74.0 65.1 2.6
PBLM
76.1 86.1 66.2 6.8
HATN
72.1 79.2 53.9 9.9
Mesa 3: Stability analysis.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
2
8
1
9
2
3
7
3
2
/
/
t
yo
a
C
_
a
_
0
0
3
2
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
outperformed by the Fine-tuned BERT in all
setups, indicating that it provides a sub-optimal
way of exploiting source and target unlabeled data.
Tercero, en 3 de 4 setups, PERL with the High-MI,
No Target pivots is outperformed by the baseline
BERT model. This is a clear indication of the
sub-optimality of this pivot selection method that
yields a model that is inferior even to a model
that was not tuned on source and target domain
datos. Finalmente, a pesar de, unsurprisingly, PERL with
oracle pivots outperforms the standard PERL, el
gap is smaller than 2% in all four cases. Nuestros resultados
clearly demonstrate the strong positive impact of
our pivot selection method on the performance of
Perla.
5 capas 8 capas 10 capas 12 capas (full) 5 capas 8 capas 10 capas 12 capas (full)
B → E
A → K
BERT
Fine-tuned BERT
Perla (Ours)
70.9
74.6
81.1
75.9
76.5
83.2
80.6
84.2
88.2
78.8
84.2
87.0
71.2
74.0
77.7
74.9
76.3
80.2
81.2
80.8
84.7
78.8
81.9
84.2
Mesa 4: Classification accuracy with reduced-size encoders.
BERT
Fine-tuned BERT
High-MI, No Target
Random-Frequent
Perla (Ours)
Oracle
B → E
78.8
84.2
76.2
79.7
87.0
88.9
K → D
77.7
79.8
76.4
76.8
84.6
85.6
E → K
85.1
89.2
84.9
85.5
90.6
91.5
D → B
81.0
84.1
83.7
81.7
85.0
86.7
Mesa 5: Impact of PERL’s pivot selection method.
BERT
Fine-tuned BERT
Perla
Fine-tuned BERT
Perla
B → E
K → D
No fine-tuning
78.8
80.7
79.6
82.0
86.9
77.7
Source data only
79.8
82.2
Target data only
80.9
83.0
Source and target data
Fine-tuned BERT
Perla
84.2
87.0
79.8
84.6
A → B
I → E
70.9
69.4
69.8
71.6
71.8
72.9
77.1
75.4
81.0
84.4
81.1
84.2
81.5
87.1
Mesa 6: Impact of fine-tuning data selection.
Unlabeled Data Selection Another design
choice we consider is the impact of the type
of fine-tuning data. While we followed previous
trabajar (p.ej., Ziser and Reichart, 2018) and used
the unlabeled data from both the source and target
dominios, it might be that data from only one of
the domains, particularly the target, is a better
choice. As above, we explore this question on
4 arbitrarily selected domain pairs. The results,
presented in Table 6, clearly indicate that our
choice to use unlabeled data from both domains
is optimal, particularly when transferring from a
non-product domain (A or I) to a product domain.
Reduced Size Encoder We finally explore the
effect of the fine-tuning step on the performance
of reduced-size models. By doing this we address
a major limitation of pre-trained encoders—their
tamaño, which prevents them from running on small
computational devices and dictates long run times.
For this experiment we prune the top encoder
layers before its fine-tuning step, yielding three
new model sizes, con 5, 8, o 10 capas, comparado
with the full 12 capas. This is done both for Fine-
tuned BERT and for PERL. We then tune the
number of encoder’s top unfrozen layers during
fine-tuning, como sigue: 5 layer-encoder (1, 2, 3);
8 layer-encoder (1, 3, 4, 5); 10 layer-encoder (1,
3, 5, 8); and full encoder (1, 2, 3, 5, 8, 12). Para
comparación, we utilize the BERT model when
its top layers are pruned, and no fine-tuning is
performed. We focus on two arbitrarily selected
DA setups.
Mesa 4 presents accuracy results. In both setups
PERL with 10 layers is the best performing
modelo. Además, for each number of layers,
PERL outperforms the other two models, con
particularly substantial improvements for 5 y
8 capas (es decir., 7.3% y 6.7%, over BERT and
516
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
2
8
1
9
2
3
7
3
2
/
/
t
yo
a
C
_
a
_
0
0
3
2
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Fine-tuned BERT, respectivamente, for B → E and 8
capas).
Reduced-size PERL is of course much faster
than the full model. The averaged run-time of the
full (12 capas) PERL on our test-sets is 196.5
msec and 9.9 msec on CPU (skylake i9-7920X,
2.9 GHz, single thread) and GPU (GeForce GTX
1080 Ti), respectivamente. Para 8 layers the numbers
drop to 132.4 mseg (CPU) y 6.9 mseg (GPU)
and for 5 layers to 84.0 (CPU) y 4.7 (GPU)
mseg.
7 Conclusions
We presented PERL, a domain-adaptation model
that fine-tunes a massively pre-trained deep con-
textualized embedding encoder (BERT) con un
pivot-based MLM objective. PERL outperforms
strong baselines across 22 sentiment classification
DA setups, improves in-domain model perfor-
mance, increases its cross-configuration stability
and yields effective reduced-size models.
Our focus in this paper is on binary sentiment
clasificación, as was done in a large body of
previous DA work. In future work we would
like to extend PERL’s reach to structured (p.ej.,
dependency parsing and aspect-based sentiment
clasificación) y generación (p.ej., abstractive
summarization and machine translation) NLP
tareas.
Expresiones de gratitud
We would like to thank the action editor and the
reviewers, Yftah Ziser, as well as the members of
the IE@Technion NLP group for their valuable
feedback and advice. This research was partially
funded by an ISF personal grant no. 1625/18.
Referencias
Shai Ben-David, John Blitzer, Koby Crammer,
Alex Kulesza, Fernando Pereira, and Jennifer
Wortman Vaughan. 2010. A theory of learning
from different domains. Machine Learning,
79(1-2):151–175.
John Blitzer, Mark Dredze, and Fernando Pereira.
2007. Biographies, Bollywood, boom-boxes
and blenders: Domain adaptation for sentiment
clasificación. In John A. Carroll, Antal van den
Bosch, and Annie Zaenen, editores, LCA 2007,
Proceedings of the 45th Annual Meeting of
la Asociación de Lingüística Computacional,
Junio 23-30, 2007, Prague, Czech Republic. El
Asociación de Lingüística Computacional,
John Blitzer, Ryan T. McDonald, and Fernando
Pereira. 2006. Domain adaptation with structu-
ral correspondence learning. In Dan Jurafsky
and ´Eric Gaussier, editores, EMNLP 2006,
el 2006 Conferencia sobre
Actas de
Empirical Methods
in Natural Language
Procesando, 22-23 Julio 2006, Sídney, Australia,
pages 120–128. LCA.
Danushka Bollegala, Takanori Maehara, y
Ken-ichi Kawarabayashi. 2015. Unsupervised
cross-domain word representation learning. En
Proceedings of the 53rd Annual Meeting of
la Asociación de Lingüística Computacional
and the 7th International Joint Conference on
Natural Language Processing of
the Asian
Federation of Natural Language Processing,
LCA 2015, Julio 26-31, 2015, Beijing, Porcelana,
Volumen 1: Artículos largos, pages 730–740. El
Association for Computer Linguistics.
Minmin Chen, Kilian Q. Weinberger, and Yixin
Chen. 2011. Automatic feature decomposition
for single view co-training. In Lise Getoor
and Tobias Scheffer, editores, Actas de
the 28th International Conference on Machine
Aprendiendo, ICML 2011, Bellevue, Washington,
EE.UU, Junio 28 – Julio 2, 2011, pages 953–960.
Omnipress.
Minmin Chen, Zhixiang Eddie Xu, Kilian Q.
Weinberger, and Fei Sha. 2012. Marginalized
denoising autoencoders for domain adaptation.
En procedimientos de
the 29th International
Conference on Machine Learning, ICML 2012,
Edimburgo, Escocia, Reino Unido, Junio 26 – Julio 1,
2012. icml.cc / Omnipress.
St´ephane Clinchant, Gabriela Csurka, and Boris
Chidlovskii. 2016. A domain adaptation regu-
larization for denoising autoencoders. En profesional-
ceedings of the 54th Annual Meeting of the
Asociación de Lingüística Computacional, LCA
2016, Agosto 7-12, 2016, Berlina, Alemania,
Volumen 2: Artículos breves. The Association for
Computer Linguistics.
Wanyun Cui, Guangyu Zheng, Zhiqiang Shen,
Sihang Jiang, and Wei Wang. 2019. Transferir
learning for sequences via learning to collocate.
517
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
2
8
1
9
2
3
7
3
2
/
/
t
yo
a
C
_
a
_
0
0
3
2
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
In 7th International Conference on Learning
Representaciones, ICLR 2019, Nueva Orleans,
LA, EE.UU, Puede 6-9, 2019. OpenReview.net.
domains with synchronous neural
idioma
modelos. In Proc. of the xLite Workshop on
Cross-Lingual Technologies, NIPS.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, y
Kristina Toutanova. 2019. BERT: pre-training
de transformadores bidireccionales profundos para el lenguaje
comprensión. In Jill Burstein, Christy Doran,
and Thamar Solorio, editores, Actas de
el 2019 Conference of the North American
Chapter of the Association for Computational
Lingüística: Tecnologías del lenguaje humano,
NAACL-HLT 2019, Mineápolis, Minnesota, EE.UU,
Junio 2-7, 2019, Volumen 1 (Long and Short
Documentos), páginas 4171–4186. Asociación para
Ligüística computacional.
Timothy Dozat and Christopher D. Manning.
2017. Deep biaffine attention for neural depen-
dency parsing. In 5th International Conference
ICLR 2017,
on Learning Representations,
Toulon, Francia, Abril 24-26, 2017, Conferencia
Track Proceedings. OpenReview.net.
Sergey Edunov, Myle Ott, Michael Auli, y
David Grangier. 2018. Understanding back-
translation at scale. In Ellen Riloff, David
Chiang, Julia Hockenmaier, and Jun’ichi Tsujii,
editores, Actas de la 2018 Conferencia
sobre métodos empíricos en lenguaje natural
Procesando, Bruselas, Bélgica, Octubre 31 –
Noviembre 4, 2018, pages 489–500. Asociación
para Lingüística Computacional.
Yaroslav Ganin, Evgeniya Ustinova, Hana
Ajakan, Pascal Germain, Hugo Larochelle,
Franc¸ois Laviolette, Mario Marchand, y
Victor S. Lempitsky. 2016. Domain-adversarial
training of neural networks. j. Mach. Learn.
Res., 17:59:1–59:35.
Xavier Glorot, Antonio Bordes, y yoshua
bengio. 2011. Domain adaptation for large-
scale sentiment classification: A deep learning
acercarse. In Lise Getoor and Tobias Scheffer,
editores, Proceedings of the 28th International
Conference on Machine Learning, ICML 2011,
Bellevue, Washington, EE.UU, Junio 28 – Julio 2,
2011, pages 513–520. Omnipress.
Stephan Gows, Gert-Jan Van Rooyen, y
Yoshua Bengio. 2012. Learning structural
linguistic
correspondences across different
Xiaochuang Han and Jacob Eisenstein. 2019.
Unsupervised domain adaptation of contextual-
ized embeddings: A case study in early modern
english. CORR, abs/1904.02817.
Yaru Hao, Li Dong, Furu Wei, and Ke Xu.
2019. Visualizing and understanding the
effectiveness of BERT.
In Kentaro Inui,
Jing Jiang, Vincent Ng, and Xiaojun Wan,
editores, Actas de la 2019 Conferencia
sobre métodos empíricos en lenguaje natural
Procesamiento y IX Conjunción Internacional
Conferencia sobre procesamiento del lenguaje natural,
EMNLP-IJCNLP 2019, Hong Kong, Porcelana,
Noviembre 3-7, 2019, pages 4141–4150.
Asociación de Lingüística Computacional.
Jiayuan Huang, Alexander J. Smola, Arthur
Gretton, Karsten M. Borgwardt, and Bernhard
Sch¨olkopf. 2006. Correcting sample selection
bias by unlabeled data. In Bernhard Sch¨olkopf,
John C. Platón, and Thomas Hofmann, editores,
Avances en el procesamiento de información neuronal
the Twentieth
Sistemas 19, Actas de
Annual Conference on Neural Information Pro-
cessing Systems, vancouver, British Columbia,
Canada, December 4-7, 2006, pages 601–608.
CON prensa.
Hal Daum´e III and Daniel Marcu. 2006. Domain
adaptation for statistical classifiers. j. Artif.
Intell. Res., 26:101–126.
Jing Jiang and ChengXiang Zhai. 2007. Instance
weighting for domain adaptation in NLP. En
Juan A.. Carroll, Antal van den Bosch, y
Annie Zaenen, editores, LCA 2007, Pro-
ceedings of the 45th Annual Meeting of the
Asociación de Lingüística Computacional,
Junio 23-30, 2007, Prague, Czech Republic. El
Asociación de Lingüística Computacional.
Zhenzhong Lan, Mingda Chen, Sebastian Good-
hombre, Kevin Gimpel, Piyush Sharma, y
Radu Soricut. 2020. ALBERT: A lite BERT
for self-supervised learning of language rep-
resentaciones. In 8th International Conference
ICLR 2020,
on Learning Representations,
518
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
2
8
1
9
2
3
7
3
2
/
/
t
yo
a
C
_
a
_
0
0
3
2
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Addis Ababa, Ethiopia, Abril 26-30, 2020.
OpenReview.net.
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim,
Donghyeon Kim, Sunkyu Kim, Chan Ho So,
and Jaewoo Kang. 2020. Biobert: A pre-trained
language representation model
biomedical
text mining. Bioinformatics,
for biomedical
36(4):1234–1240.
Zheng Li, Ying Wei, Yu Zhang, and Qiang Yang.
2018. Hierarchical attention transfer network
for cross-domain sentiment classification. En
Sheila A. McIlraith and Kilian Q. Weinberger,
editores, Actas de
the Thirty-Second
Conferencia AAAI sobre Inteligencia Artificial,
(AAAI-18), the 30th innovative Applications of
Artificial Intelligence (IAAI-18), and the 8th
AAAI Symposium on Educational Advances
in Artificial
Inteligencia (EAAI-18), Nuevo
Orleans, Luisiana, EE.UU, Febrero 2-7, 2018,
pages 5852–5859. AAAI Press.
Zheng Li, Yu Zhang, Ying Wei, Yuxiang Wu,
and Qiang Yang. 2017. End-to-end adversarial
memory network for cross-domain sentiment
clasificación. In Carles Sierra, editor, Pro-
ceedings of the Twenty-Sixth International Joint
Conference on Artificial Intelligence, IJCAI
2017, Melbourne, Australia, Agosto 19-25,
2017, pages 2237–2243. ijcai.org.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
mike lewis, Lucas Zettlemoyer, and Veselin
Stoyanov. 2019. Roberta: A robustly optimized
BERT pretraining approach. CORR, abs/1907.
11692.
Christos Louizos, Kevin Swersky, Yujia Li, máx.
Welling, and Richard S. Zemel. 2016. El
variational fair autoencoder. In Yoshua Bengio
and Yann LeCun, editores, 4th International
Conferencia sobre Representaciones del Aprendizaje, ICLR
2016, San Juan, Puerto Rico, Puede 2-4, 2016,
Conference Track Proceedings.
andres. Maas, Raymond E. Daly, Peter T.
Pham, Dan Huang, Andrew Y. Ng, y
Christopher Potts. 2011. Learning word vectors
for sentiment analysis. In Dekang Lin, Yuji
Matsumoto, and Rada Mihalcea, editores, El
49ª Reunión Anual de la Asociación de
519
Ligüística computacional: Human Language
Technologies, Proceedings of the Conference,
19-24 Junio, 2011, Portland, Oregón, EE.UU,
pages 142–150. The Association for Computer
Lingüística.
Yishay Mansour, Mehryar Mohri, and Afshin
Rostamizadeh. 2008. Domain adaptation with
multiple sources.
In Daphne Koller, Valle
Schuurmans, Yoshua Bengio, and L´eon Bottou,
Información
editores, Avances
Sistemas de procesamiento 21, Actas de
el
Twenty-Second Annual Conference on Neural
Sistemas de procesamiento de información, vancouver,
British Columbia, Canada, December 8-11,
2008, pages 1041–1048. Asociados Curran,
Cª.
in Neural
David McClosky, Eugene Charniak, and Mark
Johnson. 2010. Automatic domain adaptation
for parsing. In Human Language Technologies:
Conference of the North American Chapter of
the Association of Computational Linguistics,
Actas, Junio 2-4, 2010, Los Angeles,
California, EE.UU, pages 28–36. The Association
para Lingüística Computacional.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
2
8
1
9
2
3
7
3
2
/
/
t
yo
a
C
_
a
_
0
0
3
2
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
En
adaptación.
Timothy A. Molinero. 2019. Simplified neural
Jill
unsupervised domain
Burstein, Christy Doran, and Thamar Solorio,
editores, Actas de la 2019 Conferencia
el
el Capítulo Norteamericano de
de
Asociación de Lingüística Computacional:
Tecnologías del lenguaje humano, NAACL-HLT
2019, Mineápolis, Minnesota, EE.UU, Junio 2-7,
2019, Volumen 1 (Artículos largos y cortos),
pages 414–419. Asociación de Computación
Lingüística.
Quang Nguyen. 2015. The airline review dataset.
Thien Huu Nguyen, Barbara Plank, and Ralph
Grishman. 2015. Semantic representations for
domain adaptation: A case study on the tree
kernel-based method for relation extraction. En
Proceedings of the 53rd Annual Meeting of
la Asociación de Lingüística Computacional
and the 7th International Joint Conference on
Natural Language Processing of
the Asian
Federation of Natural Language Processing,
LCA 2015, Julio 26-31, 2015, Beijing, Porcelana,
Volumen 1: Artículos largos, pages 635–644. El
Association for Computer Linguistics.
Sinno Jialin Pan, Xiaochuan Ni, Jian-Tao Sun,
Qiang Yang, and Zheng Chen. 2010. Cruz-
domain sentiment classification via spectral
feature alignment. In Michael Rappa, Pablo
jones, Juliana Freire, and Soumen Chakrabarti,
editores, Proceedings of the 19th International
Conference on World Wide Web, WWW 2010,
Raleigh, North Carolina, EE.UU, Abril 26-30,
2010, pages 751–760. ACM.
Matthew E. Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep contex-
tualized word representations. In Marilyn A.
Caminante, Heng Ji, and Amanda Stent, edi-
tores, Actas de la 2018 Conference of
the North American Chapter of the Associ-
ation for Computational Linguistics: Humano
Language Technologies, NAACL-HLT 2018,
Nueva Orleans, Luisiana, EE.UU, Junio 1-6, 2018,
Volumen 1 (Artículos largos), pages 2227–2237.
Asociación de Lingüística Computacional.
Barbara Plank and Alessandro Moschitti. 2013.
Embedding semantic similarity in tree kernels
for domain adaptation of relation extraction.
In Proceedings of the 51st Annual Meeting of
la Asociación de Lingüística Computacional,
LCA 2013, 4-9 Agosto 2013, Sofia, Bulgaria,
Volumen 1: Artículos largos, pages 1498–1507. El
Association for Computer Linguistics.
Alec Radford, Karthik Narasimhan, Tim
Salimans, and Ilya Sutskever. 2018. Improving
language understanding by generative pre-
training. https://s3-us-west-2.amazonaws.com/
openai-assets/researchcovers/languageunsuper
vised/language understandingpaper.pdf.
Alec Radford, Jeffrey Wu, niño rewon, David
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners. OpenAI Blog, 1(8).
Brian Roark and Michiel Bacchiani. 2003. Super-
vised and unsupervised PCFG adaptation to
novel domains. In Marti A. Hearst and Mari
Ostendorf, editores, Human Language Tech-
nology Conference of
the North American
Chapter of the Association for Computational
Lingüística, HLT-NAACL 2003, Edmonton,
Canada, Puede 27 – Junio 1, 2003. The Asso-
ciation for Computational Linguistics.
Guy Rotman and Roi Reichart. 2019. Deep con-
textualized self-training for low resource depen-
dency parsing. Transacciones de la Asociación-
ción para la Lingüística Computacional, 7:695–713.
Alejandro M.. Rush, Roi Reichart, Miguel
collins, and Amir Globerson. 2012. Improved
parsing and POS tagging using inter-sentence
consistency constraints.
In Jun’ichi Tsujii,
James Henderson, and Marius Pasca, editores,
Actas de la 2012 Joint Conference on
Métodos empíricos en Natural Language Pro-
cessing and Computational Natural Language
Aprendiendo, EMNLP-CoNLL 2012, Julio 12-14,
2012, Jeju Island, Korea, pages 1434–1444. LCA.
Tobias Schnabel and Hinrich Sch¨utze. 2014.
FLORS: fast and simple domain adaptation
for part-of-speech tagging. Transactions of
la Asociación de Lingüística Computacional,
2:15–26.
Masashi Sugiyama, Shinichi Nakajima, Hisashi
Kashima, Paul von B¨unau, and Motoaki
Kawanabe. 2007. Direct importance estima-
tion with model selection and its application
to covariate shift adaptation. In John C. Platón,
Daphne Koller, Yoram Singer, and Sam T.
Roweis, editores, Advances in Neural Informa-
tion Processing Systems 20, Actas de
the Twenty-First Annual Conference on Neural
Sistemas de procesamiento de información, vancouver,
British Columbia, Canada, December 3-6,
2007, pages 1433–1440. Asociados Curran,
Cª.
Manshu Tu and Bing Wang. 2019. Adding
prior knowledge in hierarchical attention neural
network for cross domain sentiment classifica-
ción. IEEE Access, 7:32578–32588.
Pascal Vincent, Hugo Larochelle, Yoshua Bengio,
and Pierre-Antoine Manzagol. 2008. Extracting
and composing robust features with denoising
autoencoders. In William W. cohen, Andrew
McCallum, and Sam T. Roweis, editores,
Machine Learning, Proceedings of the Twenty-
Fifth International Conference (ICML 2008),
Helsinki, Finland, Junio 5-9, 2008, volumen 307
of ACM International Conference Proceeding
Serie, pages 1096–1103. ACM.
Tomás Lobo, Debut de Lysandre, Víctor Sanh,
Julien Chaumond, Clemente Delangue, Antonio
520
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
2
8
1
9
2
3
7
3
2
/
/
t
yo
a
C
_
a
_
0
0
3
2
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
moi, Pierric Cistac, Tim Rault, R´emi Louf,
Morgan Funtowicz, and Jamie Brew. 2019.
Huggingface’s transformers: State-of-the-art
natural language processing. CORR, abs/1910.
03771.
Yonghui Wu, Mike Schuster, Zhifeng Chen,
Quoc V. Le, Mohammad Norouzi, Wolfgang
Macherey, Maxim Krikun, Yuan Cao, Qin
gao, Klaus Macherey, Jeff Klingner, Apurva
Shah, Melvin Johnson, Xiaobing Liu, lucas
Kaiser, Stephan Gows, Yoshikiyo Kato,
Taku Kudo, Hideto Kazawa, Keith Stevens,
George Kurian, Nishant Patil, Wei Wang,
Cliff Young, Jason Smith, Jason Riesa, Alex
Rudnick, Oriol Vinyals, Greg Corrado, Macduff
abrazos, and Jeffrey Dean. 2016. Google’s
neural machine translation system: Bridging the
gap between human and machine translation.
CORR, abs/1609.08144.
Gui-Rong Xue, Wenyuan Dai, Qiang Yang,
and Yong Yu. 2008. Topic-bridged PLSA for
cross-domain text classification. In Sung-Hyon
Myaeng, Douglas W. Oard, Fabrizio Sebastiani,
Tat-Seng Chua, and Mun-Kew Leong, editores,
Proceedings of the 31st Annual International
ACM SIGIR Conference on Research and De-
velopment in Information Retrieval, SIGIR 2008,
Singapur, Julio 20-24, 2008, pages 627–634.
ACM.
Yi Yang and Jacob Eisenstein. 2014. Fast easy
unsupervised domain adaptation with marginal-
ized structured dropout. En Actas de la
52nd Annual Meeting of the Association for
Ligüística computacional, LCA 2014, Junio
22-27, 2014, baltimore, Maryland, EE.UU, Volumen 2:
Artículos breves, pages 538–544. The Association
for Computer Linguistics.
language understanding.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G.
Carbonell, Ruslan Salakhutdinov, and Quoc V.
Le. 2019. Xlnet: Generalized autoregressive
pretraining for
En
Hanna M. Wallach, Hugo Larochelle, Alina
Beygelzimer, Florence d’Alch´e-Buc, Emily B.
Fox, and Roman Garnett, editores, Avances
en sistemas de procesamiento de información neuronal 32:
Annual Conference on Neural
Información
Sistemas de procesamiento 2019, NeurIPS 2019, 8-
14 December 2019, vancouver, BC, Canada,
pages 5754–5764.
Jianfei Yu and Jing Jiang. 2016. Aprendiendo
sentence embeddings with auxiliary tasks
for cross-domain sentiment classification. En
Jian Su, Xavier Carreras, and Kevin Duh,
editores, Actas de la 2016 Conferencia
sobre métodos empíricos en lenguaje natural
Procesando, EMNLP 2016, austin, Texas, EE.UU,
Noviembre 1-4, 2016, pages 236–246. El
Asociación de Lingüística Computacional.
Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin
Jiang, Maosong Sun, and Qun Liu. 2019.
ERNIE: Enhanced language representation
with informative entities. In Anna Korhonen,
David R. Traum, and Llu´ıs M`arquez, editores,
Proceedings of the 57th Conference of the Asso-
ciation for Computational Linguistics, LCA
2019, Florencia, Italia, July 28-August 2, 2019,
Volumen 1: Artículos largos, pages 1441–1451.
Asociación de Lingüística Computacional.
Yftah Ziser and Roi Reichart. 2017. Neural
structural correspondence learning for domain
adaptación. In Roger Levy and Lucia Specia,
editores, Proceedings of the 21st Conference
sobre el aprendizaje computacional del lenguaje natural
(CONLL 2017), vancouver, Canada, Agosto
3-4, 2017, pages 400–410. Asociación para
Ligüística computacional.
Yftah Ziser and Roi Reichart. 2018. Pivot based
language modeling for improved neural domain
adaptación. In Marilyn A. Caminante, Heng Ji, y
Amanda Stent, editores, Actas de la 2018
Conference of the North American Chapter of
la Asociación de Lingüística Computacional:
Tecnologías del lenguaje humano, NAACL-HLT
2018, Nueva Orleans, Luisiana, EE.UU, Junio
1-6, 2018, Volumen 1 (Artículos largos), paginas
1241–1251. Asociación de Computación
Lingüística.
learning for
Yftah Ziser and Roi Reichart. 2019. Task refine-
mento
improved accuracy and
stability of unsupervised domain adaptation.
In Anna Korhonen, David R. Traum, y
el
Llu´ıs M`arquez, editores, Actas de
57th Conference of the Association for Com-
Lingüística putacional, LCA 2019, Florencia,
Italia, July 28-August 2, 2019, Volumen 1: Largo
Documentos, pages 5895–5906. Asociación para
Ligüística computacional.
521
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
2
8
1
9
2
3
7
3
2
/
/
t
yo
a
C
_
a
_
0
0
3
2
8
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3