Model Compression for Domain Adaptation through Causal Effect

Model Compression for Domain Adaptation through Causal Effect
Estimation

Guy Rotman∗, Amir Feder∗, Roi Reichart
Faculty of Industrial Engineering and Management, Technion, IIT, Israel
grotman@campus.technion.ac.il
feder@campus.technion.ac.il
roiri@technion.ac.il

Abstrakt

Recent improvements in the predictive quality
of natural language processing systems are of-
ten dependent on a substantial increase in the
number of model parameters. This has led to
various attempts of compressing such models,
but existing methods have not considered the
differences in the predictive power of various
model components or in the generalizability
of the compressed models. To understand the
connection between model compression and
out-of-distribution generalization, we define
the task of compressing language representa-
tion models such that they perform best in a
domain adaptation setting. We choose to ad-
dress this problem from a causal perspective,
attempting to estimate the average treatment
Wirkung (ATE) of a model component, wie zum Beispiel
a single layer, on the model’s predictions.
Our proposed ATE-guided Model Compres-
sion scheme (AMoC), generates many model
candidates, differing by the model components
that were removed. Dann, we select the best
candidate through a stepwise regression model
that utilizes the ATE to predict the expected
performance on the target domain. AMoC
outperforms strong baselines on dozens of do-
main pairs across three text classification and
sequence tagging tasks.1

Einführung

The rise of deep neural networks has transformed
the way we represent language, allowing models
to learn useful features directly from raw inputs.
Jedoch, recent improvements in the predictive
quality of language representations are often re-
lated to a substantial increase in the number of
model parameters. In der Tat, the introduction of the
Transformer architecture (Vaswani et al., 2017)

∗Authors contributed equally.
1Our code and data are available at: https://github

.com/rotmanguy/AMoC.

and attention-based models (Devlin et al., 2019;
Liu et al., 2019; Brown et al., 2020) have improved
performance on most natural language processing
(NLP) tasks, while facilitating a large increase in
model sizes.

Since large models require a significant amount
of computation and memory during training and
inference, there is a growing demand for com-
pressing such models while retaining the most
relevant information. While recent attempts have
shown promising results (Sanh et al., 2019), Sie
have some limitations. Speziell, they attempt
to mimic the behavior of the larger models without
trying to understand the information preserved or
lost in the compression process.

In compressing the information represented in
billions of parameters, we identify three main
Herausforderungen. Erste, current methods for model
compression are not interpretable. While the im-
portance of different model parameters is certainly
not uniform, it is hard to know a priori which
of the model components should be discarded
in the compression process. This notion of fea-
ture importance has not yet trickled down into
compression methods, and they often attempt to
solve a dimensionality reduction problem where
a smaller model aims to mimic the predictions of
the larger model. dennoch, not all parameters
are born equal, and only a subset of the informa-
tion captured in the network is actually useful for
generalization (Frankle and Carbin, 2018).

The second challenge we observe in model
compression is out-of-distribution generalization.
Typically, compressed models are tested for their
in-domain generalization. Jedoch, in reality the
distribution of examples often varies and is differ-
ent than that seen during training. Without testing
for the generalization of the compressed models
on different test-set distributions, it is hard to fully
assess what was lost in the compression process.

1355

Transactions of the Association for Computational Linguistics, Bd. 9, S. 1355–1373, 2021. https://doi.org/10.1162/tacl a 00431
Action Editor: Hai Zhao. Submission batch: 5/2021; Revision batch: 8/2021; Published 12/2021.
C(cid:13) 2021 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
3
1
1
9
7
6
7
7
8

/
T

A
C
_
A
_
0
0
4
3
1
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

The setting explored in domain adaptation pro-
vides us with a platform to test the ability of the
compressed models to generalize across-domains,
where some information that the model has learned
to rely on might not exist. Strong model perfor-
mance across domains provides a stronger signal
on retaining valuable information.

zuletzt, another challenge we identify in training
and selecting compressed models is confidence es-
timation. In trying to understand what gives large
models the advantage over their smaller competi-
tors, recent probing efforts have discovered that
commonly used models such as BERT (Devlin
et al., 2019), learn to capture semantic and syn-
tactic information in different layers and neurons
across the network (Rogers et al., 2021). Während
some features might be crucial for the model, oth-
ers could learn spurious correlations that are only
present in the training set and are absent in the
test set (Kaushik et al., 2019). Such cases have
led to some intuitive common practices such as
keeping only layers with the same parity or the top
or bottom layers (Fan et al., 2019; Sajjad et al.,
2020). Those practices can be good on average,
but do not provide model confidence scores or
success rate estimates on unseen data.

Our approach addresses each of the three main
challenges we identify, as it allows estimating
the marginal effect of each model component, Ist
designed and tested for out-of-distribution gen-
eralization, and provides estimates for each com-
pressed model performance on an unlabeled target
Domain. We dive here into the connection be-
tween model compression and out-of-distribution
generalization, and ask whether compression
schemes should consider the effect of individual
model components on the resulting compressed
Modell. Insbesondere, we present a method that
attempts to compress a model while maintain-
ing components that can generalize well across
domains.

Inspired by causal inference (Pearl, 1995), unser
compression scheme is based on estimating the
average effect of model components on the de-
cisions the model makes, at both the source and
target domains. In causal inference, we measure
the effect of interventions by comparing the differ-
ence in outcome between the control and treatment
groups. In our setting, we take advantage of the
fact that we have access to unlabeled target ex-
amples, and treat the model’s predictions as our
outcome variable. We then try to estimate the

effect of a subset of the model components, solch
as one or more layers, on the model’s output.

To do that, we propose an approximation of a
counterfactual model where a model component
of choice is removed. We train an instance of the
model without that component and keep every-
thing else equal apart from the input and output to
that component, which allows us to perform only a
small number of gradient steps. Using this approx-
imation, we then estimate the average treatment
Wirkung (ATE) by comparing the predictions of the
base model to those of its counterfactual instance.
Since our compressed models are very effi-
ciently trained, we can generate a large number of
such models per each source-target domain pair.
We then train a regression model on our training
domain pairs in order to predict how well a com-
pressed model would generalize from a source to
a target domain, using the ATE as well as other
Variablen. This regression model can then be ap-
plied to new source-target domain pairs in order
to select the compressed model that best supports
cross-domain generalization.

To organize our contributions, we formulate

three research questions:

1. Can we produce a compressed model that out-
performs all baselines in out-of-distribution
generalization?

2. Does the model component we decide to
remove indeed hurt performance the least?

3. Can we use the average treatment effect to

guide our model selection process?

In § 6 we directly address each of the three
research questions, and demonstrate the usefulness
of our method, ATE-guided model compression
(AMoC), to improve model generalization.

2 Previous Work

Previous work on the intersection of neural model
compression, domain adaptation, and causal in-
ference is limited, as our application of causal
inference to model compression and our discus-
sion of the connection between compression and
cross-domain generalization are novel. Jedoch,
there is an abundance of work in each field on
its own, and on the connection between domain
adaptation and causal inference. Since our goal

1356

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
3
1
1
9
7
6
7
7
8

/
T

A
C
_
A
_
0
0
4
3
1
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

is to explore the connection between compression
and out-of-distribution generalization, as framed
in the setting of domain adaptation, we survey the
literature on model compression and the connec-
tion between generalization, Kausalität, and domain
adaptation.

2.1 Model Compression

NLP models have been increased exponentially in
Größe, growing from less than a million parameters
a few years ago to hundreds of billions. Since the
introduction of the Transformer architecture, Das
trend has been strengthened, with some models
reaching more than 175 billion parameters (Braun
et al., 2020). Infolge, there has been a growing
interest in compressing the information captured
in Transformers into smaller models (Chen et al.,
2020; Ganesh et al., 2020; Sun et al., 2020).

Usually, such smaller models are trained us-
ing the base model as a teacher, with the smaller
student model learning to predict its output prob-
abilities (Hinton et al., 2015; Jiao et al., 2020;
Sanh et al., 2019). Jedoch, even if the student
closely matches the teacher’s soft labels, their
internal representations may be considerably dif-
ferent. This internal mismatch can undermine the
generalization capabilities originally intended to
be transferred from the teacher to the student
(Aguilar et al., 2020; Mirzadeh et al., 2020).

As an alternative, we try not to interfere or alter
the learned representation of the model. Compres-
sion schemes such as those presented in Sanh et al.
(2019) discard model components randomly. In-
stead, we choose to focus on understanding which
components of the model capture the information
that is most useful for it to perform well across
domains, and hence should not be discarded.

2.2 Domain Adaptation and Causality

Domain adaptation is a longstanding challenge
in machine learning (ML) and NLP, which deals
with cases where the train and test sets are drawn
from different distributions. A great effort has
been dedicated to exploit labels from both source
and target domains for that purpose (Daum´e III
et al., 2010; Sato et al., 2017; Cui et al., 2018;
Lin and Lu, 2018; Wang et al., 2018). Jedoch,
a much more challenging and realistic scenario,
also termed unsupervised domain adaptation, oc-
curs when no labeled target samples exist (Blitzer

et al., 2006; Ganin et al., 2016; Ziser and Reichart,
2017, 2018A, B, 2019; Rotman and Reichart, 2019;
Ben-David et al., 2020). In this setting, we have
access to labeled and unlabeled data from the
source domain and to unlabeled data from the
target, and models are tested by their performance
on unseen examples from the target domain.

A closely related task is domain adaptation
success prediction. This task explores the pos-
sibility of predicting the expected performance
degradation between source and target domains
(McClosky et al., 2010; Elsahar and Gall´e,
2019). Similar to predicting performance in a
given NLP task, methods for predicting domain
adaptation success often rely on in-domain per-
formance and distance metrics estimating the
difference between the source and target dis-
tributions (Reichart and Rappoport, 2007; Ravi
et al., 2008; Louis and Nenkova, 2009; Van Asch
and Daelemans, 2010; Xia et al., 2020). Während
these efforts have demonstrated the importance of
out-of-domain performance prediction, they have
not been made as far as we know in relation to
model compression.

Ist

As the fundamental purpose of domain adap-
tation algorithms
improving the out-of-
distribution generalization of learning models, Es
is often linked with causal inference (Johansson
et al., 2016). In causal inference we typically
care about estimating the effect that an inter-
vention on a variable of interest would have on
an outcome (Pearl, 2009). Kürzlich, using causal
methods to improve the out-of-distribution per-
formance of trained classifiers is gaining traction
(Rojas-Carulla et al., 2018; Wald et al., 2021).

In der Tat, recent papers applied a causal approach
to domain adaptation. Some researchers proposed
using causal graphs to predict under distribution
shifts (Sch¨olkopf et al., 2012) and to understand
the type of shift (Zhang et al., 2013). Adapting
these ideas to computer vision, Gong et al. (2016)
were one of the first to propose a causal graph
describing the generative process of an image
as being generated by a ‘‘domain’’. The causal
graph served for learning invariant components
that transfer across domains. Since that, the no-
tion of invariant prediction has emerged as an
important operational concept in causal inference
(Peters et al., 2017). This idea has been used to
learn classifiers that are robust to domain shifts
and can perform well on unseen target distribu-
tionen (Gong et al., 2016; Magliacane et al., 2018;

1357

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
3
1
1
9
7
6
7
7
8

/
T

A
C
_
A
_
0
0
4
3
1
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Rojas-Carulla et al., 2018; Greenfeld and Shalit,
2020).

Here we borrow ideas from causality to help
us reason on the importance of specific model
components, such as individual layers. Das ist,
we estimate the effect of a given model compo-
nent (denoted as the treatment) on the model’s
predictions in the unlabeled target domain, Und
use the estimated effect as an evaluation of the
importance of this component. Our treatment ef-
fect estimation method is inspired by previous
causal model explanation work (Goyal et al., 2019;
Feder et al., 2021), although our algorithm is very
anders.

3 Causal Terminology

Causal methodology is most commonly used in
cases where the goal is estimating effects on
real-world outcomes, but it can be adapted to
help us understand and explain what affects NLP
Modelle (Feder et al., 2021). Speziell, we can
think of intervening on a model and altering its
components as a causal question, and measure the
effect of this intervention on model predictions.
A core benefit of this approach is that we can
estimate treatment effects on model’s predictions
without the need for manually-labeled target data.
Borrowing causal methodology into our setting,
we treat model components as our treatment, Und
try to estimate the effect of removing them on our
model’s predictions. The predictions of a model
are driven by its components, and by changing
one component and holding everything else equal,
we can estimate the effect of this intervention. Wir
can use this estimation in deciding which model
component should be kept in the compression
Verfahren.

As the link between model compression and
causal inference was not explored previously, Wir
provide here a short introduction to causal infer-
ence and its basic terminology, focusing on its
application to our use case. We then discuss the
connection to Pearl’s do-operator (Pearl et al.,
2009) and the estimation of treatment effects.

Imagine we have a model m that classifies
examples to one of L classes. Given a set C
of K model components, which we hypothesize
might affect the model’s decision, we denote the
set of binary variables Ic = {Icj ∈ {0, 1}|j ∈
{1, . . . , K}}, where each corresponds to the in-
clusion of the component in the model, das ist,

if Icj = 1 then the j-th component (cj) is in the
Modell. Our goal is to assert how the model’s pre-
dictions are affected by the components in C. Als
we are interested in the effect on the class proba-
bility assigned by m, we measure this probability
for an example x, and denote it for a class l as
z(M(X))l and for all L classes as ~z(M(X)).

Using this setup, we can now define the ATE,
the common metric used when estimating causal
Effekte. ATE is the difference in mean outcomes
between the treatment and control groups, Und
using do-calculus (Pearl, 1995) we can define it
as follows:

Definition 1 (Average Treatment Effect (ATE))
The average treatment effect of a binary treatment
Icj on the outcome ~z(M(X)) Ist:

ATE(cj) =E

(cid:2)~z(M(X))|do(Icj = 1)(cid:3)
− E

(cid:2)~z(M(X))|do(Icj = 0)(cid:3) ,

(1)

where the do-operator is a mathematical opera-
tor introduced by Pearl (1995), which indicates
that we intervene on cj such that it is included
(do(Icj = 1)) or not (do(Icj = 0)) im Modell.

While the setup usually explored with do-
calculus involves a fixed joint-distribution where
treatments are assigned to individuals (or exam-
ples), we borrow intuition from a specialized case
where interventions are made on the process which
generates outcomes given examples. This type of
an intervention is called Process Control, Und
was proposed by Pearl et al. (2009) and further
explored by Bottou et al. (2013). This unique
setup is designed to improve our understanding of
the behavior of complex learning systems and
predict
the consequences of changes made to
the system. Kürzlich, Feder et al. (2021) gebraucht
it to intervene on language representation models,
generating a counterfactual representation model
through an adversarial training algorithm which
biases the representation model to forget infor-
mation about treatment concepts and maintain
information about control concepts.

In our approach we intervene on the j-th com-
ponent, by holding the rest of the model fixed and
training only the parameters that control the input
and output to that component. This is crucial for
our estimation procedure as we want to know the
effect of the j-th component on a specific model
Beispiel. This effect can be computed by compar-
ing the predictions of the original model instance

1358

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
3
1
1
9
7
6
7
7
8

/
T

A
C
_
A
_
0
0
4
3
1
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

to those of the intervened model (siehe unten).
This computation is fundamentally different from
measuring the conditional probability where the
j-th component is not in the model by estimating
E

(cid:2)~z(M(X))|Icj = 0(cid:3).

4 Methodik

We start by describing the task of compress-
they perform well on
ing models such that
out-of-distribution examples, detailing the domain
adaptation framework we focus on. Dann, we de-
scribe our compression scheme, designed to allow
us to approximate the ATE and responsible for
producing compressed model candidates. Endlich,
we propose a regression model that uses the ATE
and other features to predict a candidate model’s
performance on a target domain. This regression
allows us to select a strong candidate model.

4.1 Task Definition and Framework

To test the ability of a compressed model to
generalize on out-of-distribution examples, Wir
choose to focus on a domain adaptation setting. Ein
appealing property of domain adaptation setups is
that they allow us to measure out-of-distribution
performance in a very natural way by training on
one domain and testing on another.

In our setup, during training, we have access
to n source-target domain pairs (Si, Ti)N
i=1. Für
each pair we assume to have labeled data from the
source domains (LSi)N
i=1 and unlabeled data from
the the source and target domains (USi, UTi)N
i=1.
We also assume to have held-out labeled data
for all domains, for measuring test performance
(HSi, HTi)N
i=1. At test time we are given an unseen
domain pair (Sn+1, Tn+1) with labeled source data
LSn+1 and unlabeled data from both domains USn+1
and UTn+1, jeweils. Our goal is to classify
examples on the unseen target domain Tn+1 using
a compressed model mn+1 trained on the new
source domain.

1, . . . , mi

For each domain pair in (Si, Ti)N

i=1, we gen-
erate a set of K candidate models M i =
{mi
K}, differing by the model compo-
nents that were removed from the base model
mi
B. For each candidate, we compute the ATE and
other relevant features which we discuss in § 4.3.
Dann, using the training domain pairs, for which
we have access to a limited amount of labeled
target data, we train a stepwise linear regression
to predict the performance of all candidate models

In {M i}N
i=1 on their target domain. Endlich, at test
Zeit, after computing the regression features on
the unseen source-target pair, we use the trained
regression model to select the compressed model
(mn+1)∗ ∈ M n+1 that is expected to perform best
on the unseen unlabeled target domain.

While this task definition relies on a limited
number of labeled examples from some target do-
mains at training time, at test time we only use
labeled examples from the source domain and un-
labeled examples from the target. We elaborate
on our compression scheme, responsible for gen-
erating the compressed model candidates in § 4.2.
We then describe the regression features and the
regression model in § 4.3 and § 4.4, jeweils.

4.2 Compression Scheme

Our compression scheme (AMoC) assumes to
operate on a large classifier, consisting of an
encoder-decoder architecture, that serves as the
base model being compressed. In such models,
the encoder is the language representation model
(z.B., BERT), and the decoder is the task classifier.
Each input sentence x to the base model mi
B is
encoded by the encoder e. Dann, the encoded
sentence e(X) is passed through the decoder d to
compute a distribution over the the label space
L: ~z(mi
B(X)) = Sof tmax(D(e(X))). AMoC is
designed to remove a set of encoder components,
and can in principle be used with any language
encoder.

As described in Algorithm 1, AMoC generates
candidate compressed versions of mi
B. In each it-
eration it selects from C, the set containing subsets
of encoder components, a candidate ck ∈ C to be
removed.2 The goal of this process is to generate
many compressed model candidates, such that the
k-th candidate ck differs from the base model mi
B
only by the effect of the parameters in ck on the
model’s predictions. After generating these candi-
dates, AMoC tries to choose the best performing
model for the unseen target domain.

When generating the k-th compressed model of
the i-th source-target pair, we start by removing
all parameters in ck from the computational graph
of mi
B. Dann, we connect the predecessor of each
detached component from ck to its successor in
the graph, which yields the new mi
k (siehe Abbildung 1).
To estimate the effect of ck on the predictions of

2Zum Beispiel, if components correspond to layers, and we
wish to remove an individual layer from a 12-layer encoder,
then C = {{ich}|i ∈ {1, . . . , 12}}.

1359

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
3
1
1
9
7
6
7
7
8

/
T

A
C
_
A
_
0
0
4
3
1
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Algorithm 1 ATE-Guided Model Compression (AMoC)
Input: Domain pairs (Si, Ti)n+1
i=1 with Labeled source data
(LSi )n+1
i=1 , Unlabeled source and target data (USi , UTi )n+1
i=1 ,
Labeled held-out source and target data (HSi , HTi )N
i=1, Und
a set C of subsets of encoder components to be removed.
Algorithm:

1. For each domain pair in (Si, Ti)N
(A) Train the base model mi
(B) For ck ∈ C

i=1
B on LSi .

– Freeze all encoder parameters.
– Remove every component in ck from mi
B.
– Connect and unfreeze the remaining

components according to § 4.2.

– Fine-tune the new model mi

k on LSi for

one or more epochs.

– Compute [AT ESi (ck) Und [AT ET i (ck)

according to Eq. 2, using USi and UTi .

– Compute the remaining features in 4.3.

2. Train the stepwise regression according to Eq. 4, verwenden

all compressed models generated in step 1.

3. Repeat steps 1(A)-1(B) für (Sn+1, Tn+1) and choose
(mn+1)∗ with the highest expected performance
according to the regression model.

mi
B, we freeze all remaining model parameters in
mi
k and fine-tune it for one or more epochs, train-
ing only the decoder and the parameters of the
new connections between the predecessors and
successors of the removed components. An ad-
vantage of this procedure is that we can efficiently
generate many model candidates. Figur 1 Dämon-
strates this process on a simple architecture when
considering the removal of layer components.

Guiding our model selection step is the ATE
of ck on the base model mi
B. The generation of
each compressed candidate mi
k is designed to al-
low us to estimate the effect of ck on the model’s
Vorhersagen. In comparing the predictions of mi
B
to the compressed model mi
k on many exam-
ples, we try to mimic the process of generating
control and treatment groups. As is done in con-
trolled experiments, we compare examples that
are given a treatment, nämlich, encoded by the
compressed model mi
k, and examples that were
encoded by the base model mi
B. Intervening on
the example-generating process was explored pre-
viously in the causality literature by Bottou et al.
(2013); Feder et al. (2021).

Alongside the ATE, we compute other fea-
tures that might be predictive of a compressed
model’s performance on an unlabeled target do-
main, which we discuss in detail in § 4.3. Using

those features and the ATE, we train a linear step-
wise regression to predict a compressed model’s
performance on target domains (§ 4.4). Endlich,
at test time AMoC is given an unseen domain
pair and applies the regression in order to choose
the compressed source model expected to perform
best on the target domain. Using the regression,
we can estimate the power of the ATE in predict-
ing model performance and answer Question 3
of § 1.

In diesem Papier, we choose to focus on the removal
of sets of layers, as done in previous work (Fan
et al., 2019; Sanh et al., 2019; Sajjad et al., 2020).
While our method can support any other parameter
partitioning, such as clusters of neurons, we leave
this for future work. In the case of layers, to estab-
lish the new compressed model we simply connect
the remained layers according to their hierarchy.
Zum Beispiel, for a base model with a 12-layer
encoder and c = {2, 3, 7} the unconnected com-
ponents are {1}, {4, 5, 6} Und {8, 9, 10, 11, 12}.
Layer 1 will then be connected to layer 4, Und
layer 6 to layer 8. The compressed model will be
then trained for one or more epochs where only
the decoder and layers 1 Und 6 (using the original
indices) are fine-tuned. In times where layer 1 Ist
removed, the embedding layer is connected to the
first unremoved layer and is fine-tuned.

4.3 Regression Features

Apart from the ATE, which estimates the impact
of the intervention on the base model, we naturally
need to consider other features. In der Tat, without
any information on the target domain, predicting
that a model will perform the same as in the source
domain could be a reasonable first-order approx-
imation (McClosky et al., 2010). Auch, adding
information on the distance between the source
and target distributions (Van Asch and Daelemans,
2010) or on the type of components that were re-
moved (such as the number of layers) might also
be useful for predicting the model’s success. Wir
present here all the features we consider, Und
discuss their usefulness in predicting model per-
Form. To answer Q3, we need to show that
given all this information, the ATE is still pre-
dictive for the model’s performance in the target
Domain.

ATE Our main variable of interest is the av-
erage treatment effect of the components in ck
on the predictions of the model. In our compres-
sion scheme, we estimate for a specific domain

1360

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
3
1
1
9
7
6
7
7
8

/
T

A
C
_
A
_
0
0
4
3
1
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
3
1
1
9
7
6
7
7
8

/
T

A
C
_
A
_
0
0
4
3
1
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figur 1: An example of our method with a 3-layer encoder when considering the removal of layer components.
(A) At first, the base model is trained (Alg. 1, step 1(A)). (B) The second encoder layer is removed from the base
Modell, and the first layer is connected to the final encoder layer. The compressed model is then fine-tuned for one
or more epochs, where only the parameters of the first layer and the decoder are updated (Alg. 1, step 1(B)). Wir
mark frozen layers and non-frozen layers with snowflakes and fire symbols, jeweils.

d ∈ {Si, T i} the ATE for each compressed model
mi

k by comparing it to the base model mi
B:

[AT Ed(ck) =

1
|Ud| X
x∈Ud

h~z(cid:0)mi

B(X)(cid:1) − ~z(cid:0)mi

k(X)(cid:1)ich

(2)

where the operator hi denotes the total variation
Distanz: A summation over the absolute values
of vector coordinates.3 As we are interested in the
effect on the probability assigned to each class by
the classifier mi
k, we measure the class probability
of its output for an example x, as proposed by
Feder et al. (2021).4

In our regression model we choose to include
the ATE of the source and the target domains,
[AT ESi(ck) (estimated on USi) Und [AT ET i(ck)
(estimated on UTi) , jeweils. We note that
in computing the ATE we only require the pre-
dictions of the models, and do not need labeled
Daten.

In-domain Performance A common metric for
selecting a classification model is its performance

3For a three class prediction and a single example, Wo
the probability distributions for the base and the compressed
models are (0.7, 0.2, 0.1) Und (0.5, 0.1, 0.4), jeweils,
[AT Ei(ck) = |0.7 − 0.5| + |0.2 − 0.1| + |0.1 − 0.4| = 0.6.
compute
sentence-level ATEs by averaging the word-level proba-
bility differences, and then average those ATEs to get the
final ATE.

tagging tasks, Wir

sequence

4Für

Erste

on a held-out set. In der Tat, in cases where we do
not have access to any information from the target
Domain, the naive choice is the best performing
model on a held-out source domain set (Elsahar
and Gall´e, 2019). Somit, for every ck ∈ C we
compute the performance of mi

k on HSi.

Domain Classification An important variable
when predicting model performance on an unseen
test domain is the distance between its training
domain and that test domain (Elsahar and Gall´e,
2019). While there are many ways to approximate
this distance, we choose to do so by training a
domain classifier on USi and UTi, classifying
each example according to its domain. We then
compute the average probability assigned to the
target examples to belong to the source domain,
according to the domain classifier:

\
P (Si|T i) =

|HTi| X
x∈HTi

P (Si|X),

(3)

where P (Si|X) denotes for an unlabeled target
example x, the probability that it belongs to the
source domain Si, based on the domain classifier.

Compression-size Effects We include in our
regression binary variables indicating the number
of layers that were removed. Naturally, we assume
that the larger the number of layers removed, Die
bigger the gap from the base model should be.

1361

4.4 Regression Analysis

In order to decide which ck should be removed
from the base model, we follow the process de-
scribed in Algorithm 1 for all c ∈ C and end up
with many candidate compressed models, differ-
ing by the model components that were removed.
As our goal is to choose a candidate model to be
used in an unseen target domain, we train a stan-
dard linear stepwise regression model (Hocking,
1976; Draper and Smith, 1998; Dubossarsky et al.,
2020) to predict the candidate’s performance on
the seen target domains:

Y = β0 + β1X1 + · · · + βmXm + ǫ,

(4)

where Y is performance on these target domains,
computed using their held-out sets (HTi)N
i=1, Und
X1, · · · , Xm are the set of variables described in
4.3, including the ATE. In stepwise regression
variables are added to the model incrementally
only if their marginal addition for predicting Y is
statistically significant (P < 0.01). This method is useful for finding variables with maximal and unique contribution to the explanation of Y . The value of this regression is two-fold in our case as it allows us to: (1) get a predictive model that can choose a high quality compressed model candidate, and (2) estimate the predictive power of the ATE on model performance in the target domain. 5 Experiments 5.1 Data We consider three challenging data sets (tasks): (1) The Amazon product reviews data set for sen- timent classification (He and McAuley, 2016).5 This data set consists of product reviews and metadata, from which we choose 6 distinct do- mains: Amazon Instant Video (AIV), Beauty (B), Digital Music (DM), Musical Instruments (MI), Sports and Outdoors (SAO), and Video Games (VG). All reviews are annotated with an integer score between 0 and 5. We label > 3 reviews as
positive and < 3 reviews as negative. Ambiguous reviews (rating = 3) are discarded. Since the data set does not contain development and test sets, we randomly split each domain into training (64%), development (16%), and test (20%) sets. (2) The Multi-Genre Natural Language In- ference (MultiNLI) corpus for natural language inference classification (Williams et al., 2018).6 This corpus consists of pairs of sentences, a premise and a hypothesis, where the hypothesis either entails the premise, is neutral to it or contra- dicts it. The MultiNLI data set extends upon the SNLI corpus (Bowman et al., 2015), assembled from image captions, to 10 additional domains: 5 matched domains, containing training, devel- opment and test samples and 5 mismatched, containing only development and test samples. We experiment with the original SNLI corpus (Captions domain) as well as the matched version of MultiNLI, containing the Fiction, Government, Slate, Telephone and Travel domains, for a total of 6 domains. (3) The OntoNotes 5.0 data set (Hovy et al., 2006), consisting of sentences annotated with named entities, part-of-speech tags and parse trees.7 We focus on the Named Entity Recogni- tion (NER) task with 6 different English domains: Broadcast Conversation (BC), Broadcast News (BN), Magazine (MZ), Newswire (NW), Tele- phone Conversation (TC), and Web data (WB). This setup allows us to evaluate the quality of AMoC on a sequence tagging task. The statistics of our experimental setups are reported in Table 1. Since the test sets of the MultiNLI domains are not publicly available, we treat the original development sets as our test sets, and randomly choose 2,000 examples from the training set of each domain to serve as the development sets. We use the original splits of the SNLI as they are all publicly available. Since our data sets manifest class imbalance phenomena we use the macro average F1 as our evaluation measure. For the regression step of Algorithm 1, we use the development set of each target domain to compute the model’s macro F1 score (for the Y and the in-domain performance variables). We compute the ATE variables on the development sets of both domains, train the domain classifier on unlabeled versions of the training sets and \P (S|T ) on the target development set. compute 5http://jmcauley.ucsd.edu/data/amazon/. 6https://cims.nyu.edu/∼sbowman/multinli/. 7https://catalog.ldc.upenn.edu/LDC2013T19. 1362 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 3 1 1 9 7 6 7 7 8 / / t l a c _ a _ 0 0 4 3 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Amazon Reviews Amazon Instant Video Beauty Digital Music Musical Instruments Sports and Outdoors Video Games Train Dev 21K 112K 28K 37K 6K 174K 43K 130K 32K Test 5.2K 6.5K 35K 9.2K 11K 1.5K 1.9K 54K 40K Captions Fiction Government Slate Telephone Travel MultiNLI Train Dev 550K 10K 75K 75K 75K 81K 75K OntoNotes 2K 2K 2K 2K 2K Test 10K 2K 2K 2K 2K 2K Broadcast Conversation Broadcast News Magazine News Telephone Conversation Web Test Train Dev 36K 173K 30K 26K 207K 25K 161K 15K 17K 878K 148K 60K 11K 92K 11K 50K 361K 48K Table 1: Data statistics. We report the number of sentences for Amazon Reviews and MultiNLI, and the number of tokens for OntoNotes. 5.2 Model and Baselines Model The encoder being compressed is the BERT-base model (Devlin et al., 2019). BERT is a 12-layer Transformer model Vaswani et al. (2017); Radford et al. (2018), representing tex- tual inputs contextually and sequentially. Our decoder consists of a layer attention mechanism (Kondratyuk and Straka, 2019) which computes a parameterized weighted average over the lay- ers’ output, followed by a 1D convolution with the max-pooling operation and a final Softmax layer. Figure 1(a) presents a simplified version of the architecture of this model with 3 encoder layers. Baselines To put our results in context of pre- vious model compression work, we compare our models to three strong baselines. Like AMoC, the baselines generate reduced-size encoders. These encoders are augmented with the same decoder as in our model to yield the baseline architectures. The first baseline is DistilBERT (DB) (Sanh et al., 2019): A 6-layer compressed version of BERT-base, trained on the masked language modelling task with the goal of mimicking the predictions of the larger model. We used its default setting, i.e., removal of 6 layers with c = {2, 4, 6, 7, 9, 11}. Sanh et al. (2019) demon- strated that DistilBERT achieves comparable results to the large model with only half of its layers. Since DistilBERT was not designed or tested on out-of-distribution data, we create an addi- tional version, denoted as DB + DA. In this version, the training process is performed on the masked language modelling task using an unla- beled version of the training data from both the source and the target domains, with its original hyperparameters. We further add an additional adaptation-aware baseline: DB + GR, the DistilBERT model equipped with the gradient reversal (GR) layer (Ganin and Lempitsky, 2015). Particularly, we augment the DistilBERT model with a domain classifier, similar in structure to the task classifier, which aims to distinguish between the unlabeled source and the unlabeled target examples. By re- versing the gradients resulting from the objective function of this classifier, the encoder is biased to produce domain-invariant representations. We set the weights of the main task loss and the domain classification loss to 1 and 0.1, respectively. Another baseline is LayerDrop (LD), a pro- cedure that applies layer dropout during training, making the model robust to the removal of certain layers during inference (Fan et al., 2019). During training, we apply a fixed dropout rate of 0.5 for all layers. At inference, we apply their Every Other strategy by removing all even layers to obtain a reduced 6-layer model. Finally, we compare AMoC to ALBERT, a recently proposed BERT-based variant designed to mimic the performance of the larger BERT model with only a tenth of its parameters (11M parameters compared to BERT’s 110M parame- ters) (Lan et al., 2020). ALBERT is trained with cross-layer parameter sharing and sentence order- ing objectives, leading to better model efficiency. Unlike other baselines explored here, it is not directly comparable since it consists of 12 layers and was pre-trained on substantially more data. As such, we do not include it in the main results ta- ble (Table 2), and instead discuss its performance compared to AMoC in Section 6. 1363 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 3 1 1 9 7 6 7 7 8 / / t l a c _ a _ 0 0 4 3 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 S\T AIV B DM MI SAO VG AVG AIV B DM MI SAO VG AVG S\T Captions Fiction Govern. Slate Telephone Travel AVG Captions Fiction Govern. Slate Telephone Travel AVG S\T BC BN MZ NW TC WB AVG BC BN MZ NW TC WB AVG Base AMoC DB DB+DA DB+GR LD AIV 80.05 78.97 65.24 77.10 82.73 76.81 79.18 78.57 69.87 77.64 83.79 77.81 Base AMoC 64.44 67.99 80.16 82.70 71.10 71.53 84.16 86.43 78.56 84.73 82.07 76.50 69.23 69.52 54.96 63.26 73.66 66.13 DB 58.26 66.47 59.18 71.09 66.22 64.24 74.07 76.00 67.21 70.01 78.98 73.25 66.73 70.39 55.99 63.43 73.24 65.96 MI DB+DA DB+GR 61.64 76.28 67.21 78.64 76.77 72.11 58.64 68.03 61.37 72.27 67.38 65.54 74.10 72.14 56.53 67.72 76.24 69.35 LD 61.43 71.87 63.13 72.44 70.59 67.89 Amazon Reviews Base AMoC 82.14 75.49 DB 65.00 76.54 72.72 83.88 85.20 78.77 74.37 72.78 85.12 85.21 79.92 Base AMoC 69.52 69.76 83.21 83.73 63.83 70.94 72.71 70.08 82.61 75.42 82.23 74.30 63.83 55.75 69.87 69.62 64.81 DB 59.71 72.23 58.45 59.23 68.96 63.72 MNLI B DB+DA DB+GR 75.86 65.42 74.94 74.83 81.74 80.34 77.54 65.21 46.44 67.19 70.91 63.03 SAO DB+DA DB+GR 71.62 79.57 65.29 71.39 79.12 73.40 58.96 72.11 61.75 58.30 70.18 64.26 LD 69.51 67.36 61.25 76.32 77.13 70.31 LD 62.97 77.29 62.79 66.10 73.83 68.60 Captions Fiction Base AMoC DB DB+DA DB+GR LD Base AMoC 58.37 58.92 DB 46.96 DB+DA DB+GR 57.37 46.04 LD 54.93 71.33 62.52 65.04 65.04 65.77 65.94 68.81 68.04 62.40 61.22 62.11 64.52 Base AMoC 53.26 52.83 62.94 66.76 65.59 65.22 65.53 65.02 63.07 63.62 60.11 61.10 39.04 44.45 37.58 40.03 36.54 39.53 DB 41.30 44.79 46.57 45.73 45.65 44.81 67.60 63.47 46.99 58.65 60.11 59.36 45.26 39.23 44.87 36.64 38.29 40.86 Slate DB+DA DB+GR 52.96 64.13 62.89 56.35 60.96 59.46 42.23 45.70 45.42 44.68 47.08 45.02 Base AMoC DB BC DB+DA DB+GR 74.25 66.56 72.23 42.63 28.47 56.83 71.06 62.00 70.26 41.78 27.58 54.54 Base AMoC 58.80 61.31 69.79 73.55 63.80 67.40 35.25 22.60 52.02 35.15 26.40 50.79 70.83 60.55 68.22 45.14 26.79 54.31 DB 58.44 70.51 63.04 36.73 23.64 50.47 70.11 61.76 70.16 39.18 25.17 53.28 70.29 62.06 41.20 21.32 26.97 44.37 NW DB+DA DB+GR 57.75 71.26 63.64 35.58 27.57 51.16 46.95 58.80 50.33 20.83 20.61 39.50 63.27 54.68 55.39 59.77 55.41 57.70 LD 50.56 59.82 61.06 60.70 56.51 57.73 LD 65.61 54.76 63.57 29.64 21.97 47.11 LD 50.73 62.31 52.08 27.93 17.02 42.01 67.61 69.83 69.07 66.97 66.48 66.05 67.70 67.77 65.19 65.02 44.5 46.53 47.45 44.05 45.90 63.47 58.59 63.76 60.06 60.65 Telephone 46.75 43.58 46.76 42.94 45.21 Base AMoC 56.68 56.94 68.47 71.83 67.87 67.54 71.27 68.27 DB 41.22 41.66 43.73 45.21 69.57 66.83 66.31 66.12 42.30 42.82 OntoNotes DB+DA DB+GR 58.35 67.70 65.39 59.39 64.35 63.04 45.53 44.52 45.88 39.50 45.86 44.26 Base AMoC 71.28 73.78 DB 70.76 71.47 80.85 53.08 40.39 63.91 67.32 79.54 52.37 40.68 62.24 Base AMoC 63.07 62.39 65.69 69.64 56.94 60.31 51.88 61.20 18.68 54.44 15.45 50.61 66.5 78.15 54.56 39.09 61.81 DB 58.19 61.45 54.61 50.73 18.36 48.67 BN DB+DA DB+GR 70.94 58.22 66.41 79.34 51.69 40.35 61.75 59.67 68.92 19.80 30.79 47.48 TC DB+DA DB+GR 59.53 64.68 55.51 49.78 15.38 48.98 59.31 64.98 63.37 36.48 7.64 46.36 60.44 62.07 61.91 56.67 59.20 LD 54.01 64.97 65.46 61.06 61.63 61.43 LD 66.46 61.29 75.07 42.16 33.55 55.71 LD 55.21 60.40 42.00 44.38 10.77 42.55 Base AMoC 76.02 77.66 76.60 77.10 60.09 74.30 81.10 74.05 63.88 75.15 82.43 74.82 Base AMoC 76.52 77.71 82.23 82.57 76.04 78.45 67.91 65.10 81.06 80.05 DM DB+DA DB+GR 75.94 72.74 68.24 67.60 75.08 71.92 62.8 58.52 30.42 60.58 72.51 56.97 VG DB+DA DB+GR 76.44 76.96 76.21 56.37 75.14 67.11 65.50 66.93 49.67 65.78 DB 67.12 65.42 50.01 58.51 71.21 62.45 DB 67.43 65.52 68.67 51.60 64.51 LD 71.92 69.94 52.67 64.60 76.01 67.03 LD 70.19 71.59 70.66 56.87 70.00 76.78 76.75 63.55 72.22 63.00 67.86 Base AMoC 59.35 59.51 69.71 73.41 72.95 71.46 74.24 70.31 72.16 65.47 72.07 67.75 Base AMoC 57.40 57.88 66.28 69.86 64.70 67.45 69.01 71.47 65.97 69.20 Govern. DB 40.14 46.83 49.53 46.83 49.03 46.47 DB 42.86 46.98 48.67 46.19 47.30 DB+DA DB+GR 57.85 69.55 71.31 66.63 72.69 67.61 42.54 47.10 49.23 45.99 51.32 47.24 Travel DB+DA DB+GR 54.84 66.52 66.99 57.94 65.53 43.64 46.81 48.58 46.92 42.94 LD 56.85 63.56 66.82 65.53 65.47 63.65 LD 54.88 62.36 63.09 61.79 61.76 67.17 64.67 46.40 62.36 45.78 60.78 Base AMoC 60.96 64.06 68.34 69.92 74.66 39.17 15.86 52.73 71.78 38.59 20.09 51.95 Base AMoC 47.42 48.90 50.14 51.34 44.78 48.25 50.52 52.23 35.36 36.50 MZ DB+DA DB+GR 64.75 69.39 72.28 38.75 24.84 54.00 48.48 69.70 65.76 16.98 15.53 43.29 WB DB+DA DB+GR 46.00 48.72 43.91 49.30 36.23 45.56 48.39 39.98 41.34 25.72 DB 63.44 68.71 71.86 41.94 22.84 53.76 DB 45.58 48.02 43.11 49.07 37.00 LD 53.78 60.87 64.82 33.81 13.52 45.36 LD 40.17 43.45 38.80 45.72 27.04 47.44 45.64 44.56 44.83 40.20 39.04 Table 2: Domain adaptation results in terms of macro F1 scores on Amazon Reviews (top), MultiNLI (middle), and OntoNotes (bottom) with 6 removed layers. S and T denote Source and Target, respectively. The best result among the compressed models (all models except from Base) is highlighted in bold. We mark results that outperform the uncompressed Base model with an underscore. 5.3 Compression Scheme Experiments While our compression algorithm is neither restricted to a specific deep neural network ar- chitecture nor to the removal of certain model components, we follow previous work and focus on the removal of layer sets (Fan et al., 2019, Sanh et al., 2019; Sajjad et al., 2020). With the goal of addressing our research questions posed in § 1, we perform extensive compression experiments on the 12-layer BERT by considering the removal of 4, 6, and 8 layers. For each number of layers removed, we randomly sample 100 layer sets to generate our model candidates. To be able to test our method on all domain pairs, we randomly split these pairs into five 20% domain pair sets and train five regression models, differing in the set used for testing. Our splits respect the restriction that no test set domain (source or target) appears in the training set. 5.4 Hyperparameters We implement all models using HuggingFace’s Transformers package (Wolf et al., 2020).8 We consider the following hyperparameters for the uncompressed models: Training for 10 epochs 8https://github.com/huggingface/transformers. 1364 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 3 1 1 9 7 6 7 7 8 / / t l a c _ a _ 0 0 4 3 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 (Amazon Reviews and MultiNLI) or 30 epochs (OntoNotes) with an early stopping criterion ac- cording to the development set, optimizing all parameters using the ADAM optimizer (Kingma and Ba, 2015) with a weight decay of 0.01 and a learning rate of 1e-4, a batch size of 32, a window size of 9, 16 output channels for the 1D convolu- tion, and a dropout layer probability of 0.1 for the layer attention module. The compressed models are trained on the labeled source data for 1 epoch (Amazon Reviews and MultiNLI) or 10 epochs (OntoNotes). The domain classifiers are identical in ar- chitecture to our task classifiers and use the uncompressed encoder after it was optimized during the above task-based training. These clas- sifiers are trained on the unlabeled version of the source and target training sets for 25 epochs with early stopping, using the same hyperparameters as above. 6 Results Performance of Compressed Models Table 2 reports macro F1 scores for all domain pairs of the Amazon Reviews, MultiNLI, and OntoNotes data sets, when considering the removal of 6 lay- ers, and Figure 2 provides summary statistics. Clearly, AMoC outperforms all baselines in the vast majority of setups (see, e.g., the lower graphs of Figure 2). Moreover, its average target-domain performance (across the 5 source domains) im- proves over the second best model (DB + DA) by up to 4.56%, 5.16%, and 1.63%, on Amazon Reviews, MultiNLI, and OntoNotes, respectively (lowest rows of each table in Table 2; see also the average across setups in the upper graphs of Figure 2). These results provide a positive answer to Q1 of § 1, by indicating the superiority of AMoC over strong alternatives. DB+GR is overall the worst performing base- line, followed by DB, with an average degradation of 11.3% and 8.2% macro F1 score, respectively, compared to the more successful cross-domain oriented variant DB + DA. This implies that out-of-the-box compressed models such as DB struggle to generalize well to out-of-distribution data. DB + DA also performs worse than AMoC in a large portion of the experiments. These results are even more appealing given that AMoC does not perform any gradient step on the target data, performing only a small number of gradient steps Figure 2: Summary of domain adaptation results. Over- all average score (top) and overall number of wins (bottom) over all source-target domain pairs. on the source data. In fact, AMoC only uses the unlabeled target data for computing the regres- sion features. Lastly, LD, another strong baseline which was specifically designed to remove layers from BERT, is surpassed by AMoC by as much as 6.76% F1, when averaging over all source-target domain pairs. Finally, we compare AMoC to ALBERT. We find that on average ALBERT is outperformed by AMoC by 8.8% F1 on Amazon Reviews, and by 1.6% F1 on MultiNLI. On OntoNotes the performance gap between ALBERT and AMoC is an astounding 24.8% F1 in favor of AMoC, which might be a result of ALBERT being an uncased model, an important feature for NER tasks. Compressed Model Selection We next evalu- ate how well the regression model and its variables predict the performance of a candidate compressed model on the target domain. Table 3 presents the Adjusted R2, indicating the share of the variance in the predicted outcome that the variables ex- plain. Across all experiments and regardless of the number of layers removed, our regression model predicts well the performance on unseen domain pairs, averaging an R2 of 0.881, 0.916, and 0.826 on Amazon Reviews, MultiNLI, and OntoNotes, respectively. This indicates that our 1365 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 3 1 1 9 7 6 7 7 8 / / t l a c _ a _ 0 0 4 3 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 # of removed Layers 6 8 4 Data set Amazon Reviews 0.844 0.898 0.902 0.902 0.921 0.926 MultiNLI 0.827 0.830 0.821 OntoNotes Average 0.881 0.916 0.826 Table 3: Adjusted R2 on the test set for each type of compression (4, 6, or 8 layers) on each data set. regression properly estimates the performance of candidate models. Another support for this observation is that in 75% of the experiments the model selected by the regression is among the top 10 performing com- pressed candidates. In 55% of the experiments, it is among the top 5 models. On average it performs only 1% worse than the best performing com- pressed model. Combined with the high adjusted R2 across experiments, this suggests a positive answer to Q2 of § 1. Finally, as expected, we find that AMoC is often outperformed by the full model. However, the gap between the models is small, averaging only in 1.26%. Moreover, in almost 25% of all experiments AMoC was able to surpass the full model (underscored scores in Table 2). Marginal Effects of Regression Variables While the performance of the model on data drawn from the same distribution may also be indicative of its out-of-distribution performance, additional information is likely to be needed in order to make an exact prediction. Here, we supplement this indicator with the variables described in § 4.3 and ask whether they can be useful to select the best compressed model out of a set of candidates. Table 4 presents the most statistically significant variables in our stepwise regression analysis. It demonstrates that the ATE and the model’s per- formance in the source domain are usually very indicative of the model’s performance. Indeed, most of the regression’s predictive power comes from the model performance on the source domain (F 1S) and the treatment effects on the source and target domains (\AT ES, \AT ET ). \P (S|T )) and the In contrast, the distance metric ( \P (S|T )) interaction terms (\AT ET · contribute much less to the total R2. The predic- tive power of the ATE in both source and target domains suggests a positive answer to Q3 of § 1. \P (S|T ), F 1S · ∆R2 ∆R2 1.836 0.029 MultiNLI β OntoNotes Amazon ∆R2 β β Variable F 1S 0.435 0.603 −0.299 0.143 0.748 0.510 \AT ET −1.207 0.239 −0.666 0.413 117.5 0.202 \AT ES 0.557 0.232 125.9 0.072 \P (S|T ) −0.298 0.028 −0.652 0.061 15.60 0.052 \AT ET \P (S|T ) · F 1S \P (S|T ) · 8 layers −0.137 0.001 −0.303 0.001 −3.145 0.001 6 layers −0.066 0 −0.146 0.007 −1.020 0.005 0.259 0 const −0.560 0.007 −0.092 0.029 −115.8 0.004 0.187 0.004 1.027 0.043 0.472 0.004 −12.18 0 0.594 0 Table 4: Stepwise regression coefficients (β) and their marginal contribution to the adjusted R2 (∆R2) on all experiments on both data sets. 7 Additional Analysis 7.1 Layer Importance To further understand the importance of each of BERT’s layers, we compute the frequency in which each layer appears in the best candidate model, namely, the model with the highest F1 score on the target test set, of every experiment. Figure 3 captures the layer frequencies across the different data sets and across the number of removed layers. The plots suggest that the two final layers, lay- ers 11 and 12, are the least important layers with average frequencies of 30.3% and 24.8%, respec- tively. Additionally, in most cases layer 1 is ranked below the other layers. These results imply that the compressed models are able to better recover from the loss of parameters when the external lay- ers are removed. The most important layer appears to be layer 4, with an average frequency of 73.3%. Finally, we notice that a large frequency variance exists across the different subplots. Such variance supports our hypothesis that the decision of which layers to remove should not be based solely on the architecture of the model. To pin down the importance of a specific layer for a given base model, we utilize a similar regres- sion analysis to that of § 6. Specifically, we train a regression model on all compressed candidates for a given source-target domain pair (in all three tasks), adding indicator variables for the exclusion of each layer from the model. This model asso- ciates each layer with a regression coefficient, which can be interpreted as the marginal effect 1366 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 3 1 1 9 7 6 7 7 8 / / t l a c _ a _ 0 0 4 3 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 3 1 1 9 7 6 7 7 8 / / t l a c _ a _ 0 0 4 3 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 3: Layer frequency at the best (oracle) compressed models when considering the removal of 4, 6, and 8 layers in the three data sets. of that layer being removed on expected target performance. We then compute for each layer its average coefficient across source-target pairs (Table 5, β column) and compare it to the frac- tion of source-target pairs where this layer is not included in the best possible (oracle) compressed model (Table 5, P (Layer removed) column). As can be seen in the table, layers that their re- moval is associated with better model performance are more often not included in the best performing compressed models. Indeed, the Spearman’s rank correlation between the two rankings is as high as 0.924. Such analysis demonstrates that the regres- sion model used as part of AMoC not only selects high quality candidates, but can also shed light on the importance of individual layers. 7.2 Training Epochs We next analyze the number of epochs required to fine-tune our compressed models. For each data set (task) we randomly choose for every target domain 10 compressed models and cre- ate two alternatives, differing in the number of training epochs performed after layer removal: One trained for a single epoch and another for 5 epochs (Amazon Reviews, MultiNLI) or 10 Layer Rank 1 2 3 4 5 6 7 8 9 10 11 12 ¯β 0.0448 0.0464 0.0473 0.0483 0.0487 0.0495 0.0501 0.0507 0.0514 0.0522 0.0538 0.0577 P (Layer removed) 0.300 0.333 0.333 0.333 0.416 0.555 0.472 0.638 0.500 0.638 0.611 0.666 Table 5: Layer rank according to regres- sion coefficients (β) and the probability the layer was removed form the best com- pressed model. Results are averaged across all target-domain pairs in our experiments. epochs (Ontonotes). Table 6 compares the av- erage F1 (target-domain task performance) and \AT ET differences between the two alternatives, on the target domain test and dev sets, respec- tively. The results suggest that when training for 1367 F1 Difference \AT ET Difference Amazon Reviews MNLI OntoNotes 0.080 −0.250 2.940 0.011 0.003 −0.009 Table 6: F1 and ATE differences when training AMoC after layer removal for multiple epochs vs. a single epoch. Overall Parameters Trainable Parameters Train Time Reduction BERT-base DistilBERT 110M 66M AMoC 110M - 7M · L 110M 66M 7M · min{L, 12 − L} +17M · 1 {1∈c} ×1 ×1.83 ×11 Table 7: Comparison of number of parameters and training time between BERT-base, DistilBERT, and AMoC when removing L layers. AMoC’s number of trainable parameters is an upper bound. more epochs on Amazon Reviews and MultiNLI the difference in both the F1 and ATE are negligi- ble. For OntoNotes (NER), in contrast, additional training improves the F1, suggesting that further training of the compressed model candidates may be favorable for sequence tagging tasks such as NER. 7.3 Space and Time Complexity Table 7 compares the number of overall and trainable parameters and the training time of BERT, DistilBERT, and AMoC. Removing L layers from BERT yields a reduction of 7L mil- lion parameters. As can be seen in the Table, AMoC requires training only a small fraction of the overall parameters. Since we only unfreeze one layer per each new connected component, at the worst case our algorithm requires the training of min{L, 12 − L} layers. The only exception is in the case where Layer 1 is removed (1 ∈ c). In such a case we unfreeze the embedding layer, which adds 24 million trained parameters. In terms of total training time (one epoch of task-based fine-tuning), when averaging over all setups, a single compressed AMoC model is ×11 faster than BERT and ×6 faster than DistilBERT. 7.4 Design Choices Computing the ATE Following Goyal et al. (2019) and Feder et al. (2021), we implement the ATE with the total variation distance be- tween the probability output of the original model and that of the compressed models. To verify the quality of this design choice, we re-ran our experiments where the ATE is calculated using the KL-divergence between the same distribu- tions. While the results in both conditions are qualitatively similar, we did find a consistent quantitative improvement of the R2 (average of 0.05 across setups) when considering our total variation distance. Regression Analysis Our regression approach is designed to allow us to both select high-quality compressed candidates and to interpret the im- portance of each explanatory variable, including the ATEs. As this regression has relatively few features, we do not expect to lose significant predictive power by choosing to focus on linear predictors. To verify this, we re-ran our experi- ments when using a fully connected feed-forward network9 to predict target performance. This model, which is less interpretable than our re- gression, is also less accurate: We have observed an increased mean squared error of 1-3% with the network. 8 Conclusion We explored the relationship between model com- pression and out-of-distribution generalization. AMoC, our proposed algorithm, relies on causal inference tools for estimating the effects of inter- ventions. It hence creates an interpretable process that allows to understand the role of specific model components. Our results indicate that AMoC is able to produce a smaller model with minimal loss in performance across domains, without any use of target labeled data at test time (Q1). AMoC can efficiently train a large number of compressed model candidates, that can then serve as training examples for a regression model. We have shown that this approach results in a high quality estimation of the performance of com- pressed models on unseen target domains (Q2). Moreover, our stepwise regression analysis indi- cates that the \AT ES and \AT ET estimates are instrumental for these attractive properties (Q3). As training and test set mismatches are com- mon, we steered our model compression research towards out-of-domain generalization. Besides its realistic nature, this setup poses additional modeling challenges, such as understanding the proximity between domains, identifying which 9With one intermediate layer, same input feature as the regression, and hyperparameters tuned on the development set of each source-target pair. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 3 1 1 9 7 6 7 7 8 / / t l a c _ a _ 0 0 4 3 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 1368 components are invariant to domain shift, and es- timating performance on unseen domains. Hence, AMoC is designed for model compression in the out-of-distribution setup. We leave the design of similar in-domain compression methods for future work. Finally, we believe that using causal methods to produce compressed NLP models that can well generalize across distributions is a promising di- rection of research, and hope that more work will be done in this intersection. Acknowledgments We would like to thank the action editor and the reviewers, as well as the members of the IE@Technion NLP group for their valuable feed- back and advice. This research was partially funded by an ISF personal grant No. 1625/18. References Gustavo Aguilar, Yuan Ling, Yu Zhang, Benjamin Yao, Xing Fan, and Chenlei Guo. 2020. Knowledge distillation from internal repre- sentations. In Proceedings of the AAAI Con- ference on Artificial Intelligence, volume 34, pages 7350–7357. https://doi.org/10 .1609/aaai.v34i05.6229 Eyal Ben-David, Carmel Rabinovitz, and Roi Reichart. 2020. PERL: Pivot-based domain adaptation for pre-trained deep contextual- ized embedding models. Transactions of the Association for Computational Linguistics, 8:504–521. https://doi.org/10.1162 /tacl_a_00328 John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation with struc- In Proceed- tural correspondence learning. ings of the 2006 Conference on Empirical Methods in Natural Language Processing, pages 120–128. https://doi.org/10.3115 /1610075.1610094 L´eon Bottou, Jonas Peters, Joaquin Qui˜nonero- Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual reasoning and learning systems: The example of compu- tational advertising. The Journal of Machine Learning Research, 14(1):3207–3260. Samuel Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural the language inference. 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642. https://doi.org/10.18653/v1/D15 -1075 In Proceedings of Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Daoyuan Chen, Yaliang Li, Minghui Qiu, Zhen Wang, Bofang Li, Bolin Ding, Hongbo Deng, Jun Huang, Wei Lin, and Jingren Zhou. 2020. Adabert: Task-adaptive bert compression with differentiable neural architecture search. In Pro- ceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, pages 2463–2469. International Joint Conferences on Artificial Intelligence Organi- zation. Main track. https://doi.org/10 .24963/ijcai.2020/341 Wanyun Cui, Guangyu Zheng, Zhiqiang Shen, Sihang Jiang, and Wei Wang. 2018. Transfer learning for sequences via learning to collo- cate. In International Conference on Learning Representations. Hal Daum´e III, Abhishek Kumar, and Avishek Saha. 2010. Frustratingly easy semi-supervised domain adaptation. In Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing, pages 53–59. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Con- ference of the North American Chapter of the 1369 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 3 1 1 9 7 6 7 7 8 / / t l a c _ a _ 0 0 4 3 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Association for Computational Linguistics: Hu- man Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186. Norman R. Draper and Harry Smith. 1998. Applied Regression Analysis, volume 326. John Wiley & Sons. https://doi.org/10.1002 /9781118625590 Haim Dubossarsky, Ivan Vuli´c, Roi Reichart, and Anna Korhonen. 2020. The secret is in the spectra: Predicting cross-lingual task perfor- mance with spectral similarity measures. In Proceedings of the 2020 Conference on Empir- ical Methods in Natural Language Processing, pages 2377–2390. https://doi.org/10 .18653/v1/2020.emnlp-main.186 Hady Elsahar and Matthias Gall´e. 2019. To an- notate or not? Predicting performance drop under domain shift. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Interna- tional Joint Conference on Natural Language Processing, pages 2163–2173. https://doi .org/10.18653/v1/D19-1222 Angela Fan, Edouard Grave, and Armand Joulin. 2019. Reducing transformer depth on de- mand with structured dropout. In International Conference on Learning Representations. Amir Feder, Nadav Oved, Uri Shalit, and Roi Reichart. 2021. CausaLM: Causal model ex- planation through counterfactual language mod- els. Computational Linguistics, 47(2):333–386. https://doi.org/10.1162/coli a 00404 Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Franc¸ois Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030. Mingming Gong, Kun Zhang, Tongliang Liu, Dacheng Tao, Clark Glymour, and Bernhard Sch¨olkopf. 2016. Domain adaptation with transferable components. In In- conditional ternational Conference on Machine Learning, pages 2839–2848. Yash Goyal, Amir Feder, Uri Shalit, and Been Kim. 2019. Explaining classifiers with causal concept effect (cace). arXiv preprint arXiv:1907.07165. Daniel Greenfeld and Uri Shalit. 2020. Robust indepen- learning with the Hilbert-Schmidt dence criterion. In Proceedings of the 37th International Conference on Machine Learn- ing, volume 119 of Proceedings of Machine Learning Research, pages 3759–3768. PMLR. Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th Inter- national Conference on World Wide Web, pages 507–517. Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop. Jonathan Frankle and Michael Carbin. 2018. The lottery ticket hypothesis: Finding sparse, In International trainable neural networks. Conference on Learning Representations. Ronald R. Hocking. 1976. A biometrics invited paper. The analysis and selection of variables in linear regression. Biometrics, 32(1):1–49. https://doi.org/10.2307/2529336 Prakhar Ganesh, Yao Chen, Xin Lou, Mohammad Ali Khan, Yin Yang, Deming Chen, Marianne Winslett, Hassan Sajjad, and Preslav Nakov. 2020. Compressing large-scale transformer-based models: A case study on bert. arXiv preprint arXiv:2002.11985. Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropa- gation. In International Conference on Machine Learning, pages 1180–1189. PMLR. Eduard Hovy, Mitch Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. 2006. Ontonotes: The 90% solution. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Pa- pers, pages 57–60. https://doi.org/10 .3115/1614049.1614064 Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2020. TinyBERT: Distilling BERT 1370 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 3 1 1 9 7 6 7 7 8 / / t l a c _ a _ 0 0 4 3 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 for natural language understanding. In Find- the Association for Computational ings of Linguistics: EMNLP 2020, pages 4163–4174, Online. Association for Computational Linguis- tics. https://doi.org/10.18653/v1 /2020.findings-emnlp.372 Fredrik Johansson, Uri Shalit, and David Sontag. 2016. Learning representations for counterfac- tual inference. In International Conference on Machine Learning, pages 3020–3029. Divyansh Kaushik, Eduard Hovy, and Zachary Lipton. 2019. Learning the difference that makes a difference with counterfactually- augmented data. In International Conference on Learning Representations. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimiza- tion. In International Conference on Learning Representations. Dan Kondratyuk and Milan Straka. 2019. 75 languages, 1 model: Parsing universal depen- dencies universally. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Interna- tional Joint Conference on Natural Language Processing, pages 2779–2795. https://doi .org/10.18653/v1/D19-1279 Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. Albert: A lite BERT for self-supervised learning of language representa- tions. In International Conference on Learning Representations. Bill Yuchen Lin and Wei Lu. 2018. Neural adaptation layers for cross-domain named en- tity recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2012–2022. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly opti- mized bert pretraining approach. arXiv preprint arXiv:1907.11692. Annie Louis and Ani Nenkova. 2009. Performance confidence estimation for automatic summa- rization. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 541–548. Association for Compu- tational Linguistics. https://doi.org/10 .3115/1609067.1609127 Sara Magliacane, Thijs van Ommen, Tom Claassen, Stephan Bongers, Philip Versteeg, and Joris M. Mooij. 2018. Domain adap- tation by using causal inference to predict invariant conditional distributions. In Advances in Neural Information Processing Systems, pages 10846–10856. David McClosky, Eugene Charniak, and Mark Johnson. 2010. Automatic domain adaptation for parsing. In Human Language Technolo- gies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 28–36. Association for Computational Linguistics. Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. 2020. Improved knowledge dis- tillation via teacher assistant. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5191–5198. https://doi .org/10.1609/aaai.v34i04.5963 Judea Pearl. 1995. Causal diagrams for empirical research. Biometrika, 82(4):669–688. Judea Pearl. 2009. Causality, Cambridge Univer- sity Press. https://doi.org/10.1093 /biomet/82.4.669 Judea Pearl. 2009. Causal inference in statistics: An overview. Statistics Surveys, 3:96–146. Jonas Peters, Dominik Janzing, and Bernhard Schlkopf. 2017. Elements of Causal Inference: Foundations and Learning Algorithms. The MIT Press. https://doi.org/10.1214 /09-SS057 Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improv- ing language understanding with unsupervised learning. Technical report, OpenAI. Sujith Ravi, Kevin Knight, and Radu Soricut. 2008. Automatic prediction of parser accu- racy. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 887–896. https://doi .org/10.3115/1613715.1613829 1371 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 3 1 1 9 7 6 7 7 8 / / t l a c _ a _ 0 0 4 3 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Roi Reichart and Ari Rappoport. 2007. An ensem- ble method for selection of high quality parses. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 408–415. Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2021. A primer in BERTology: What we know about how BERT works. Trans- actions of the Association for Computational Linguistics, 8:842–866. https://doi.org /10.1162/tacl_a_00349 Mateo Rojas-Carulla, Bernhard Sch¨olkopf, Richard Turner, and Jonas Peters. 2018. In- variant models for causal transfer learning. The Journal of Machine Learning Research, 19(1):1309–1342. Guy Rotman and Roi Reichart. 2019. Deep contextualized self-training for low resource dependency parsing. Transactions of the Association for Computational Linguistics, 7:695–713. https://doi.org/10.1162 /tacl_a_00294 Hassan Sajjad, Fahim Dalvi, Nadir Durrani, and Preslav Nakov. 2020. Poor man’s BERT: Smaller and faster transformer models. arXiv preprint arXiv:2004.03844. Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. In Proceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing in Advances in Neural Information Processing Systems. Motoki Sato, Hitoshi Manabe, Hiroshi Noji, and Yuji Matsumoto. 2017. Adversarial training for cross-domain universal dependency parsing. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Univer- sal Dependencies. https://doi.org/10 .18653/v1/K17-3007 Bernhard Sch¨olkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. 2012. On causal and anticausal learn- ing. In Proceedings of the 29th International Conference on International Conference on Machine Learning, pages 459–466. Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. 2020. MobileBERT: A compact task-agnostic BERT for resource-limited devices. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2158–2170. Association for Computational Linguistics. Vincent Van Asch and Walter Daelemans. 2010. Using domain similarity for performance esti- mation. In Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing, pages 31–36. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. In Advances Attention is all you need. in Neural Information Processing Systems, pages 5998–6008. Yoav Wald, Amir Feder, Daniel Greenfeld, and Uri Shalit. 2021. On calibration and out-of-domain generalization. arXiv preprint arXiv:2102.10395. Zhenghui Wang, Yanru Qu, Liheng Chen, Jian Shen, Weinan Zhang, Shaodian Zhang, Yimei Gao, Gen Gu, Ken Chen, and Yong Yu. 2018. Label-aware double transfer learn- ing for cross-specialty medical named entity recognition. In Proceedings of the 2018 Con- ference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, Volume 1 (Long Papers), pages 1–15. Association for Compu- tational Linguistics. https://doi.org/10 .18653/v1/N18-1001 Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through in- ference. In Proceedings of the 2018 Conference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Language Technologies, Volume 1 (Long Pa- pers), pages 1112–1122. https://doi.org /10.18653/v1/N18-1101 Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R´emi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, 1372 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 3 1 1 9 7 6 7 7 8 / / t l a c _ a _ 0 0 4 3 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Con- ference on Empirical Methods in Natural Language Processing: System Demonstra- tions, pages 38–45. Association for Compu- tational Linguistics. https://doi.org/10 .18653/v1/2020.emnlp-demos.6 Mengzhou Xia, Antonios Anastasopoulos, Ruochen Xu, Yiming Yang, and Graham Neubig. 2020. Predicting performance for natu- ral language processing tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8625–8646. Association for Computational Linguistics. Kun Zhang, Bernhard Sch¨olkopf, Krikamol Muandet, and Zhikun Wang. 2013. Do- main adaptation under target and conditional shift. In International Conference on Machine Learning, pages 819–827. Yftah Ziser and Roi Reichart. 2017. Neural structural correspondence learning for domain adaptation. In Proceedings of the 21st Con- ference on Computational Natural Language Learning (CoNLL 2017), pages 400–410. https://doi.org/10.18653/v1/K17 -1040 for Yftah Ziser and Roi Reichart. 2018a. Deep pivot-based modeling cross-language cross-domain transfer with minimal guid- ance. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 238–249. Association for Computational Linguistics. https://doi .org/10.18653/v1/D18-1022 Yftah Ziser and Roi Reichart. 2018b. Pivot based language modeling for improved neural do- main adaptation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long Papers), pages 1241–1251. Association for Computational Linguistics. https://doi .org/10.18653/v1/N18-1112 Yftah Ziser and Roi Reichart. 2019. Task re- finement learning for improved accuracy and stability of unsupervised domain adaptation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5895–5906. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/P19-1591 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 3 1 1 9 7 6 7 7 8 / / t l a c _ a _ 0 0 4 3 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 1373 Model Compression for Domain Adaptation through Causal Effect image

PDF Herunterladen