Reducir la confusión en el aprendizaje activo para el etiquetado de partes del discurso
Aditi Chaudhary1, Antonios Anastasopoulos2,∗, Zaid Sheikh1, Graham Neubig1
1Language Technologies Institute, Carnegie Mellon University
2Departamento de Ciencias de la Computación, George Mason University
{aschaudh,zsheikh,gneubig}@cs.cmu.edu
antonis@gmu.edu
Abstracto
Active learning (AL) uses a data selection
algorithm to select useful training samples to
minimize annotation cost. This is now an es-
sential tool for building low-resource syntactic
analyzers such as part-of-speech (POS) tag-
gers. Existing AL heuristics are generally de-
signed on the principle of selecting uncertain
yet representative training instances, dónde
annotating these instances may reduce a large
number of errors. Sin embargo, in an empirical
study across six typologically diverse lan-
calibres (Alemán, Swedish, Galician, North
Sami, persa, and Ukrainian), we found the
surprising result that even in an oracle scenario
where we know the true uncertainty of predic-
ciones, these current heuristics are far from
óptimo. Based on this analysis, we pose the
problem of AL as selecting instances that maxi-
mally reduce the confusion between particular
pairs of output tags. Extensive experimenta-
tion on the aforementioned languages shows
that our proposed AL strategy outperforms
other AL strategies by a significant margin.
We also present auxiliary results demonstrat-
ing the importance of proper calibration of
modelos, which we ensure through cross-view
training, and analysis demonstrating how our
proposed strategy selects examples that more
closely follow the oracle data distribution. El
code is publicly released here.1
1
Introducción
Part-of-speech (POS) tagging is a crucial step for
language understanding, both being used in auto-
matic language understanding applications such
as named entity recognition (NER; Ankita and
Nazeer, 2018) and question answering (control de calidad; Wang
et al., 2018), but also being used in manual lan-
∗Work done at Carnegie Mellon University.
1https://github.com/Aditi138/CRAL.
1
guage understanding by linguists who are attempt-
ing to answer linguistic questions or document
less-resourced languages (Anastasopoulos et al.,
2018). Much prior work (Huang et al., 2015;
Bohnet et al., 2018) on developing high-quality
POS taggers uses neural network methods, cual
rely on the availability of large amounts of
labelled data. Sin embargo, such resources are not
readily available for the majority of the world’s
7000 idiomas (Hammarstr¨om et al., 2018). Fur-
thermore, manually annotating large amounts of
text with trained experts is an expensive and
time-consuming task, even more so when lin-
guists/annotators might not be native speakers of
the language.
Active Learning (Luis, 1995; Settles, 2009,
AL) is a family of methods that aim to train effec-
tive models with less human effort and cost by
selecting such a subset of data that maximizes the
end model performance. Although many methods
have been proposed for AL in sequence labeling
(Settles and Craven, 2008; Marcheggiani and
Arti`eres, 2014; Fang and Cohn, 2017), a través de
an empirical study across six typologically di-
verse languages we show that within the same
task setup these methods perform inconsistently.
Además, even in an oracle scenario, where we
have access to the true labels during data selection,
existing methods are far from optimal.
eso
We posit
the primary reason for this
inconsistent performance is that while existing
methods consider uncertainty in predictions, ellos
do not consider the direction of the uncertainty
with respect to the output labels. Por ejemplo, en
Cifra 1 we consider the German token ‘‘die,''
which may be either a pronoun (PRO) or de-
terminer (DET). According to the initial model
(iteration 0), ‘‘die’’ was labeled as PRO majority
of the time, but a significant amount of probability
mass was also assigned to other output
tags
(OTHER) for many examples. Based on this,
existing AL algorithms that select uncertain tokens
will likely select ‘‘die’’ because it is frequent and
Transacciones de la Asociación de Lingüística Computacional, volumen. 9, páginas. 1-dieciséis, 2021. https://doi.org/10.1162/tacl a 00350
Editor de acciones: Yuji Matsumoto. Lote de envío: 3/2020; Lote de revisión: 6/2020; Publicado 02/2021.
C(cid:3) 2021 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
5
0
1
9
2
4
2
6
3
/
/
t
yo
a
C
_
a
_
0
0
3
5
0
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Cifra 1: Illustration of selecting representative token-tag combinations to reduce confusion between the output
tags on the German token ‘‘die’’ in an idealized scenario where we know true model confusion.
its predictions are not certain, but they may select
an instance of ‘‘die’’ with either a gold label of
PRO or DET. Intuitivamente, because we would like
to correct errors where tokens with true labels of
DET are mislabeled by the model as PRO, asking
the human annotator to tag an instance with a true
label of PRO, even if it is uncertain, is not likely
to be of much benefit.
Inspired by this observation, we pose the
problem of AL for POS tagging as selecting tokens
that maximally reduce the confusion between the
output tags. Por ejemplo, in the example, nosotros
would attempt to pick a token-tag pair ‘‘die/DET’’
to reduce potential errors of the model over-
predicting PRO despite its belief that DET is also
a plausible option. We demonstrate the features
of this model in an oracle setting where we know
true model confusions (como en la figura 1), and also
describe how we can approximate this strategy
when we do not know the true confusions.
We evaluate our proposed AL method by run-
ning simulation experiments on six typologically
diverse languages, a saber, Alemán, Swedish,
Galician, North Sami, persa, and Ukrainian,
improving upon models seeded with crosslingual
transfer from related languages (Cotterell and
Heigold, 2017). Además, we conduct human
annotation experiments on Griko, an endangered
language that truly lacks significant resources. Nuestro
contributions are as follows:
1. We empirically demonstrate the shortcom-
ings of existing AL methods under both
conventional and ‘‘oracle’’ settings. Basado
on the subsequent analysis, we propose a
new AL method that achieves +2.92 aver-
age per-token accuracy improvement over
existing methods under conventional set-
tings, y un +2.08 average per-token accuracy
improvement under the oracle setting.
2. We conduct extensive analysis measuring
how the selected data using our proposed
AL method closely matches the oracle data
distribución.
3. We further demonstrate the importance
of model calibration, the accuracy of the
model’s probability estimates themselves,
and demonstrate that cross-view training
(Clark et al., 2018) is an effective way to
improve calibration.
4. We perform human annotation using the
proposed method on an endangered language,
Griko, and find our proposed method to
perform better than the existing methods. En
this process, we collect 300 new token-level
annotations which will help further Griko
NLP.
2 Fondo: Active Learning
Generally, AL methods are designed to select
data based on two criteria: ‘‘informativeness’’
and ‘‘representativeness’’ (Huang et al., 2010).
Informativeness represents the ability of
el
selected data to reduce the model uncertainty on
its predictions, and representativeness measures
how well the selected data represent the entire
unlabeled data. AL is an iterative process and is
typically implemented in a batched fashion for
neural models (Sener and Savarese, 2018). en un
given iteration, a batch of data is selected using
some heuristic on which the end model is trained
until convergence. This trained model is then used
to select the next batch for annotation, y entonces
adelante.
In this work we focus on token-level AL
methods, which require annotation of individual
tokens in context, rather than full sequence anno-
tation, which is more time consuming. Given
an unlabeled pool of sequences D = {x1, x2,
2
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
5
0
1
9
2
4
2
6
3
/
/
t
yo
a
C
_
a
_
0
0
3
5
0
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
· · · , xn} and a model θ, Pθ(yi,t = j | xi) denotes
the output probability of the output tag j ∈ J
produced by the model θ for the token xi,t in
the input sequence xi. J denotes the set of POS
tags. Most popular methods (Settles, 2009; Fang
and Cohn, 2017) define the ‘‘informativeness’’
using either uncertainty sampling or query-by-
committee. We provide a brief review of these
existing methods.
• Uncertainty Sampling (UNS; Fang and Cohn,
2017) selects the most uncertain word types
in the unlabeled corpus D for annotation.
Primero, it calculates the token entropy H(xi,t;
i) for each unlabeled sequence xi ∈ D under
model θ, defined as
pi, t, j := Pθ(yi, t = j | xi)
(cid:2)
h(xi, t; i) = -
pi, t, j log pi, t, j
j ∈ J
Próximo, this entropy is aggregated over all token
occurrences across D to get an uncertainty
score SUNS(z) for each word type z:
(cid:2)
(cid:2)
SUNS(z) =
h(xi, t; i)
xi∈D
xi, t=z
• Query-by-Commitee
(QBC; Settles
y
Craven, 2008) selects the tokens having the
highest disagreement between a committee
of models C = {θ1, i2, i3, · · · }, cual
is aggregated over all token occurrences.
The token level disagreement scores are
defined as
SDIS(xi, t) = |C|−max
(cid:2)
V (y),
y∈[ ˆyθ1
i, t, ˆyθ2
i, t,··· , ˆyθc
i, t]
where V (y) is number of ‘‘votes’’ received
for the token label y. ˆyθc
i,t is the prediction
with the highest score according to model θc
for the token xi,t. These disagreement scores
are then aggregated over word types:
(cid:2)
(cid:2)
SQBC(z) =
SDIS(xi, t)
xi∈D
xi, t=z
Finalmente, regardless of whether we use an UNS-
based or QBC-based score, the top b word types
with the highest aggregated score are then selected
as the to-label set
XLABEL = b- arg max
z
S(z),
3
where b- arg max selects top b word types
having the highest S(z). Fang and Cohn (2017)
and Chaudhary et al. (2019) further attempt
to include the ‘‘representativeness’’ criterion
by combining uncertainty sampling with a bias
towards high-frequency tokens/spans.
Failings of Current AL Methods Although
these methods are widely used,
in a prelimi-
nary empirical study we found that these existing
methods are less than optimal, and fail to bring
consistent gains across multiple settings. Idealmente,
having a single strategy that performs the best
across a diverse language set is useful for other
researchers who plan to use AL for new lan-
calibres. Instead of researchers experimenting with
different strategies with human annotation, cual
is costly, having a single strategy known a priori
will reduce both time and human annotation effort.
Específicamente, we demonstrate this problem of
inconsistency through a set of oracle experiments,
where the data selection algorithm has access to
the true labels. We hope that these experiments
serve as an upper-bound for their non-oracle
counterparts, so if existing methods do not achieve
gains even in this case, they will certainly be even
less promising when true labels are not available
at data selection time, as is the case in standard
AL.
Concretely, as an oracle uncertainty sampling
method UNS-ORACLE, we select types with the high-
est negative log likelihood of their true label. Como
an ‘‘oracle’’ QBC method QBC-ORACLE, we select
types having the largest number of incorrect pre-
dictions. We conduct 20 AL iterations for each
of these methods across six typologically diverse
languages.2
Primero, we observe that between the oracle
methods (Cifra 2) no method consistently per-
forms the best across all six languages. Segundo,
we find that just considering uncertainty leads
to unbalanced selection of the resulting tags. A
drive this point across, Mesa 1 shows the output
tags selected for the German token ‘‘zu’’ across
multiple iterations. UNS-ORACLE selects the most
frequent output tag, failing to select tokens from
other output tags. Whereas QBC-ORACLE selects
tokens having multiple tags, the distribution is not
in proportion with the true tag distribution. Nuestro
hypothesis is that this inconsistent performance
occurs because none of the methods consider the
2More details on the experimental setup in Section §5.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
5
0
1
9
2
4
2
6
3
/
/
t
yo
a
C
_
a
_
0
0
3
5
0
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
define the confusion as the sum of probability
Pθ(yi,t = j | xi) of all output tags J other
than the highest probability output tag ˆyi, t:
SCONF(xi, t) = 1 − Pθ(yi,t = ˆyi, t | xi),
then sum this over all instances of type z:
(cid:2)
(cid:2)
SCRAL(z) =
SCONF(xi, t).
xi∈D
xi, t=z
Again selecting the top b types having the
highest score (given by b- arg max) gives us
the most confusing word types (XINIT). Para
each token, we also store the output tag that
had the second highest probability which we
refer to as the ‘‘most confusing output tag’’
for a particular xi, t:
(cid:3)
oh(xi,t, j) =
1 if j = arg maxj ∈ J \ { ˆyi, t} pi, t, j
0 de lo contrario.
For each word type z, we aggregate the
frequency of the most confusing output tag
across all token occurrences:
(cid:2)
(cid:2)
OCRAL(z, j) =
oh(xi, t, j),
xi∈D
xi, t=z
and compute the output tag with the highest
frequency as the most confusing output tag
for type z:
l(z) = arg máx
j ∈ J
OCRAL(z, j).
For each of the top b most confusing word
types, we retrieve its most confusing output
tag, resulting in type-tag pairs given by
LINIT = {(cid:5)z1, j1(cid:6), · · · (cid:5)zb, jb(cid:6)}. This process is
illustrated in steps 7–14 in Algorithm 1.
2. Find the most
representative
simbólico
instancias. Now that we have the most
confusing type-tag pairs LINIT, our final step
is selecting the most representative token
instances for annotation. For each type-
tag tuple (cid:5)zk, jk(cid:6) ∈ LINIT, we first retrieve
contextualized representations for all token
occurrences (xi,t = zk) of the word-type zk
from the encoder of the POS model. Nosotros
express this in shorthand as ci,t := enc(xi,t).
Because the true labels are unknown, hay
no certain way of knowing which tokens have
the ‘‘most confusing output tag’’ as the true
label. Por lo tanto, each token representation
Cifra 2: Illustrating the inconsistent performance of
UNS-ORACLE and QBC-ORACLE methods. The y-axis is
difference in the POS accuracy for these two methods,
averaged across 20 iterations having a batch size 50.
QBC-ORACLE
UNS-ORACLE
ITERATION-1 PART=1
ITERATION-2 PART=1,ADP=1
ITERATION-3 ADV=1,PART=1,ADP=1
ITERATION-4 ADV=1,PART=1,ADP=2
ADP=1
ADP=2
ADP=2
ADP=3
Mesa 1: Each cell is the tag selected for German
token ‘‘zu’’ at each iteration. Gold output tag
distribution for ‘‘zu’’ is ADP=194, PART=103,
ADV=5, PROPN=5, ADJ=1.
confusion between output tags while selecting
datos. This is especially important for POS tagging
because we find that the existing methods tend to
select highly syncretic word types. Syncretism
is a linguistic phenomenon where distinctions
required by syntax are not realized by morphology,
meaning a word type can have multiple POS
tags based on context.3 This is expected because
syncretic word types, owing to their inherent
ambiguity, cause high uncertainty, which is the
underlying criterion for most AL methods.
3 Confusion-Reducing Active Learning
To address the limitations of the existing methods,
we propose a confusion-reducing active learning
(CRAL) estrategia, which aims at reducing the confu-
sion between the output tags. In order to combine
both ‘‘informativeness’’ and ‘‘representativeness’’,
we follow a two-step algorithm:
1. Find the most confusing word types. El
goal of this step is to find b word types that
would maximally reduce the model confusion
within the output tags. For each token xi,t
in the unlabeled sequence xi ∈ D, we first
3Details can be found in Section §5.2, Mesa 3.
4
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
5
0
1
9
2
4
2
6
3
/
/
t
yo
a
C
_
a
_
0
0
3
5
0
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Algoritmo 1 CONFUSION-REDUCING AL
1: D ← unlabeled set of sequences
2: Z ← vocabulary
3: J ← output tag-set
4: b ← active learning batch size
5: Pθ(yi, t = j | xi) ← marginal probability
6: pi, t, j := Pθ(yi, t = j | xi)
7: for xi ∈ D do
8:
para (xi, t) ∈ xi do
z ← xi, t
SCRAL(z) ← SCRAL(z)+(1 − pi,t, ˆyi, t)
ˆj ← arg maxj ∈ J \ { ˆyi, t} pi, t, j
OCRAL(z, ˆj) ← OCRAL(z, ˆj)+ 1
9:
10:
11:
12:
13: XINIT
14: for zk ∈ XINIT do
15:
← b- arg maxz∈Z SCRAL(z)
jk ← arg maxj ∈ J OCRAL(zk, j)
for xi, t ∈ D s.t. xi, t = zk do
← enc(xi, t)
cxi, t
Wxi, t = pi, t, jk
∗ cxi, t
16:
17:
18:
19:
XLABEL(zk) = CENTROID{Wxi,t=zk
}
ci,t is weighted with the model confidence of
the most confusing tag jk given by
Wxi, t = Pθ(yi, t = jk | xi) ∗ ci, t
Finalmente, the token instance that is closest
to the centroid of this weighted token set
becomes the most representative instance for
annotation. Going forward, we also refer to
the most representative token instance as
the centroid for simplicity.4 This process
is repeated for each of the word-types zk,
resulting in the to-label set XLABEL. Esto es
illustrated in steps 14–19 in Algorithm 1.
During the annotation process,
the selected
representative tokens of each selected confusing
word type are presented in context similar to Fang
and Cohn (2017) and Chaudhary et al. (2019).
4Sener and Savarese (2018) describe why choosing the
centroid is a good approximation of representativeness. Ellos
pose AL as a core-set selection problem where a core set
is the subset of data on which the model if trained closely
matches the performance of the model trained on the entire
conjunto de datos. They show that finding the core set is equivalent
to choosing b center points such that the largest distance
between a data point and its nearest center is minimized. Nosotros
take inspiration from this result in using the centroid to be
the most representative instance.
5
4 Model and Training Regimen
Now that we have a method to select data
for annotation, we present our POS tagger in
Section §4.1, followed by the training algorithm
in Section §4.2.
4.1 Model Architecture
Our POS tagging model is a hierarchical neural
conditional random field (CRF) tagger (Ma and
Azul, 2016; Lample et al., 2016; Yang et al.,
2017). Each token (X, t) from the input sequence x
is first passed through a character-level Bi-LSTM,
followed by a self-attention layer (Vaswani et al.,
2017), followed by another Bi-LSTM to capture
information about subword structure of the words.
Finalmente, these character-level representations are
fed into a token-level Bi-LSTM in order to
←−
create contextual representations ct =
ht,
←−
ht are the representations from
dónde
the forward and backward LSTMs, and ‘‘:''
denotes the concatenation operation. The encoded
representations are then used by the CRF decoder
to produce the output sequence.
−→
ht and
−→
ht :
Because we acquire token-level annotations,
we cannot directly use the traditional CRF, cual
expects a fully labeled sequence. En cambio, we use a
constrained CRF (Bellare and McCallum, 2007),
which computes the loss only for annotated tokens
by marginalizing the un-annotated tokens, as has
been used by prior token-level AL models (Fang
and Cohn, 2017; Chaudhary et al., 2019) también.
Given an input sequence x and a label sequence
y, traditional CRF computes the likelihood as
follows:
pθ(y|X) =
(cid:4)
norte
t=1 ψt(yt−1, yt, X, t)
z(X)
,
(cid:2)
norte(cid:5)
z(X) =
ψt(yt−1, yt, X, t),
y∈Y(norte )
t=1
where N is the length of the sequence, Y(norte )
denotes the set of all possible label sequences with
length N . ψt(yt−1, yt, X) = exp(WT
yt−1,ytxt +
byt−1,yt) is the energy function where WT
yt−1,yt
and byt−1,yt are the weight vector and bias
corresponding to label pair (yt−1, yt) respectivamente.
In constrained CRF training, YL denotes the set
of all possible sequences that are congruent with
the observed annotations, and the likelihood is
computed as: pθ(YL|X) =
pθ(y|X).
(cid:6)
y∈YL
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
5
0
1
9
2
4
2
6
3
/
/
t
yo
a
C
_
a
_
0
0
3
5
0
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
4.2 Cross-view Training Regimen
4.3 Cross-Lingual Transfer Learning
In order to further improve this model, we apply
cross-view training (CVT), a semi-supervised
learning method (Clark et al., 2018). On unlabeled
examples, CVT trains auxiliary prediction mo-
dules, which look at restricted ‘‘views’’ of an
input sequence, to match the prediction from the
full view. By forcing the auxiliary modules to
match the full-view module, CVT improves the
model’s representation learning. Not only does it
help in improving the downstream performance
under low-resource conditions, but also improves
the model calibration overall (§5.4). Having a
well-calibrated model is quite useful for AL, como
a well-calibrated model tends to assign lower
incorrect predictions,
probabilities to ‘‘true’’
which allows the AL measure to select these
incorrect tokens for annotation.
CVT is composed of four auxiliary prediction
modules, a saber: the forward module θf wd that
makes predictions without looking at the right of
the current token, the backward module θbwd that
makes predictions without looking at the left of
the current token, the future module θf ut that does
not look at either the right context or the current
simbólico, and the past module θpst that does not look
at either the left context or the current token. El
token representations ct for each module can be
seen as follows:
cfwd
t=
cfut
t=
−→
ht,
−−→
ht−1,
cbwd
t=
cpst
t=
←−
ht,
←−−
ht+1.
cfull
t=
−→
ht :
←−
ht.
For an unlabeled sequence x, the full-view model
targets pθ(y|X) después
θf ull first produces soft
inferencia. CVT matches the soft predictions from
V auxiliary modules by minimizing their KL-
divergencia. Although CRF produces a probability
distribution over all possible output sequences, para
computational feasibility we compute the token-
level KL-divergence using pθ(yt|X), which is the
marginal probability distribution of token (X, t)
over all output tags T . This is calculated easily
from the forward-backward algorithm:
LCVT=
1
|D|
(cid:2)
(cid:2)
V(cid:2)
xi∈D
xi,t∈xi
v=1
KL(pf ull
i
||pv
i)
pf ull
i
:= P f ull
i
(yi,t=j | xi) and pv
i := P v
i (yi, t=j|xi)
dónde |D| is the total unlabeled examples in D.
6
Using the architecture described above, para cualquier
given target language we first train a POS model
on a group of related high-resource languages. Nosotros
then fine-tune this pre-trained model on the newly
acquired annotations on the target language, como
obtained from an AL method. The objective of
cross-lingual transfer learning is to warm-start the
POS model on the target language. Several meth-
ods have been proposed in the past including
annotation projection (Zitouni and Florian, 2008),
and model transfer using pre-trained models such
as m-BERT (Devlin et al., 2019). In this work our
primary focus is on designing an active learning
método, so we simply pre-train a POS model on a
group of related high-resource languages (Cotterell
and Heigold, 2017), which is a computationally
cheap solution, a crucial requirement for running
multiple AL iterations. Además, trabajo reciente
(Siddhant et al., 2020) has shown the advantage
of pre-training using a selected set of related lan-
guages over a model pre-trained over all available
idiomas.
Following this, for a given target language we
first select a set of typologically related languages.
An initial set of transfer languages is obtained
using the automated tool provided by Lin et al.
(2019), which leverages features such as phylo-
genetic similarity, typology, lexical overlap, y
size of available data, in order to predict a list of
optimal transfer languages. This list can be then
refined using the experimenter’s intuition. Finalmente,
is trained on the concatenated
a POS model
corpora of
a
Johnson et al. (2017), a language identification
token is added at the beginning and end of each
secuencia.
the related languages. Similar
5
Simulation Experiments
En esta sección, we describe the simulation exper-
iments used for evaluating our method. Under this
configuración, we use the provided training data as our
unlabeled pool and simulate annotations by using
the gold labels for each AL method.
Datasets: For the simulation experiments, nosotros
test on six typologically diverse languages:
Alemán, Swedish, North Sami, persa, Ukrainian,
and Galician. We use data from the Universal
Dependencies (UD) v2.3 (Nivre et al., 2016; Nivre
et al., 2018; Kirov et al., 2018) treebanks with the
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
5
0
1
9
2
4
2
6
3
/
/
t
yo
a
C
_
a
_
0
0
3
5
0
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
TARGET LANGUAGE
TRANSFER LANGUAGES (TREEBANK)
Inglés (en-ewt) + Dutch (nl-alpino)
Norwegian (no-nynorsk) + Danish (da-ddt)
Alemán (de-gsd)
Swedish (sv-lines)
North Sami (sme-giella) Finnish (fi-ftb)
persa (fa-seraji)
Galician (gl-treegal)
Ukrainian (uk-iu)
Urdu (ur-udtb) + Russian (ru-gsd)
Español (es-gsd) + Portuguese (pt-gsd)
Russian (ru-gsd)
Griko
Griego (el-gdt) + italiano (it-postwita)
Mesa 2: Dataset details describing the group
of related languages over which the model was
pre-trained for a given target language.
same train/dev/test split as proposed in McCarthy
et al. (2018). For each target language, the set of
related languages used for pre-training is listed
en mesa 2. Persian and Urdu datasets being in
the Perso-Arabic script, there is no orthography
overlap along the transfer and the target languages.
Por lo tanto, for Persian we use uroman,5 a publicly
available tool for romanization.
Líneas de base: As described in Section §2, we com-
pare our proposed method (CRAL) with Uncertainty
Sampling (UNS) and Query-by-commitee (QBC). Nosotros
also compare with a random baseline (RAND) eso
selects tokens randomly from the unlabeled data
D. For QBC, we use the following committee of
models C = {θf wd, θbwd, θf ull}, where θi are the
CVT views (§4.2). We do not include the θf ut and
θpst as they are much weaker in comparison to the
other views.6 For CRAL, UNS, and RAND, we use the
full model view.
Model Hyperparameters: We use a hidden
tamaño de 25 for the character Bi-LSTM, 100 para el
modeling layer, y 200 for the token-level Bi-
LSTM. Character embeddings are 30-dimensional
and are randomly initialized. We apply a dropout
de 0.3 to the character embeddings before inputting
to the Bi-LSTM. Otro 0.5 dropout is applied
to the output vectors of all Bi-LSTMs. The model
is trained using the SGD optimizer with learning
tasa de 0.015. The model is trained till convergence
over a validation set.
Active Learning Parameters: For all AL
methods, we acquire annotations in batches of
5https://www.isi.edu/∼ulf/uroman.html.
6We chose CVT views for QBC over the ensemble for
computational reasons. Training three models independently
would require three times the computation. Given that for
each language we run 20 experimentos, amounting to a total
de 120 experimentos, reducing the computational burden was
preferred.
Cifra 3: Our method (CRAL) outperforms existing AL
methods for all six languages. y-axis is the difference
in POS accuracy between CRAL and other AL methods,
averaged across 20 iterations with batch size 50.
Cifra 4: Comparison of the POS performance across
the different methods for 20 AL iterations for German.
50 and run 20 simulation experiments resulting in
a total of 1000 tokens annotated for each method.
We pre-train the model using the above parameters
and after acquiring annotations, we fine-tune it
with a learning rate proportional to the number of
sentences in the labeled data lr = 2.5e−5|XLABEL
|.
5.1 Resultados
Cifra 3 compares our proposed CRAL strategy
with the existing baselines. The y-axis repre-
sents the difference in POS tagging performance
between two AL methods and is measured by
exactitud. The accuracy is averaged across 20 él-
erations. Across all six languages, our proposed
method CRAL shows significant performance gains
over the other methods. En figura 4 we plot the
individual accuracy values across the 20 iterations
for German and we see that our proposed method
CRAL performs consistently better across multiple
iterations. We also see that the zero-shot model
on German (iteration-0) gets a decent warm start
because of cross-lingual transfer from Dutch and
Inglés.
Además, to check how the performance of
the AL methods is affected by the underlying
POS tagger architecture, we conduct additional
7
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
5
0
1
9
2
4
2
6
3
/
/
t
yo
a
C
_
a
_
0
0
3
5
0
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
TARGET LANGUAGE
CRAL
UNS
QBC
Alemán
Swedish
North Sami
persa
Galician
Ukrainian
0.0465
0.0811
0.0270
0.0384
0.0722
0.0770
0.0801
0.1196
0.0328
0.0583
0.0953
0.1067
0.0849
0.1013
0.0346
0.0444
0.0674
0.0665
Mesa 4: Wasserstein distance between the
output tag distributions of the selected data
and the gold data, lower the better. The above
results are after 200 annotated tokens, es decir.,
four AL iterations.
types that can have multiple POS tags based on
contexto. We compute the Wasserstein distance
(a metric to compute distance between two
probability distributions) between the annotated
tag distribution and the true tag distribution for
each word type z.
W D(z) =
(cid:2)
j ∈ Jz
j (z) − p∗
pAL
j(z)
where Jz is the set of output tags for a word
type z in the selected active learning data. pAL
j (z)
denotes the proportion of tokens annotated with
tag j in the selected data and p∗
j is the proportion of
tokens having tag j in the entire gold data. Lower
Wasserstein distance suggests high similarity
between the selected tag distribution and output
tag distribution. Given that each iteration selects
unique tokens, this distance can be computed
after each of n iterations. Mesa 4 shows that our
proposed strategy CRAL selects data that closely
matches the gold data distribution for four out of
the six languages.
iterations,
How effective is the AL method in reducing
confusion across iterations? Across iterations,
as more data is acquired we expect the incorrect
predictions from the previous iterations to be
rectified in the subsequent
ideally
without damaging the accuracy of existing
predicciones. Sin embargo, as seen in Table 3, el
AL methods have a tendency to select syncretic
word types, which suggests that across multiple
iterations the same word types could get selected
albeit under a different context. This could lead to
more confusion, thereby damaging the existing
accuracy if the selected type is not a good
representative of its annotated tag. Por lo tanto,
we calculate the number of existing correct
5: Comparing
in POS
Cifra
performance across the AL methods with BRNN/MLP
architecture, averaged across 20 iterations.
diferencia
el
TARGET LANGUAGE
UNS
QBC
CRAL
Alemán
Swedish
North-Sami
persa
Galician
Ukrainian
74 %
56 %
10 %
50 %
40 %
38 %
76 %
54 %
12 %
46 %
42 %
48 %
82 %
62 %
14 %
46 %
44 %
48 %
Mesa 3: Percentage of syncretic word types in
the first iteration of active learning (consisting
de 50 types).
experiments with a different architecture. Nosotros
replace the CRF layer with a linear layer and use
token level softmax to predict the tags, keeping the
encoder as before. We present the results for four
(North Sami, Swedish, Alemán, Galician) del
six languages in Figure 5. Our proposed method
CRAL still always outperforms QBC. We observe
that only for North Sami does UNS outperform
CRAL, which is similar to the results obtained from
BRNN/CRF architecture, where the CRAL performs
at par with UNS.
5.2 Análisis
In the previous section, we compared the different
AL methods by measuring the average POS
exactitud. En esta sección, we perform intrinsic
evaluation to compare the quality of the selected
data on two aspects:
How similar are the selected and the true
data distributions? To measure this similarity,
we compare the output tag distribution for each
word type in the selected data with the tag dis-
tribution in the gold data. This evaluation is nec-
essary because there are a significant number of
syncretic word types in the selected data as seen in
Mesa 3. To recap, syncretic word types are word
8
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
5
0
1
9
2
4
2
6
3
/
/
t
yo
a
C
_
a
_
0
0
3
5
0
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Cifra 6: Confusion score measures the percentage of
correct predictions in the first iteration which were
incorrectly predicted in the second iterations. Lower
values suggest that the selected annotations in the
subsequent iterations cause less damage on the model
trained on the existing annotations.
Cifra 7: In the oracle setting, our method (CRAL-
ORACLE) outperforms UNS-ORACLE and QBC-ORACLE in
most cases, while the non-oracle CRAL matches the
performance of its oracle counterpart. The y-axis
measures the difference in average accuracy across
20 iterations between the methods being compared.
predictions that were incorrectly predicted in
the subsequent iteration, and present the results
En figura 6. A lower value suggests that the
AL method was effective in improving overall
accuracy without damaging the accuracy from
existing annotations, and thereby was successful in
reducing confusion. From Figure 6, the proposed
strategy CRAL is clearly more effective than the
others in most cases in reducing confusion across
iterations.
5.3 Oracle Results
In order to check how close to optimal our
proposed method CRAL is, we conduct ‘‘oracle’’
comparisons, where we have access to the gold
labels during data selection. The oracle versions
of existing methods UNS-ORACLE and QBC-ORACLE
have already been described in Section §2. For our
proposed method CRAL, we construct the oracle
version as follows:
CRAL-ORACLE: Select
the types having the
highest incorrect predictions. Within each type,
select that output tag that is most incorrectly
predicted. This gives the most confusing output
tag for a given word type. From the tokens having
the most confusing output tag, select the token
representative by taking the centroid of their
respective contextualized representations, similar
to the procedure described in Section § 3.
Cifra 7 compares the performance gain of the
POS model trained using CRAL-ORACLE over UNS-
ORACLE and QBC-ORACLE (Figure 7.a, 7.b). Incluso
under the ‘‘oracle’’ setting, our proposed method
performs consistently better across all languages
(except Ukrainian), unlike the existing methods,
as seen in Figure 2. CRAL closely matches the
performance of its corresponding ‘‘oracle’’ CRAL-
el
ORACLE (Figure 7.c) which suggests that
proposed method is close to an optimal AL
método. Sin embargo, we note that CRAL-ORACLE is not
a ‘‘true’’ upper bound as for Ukrainian it does not
outperform CRAL. We find that for Ukrainian, arriba
a 250 tokens, the oracle method outperforms the
non-oracle method after which it underperforms.
We hypothesize that this inconsistency is due to
noisy annotations in Ukrainian. On analysis we
found that the oracle method predicts numerals
as NUM but in the gold data some of them are
annotated as ADJ. We also find several tokens to
have punctuations and numbers mixed with the
letters.7
In order to verify whether CRAL is accurately
selecting data at near-oracle levels, we analyze
the intermediate steps leading to the data selec-
ción. For each selected word type z ∈ XLABEL,
we analyze how well our proposed method of
weighting encoder representations with the model
confidence of the most confused tag and taking
the centroid actually succeeds at ‘‘representative’’
token selection. If this is indeed the case, tokens
in the vicinity of the centroid should also have the
same ‘‘most confused tag’’ as their predicted label
and thereby be mis-classfied instances. To verify
this hypothesis we compare how many of the 100
tokens closest to the centroid (in the representation
espacio) (XNN(z)) are truly misclassified. This score
is given by p(z) for each selected word-type z:
XNN(z) = b- arg min
xi,t=z∈D
|ci,t − cz|
pag(z) =
|
|ˆyi,t (cid:10)= y∗
i,t
|XNN(z)|
7This is also noted in the UD page: https://
universaldependencies.org/treebanks/uk iu
/index.html.
9
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
5
0
1
9
2
4
2
6
3
/
/
t
yo
a
C
_
a
_
0
0
3
5
0
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
EXPERIMENT SETTING
EN + NO → EN
EN + NO + DE-200 → DE
SCE
CVT
− 0.0190
0.0174
+
− 0.1658
0.1391
+
ACCURACY
95.53
95.58
69.90
74.61
Mesa 5: Evaluating the effect of CVT across
two experimental settings. EN: Inglés, NO:
Norwegian, DE-200: 200 German annotations.
Left of ‘‘→’’ are the pre-training languages and
the right of ‘‘→’’ is the language on which this
model is evaluated. Accuracy measures the POS
model performance (higher is better) and SCE
measures the model calibration (lower is better).
are placed in bin 1, el siguiente 10% in bin 2, y
so on. We conduct two ablation experiments to
measure the effect of CVT. Primero, we train a joint
POS model on English and Norwegian datasets
using all available training data, and evaluate on
the English test set. Segundo, we use this pre-
trained model and fine-tune on 200 randomly
sampled German data and evaluate on German
test data. We train models with and without CVT,
denotado por +/- en mesa 5. We find that with CVT
results both in higher accuracy as well as lower
calibration error (SCE). This effect of CVT is
much more pronounced in the second experiment,
which presents a low-resource scenario and is
common in an AL framework.
6 Human Annotation Experiment
En esta sección, we apply our proposed approach
on Griko, an endangered language spoken by
alrededor 20,000 people in southern Italy, en el
Grec`ıa Salentina area southeast of Lecce. El
only available online Griko corpus, referred to
as UoI (Lekakou et al., 2013),8 consists of 330
utterances by nine native speakers having POS
anotaciones. Además, Anastasopoulos et al.
(2018) collected, procesado, and released 114
stories, of which only the first 10 stories were
annotated by experts and have gold-standard an-
notations. We conduct human annotation exper-
iments on the remaining unannotated stories in
order to compare the different active learning
methods.
Setup: We use Modern Greek and Italian as
the two related languages to train our initial
8http://griko.project.uoi.gr.
Cifra 8: We report the mean and median of p over
all the 50 token-tag pairs selected by the first AL
iteration of CRAL. We see that across all languages
majority of the token-tag pairs satisfy the criteria of
using weighted representations with centroid for token
selección.
where b = 100. cz is the contextualized rep-
resentation of the representative instance for the
word-type z, a saber, the centroid and ci,t is the
contextualized representation of z’s token instance
xi,t. y∗
i,t and ˆyi,t are the true and predicted labels
of xi,t. We report the average and median of
p across all the selected tokens of the first AL
iteration in Figure 8. We see that for all languages
the median is high (es decir., > 0.8) which suggests
that the majority of the token-tag pairs satisfy this
criteria, thus supporting the step of weighting the
token representations and choosing the centroid
for annotation.
We also compare the percent of token-tag
overlap between the data selected from CRAL with
its oracle counterpart: CRAL-ORACLE. For the first
AL iteration, the proposed method CRAL has more
than 50% overlap with the oracle method for all
idiomas, providing some evidence as to why
CRAL is matching the oracle performance.
5.4 Effect of Cross-View Training
As mentioned in Section § 4.2, we use CVT to not
only improve our model overall but also to have
a well-calibrated model that can be important for
active learning. A model is well-calibrated when a
model’s predicted probabilities over the outcomes
reflects the true probabilities over these outcomes
(Nixon et al., 2019). We use Static Calibration
Error (SCE), a metric proposed by Nixon et al.
(2019) to measure the model calibration. SCE bins
the model predictions separately for each output
tag probability and computes the calibration error
within each bin which is averaged across all the
bins to produce a single score. For each output tag,
bins are created by sorting the predictions based on
the output class probability. Por eso, la primera 10%
10
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
5
0
1
9
2
4
2
6
3
/
/
t
yo
a
C
_
a
_
0
0
3
5
0
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
AL
ITERATION-0
ITERATION-1
ITERATION-2
ITERATION-3
IA Agr. WD
Linguist-1
Linguist-2
Linguist-3
(Expert)
CRAL
QBC
UNS
CRAL
QBC
UNS
CRAL
QBC
UNS
52.93
52.93
52.93
52.93
52.93
52.93
52.93
52.93
52.93
63.42 (10)
55.82 (15)
56.14 (15)
61.24 (15)
56.52 (20)
55.45 (17)
65.63
60.50
58.51
69.07 (10)
62.03 (17)
57.04 (15)
67.24 (20)
65.96 (20)
58.80 (17)
69.17
65.69
64.26
65.16 (16)
66.51 (15)
65.73 (11)
67.05 (18)
66.71 (17)
65.73 (20)
68.09
56.20
65.93
0.58
0.68
0.58
0.70
0.72
0.70
−
−
−
0.281
0.243
0.379
0.346
0.245
0.363
0.159
0.170
0.125
Mesa 6: Griko test set POS accuracy after each AL annotation iteration. Each iteration consists of 50
token-level annotations. The number in parentheses is the time in minutes required for annotation.
The IA AGR. column reports the inter-annotator agreement against the expert linguist for the first
iteration. WD is the Wasserstein distance between the selected tokens and the test distribution.
POS model.9 To further improve the model,
we fine-tune on the UoI corpus, which consists
de 360 labeled sentences. We evaluate the AL
performance on the 10 gold-labelled stories from
Anastasopoulos et al. (2018), of which the first two
stories, comprising 143 labeled sentences, are used
as the validation set and the remaining 800 labeled
sentences form the test set. We use the unannotated
stories as our unlabeled pool. We compare CRAL
with UNS and QBC, conducting three AL iterations
for each method, where each iteration selects
apenas 50 tokens for annotation. The annotations
are provided by two linguists familiar with mo-
dern Greek and somewhat familiar with Griko.
To familiarize the linguists with the annotation
interface, a practice session was conducted in
modern Greek. In the interface, tokens that need
to be annotated are highlighted and presented
with their surrounding context. The linguist then
simply selects the appropriate POS tag for each
highlighted token. Because we do not have
gold annotations for these experiments, nosotros también
obtained annotations from a third linguist who is
more familiar with Griko grammar.
Resultados: Mesa 6 presents the results on three
iterations for each AL method, with our proposed
method CRAL outperforming the other methods in
most cases. We note that we found several frequent
tokens (es decir., 863/13,740 tokens) in the supposedly
gold-standard Griko test data to be inconsistently
anotado. Específicamente, the original annotations
9With Italian being the dominant language in the region,
code switching phenomena appear in the Griko corpora.
did not distinguish between coordinating (CCONJ)
and subordinating conjunctions (SCONJ), a diferencia de
the UD schema. Como resultado, when converting the
test data to the UD schema all conjunctions were
tagged as subordinating ones. Our annotation tool,
sin embargo, allowed for either CCONJ or SCONJ as
tags and the annotators did make use of them.
With the help of a senior Griko linguist (Linguist-
3), we identified a few types of conjunctions that
are always coordinating: variations of ‘‘and’’ (ce
and c’), and of ‘‘or’’ (e or i). We fixed these
annotations and used them in our experiments.
For Linguist-1, we observe a decrease in
performance in Iteration-3. One possible reason
for this decrease is attributed to Linguist-1’s poor
annotation quality, which is also reflected in their
low inter-annotator agreement scores. We observe
a slight decrease for other linguists, which we
hypothesize is due to domain mismatch between
the annotated data and the test data. De hecho, the test
set stories and the unlabeled ones originate from
different time periods spanning a century, cual
can lead to slight differences in orthography and
usage. Por ejemplo, after three AL iterations,
the token ‘‘i’’ had been annotated as CONJ
twice and DET once, whereas in the test data
all instances of ‘‘i’’ are annotated as DET. Similar
to the simulation experiments, we compute the
confusion score for all linguists in Figure 9. Nosotros
find that, unlike in the simulation experiments,
a model trained with UNS causes less damage on
the existing annotations as compared to CRAL.
Sin embargo, we note that the model performance
from the UNS annotations is much lower than CRAL
to begin with.
11
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
5
0
1
9
2
4
2
6
3
/
/
t
yo
a
C
_
a
_
0
0
3
5
0
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
CRAL
UNS
QBC
Linguist-1
Linguist-2
Linguist-3
90
84
74
95
88
90
72
72
61
Mesa 7: Each cell denotes the number of
annotated tokens that are also present in the
test data.
better tagging accuracy because the WD metric is
only informing us how good an AL strategy is in
selecting data that matches closely the gold output
tag distribution for that selected data.
7 Trabajo relacionado
Active Learning for POS Tagging: AL has
been widely used for POS tagging. Garrette
and Baldridge (2013) use a graph-based label
propagation to generalize initial POS annotations
they find
to the unlabeled corpus. Más,
tipo-
that under a constrained time setting,
level annotations prove to be more useful than
token-level annotations. In line with this, Fang
and Cohn (2017) also select informative word
types based on uncertainty sampling for low-
resource POS tagging. They also construct a
tag dictionary from these type-level annotations
and then propagate the labels across the entire
unlabeled corpus. Sin embargo, in our initial analysis
on uncertainty sampling, we found that adding
label-propagation harmed the accuracy in cer-
tain languages because of prevalent syncretism.
Ringger et al. (2007) present different variations
of uncertainty-sampling and QBC methods for
Etiquetado de punto de venta. Similar to Fang and Cohn (2017),
they find uncertainty sampling with frequency
bias to be the best strategy. Settles and Craven
(2008) present a nice survey on the different ac-
tive learning strategies for sequence labeling tasks,
and Marcheggiani and Arti`eres (2014) conversar
the strategies for acquiring partially labeled data.
Sener and Savarese (2018) propose a core-set
selection strategy aimed at finding the subset
that is competitive across the unlabeled dataset.
This work is most similar to ours with respect
to using geometric center points as being the
most representative. Sin embargo, a lo mejor de nuestro
conocimiento, none of the existing works are targeted
at reducing confusion within the output classes.
Cifra 9: Confusion scores for the three Griko linguists.
Lower values suggest that the selected annotations in
the subsequent iterations cause less damage on the
model trained on existing annotation.
expert
Iteration-1 with the
We also compute the inter-annotator agreement
en
(Linguist-3)
(Mesa 6). We find that the agreement scores are
lower than one would expect (c.f. the annotation
test run on modern Greek, for which we have gold
anotaciones, yielded much higher interannotator
agreement scores over 90%). The justification
probably lies with our annotators having limited
knowledge of Griko grammar, while our AL
methods require annotations for ambiguous and
this is a common
‘‘hard’’ tokens. Sin embargo,
scenario in language documentation where often
linguists are required to annotate in a language
they are not very familiar with, which makes this
task even more challenging. We also recorded
the annotation time needed by each linguist for
each iteration in Table 6. Compared with the UNS
método, the linguists annotated (avg.) 2.5 minutos
faster using our proposed method which suggests
that UNS tends to select harder data instances for
annotation.
Similar to the simulation experiments, nosotros reportamos
the Wasserstein distance (WD) for all linguists
en mesa 6. Sin embargo, unlike in the simulation
setting where the WD was computed with the
gold training data, for the human experiments
we do not have access to the gold annotations
and therefore computed WD with the gold test
data which however, is from a slightly different
domain, which affects the results somewhat.
We observe that QBC has lower WD scores for
Linguist-1 and Linguist-2 and UNS for Linguist-3.
On further analysis, we find that even though QBC
has lower WD, it also has the least coverage of
the test data—that is, it has the fewest number of
annotated tokens which are present in the test data,
as shown in Table 7. We would like to note that a
lower WD score does not necessarily translate to
12
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
5
0
1
9
2
4
2
6
3
/
/
t
yo
a
C
_
a
_
0
0
3
5
0
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Low-resource POS Tagging: Several cross-
lingual transfer techniques have been used for
improving low-resource POS tagging. Cotterell
and Heigold (2017) and Malaviya et al. (2018)
train a joint neural model on related high-
resource languages and find it be very effective
on low-resource languages. The main advantage
of these methods is that
they do not require
any parallel text or dictionaries. Das and Petrov
(2011), T¨ackstr¨om et al. (2013), Yarowsky et al.
(2001), and Nicolai and Yarowsky (2019) usar
annotation projection methods to project POS an-
notations from one language to another. Cómo-
alguna vez, annotation projection methods use parallel
texto, which often might not be of good quality for
idiomas de bajos recursos.
8 Conclusión
We have presented a novel active learning method
for low-resource POS tagging that works by
reducing confusion between output tags. Usando
simulation experiments across six typologically
diverse languages, we show that our confusion-
reducing strategy achieves higher accuracy than
existing methods. Más, we test our approach
under a true setting of active learning where
we ask linguists to document POS information
for an endangered language, Griko. Despite be-
ing unfamiliar with the language, our proposed
method achieves performance gains over the other
methods in most iterations. For our next steps,
we plan to explore the possibility of adapting
our proposed method for complete morphological
análisis, which poses an even harder challenge
for AL data selection due to the complexity of the
tarea.
Expresiones de gratitud
to the anonymous
The authors are grateful
reviewers and the Action Editor who took the
time to provide many interesting comments that
made the paper significantly better, and to Eleni
Antonakaki and Irini Amanaki, for participating
in the human annotation experiments. Este
work is sponsored by the Dr. Robert Sansom
Fellowship, the Waibel Presidential Fellowship,
and the National Science Foundation under grant
1761548.
Referencias
Antonios Anastasopoulos, Marika Lekakou, Josep
Quer, Eleni Zimianiti, Justin DeBenedetto, y
David Chiang. 2018. Part-of-speech tagging
on an endangered language: A parallel griko-
italian resource. In Proceedings of the 27th
International Conference on Computational
Lingüística, pages 2529–2539, Santa Fe, Nuevo
México, EE.UU. Asociación de Computación
Lingüística.
Ankita and K. A. Abdul Nazeer. 2018. Part-of-
Speech Tagging and Named Entity Recognition
using Improved Hidden Markov Model and
Bloom Filter. En 2018 International Conference
on Computing, Power and Communication
Technologies (GUCON), pages 1072–1077.
IEEE.
Kedar Bellare and Andrew McCallum. 2007.
Learning extractors from unlabeled text using
relevant databases. In Sixth International Work-
shop on Information Integration on the Web.
Bernd Bohnet, ryan mcdonald, Gonc¸alo Sim˜oes,
Daniel Andor, Emily Pitler, and Joshua
Maynez. 2018. Morphosyntactic Tagging with
a meta-BiLSTM Model over Context Sensi-
tive Token Encodings. En Actas de la
56ª Reunión Anual de la Asociación de
Ligüística computacional (Volumen 1: Largo
Documentos), pages 2642–2652, Melbourne, Aus-
tralia. Association for Computational Linguis-
tics, DOI: https://www.aclweb.org
/anthology/P18-1246
En procedimientos de
Aditi Chaudhary, Jiateng Xie, Zaid Sheikh,
Graham Neubig, and Jaime Carbonell. 2019.
A Little Annotation does a Lot of Good: A
Study in Bootstrapping Low-resource Named
el
Entity Recognizers.
2019 Conference on Empirical Methods in
Natural Language Processing and the 9th
International Joint Conference on Natural
Idioma
(EMNLP-IJCNLP),
pages 5164–5174, Hong Kong, Porcelana. también-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1520
Procesando
Kevin Clark, Minh-Thang Luong, Cristóbal D..
Manning, and Quoc Le. 2018. Semi-Supervised
13
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
5
0
1
9
2
4
2
6
3
/
/
t
yo
a
C
_
a
_
0
0
3
5
0
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Sequence Modeling with Cross-View Train-
En g. En Actas de la 2018 Conferencia
sobre métodos empíricos en lenguaje natural
Procesando
1914–1925,
Bruselas, Bélgica. Asociación de Computación-
lingüística nacional.
(EMNLP),
paginas
Ryan Cotterell
and Georg Heigold. 2017.
Cross-Lingual Character-Level Neural Mor-
el
phological Tagging. En procedimientos de
2017 Conference on Empirical Methods in
Natural Language Processing
(EMNLP),
pages 748–759, Copenhague, Dinamarca. también-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D17
-1078
Dipanjan Das and Slav Petrov. 2011. Unsuper-
vised Part-of-Speech Tagging with Bilingual
Graph-Based Projections. En procedimientos de
the 49th Annual Meeting of the Association
para Lingüística Computacional: Human Lan-
guage Technologies, pages 600–609, Portland,
Oregón, EE.UU. Asociación de Computación
Lingüística.
el 2019 Conference of
Jacob Devlin, Ming-Wei Chang, Kenton Lee,
and Kristina Toutanova. 2019. BERT: Pre-
training of Deep Bidirectional Transformers
for Language Understanding. En procedimientos
de
the North
la Asociación para
American Chapter of
Ligüística computacional: Human Language
Technologies, Volumen 1 (Long and Short
Documentos),
4171–4186, Mineápolis,
Minnesota. Asociación de Computación
Lingüística.
paginas
a Bilingual Dictionary.
Meng Fang and Trevor Cohn. 2017. Modelo
Transfer for Tagging Low-Resource Languages
En profesional-
usando
ceedings of the 55th Annual Meeting of the
Asociación de Lingüística Computacional
(Volumen 2: Artículos breves), pages 587–593,
vancouver, Canada. Asociación de Computación-
lingüística nacional. DOI: https://doi
.org/10.18653/v1/P17-2093
Dan Garrett
and Jason Baldridge. 2013.
Learning a Part-of-Speech Tagger from Two
Hours of Annotation. En Actas de la
2013 Conference of
the North American
Chapter of the Association for Computational
14
Lingüística: Tecnologías del lenguaje humano,
pages 138–147, Atlanta, Georgia. Asociación
para Lingüística Computacional.
Harald Hammarstr¨om, Robert Forkel, and Martin
Haspelmath. 2018. Glottolog 3.3. Max Planck
Institute for the Science of Human History.
Jena.
Sheng-Jun Huang, Rong Jin, and Zhi-Hua Zhou.
2010. Active Learning by Querying Informa-
tive and Representative Examples. In Advances
en sistemas de procesamiento de información neuronal,
pages 892–900.
Zhiheng Huang, Wei Xu, and Kai Yu. 2015.
Bidirectional LSTM-CRF models for Sequence
Tagging. arXiv preimpresión arXiv:1508.01991.
Melvin Johnson, Mike Schuster, Quoc V Le,
Maxim Krikun, Yonghui Wu, Zhifeng Chen,
Nikhil Thorat, Fernanda Vi´egas, Martín
Wattenberg, and Greg Corrado. 2017. Googles
Multilingual Neural Machine Translation
Sistema: Enabling Zero-Shot Translation. Tran-
sactions of the Association for Computatio-
nal Linguistics, 5:339–351. DOI: https://
doi.org/10.1162/tacl a 00065
Christo Kirov, Ryan Cotterell,
John Sylak-
Glassman, G´eraldine Walther, Ekaterina
Vylomova, Patrick Xia, Manaal Faruqui,
Sebastian J. Mielke, Arya McCarthy, Sandra
K¨ubler, David Yarowsky,
Jason Eisner,
and Mans Hulden. 2018. UniMorph 2.0:
Universal Morphology. En Actas de la
11th Language Resources and Evaluation
Conferencia, Miyazaki, Japón. European Lan-
guage Resource Association.
En procedimientos de
Guillaume Lample, Miguel Ballesteros, Sandeep
Subramanian, Kazuya Kawakami, and Chris
Dyer. 2016. Neural architectures for named
el
entity recognition.
the North American
2016 Conference of
Chapter of the Association for Computational
Lingüística: Tecnologías del lenguaje humano,
pages 260–270, San Diego, California. asociación-
ción para la Lingüística Computacional. DOI:
https://doi.org/10.18653/v1/N16
-1030
Marika Lekakou, Valeria Baldissera, and Anto-
nios Anastasopoulos. 2013. Documentation and
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
5
0
1
9
2
4
2
6
3
/
/
t
yo
a
C
_
a
_
0
0
3
5
0
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
analysis of an endangered language: aspects of
the grammar of griko.
texto
clasificación
David D. Luis. 1995. Evaluating and optimizing
autonomous
sistemas.
In Proceedings of the 18th Annual Interna-
tional ACM SIGIR Conference on Research
and Development
in Information Retrieval,
pages 246–254. DOI: https://doi.org
/10.1145/215206.215366
Yu Hsiang Lin, Chian-Yu Chen, Juan Lee, Zirui
li, Yuyan Zhang, Mengzhou Xia, sruti
Rijhwani, Junxian He, Zhang Zhisong, Xuezhe
Mamá, Antonios Anastasopoulos, Patricio Littell,
y Graham Neubig. 2019. Choosing Transfer
Languages for Cross-Lingual Learning.
En
Actas de la 57ª Reunión Anual de
la Asociación de Lingüística Computacional,
páginas 3125–3135, Florencia, Italia. Asociación
para Lingüística Computacional.
Xuezhe Ma and Eduard Hovy. 2016. End-
to-end Sequence Labeling via Bi-directional
el
LSTM-CNNs-CRF.
54ª Reunión Anual de la Asociación de
Ligüística computacional (Volumen 1: Largo
Documentos), pages 1064–1074. Asociación para
Ligüística computacional,
En procedimientos de
Chaitanya Malaviya, Matthew R. Gormley, y
Graham Neubig. 2018. Neural Factor Graph
for Cross-Lingual Morphological
Modelos
Tagging. In Proceedings of the 56th Annual
reunión de
la Asociación de Computación-
lingüística nacional (Volumen 1: Artículos largos),
pages 2653–2663, Melbourne, Australia. también-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/P18
-1247
Diego Marcheggiani
and Thierry Arti`eres.
2014. An experimental comparison of active
learning strategies for partially labeled se-
quences. En Actas de la 2014 Conferencia
on Empirical Methods
in Natural Lan-
Procesamiento de calibre (EMNLP), pages 898–906,
Doha, Qatar. Asociación de Computación
Lingüística. DOI: https://doi.org/10
.3115/v1/D14-1097
Arya D. McCarthy, Miikka Silfverberg, ryan
Cotterell, Mans Hulden, and David Yarowsky.
2018. Marrying Universal Dependencies and
En procedimientos de
Universal Morphology.
the Second Workshop on Universal Depen-
dencies (UDW 2018), pages 91–101. DOI:
https://doi.org/10.18653/v1/W18
-6011
Garrett Nicolai and David Yarowsky. 2019.
Learning Morphosyntactic Analyzers
de
the Bible via Iterative Annotation Projection
across 26 Idiomas. En Actas de la
57ª Reunión Anual de la Asociación de
Ligüística computacional, pages 1765–1774,
Florencia, Italia. Association for Computatio-
nal Linguistics. DOI: https://doi.org
/10.18653/v1/P19-1172
Joakim Nivré, Marie-Catherine De Marneffe,
Filip Ginter, Yoav Goldberg,
Jan Hajic,
Cristóbal D.. Manning, ryan mcdonald,
eslavo petrov, Sampo Pyysalo, Natalia Silveira,
and Reut Tsarfaty, and Daniel Zeman. 2016.
Universal dependencies v1: A multilingual
el
treebank collection.
Tenth International Conference on Language
(LREC’16),
y
Recursos
pages 1659–1666.
En procedimientos de
Evaluation
Jeremy Nixon, Mike Dusenberry, Linchuan
zhang, Ghassen Jerfel, and Dustin Tran. 2019.
Measuring calibration in Deep Learning. arXiv
preprint arXiv:1904.01685.
Eric Ringger, Peter McClanahan, Robbie Haertel,
George Busby, Marc Carmen, James Carroll,
Kevin Seppi, and Deryle Lonsdale. 2007.
Active Learning for part-of-speech Tagging:
Accelerating corpus annotation. En curso-
ings of the Linguistic Annotation Workshop,
pages 101–108. DOI: https://doi.org
/10.3115/1642059.1642075
Ozan Sener and Silvio Savarese. 2018. Activo
Learning for Convolutional Neural Networks:
A Core-Set Approach. In International Con-
ference on Learning Representations.
Burr Settles. 2009, Active Learning literature
survey, University of Wisconsin-Madison
Department of Computer Sciences.
Burr Settles and Mark Craven. 2008. An Analysis
of Active Learning Strategies for Sequence
Labeling Tasks. En Actas de la 2008
Jornada sobre Métodos Empíricos en Natural
15
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
5
0
1
9
2
4
2
6
3
/
/
t
yo
a
C
_
a
_
0
0
3
5
0
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
paginas
Procesamiento del lenguaje,
1070–1079,
Honolulu, HI. Asociación de Computación
Lingüística. DOI: https://doi.org/10
.3115/1613715.1613855
Aditya Siddhant, Ankur Bapna, Henry Tsai,
Jason Riesa, Karthik Raman, Melvin Johnson,
Naveen Ari, and Orhan Firat. 2020. In Evaluat-
ing the Cross-lingual Effectiveness of Massively
Multilingual Neural Machine Translation.
DOI: https://doi.org/10.1609/aaai
.v34i05.6414
y
tipo
Oscar T¨ackstr¨om, Dipanjan Das, eslavo petrov,
ryan mcdonald, and Joakim Tagging. 2013.
Token
cross-
lingual part-of-speech Tagging. Transactions
de
the Association for Computational Lin-
guísticos, 1:1–12. DOI: https://doi.org
/10.1162/tacl a 00205
constraints
para
Nivre Joakim, Blokland Rogier, Partanen Niko,
Jacobo. 2018.
Rießler Michael, and Rueter
Universal Dependencies 2.3.
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Leon Jones, Aidan N.. Gómez,
lucas káiser, y Illia Polosukhin. 2017.
In Advances
Attention is all you need.
en sistemas de procesamiento de información neuronal,
pages 5998–6008.
Zhe Wang, Xiaoyi Liu, Limin Wang, Yu Qiao,
Xiaohui Xie, and Charless Fowlkes. 2018.
Structured Triplet Learning with POS-tag
Guided Attention for Visual Question Ans-
wering. En 2018 IEEE Winter Conference on
Applications of Computer Vision (WACV),
pages 1888–1896, IEEE. DOI: https://
doi.org/10.1109/WACV.2018.00209
Zhilin Yang, Ruslan Salakhutdinov, and William
W.. cohen. 2017. Transfer learning for sequence
tagging with hierarchical recurrent networks.
arXiv preimpresión arXiv:1703.06345.
David Yarowsky, Grace Ngai, y ricardo
Wicentowski. 2001. Inducing multilingual text
analysis tools via robust projection across
aligned corpora. In Proceedings of the First
International Conference on Human Language
Technology Research, pages 1–8. Associ-
ation for Computational Linguistics. DOI:
https://doi.org/10.3115/1072133
.1072187
Imed Zitouni and Radu Florian. 2008. Mention
detection crossing the language barrier. En
Proceedings of the Conference on Empirical
Métodos en el procesamiento del lenguaje natural,
pages 600–609. Asociación de Computación-
lingüística nacional. DOI: https://doi.org
/10.3115/1613715.1613789
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
5
0
1
9
2
4
2
6
3
/
/
t
yo
a
C
_
a
_
0
0
3
5
0
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3