Partially Supervised Named Entity Recognition
via the Expected Entity Ratio Loss
Thomas Effland
Columbia University, EE.UU
teffland@cs.columbia.edu
michael collins
Google Research, EE.UU
mjcollins@google.com
Abstracto
We study learning named entity recognizers
in the presence of missing entity annotations.
We approach this setting as tagging with la-
tent variables and propose a novel loss, el
Expected Entity Ratio, to learn models in the
presence of systematically missing tags. Nosotros
show that our approach is both theoretically
sound and empirically useful. Experimentally,
we find that it meets or exceeds performance
of strong and state-of-the-art baselines across a
variety of languages, annotation scenarios, y
amounts of labeled data. En particular, we find
that it significantly outperforms the previous
state-of-the-art methods from Mayhew et al.
(2019) and Li et al. (2021) por +12.7 y +2.3
F1 score in a challenging setting with only
1,000 biased annotations, averaged across 7
conjuntos de datos. We also show that, when combined
with our approach, a novel sparse annotation
scheme outperforms exhaustive annotation for
modest annotation budgets.1
1
Introducción
Named entity recognition (NER) is a critical sub-
task of many domain-specific natural language
understanding tasks in NLP, such as information
extraction, vinculación de entidades, semantic parsing, y
question answering. For large, exhaustively anno-
tated benchmark datasets, this problem has been
largely solved by fine-tuning of high-capacity pre-
trained sentence encoders from massive-scale lan-
guage modeling tasks (Peters et al., 2018; Devlin
et al., 2019; Liu et al., 2019). Sin embargo, fully anno-
tated datasets themselves are expensive to obtain
at scale, creating a barrier to rapid development of
models in low-resource situations.
Partial annotations,
en cambio, may be much
cheaper to obtain. Por ejemplo, when building a
1We have published for our implementation and exper-
imental results at https://github.com/teffland
/ner-expected-entity-ratio.
dataset for a new entity extraction task, a domain
expert may be able to annotate entity spans with
high precision at a lower recall by scanning
through documents inexhaustively, creating a
higher diversity of contexts and surface forms by
limiting the amount of time spent on individual
documentos. In another scenario studied by Mayhew
et al. (2019), non-speaker annotators for low-
resource languages may only be able to recognize
some of the more common entities in the target
idioma, but will miss many less common ones.
In both of these situations, we wish to leverage
partially annotated training data with high preci-
sion but low recall for entity spans. debido a la
low recall, unannotated tokens are ambiguous and
it is not reasonable to assume they are non-entities
(the O tag). We give an example of this in Figure 1.
We address the problem of training NER taggers
with partially labeled, low-recall data by treating
unannotated tags as latent variables for a discrim-
inative tagging model. We propose to combine
marginal tag likelihood training (Tsuboi et al.,
2008) with a novel discriminative criterion, el
Expected Entity Ratio (EER), to control the rela-
tive proportion of entity tags in the sentence. El
proposed loss is (1) flexibly able to incorporate
prior knowledge about expected entity rates under
incertidumbre; (2) theoretically recovers the true tag-
ging distribution under mild conditions; y (3)
easy to implement, fast to compute, and amenable
to standard gradient-based optimization. We eval-
uate our method across 7 corpora in 6 idiomas
along two diverse low-recall annotation scenar-
ios, one of which we introduce. Nosotros mostramos que
our method performs as well as or better than the
previous state-of-the-art methods from Mayhew
et al. (2019) and the recent work of Li et al.
(2021) across the studied languages, escenarios,
and amounts of labeled entities. Más, we show
that our novel partial annotation scheme, cuando
combined with our method, outperforms exhaus-
tive annotation for modest annotation budgets.
1320
Transacciones de la Asociación de Lingüística Computacional, volumen. 9, páginas. 1320–1335, 2021. https://doi.org/10.1162/tacl a 00429
Editor de acciones: Noah Smith. Lote de envío: 3/2021; Lote de revisión: 6/2021; Publicado 12/2021.
C(cid:2) 2021 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
2
9
1
9
7
6
1
8
3
/
/
t
yo
a
C
_
a
_
0
0
4
2
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Cifra 1: An example low-recall sentence with two
entidades (one is missing) and its NER tags. The Gold
row shows the true tags, the Raw row shows a false
negative induced by the standard ‘‘tokens without en-
tity annotations are non-entities’’ assumption, y el
Latent row reflects our view of unannotated tags as
latent variables.
2 Related Works
A common paradigm for low-recall NER is
automatically creating silver-labeled data using
outside resources. Bellare and McCallum (2007)
approach the problem by distantly supervising
spans using a database with marginal tag training.
Carlson et al. (2009) similarly use a gazetteer
and adapt
the structured perceptron (collins,
2002) to handle partially labeled sequences, mientras
Nothman et al. (2008) use Wikipedia and label-
propagation heuristics. Peng et al. (2019) also use
distant supervision to silver-labeled entities, pero
use PU-learning with specified class priors to esti-
mate individual classifiers with ad-hoc decoding.
Yang y otros. (2018) and Nooralahzadeh et al. (2019)
optimize the marginal likelihood (Tsuboi et al.,
2008) of the distantly annotated tags but require
gazatteers and some fully labeled data to handle
proper prediction of the O tag. Greenberg et al.
(2018) use a marginal likelihood objective to pool
overlapping NER tasks and datasets, but must ex-
ploit cross-dataset constraints. Snorkel (Ratner
et al., 2020) uses many sources of weak supervi-
sión, but relies on high recall and overlap to work.
In contrast to these works, we do not use outside
resources.
Our problem setting has connections to PU-
aprendiendo, which is classically an approach to clas-
sification (Liu et al., 2002, 2003; Elkan and Noto,
2008; Grave, 2014), but here we work with tag-
ging structures. Our approach is also related to
constraint-satisfaction methods for shaping the
model distribution such as CoDL (Chang et al.,
2007), used by Mayhew et al. (2019), and is also
related to Posterior Regularization (Ganchev et al.,
2010), with main differences being that we do
not use the KL-divergence and use gradient-based
updates to a nonlinear model instead of closed-
form updates to a log-linear model.
The problem setup from Jie et al. (2019) y
Mayhew et al. (2019) is the same as ours, pero
Jie et al. (2019) use a cross-validated self-training
approach and Mayhew et al. (2019) use an iter-
ative constraint-driven self-training approach to
down-weigh possible false-negative O tags, cual
they show to outperform Jie et al. (2019). Mayhew
et al. (2019) is the current state of the art on the
CONLL 2003 NER datasets (Tjong Kim Sang and
De Meulder, 2003) and we compare to their work
in the experiments. Recientemente, Li et al. (2021) tener
published a span-based method that uses negative
sampling of non-entity spans, but they do not pro-
vide any supporting theoretical guarantees. Nosotros
also compare to them in the experiments.
3 Métodos
En esta sección, we describe the proposed approach.
We begin with a description of the problem and
notation in § 3.1, followed by the NER tagging
model in § 3.2. We then describe the supervised
marginal tag loss and our proposed auxiliary loss,
used for learning on positive-only annotations, en
§ 3.3 and § 3.4, respectivamente. Finalmente, in § 3.5 nosotros
describe the full objective and give theory show-
ing that our approach recovers the true tagging
distribution in the large-sample limit.
3.1 Problem Setup and Notation
We formulate NER as a tagging problem, as is
extremely common (McCallum and Li, 2003;
Lample et al., 2016; Devlin et al., 2019; Mayhew
et al., 2019, inter alia). In fully supervised tagging
for NER, we are given an input sentence x1:norte =
x1 . . . xn, xi ∈ X of length n tokens paired with
a sequence y1:norte, yi ∈ Y of tags that encode the
typed entity spans in the sentence. Following pre-
vious work, we use the BILUO scheme (Ratinov
and Roth, 2009). Under this formulation, a NER
dataset of fully annotated sentences is a set of
pairs of token and tag sequences:
Dm
s = {(xk
1:nk
, yk
1:nk
)}metro
k=1
3.1.1 Partial Annotations
Normalmente, fully annotated tag sequences are de-
rived from exhaustive annotation schemes, dónde
annotators mark all positive entity spans in the text
1321
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
2
9
1
9
7
6
1
8
3
/
/
t
yo
a
C
_
a
_
0
0
4
2
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
and then the filler O tag can be perfectly inferred at
all unannotated tokens. Training a model on such
fully annotated data is easy enough with traditional
maximum likelihood estimation (McCallum and
li, 2003; Lample et al., 2016).
In many cases, sin embargo, it is desirable to be able
to learn on incomplete, partially annotated train-
ing data that has high precision for entity spans,
but low recall (§4.2 discusses two such scenarios).
Because of the low recall, unannotated tokens are
ambiguous and it is not reasonable to assume they
are non-entities (the O tag). Even in this low-recall
situation, prior work (Jie et al., 2019; Mayhew
et al., 2019) assumes that unannotated tokens are
given this non-entity tag. Their approaches then try
to estimate which of these tags are ‘‘incorrect’’
through self-training-like schemes, iteratively down-
weighing the contribution of these noisy tags to
the loss with a meta training loop.
In contrast to prior work, we make no direct
assumptions about unannotated tokens and treat
all such positions as latent tags. In this view, a
partially annotated sentence is a token sequence
x1:n paired with a set of observed (tag,posición)
pares. Given a sentence x1:norte, we define
yO ⊂ {(y, i) | y ∈ Y, 1 ≤ yo ≤ norte}
as the set of observed tags y at positions i.
Por ejemplo, En figura 1 we would have yO =
{(U-ORG, 7)}. Under this formulation, we will be
given a partially observed dataset:
Dm = {(xk
1:nk
, yk
Ok
)}metro
k=1
We use data of this form for the rest of the work.
3.2 Tagging Model
We use a simple, relatively off-the-shelf tagging
model for p(y1:norte|x1:norte; i). Our model, BERT-CRF,
first encodes the token sequence using a contex-
tual Transformer-based (Vaswani et al., 2017)
encoder, initialized from a pretrained language-
model objective (Devlin et al., 2019; Liu et al.,
2019). Given the output representations from the
last layer of the encoder, we then score each
tag individually with a linear layer, as in Devlin
et al. (2019). Finalmente, we model the distribution
pag(y1:norte|x1:norte) with a linear-chain CRF (Lafferty
tag scores
et al., 2001), using the individual
and learned transition parameters T as potentials.
Mathematically, our tagging model is given by:
h1:n = BERT(x1:norte; θBERT)
Fi(i, y) = v(cid:6)
Fi(i, y, y(cid:7)) = φ(i, y) + Ty,y(cid:7)
exp.{
y hi
(cid:2)
n−1
i=1 φ(i, yi, yi+1) + Fi(norte, en)}
(cid:4)
z(Fi)
(cid:5)
Fi(i, y(cid:7)
i, y(cid:7)
i+1) + Fi(norte, y(cid:7)
norte)
pag(y|X) =
(cid:3)
n−1(cid:3)
z(Fi) =
exp.
y(cid:7)
1:norte
∈Y n
yo=1
where φ ∈ Rn×|Y|×|Y| is the tensor of individual
potentials and θ = {θBERT, t } ∪ {vy}y∈Y are the
full set of model parameters.
A few important things to note: (1) although we
call the encoder ‘‘BERT’’, in practice we utilize
various BERT-like pretrained transformer language
models from the HuggingFace Transformers (Cª,
2019) library; (2) we apply grammaticality con-
straints to the transition parameters T that cause
the model to put zero mass on invalid transitions;
y (3) we do not use special start and end states,
as pretrained transformers already bookend the
sentence with SOS and EOS tokens that can be
assumed to always be O tags. Este, combined
with the transition constraints, guarantees that the
tagger outputs valid sequences.
We choose this model architecture because it
closely reflects recent standard practice in applied
NER (Devlin et al., 2019; Cª, 2019), where a
pretrained transformer is fine-tuned to the tagging
conjunto de datos. Sin embargo, we improve this practice by us-
ing a CRF layer on top instead of predicting all tags
independientemente. We stress that the additional CRF
layer has multiple benefits—the transition param-
eters and global normalization improve model
capacity and, importantly, prevent invalid predic-
ciones. En experimentos preliminares, Encontramos eso
invalid predictions were common in some of the
few-annotation scenarios we study here.
3.3 Supervised Marginal Tag Loss
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
2
9
1
9
7
6
1
8
3
/
/
t
yo
a
C
_
a
_
0
0
4
2
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
We train our tagger on partially annotated data by
maximizing the marginal likelihood (Tsuboi et al.,
2008) of the observed tags under the model:
(cid:3)
Lp(i; Dm) =
1
metro
− log p(yk
Ok
|xk
1:nk
(xk
,yk
oh
1:nk
∈Dm
)
k
; i)
(1)
1322
con
iniciar sesión p(yO|x1:norte) = log
(cid:3)
y1:norte|=yO
pag(y1:norte|x1:norte)
(2)
where y1:norte |= yO means all taggings satisfying the
observations yO. For tree-shaped CRFs, this loss
is tractable with dynamic programming.
Although it is possible to optimize only this loss
for the given partially annotated data, doing so
alone has deleterious effects in our scenario—the
resulting model will not learn to meaningfully
predict the O tag, by far the most common tag (Jie
et al., 2019) and thus fail to have acceptable per-
rendimiento, with high recall at nearly zero precision.
We need another term in the loss to encourage the
model to predict O tags, which we introduce next.
3.4 Expected Entity Ratio Loss
As has been observed in prior work (Augenstein
et al., 2017; Peng et al., 2019; Mayhew et al.,
2019), the number of named entity tags (versus
O tags) over the entire distribution of sentences
occur at relatively stable rates for different named
entity datasets with the same task specification.
For any specific dataset, we call this proportion the
‘‘expected entity ratio’’ (EER), which is simply
the marginal distribution of some tag y being part
of an entity span, pag(y (cid:9)=O). Given an estimate
of this EER, ρ = p(y (cid:9)=O), for the dataset in
pregunta, we propose to impose a second loss that
directly encourages the tag marginals under the
model to match the given EER, up to a margin of
uncertainty γ. This loss is given by:
Lu(i; Dm, ρ, γ) = max{0, |ρ − ˆρθ| − γ}
(3)
dónde
ˆρθ =
(cid:2)
(xk
,yk
oh
1:nk
∈Dm
)
k
mi
pag(yk
1:nk
|xk
1:nk
;i)
(cid:2)
(cid:7)
(cid:9)=O}
11{yk
i
(cid:6)
nk(cid:2)
yo=1
nk
(xk
1:nk
,yk
oh
k
)∈Dm
(4)
is the model’s expected rate of entity tags.
For linear-chain CRFs, the inner expected count
Ep(y1:norte|X)[
norte(cid:3)
yo=1
11{yi (cid:9)=O}] =
norte(cid:3)
(cid:3)
yo=1
y∈Y\{oh}
pag(yi|X)
can be computed exactly, because it factors over
the model potentials and reduces to a simple sum
over the tag marginals under the model,2 and is
differentiable. The outer expectation is not feasi-
ble for large datasets on modern hardware, so we
approximate it with Monte Carlo estimates from
mini-batches and optimize using stochastic gradi-
ent descent (Robbins and Monro, 1951).
We also note that the loss in Eqn. 3 takes the
same form as the (cid:6)-insenstive hinge loss for sup-
port vector regression machines (Vapnik, 1995;
Drucker et al., 1996), though our use-case is
quite different. Además, this loss function
is differentiable everywhere except at the ρ ± γ
puntos.
3.5 Combined Objective and Consistency
The final loss, presented in Eqn. 5, combines
Eqns. 1 y 3 with a balancing coefficient λu.
l(i; D, λu, ρ, γ) = Lp(i; D) + λuLu(i; D, ρ, γ)
(5)
This loss has an intuitive explanation. The su-
pervised loss Lp optimizes the entity recall of the
modelo. The addition of the EER loss Lu further
controls the precision of the model. Juntos, ellos
form a principled objective whose optimum re-
covers the true distribution under mild conditions.
We now present a theorem that gives insight
into why the loss in Eqn. 5 is justified. Primero, nosotros
introduce the following set of assumptions:3
Assumption 1. Assume there are finite vocabu-
laries of words X and tags Y, and that Y contains
a special tag O. We have some model p(y1:norte|
x1:norte; i) with parameter space Θ. Assume some
distribution pX,Y (x1:norte, y1:norte) over sequence pairs
x1:n ∈ X +, y1:n ∈ Y +, and define S = {x1:n ∈
X + : pX (x1:norte) > 0}. Assume in addition the
following:
(a) pY |X is deterministic: eso es, for any x1:n ∈
S, there exists some y1:n ∈ Y + such that
pY |X (y1:norte|x1:norte) = 1.
i
[
(cid:2)
(cid:2)
(cid:2)
Ey1:norte [F (yi)] =
2This follows from linearity of expectations: Ey1:norte
i f (yi)] =
3We make use of the following definition: For any finite
set A, define A+ to be the set of finite length sequences
of symbols drawn from A. Eso es, A+ = {a1:norte : n >
0, ∀i, ai ∈ A}.
Eyi [F (yi)].
i
1323
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
2
9
1
9
7
6
1
8
3
/
/
t
yo
a
C
_
a
_
0
0
4
2
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
(b) There is some parameter setting θ ∈ Θ such
that p(y1:norte|x1:norte; i) = pY |X (y1:norte|x1:norte) para
todo (x1:norte, y1:norte) ∈ S × Y +.
(C) We have a set of training examples Dm =
{(xk
k=1 drawn from the distribu-
tion pX (x1:norte) × ˜pY |X (y1:norte|x1:norte) where ˜pY |X
has the following properties:
)}metro
, yk
1:nk
1:nk
(c1) No false positives: for all x1:n ∈ S,
for all i ∈ {1 . . . norte},
if pY |X (yi =
oh|x1:norte) = 1, then ˜p(yi = O|x1:norte) = 1.
(c2) Positive entity support: for all x1:n ∈ S,
for all i ∈ {1 . . . norte}, if there is some
y ∈ Y such that y (cid:9)= O and pY |X (yi =
y|x1:norte) = 1, then ˜p(yi = y|x1:norte) > 0,
and ˜p(yi = O|x1:norte) = 1 − ˜p(yi =
y|x1:norte). Eso es, only y and O are possi-
ble under ˜p, and the tag y has probability
strictly greater than zero.
Given these assumptions, define L∞ to be the
expected loss under the distribution ˜p:
L∞(i; λu, ρ, γ) = EDm∼ ˜p [l(i; Dm, λu, ρ, γ)]
We can then state the following theorem.
Teorema 2. Assume that all conditions in as-
sumption 1 hold. Define ρ = ρ∗ where ρ∗ is the
known marginal entity tag distribution, γ = 0, y
λu > 0. Then for any θ ∈ arg min L∞(i;
λu, ρ, γ), the following holds:
∀(x1:norte, y1:norte) ∈ S × Y +,
pag(y1:norte|x1:norte; i) = pY |X (y1:norte|x1:norte)
The proof of the theorem is in the Appendix.
Intuitivamente, this result is important because it
shows that in the limit of infinite data, param-
eter estimates optimizing the loss function will
recover the correct underlying distribution pY |X .
Más formalmente, this theorem is the first criti-
cal step in proving consistency of an estimation
method based on optimization of the loss func-
ción. En particular (see for example Section 4 de
Ma and Collins, 2018) it should be relatively
straightforward to derive a result of the form
(cid:8)
PAG
lim
m→∞
(cid:9)
Y |X , pY |X ) = 0
d(ˆpm
= 1
under some appropriate definition of distance
between distributions d, where ˆpm
Y |X is the dis-
tribution under parameters θm derived from a ran-
dom sample Dm of size m. Sin embargo, for reasons
of space we leave this to future work.4
4 Benchmark Experiments
We evaluate our approach on 7 datasets in 6 lan-
guages for two diverse annotation scenarios (14
datasets in total) and compare to strong and state-
of-the-art baselines.
4.1 corpus
Our original datasets come from two benchmark
NER corpora in 6 idiomas. We use the English
(eng-c), Español (esp), Alemán (deu), and Dutch
(ned) languages from the CoNLL 2003 shared
tareas (Tjong Kim Sang and De Meulder, 2003).
We also use the NER annotations for English
(eng-o), Mandarin Chinese (chi), and Arabic (ara)
from the OntoNotes5 corpus (Hovy et al., 2006).
By studying across this wide array of corpora,
we test the approaches in a variety of language
settings, as well as dataset and task sizes. El
CoNLL corpus specifies 4 entity classes while the
OntoNotes corpus has 18 different classes and
they span 7.4K to 82K training sentences. Usamos
standard train/dev/test document splits. For each
cuerpo, we generate two partially annotated data-
sets according to the scenarios from § 4.2.
4.2 Simulated Annotation Scenarios
We simulate two partial annotation scenarios that
model diverse real-world situations. The first is
the ‘‘Non-Native Speaker’’ (NNS) scenario from
Mayhew et al. (2019) and the second, ‘‘Explor-
atory Expert’’ (EE), is a novel scenario inspired by
industria. We choose these two samplers to make
our results more applicable to practitioners. El
simpler alternative—dropping entity annotations
uniformly at random (as in Jie et al., 2019, y
Le et al., 2021)—is not realistic, leaving an overly
4One additional remark: Assumption 1 condiciones (a) y
(b) do not strictly speaking include log-linear models, como
probabilities in these models cannot be strictly equal to 1 o
0. Sin embargo, probabilities under these models can approach
arbitrarily close to 1 o 0; for simplicity we present this
version of the theorem here, but a more complete analysis
could use techniques similar to those in Della Pietra et al.
(1997) that make use of the closure of the set of distributions
of the model, which include points on the boundary.
1324
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
2
9
1
9
7
6
1
8
3
/
/
t
yo
a
C
_
a
_
0
0
4
2
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
diverse set of surface mentions with none of the bi-
ases incurred by real-world partial labeling. Mientras
there are other partial annotation scenarios com-
patible with our method that we could have con-
sidered here as well, such as using Wikipedia or
gazatteers for silver-labeled supervision, we chose
to work with simulated scenarios that allow us to
study a large array of datasets without introduc-
ing the confounding effects of choices for outside
resources.
4.2.1 Scenario 1: Non-Native Speaker (NNS)
Our first low-recall scenario is the one proposed
by Mayhew et al. (2019), wherein they study NER
datasets that simulate non-native speaker annota-
tores. To simulate data for this scenario, Mayhew
et al. (2019) downsample annotations grouped by
mention until a recall of 50%. Por ejemplo, si
‘‘New York’’ is sampled, then all annotations
with ‘‘New York’’ as their mention in the text
are dropped. After the recall is dropped to 50%,
the precision is lowered to 90% by adding short
randomly typed false-positive spans. La razón-
ing for this slightly more complicated scheme
is that it better reflects the biases incurred via
non-native speaker annotation. When non-native
speakers exhaustively annotate for NER, they of-
ten systematically miss unrecognized entities and
occasionally incorrectly annotate false-positive
spans.5
The original sampling code used in Mayhew
et al. (2019) is not available and we have intro-
duced datasets that were not in their study, so we
reimplemented their sampler and used our version
across all of our corpora for consistency. Hacemos,
sin embargo, run their model code on our datasets,
so our results with respect to their approach still
hold.
4.2.2 Scenario 2: Exploratory Expert (EE)
In addition to Mayhew et al. (2019)’s non-native
speaker scenario, we introduce a signficantly dif-
ferent scenario that reflects another common real-
world low-recall NER situation. Though it has
not been studied before in the literature, it is in-
spired by accounts of partially annotated datasets
encountered in industry.
In the EE scenario, we suppose a new NER
task to be annotated by a domain expert with
5It is worth noting that the NNS scenario is also quite
close to a silver-labeled scenario using a seed dictionary with
50% recordar, only it has some additional false positive noise.
limited time. Aquí, in the initial ‘‘exploratory’’
phase of annotation, the expert may wish to cover
more ground by inexhaustively scanning through
documents in the corpus, annotating the first few
entities they see in a document before moving on,
stopping once they have added M total entity
spans. The advantage of this approach is that, por
being inexhaustive, the resulting set of mentions
and contexts will have more diversity than by
using exhaustive annotation. Compared to exhaus-
tive annotation, the disadvantage is annotators
may miss entities and the annotations are biased
toward the top of documents.
We simulate this scenario by first removing all
annotations from the dataset, then adding back
entity spans with the following process. Primero, nosotros
select a document at random without replacement,
then scan this document left to right, adding back
entity spans with probability 0.8, until 10 enti-
ties have been added, then moving on to the next
random document. The process halts when M =
1,000 total entity spans have been added back to
the dataset. We note that this assumes that the ex-
pert annotators are skimming, sometimes missing
entidades (20% of the time), but also assumes that
the expert does not make flagrant mistakes and so
do not insert random false-positive spans.
An important aspect of this scenario in our ex-
periments is the scale of the number of kept anno-
taciones. In previous work (Jie et al., 2019; Mayhew
et al., 2019; Le et al., 2021), the number of kept
annotations is not dropped below 50% del
complete dataset. By keeping only 1K entities, este
scenario is significantly more impoverished than
those previously studied (1K entities leaves less
than 10% of annotations for all datasets, ranging
de 0.8% a 8.5%, depending on the corpus).
4.3 Enfoques
We compare several modeling approaches on the
benchmark corpora, detailed below.
4.3.1 Gold
For comparison, we report our tagging model
trained with supervised sequence likelihood on
the original gold datasets. This provides an up-
perbound on tagging performance and puts any
performance degradation from partially super-
vised datasets into perspective. We do not expect
any of the other methods to outperform this.
1325
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
2
9
1
9
7
6
1
8
3
/
/
t
yo
a
C
_
a
_
0
0
4
2
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
4.3.2 Raw
In the Raw-BERT baseline, we make the naive as-
sumption that all unobserved tags in the low-recall
datasets are the O, reflecting the second row of
Cifra 1, and train with supervised likelihood.
This is a weak baseline that we expect to have low
recordar.
4.3.3 Cost-aware Decoding (Raw+CD)
This stronger baseline, suggested by a reviewer,
explores a simple modification to the Raw baseline
en el momento de la prueba: We increase the cost of predicting an
O tag during inference in an attempt to artificially
increase the recall. Eso es, we introduce an ad-
ditional hyperparameter bO ≥ 0 that is subtracted
from the O tag potentials, biasing the model away
from predicting O tags:
Fi(i, y) =
(cid:4)
v(cid:6)
y hi − bO
v(cid:6)
y hi
y = O
else
Intuitivamente, this approach will work well if the
tag potentials consistently rank false negative en-
tity tokens higher than true O tokens. To select bO,
we perform a model-based hyperparameter search
(Head et al., 2020) using a Gaussian process with
30 evaluations on the validation set F1 score for
each dataset’s trained Raw-BERT model.
4.3.4 Constrained Binary Learning (CBL)
The CBL baseline is a state-of-the-art approach
to partially supervised NER from Mayhew et al.
(2019). The main idea of the approach is to esti-
mate which O tags are false negatives, and remove
them from training.
Constrained Binary Learning (CBL)
ap-
proaches this through a constrained, self-training-
like meta-algorithm, based on Constraint-Driven
Aprendiendo (Chang et al., 2007). The algorithm
starts off with a binarized version of the problem
(O tag vs not) and initializes instance weights
de 1 for all O tags. It then estimates their final
weights by iteratively training a model, predicting
tags for the training data, then down-weighing
some tags based on the confidence of these
predictions according to a linear-programming
constraint on the total number of allowed O tags.
At each iteration, the number of allowed O tags is
decreased slightly, and this loop is repeated until
the final target entity ratio (our ρ) is satisfied by
the weights. A final tagger is then trained on the
original tag set using a weighted modification of
the supervised tagging likelihood.
For this method, we used the code exactly as
was provided, with the following exception. Para
all non-English languages, we were not able to
obtain the original embeddings used in their ex-
perimentos, and so we have used language-specific
pretrained embeddings from the FastText library
(Grave et al., 2018). The base tagging model from
Mayhew et al. (2019) utilizes the BiLSTM-CRF
approach from Ma and Hovy (2016). The CBL
meta-algorithm, sin embargo, is agnostic to the under-
lying scoring architecture of the CRF, and so we
test the CBL algorithm both with their BiLSTM
scoring architecture and with our BERT-based
scoring architecture, which we call CBL-LSTM
and CBL-BERT, respectivamente. By testing the CBL
meta-algorithm with our tagging model, nosotros estafamos-
trol for the different modeling choices and get a
clear view of how their CBL approach compares
to ours.
4.3.5 Span-based Negative Sampling (SNS)
The SNS-BERT baseline is a recent state-of-the-
art approach to partially supervised NER from Li
et al. (2021). It uses the same BERT-based encod-
ing architecture, but has a different modeling layer
on top. Instead of tagging each token, they instead
use a span-based scheme, treating each possible
pair of tokens as potential entity and classifying
all of the spans independently, using an ad-hoc
decoding step based on confidence to eliminate
overlapping spans. To deal with the resulting class
imbalance (O spans are overwhelmingly common)
and low-recall entity annotations, they propose to
sample spans from the set of unlabeled spans as
negatives. While it is possible that they incorrectly
sample false negative entities, they argue that this
has very low probability. For this method, we used
the code as provided but controlled for the same
encoding pretrained weights as our other models.
4.3.6 Expected Entity Ratio (EER)
The EER-BERT model implements our proposed
acercarse, using the proposed tagger (§ 3.2) y
loss function described in Eqn. 5.
4.4 Preprocesamiento
All datasets came in documents, pre-tokenized
into words, with gold sentence boundaries. Reciente
trabajar (Akbik et al., 2019; Luoma and Pyysalo
1326
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
2
9
1
9
7
6
1
8
3
/
/
t
yo
a
C
_
a
_
0
0
4
2
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
2020) has demonstrated that larger inter-sentential
document context is useful for maximizing per-
rendimiento, so we work with full documents instead
of individual sentences.6 For approaches that used
a pretrained transformer, some documents did not
fit into the 512 token maximum length. In these
casos, we split documents into maximal conti-
guous chunks at sentence boundaries. También, para
pretrained transformer approaches we expand the
tag sequences to match the subword tokenizations.
Because the low-recall data in the EE scenario
concentrates annotations at the top of only a few
documentos, it is possible to identify and omit large
unannotated portions of text from the training
datos. We hypothesize that this will significantly
improve model outcomes for the baselines because
it significantly cuts down on the number of false
negative annotations. Por lo tanto, we explore three
preprocessing variants for all EE models: (1) todo
uses the full dataset as given; (2) short drops all
documents with no annotations; y (3) shortest
drops all sentences after the last annotation in a
documento (subsuming short). Model names are
suffixed with their preprocessing variants. Nosotros
note that these approaches do not apply to the
NNS scenario, as it has many more annotations
spread more evenly throughout the data.
4.5 Hyperparameters
All hyperparameters were given reasonable de-
faults, using recommendations from previous
trabajar. For pretrained transformer models, nosotros
used the Huggingface (2019) implementations of
roberta-base (Liu et al., 2019) on English
datasets and bert-base-multilingual-
cased (Devlin et al., 2019) for the other lan-
calibres. The vector representations used by these
models are 768-dimensional and we used match-
ing dimensions for other vector sizes throughout
el modelo. We used a learning rate of 2 × 10−5
with slanted triangular schedule peaking at 10%
of the iterations (Devlin et al., 2019). For batch
tamaño, we use the maximum batch size that will
allow us to train in memory on a Tesla V100 GPU
(14 for CoNLL data, 2 for Ontonote5 data). Nosotros
found that training for more epochs than orig-
inally recommended (Devlin et al., 2019) was nec-
6With the exception of the SNS (Le et al., 2021) base
where we had to restrict to sentences because it is O(n2)
span-based model and could not handle long text sequences,
running into memory issues.
essary for convergence and used 20 epochs for
the all variants and 50 epochs for the significantly
smaller short and shortest variants.7
The only hyperparameter we adjusted (from a
preliminary experiment measuring dev set per-
rendimiento) was setting λu = 10. We originally
tried a weight of λu = 1, but then found that the
scale of the Lp loss massively overpowered Lu,
so we increased it to λu = 10, which yielded good
actuación. We did not try other values after
eso.
In important contrast to benchmark experiments
from prior work (Jie et al., 2019; Mayhew et al.,
2019), we do not assume we know the gold entity
tag ratio for each dataset when setting ρ. En cambio,
to make the evaluation more realistic, we use
a reasonable guess of ρ = 0.15 with a margin
of uncertainty γ = 0.05 for all approaches and
conjuntos de datos. We choose this range because it covers
most of the gold ratios observed in the datasets.8
4.6 Resultados
The results of our evaluation are presented in
Mesa 1. The first row shows the result of training
our tagger with the original gold data. Estos resultados
are competitive with previously published results
from similar pretrained transformers (Devlin et al.,
2019) that do not use ensembles or NER-specific
pretraining (Luoma and Pyysalo, 2020; Baevski
et al., 2019; Yamada et al., 2020). Curiosamente,
we also found that our tagging CRF outperformed
the span-based independent distribution of Li
et al., (2021) on all gold datasets.
NNS Performance. The second set of rows
shows test F1 scores of models from § 4.3 para
the NNS sampled datasets. We first note that the
CBL-LSTM approach from Mayhew et al. (2019)
significantly underperformed for all non-English
idiomas (and are much lower than the results
from their paper with similar data). We used their
code as is, only changing the pretrained word
vectores, and so suspect that this is due to lower
quality word vectors obtained from FastText in-
stead of their custom-fit vectors. This is confirmed
by the results of using their CBL meta-algorithm
7For the CBL-LSTM approach, we use the hyperparame-
ters from Mayhew et al. (2019): these are more epochs (45),
and a higher learning rate of 10−3.
8In early experiments we found that the CBL code from
Mayhew et al. (2019) used the gold ratio plus 0.05. Este
adicional 0.05 turned out to be critical to getting competitive
actuación, so in practice we use a ρ = 0.2 for CBL.
1327
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
2
9
1
9
7
6
1
8
3
/
/
t
yo
a
C
_
a
_
0
0
4
2
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Acercarse / Idioma
Gold-BERT-all
Gold-SNS-BERT-all
eng-c
92.7
91.1
deu
83.9
82.3
esp
88.3
87.9
ned
91.1
89.5
eng-o
90.7
89.7
chi
79.4
77.1
ara
72.9
62.1
avg
85.6
82.8
Raw-BERT-all
Raw+CD-BERT-all
CBL-LSTM-all
CBL-BERT-all
SNS-BERT-all
EER-BERT-all
Non-Native Speaker Scenario (NNS): Recall=50%, Precision=90%
67.9
75.4
54.5
74.8
75.1
75.8
71.2
79.9
54.6
78.7
80.8
80.9
68.0
80.9
67.9
76.3
81.5
84.5
52.8
60.1
39.4
61.9
56.0
56.6
61.9
64.9
53.5
68.9
66.4
66.6
70.1
77.2
48.2
75.3
77.9
76.9
69.1
78.4
38.4
77.5
77.0
77.3
81.9
86.3
79.2
84.8
86.0
88.0
Raw-BERT-all
Raw-BERT-short
Raw-BERT-shortest
Raw+CD-BERT-shortest
CBL-LSTM-all
CBL-LSTM-shortest
CBL-BERT-all
CBL-BERT-short
CBL-BERT-shortest
SNS-BERT-all
SNS-BERT-short
SNS-BERT-shortest
EER-BERT-all
EER-BERT-short
EER-BERT-shortest
0.4
44.1
80.7
82.4
60.2
67.8
36.4
43.7
80.6
59.5
64.4
83.9
86.3
89.0†
87.3†
Exploratory Expert Scenario (EE): 1,000 Annotations
02.6
37.2
65.4
67.9
27.5
20.1
52.8
64.7
65.1
63.8
62.6
70.1
73.2
72.2
73.6†
0.0
0.0
69.1
70.0
33.3
26.7
52.5
60.8
71.2
70.3
64.1
77.1
80.2
80.3†
74.2
5.3
15.4
42.0
43.9
15.3
9.7
20.8
30.2
39.2
0.0
0.0
40.7
42.9
46.8†
42.1
0.4
28.4
67.5
68.9
23.1
42.0
22.4
16.0
28.4
14.0
40.7
75.6
61.2
75.9
74.0
00.7
44.4
73.0
76.6
41.2
36.2
40.9
56.4
74.7
70.8
70.8
76.8
80.2
76.5
76.5
2.4
32.4
57.1
58.3
29.9
24.6
29.3
31.2
53.6
28.8
46.4
63.3
56.2
61.4
64.3
1.7
28.8
65.0
66.9
32.9
32.4
36.4
43.3
59.0
43.9
49.9
69.6
68.6
71.7†
70.3
Mesa 1: Benchmark test set F1 scores across different languages and annotation scenarios. Best models
in bold. † indicates that for EE the test F1 score is statistically signficantly better than SNS-BERT-
shortest (pag < 0.01) (details in footnote 9). Other pairs between SNS-BERT-shortest and EER-BERT-
short/shortest were not signficant.
with our proposed tagging architecture, which is
competitive with EER-BERT in this setting.
ished annotation counts, such as the EE scenario
we study next.
Otherwise, we found that all strong baselines
and our method performed quite similarly. This
suggests that performance in the NNS regime
with relatively high recall (50%) and little la-
bel noise per positively labeled mention is not
bottlenecked by approaches to resolving missing
mentions. Further improvements in this regime
will likely come from other sources, such as bet-
ter pretraining or supplemental corpora. Because
of this we recommend that future evaluations for
partially supervised NER focus on more impover-
EE Performance. In the third group of rows,
we show test F1 scores for each model using the
more challenging EE scenario with only 1,000
kept annotations. In this setting, using the data-
set as is for supervised training (Raw-BERT-
all), fails to converge, but smarter preprocessing
largely alleviates this problem, with Raw-BERT-
shortest obtaining an average F1 of 65.0. Adding
cost-aware decoding (Raw+CD-BERT-shortest)
further
improves upon the standard baseline
(F1 66.9).
1328
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
2
9
1
9
7
6
1
8
3
/
/
t
l
a
c
_
a
_
0
0
4
2
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Even with only 1,000 biased and incomplete
annotations – less than 10% of the original anno-
tations for all datasets—we find that our approach
(EER-BERT-short) still achieves an F1 score of
71.7 on average. This outperforms the best strong
baselines: Raw+CD, CBL, and SNS, by 4.8, 12.7,
and 2.3 F1 score, respectively. The closest base-
line, SNS-BERT-shortest from Li et al. (2021),
is competitive with EER-BERT-short on four of
the datasets, but performs significantly worse on
the other three as well as overall,9 leading us
to conclude that our method has a performance
edge in this regime. Further, EER-BERT-short
performs only 4.1 average F1 worse on EE
data than EER-BERT-all on NNS data. We also
note that EER-BERT-shortest significantly outpe-
formed SNS-BERT-shortest on two datasets, but
failed to reject the null hypothesis overall.
Another important finding is that EER-BERT is
much more robust to preprocessing choices than
the baselines. The baselines all view missing en-
tities as O tags/spans (at least to start) and these
relatively common false negatives severely throw
off convergence. By removing most of the unanno-
tated text with preprocessing, we effectively create
a much smaller corpus that has nearly 80% recall
(for shortest). In contrast, EER-BERT’s view of
the data makes no assertions about the class of in-
dividual unobserved tokens and so is less sensitive
to the relative proportion of false negative anno-
tations. This is useful in practice, as our approach
should better handle partial annotation scenarios
with wider varieties of false negative proportions
that may not be so easily addressed with simple
preprocessing.
Speed. A pragmatic appeal of our approach
compared to CBL (Mayhew et al., 2019) is training
time. On NNS data, EER-BERT-all is on average
7.6 times faster than CBL-BERT-all and on EE
data EER-BERT-short is 2.2 times faster than
CBL-BERT-shortest, even though it uses more
data. This is because EER does not require a
costly outer self-training loop.10
9We assessed significance between model pairs using
a percentile bootstrap of F1 score differences, resampling
test set documents with replacement 100K times (Efron and
Tibshirani, 1994) and measuring the paired F1 scores differ-
ences of EER-BERT-short/shortest and SNS-BERT-shortest.
Significance was assessed by whether the two-sided 99% con-
fidence interval contained 0.0. To assess overall significance,
we concatenated the test datasets before bootstrapping.
10We unfortunately cannot comment on relative speed of
SNS because runtimes cannot be inferred from the SNS
Varying EER HPs Test F1 Scores
[ρ − γ, ρ + γ]
[0.23, 0.23]∗
[0.15, 0.15]
[0.30, 0.30]
[0.10, 0.20]†
[0.20, 0.30]
[0.00, 0.10]
[0.00, 0.30]
RS0
86.8
89.3
79.1
87.6
83.9
89.2
83.9
RS1
87.4
87.1
79.4
88.2
83.8
87.1
83.8
RS2
87.0
87.8
79.7
87.8
84.1
87.8
84.0
Avg.
87.1
88.1
79.4
87.9
83.9
88.0
83.9
Table 2: CoNLL English EE EER-short test set
F1 across three randomly sampled datasets. ∗:
ρ = ρ∗. †: benchmark experiment setting.
Conclusion. These results illustrate that our
approach outperforms the previous strong and
state-of-the-art baselines
challenging
low-recall EE setting with only 1K annotations
to the relative
while also being more robust
proportions of false negatives in the training
corpus.11
in the
4.7 Analysis of EER Hyperparameters
Recall that the definition of our EER loss in
Eqn. 3 defines an acceptable region ˆρθ ∈ [ρ −
γ, ρ + γ] of learned models and that in our this
experiment, we used ρ = 0.15 and γ = 0.05 for
all datasets, regardless of the true entity ratios
ρ∗. Two interesting questions then are (1) ‘‘how
sensitive is the procedure to choices of ρ and γ?’’;
and (2) ‘‘how closely do the final learned models
reflect the true entity ratios for the data?’’. We
address these next.
4.7.1 Robustness to Choices of ρ and γ
To study robustness we varied choices of ρ and
γ for EER-BERT-short on the CoNLL English
EE dataset with three randomly sampled datasets.
Table 2 shows test F1 scores across seeds for
various settings of ρ ± γ. We first show three
point estimates with γ = 0.0, the first at ρ =
code output, though we do not expect a fundamental speed
advantage of one over the other, as neither use self-training.
11We also note that the EE scenario averages for all
models are significantly affected by the poor performance on
the Arabic OntoNotes5 (ara) dataset. After further inspection
of the training curves, we found that all models exhibited very
slow convergence on this dataset and/or failed to converge in
the allotted number of epochs.
1329
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
2
9
1
9
7
6
1
8
3
/
/
t
l
a
c
_
a
_
0
0
4
2
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
ρ∗ = 0.23, then shifted around ρ∗ left and right
to ρ = 0.15 and ρ = 0.30, respectively. We then
widen the range with γ = 0.05 and show the
benchmark result ρ = 0.15, followed by shifts of
ρ ± 0.1. Finally we show a very wide range of
ρ = 0.15, γ = 0.15.
From the table we can glean two interesting
points. The first is that in settings where the high
end of range of acceptable EERs is greater than
ρ∗ (when ρ + γ = 0.30) there is a substantial
drop in performance (mean = 82.3). The second
is that the complement group of settings, where
ρ + γ ≤ ρ∗ are all high-performing with little
variance (mean = 87.8, std = 0.4). Together they
suggest that the true sensitivity of the proposed
EER approach to the high end of the interval and
that it is best to conservatively estimate that value,
whereas the low end of the range is unimportant.
This result agrees well with the intuitions provided
in § 3.5: Because Lp is encouraging models with
high recall without regard for precision (ˆρθ → 1),
it is best to set ρ + γ such that Lu introduces
a tension in the combined loss by encouraging
ˆρθ ≤ ρ∗. This is not the whole story, however, as
we discuss next.
4.7.2 Convergence Towards ρ∗
The results from the previous experiment suggests
that Lu simply serves to drive ˆρθ → ρ + γ.
Because we used ρ + γ = 0.2 for all datasets in
the benchmark, we would then expect to see a
result that ˆρθ ≈ 0.2 for all models.
We tested this hypothesis by calculating the
entity ratio ˆρθ of final trained EER-BERT-short
models for the EE datasets (leaving out ara, since
it failed to converge) and calculated the average
difference of each ˆρθ with respect to the cor-
responding true ρ∗, resulting in mean absolute
error of only 0.018. This is much closer on av-
erage than if the models just converged to 0.2
(the mean absolute error then would be 0.048),
indicating that our approach tends to converge
more closely to the true entity ratio ρ∗ than the
estimate given by ρ + γ. In particular, we found
that all final models had ˆρθ < 0.2 except CoNLL
English, where ˆρθ = 0.23, quite close to the gold
ρ∗ even though it was outside of the target range.
This result is encouraging in that it suggests the
EER loss, in balance with the supervised marginal
tag loss, does more to recover ρ∗ than just drive
ˆρθ → ρ + γ.
5 EE vs. Exhaustive Experiments
In situations where we only have partially an-
notated data without the option for exhaustive
annotations, the utility of being able to train with
the data as provided is self-evident. However,
given the potential upsides of partial annota-
tion relative to exhaustive annotation—mentally
less taxing and increased contextual diversity for
a fixed annotation budget—it is natural to ask
whether it is actually better to go with a sparse
annotation scheme.
5.1 Annotation Speed User Study
We begin with a user study of annotation speed,
comparing EE to the standard exhaustive anno-
tation scheme. Following methodology from Li
et al. (2020), we recorded 8 annotation sessions
from 4 NLP researchers familiar with NER. Using
the OntoNotes5 English corpus, we asked each
annotator to annotate for two 20 minute sessions
using the BRAT (Stenetorp et al., 2012) annota-
tion tool, one exhaustively and the other following
the EE scheme. We split documents into two ran-
domized groups and systematically varied which
group was annotated with each scheme and in
what order to control for document and ordering
variation effects. Then, for each annotator, we
measured the number of annotated entities per
minute for both schemes and report the ratio of
EE annotations per minute to exhaustive annota-
tions per minute (i.e., the relative speed of EE to
exhaustive). We found that, although speed varied
greatly between annotators (ranging from roughly
4 annotations/min to 9 annotations/min across ses-
sions), EE annotation and exhaustive annotation
were essentially the same speed, with EE being
3% faster on average. Thus we may fairly compare
exhaustive and EE schemes using model perfor-
mance at the same number of annotations, which
we do next.12
5.2 Performance Learning Curves
experiment, we
In this
the best
traditional supervised training from the bench-
mark (Raw-BERT-shortest) with our proposed
approach (EER-BERT-short) on EE-annotated
compare
12The exact number of annotated entities among the four
annotation sessions for EE were 91, 90, 109, and 179. For
Exhaustive the matching annotation counts were 83, 85, 117,
and 170.
1330
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
2
9
1
9
7
6
1
8
3
/
/
t
l
a
c
_
a
_
0
0
4
2
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
for 500 and 1K annotations, respectively. These
results, however, reverse as the annotation counts
grow: At 5K annotations, the two approaches per-
form the same (90.8) and, at even larger annota-
tion counts, exhaustive annotation with traditional
training outperforms our approach by +0.5 at 10K
annotations and +0.8 on the gold dataset. This in-
dicates that EE annotation, paired with our EER
loss, is competitive and potentially advantageous
to exhaustive annotation and traditional training
at modest annotation counts, but that exhaus-
tive annotation with traditional training is better
at large annotation counts. This suggests that a
hybrid annotation approach where we sparsely
annotate data at first, but eventually switch to
exhaustive annotations as the process progresses,
is a promising direction of future work. We note
that our EER loss can easily incorporate observed
O tags from exhaustively annotated documents
in yO and so would work in this setup without
modification.
6 Conclusions
We study learning NER taggers in the presence
of partially labeled data and propose a simple,
fast, and theoretically principled approach, the
Expected Entity Ratio loss, to deal with low-recall
annotations. We show empirically that it outper-
forms the previous state of the art across a variety
of languages, annotation scenarios, and amounts
of labeled data. Additionally, we give evidence
that sparse annotations, when paired with our
approach, are a viable alternative to exhaustive
annotation for modest annotation budgets.
Though we study two simulated annotation
scenarios to provide controlled experiments, our
proposed EER approach is compatible with a vari-
ety of other incomplete annotation scenarios, such
as incidental annotations (e.g., from Web links
on Wikipedia), initialized by seed annotations
from incomplete distant supervision/gazatteers,
or embedded as a learning procedure in an ac-
tive/iterative learning framework, which we intend
to explore in future work.
Acknowledgments
We would like to thank Chris Kedzie, Giannis
Karamanolakis, and the reviewers for helpful con-
versations and feedback.
Figure 2: Test performance as a function of the number
of observed training annotations for the Exhaustive vs.
EE annotation on CoNLL English. Lines are averages
and shaded regions are ±1 standard error.
and exhaustively annotated documents
from
CoNLL’03 English (eng-c) at several annota-
tion budgets, M ∈ {100 (0.4%), 500 (2.1%),
1K (4.3%), 5K (21.3%), 10K (42.6%)}. For
each annotation budget, we sampled three datasets
with different random seeds for both annotation
schemes and trained both modeling approaches.
This allows us to study how all four combinations
of annotation style and training methods perform
at varying magnitudes of annotation counts. In
addition to low-recall annotations, we compared
our EER approach to supervised training on the
gold data.
In Figure 2, we show learning curves for the
average test performance of all four annotation/
training variants. From the plot, we can infer
several points. First, on EE-annotated data, using
our EER loss substantially outperforms traditional
likelihood training at all amounts of partial anno-
tation, but the opposite is true on exhaustively
the train-
annotated data. This indicates that
ing method should be tailored to the annotation
scheme.
The comparison between EE data with EER
training versus exhaustive data with likelihood
training is more nuanced. At only 100 annotations,
exhaustive annotation worked best on average in
our sample, but all methods exhibit high variance
due to the large variation in which entities were
annotated. Interestingly, at modest sizes of only
500 and 1K annotations, EE annotated data with
our proposed EER-short approach outperformed
exhaustive annotation with traditional supervised
training, with gains of +1.8 and +1.5 average F1
1331
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
2
9
1
9
7
6
1
8
3
/
/
t
l
a
c
_
a
_
0
0
4
2
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
References
Alan Akbik, Tanja Bergmann, and Roland
Vollgraf. 2019. Pooled contextualized embed-
dings for named entity recognition. In Pro-
ceedings of the 2019 Conference of the North
American Chapter of the Association for Com-
putational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers),
pages 724–728, Minneapolis, Minnesota. Asso-
ciation for Computational Linguistics. https://
doi.org/10.18653/v1/N19-1078
Isabelle Augenstein, Leon Derczynski,
and
Kalina Bontcheva. 2017. Generalisation in
named entity recognition: A quantitative anal-
ysis. Computer Speech & Language, 44:61–83.
https://doi.org/10.1016/j.csl.2017
.01.012
Alexei Baevski, Sergey Edunov, Yinhan Liu,
Luke Zettlemoyer, and Michael Auli. 2019.
Cloze-driven pretraining of self-attention net-
works. In Proceedings of the 2019 Conference
on Empirical Methods in Natural Language
Processing and the 9th International Joint
Conference on Natural Language Process-
ing (EMNLP-IJCNLP), pages 5360–5369,
Hong Kong, China. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/D19-1539
Kedar Bellare and Andrew McCallum. 2007.
Learning extractors from unlabeled text us-
ing relevant databases. In Sixth International
Workshop on Information Integration on the
Web.
Andrew Carlson, Scott Gaffney, and Flavian
Vasile. 2009. Learning a named entity tagger
from gazetteers with the partial perceptron. In
AAAI Spring Symposium: Learning by Reading
and Learning to Read, pages 7–13.
Ming-Wei Chang, Lev Ratinov,
and Dan
Roth. 2007. Guiding semi-supervision with
constraint-driven learning. In Proceedings of
the 45th Annual Meeting of the Association of
Computational Linguistics, pages 280–287.
Michael Collins. 2002. Discriminative training
methods for hidden Markov models: Theory
and experiments with perceptron algorithms.
the ACL-02 Conference
In Proceedings of
on Empirical Methods in Natural Language
Processing-Volume 10, pages 1–8. Associa-
tion for Computational Linguistics. https://
doi.org/10.3115/1118693.1118694
Stephen Della Pietra, Vincent Della Pietra,
and John Lafferty. 1997.
Inducing fea-
tures of random fields. IEEE Transactions
on Pattern Analysis and Machine Intelli-
gence, 19(4):380–393. https://doi.org
/10.1109/34.588021
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In NAACL.
Harris Drucker, Chris
J. C. Burges, Linda
Kaufman, Alex Smola, and Vladimir Vapnik.
1996. Support vector regression machines. In
NIPS.
Bradley Efron and Robert J. Tibshirani. 1994. An
Introduction to the Bootstrap, CRC Press.
Charles Elkan and Keith Noto. 2008. Learning
classifiers from only positive and unlabeled
data. In Proceedings of the 14th ACM SIGKDD
International Conference on Knowledge Dis-
covery and Data Mining, pages 213–220.
Kuzman Ganchev,
Joao Grac¸a,
Jennifer
Gillenwater, and Ben Taskar. 2010. Posterior
regularization for structured latent variable
models. The Journal of Machine Learning Re-
search, 11:2001–2049.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
2
9
1
9
7
6
1
8
3
/
/
t
l
a
c
_
a
_
0
0
4
2
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Edouard Grave. 2014. Weakly supervised named
entity classification. In Workshop on Automated
Knowledge Base Construction (AKBC).
Edouard Grave, Piotr Bojanowski, Prakhar Gupta,
Armand Joulin, and Tomas Mikolov. 2018.
Learning word vectors for 157 languages. In
Proceedings of the International Conference
on Language Resources and Evaluation (LREC
2018).
Nathan Greenberg, Trapit Bansal, Patrick Verga,
and Andrew McCallum. 2018. Marginal likeli-
hood training of BiLSTM-CRF for biomedical
named entity recognition from disjoint label
sets. In Proceedings of the 2018 Conference
on Empirical Methods in Natural Language
Processing, pages 2824–2829. https://doi
.org/10.18653/v1/D18-1306
1332
Tim Head, Manoj Kumar, Holger Nahrstaedt,
Gilles Louppe, and Iaroslav Shcherbatyi. 2020.
scikit-optimize/scikit-optimize.
Eduard Hovy, Mitchell Marcus, Martha Palmer,
Lance Ramshaw, and Ralph Weischedel. 2006.
Ontonotes: The 90% solution. In Proceedings
of
the Human Language Technology Con-
ference of the NAACL, Companion Volume:
Short Papers, NAACL-Short ’06, pages 57–60,
Stroudsburg, PA, USA. Association for Compu-
tational Linguistics. https://doi.org/10
.3115/1614049.1614064
HuggingFace Inc. 2019. PyTorch Pretrained
BERT: The Big & Extending Repository of
pretrained Transformers.
Zhanming Jie, Pengjun Xie, Wei Lu, Ruixue Ding,
and Linlin Li. 2019. Better modeling of incom-
plete annotations for named entity recognition.
In Proceedings of NAACL.
John Lafferty, Andrew McCallum, and Fernando
Pereira. 2001. Conditional random fields: Prob-
abilistic models for segmenting and labeling
sequence data. In ICML.
Guillaume Lample, Miguel Ballesteros, Sandeep
Subramanian, K. Kawakami, and Chris Dyer.
2016. Neural architectures for named entity
recognition. In NAACL. https://doi.org
/10.18653/v1/N16-1030
Belinda Z. Li, Gabriel Stanovsky, and Luke
Zettlemoyer. 2020. Active learning for coref-
erence resolution using discrete annotation. In
Proceedings of the 58th Annual Meeting of
the Association for Computational Linguis-
tics, pages 8320–8331, Online. Association
for Computational Linguistics. https://doi
.org/10.18653/v1/2020.acl-main.738
Yangming Li, Lemao Liu, and Shuming Shi. 2021.
Empirical analysis of unlabeled entity problem
in named entity recognition. In International
Conference on Learning Representations.
Bing Liu, Yang Dai, Xiaoli Li, Wee Sun Lee,
and Philip S. Yu. 2003. Building text classi-
fiers using positive and unlabeled examples. In
Third IEEE International Conference on Data
Mining, pages 179–186. IEEE.
Bing Liu, Wee Sun Lee, Philip S. Yu, and
Xiaoli Li. 2002. Partially supervised classifi-
cation of text documents. In ICML, volume 2,
pages 387–394.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019. RoBERTa: A robustly opti-
mized BERT pretraining approach. arXiv pre-
print arXiv:1907.11692v1.
Jouni Luoma and Sampo Pyysalo. 2020. Ex-
ploring cross-sentence contexts for named
In COL-
entity recognition with BERT.
ING. https://doi.org/10.18653/v1
/2020.coling-main.78
Xuezhe Ma and Eduard Hovy. 2016. End-to-end
sequence labeling via bi-directional LSTM-
CNNs-CRF. ArXiv, abs/1603.01354.
Zhuang Ma and Michael Collins. 2018. Noise con-
trastive estimation and negative sampling for
conditional models: Consistency and statistical
efficiency. In EMNLP.
Stephen Mayhew, Snigdha Chaturvedi, Chen-Tse
Tsai, and Dan Roth. 2019. Named entity recog-
nition with partially annotated training data. In
Proceedings of the 23rd Conference on Compu-
tational Natural Language Learning (CoNLL),
pages 645–655, Hong Kong, China. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/K19-1060
Andrew McCallum and Wei Li. 2003. Early
results for named entity recognition with
conditional random fields, feature induction
and web-enhanced lexicons. In Proceedings
lan-
the seventh conference on Natural
of
guage learning at HLT-NAACL 2003-Volume 4,
pages 188–191. Association for Computational
Linguistics. https://doi.org/10.3115
/1119176.1119206
Farhad Nooralahzadeh, Jan Tore Lønning, and
Lilja Ovrelid. 2019. Reinforcement-based
denoising of distantly supervised ner with par-
tial annotation. In DeepLo@EMNLP-IJCNLP.
https://doi.org/10.18653/v1/D19
-6125
Joel Nothman, James R. Curran, and Tara Murphy.
2008. Transforming Wikipedia into named
1333
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
2
9
1
9
7
6
1
8
3
/
/
t
l
a
c
_
a
_
0
0
4
2
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
entity training data. In Proceedings of
the
Australasian Language Technology Associa-
tion Workshop 2008, pages 124–132.
Minlong Peng, Xiaoyu Xing, Qi Zhang, Jinlan
Fu, and Xuanjing Huang. 2019. Distantly super-
vised named entity recognition using positive-
unlabeled learning. In ACL. https://doi
.org/10.18653/v1/P19-1231
Matthew E. Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep context-
In NAACL.
ualized word representations.
https://doi.org/10.18653/v1/N18
-1202
Lev Ratinov and Dan Roth. 2009. Design chal-
lenges and misconceptions in named entity
recognition. In Proceedings of the Thirteenth
Conference on Computational Natural Lan-
guage Learning, pages 147–155. Association
for Computational Linguistics. https://
doi.org/10.3115/1596374.1596399
Jason Fries, Sen Wu,
Alexander Ratner, Stephen H. Bach, Henry
Ehrenberg,
and
Christopher R´e. 2020. Snorkel: Rapid training
data creation with weak supervision. The VLDB
Journal, 29(2):709–730. https://doi.org
/10.1007/s00778-019-00552-1, Pubmed:
32214778
Herbert Robbins and Sutton Monro. 1951. A
stochastic approximation method. The Annals
of Mathematical Statistics, pages 400–407.
https://doi.org/10.1214/aoms/1177729586
Pontus Stenetorp, Sampo Pyysalo, Goran Topi´c,
Tomoko Ohta, Sophia Ananiadou, and Jun’ichi
Tsujii. 2012. BRAT: A web-based tool for
NLP-assisted text annotation. In Proceedings
of the Demonstrations Session at EACL 2012,
Avignon, France. Association for Computa-
tional Linguistics.
Erik F. Tjong Kim Sang and Fien De Meulder.
2003. Introduction to the CoNLL-2003 shared
task: Language-independent named entity
recognition. In Proceedings of the Seventh Con-
ference on Natural Language Learning at
HLT-NAACL 2003, pages 142–147. https://
doi.org/10.3115/1119176.1119195
Yuta Tsuboi, Hisashi Kashima, Hiroki Oda,
Shinsuke Mori, and Yuji Matsumoto. 2008.
Training conditional random fields using in-
complete annotations. In Proceedings of the
22nd International Conference on Computa-
tional Linguistics-Volume 1, pages 897–904.
Association for Computational Linguistics.
https://doi.org/10.3115/1599081
.1599194
Vladimir N. Vapnik. 1995. The Nature of Statisti-
cal Learning Theory, Springer-Verlag, Berlin,
Heidelberg. https://doi.org/10.1007
/978-1-4757-2440-0
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N.
Gomez, Łukasz Kaiser, and Illia Polosukhin.
2017. Attention is all you need. In Advances
in Neural Information Processing Systems,
pages 5998–6008.
Ikuya Yamada, Akari Asai, Hiroyuki Shindo,
Hideaki Takeda, and Yuji Matsumoto. 2020.
Luke: Deep contextualized entity represen-
tations with entity-aware self-attention.
In
EMNLP. https://doi.org/10.18653
/v1/2020.emnlp-main.523
Yaosheng Yang, Wenliang Chen, Zhenghua Li,
Zhengqiu He, and Min Zhang. 2018. Distantly
supervised ner with partial annotation learn-
ing and reinforcement learning. In Proceedings
of
the 27th International Conference on
Computational Linguistics, pages 2159–2169.
A Appendix: Proof of Theorem 2
Proof of Theorem 2: We have
L∞(θ; λu, ρ, γ) = g(θ) + h(θ)
where g(θ) = E[Lp(θ; Dm)] and h(θ) =
E[λuLu(θ; Dm, ρ, γ)]. Note that
g(θ) =
(cid:3)
x1:n,y1:n
˜p(x1:n, y1:n)g(cid:7)(x1:n, y1:n, θ)
(6)
where ˜p(x1:n, y1:n) = pX (x1:n) × ˜p(y1:n|x1:n) and
g(cid:7)(x1:n, y1:n, θ) = − log
(cid:3)
y(cid:7)
1:n
|=y1:n
p(y(cid:7)
1:n
|x1:n; θ)
(7)
1334
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
2
9
1
9
7
6
1
8
3
/
/
t
l
a
c
_
a
_
0
0
4
2
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Define θ∗ to be such that ∀x1:n ∈ X , ∀y1:n,
p(y1:n|x1:n; θ∗) = pY |X (y1:n|x1:n) (by assump-
tion 1(b) such a parameter setting must exist).
The following properties are easily verified to
(1) ∀θ, g(θ) ≥ 0, h(θ) ≥ 0 and (2)
hold:
g(θ∗) = h(θ∗) = 0. Hence θ∗ is a minimizer
of g(θ) + h(θ).
We now show that any minimizer θ(cid:7) of
g(θ) + h(θ) must satisfy the property that ∀x1:n ∈
X , ∀y1:n, p(y1:n|x1:n; θ(cid:7)) = pY |X (y1:n|x1:n). For
θ(cid:7) to be a minimizer of g(θ) + h(θ) it must be
the case that g(θ(cid:7)) = h(θ(cid:7)) = 0. We then note the
following steps:
(i) By Lemma 3, if g(θ(cid:7)) = 0 it must hold that
∀x1:n ∈ X , ∀i ∈ {1 . . . n} such that pY |X (yi =
y|x1:n) = 1 and y (cid:9)= o, p(yi = y|x1:n; θ(cid:7)) = 1.
(ii) It remains to be shown that ∀x1:n ∈ X , ∀i ∈
{1 . . . n} such that pY |X (yi = y|x1:n) = 1 and
y = o, p(yi = y|x1:n; θ(cid:7)) = 1.
(iii) Property (ii) follows from (i) through proof
by contradiction. If ∃ x1:n ∈ X together with
i ∈ {1 . . . n} such that pY |X (yi = y|x1:n) = 1
and y = o, and p(yi = y|x1:n; θ) < 1 it must
be the case that h(θ(cid:7)) > 0, because the expected
number of o tags under θ(cid:7) is strictly less than the
(cid:2)
expected number of o tags under pY |X .
Lema 3. Define g(i) and g(cid:7)(x1:norte, y1:norte, i) as in
Eqs. 6 y 7 arriba. For any value of θ such
that g(i) = 0, ∀x1:n ∈ X , ∀i ∈ {1 . . . norte} semejante
that pY |X (yi = y|x1:norte) = 1 and y (cid:9)= o, pag(yi =
y|x1:norte; i) = 1.
1:norte
1:norte
y(cid:7)
1:norte
|=y1:norte
|=y1:norte
y(cid:7)
1:norte
pag(y(cid:7)
Prueba: If g(i) = 0,
then for all x1:norte, y1:norte
such that ˜p(x1:norte, y1:norte) > 0,
it must be the
(cid:2)
pag(y(cid:7)
|x1:norte; i) = 0 y
case that − log
(cid:2)
|x1:norte; i) = 1. The proof
hence
is then by contradiction:
if there exists some
x1:n ∈ X , i ∈ {1 . . . norte} such that pY |X (yi =
y|x1:norte) = 1 and y (cid:9)= o and p(yi = y|x1:norte; i) < 1,
it must be the case that
there exists some
y1:n such that ˜p(x1:n, y1:n) > 0, yi = y, y
(cid:2)
|x1:norte; i) < 1.
p(y(cid:7)
1:n
y(cid:7)
1:n
|=y1:n
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
2
9
1
9
7
6
1
8
3
/
/
t
l
a
c
_
a
_
0
0
4
2
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
1335
Descargar PDF