Ultra-fine Entity Typing with Indirect Supervision from Natural

Ultra-fine Entity Typing with Indirect Supervision from Natural
Language Inference

Bangzheng Li(cid:2)†∗, Wenpeng Yin‡, and Muhao Chen(cid:2)

(cid:2)Universität von Südkalifornien, USA
†University of Illinois at Urbana-Champaign, USA
‡Temple University, USA
vincentleebang@gmail.com; wenpeng.yin@temple.edu;

muhaoche@usc.edu

Abstrakt

The task of ultra-fine entity typing (UFET)
seeks to predict diverse and free-form words
or phrases that describe the appropriate types
of entities mentioned in sentences. A key chal-
lenge for this task lies in the large number of
types and the scarcity of annotated data per
type. Existing systems formulate the task as a
multi-way classification problem and train di-
rectly or distantly supervised classifiers. Das
causes two issues: (ich) the classifiers do not
capture the type semantics because types are
often converted into indices; (ii) systems de-
veloped in this way are limited to predicting
within a pre-defined type set, and often fall
short of generalizing to types that are rarely
seen or unseen in training.

This work presents LITE
, a new approach
that formulates entity typing as a natural lan-
guage inference (NLI) Problem, making use
von (ich) the indirect supervision from NLI to in-
fer type information meaningfully represented
as textual hypotheses and alleviate the data
scarcity issue, sowie (ii) a learning-to-rank
objective to avoid the pre-defining of a type set.
Experiments show that, with limited training
Daten, LITE obtains state-of-the-art perfor-
mance on the UFET task. Zusätzlich, LITE
demonstrates its strong generalizability by not
only yielding best results on other fine-grained
entity typing benchmarks, more importantly, A
pre-trained LITE system works well on new
data containing unseen types.1

1

Einführung

Entity typing, inferring the semantic types of the
entity mentions in text, is a fundamental and
∗ This work was done when the first author was visiting

the University of Southern California.

1Our models and implementation are available at https://

github.com/luka-group/lite.

607

long-lasting research problem in natural language
Verständnis, which aims at inferring the seman-
tic types of the entities mentioned in text. Der
resulted type information can help with grounding
human language components to real-world con-
cepts (Chandu et al., 2021), and provide valuable
prior knowledge for natural language understand-
ing tasks such as entity linking (Ling et al., 2015;
Onoe and Durrett, 2020), question answering
(Yavuz et al., 2016), and information extraction
(Koch et al., 2014). Prior studies have mainly
formulated the task as a multi-way classification
problems (Wang et al., 2021; Zhang et al., 2019;
Chen et al., 2020A; Hu et al., 2020).

Jedoch, earlier efforts for entity typing are
far from enough for representing real-world sce-
narios, where types of entities can be extremely
diverse. Entsprechend, the community has recently
paid much attention to more fine-grained model-
ing of types for entities. One representative work
is the Ultra-fine Entity Typing (UFET) bench-
mark created by Choi et al. (2018). The task seeks
to search for the most appropriate types for an
entity among over ten thousand free-form type
candidates. The drastic increase of types forces us
to question whether the multi-way classification
framework is still suitable for UFET. In this con-
Text, two main issues are noticed from prior work.
Erste, prior studies have not tried to understand
the target types since most classification systems
converted all types into indices. Without know-
ing the semantics of types, it is hard to match
an entity mention to a correct type, especially
when there is not sufficient annotated data for
each type. Zweite, existing entity typing systems
are far behind the desired capability in real-world
applications in which any open-form types can
appear. Speziell, those pre-trained multi-way

Transactions of the Association for Computational Linguistics, Bd. 10, S. 607–622, 2022. https://doi.org/10.1162/tacl a 00479
Action Editor: Benjamin Van Durme. Submission batch: 10/2021; Revision batch: 12/2021; Published 5/2022.
C(cid:4) 2022 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
4
7
9
2
0
2
2
9
5
4

/

/
T

l

A
C
_
A
_
0
0
4
7
9
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

classifiers cannot recognize types that are unseen
in training, especially when there is no reasonable
mapping from existing types to unseen type la-
bels, unless the classifiers are re-trained to include
those new types.

To alleviate the aforementioned challenges, Wir
propose a new learning framework that seeks to
enhance ultra-fine entity typing with indirect su-
pervision from natural language inference (NLI)
(Dagan et al., 2006). Speziell, our method
LITE
, (Language Inference based Typing of
Entities), treats each entity-mentioning sentence
as a premise in NLI. Using simple, template-based
generation techniques, a candidate type is trans-
formed into a textual description and is treated as
the hypothesis in NLI. Based on the premise sen-
tence and a hypothesis description of a candidate
type, the entailment score given by an NLI model
is regarded as the confidence of the type. On top
of the pre-trained NLI model, LITE conducts a
learning-to-rank objective, which aims at scoring
hypotheses of positive types higher than the hy-
potheses of sampled negative types. Endlich, Die
label candidates whose hypotheses obtain scores
above a threshold are given as predictions by
the model.

Technically, LITE benefits ultra-fine entity
the in-
typing from three perspectives. Erste,
ference ability of a pre-trained NLI model can
provide effective indirect supervision to improve
the prediction of type information. Zweite, Die
Hypothese, as a type description, also provides
a semantically rich representation of the type,
which further benefits few-shot learning with in-
sufficient labeled data. Darüber hinaus, to handle the
dependency of type labels in different granulari-
Krawatten, we also utilize the inference ability of an NLI
model to learn that the finer label hypothesis of an
entity mention entails its general label hypothe-
Schwester. Experimental results on the UFET benchmark
(Choi et al., 2018) show that LITE drastically
outperforms the recent state-of-the-art (SOTA)
Systeme (Dai et al., 2021; Onoe et al., 2021; Liu
et al., 2021) without any need of distantly super-
vised data as they do. Zusätzlich, our LITE also
yields the best performance on traditional (weniger)
fine-grained entity typing tasks.2 What’s more,
because we adopt a learning-to-rank objective to

2Note that although these more traditional entity typing
tasks are termed ‘‘fine-grained entity typing’’, their typing
systems are much less fine-grained than that of UFET.

optimize the inference ability of LITE rather than
classification on a specified label space, it is fea-
sible to apply the trained model across different
typing data sets. We therefore test its transferabil-
ity by training on UFET and evaluate on traditional
fine-grained benchmarks to get promising results.
Darüber hinaus, we also examined the time efficiency
of LITE, and discussed about the trade-off be-
tween training and inference costs in comparison
with prior methods.

To summarize, the contributions of our work
are three fold. Erste, to our knowledge, this is the
first work that uses NLI formulation and NLI su-
pervision to handle entity typing. Infolge, unser
system is able to retain the labels’ semantics and
encode the label dependency effectively. Zweite,
our system offers SOTA performance on both
ultra-fine entity typing and regular fine-grained
typing tasks, being particularly strong at predict-
ing zero-shot and few-shot cases. Endlich, Wir
show that our system, once trained, can also
work on different test sets that are free to have
unseen types.

2 Related Work

Entity Typing. Traditional entity typing was in-
troduced and thoroughly studied by Ling and Weld
(2012). One main challenge that earlier efforts
have focused on was to obtain sufficient training
data to develop the typing model. To do so, auto-
matic annotation has been commonly used in the
a series of studies (Gillick et al., 2014; Ling and
Weld, 2012; Yogatama et al., 2015). Later research
was developed for further improvement by model-
ing the label dependency with a hierarchy-aware
loss (Ren et al., 2016; Xu and Barbosa, 2018).
External knowledge from knowledge bases has
also been introduced to capture the semantic re-
lations or relatedness of type information (Jin
et al., 2019; Dai et al., 2019; Obeidat et al., 2019).
Ding et al. (2021) adopt prompts to model the re-
lationship between entities and type labels, welche
is similar to our template-based type description
Generation. Jedoch, their prompts are intended
for label generation from masked language mod-
els whereas our templates realize the supervision
from NLI.

More recently, Choi et al. (2018) proposed
the ultra-fine entity typing (UFET) Aufgabe, welche
involved free-form type labeling to realize the

608

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
4
7
9
2
0
2
2
9
5
4

/

/
T

l

A
C
_
A
_
0
0
4
7
9
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

open-domain label space with much more com-
prehensive coverage of types. As the UFET
tasks non-trivial learning and inference problems,
several methods have been explored by more
effectively modeling the structure of the label
Raum. Xiong et al. (2019) utilized a graph propa-
gation layer to impose label-relation bias in order
to capture type dependencies implicitly. Onoe
and Durrett (2019) trained a filtering and rela-
beling model with the human annotated data to
denoise the automatically generated data for train-
ing. Onoe et al. (2021) introduced box embeddings
(Vilnis et al., 2018) to represent the dependency
among multiple levels of type labels as topol-
ogy of axis-aligned hyper-rectangles (boxes). To
further cope with insufficient training data, Dai
et al. (2021) used pre-trained language model for
augmenting (noisy) training data with masked en-
tity generation. Different from their strategy of
augmenting training data, our approach generates
type descriptions to leverage indirect supervision
from NLI which requires no more data samples.

Natural Language Inference and Its Applica-
tionen. Early approaches towards NLI problems
were based on studying lexical semantics and
syntactic relations (Dagan et al., 2006). Sub-
sequent research then introduced deep-learning
methods to this task to capture contextual se-
mantics. Parikh et al. (2016) utilized Bi-LSTM
(Hochreiter and Schmidhuber, 1997) to encode
the input tokens and use attention mechanism to
capture substructures of input sentences. Most re-
cent work develops end-to-end trained NLI models
that leverage pre-trained language models (Devlin
et al., 2019; Liu et al., 2019) for sentence pair rep-
resentation and large learning resources (Bowman
et al., 2015; Williams et al., 2018) for training.

Speziell, because pre-trained NLI models
benefit with generalizable logical inference, cur-
rent literature has also proposed to leverage NLI
models to improve prediction tasks with insuf-
ficient training labels, including zero-shot and
few-shot text classification (Yin et al., 2019).
Shen et al. (2021) adopted RoBERTa-large-MNLI
(Liu et al., 2019) to calculate the document similar-
ity for document multi-class classification. Chen
et al. (2021) proposed to verify the output of a QA
system with NLI models by converting the ques-
tion and answer into a hypothesis and extracting
textual evidence from the reference document as
the premise.

Recent work by Yin et al. (2020) and White
et al. (2017) is particularly relevant to this topic,
which utilizes NLI as a unified solver for sev-
eral text classification tasks such as co-reference
resolution and multiple choice QA in few-shot
or fully-supervised manner. Yet our work han-
dles a learning-to-rank objective for inference in
a large candidate space, which not only enhances
learning under a data-hungry condition, but is also
free to be adapted to infer new labels that are
unseen to training. Yin et al. (2020) also proposed
an approach to transform co-reference resolution
task into NLI manner and we modified this as
one of our template generation methods, welches ist
discussed in §3.2.

3 Method

In diesem Abschnitt, we introduce the proposed method
für (ultra-fine) entity typing with NLI. We start
with problem definition and an overview of our
NLI-based entity typing framework (§3.1), fol-
lowed by technical details of type description gen-
eration (§3.2), label dependency modeling (§3.3),
learning objective (§3.4), and inference (§3.5).

3.1 Preliminaries

Problem Definition. The input of an entity typ-
ing task is a sentence s and an entity mention of
interest e ∈ s. This task aims at typing e with
one or more type labels from the label space L.
Zum Beispiel, in ‘‘Jay is currently working on his
Frühling 09 Sammlung, which is being sponsored
by the YKK Group.’’, the entity ‘‘Jay’’ should be
labeled as person, designer, or creator instead of
organization or location.

The structure of the label space L can vary.
Zum Beispiel, in some benchmarks like OntoNotes
(Gillick et al., 2014), labels are provided in canon-
ical form and strictly depend on their ancestor
types. In this case, a type label bridge appears as
/location/transit/bridge. Jedoch, in benchmarks
like FIGER (Ling and Weld, 2012), partial la-
bels have a dependency with their ancestors while
the others are free-form and uncategorized. Für
Beispiel, the label film is given as /art/film but
currency appears as a single word. For our pri-
mary task, for ultra fine-grained entity typing, Die
UFET benchmark (Choi et al., 2018) provides
no ontology of the labels and the label vocab-
ulary consists of free-form words only. In diesem

609

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
4
7
9
2
0
2
2
9
5
4

/

/
T

l

A
C
_
A
_
0
0
4
7
9
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
4
7
9
2
0
2
2
9
5
4

/

/
T

l

A
C
_
A
_
0
0
4
7
9
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figur 1: Entity typing by LITE with indirect supervision from NLI.

Fall, film star and person can appear indepen-
dently in an annotation set with no dependency
information provided.

Overview of LITE. Given a sentence with at
least an entity mention, LITE treats the sen-
tence as the premise in NLI, and then learns
to type the entity in three consecutive steps
(Figur 1). Erste, LITE employs a simple, low-cost
template-based technique to generate a natural
language description for a type candidate. Das
type description is treated as the hypothesis in
NLI. For this step, we explore three different de-
scription generation templates (§3.2). Zweite, Zu
capture label dependency, whether or not the type
ontology is provided, LITE consistently gener-
ates type descriptions for any ancestors of the
original type label on the previous sentence and
learns their logical dependencies (§3.3). Diese beiden
steps create positive cases of type descriptions for
the entity mention in the previous sentence. Last,
LITE fine-tunes a pre-trained NLI model with a
learning-to-rank objective that ranks the positive
Fall(S) over negative-sampled type descriptions
according to the entailment score (§3.5). Während
the inference phase, given another sentence that
mentions an entity to be typed, our model predicts
type that leads to the hypothetical type description
with the highest entailment score. In this way,
LITE can effectively leverage indirect supervi-
sion signals of a (pre-trained) NLI model to infer
the type information of a mentioned entity.

3.2 Type Description Generation

Given each sentence s with an annotated entity
mention e, LITE first generates a natural language
type description T (A) for the type label annotation
A. The description will later act as a hypothesis in
NLI. Speziell, we consider several generation
technique to obtain such type descriptions, für
which the details are described as follows.

• Taxonomic statement. The first template di-
rectly connects the entity mention and the
type label with an ‘‘is-a’’ statement, nämlich,
‘‘[ENTITY] ist ein [LABEL]’’.

• Contextual explanation. The second tem-
plate generates a declarative sentence that
adds a context-related connective. The gen-
erated type description is in the form of
‘‘In this context, [ENTITY] is referring to
[LABEL]’’.

• Label substitution. Yin et al. (2020) proposed
to transform co-reference resolution problem
into NLI manner by replacing the pronoun
mentions with candidate entities. Inspiriert von
their transformation, this technique directly
replaces the [ENTITY] in the original sen-
tence with [LABEL]. daher, the NLI
model will treat the modified sentence with
a ‘‘type mention’’ as the hypothesis of the
original sentence with the entity mention.

We describe the technical details of training and
inference steps of LITE in the rest of the section.

As shown in Table 1, each template provides a
semantically meaningful way to connect the entity

610

Templates

Type Descriptions

Premise-Hypothesis Pairs for NLI

Taxonomic Statement

Jay is a producer.

Contextual Explanation

Label Substitution

In diesem Kontext, career at a
company is referring
to duration.
Musician knows how to make a
hip-hop record sound good.

Premise: ‘‘Jay is currently working on his Spring 09 Sammlung, . . . ’’
Hypothesis: ‘‘Jay is a producer.’’

Premise: ‘‘No one expects a career at a company any more, . . . ’’
Hypothesis: ‘‘In this context, career at a company is
referring to duration.’’
Premise: ‘‘He knows how to make a hip-hop record sound good.’’
Hypothesis: ‘‘Musician knows how to make a hip-hop record
sound good.’’

Tisch 1: Type description instances of three templates. Entity mentions are boldfaced and underlined
whereas label words are only boldfaced.

with a label. In this way, the inference ability of
an NLI model can be leveraged to capture the
relationship of entity and label, given the original
entity-mentioning sentence as the premise.

Insbesondere, we have also tried the automatic
template generation method proposed by Gao
et al. (2021), which has led to the adoption of
the contextual explanation template. Such a tem-
plate technique adopts the pre-trained text-to-text
Transformer T5 (Raffel et al., 2020) to generate
prompt sentences for fine-tuning language mod-
els. In unserem Fall, T5 mask tokens are added between
the sentence, the entity, and the label. Since T5
is trained to fill in the blanks within its input, Die
output tokens can be used as the template for our
type description. Zum Beispiel, given the sentence
‘‘Anyway, Nell is their new singer, and I would
never interrupt her show.’’, the entity Nell and
the annotations (singer, musician, person), we can
formulate the input to T5 as ‘‘Anyway, Nell is their
new singer, and I would never interrupt her show.
Nell singer ’’. T5 will then fill
in the placeholders , , and output
‘‘. . . I would never interrupt her show. Tatsächlich,
Nell is a singer.’’ We observe that most of the
generated templates given by T5 have appeared as
the format where a prepositional phrase (in fact,
in diesem Zusammenhang, in addition, usw.) followed by a
statement such as ‘‘[ENTITY] ist ein [LABEL]’’ or
‘‘[ENTITY] wurde [LABEL]’’. Entsprechend, Wir
select the above contextual explanation template,
which is the most representative pattern observed
in the generations.

In the training process, we use one of the three
templates to generate the hypotheses, for which
the same template will also be used to obtain
the candidate hypotheses in inference. Nach
to our preliminary results on the dev set, Die
taxonomic statement generation generally gives
better performance than the others under most
settings, for which the analysis is presented in

§4.3. Daher, the main experimentation is reported
as the configuration where LITE uses the type
descriptions based on taxonomic statement.

3.3 Modeling Label Dependency

The rich entity type vocabulary may form hier-
archies that enforce logical dependency among
labels of different specificity. Somit, we extend
the generation process of type description to bet-
ter capture such a label dependency. In detail, für
a specific type label for which LITE has gen-
erated a type description, if there are ancestor
types, we not only generate descriptions for each
of the ancestor types, but also conduct learning
among these type descriptions. The descendant
type description would act as the premise and the
ancestor type description would act as the hypoth-
esis. Zum Beispiel, in OntoNotes (Gillick et al.,
2014) or FIGER (Ling and Weld, 2012), suppose
a sentence mentions the entity London and is la-
beled as /location/city, if the taxonomic statement
based description generation is used, LITE will
yield descriptions for both levels of types, das ist,
‘‘London is a city’’ and ‘‘London is a location’’.
In such a case, the more fine-grained type descrip-
tion ‘‘London is a city’’ can act as the premise of
the more coarse-grained description ‘‘London is
a location’’, so as to help capture the dependency
between two labels ‘‘city’’ and ‘‘location’’. Solch
paired type descriptions are added to training and
will be captured by the dependency loss Ld as
described in §3.4.

This technique to capture label dependency can
be easily adapted to tasks where a type ontol-
ogy is unavailable, but each instance is directly
annotated with multiple type labels of different
specificity. Particularly for the UFET task (Choi
et al., 2018), while no ontology is provided for
the label space, the task separates the type la-
bel vocabulary into different specificity, nämlich,

611

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
4
7
9
2
0
2
2
9
5
4

/

/
T

l

A
C
_
A
_
0
0
4
7
9
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

allgemein, fine, and ultra-fine ones. Because its an-
notation to an entity from a sentence includes
multiple labels of different specificity, we can still
utilize the aforementioned dependency modeling
method. Zum Beispiel, an entity Mike Tyson may
be simultaneously labeled as person (allgemein),
sportsman (fine), and boxer (ultra-fine). Similar
to using an ontology, each pair of descendant and
ancestor descriptions among the three generations
‘‘Mike Tyson is a sportsman’’, ‘‘Mike Tyson is a
person’’, and ‘‘Mike Tyson is a sportsman’’ are
also added to training.

3.4 Learning Objective

Let L be the type vocabularies, and the learning
objective of LITE is to conduct learning-to-rank
on top of the NLI model. Given a sentence s with
mentioned entity e, we use P to denote all true type
labels of e that may include the original label and
any induced ancestor labels as described in §3.3.
Dann, for each label p ∈ P whose type descrip-
tion is generated as H(P) by one of the techniques
in §3.2, the NLI model calculates the entailment
score ε(S, H(P)) ∈ [0, 1] for the premise s and
hypothesis H(P). In der Zwischenzeit, negative sampling
randomly selects a false label p(cid:6) ∈ L \ P . Follow-
ing the same procedure as above, the entailment
score ε(S, H(P(cid:6))) is obtained for the premise s
and the negative-sample hypothesis H(P(cid:6)). Der
margin ranking loss for an annotated training case
is then defined as

Lt = [ε(S, H(P(cid:6))) − ε(S, H(P)) + γ]+.

[X]+ denotes the positive part of the input x (d.h.,
max(X, 0)) and γ is a non-negative constant.

We also similarly define a ranking loss to model
the label dependency. Still given the above anno-
tated sentence s and the set of all true type labels
P , as described in §3.3, for any exiting pair of
ancestor type pan and descendant type pde from
P , the training phase also captures the entailment
relation between their descriptions. This process
regards H(pde) as the premise and H(pan) as the
Hypothese, and the NLI model therefore yields
an entailment score ε(H(pde), H(pan)). The label
dependency loss is then defined as the following
ranking loss:

Ld = [ε(H(pde), H(P(cid:6)

ein))

− ε(H(pde), H(pan)) + γ]+,

where p(cid:6)

an is negative-sampled type label.

The eventual learning objective is to optimize

the following joint loss:

L =

1
|S|

(cid:2)

s∈S

1
|Ps|

(cid:2)

p∈Ps

Lt + λLd

where S denotes the dataset containing sentences
with typed entities, and Ps denotes the set of
true labels on an entity of the sentence instance
S. In this way, all annotations of each entity
mention will be involved in training. λ here is
a non-negative hyperparameter that controls the
influence of dependency modeling.

3.5 Inference

The inference phase of LITE performs ranking
on descriptions for all type labels from the vo-
cabulary. For any given sentence s mentioning
an entity e, LITE accordingly generates a type
description for each candidate type label. Dann,
taking the sentence s as the premise, the fine-tuned
NLI model ranks the hypothetical type descrip-
tions according to their entailment scores. Endlich,
LITE selects the type label whose description
receives the highest entailment score, or predicts
with a threshold of entailment scores in cases
where multi-label prediction is required.

4 Experiment

In diesem Abschnitt, we present
the experimental
evaluation for LITE framework, based on both
UFET (§4.1) and traditional (weniger) fine-grained
entity typing tasks (§4.2). Zusätzlich, we also
conduct comprehensive ablation studies to un-
derstand the effectiveness of the incorporated
Techniken (§4.3).

4.1 Ultra-Fine Entity Typing

We use the UFET benchmark created by Choi
et al. (2018) for evaluation. The UFET dataset
consists of two parts. (ich) Human-labeled data (L):
5,994 instances split into train/dev/test by 1:1:1
(1,998 für jede); (ii) Distant supervision data (D):
including 5.2M instances that are automatically
labeled by linking entity to KB, and 20M instances
generated by headword extraction. We follow the
original design of the benchmark to evaluate loose
macro-averaged precision (P), recall (R), and F1.

Training Data.
In our approach, the supervision
can come from the MNLI data (NLI) (Williams
et al., 2018), distant supervision data (D), und das

612

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
4
7
9
2
0
2
2
9
5
4

/

/
T

l

A
C
_
A
_
0
0
4
7
9
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

human-labeled data (L). daher, we investigate
the best combination of training data by exploring
the following different training pipelines:

• LITENLI: Pre-train on MNLI3, then predict

directly, without any tuning on D or L;

• LITEL: Only fine-tune on L;
• LITENLI+L: Pre-train

on MNLI,

Dann

fine-tune on L;

• LITED+L: Pre-train on D, then fine-tune on

L;

• LITENLI+D+L: First pre-train on MNLI, Dann

on D, finally fine-tune on L.

Model Configurations. Our system is first ini-
tialized as RoBERTa-large (Liu et al., 2019) Und
AdamW (Loshchilov and Hutter, 2018) is used to
optimize the model. The hyperparameters as well
as the output threshold are tuned on the dev set:
batch size 16, pre-training (D) learning rate 1e-6,
fine-tuning (L) learning rate 5e-6, margin γ = 0.1
and λ = 0.05. The pre-training epochs are lim-
ited to 5, which is enough considering the large
size of pre-training data. The fine-tuning epochs
are limited to 2,000; models are evaluated every
30 epochs on dev and the best model is kept to
conduct inference on test.

Baselines. We compare LITE with the follow-
ing strong baselines. Except for LRN which is
merely trained on the human annotated data, all the
other baselines incorporate the distant supervision
data as extra training resource.

• UFET-biLSTM (Choi et al., 2018) repre-
sents words using the GloVe embedding
(Pennington et al., 2014) and captures se-
mantic information of sentences, entities,
as well as labels with a bi-LSTM and
a character-level CNN.
It also learns a
type label embedding matrix to operate in-
ner product with the context and mention
representation for classification.

• LabelGCN (Xiong et al., 2019) improves
UFET-biLSTM by stacking a GCN layer on
the top to capture the latent label dependency.
• LDET (Onoe and Durrett, 2019) applies
ELMo embeddings (Peters et al., 2018) für

3This is obtained from huggingface.co/roberta

-large-mnli.

word representation and adopts LSTM as
its sentence and mention encoders. Similar
to UFET-biLSTM, it learns a matrix to com-
pute inner product with each input represen-
tation for classification. Zusätzlich, LDET
also trains a filter and relabeler to fix the
label inconsistency in the distant supervision
training data.

• BOX4Types (Onoe et al., 2021) introduces
box embeddings to handle the type depen-
dency problems. It uses BERT-large-uncased
(Devlin et al., 2019) as the backbone and
projects the hidden classification vector to
a hyper-rectangular (box) Raum. Each type
from the label space is also represented as
a box and the classification is fulfilled by
computing the intersection of the input text
and type boxes.

• LRN (Liu et al., 2021) encodes the context
and entity with BERT-base-uncased. Dann
two LSTM-based auto-regression networks
capture the context-label relation and the
label-label relation via attention mechanisms,
jeweils, in order to generate labels. Sie
simultaneously construct bipartite graphs for
sentence tokens, entities, and generated labels
to perform relation reasoning and predict
more labels.

• MLMET (Dai et al., 2021), the prior SOTA
System,
first generates additional distant
supervision data by the BERT Masked Lan-
guage Model, then stacks a linear layer on
BERT to learn the classifier on the union
label space.

Ergebnisse. Tisch 2 compares LITE with baselines,
in which LITE adopts the taxonomic statement
Vorlage (d.h., ‘‘[ENTITY] ist ein [LABEL]’’).

Gesamt, LITENLI+L demonstrates SOTA per-
formance over other baselines, outperforming the
prior top system MLMET (Dai et al., 2021) mit
1.5% absolute improvement on F1. Recall that
MLMET built a multi-way classifier on the its
newly collected distant supervision data and the
human-labeled data, our LITE optimizes a tex-
tual entailment scheme on the entailment data
(d.h., MNLI) and the human-labeled entity typing
Daten. This comparison verifies the effectiveness

613

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
4
7
9
2
0
2
2
9
5
4

/

/
T

l

A
C
_
A
_
0
0
4
7
9
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Modell
UFET-biLSTM (Choi et al., 2018)
LabelGCN (Xiong et al., 2019)
LDET (Onoe and Durrett, 2019)
Box4Types (Onoe et al., 2021)
LRN (Liu et al., 2021)
MLMET (Dai et al., 2021)

LITE

NLI
L
D+L
NLI+D+L
NLI+L
–w/o label dependency

P
48.1
50.3
51.5
52.8
54.5
53.6
1.5
48.7
27.5
45.4
52.4
53.3

R
23.2
29.2
33.0
38.8
38.9
45.3
7.1
45.8
56.4
49.9
48.9
46.6

F1
31.3
36.9
40.1
44.8
45.4
49.1
2.5
47.2
37.0
47.4
50.6
49.7

Tisch 2: Results on the ultra-fine entity typing
Aufgabe. LITE series
equipped with the
Sind
Taxonomic Statement template. ‘‘w/o label de-
pendency’’ is applied to the ‘‘NLI+L’’ setting.
The F1 result by LITENLI+L is statistically signif-
icant (p-value < 0.01 in t-test) in comparison with the best baseline result by MLMET. of using the entailment scheme and the indirect supervision from NLI. The bottom block in Table 2 further explores the best combination of available training data. First, training on MNLI (i.e., LITENLI) alone does not provide promising results. This could be because the MNLI does not generalize well to this UFET task. LITEL removes the super- vision from NLI as compared to LITENLI+L, causing a noticeable performance drop. In addi- the comparison between LITENLI+L and tion, LITED+L illustrates that the MNLI data, as an out-of-domain resource, even provides more ben- eficial supervision than the distant annotations. To our knowledge, this is already the first work that shows rather than relying on gathering dis- tant supervision data in the (entity-mentioning context, type) style, it is possible to find more effective supervision from other tasks (e.g., from entailment data) to boost the performance. How- ever, when we incorporate the distant supervi- sion data (D) into LITENLI+L, the new system LITENLI+D+L performs worse. We present more detailed analyses in §4.3. In addition, we also investigate the contribution of label dependency modeling by removing it from LITENLI+L. As results show in Table 2, incorporating label dependency helps improve the recall with a large margin (from 46.6 to 48.9) despite a minor drop for the precision, leading to notable overall improvement in F1. 4.2 Fine-grained Entity Typing In addition to UFET, we are also interested in (i) the effectiveness of our LITE to entity typing tasks with much fewer types, and (ii) seeing if our learned LITE model from the ultra-fine task can be used for inference on other entity typing tasks, which often have unseen types, even without further tuning. To that end, we evaluate LITE on OntoNotes (Gillick et al., 2014) and FIGER (Ling and Weld, 2012), two popular fine-grained entity typing benchmarks. OntoNotes contains 3.4M automatically la- training and 11k beled entity mentions for manually annotated instances that are split into 8k for dev set and 2k for test set. Its label space consists of 88 types and one more other type. In inference, LITE outputs other if none of the 88 types is scored over the threshold described in §3.5. FIGER contains 2M data samples labeled with 113 types. The dev set and test set include 1,000 and 562 samples, respectively. Within its label space, 82 types have a dependency relation with their ancestor or descendant types while the other 30 types are uncategorized free-form words. Results. Table 3 reports baseline results as well as results of two variants of LITE: One is pre-trained on UFET and directly transfers to predict on the two target benchmarks, the other conducts task-specific training on the tar- get benchmark after pre-training on MNLI. The task-specific training variant outperforms respec- tive prior SOTA on both benchmarks (OntoNotes: 86.4 vs. 85.4 in macro-F1, 80.9 vs. 80.4 in micro-F1; FIGER: 86.7 vs. 84.9 in macro-F1, 83.3 vs. 81.5 in micro-F1). An interesting advantage of LITE lies in its transferability across benchmarks. Table 3 demonstrates that our LITE (pre-trained on UFET) offers competitive performance on both OntoNotes and FIGER even with only zero-shot transfer (it even exceeds the ‘‘task-specific train- ing’’ version on OntoNotes).4 Although there are disjoint type labels between these two datasets and UFET, there exist manually crafted mappings from UFET labels to them (e.g., ‘‘musician’’ to 4LITE pre-trained on UFET performs worse on FIGER than LITE with task-specific training. The main reason could be that a larger portion of FIGER test data comes with an entity of proper noun to be labeled with more compositional types, such as government agency, athlete, sports facility, which have appeared much less often on UFET. 614 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 7 9 2 0 2 2 9 5 4 / / t l a c _ a _ 0 0 4 7 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Model Hierarchy-Typing (Chen et al., 2020b) Box4Types (Onoe and Durrett, 2020) DSAM (Hu et al., 2020) SEPREM (Xu et al., 2021) MLMET (Dai et al., 2021) LITE pre-trained on NLI+UFET NLI+task-specific training OntoNotes FIGER macro-F1 micro-F1 macro-F1 micro-F1 73.0 77.3 83.1 – 85.4 86.6 86.4 68.1 70.9 78.2 – 80.4 81.4 80.9 83.0 79.4 83.3 86.1 – 80.1 86.7 79.8 75.0 81.5 82.1 – 74.7 83.3 Table 3: Results for fine-grained entity typing. All LITE model results are statistically significant (p-value < 0.05 in t-test) in comparison with the best baseline results by MLMET on OntoNotes and by SEPREM on FIGER. Data Source Entity Linking Sentence (a) From 1928-1929 , he enrolled in graduate coursework at Yale University in New Haven , Connecticut. Head Word (b) Once Upon Andalasia is a video game based on the film of the same name. (c) You can also use them in casseroles and they can be grated and fried if you want to make hash browns. (d) He has written a number of short stories in different fictional worlds, including Dragonlance, Forgotten Realms, Ravenloft and Thieves’ World. (e) Despite obvious parallels and relationships , video art is not film. Labels location, author, province, cemetery, person art, film brown number film Table 4: Examples of two sources of distant supervision data (one from entity linking, the other from head word extraction). In the right ‘‘Labels’’ column, correct types are boldfaced while incorrect ones are in gray. ‘‘/person/artist/music’’). In this way, traditional multi-way classifiers still work across the datasets after type mapping though we do not prefer human-involvement in real-world applications. To further test the transferability of LITE, a more challenging experimental setting for zero-shot type prediction is conducted and analyzed in §4.3. 4.3 Analysis Through the following analyses, we try to answer following questions: (i) Why did the distant su- pervision data not help (as Table 2 indicates)? (ii) How effective is each type description template (Table 1)? (iii) With the NLI-style formulation and the indirect supervision, does LITE general- ize better for zero-shot and few-shot prediction? Is trained LITE transferable to new benchmarks with unseen types? (iv) On which entity types does our model perform better, and which ones remain challenging? (vi) How efficient is LITE? Distant Supervision Data. As Table 2 in- dicates, adding distant supervision data in LITENLI+D+L even leads to a drop of 3.2% ab- solute score in F1 from LITENLI+L. This should be due to the fact that the distant supervision data (D) are overall noisy (Onoe and Durrett, 2019). Table 4 lists some frequent and typical prob- lems that exist in D based on entity linking and head-word extraction. In general, they will lead to two problems. On the one hand, a large number of false positive types are introduced. Considering example (a) in Table 4, the state Connecticut is labeled as au- thor, cemetery, and person. For example (c), hash brown is labeled as brown, turning the concept of food into color. Additionally, the head-word 615 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 7 9 2 0 2 2 9 5 4 / / t l a c _ a _ 0 0 4 7 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Templates Taxonomic Statement Contextual Explanation Label Substitution LITENLI+L R 48.9 49.2 49.3 P 52.4 50.8 47.4 F1 50.6 50.2 48.3 LITENLI+D+L R P 49.9 45.4 48.5 45.3 50.7 42.5 F1 47.4 46.8 46.2 LITED+L R 56.4 55.4 59.3 P 27.5 26.9 24.8 F1 37.0 36.2 35.0 Table 5: Behavior of different type description templates under three training settings. method is short in capturing the semantics. In example (d), number is falsely extracted as the type for a number of short stories because of the preposition ‘‘of’’. On the other hand, such distant supervision may not comprehensively recall positive types. For instance, examples (b) and (e) are both about the entity ‘‘film’’ where the recalled types are correct. However, in the human annotated data, entity ‘‘film’’ may also be labeled as (‘‘film’’, ‘‘art’’, ‘‘movie’’, ‘‘show’’, ‘‘entertain- ment’’, ‘‘creation’’). In this situation, those missed positive types (i.e., ‘‘movie’’, ‘‘show’’, ‘‘enter- tainment’’, and ‘‘creation’’) will be selected by the negative sampling process of LITE and there- fore negatively influence the performance. The comparison between LITENLI+L and LITED+L can further justify the superiority of the indirect supervision from NLI over that from the distant supervision data. Type Description Templates. Table 5 reveals how template choices affect the typing perfor- mance. It is obvious that taxonomic statement outperforms the other two under all of the three training settings. The contextual explanation tem- plate yields close yet worse results, but the label substitution leads to more noticeable F1 drop. This may result from the absence of an entity mention in the hypothesis by label substitution. For instance, in ‘‘Soft eye shields are placed on the babies to protect their eyes.’’, LITE with label substitution generates related but incorrect type labels such as treatment, attention, or tissue. In §4.2, we Few- and Zero-shot Prediction. discussed transferring LITE trained on UFET to other fine-grained entity typing benchmarks. Nev- ertheless, because UFET labels are still inclusive of them with mapping, we conducted a further experiment in which portions of UFET training labels are randomly filtered out so that 40% of the testing labels are unseen in training. We then investigated the LITENLI+L performance on test Figure 2: Performance comparison of our system LITE and the prior SOTA system, MLMET, on the filtered version of UFET for zero-shot and few-shot typing. The zero-shot labels correspond to the 40% test set type labels that are unseen in training. We also report the performance on other few-shot type labels. types that have zero or a few labeled examples in the training set. Figure 2 shows the results of LITENLI+L and the strongest baseline, MLMET. Note that while the held-out set of type labels is completely unseen to LITE, the full type vocabu- lary is provided for MLMET during its LM-based data augmentation process in this experiment. As shown in the results, it is as expected that the performance on more frequent labels is better than on rare labels. LITENLI+L outperforms MLMET on all the listed sets of zero- and few-shot labels; this reveals the strong low-shot prediction perfor- mance of our model. Particularly, on the extremely challenging zero-shot labels, LITENLI+L drasti- cally exceeds MLMET by 32.9% vs. 10.8% in F1. Hence, it is demonstrated that the NLI-based en- tity typing succeeds in more reliably representing and inferring rare and unseen entity types. The main difference between the NLI frame- work and multi-way classifiers is that NLI makes use of the semantics of input text as well as the label text; conventional classifiers, however, only model the semantics of input text. Encoding the 616 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 7 9 2 0 2 2 9 5 4 / / t l a c _ a _ 0 0 4 7 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Input True Labels Prediction (a) The University of California communications major gave her mother a fitting present, surprising herself by winning the 50-meter backstroke gold medal. athlete, person, swimmer*, contestant, scholar, child (b) The apology is being viewed as a watershed in Australia , with major television networks airing it live and crowd gathering around huge screens in the city. (c) A drawing table is also sometimes called a mechanical desk because, for several centuries, most mechanical desks were drawing tables. event, apology*, plea, regret object, desk, furniture*, board, desk, table (d) He attended the University of Virginia, where he played basketball and baseball; his brother Bill also played baseball for the University. basketball*, baseball, fun, action, activity, contact sport, game, sport, athletics, ball game, ball, event (e) The manner in which it was confirmed however smacked of an acrimonious end to the relationship between club and player with Chelsea. manner*, way, concept, style, method T E M L M s d e e c x e E T I L E T I L s d e e c x e T E M L M LITE: athlete, person, swimmer*, female, student, winner MLMET: athlete, person, child, adult, female, mother, woman LITE: event, apology*, ceremony, happening, concept MLMET: event, message LITE: object, desk, furniture* MLMET: object, desk, computer LITE: activity, game, sport, event, ball game, ball, athletics MLMET: activity, game, sport, event, basketball* LITE: event MLMET: manner*, event Table 6: Case study of labels on which LITE improves MLMET or MLMET outperforms LITE. Correct predictions are in blue and * indicates the representative label words for the discussed pattern. semantics of on the label side is particularly ben- eficial when the type set is super large and many types lack training data. When some test labels are filtered out in the training process, LITE still performs well with its inference manner but classifiers (like MLMET) fail to recognize the semantics of unseen labels merely with their fea- tures. In this way, LITE maintains high perfor- mance when transferring across benchmarks with disjoint type vocabularies. Case Study. We randomly sampled 100 labels on which LITE improves MLMET by at least 50% in F1 and here are the recognized typical patterns: • Contextual inference (28%): In case (a) of Table 6, considering the information ‘‘win- ning the 50-meter backstroke gold medal’’, LITE successfully types her with swim- mer in addition to athlete that is given by MLMET. • Coreference (20%): In case (b), LITE correctly refers the pronoun entity it to ‘‘apol- ogy’’ but MLMET merely captures local information ‘‘tv network airing’’ to obtain the label words event, message. • Hypernym (19%): In case (c), even if there is no mention of furniture in the text, LITE gives a high confidence score to this type that is a hypernym of mechanical desks. Nevertheless, MLMET only obtains trivial answers such as desk, object. On the other hand, we also sampled 100 labels on which MLMET performs better and it can be concluded that LITE falls short mainly in the following scenarios: • Multiple nominal words (30%): In sample (d) of Table 6, due to the ambiguous meaning of the type hypothesis ‘‘basketball and baseball is a basketball’’, LITE fails to predict the groundtruth label basketball. • Clause (28%) Instance (e) illustrates a com- mon situation when clauses are included in 617 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 7 9 2 0 2 2 9 5 4 / / t l a c _ a _ 0 0 4 7 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Named Entity R P 55.5 58.6 54.4 58.3 F1 57.0 56.3 Pronoun R 57.5 50.0 F1 54.2 53.4 P 51.2 57.2 Nominal R 47.1 38.9 P 45.3 49.5 F1 46.2 43.5 LITENLI+L MLMET Table 7: Performance comparison of LITE and prior SOTA, MLMET, on named entity, pronoun, and nominal entities, respectively. the entity mention, where the effectiveness of type descriptions is harmed. The clausal information distracts LITE from focusing on the key part of the entity. Prediction on Different Categories of Entity Mentions. We also investigated the prediction of LITE on three different categories of entity mentions from the UFET test data: named entities, pronouns, and nominals. For each category of mentions, we randomly sample 100 instances; the performance comparison against MLMET is reported in Table 7. According to the results, LITE consistently outperforms MLMET on all three categories of entities and the improvement on nominal phrases (46.2% vs. 43.5% in F1) is most significant. This partly aligns with the capability of making infer- ences based on noun hypernyms, as discussed in the Case Study. Meanwhile, typing on nominals seeks to be more challenging than on the other two categories of entities, which, from our observa- tion, is mainly due to two reasons. First, Nominal phrases with multiple words are more difficult to capture by the language model in general. Sec- ond, nominals are sometimes less concrete than pronouns and named entities, hence LITE also generates more abstract type labels. For example, LITE has labeled the drink in an instance as sub- stance, which is too abstract and is not recognized by human annotators. In general, LITE has much Time Efficiency. less training cost, of around 40 hours, than the pre- vious strongest (data-augmentation-based) model MLMET, which requires over 180 hours, on the UFET task.5 During the inference step, it takes about 35 seconds per new sentence for our model to do inference with a fixed type vocabulary of 5All time estimations are given by experiments performed on a commodity server with a TITAN RTX. Training and evaluation batch sizes are maximized to 16 or 128 for LITE and MLMET, respectively. labels while a common over 10,000 different multi-way classifier merely requires around 0.2 seconds. In fact, such a big difference in infer- ence cost results from encoding longer texts and multiple encoding calculation time for the same text. It can be accelerated by modifying the en- coding model structure which will be discussed in §5. However, LITE is much more efficient on dynamic type vocabulary. It requires almost no re-calculation when new, un-mappable labels are added to an existing type set but multi-way classifiers need re-training with an extended clas- sifier every time (e.g., over 180 hours by the previous SOTA). 5 Conclusion and Future Work We propose a new model, LITE, that leverages indirect supervision from NLI to type entities in texts. Through template-based type hypothesis generation, LITE formulates the entity typing task as a language inference task and meanwhile the semantically rich hypothesis remedies the data scarcity problem in the UFET benchmark. Additionally, the learning-to-rank objective further helps LITE with generalized predic- tion across benchmarks with disjoint type sets. Our experimental results illustrate that LITE promisingly offers SOTA on UFET, OntoNotes, and FIGER, and yields strong performance on zero-shot and few-shot types. LITE pretrained on UFET also yields strong transferability by out- performing SOTA baselines when directly make predictions on OntoNotes and FIGER. For future research, as mentioned in §4.3, we first plan to investigate ways to accelerate LITE by utilizing a late-binding cross-encoder (Pang et al., 2020) for linear-complexity NLI, and incor- porating high-dimensional indexing techniques like ball trees in inference. To be specific, the premise and hypotheses can first be encoded re- spectively and the resulting representations can later be used to evaluate the confidence score of 618 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 7 9 2 0 2 2 9 5 4 / / t l a c _ a _ 0 0 4 7 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 premise-hypothesis representation pairs through a trained network. With little expected loss in per- formance, LITE can still maintain its feature of strong transferability and zero-shot prediction. In addition, we plan to extend NLI-based in- direct supervision to information extraction tasks such as relation extraction and event extraction. In- corporating abstention-awareness (Dhamija et al., 2018) for handling unknown types is another meaningful direction. Additionally, Poliak et al. (2018) recast diverse types of reasoning datasets including NER, relation extraction, and sentiment analysis into the NLI structure, which we plan to incorporate as extra indirect supervision for LITE to further enhance the robustness of entity typing. Acknowledgments The authors appreciate the reviewers and editors for their insightful comments and suggestions. The authors would also like to thank Hongliang Dai and Yangqiu Song from the Hong Kong University of Science and Technology for sharing the resources and implementation of MLMET, and thank Eunsol Choi from the University of Texas at Austin for sharing the full UFET distant supervision data. This material is partly supported by the Na- tional Science Foundation of United States grant IIS 2105329, and the DARPA MCS program un- der contract no. N660011924033 with the United States Office Of Naval Research. References Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural lan- guage inference. In Proceedings of the 2015 Conference on Empirical Methods in Nat- ural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computa- tional Linguistics. Khyathi Raghavi Chandu, Yonatan Bisk, and Alan W. Black. 2021. Grounding ‘ground- ing’ in NLP. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4283–4305, Online. Association for Computational Linguistics. Jifan Chen, Eunsol Choi, and Greg Durrett. 2021. Can NLI models verify QA systems’ predictions? In Findings of the Associa- tion for Computational Linguistics: EMNLP 2021, pages 3841–3854, Punta Cana, Domini- can Republic. Association for Computational Linguistics. Tongfei Chen, Yunmo Chen, and Benjamin Van Durme. 2020a. Hierarchical entity typing In Pro- via multi-level the 58th Annual Meeting of ceedings of the Association for Computational Linguis- tics, pages 8465–8475, Online. Association for Computational Linguistics. learning to rank. Tongfei Chen, Yunmo Chen, and Benjamin Van Durme. 2020b. Hierarchical entity typing In Pro- via multi-level the 58th Annual Meeting of ceedings of the Association for Computational Linguis- tics, pages 8465–8475, Online. Association for Computational Linguistics. learning to rank. Eunsol Choi, Omer Levy, Yejin Choi, and Luke Zettlemoyer. 2018. Ultra-fine entity typing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 87–96, Melbourne, Australia. Association for Compu- tational Linguistics. Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The pascal recognising tex- tual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognis- ing Tectual Entailment, pages 177–190, Berlin, Heidelberg. Springer Berlin Heidelberg. Hongliang Dai, Donghong Du, Xin Li, and Yangqiu Song. 2019. Improving fine-grained entity typing with entity linking. In Proceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th International Joint Conference on Natu- ral Language Processing (EMNLP-IJCNLP), pages 6210–6215, Hong Kong, China. Associ- ation for Computational Linguistics. Hongliang Dai, Yangqiu Song, and Haixun Wang. 2021. Ultra-fine entity typing with weak su- pervision from a masked language model. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long 619 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 7 9 2 0 2 2 9 5 4 / / t l a c _ a _ 0 0 4 7 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Papers), pages 1790–1799, Online. Association for Computational Linguistics. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Com- putational Linguistics. Akshay Raj Dhamija, Manuel G¨unther, and Terrance E. Boult. 2018. Reducing network ag- nostophobia. In Proceedings of the 32nd Annual Conference on Neural Information Processing Systems (NeurIPS), pages 9175–9186. Ning Ding, Yulin Chen, Xu Han, Guangwei Xu, Pengjun Xie, Hai-Tao Zheng, Zhiyuan Liu, Juanzi Li, and Hong-Gee Kim. 2021. Prompt-learning for fine-grained entity typing. arXiv preprint arXiv:2108.10604. Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making pre-trained language models learners. In Proceedings of better few-shot the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Pa- pers), pages 3816–3830, Online. Association for Computational Linguistics. Dan Gillick, Nevena Lazic, Kuzman Ganchev, Jesse Kirchner, and David Huynh. 2014. Context-dependent fine-grained entity type tagging. arXiv preprint arXiv:1412.1820. Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780. Yanfeng Hu, Xue Qiao, Luo Xing, and Chen Peng. 2020. Diversified semantic attention model for fine-grained entity typing. IEEE Access, 9:2251–2265. Hailong Jin, Lei Hou, Juanzi Li, and Tiansi Dong. 2019. Fine-grained entity typing via hierarchical multi graph convolutional net- works. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4969–4978, Hong Kong, China. Association for Computational Linguistics. Mitchell Koch, John Gilmer, Stephen Soderland, and Daniel S. Weld. 2014. Type-aware dis- tantly supervised relation extraction with linked arguments. In Proceedings of the 2014 Con- ference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 1891–1901, Doha, Qatar. Association for Computational Linguistics. Xiao Ling, Sameer Singh, and Daniel S. Weld. 2015. Design challenges for entity linking. Transactions of the Association for Computa- tional Linguistics, 3:315–328. https://doi .org/10.1162/tacl_a_00141 Xiao Ling and Daniel S. Weld. 2012. Fine-grained entity recognition. In Twenty-Sixth AAAI Con- ference on Artificial Intelligence. Qing Liu, Hongyu Lin, Xinyan Xiao, Xianpei Han, Le Sun, and Hua Wu. 2021. Fine-grained entity typing via label reasoning. In Proceedings of the 2021 Conference on Empirical Methods in Nat- ural Language Processing, pages 4611–4622, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly opti- mized BERT pretraining approach. arXiv pre- print arXiv:1907.11692. Ilya Loshchilov and Frank Hutter. 2018. Decoupled weight decay regularization. In International Conference on Learning Repre- sentations. Rasha Obeidat, Xiaoli Fern, Hamed Shahbazi, and Prasad Tadepalli. 2019. Description-based zero-shot fine-grained entity typing. In Pro- ceedings of the 2019 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages 807–814, Minneapolis, Minnesota. As- sociation for Computational Linguistics. Yasumasa Onoe, Michael Boratko, Andrew McCallum, and Greg Durrett. 2021. Modeling fine-grained entity types with box embeddings. In Proceedings of the 59th Annual Meeting of 620 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 7 9 2 0 2 2 9 5 4 / / t l a c _ a _ 0 0 4 7 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2051–2064, Online. Association for Computational Linguistics. Yasumasa Onoe and Greg Durrett. 2019. Learn- ing to denoise distantly-labeled data for entity typing. In Proceedings of the 2019 Conference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2407–2417, Minneapolis, Minnesota. Association for Computational Linguistics. Yasumasa Onoe and Greg Durrett. 2020. Interpretable entity representations through large-scale typing. In Findings of the Asso- ciation for Computational Linguistics: EMNLP 2020, pages 612–624, Online. Association for Computational Linguistics. Shuai Pang, Jianqiang Ma, Zeyu Yan, Yang Zhang, and Jianping Shen. 2020. FAST- MATCH: Accelerating the inference of BERT- based text matching. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6459–6469, Barcelona, Spain (Online). International Committee on Compu- tational Linguistics. Ankur Parikh, Oscar T¨ackstr¨om, Dipanjan Das, and Jakob Uszkoreit. 2016. A decomposable at- tention model for natural language inference. In Proceedings of the 2016 Conference on Empir- ical Methods in Natural Language Processing, pages 2249–2255, Austin, Texas. Association for Computational Linguistics. Jeffrey Socher, Pennington, Richard and Christopher Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextu- alized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, 621 New Orleans, Louisiana. Association for Com- putational Linguistics. Adam Poliak, Aparajita Haldar, Rachel Rudinger, J. Edward Hu, Ellie Pavlick, Aaron Steven White, and Benjamin Van Durme. 2018. Collecting diverse natural language inference problems for sentence representation evalua- tion. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 67–81, Brussels, Belgium. Association for Computational Linguistics. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67. Xiang Ren, Wenqi He, Meng Qu, Clare R Voss, Heng Ji, and Jiawei Han. 2016. Label noise reduction in entity typing by heterogeneous partial-label embedding. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1825–1834. Jiaming Shen, Wenda Qiu, Yu Meng, Jingbo Shang, Xiang Ren, and Jiawei Han. 2021. TaxoClass: Hierarchical multi-label text classi- fication using only class names. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4239–4249, Online. Association for Computational Linguistics. Luke Vilnis, Xiang Li, Shikhar Murty, and Andrew McCallum. 2018. Probabilistic em- bedding of knowledge graphs with box lattice measures. In Proceedings of the 56th Annual the Association for Computa- Meeting of tional Linguistics (Volume 1: Long Papers), pages 263–272, Melbourne, Australia. Associ- ation for Computational Linguistics. Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Xuanjing Huang, Jianshu Ji, Guihong Cao, Daxin Jiang, and Ming Zhou. 2021. K-Adapter: Infusing knowledge into pre-trained models with adapters. In Findings of the Association l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 7 9 2 0 2 2 9 5 4 / / t l a c _ a _ 0 0 4 7 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 for Computational Linguistics: ACL-IJCNLP 2021, pages 1405–1418, Online. Association for Computational Linguistics. Aaron Steven White, Pushpendre Rastogi, Kevin Duh, and Benjamin Van Durme. 2017. In- ference is everything: Recasting semantic resources into a unified evaluation framework. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 996–1005, Taipei, Taiwan. Asian Federation of Natural Language Processing. Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Con- ference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics. Wenhan Xiong, Jiawei Wu, Deren Lei, Mo Yu, Shiyu Chang, Xiaoxiao Guo, and William Yang Wang. 2019. Imposing label-relational induc- tive bias for extremely fine-grained entity typing. In Proceedings of the 2019 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, Volume 1 (Long and Short Papers), pages 773–784, Minneapo- lis, Minnesota. Association for Computational Linguistics. Peng Xu and Denilson Barbosa. 2018. Neu- ral fine-grained entity type classification with hierarchy-aware loss. In Proceedings of the 2018 Conference of the North American Chap- the Association for Computational ter of Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 16–25, New Orleans, Louisiana. Association for Computa- tional Linguistics. Zenan Xu, Daya Guo, Duyu Tang, Qinliang Su, Linjun Shou, Ming Gong, Wanjun Zhong, Xiaojun Quan, Daxin Jiang, and Nan Duan. 2021. Syntax-enhanced pre-trained model. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5412–5422, Online. Associa- tion for Computational Linguistics. Semih Yavuz, Izzeddin Gur, Yu Su, Mudhakar Srivatsa, and Xifeng Yan. 2016. Improving se- mantic parsing via answer type inference. In Proceedings of the 2016 Conference on Empir- ical Methods in Natural Language Processing, pages 149–159, Austin, Texas. Association for Computational Linguistics. Wenpeng Yin, Jamaal Hay, and Dan Roth. 2019. Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach. In Proceedings of the 2019 Conference on in Natural Language Empirical Methods Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3914–3923, Hong Kong, China. Association for Computational Linguistics. Wenpeng Yin, Nazneen Fatema Rajani, Dragomir Radev, Richard Socher, and Caiming Xiong. 2020. Universal natural language processing with limited annotations: Try few-shot tex- tual entailment as a start. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8229–8239, Online. Association for Computational Linguistics. Dani Yogatama, Daniel Gillick, and Nevena fine Lazic. 2015. Embedding methods for In Pro- grained entity type classification. ceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 291–296, Beijing, China. Association for Computational Linguistics. Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. 2019. ERNIE: Enhanced language representation with informative entities. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1441–1451, Florence, Italy. Association for Computational Linguistics. 622 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 7 9 2 0 2 2 9 5 4 / / t l a c _ a _ 0 0 4 7 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3Ultra-fine Entity Typing with Indirect Supervision from Natural image
Ultra-fine Entity Typing with Indirect Supervision from Natural image

PDF Herunterladen