Predicting Document Coverage for Relation Extraction

Sneha Singhania, Simon Razniewski, Gerhard Weikum
Max Planck Institute for Informatics, Germany

{ssinghan,srazniew,weikum}@mpi-inf.mpg.de

Astratto

This paper presents a new task of predicting
the coverage of a text document for relation
extraction (RE): Does the document contain
many relational
tuples for a given entity?
Coverage predictions are useful in selecting
the best documents for knowledge base con-
struction with large input corpora. To study
this problem, we present a dataset of 31,366
diverse documents for 520 entities. We an-
alyze the correlation of document coverage
with features like length, entity mention fre-
quency, Alexa rank, language complexity, E
information retrieval scores. Each of these
features has only moderate predictive power.
We employ methods combining features with
statistical models like TF-IDF and language
models like BERT. The model combining fea-
tures and BERT, HERB, achieves an F1 score
of up to 46%. We demonstrate the utility of
coverage predictions on two use cases: KB
construction and claim refutation.

introduzione

Motivation and Problem Relation extraction
(RE) from text documents is an important NLP
task with a range of downstream applications (Han
et al., 2020). For these applications, it is vital
to understand the quality of RE results. While
extractors typically provide confidence (or preci-
sion) scores, this paper puts forward the notion
of RE coverage (or recall). Given an input doc-
ument and an RE method, coverage measures
the fraction of the extracted relations compared
to the complete ground-truth that holds in real-
ità. We consider this on a per-subject and per-
predicate basis—for example, all memberships
of Bill Gates in organizations or all companies
founded by Elon Musk.

Document coverage for RE highly varies. Contro-
sider the three text snippets about Tesla as shown
in Figure 1. The first text contains all five founders
of Tesla, while the second text contains only two
of them, and the third has just one. Analogously,

207

for the entity Tesla and the relation founded-by, we
see that text 1 has coverage 1, testo 2 has coverage
0.4, and text 3 has coverage 0.2.

When applying RE at scale, Per esempio, A
populate or augment a knowledge base (KB), an
RE system may have to process a huge number
of input documents that differ widely in their
coverage. As state-of-the-art extractors are based
on heavy-duty neural networks (Lin et al., 2016;
Zhang et al., 2017; Soares et al., 2019; Yao et al.,
2019), processing all documents in a large cor-
pus may be prohibitive and inefficient. Invece,
prioritizing the input documents by identifying
the best documents with high coverage could be
effective. This is why coverage prediction is cru-
cial for large-scale RE, but the problem has not
been explored so far.

This problem would be easy if we could first
run a neural RE system on each document and
then assess the yield, either by comparison to
withheld labeled data or by sampling followed by
human inspection. Tuttavia, this is exactly the
computational bottleneck that we must avoid. IL
challenge is to estimate document coverage, for
a given entity and relation of interest, with inex-
pensive and lightweight techniques for document
processing.

Approach and Contributions This paper pre-
sents the first systematic approach for analyzing
and predicting document coverage for relation
extraction. To facilitate extensive experimental
study on this novel task, we introduce a large
Document Coverage (DoCo) dataset of 31,366
web documents for 520 distinct entities spanning
8 relations, along with automated extractions and
coverage labels. Tavolo 1 provides samples from
DoCo for each relation.

We employ a classifier architecture that we
call HERB (for Heuristics with BERT), based
on a document’s lightweight features and addi-
tionally incorporates pretrained language models
like BERT without any costly re-training and
Questo
fine-tuning. The best configuration of

Operazioni dell'Associazione per la Linguistica Computazionale, vol. 10, pag. 207–223, 2022. https://doi.org/10.1162/tacl a 00456
Redattore di azioni: Hoifung Poon. Lotto di invio: 9/2021; Lotto di revisione: 10/2021; Pubblicato 3/2022.
C(cid:2) 2022 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
5
6
2
0
0
2
6
7
8

/
T

UN
C
_
UN
_
0
0
4
5
6
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

3. We propose a set of heuristics for coverage
estimation, analyze them in isolation and in
combination with an inexpensive standard
embedding-based document model.

4. We study the application of our classifier
on two important use cases: KB construction
and claim refutation. Experiments show that
our predictor is useful in both of these tasks.

Our data, models and code is publicly available.1

2 Related Work

Relation Extraction (RE) RE is the task of
identifying the relation types between two entities
that are mentioned together in a sentence or in
proximity within a document (per esempio., in the same
paragraph). RE has a long history in NLP re-
search (Mintz et al., 2009; Riedel et al., 2010),
with a recent overview given by Han et al. (2020).
State-of-the-art methods are based on deep neu-
ral networks trained via distant supervision (Lin
et al., 2016; Zhang et al., 2017; Soares et al.,
2019; Yao et al., 2019). On the practical side,
RE is available in several commercial APIs for
information extraction from text. In our exper-
iments, we make use of Rosette2 and Diffbot.3
Our approach is agnostic to the choice of extrac-
tori, Anche se; any RE tool can be plugged in.

Knowledge Base Construction (KBC) RE
plays a crucial part in the more comprehensive
KBC task: identifying instances of entity pairs
that stand in a given relation in order to construct
a knowledge base (Mitchell et al., 2018; Weikum
et al., 2021; Hogan et al., 2021).

The input is typically a set of documents, often
assumed to be fixed and given upfront. This dis-
regards the critical issue of benefit/cost trade-offs,
which mandates identifying high-yield inputs for
resource-bounded KBC. Identifying relevant, ex-
pressive, and preferable sources for KBC is often
referred to as source discovery. Source discovery
can be performed via IR-style ranking of docu-
ments or can be based on heuristic estimators of
the yield of relation extractors (Wang et al., 2019;
Razniewski et al., 2019). The former work, In

1www.mpi-inf.mpg.de/document-coverage

-prediction.

2https://rosette.com/.
3https://www.diffbot.com/.

Figura 1: Sample documents from our Document
Coverage (DoCo) dataset.

classifier achieves an F1-score of up to 46%.
The classifier provides scores for its predictions
and thus also supports ranking documents by their
expected yield for the RE task at hand.

We evaluate our approach against a range of
state-of-the-art baselines. Our results show that
heuristic features like text length, entity men-
tion frequency, language complexity, Alexa rank,
or information retrieval scores have only mod-
erate predictive power. Tuttavia, in combination
with pre-trained language models, the proposed
classifier gives useful predictions of document
coverage.

We further study the role of coverage prediction
in two extrinsic use cases: KB construction and
claim refutation. For KB construction, we show
that coverage estimates by HERB are effective
in ranking candidate documents and can substan-
tially reduce the number of web pages one needs
to process for building a reasonably complete KB.
For the task of claim refutation (per esempio., Tim Cook
is the CEO of Microsoft), we show that cover-
age estimates for different documents can provide
counter-evidence that can help to invalidate false
statements obtained by RE systems.
The salient contributions of this work are:

1. We introduce the novel task of predicting
document information coverage for RE.

2. To support experimental comparisons, we
present a large dataset of annotated web
documents.

208

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
5
6
2
0
0
2
6
7
8

/
T

UN
C
_
UN
_
0
0
4
5
6
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

Entity

Relation

Content

Coverage

George W. Bush

family

FedEx

partner-org

Warren Buffett

member-of

Indra Nooyi

edu-at

J. K. Rowling

position-held

Apple Inc.

founded-by

Intel

board-member

ceo

President Bush grew up in Midland, Texas, as the eldest son of Barbara and George
H.W. Bush . . . and met Laura Welch. They were married in 1977 . . . twin daughters:
Barbara, married to Craig Coyne, and Jenna, married to Henry Hager. The Bushes
also are the proud grandparents of Margaret Laura ‘‘Mila’’, Poppy Louise, and Henry
Harold ‘‘Hal’’ Hager . . .
FedEx Corp. . . . to acquire ShopRunner, the e-commerce . . . acquires the International
Express business of Flying Cargo Group . . . acquires Manton Air-Sea Pty Ltd, UN
leading provider . . . acquires P2P Mailing Limited, a leading . . . acquires Northwest
Research, a leader in inventory . . . acquires TNT Express . . . acquires GENCO
. . . acquires Bongo International . . . acquires the Supaswift businesses in South
Africa . . . acquires Rapid˜ao Cometa . . .
He formed Buffett Partnership Ltd. In 1956, e da 1965 he had assumed control
of Berkshire Hathaway . . . Following Berkshire Hathaway’s significant investment
in Coca-Cola, Buffett became . . . director of Citigroup Global Markets Holdings,
Graham Holdings Company and The Gillette Company . . .
Nooyi was born in Chennai, India, and moved to the US in 1978 when she entered the
Yale School of Management . . . secured her B.S. from Madras Christian College and
her M.B.A. from Indian Institute of Management Calcutta …
Rowling is one of the best-selling authors today . . . job of a researcher and bilingual
secretary for Amnesty International . . . position of a teacher led to her relocating to
Portugal …
Steve Jobs, the co-founder of Apple Computers . . . switched over to managing the Apple
‘‘Macintosh’’ project that was started . . .
Andy D. Bryant stepped down as chairman . . . Dr. Omar Ishrak to succeed . . . Alyssa
Henry was elected to Intel’s board. Her election marks the seventh new independent
director . . .
The American multinational conglomerate corporation 3M was formerly known as
Minnesota Mining and Manufacturing Company. It’s based in the suburbs . . .

0.8

0.75

0.67

0.33

0.125

Tavolo 1: Sample entity-relation-document triples for all eight relations present in our DoCo dataset.

particular, approaches yield optimization as a set
coverage maximization problem through shared
properties of extracted entities. The latter uses
textual features in a supervised SVM or LSTM
modello, a baseline with which we also compare in
our experiments.

Document Ranking in IR Information retrieval
(IR) ranks documents by relevance to a query
with keywords or telegraphic phrases. Relevance
judgments are based on the perception of infor-
mativeness concerning the query and its underly-
ing user intent. Standard metrics for assessment,
like precision, recall, and nDCG (J¨arvelin and
Kek¨al¨ainen, 2002), are not applicable to our set-
ting. The notion of coverage pursued in this pa-
per refers to the yield of structured outputs by
RE systems rather than document relevance. For
esempio, a query-topic-wise highly relevant doc-
ument that contains few extractable facts about
named entities would still have low RE coverage.

Relevance of Coverage Estimates Understand-
ing and incorporating document coverage pre-
diction into NLP-based information extraction
is essential for several reasons. For resource-
bounded KB construction, it is crucial to know

which documents are most promising for extrac-
tion with limited budgets for crawling and RE pro-
cessing and/or human annotation (Ipeirotis et al.,
2007; Wang et al., 2019). For claim refutation,
coverage estimates can help to assess statements
as questionable if documents with high coverage
do not support them. So far, claim evaluation sys-
tems mostly rely on textual cues about factual-
ity or source credibility (Nakashole and Mitchell,
2014; Rashkin et al., 2017; Thorne et al., 2018;
Chen et al., 2019).

For question answering over knowledge bases,
it is important to know whether a KB can be relied
upon in terms of complete answer sets (Darari
et al., 2013; Hopkinson et al., 2018; Arnaout
et al., 2021). Current coverage estimation tech-
niques for KBs do this analysis only post-hoc after
the KB is fully constructed (Gal´arraga et al., 2017;
Luggen et al., 2019), losing access to valuable
information from extraction time.

3 Coverage Prediction

We take an entity-centric perspective, and view
RE methods as functions mapping document-
entity-relation tuples onto the set of objects found
in the document. Formalmente, given a document d,

209

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
5
6
2
0
0
2
6
7
8

/
T

UN
C
_
UN
_
0
0
4
5
6
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
5
6
2
0
0
2
6
7
8

/
T

UN
C
_
UN
_
0
0
4
5
6
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

Figura 2: Dataset Construction Pipeline. There are two main phases: 1) corpus collection to create GTweb, E 2)
coverage calculation. Phase 1 involves: io) for each entity ei, n websites are collected using the Bing search API,
ii) text is scraped from each website, iii) RE tuples from documents are extracted via Rosette/Diffbot, and iv) RE
tuples are deduplicated and consolidated to form GTweb. The scraped documents are stored as inputs for phase 2
which consists of: io) for each document di, previously extracted relations are collected, and ii) based on the choice
of GT, coverage is calculated to create the final DoCo dataset.

an entity e, a relation r, a ground truth GT of ob-
jects that stand in relation r with e, and a relation
extraction method extr, the document coverage
of d for (e, R) applying extr is defined as:

coverageextr(D, e, R) =

| extr(D, e, R) ∩ GT |
| GT |

(1)
The task thus takes the form of a classical pre-
diction problem, either as a numerical coverage
value or binarized class label. In Section 5, we
propose several heuristics and methods that can
be used to predict coverage for a given document.
To study this novel problem, we require evalua-
tion data. The following section thus deals with
the generation of a large and diverse document
coverage dataset.

4 Dataset Construction

A thorough study of document coverage predic-
tion requires a corpus with two characteristics:
(io) relation diversity (cioè., documents containing
enough automatically extractable relations) E
(ii) content diversity (cioè., multiple documents
with varying content per entity). Existing text cor-
pora, like the popular NYT (Sandhaus, 2008)
and Newsroom dataset (Grusky et al., 2018),
contain ample numbers of articles that mention

newsworthy entities; Tuttavia, the articles are pri-
marily short, mentioning only very few relations.
On the other end, machine-translated multilin-
gual versions of Wikipedia articles (Roy et al.,
2020) allow extraction of many relations but
lack diversity.

For the novel

task of predicting document
information coverage, we thus built the DoCo
(Document Coverage) dataset, consisting of
31,366 web documents for 520 distinct entities,
each with its coverage value. Figura 2 illustrates
the dataset construction.

Entity Selection First, well-known entities of
two types, persona (PER) and organization (ORG),
were selected from popular ranking lists by Time
1004 and Forbes5,6 (‘‘Influential people around
the globe’’, ‘‘Most valuable tech companies’’).
These entities covered 12 diverse sub-domains, In-
cluding politicians, entrepreneurs, singers, sports
figures, writers, and actors, for PER, and tech-
nology, automobile, retail, conglomerate, pharma-
ceuticals, and financial corporations, for ORG.
Popular and long-tail entities for PER, companies

4https://time.com/collection/100-most

-influential-people-2020/.

5https://forbes.com/forbes-400/.
6https://forbes.com/lists/global2000

/#9a993675ac04.

210

across demographics and with differing net worth
for ORG, were chosen to further obtain documents
with varying content.

Relation

member-of

Wikidata Property

(P22), mother

member of (P463), member of political
party (P102), part of (P361), employer
(P108), owner of (P1830), record label
(P264), member of sports team (P54)
(P25), spouse
father
(P26), child (P40), stepparent (P3448),
sibling (P3373)
educated at (P69)
position held (P39), occupation (P106)
owner of (P1830), owned by (P127),
member of (P463), parent organization
(P749), subsidiary (P355)
founded by (P112)
chief executive officer (P169)
board member (P3320)

family

edu-at
position-held
partner-org

founded-by
ceo
board-member

Tavolo 2: Wikidata property names and identifiers
used to create GTwiki.

Ground Truth We considered three ground-
to calculate coverage for each
truth labels
document:

1. Wikidata (GTwiki): A popular KB providing
data for most relations yet having coverage
limitations (Gal´arraga et al., 2017; Luggen
et al., 2019). Per esempio, for Bill Gates,
Microsoft and other popularly associated
companies for the member-of relation are
present, but niche entities like Honeywell
are missing. Depending on the entity type
and sub-domain, we created the ground-truth
labels by choosing those Wikidata proper-
ties that best matched the semantics of the
8 selected relations. Tavolo 2 provides the
complete information.

2. Web Extractions (GTweb): We used the set
of frequent extractions across all the docu-
ments in DoCo as web-aggregated ground
truth. For a given entity-relation (e, R), an
extraction was determined frequent if it ap-
peared in at least 5% of total documents
corresponding to e, or if its count was no less
di 5 times the highest counted tuple for
(e, R). Deciding frequent extractions relative
to total document count and other tuples’ fre-
quencies for an entity resulted in noise-free
ground-truth labels.

3. Wikidata and Web Extractions (GTwikiweb):
We merged both previous variants using set

Websites and Content We aimed to collect di-
verse 100 URLs per entity by issuing a set of
search engine queries per entity, Per esempio,
‘‘about PER’’, ‘‘PER biography’’, ‘‘ORG his-
tory’’. A total of 6 set of queries for PER and 10
for ORG was designed. Since the URLs returned
over the set of queries were not always unique, we
retained the duplicated URL only once.

Extracting textual

content without noisy
headers, menus, and comments
required a
labor-intensive scraping step. We handled the
multi-domain content scraping task through a
libraries like Newspaper3k,7
combination of
Readability,8 and online scraping services like
Import.io9 and ParseHub.10 We ensured high-
quality scraped content by applying rule-based
filters to remove noisy elements like embedded
ADs and reference links. The scraped documents
covered a range of website domains, including
biographical sites, articoli di notizie, official company
profiles, newsletters, and so on.

Relation Tuples Each document in DoCo was
processed by two relation extraction APIs, Rosette
and Diffbot. To annotate each document with
coverage, we focused only on the entity queried
initially to obtain the document. For our experi-
mental study, we selected the following frequently
occurring relations: member-of, family, edu-at,
and position-held, for PER, and partner-org,
founded-by, ceo, and board-member, for ORG.
For more accurate coverage calculation, the RE
tuples were deduplicated, Per esempio, (Gates,
member-of, Microsoft Corp.) would become (Bill
Gates, member-of, Microsoft), via alignment to
Wikidata identifiers returned by the APIs.

The relations extracted by the APIs are fine-
grained like person-member-of, person-employee-
Di, org-acquired-by, and org-subsidiary-of. Noi
combined the first two as member-of for PER
and the last
two as partner-org for ORG as
coarse-grained relations.

7https://newspaper.readthedocs.io/en

/latest/.

8https://pypi.org/project/readability

-lxml/.

9https://www.import.io/.
10https://www.parsehub.com/.

211

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
5
6
2
0
0
2
6
7
8

/
T

UN
C
_
UN
_
0
0
4
5
6
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

# PER entities
# ORG entities
# Relations
# Documenti
Doc. length range (parole)
# Unique website domains
# Doc. with non-zero RE tuples
# Doc. with non-zero coverage
# Doc. in class informative

250
270
8
31,366
[20, 10906]
600
26956
14086
7103 (22.6 %)

Tavolo 3: Characteristics of the DoCo dataset.

Relation

member-of
family
edu-at
position-held

partner-org
founded-by
ceo
board-member

GTwiki
3.61
2.21
2.26
5.86

6.16
1.07
1.03
0.47

GTweb
6.51
4.0
2.07
7.76

4.26
1.06
2.77
1.44

GTwikiweb
7.12
4.41
2.58
10.37

3.12
1.66
2.86
1.75

union operation and phrase embeddings with
cosine similarity for higher recall.

Coverage Calculation Coverage was computed
on a per entity-relation-document basis using
Equazione (1). Even though real-valued coverage
values are computed while constructing the data-
set, it is often not possible to give nuanced pre-
dictions at test time. Consider the text ‘‘. . . Musk
is a co-founder of Tesla …’’. The term co-founder
clearly indicates the presence of multiple found-
ers; Tuttavia, the context does not provide any
clue on the total number of co-founders. For exam-
ple, there could be one other co-founder (coverage
0.5) O 9 other co-founders (coverage 0.1).

Coverage Binarization We binarized the cov-
erage values to circumvent this problem, splitting
documents into two classes: informative and un-
informative. The binarization method comprised
an absolute and a relative threshold: A document
was labeled as informative or 1 if its coverage
was greater than 0.5, or greater than the coverage
of at least 85% of documents for the same (e, R);
otherwise, it was labeled as uninformative or 0.

Dataset Characteristics After filtering dupli-
cates, irrelevant URLs like social media handles,
and video-content websites, we obtained a total
Di 31,366 documents for 520 entities. Tavolo 3
provides an overview of the DoCo dataset. Noi
can see that DoCo’s labels are imbalanced, COME
only 22.6% of the documents are informative and
77.4% are uninformative. The count of documents
with non-zero RE tuples is higher than those with
non-zero coverage since the RE tuples were not
always related to the subject entity, hence irrel-
evant towards coverage calculation.

Tavolo 4 gives the average number of objects
present in each ground truth variant. On average

Tavolo 4: Average number of objects per entity.

Relation
member-of
family
edu-at
position-held
partner-org
founded-by
ceo
board-member

Human Diffbot Rosette GTwiki GTweb GTwikiweb

4.36
4.74
1.72
2.9
3.7
1.34
2.02
2.62

3.66
3.82
2.5
4.26
0.72
0.58
1.96
1.54

5.04
0.66
2.52
–
2.26
1.8
–
–

4.54
1.76
2.94
6.7
0.8
0.78
1.68
2.82

7.12
5.78
3.08
6.14
5.04
2.84
4.32
3.48

9.22
6.64
2.18
9.52
5.92
2.96
4.2
2.64

Tavolo 5: Average tuple count per relation. The RE
tool with higher tuple count (boldfaced) is chosen
for each relation.

across relations, the number of objects in GTweb
is higher than those in GTwiki by 23.7%, E
GTwikiweb is higher than those in GTwiki by 28.8%.
This implies that GTweb and GTwiki can have over-
lapping objects, and GTweb might contain extra
objects towards GTwikiweb creation.

Dataset Quality We analyzed the quality of the
DoCo dataset by comparing automatic relation
extractions to extractions given by human anno-
tators. A sample of 400 documents was selected,
50 per relation, with half from the high-coverage
range and the rest from the low-coverage range.
Each document was annotated with all correct
tuples for the document’s main subject entity.

Tavolo 5 shows the observed averaged counts.
We note that the human annotators extracted a
substantial number of tuples for all 8 relations,
indicating the richness and breadth of the DoCo
documents. The two automatic extractors mostly
yielded smaller numbers of tuples, with a few
exceptions. These exceptions include spurious
tuples, Anche se. The ground-truth variants con-
sistently suggest higher numbers, but except for
the conservative GTwiki, these are usually over-
estimates due to spurious tuples. The GT variants

212

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
5
6
2
0
0
2
6
7
8

/
T

UN
C
_
UN
_
0
0
4
5
6
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

should thus be seen as upper bounds for the true
RE coverage.

sequence-to-sequence model (Raffel et al.,
2020) to rank documents.

We analyzed how well the automatic anno-
tations reflect human annotations’ coverage by
computing Pearson correlation coefficients for the
entire set of 400 sample documents. For a re-
lation, the RE tool with higher averaged count
was chosen for our experiments, and the corre-
lation for (Umano, RE) È 0.68. This shows that
optimizing for coverage by automatic RE tools
is highly correlated with the overarching goal of
approximating human-quality outputs.

5 Approach

We aim to model coverage prediction by process-
ing unstructured document text by inexpensive
lightweight techniques. This is crucial for identi-
fying promising documents before embarking on
heavy-duty RE.

Heuristics We devise several simple heuristics
involving textual features for document coverage.

1. Document Length: The length of a document
is a proxy for the amount of information
contained. Longer documents may express
more relations.

2. NER Frequency: Length can be misleading
when a document is verbose, yet uninfor-
mative. The count of named-entity mentions
matching the relation domain (per esempio., persons
for the relation family, or organizations for
the relation member-of ) could correlate with
coverage.

3. Entity Saliency: The more frequently an en-
tity is mentioned in a document, the more
likely the document expressed relations for
that entity.

4. IR-Relevance Signals: The surface similarity
of the entire document with the input query
is another cue. We adopt BM25 (Robertson
et al., 1995), a classical and still power-
ful IR model for ranking documents, using
(cid:4)e(cid:5) + (cid:4)R(cid:5) as query, where e and r are the tar-
get entity and relation, rispettivamente. Recente
advances on neural rankers are considered
anche (Nogueira and Cho, 2020). We fol-
low Nogueira et al. (2020) and use the T5

5. Website Popularity: Popular websites may be
visited often because they are more informa-
tive. We use the standard Alexa rank11 as a
measure of popularity.

6. Text Complexity: RE methods are effective
on simpler text, and may not be able to
effectively extract relations from documents
written in complex prose. We use the Flesch
score (Flesch and Gould, 1949), a popular
text readability measure.

7. Random: We contrast the predictive power
of our proposed methods with two random
baselines: A fair coin, and a biased coin main-
taining the label imbalance in our test set.

Methods We use several inexpensive statistical
models for document representation and feed them
to a logistic regression classifier.

8. Latent Topic Modeling: Topics in a document
could be a useful indicator of coverage. For
esempio, for relation family, latent topics
like ancestry or personal life are relevant.
We use Latent Dirichlet Allocation (LDA)
(Blei et al., 2003) to model documents as
distributional vectors.

9. BOW+TFIDF: A simple yet effective statis-
tic to measure word importance given a
document in a corpus is the product of term
frequency and inverse document frequency
(TF-IDF). We vectorize a document into a
Bag-of-Words (BOW) representation with
TF-IDF weights.

10. Ngrams+TFIDF: A document is vectorized
using frequent n-grams (n ≤ 3) with TF-IDF
pesi.

We employ two neural baselines including

LSTM and pre-trained language model (BERT).

11. LSTM: Previous work by Razniewski et al.
(2019) used textual features to estimate the
presence of a complete set of objects in a text
segment. We adopt their architecture, rep-
resenting documents using 100 dimensional

11www.alexa.com.

213

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
5
6
2
0
0
2
6
7
8

/
T

UN
C
_
UN
_
0
0
4
5
6
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

Figura 3: Architecture for Heuristics combined with
TF-IDF (Heu+TFIDF).

GloVe embeddings (Pennington et al., 2014),
and processing them in LSTM (Hochreiter
and Schmidhuber, 1997), followed by a
feed-forward layer with ReLU activation
before the classifier.

12. Language Model (BERT): Without costly
re-training or fine-tuning, we utilize pre-
trained BERT embeddings (Devlin et al.,
2019) in a feature-based approach by extract-
ing activations from the last four hidden
layers. As in the original work, these contex-
tual embeddings are fed to a two-layer 768-
dimensional BiLSTM before the classifier.

Our experiments (Sez. 6.2) reveal that each of our
proposed heuristics has only a moderate predic-
tive power. We therefore formulate a lightweight
classifier to combine heuristics with the best per-
forming statistical model (TF-IDF), or language
modello (BERT).

13. Heuristics with BOW+TFIDF (Heu+TFIDF):
We combine TF-IDF with heuristics (one to
six) using stacked Logistic Regression (LR)
(Figura 3). In level 1, the TF-IDF vector and
each individual heuristic are fed to separate
LR classifiers. In level 2, all the outputs of
level 1 LRs are concatenated and fed to a
final LR classifier for coverage prediction.
The entire model is jointly trained.

14. Heuristics with BERT (HERB): Noi veniamo-
bine BERT with heuristics (one to six) in un
two-step process (Figura 4). In the first step,
we reuse the BERT model above (with no ad-
ditional training or fine-tuning) for coverage
prediction. This prediction is then concate-
nated with heuristics to form a single vector,
which is fed to a LR classifier.

Figura 4: Architecture for Heuristics combined with
BERT Prediction (HERB).

6 Experiments

6.1 Setup

Dataset We considered two automatic RE tools,
Rosette and Diffbot, extr (Rosette, Diffbot),
and three ground truth variants: GTwiki, GTweb,
GTwikiweb. For each relation, we report on the
combination of RE tool and GT variant
Quello
achieves the highest count of documents classi-
fied as high-coverage.

Each relation had a separate labeled set of doc-
uments, split into 70% train, 10% validation and
20% test. Information leakage was prevented by
splitting along entities, cioè., all documents on the
same entity would exclusively be in one of train,
validation or test set. The number of training sam-
ples per relation varies from 664 (board-member)
A 3604 (position-held). Since the label distribu-
tion in DoCo is imbalanced, the uninformative
(O 0) class in all train datasets were undersam-
pled to obtain a 50:50 distribution, while the val-
idation and test datasets were kept unchanged to
reflect the real-world imbalance. Named entities
and numbers were masked.

Models Each proposed heuristic was turned into
a classifier by first ranking documents according
to the heuristic, and then labeling the top 50%
documents as class 1 or informative. We used the
Okapi BM2512 and monoT513 open-source imple-
mentations for IR ranking. The monoT5 model
is generally used for passage ranking, and as
DoCo documents are much longer with multiple
passages, we used the MaxP algorithm (Dai and
Callan, 2019) to compute the document rank-
ing. Since the difference in performance between
T5 and BM25 models is negligible, we chose
the simpler yet equally effective BM25 model as
IR-relevance signal for HERB.

12https://pypi.org/project/rank-bm25/.
13https://github.com/castorini/pygaggle.

214

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
5
6
2
0
0
2
6
7
8

/
T

UN
C
_
UN
_
0
0
4
5
6
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

Method

Random (biased)
Random (fair)
Text Complexity
Alexa Ranking
Entity Saliency
Document Length
NER Count
BM25 IR
T5 IR
LDA Topic Model
GloVe+LSTM
Ngrams+TFIDF
BOW+TFIDF
BERT
Heu+TFIDF
HERB

member-of

family edu-at position-held

partner-org founded-by

ceo

board-member Avg.

PER

ORG

5.7
15.7
9.6
12.6
17.8
20.5
24.3
27.1
26.9
19.3
16.5
36.2
36.0
40.4
41.9
44.2

6.8
11.1
5.4
9.8
14.3
19.0
19.8
21.1
23.2
19.0
28.6
40.0
41.0
39.7
43.5
41.7

4.9
12.6
6.1
8.1
11.9
15.5
18.2
18.8
20.3
14.5
19.8
25.6
29.2
35.7
31.3
40.5

10.0
15.4
10.3
12.4
18.2
21.9
–
26.3
29.6
21.1
32.9
40.2
42.1
44.4
36.5
45.6

7.5
15.2
3.5
16.7
14.7
23.9
21.1
21.8
19.5
15.7
24.2
18.6
17.2
22.0
35.1
28.8

1.2
8.9
3.3
11.3
8.4
12.8
13.7
12.9
15.4
8.6
19.5
25.5
28.3
30.8
28.2
32.5

13.5
21.3
15
24.8
24.6
28.8
34.5
36.6
41.1
25.2
24.4
41.8
40.6
43.0
41.4
46.2

3.7
7.2
5.4
7.3
7.1
8.5
11.8
12.1
13.1
11.5
4.9
30.2
32.1
33.8
22.0
34.8

6.6
13.4
7.3
12.9
14.6
18.9
20.5
22.1
23.6
16.9
21.3
32.3
33.3
36.2
35.0
39.3

Tavolo 6: F1-scores (%) obtained on the coverage prediction task by various heuristics and methods.

Feature based methods including topic model-
ing with LDA, TF-IDF, and n-grams, were fed to
a Logistic Regression classifier. In the LSTM
architecture, we used 100 dimensional GloVe
embeddings with a vocabulary size of 100,000,
and a 100 dimensional hidden state for LSTM.

For pre-trained language models, we used the
BERT-base-uncased14 model (without additional
retraining or fine-tuning) to encode sentences, by
summing the [CLS] token’s representation from
the last four hidden layers. Input documents were
padded or truncated to 650 sentences, and rep-
resented through sentence encodings. Coverage
classification was performed using the feature-
based approach outlined in Devlin et al. (2019).

We constructed mini-batches of size 32, used
the Adam optimizer initialized with a constant
learning rate of 1e-05 and 1e-09 epsilon value,
and trained for 200 epochs. Because our dataset is
imbalanced, we monitored validation precision to
save the best model, and report optimal F1-scores
(Lipton et al., 2014) to compare results.

6.2 Results

Our results are shown in Table 6. Each heuristic
gives a mediocre performance, with T5 IR achiev-
ing the highest average F1 of 23.6 among the
heuristics. In the trained group of models, LDA
has the lowest average F1 of 16.9, while BERT
performs the best with an average F1 of 36.2.

14https://huggingface.co/bert-base-uncased.

Although each heuristic has moderate predic-
tive power, combining them with statistical mod-
els like TF-IDF, or pre-trained language models
like BERT, gives the best performance. Among
the combination models, HERB outperforms
Heu+TFIDF in a clear majority of relations.

Model Analysis Statistical models like BOW+
TFIDF and Ngrams+TFIDF performed compara-
bly to BERT for a minority of relations. To better
understand these models, we analyzed highly pos-
itive and negative features. Tavolo 7 provides note-
worthy examples. We observe the presence of
semantically relevant phrases. We also inspect the
weights of the trained LR classifier of HERB.
Across relations, BERT had the highest average
weight (5.05), followed by BM25 (2.56), while
NER Count had the lowest weight (0.07).

Feature Ablations We further perform an abla-
tion analysis, with Table 8 showing the average
F1-scores when individual heuristics are removed
from HERB. Removing either BM25 or Text
Complexity leads to a significant drop in perfor-
mance, indicating that other heuristics or BERT
do not capture these features well.

Human Performance Finally, we compare the
results against human performance on identifying
high-coverage documents. For each relation, 10
randomly sampled test documents were labeled
as informative or uninformative for RE solely by
reading the document. Averaged over all relations,

215

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
5
6
2
0
0
2
6
7
8

/
T

UN
C
_
UN
_
0
0
4
5
6
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

Relation

member-of
family
edu-at
position-held
partner-org
founded-by
ceo
board-member

Important Phrases

[org], is part of, ambassador, is associated with, [org] partner
[persona], married, father, wife, children, daughter, parents, [number]
[org], graduated, degree, studied, [org] In [number], is part of
[persona], leader, president, actor, professor, scrittore, founder, police, portman
[org], [number] [org], subsidiary, merger, the company, member of
[persona], founder, director, executive, chairman, co founder, head of, chief executive
ceo, [persona] director, chief, officer, founders, chief executive officer, president
[org], [persona], chairman, executive, board of directors, [number] senior executive,
officer in charge, representative director

Tavolo 7: Highly weighted phrases given by the trained LR classifier of Ngrams+TFIDF and
BOW+TFIDF.

HERB
– Doc. Length
– Entity Saliency
– Alexa Ranking
– NER Count
– BM25
– Text Complexity

39.3%
36.8% (–2.44)
36.4% (–2.85)
36.3% (–3.03)
36.2% (–3.11)
36.0% (–3.29)
35.7% (–3.62)

Tavolo 8: Average F1 performance with
feature ablations. Text Complexity and
BM25 are most important.

humans obtained an F1 score of 70.42%, com-
pared with HERB predictions reaching an average
F1 of 39.3%, and all baselines were signifi-
cantly inferior. The large gap between humans
and learned predictors shows the hardness of the
coverage prediction task and underlines the need
for the presented research.

7 Analysis and Discussion

investigate

Domain Dependency To
how
strongly prediction depends on in-domain training
dati, we performed a stress test, where the train,
validation, and test sets were split along domains
(per esempio., singers vs. entrepreneurs vs. politicians).
Tavolo 9 shows the resulting F1-scores (%). For
HERB, the average F1-score on the in-domain
test set is 34.3%, while on the out-of-domain test
set is 34.2%—that is, there is no notable drop for
the challenging domain-transfer case. We observe
a minor drop for larger relations, while even
increases are visible for the smallest two relations.
This suggests that HERB learned generalizable
features that are beneficial across domains.

Evaluation of Document Ranking So far, we
have evaluated our methods on a binary pre-
diction problem. Tuttavia, use cases frequently
require a ranking capability (see also Sec. 8). Noi
additionally evaluate our methods on a ranking
task, where documents are ranked by the score of
positive predictions.

We use the mean Normalized Discounted
Cumulative Gain (mean nDCG) (J¨arvelin and
Kek¨al¨ainen, 2002) as the evaluation metric. UN
similar performance trend to the F1 metric is ob-
served among our methods. HERB performs the
best with an average nDCG score of 0.45 across
relations, while BERT and Heu+TFIDF have 0.44
E 0.43, rispettivamente.

RE Limitations The performance of RE meth-
ods significantly impacts the quality of GTWeb as
well as the RE coverage of documents. Although
we used state-of-the-art commercial APIs, these
nonetheless struggle on open web documents. A
illustrate this, we randomly sampled 40 docu-
ments from DoCo and compared the count of
RE tuples returned by Diffbot/Rosette against the
count by a human relation extractor. Diffbot re-
turned 60.6% fewer relational tuples, and Rosette
returned 72.3% fewer, suggesting the need for
further improvement of RE methods.

Error Analysis We analyzed the incorrect pre-
dictions by HERB and categorized the errors. For
each relation, we randomly sampled 10 incorrectly
predicted documents, 5 false positives and 5 false
negatives. Out of the total 80 samples, 63.75%
of documents contained partial information for
the chosen relation; SU 15% of documents the
IE methods failed to extract all the necessary RE

216

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
5
6
2
0
0
2
6
7
8

/
T

UN
C
_
UN
_
0
0
4
5
6
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

Setting

HERB (in-domain)
HERB (out-of-
domain)
Training Data Size

member-
Di

family

edu-at

position-
held

partner-
org

founded-
by

40.8
35.7

41.8
39.7

34.9
32.5

42.8
39.1

28.4
29.4

17.1
23.8

ceo

45.4
42.3

board-
member

23.3
31.1

Avg.

34.3
34.2

2194

1650

1458

2940

1124

828

2058

608

Tavolo 9: Comparison of F1-scores (%) of HERB on the in-domain and out-of-domain test set.

tuples; the ground truth for 3.75% of documents
had an incomplete set of objects; 3.75% docu-
ments had noisy content; E 2.5% documents had
incomplete information due to failure of scraping
methods on complex website layouts.

Multiple documents in the low-information
category contained speculative content—for ex-
ample, considerations about candidates for a new
appointment as a board member or CEO. In other
cases, the document would mention the increased
count of board members, but not their names.
A few documents also had partial information
leading to false positives—for example, a docu-
ment partially talking about the footballer Sergio
Ag¨uero for the family relation was incorrectly
classified as informative; as it also contained a
complete family history about another footballer,
Diego Maradona (Sergio’s father-in-law).

Conversely, documents may contain informa-
tion relevant to a relation without actual mention
of the relation, which leads to false negatives.
Per esempio, a document on the LinkedIn Cor-
poration stating ‘‘ . . . Weiner stepped down from
LinkedIn . . . He named Ryan Roslansky as his re-
placement.’’ was labeled uninformative for the ceo
relation. Although Ryan Roslansky and LinkedIn
are related through the ceo relation, the implicit
statement was not noticed by HERB.

We specifically inspected the IR baselines’
performance to understand better why these are
mediocre predictors at best. The IR signals about
entire documents merely reflect that a document is
on the proper topic given by the query entity, Ma
that does not necessarily imply that the document
contains many relational facts about the target
entity. For RE coverage, IR-style document-
query relevance is a necessary cue but not a
sufficient criterion.

et al., 2019). Based on the DocRED leaderboard,15
we selected the currently best open-source method:
the Transformer-based Structured Self-Attention
Network (SSAN) (Xu et al., 2021).

A sample of 100 documents from DoCo was
given to both HERB and SSAN and processed as
follows. For HERB, features are computed utiliz-
ing BERT, followed by coverage prediction. For
SSAN, documents first need to be pre-processed
to construct the necessary DocRED representa-
zione. This includes named entity recognition and
pair-wise co-reference resolution, using Stanza16
to properly group same-entity occurrences.

The measurements show the following. HERB
takes about 2 seconds, on average,
to pro-
cess one document, whereas SSAN requires 13.6
seconds—a factor of 6.8 higher in speed and
resource consumption. The difference becomes
even more prominent for very long documents
with many named entity mentions. HERB’s
run-time grows linearly with document length,
while SSAN’s run-time exhibits quadratic growth
with the number of entity mentions.

This quadratic complexity of full-fledged neu-
ral RE has inherent reasons (as stated in Yao et al.,
2019). Document-level relation extraction gener-
ally requires computations for all possible pairs
of entity mentions. The neural RE methods need
to have the positions of candidate entity pairs
as input, which necessitates considering all pairs
of mentions.

8 Applications

To demonstrate the importance of coverage pre-
diction, we evaluated its utility in two use cases,
knowledge base construction and claim refuta-
zione. For the former, we discuss the importance of
ranking documents by RE coverage (Sezione 8.1)

Efficiency and Scalability We measured the
run-time of HERB against a state-of-the-art neural
model for document-level RE (DocRED) (Yao

15https://competitions.codalab.org/competitions

/20717#risultati.

16https://stanfordnlp.github.io/stanza/.

217

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
5
6
2
0
0
2
6
7
8

/
T

UN
C
_
UN
_
0
0
4
5
6
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

and a practically relevant setting where RE is
constrained by resource budgets (Sezione 8.2).

8.1 Document Ranking for Relation

Extraction

Relation extraction plays a pivotal role in KB
construction. We show the relevance of cover-
age estimates for prioritizing among documents.
Entities from our test dataset serve as subjects for
RE. We select top k documents from the test data-
set corpus by four different techniques. Noi veniamo-
pare the performance of each method by the total
number of extracted RE tuples per subject and
compute recall w.r.t. the Wikidata ground-truth.

1. Random: A random sample of documents.

Figura 5: Total yield (top) and precision (bottom)
of KBC based on different ranking methods for
documents.

Method

RE Count #Docs Processed

2. IR-Relevance: Using BM25 to identify the

most relevant documents.

SSAN
HERB+SSAN

59
96

410
318

3. Coverage Prediction: HERB’s predictions to

rank documents.

4. Coverage Oracle: Selecting documents by
their ground-truth labels from DoCo. Questo
ranking gives an upper bound on what an
ideal method could achieve.

Setup The document coverage calculation is on
a per (e, R) pair basis. In a single iteration, Tutto
the proposed methods are given a set of docu-
ments partitioned by (e, R) pairs. Each method
uses its technique to rank the documents, E
the top k ranked documents are given to the RE
API (Rosette or Diffbot) for obtaining the set of
relational tuples.

Results Figure 5 (top) compares the total RE
tuples obtained by the proposed methods, aver-
aged across test dataset entities and 8 chosen
relations. Notably, BM25 doesn’t perform much
better than random, while coverage prediction is
not far behind the perfect ranking defined by the
coverage oracle. Ordering documents by cover-
age prediction instead of IR-relevance gives 50%
more extractions from the top-10 documents.

Figura 5 (bottom) shows the number of RE tu-
ples that match the Wikidata KB, thus comparing
the methods on precision. As was foreseeable, IL
coverage oracle method wins due to the usage
of correct coverage values for ranking. HERB’s
coverage prediction performance is considerably

Tavolo 10: Relation extraction under run-time
constraint.

higher than IR-relevance and other methods, while
it matches the coverage oracle for K ≥ 4. Beyond
K > 15, all methods yield nearly the same sets of
tuples, hence similar precision.

8.2 Budget-constrained Relation Extraction

Document coverage predictions are particularly
important for massive-scale RE tasks targeted at
long-tail entities, such as populating or augment-
ing a domain-specific knowledge base (per esempio., Di
diabetes or jazz music). Such tasks may require
screening a huge number of documents. Therefore,
practically viable RE methods need to operate un-
der budget constraints, regarding the monetary
cost of computational resources (per esempio., using and
paying for cloud servers) as well as the cost of
energy consumption and environmental impact.

In the experiment described here, we simulate
this setting, comparing standard RE by SSAN
against HERB-enhanced RE where HERB prior-
itizes documents for RE by SSAN. We assume a
budget of 10 minutes of processing time and give
both methods 100 candidate documents. SSAN
selects documents randomly and processes them
until it runs out of time. HERB+SSAN sorts docu-
ments by HERB scores for high coverage and then
lets SSAN process them in this order. The time

218

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
5
6
2
0
0
2
6
7
8

/
T

UN
C
_
UN
_
0
0
4
5
6
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

Subject

Relation

Object

Document Snippet

Alphabet Inc.

ceo

Susan Wojcicki

Oracle Corporation founded-by

David Agus

PepsiCo

board-member

Joan Crawford

Susan Wojcicki is CEO of Alphabet subsidiary YouTube, which has
2 billion monthly users.
Oracle Co-founder Larry Ellison and acclaimed physician and
scientist Dr. David Agus formed Sensei Holdings, Inc.
Film actress Joan Crawford, after marrying Pepsi-Cola president
Alfred N. Steele became a spokesperson for Pepsi.

Tavolo 11: Incorrect claims extracted by Diffbot RE API from documents predicted as low coverage.

for HERB itself is part of the 10-minute budget
for the HERB+SSAN method.

As a proof-of-concept, we ran this experiment
for a sample of 10 different entities (each with a
pool of 100 documents).

Tavolo 10 shows the results. Due to the upfront
cost of HERB, HERB+SSAN processes fewer
documents within the 10-minute budget, but its
yield is substantially higher than that of SSAN
alone, by a factor of 1.63. This demonstrates the
need for document-coverage prediction towards
realistic usage.

8.3 Claim Refutation

Our second use case is fact-checking, specifically
the case of refuting false claims by providing
counter-evidence via RE.

Reasoning Extraction confidence and document
coverage are conceptually independent notions.
Tuttavia, when looking at sets of documents, an
interesting relation emerges. Consider two docu-
menti, d1 with high coverage, and d2 with low
coverage, along with two claims c1 and c2 from
the respective documents, extracted with the same
confidence. Can we use coverage information to
make claims about extraction correctness?

We propose the following hypothesis: Given
that d1 is asserted to have high coverage, we can
conclude that any statement not mentioned in d1
(like c2) is more likely false. In contrasto, the low
coverage of d2 implies that d2 is unlikely to contain
all factual statements. Così, c1 not being found in
d2 is no indication that it could not be true.

Validation We experimentally validated the
correctness of the above reasoning as follows.
From the collection of relation extractions from
the test dataset documents, we randomly sampled
69 pairs of claims for the same entity and rela-
zione, which had low support (cioè., extraction found
only in one website). We then ordered the pairs

by the coverage of the documents that did not
express them, obtaining 69 claims with relatively
higher coverage in non-expressing documents and
69 claims with relatively lower coverage.

We manually verified the correctness of each
claim on the Internet, verifying annotator agree-
ment on a sub-sample, where we found a high
Fleiss’ Kappa (Fleiss, 1971) inter-annotator agree-
ment of 0.82.

Using these annotations, we found that from the
69 claims absent from lower-coverage documents,
58% (40) were correct, while from those absent
from higher-coverage documents, only 36% (25)
were correct. In other words, the fraction of correct
claims absent from low-coverage documents is 1.6
times higher; so coverage can be used as a feature
for claim refutation.

Tavolo 11 shows examples of claims absent from

high-coverage documents.

9 Conclusione

This paper introduces the new task of docu-
ment coverage prediction and a large dataset
for experimental study of the task. Our methods
show that heuristic features can boost the per-
formance of pre-trained language models without
costly fine-tuning. Inoltre, we demonstrate the
value of coverage estimates for the use cases
of knowledge base construction and claim refu-
tazione. Our future research includes developing a
user-friendly tool to support knowledge engineers.

Ringraziamenti

We thank Andrew Yates for his suggestions. Fur-
ther thanks to the anonymous reviewers, action
editor, and fellow researchers at MPI, for their
comments towards improving our paper. Questo
work is supported by the German Science Foun-
dation (DFG: Deutsche Forschungsgemeinschaft)
by grant 4530095897: ‘‘Negative Knowledge at
Web Scale’’.

219

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
5
6
2
0
0
2
6
7
8

/
T

UN
C
_
UN
_
0
0
4
5
6
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

Riferimenti

Hiba Arnaout, Simon Razniewski, Gerhard
Weikum, and Jeff Z. Pan. 2021. Negative
statements considered useful. Journal of Web
Semantics, 71:100661. https://doi.org
/10.1016/j.websem.2021.100661

David M. Blei, Andrew Ng, and Michael I. Jordan.
2003. Latent dirichl et allocation. Journal of
Machine Learning Research, 3:993–1022.

Sihao Chen, Daniel Khashabi, Wenpeng Yin,
Chris Callison-Burch, and Dan Roth. 2019.
Seeing things from a different angle: Discov-
ering diverse perspectives about claims. In
Atti del 2019 Conference of the
North American Chapter of the Association for
Linguistica computazionale: Human Language
Technologies, Volume 1 (Long and Short Pa-
pers), pages 542–557, Minneapolis, Minnesota.
Associazione per la Linguistica Computazionale.

Negli Atti di

Zhuyun Dai and Jamie Callan. 2019. Deeper text
understanding for IR with contextual neural
IL
language modeling.
42nd International ACM SIGIR Conference
on Research and Development in Information
Retrieval, pages 985–988, New York, NY,
USA. Association for Computing Machinery.
https://doi.org/10.1145/3331184
.3331303

Fariz Darari, Werner Nutt, Giuseppe Pirr`o, E
Simon Razniewski. 2013. Completeness state-
ments about RDF data sources and their use
for query answering. In The Semantic Web –
ISWC 2013, pages 66–83. Springer Berlin
Heidelberg. https://doi.org/10.1007
/978-3-642-41335-3_5

Jacob Devlin, Ming-Wei Chang, Kenton Lee, E
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
IL 2019
understanding. Negli Atti di
Conference of the North American Chapter
of the Association for Computational Linguis-
tic: Tecnologie del linguaggio umano, Volume 1
(Long and Short Papers), pages 4171–4186,
Minneapolis, Minnesota. Association for Com-
Linguistica putazionale.

Joseph L. Fleiss. 1971. Measuring nominal scale
agreement among many raters. Psicologico
Bulletin, 76(5):378–382.

Rudolf Flesch and Alan J. Gould. 1949. IL
Art of Readable Writing, volume 8. Harper
New York. https://doi.org/10.1037
/h0031619

Luis Gal´arraga, Simon Razniewski, Antoine
Amarilli, and Fabian M. Suchanek. 2017. Pre-
dicting completeness in knowledge bases. In
Proceedings of the Tenth ACM International
Conference on Web Search and Data Mining,
pages 375–383, New York, NY, USA. Associ-
ation for Computing Machinery. https://
doi.org/10.1145/3018661.3018739

Max Grusky, Mor Naaman, and Yoav Artzi.
2018. Newsroom: A dataset of 1.3 million
summaries with diverse extractive strategies.
IL 2018 Conference of
Negli Atti di
the North American Chapter of the Associa-
tion for Computational Linguistics: Umano
Language Technologies, Volume 1 (Long Pa-
pers), pages 708–719, New Orleans, Louisiana.
Associazione per la Linguistica Computazionale.
https://doi.org/10.18653/v1/N18
-1065

Xu Han, Tianyu Gao, Yankai Lin, H. Peng,
Yaoliang Yang, Chaojun Xiao, Zhiyuan Liu,
Peng Li, Maosong Sun, and Jie Zhou. 2020.
More data, more relations, more context and
more openness: A review and outlook for re-
lation extraction. In Proceedings of the 1st
the Asia-Pacific Chapter of
Conference of
the Association for Computational Linguistics
and the 10th International Joint Conference on
Elaborazione del linguaggio naturale, pages 745–758,
Suzhou, China. Associazione per il calcolo
Linguistica.

Sepp Hochreiter e Jürgen Schmidhuber. 1997.
Memoria a lungo termine. Calcolo neurale,
https://doi.org/10
9(8):1735–1780.
.1162/neco.1997.9.8.1735, PubMed:
9377276

Aidan Hogan, Eva Blomqvist, Michael Cochez,
Claudia d’Amato, Gerard de Melo, Claudia
Guti´errez, Sabrina Kirrane, Jos´e Emilio Labra
Gayo, Roberto Navigli, Sebastian Neumaier,
Axel-Cyrille N. Ngomo, Axel Polleres, Sabbir
M. Rashid, Anisa Rula, Lukas Schmelzeisen,
Juan Sequeda, Steffen Staab, and Antoine
Zimmermann. 2021. Knowledge graphs. ACM
Computing Surveys, 64(4):96–104. https://
doi.org/10.1145/3447772

220

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
5
6
2
0
0
2
6
7
8

/
T

UN
C
_
UN
_
0
0
4
5
6
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

Andrew Hopkinson, Amit Gurdasani, Dave
Palfrey, and Arpit Mittal. 2018. Demand-
weighted completeness prediction for a knowl-
IL 2018
Negli Atti di
edge base.
Conference of the North American Chapter
of the Association for Computational Linguis-
tic: Tecnologie del linguaggio umano, Volume 3
(Industry Papers), pages 200–207, New Orleans,
Louisiana. Association for Computational Lin-
guistics. https://doi.org/10.18653
/v1/N18-3025

Panagiotis G. Ipeirotis, Eugene Agichtein, Pranay
Jain, and Luis Gravano. 2007. Towards a query
optimizer for text-centric tasks. ACM Trans-
actions on Database Systems, 32(4):21–es.

Kalervo J¨arvelin and Jaana Kek¨al¨ainen. 2002.
Cumulated gain-based evaluation of IR tech-
Carino. ACM Transactions of Information Sys-
tems, 20(4):422–446. https://doi.org
/10.1145/582415.582418

extraction without labeled data. Negli Atti
of the Joint Conference of the 47th Annual
Meeting of
the ACL and the 4th Interna-
tional Joint Conference on Natural Language
Processing of the AFNLP, pages 1003–1011,
Suntec, Singapore. Associazione per il calcolo-
linguistica nazionale. https://doi.org/10
.3115/1690219.1690287

Tom M. Mitchell, William W. Cohen, Estevam
R. Hruschka, Partha P. Talukdar, Bo Yang,
J. Betteridge, Andrew Carlson, Bhavana
Dalvi, Matt Gardner, Bryan Kisiel, Jayant
Krishnamurthy, Ni Lao, Kathryn Mazaitis,
Thahir Mohamed, Ndapandula Nakashole,
Emmanouil Antonios Platanios, Alan Ritter,
Mehdi Samadi, Burr Settles, Richard C. Wang,
Derry T. Wijaya, Abhinav Gupta, Xinlei
Chen, Abulhair Saparov, Malcolm Greaves,
and Joel Welling. 2018. Never-ending learning.
Communications of the ACM, 61(5):103–115.
https://doi.org/10.1145/3191513

Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo
Luan, and Maosong Sun. 2016. Neural rela-
tion extraction with selective attention over
instances. In Proceedings of the 54th Annual
the Association for Computa-
Meeting of
linguistica nazionale (Volume 1: Documenti lunghi),
pages 2124–2133, Berlin, Germany. Associa-
tion for Computational Linguistics.

Ndapandula Nakashole and Tom M. Mitchell.
2014. Language-aware truth assessment of
fact candidates. In Proceedings of the 52nd
Annual Meeting of the Association for Com-
Linguistica putazionale (Volume 1: Long Pa-
pers), pages 1009–1019, Baltimore, Maryland.
Associazione per la Linguistica Computazionale.
https://doi.org/10.3115/v1/P14-1095

Zachary C. Lipton, Charles Elkan,

E
Balakrishnan Narayanaswamy. 2014. Optimal
thresholding of classifiers to maximize F1
measure. In Machine Learning and Knowl-
edge Discovery in Databases, pages 225–239,
Berlin, Heidelberg. Springer Berlin Heidelberg.
https://doi.org/10.1007/978-3-662
-44851-9 15, PubMed: 26023687

Michael Luggen, Djellel Difallah, Cristina
Sarasua, Gianluca Demartini, and Philippe
Cudr´e-Mauroux. 2019. Non-parametric class
completeness estimators for collaborative knowl-
edge graphs—the case of wikidata. In The
Semantic Web – ISWC 2019, pages 453–469,
Cham. Springer
International Publishing.
https://doi.org/10.1007/978-3-030
-30793-6 26

Mike Mintz, Steven Bills, Rion Snow, and Daniel
Jurafsky. 2009. Distant supervision for relation

221

Rodrigo Nogueira and Kyunghyun Cho. 2020.
Passage re-ranking with BERT. ArXiv, abs/
1901.04085.

Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep,
and Jimmy Lin. 2020. Document ranking with
a pretrained sequence-to-sequence model. In
Findings of the Association for Computational
Linguistica: EMNLP 2020, pages 708–718,
Online. Association for Computational Lin-
guistics. https://doi.org/10.18653
/v1/2020.findings-emnlp.63

Jeffrey

Socher,

Pennington, Richard

E
Cristoforo Manning. 2014. Guanto: Globale
vettori per la rappresentazione delle parole. In Procedi-
ings of the 2014 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP),
pagine 1532–1543, Doha, Qatar. Association
for Computational Linguistics. https://doi
.org/10.3115/v1/D14-1162

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
5
6
2
0
0
2
6
7
8

/
T

UN
C
_
UN
_
0
0
4
5
6
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.
2020. Exploring the limits of transfer learning
with a unified text-to-text transformer. Journal
of Machine Learning Research, 21(140):1–67.

Hannah Rashkin, Eunsol Choi, Jin Y. Jang,
Svitlana Volkova, and Yejin Choi. 2017. Truth
of varying shades: Analyzing language in fake
news and political fact-checking. In Procedi-
ings di
IL 2017 Conferenza sull'Empirico
Metodi nell'elaborazione del linguaggio naturale,
pages 2931–2937, Copenhagen, Denmark.
Associazione per la Linguistica Computazionale.
https://doi.org/10.18653/v1/D17
-1317

Simon Razniewski, Nitisha Jain, Paramita Mirza,
and Gerhard Weikum. 2019. Coverage of
information extraction from sentences and
paragraphs. Negli Atti del 2019 Contro-
ference on Empirical Methods in Natural Lan-
guage Processing and the 9th International
Joint Conference on Natural Language Pro-
cessazione (EMNLP-IJCNLP), pages 5771–5776,
Hong Kong, China. Associazione per il calcolo-
linguistica nazionale. https://doi.org/10
.18653/v1/D19-1583

Sebastian Riedel, Limin Yao, and Andrew
McCallum. 2010. Modeling relations and their
labeled text. In Machine
mentions without
Learning and Knowledge Discovery in Data-
bases, pages 148–163, Berlin, Heidelberg.
Springer Berlin Heidelberg. https://doi
.org/10.1007/978-3-642-15939-8 10

Stephen E. Robertson, Steve Walker, Susan
Jones, Micheline Hancock-Beaulieu, and Mike
Gatford. 1995. Okapi at TREC-3. In Over-
view of
the Third Text REtrieval Confer-
ence (TREC-3), pages 109–126. Gaithersburg,
MD: NIST.

Evan Sandhaus, Philadelphia. 2008. The New
York Times Annotated Corpus. Linguistic Data
Consortium, 6(12):e26752.

Livio B. Soares, Nicholas FitzGerald, Jeffrey
Ling, and Tom Kwiatkowski. 2019. Matching
the blanks: Distributional similarity for relation
apprendimento. In Proceedings of the 57th Annual
Riunione dell'Associazione per il Computazionale
Linguistica, pages 2895–2905, Florence, Italy.
Associazione per la Linguistica Computazionale.

James Thorne, Andreas Vlachos, Christos
Christodoulopoulos, and Arpit Mittal. 2018.
FEVER: A large-scale dataset for Fact Ex-
traction and VERification. Negli Atti di
IL 2018 Conferenza del Nord America
Capitolo dell'Associazione per il calcolo
Linguistica: Tecnologie del linguaggio umano,
Volume 1 (Documenti lunghi), pages 809–819, Nuovo
Orleans, Louisiana. Association for Compu-
linguistica nazionale. https://doi.org/10
.18653/v1/N18-1074

Xiaolan Wang, Xin L. Dong, Yang Li, E
Alexandra Meliou. 2019. MIDAS: Finding
the right web sources to fill knowledge gaps.
In 2019 IEEE 35th International Conference
on Data Engineering (ICDE), pages 578–589.
https://doi.org/10.1109/ICDE.2019
.00058

Gerhard Weikum, Luna Dong, Simon Razniewski,
and Fabian M. Suchanek. 2021. Machine
knowledge: Creation and curation of com-
prehensive knowledge bases. Foundations
and Trends in Databases, 10(2–4):108–490.
https://doi.org/10.1561/1900000064

Benfeng Xu, Quan Wang, Yajuan Lyu, Yong
Zhu, and Zhendong Mao. 2021. Entity struc-
ture within and throughout: Modeling mention
dependencies for document-level relation ex-
traction. In Proceedings of the AAAI Conference
Intelligenza, volume 35(16),
on Artificial
pages 14149–14157.

Dwaipayan Roy, Sumit Bhatia, and Prateek Jain.
2020. A topic-aligned multilingual corpus of
wikipedia articles for studying information
asymmetry in low resource languages. Nel professionista-
ceedings of
the 12th Language Resources
and Evaluation Conference, pages 2373–2380,
Marseille, France. European Language Re-
sources Association.

Yuan Yao, Deming Ye, Peng Li, Xu Han,
Yankai Lin, Zhenghao Liu, Zhiyuan Liu, Lixin
Huang, Jie Zhou, and Maosong Sun. 2019.
DocRED: A large-scale document-level re-
lation extraction dataset. Negli Atti di
the 57th Annual Meeting of the Association
for Computational Linguistics, pages 764–777,
Florence, Italy. Associazione per il calcolo

222

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
5
6
2
0
0
2
6
7
8

/
T

UN
C
_
UN
_
0
0
4
5
6
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

Linguistica. https://doi.org/10.18653
/v1/P19-1074

Yuhao Zhang, Victor Zhong, Danqi Chen,
Gabor Angeli, e Christopher D. Equipaggio.
2017. Position-aware attention and supervised

data improve slot filling. Negli Atti di
IL 2017 Conference on Empirical Methods in
Elaborazione del linguaggio naturale, pages 35–45,
Copenhagen, Denmark. Association for Com-
Linguistica putazionale. https://doi.org
/10.18653/v1/D17-1004

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
5
6
2
0
0
2
6
7
8

/
T

UN
C
_
UN
_
0
0
4
5
6
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

223 Predicting Document Coverage for Relation Extraction image

Scarica il pdf