DATA PAPER - Specialized Research AI at MIT

DATA PAPER

CN-DBpedia2: An Extraction and Verification Framework
for Enriching Chinese Encyclopedia Knowledge Base

Bo Xu1, Jiaqing Liang2, Chenhao Xie2, Bin Liang2, Lihan Chen2 & Yanghua Xiao2†

1School of Computer Science and Technology, Donghua University, Shanghai 200051, China

2Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, Shanghai 200433, China

Keywords: Knowledge graph; Entity typing; Slot filling; Information extraction; Crowdsourcing

Citation: B. Xu, J. Liang, C. Xie, B. Liang, L. Chen, & Y. Xiao. CN-DBpedia2: An extraction and verification framework for

enriching chinese encyclopedia knowledge base. Data Intelligence 1(2019), 244-261. doi: 10.1162/dint_a_00017

Received: January 4, 2019; Revised: May 14, 2019; Accepted: May 24, 2019

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
d
n

t
/

a
r
t
i
c
e
–
p
d

f
/

1
3
2
7
1
6
8
3
7
8
2
d
n
_
a
_
0
0
0
1
7
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

ABSTRACT

Knowledge base plays an important role in machine understanding and has been widely used in various
applications, such as search engine, recommendation system and question answering. However, most
knowledge bases are incomplete, which can cause many downstream applications to perform poorly because
they cannot find the corresponding facts in the knowledge bases. In this paper, we propose an extraction and
verification framework to enrich the knowledge bases. Specifically, based on the existing knowledge base,
we first extract new facts from the description texts of entities. But not all newly-formed facts can be added
directly to the knowledge base because the errors might be involved by the extraction. Then we propose a
novel crowd-sourcing based verification step to verify the candidate facts. Finally, we apply this framework
to the existing knowledge base CN-DBpedia and construct a new version of knowledge base CN-DBpedia2,
which additionally contains the high confidence facts extracted from the description texts of entities.

1. INTRODUCTION

In recent years, there has been a great amount of efforts in trying to harvest knowledge from Web, and
a variety of knowledge graphs (KGs) or knowledge bases (KBs) have been constructed, such as YAGO [1],
DBpedia [2], Freebase [3] and CN-DBpedia [4]. These knowledge bases play important roles in many
applications, such as search engine [5], recommendation system [6] and question answering [7].

† Corresponding author: Yanghua Xiao (Email: shawyh@fudan.edu.cn; ORCID: 0000-0001-8403-9591).

CN-DBpedia2: An Extraction and Veriﬁ cation Framework for Enriching Chinese Encyclopedia
Knowledge Base

However, knowledge bases are generally incomplete. Facts of current KBs (e.g., DBpedia [8], YAGO [1],
Freebase [3] and CN-DBpedia [4]) are mainly obtained from the carefully edited structured texts (e.g.,
infobox and category information) of Web pages in the online encyclopedia websites (e.g., Wikipedia and
Baidu Baike). Since knowledge is rich but editors have limited editing capabilities, many structured texts
are often incomplete, making the facts in knowledge bases directly extracted from structured texts are
incomplete. According to Catriple [9], only 44.2% of articles in Wikipedia have infobox information. Also
in Baidu Baike, the largest Chinese online encyclopedia website, almost 32% (nearly 3 million) of entities
lack the infobox and category information altogether [10].

Incomplete knowledge base will lead to poor performance of many downstream applications, since they
cannot find the corresponding facts in the knowledge bases. For example, if the knowledge graphs lack the
fact about Donald Trump’s birthday, then it cannot answer the question of “when was Donald Trump born”.

To address this challenge, we propose an extraction and verification framework to enrich the knowledge
bases. Based on the existing knowledge bases, we first extract new facts from the description texts of
entities. But not all newly-formed facts can be added directly to the knowledge base because the errors
might be involved by the extraction [11, 12, 13]. For example, in Table 1, which is the F1-score of the
state-of-the-art text-based extractors on Slot Filling benchmark TAC data set, including the pattern-based
method (PATdist [11]), traditional machine learning methods (Mintz++ [14], SVMskip [11]), graphical
model (MIMLRE [15]) and neural network based method (CNNcontext) [11], the facts extracted by these
extractors still have a lot of noise. This motivates us to employ a novel crowdsourcing method to verify the
extracted facts. Considering the human cost, we only verify those low-confidence facts. In the end, only
two types of facts extracted by the extractors can be added to the knowledge base. One is the facts with
high confidence, and the other is the facts with low confidence but verified by human as correct.

Table 1. The F1 scores on slot ﬁ lling benchmark TAC data set (dev: data from 2012/2013, eval: data from 2014)
[11]

per:age
per:alternate names
per:children
per:cause of death
per:date of birth
per:date of death
per:empl memb of
per:location of birth
per:loc of death
per:loc of residence
average

 http://baike.baidu.com/

Data Intelligence

Mintz++

MIMLRE

PATdist

SVMskip

CNNcontext

dev

eval

dev

eval

dev

eval

dev

eval

dev

eval

.84
.29
.76
.76
1.0
.67
.38
.56
.65
.14
.53

.71
.03
.43
.42
.60
.45
.36
.22
.41
.11
.41

.83
.29
.77
.75
.99
.67
.41
.56
.66
.15
.54

.73
.03
.48
.36
.60
.45
.37
.22
.43
.18
.42

.69
.50
.10
.44
.67
.30
.24
.30
.13
.10
.35

.80
.50
.07
.11
.57
.32
.22
.30
.00
.03
.36

.86
.35
.81
.82
1.0
.79
.42
.59
.64
.31
.62

.74
.02
.68
.32
.67
.54
.36
.27
.34
.33
.48

.83
.32
.82
.77
1.0
.72
.41
.59
.63
.20
.60

.76
.04
.61
.52
.77
.48
.37
.23
.28
.23
.46

272

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
d
n

t
/

a
r
t
i
c
e
–
p
d

f
/

1
3
2
7
1
6
8
3
7
8
2
d
n
_
a
_
0
0
0
1
7
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

CN-DBpedia2: An Extraction and Veriﬁ cation Framework for Enriching Chinese Encyclopedia
Knowledge Base

The missing facts in the knowledge base mainly include the relationship between entities and entities
and the relationship between entities and concepts. In this paper, we use description texts of entities to
enrich the knowledge base, including two subtasks, entity typing and slot filling. Our contributions are as
follows:

First, for entity typing subtask, we propose a multi-instance learning model to process textual
information as well as heterogeneous information.
Second, for slot filling subtask, we use a transfer learning strategy to extract the values of the long-
tailed predicates.

3) Third, we propose a novel implicit crowdsourcing approach to verify low-confidence new facts.
4)

Finally, we apply this framework to the existing knowledge base CN-DBpedia and release a new
version of knowledge base CN-DBpedia2, which additionally contains the facts extracted from the
description texts of entities. By April 2019, CN-DBpedia2 contains about 16,024,656 entities and
228,499,155 facts.

The rest of this paper is organized as follows. Section 2 introduces the system architecture of CN-DBpedia2.
Section 3 and Section 4 detail the methods of entity typing and slot filling. Section 5 introduces how to
verify those low-confidence new facts. Section 6 presents the statistics of our new system. Finally, Section
7 concludes the paper.

2. SYSTEM ARCHITECTURE

The system architecture of CN-DBpedia2 is shown in Figure 1, which is an extension of CN-DBpedia.
CN-DBpedia2 uses Baidu Baike, Hudong Baike, Chinese Wikipedia and other domain encyclopedia
websites as data sources, and the pipeline process consists of five components:

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
d
n

t
/

a
r
t
i
c
e
–
p
d

f
/

1
3
2
7
1
6
8
3
7
8
2
d
n
_
a
_
0
0
0
1
7
p
d

•

The extraction component is used to extract the raw facts from the articles of the encyclopedia
websites, including crawling the Web pages of all the entities in the data sources, and extracting the
raw facts from the structured text of the pages.
The normalization component is used to normalize the raw facts, including the normalization of
attributes/predicates and values of the facts.
The enrichment component is used to extract new facts that cannot be obtained directly from
structured text of Web pages.
The correction component is used to correct some of the error facts in the knowledge base, including
error detection and crowdsourcing error correction.
The update component is used to keep the freshness of knowledge base, including periodic update
and active update.

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

273

Data Intelligence

CN-DBpedia2: An Extraction and Veriﬁ cation Framework for Enriching Chinese Encyclopedia
Knowledge Base

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
d
n

t
/

a
r
t
i
c
e
–
p
d

f
/

1
3
2
7
1
6
8
3
7
8
2
d
n
_
a
_
0
0
0
1
7
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure 1. System architecture of CN-DBpedia2. Note: It is an extension of CN-DBpedia, and the dotted compo-
nents are new features.

CN-DBpedia2 is different from CN-DBpedia in the enrichment component. We propose an extraction
and verification framework to enrich the knowledge bases, which includes three new features, entity typing,
slot filling and fact verification. As shown in Figure 2, we use both entity typing and slot filling methods to
extract new facts from the description texts of entities, and low-confidence facts need to be verified before
they are added to the knowledge base.

Data Intelligence

274

CN-DBpedia2: An Extraction and Veriﬁ cation Framework for Enriching Chinese Encyclopedia
Knowledge Base

Figure 2. The detail of enrichment component in CN-DBpedia2.

3. E NTITY TYPING

The entity typing task is to find a set of types/concepts for each entity in knowledge bases. An entity
contains both structured and unstructured features. In CN-DBpedia, we have used structured features to
type entities [10]. In CN-DBpedia2, we first use unstructured features alone to type entities and then use
both structured and unstructured features together to type the entities.

3.1 Entity Typing from Unstructured Features

We propose a multi-instance method, METIC [13] to type the entity with unstructured features alone.
An entity may have multiple mentions in a corpus, we take each mention of an entity in KBs as an instance
of the entity, and learn the types of these entities from multiple instances. Specifically, we first use an end-
to-end neural network model to type each instance of an entity (mention typing), and then use an integer
linear programming (ILP) method to aggregate the predicted type results from multiple instances (type
fusion). The framework of our solution is shown in Figure 3.

In the offline phase, we train models for the two subtasks mention typing and type fusion separately. For
mention typing, we model it as a multi-label classification. In our setting, we use distant supervision method
to construct the training data automatically and build a supervised learning model (more specifically we
propose a neural network model). For type fusion, we model it as a constrained optimization problem, and
propose an integer linear programming model in order to solve the problem. The constraints are derived
from the semantic relationship between types.

In the online phase, we use the models built in the offline phase to enrich types for each entity in the
KB. For each entity e, we first employ the existing entity linking systems [16] to discover entity mentions
from its corpus. Each mention and its corresponding context: are fed into the mention typing model.
The model then derives a set of types with each type being associated with a probability (i.e., (t|mi)). Types
as well as their probability (i.e., P(mi)) derived from each mention are further fed into the integer linear
programming model with constraints specified as exclusive or compatibility among types. The model finally
selects a subset from all candidate types as the final output types.

275

Data Intelligence

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
d
n

t
/

a
r
t
i
c
e
–
p
d

f
/

1
3
2
7
1
6
8
3
7
8
2
d
n
_
a
_
0
0
0
1
7
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

CN-DBpedia2: An Extraction and Veriﬁ cation Framework for Enriching Chinese Encyclopedia
Knowledge Base

Figure 3. Framework of METIC [13], a multi-instance method for entity typing from unstructured features.

In mention typing step, we propose a neural network model, as shown in Figure 4. We first divide the
sentence into three parts: the left context of the mention, the mention part and the right context of the
mention. Each part of the sentence is fed into a parallel neural network with similar structure: a word
embedding layer and a BiLSTM (bidirectional long short-term memory) layer [17]. We then concatenate the
outputs of these BiLSTM layers to generate the final output.

In type fusion step, we propose an integer linear programming model (ILP) to aggregate all the types
derived from mentions of an entity to reduce the noise. ILP is an optimization model with constraints and
all the variables required to be non-negative integers [18]. For each entity e, we first define a decision
variable xe,t for every candidate type t. These variables are binary and indicate whether entity e belongs to
type t or not. Our ILP model as follows:

(

∑

∈
t T

max

∈
em M

(
t mP
|

)

−

)

e t
,

Maximize

Subject to

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
d
n

t
/

a
r
t
i
c
e
–
p
d

f
/

1
3
2
7
1
6
8
3
7
8
2
d
n
_
a
_
0
0
0
1
7
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

∀

ME(

t
t
,
1 2

)

e t
,
1

e t
,

∀

ISA(

t
t
,
1 2

)

∀

−

∈

e t
,
1

e tx
,

e t
,

x
{
}
0,1

≤
1
≤

where maxm∈Me P(t|m) represents the maximum probability that one mention of the entity e belongs to type
t, and h is the threshold (In our experiment, we set the threshold as 0.5).

Data Intelligence

276

CN-DBpedia2: An Extraction and Veriﬁ cation Framework for Enriching Chinese Encyclopedia
Knowledge Base

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
d
n

t
/

a
r
t
i
c
e
–
p
d

f
/

1
3
2
7
1
6
8
3
7
8
2
d
n
_
a
_
0
0
0
1
7
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure 4. A neural network model for mention typing.

We propose two constraints for our ILP model: a type disjointness constraint and a type hierarchy
constraint. The Type disjointness constraint claims that an entity cannot belong to two semantically mutually
exclusive types simultaneously, such as Person and Organization. Types with no overlap in entities or an
insignificant overlap below a specified threshold are considered to be disjoint [19]. The Type hierarchy
constraint claims that if an entity does not belong to a type t then it will certainly not belong to any of t’s
sub-types. For example, an entity that does not belong to type Artist should not be classified to type Actor.

3.2 Entity Typing from Heterogeneous Features

In order to take advantage of the structured and unstructured features of the entity, we propose a new
framework, METIH (Multi-instance Entity TypIng from Heterogeneous features), which is a modified version
of the METIC. As shown in Figure 5, most of the components are the same as METIC, except the ones
marked by the dotted line. Specially, in type fusion step, we treat the prediction result from structured
features as a new instance of an entity, and use a new ILP model to aggregate those prediction results. The
new ILP model is shown as follows:

277

Data Intelligence

CN-DBpedia2: An Extraction and Veriﬁ cation Framework for Enriching Chinese Encyclopedia
Knowledge Base

Maximize

(

∑

∈
t T

(

max

∈
m M
e

(
t mP
|

)

(
∈
t C

)

−

e t
,

Subject to

∀

ME( t ,t
1 2

)

e t
,
1

e t
,

≤
1

∀

ISA( t ,t
1 2
∀

)

e t
,
1

e tx
,

−

∈

e t
,

≤
x
{
}
0,1

where function d(t ∈Ce) is defined as follows:

(

∈
t C

)

= ⎨

1,
0,

if type t belongs to C
else

(1)

max(maxm∈Me P(t|m), d(t ∈Ce)) represents the maximum probability that entity e belongs to type t, where
maxm∈Me P(t|m) is the maximum probability from unstructured features, while d(t ∈Ce) is the probability from
structured features.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
d
n

t
/

a
r
t
i
c
e
–
p
d

f
/

1
3
2
7
1
6
8
3
7
8
2
d
n
_
a
_
0
0
0
1
7
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure 5. Framework of METIH, a multi-instance method for entity typing from heterogeneous features.

4. SLOT FILLING

Some entities’ attribute values cannot be directly extracted due to missing information, so we extract the
attribute values from text corpus. We cast it as a slot filling task, given an entity and an attribute, our goal
is to extract the values from the description text of the entity. The state-of-the-art methods usually use
supervised learning to build extractors for each predicate, and treat it as a sequence tagging problem.

Data Intelligence

278

⎧
⎩

CN-DBpedia2: An Extraction and Veriﬁ cation Framework for Enriching Chinese Encyclopedia
Knowledge Base

However, the training samples for each predicate are unbalanced, and head predicates usually contain
a large amount of training samples while long-tailed ones only contain few samples. In CN-DBpedia, we
have extracted the values for head predicates. And in CN-DBpedia2, we focus on extracting the values for
long-tailed predicates.

A naive solution for slot filling is to train a single predicate extractor for each predicate. The single
predicate extractor is trained separately on data of different predicates. In the case of long-tailed predicates
with insufficient training data, the single predicate extractors may not be fully trained. This impairs the
performance. Therefore, we propose a Multiple Predicate Extractor with Prior Knowledge (MPK) model.
The MPK model also takes the predicate as an input. We first pre-train a model by using all the head
predicates, and then we fine-tune the model for each long-tailed predicates by transfer learning. Since most
of the parameters in the model are shared along different predicates, it is feasible for the long-tailed
predicate to utilize the abundant training data of head predicates by transfer learning.

Naturally, we expect to utilize the training samples from other predicates, which motivates us to develop
a multiple predicate extractor structure. The multiple predicate extractor network structure is shown in
Figure 6. The extractor is divided into five parts: 1) text embedding layer, 2) knowledge embedding layer,
3) text-knowledge attention layer, 4) encoder layer and 5) output layer.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
d
n

t
/

a
r
t
i
c
e
–
p
d

f
/

1
3
2
7
1
6
8
3
7
8
2
d
n
_
a
_
0
0
0
1
7
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure 6. Network structures of the multiple predicate extractor.

279

Data Intelligence

CN-DBpedia2: An Extraction and Veriﬁ cation Framework for Enriching Chinese Encyclopedia
Knowledge Base

4.1 Text Embedding Layer

We first pre-process the sentences. The text embedding layer tends to capture the text features before
extraction, including both the word and phrase information. Then we encode the concatenation of these
embedded vectors with a BiLSTM layer.

4.2 Knowledge Embedding Layer

In this layer, we embed the prior knowledge as guidance for further information extracting, i.e. they
decide what kind of information should be extracted. The prior knowledge information includes the type
from the entity (subject) and the predicate specifying the information to extract.

Type Encoder Layer. One entity may belong to multiple types, which implies different aspects of the
entity. We use a self-attention layer to embed all the types of an entity, which is a multi-head attention
mechanism proposed by [20]. Specifically, for each type, called the query, compute a weighted sum of all
types, or keys, in the input based on the similarity between the query and key as measured by the dot
product.

Predicate-Type Attention Layer. We combine the type and predicate information in this layer. The
predicate and the type determine what to be extracted together. For example, with the entity type person
and the predicate birthplace, we can determine that the task is to extract where the person was born, and
which is similar to a query. However, an entity always has many noisy types (e.g., such as actor and
producer in the Figure). We use the predicate-to-type attention to select suitable types to form the query.
Specifically, we use T and P to respectively denote the encoded types and predicate. The predicate-to-type
×∈R . We then normalize the only row of S0 by applying the softmax function,
similarity is computed as
. The similarity
deriving a matrix
function used here is the trilinear function [21]:

0S . Then the predicate-to-type attention is computed as

.
S T
0

∈R

×
m d

f(t,p) = W0[t,p,t⨀p]

(2)

where ⨀ is the element-wise multiplication and W0 is a trainable variable.

The output of this module is an encoded knowledge sequence U, where each unit is u = W1[t,p,t⨀p,
a0t], t and p respectively denote an encoded type and the predicate embedding and a0 is an attention value
in A 0. Each unit u specifies an aspect of the extraction task, like the birthPlace of a person.

4.3 Text-Knowledge Attention Layer

This module uses the knowledge as a query to guide the extractor. We use H and U to denote the
encoded text and knowledge. Then we adopt the attention similar to the BiDAF [21] to model the interaction
between the text and the knowledge, including text-to-knowledge attention and knowledge-to-text attention.

Data Intelligence

280

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
d
n

t
/

a
r
t
i
c
e
–
p
d

f
/

1
3
2
7
1
6
8
3
7
8
2
d
n
_
a
_
0
0
0
1
7
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

CN-DBpedia2: An Extraction and Veriﬁ cation Framework for Enriching Chinese Encyclopedia
Knowledge Base

The text-to-knowledge attention is constructed as follows. We first use Equation (2) to compute the
similarities between each pair of encoded text and knowledge, rendering a similarity matrix
. We
then normalize each row of S by applying the softmax function, getting a matrix S. Then the text-to-
×
m d
knowledge attention is computed as

= ⋅
A S U

×∈R

∈R

m m

We additionally use some form of knowledge-to-text attention. Specifically, we perform the maximum
reduction on each column of S and compute the softmax function to get the text-to-knowledge
attention

B softmax

(
max

)
,2

)

(

4.4 Model Encoder Layer

The input of this layer at each position is [h,a,h⨀a,h⨀b], where a and b are respectively a row of

attention matrix A and B. This module contains two BiLSTM layers.

4.5 Output Layer

In the output layer, we adopt a layer of Conditional Random Field, as it has been proven to be effective

in the sequence tagging task [22].

4.6 Training

To tackle with insufficient training data for the long-tailed predicates, we hope that the model can benefit
from abundant training samples of the head predicates. To achieve this, we let the model first learn how
to extract the value of the head predicates, and then learn the extraction of the long-tailed predicates
through transfer learning.

Specifically, we build extractors for each long-tailed predicate. The training processes include two steps.
First, we pre-train the MPK model on training data of the Top-K head predicates. Then we fine tune the
MPK model on the training data of the long-tailed predicate.

5. FACT VERIFICATION

Through the two steps of entity typing and slot filling, we can obtain a lot of facts from the text. But
some of them are wrong, adding these errors to an existing knowledge base will reduce the quality of the
knowledge base. Hence we need to verify the low confidence facts by human before adding it to the current
knowledge base. The task of the verification process is as follows: given a piece of text and a fact extracted
from the text, our goal is to let people judge whether this fact can be inferred from the text.

In general, it is very costly to verify all the facts by experts. To solve this problem, we propose a novel
implicit crowdsourcing approach to ask users to verify those extracted facts. Our idea is inspired by
CAPTCHA [23], which is a program that can generate and grade tests that: most humans can pass, but
current computer programs cannot. Such a program can be used to differentiate humans from computers.

281

Data Intelligence

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
d
n

t
/

a
r
t
i
c
e
–
p
d

f
/

1
3
2
7
1
6
8
3
7
8
2
d
n
_
a
_
0
0
0
1
7
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

CN-DBpedia2: An Extraction and Veriﬁ cation Framework for Enriching Chinese Encyclopedia
Knowledge Base

As shown in Figure 7, it asks users to type the distorted text seen from the image. If the typed characters
match the characters in the image, it is considered verified.

Figure 7. An example of CAPTCHA. Note: In this case, the answer is “overlooks inquiry”.

Specifically, we propose a different type of CAPTCHA, rcCAPTCHA, which is a reading comprehension
test. Note that our goal is to let people verify whether a fact can be inferred from a piece of text. For each
test, the reading passage is a piece of text, the question is generated from the fact, and the answer options
are generated from both the text and the fact.

For different types of facts, we propose different ways to generate question and answer options. Given a
fact derived from entity typing, such as (e, isA, t), where e ∈E is an entity and t ∈T is a type, we generate
a question such as “which type does belong to?”, the answer options are t and all its siblings are in
the type taxonomy, and the correct answer in this test is t. Since one entity may belong to many types, the
entity e may belong to some sibling types of t according to other extracted facts. In this test, we exclude
those types in the answer options. Given a fact derived from slot filling, such as (e, a, v), where a ∈A is an
attribute and v is the attribute value, we generate a question such as “what is the of ?”, the answer
options are all the words in the text (including the v), and the correct answer in this test is v. For each
extracted fact, we generate some reading comprehension tests. When multiple users click on the correct
answer and pass the test, the extracted facts are considered correct.

To verify the accuracy of the facts we extracted, we released a free rcCAPTCHA API and deployed it
on multiple websites. CN-DBpedia search engine is one of those websites, and our rcCAPTCHA system
is triggered when the number of user search exceeds a certain threshold. The search can only be continued
if the user correctly answers the question from the rcCAPTCHA system. As shown in Figure 8, this is an
instance of our system. The instance is used to verify whether the fact (Galare Thong Tower, star rating,
3-star) is correct. Based on this fact, the system generates a question (What is the star rating of Galare Thong
Tower?) and a text containing the answer (Galare Thong Tower is a 3 -star hotel located in Chiang Mai… It
is a 9-minute drive from Wat Chiang Man …). If majority of users click on “3-star” in the text, then the
triple is considered correct.

 http://www.captcha.net/
 http://kw.fudan.edu.cn/apis/supervcode/
 http://kw.fudan.edu.cn/cndbpedia/

Data Intelligence

282

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
d
n

t
/

a
r
t
i
c
e
–
p
d

f
/

1
3
2
7
1
6
8
3
7
8
2
d
n
_
a
_
0
0
0
1
7
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

CN-DBpedia2: An Extraction and Veriﬁ cation Framework for Enriching Chinese Encyclopedia
Knowledge Base

Figure 8. An example of our rcCAPTCHA system. Note: The word marked yellow is the correct answer.

Our rcCAPTCHA system is triggered thousands of times per day on average, and each fact is verified by
an average of 20 users. We randomly sampled 100 extracted facts that were verified to be correct with an
accuracy of 100%. Although the current verification speed is slow, as our system is used by more and more
people, our verification speed can be accelerated.

6. STATISTICS FOR CN-DBPEDIA2

We present the statistics of CN-DBpedia2 in this section. The number of entities and facts of CN-DBpedia2
is much larger than CN-DBpedia. By April 2019, CN-DBpedia2 contains about 16,024,656 entities and
228,499,155 facts, while CN-DBpedia only contains 10,341,196 entities and 88,454,264 facts. Table 2
shows the fact types in CN-DBpedia2, as well as changes compared to CN-DBpedia. The increase mainly
comes from three aspects. Firstly, new sources of data have been added. Secondly, textual information is
used. Thirdly, updating is implemented.

Table 2. Fact Types in CN-DBpedia2, as well as changes compared to CN-DBpedia.

Rank

Rank Count

Fact Types

Fact Quantity

Quantity Change

1
2
3
4
5

0
+1
-1
0
0

Entity Infobox
Entity Types
Entity Tags
Entity Information
Entity SameAs

149,350,983
45,065,595
25,923,300
8,016,829
142,448

+108,210,921
+25,219,295
+6,057,489
+4,012,928
0

By using the new entity typing method, we found more types for entities in CN-DBpedia2. Table 3 shows
the top 20 concepts and the number of entities they contained in CN-DBpedia2, as well as changes
compared to CN-DBpedia. Because of the consideration of textual information, entities have gained more
concepts, and the size of the concepts has increased significantly (except for the two concepts of companies
and singles, the increase is mainly due to the extraction from the domain encyclopedia websites).

283

Data Intelligence

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
d
n

t
/

a
r
t
i
c
e
–
p
d

f
/

1
3
2
7
1
6
8
3
7
8
2
d
n
_
a
_
0
0
0
1
7
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

CN-DBpedia2: An Extraction and Veriﬁ cation Framework for Enriching Chinese Encyclopedia
Knowledge Base

Table 3. The top 20 concepts and the number of entities included in CN-DBpedia2, as well as changes compared
to CN-DBpedia.

Rank

Rank Change

Type

Entity Quantity

Quantity Change

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

+1
+7
+9
-3
0
0
+9
-5
-5
+7
+7
-4
-3
-5
-2
-2
+2
+2
+2
-5

Agent
Organisation
Company
Work
WrittenWork
Book
Novel
Person
Place
MusicalWork
Single
PopulatedPlace
Settlement
ArchitecturalStructure
Species
Eukaryote
Software
Device
Athlete
Food

8,771,817
7,019,610
6,545,250
4,786,508
2,918,856
2,742,963
1,504,497
1,488,264
1,401,132
950,072
897,631
546,813
404,672
250,638
247,723
237,418
230,900
199,341
167,293
163,823

+6,766,894
+6,228,636
+6,128,240
+2,257,454
+1,820,837
+1,686,857
+1,358,730
+270,276
+203,869
+699,196
+691,346
-69,209
-57,410
-241,942
+36,187
+29,647
+80,114
+75,976
+45,960
-14,866

Based on CN-DBpedia2, we also publish some new APIs for natural language understanding, including
entity linking and question answering. By April 2019, these APIs have already been called 950 million times

since they are published in December 2015.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
d
n

t
/

a
r
t
i
c
e
–
p
d

f
/

1
3
2
7
1
6
8
3
7
8
2
d
n
_
a
_
0
0
0
1
7
p
d

7. CONCLUSION

In this paper, we released a new version of the knowledge base CN-DBpedia2, which is an extension of
CN-DBpedia. Based on the existing knowledge base, we additionally exploit both structured and unstructured
features to type entities together and propose a transfer learning strategy to extract the values for long-tailed
predicates. And then, we propose a novel implicit crowdsourcing approach to verify low confidence new
facts. And the facts verified as correct are added to the knowledge base. By April 2019, CN-DBpedia2
contains about 16,024,656 entities and 228,499,155 facts, and the APIs have already been called 950
million times.

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

 http://kw.fudan.edu.cn/apis

Data Intelligence

284

CN-DBpedia2: An Extraction and Veriﬁ cation Framework for Enriching Chinese Encyclopedia
Knowledge Base

AUTHOR CONTRIBUTIONS

This work was collaboration between all of the authors. Y. Xiao (shawyh@fudan.edu.cn, corresponding
author) is the leader of the CN-DBpedia project. B. Xu (xubo@dhu.edu.cn) led the work and summarized
the entity typing part. C. Xie (redreamality@gmail.com) and L. Chen (lhc825@gmail.com) summarized
the slot filling part. J. Liang (l.j.q.light@gmail.com) summarized the fact verification part. B. Liang
(liangbin@fudan.edu.cn) summarized the statistics part. All the authors have made meaningful and valuable
contributions in revising and proofreading the resulting manuscript.

ACKNOWLEDGEMENTS

This paper was supported by National Key R&D Program of China under Grant No. 2017YFC1201203,
sponsored by Shanghai Sailing Program under Grant No.19YF1402300, by the Initial Research Funds for
Young Teachers of Donghua University under Grant No. 112-07-0053019.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
d
n

t
/

a
r
t
i
c
e
–
p
d

f
/

1
3
2
7
1
6
8
3
7
8
2
d
n
_
a
_
0
0
0
1
7
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

REFERENCES

[1]

[2]

[3]

[4]

F.M. Suchanek, G. Kasneci, & G. Weikum. YAGO: A core of semantic knowledge. In: Proceedings of the 16th
International Conference on World Wide Web, ACM, 2007, pp. 697–706. doi: 10.1145/1242572.1242667.
S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, & Z. Ives. DBpedia: A nucleus for a web of open
data. In: The Semantic Web, Springer, 2007, pp. 722–735. doi: 10.1007/978-3-540-76298-0_52.
K. Bollacker, C. Evans, P. Paritosh, T. Sturge, & J. Taylor. Freebase: A collaboratively created graph database
for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD International Conference on
Management of Data, ACM, 2008, pp. 1247–1250. doi: 10.1145/1376616.1376746.
B. Xu, Y. Xu, J. Liang, C. Xie, B. Liang, W. Cui, & Y. Xiao. CN-DBpedia: A never-ending Chinese knowledge
extraction system. In: International Conference on Industrial, Engineering and Other Applications of Applied
Intelligent Systems, Springer, 2017, pp. 428–438. doi: 10.1007/978-3-319-60045-1_44.

[5] C. Xiong, R. Power, & J. Callan. Explicit semantic ranking for academic search via knowledge graph embed-
ding. In: Proceedings of the 26th International Conference on World Wide Web, International World Wide
Web Conferences Steering Committee, 2017, pp. 1271–1279. doi: 10.1145/3038912.3052558.

[6] D. Yang, J. He, H. Qin, Y. Xiao, & W. Wang. A graph-based recommendation across heterogeneous domains.
In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management,
ACM, 2015, pp. 463–472. doi: 10.1145/2806416.2806523.

[7] W. Cui, Y. Xiao, & W. Wang. KBQA: An online template based question answering system over Freebase. In:
Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI 2016), 2016, pp.
4240–4241. Available at: https://www.ijcai.org/Proceedings/16/Papers/640.pdf.
J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, … & C. Bizer.
DBpedia—A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web Journal 6(2)
(2015), 167–195. doi: 10.3233/SW-140134.

[8]

[9] Q. Liu, K. Xu, L. Zhang, H. Wang, Y. Yu, & Y. Pan. Catriple: Extracting triples from Wikipedia categories. In:
Asian Semantic Web Conference, Springer, 2008, pp. 330–344. doi: 10.1007/978-3-540-89704-0_23.
[10] B. Xu, Y. Zhang, J. Liang, Y. Xiao, S.-W. Hwang, & W. Wang. Cross-lingual type inference. In: International
Conference on Database Systems for Advanced Applications, Springer, 2016, pp. 447–462. doi: 10.1007/978-
3-319-32025-0_28.

285

Data Intelligence

CN-DBpedia2: An Extraction and Veriﬁ cation Framework for Enriching Chinese Encyclopedia
Knowledge Base

[11] H. Adel, B. Roth, & H. Schütz. Comparing convolutional neural networks to traditional models for slot filling.
In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, 2016, pp. 828–838. Available at: https://www.aclweb.org/
anthology/N16-1097.

[12] Y. Zhang, V. Zhong, D. Chen, G. Angeli, & C.D. Manning. Position-aware attention and supervised data
improve slot filling. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language
Processing, 2017, pp. 35–45. doi: 10.18653/v1/D17-1004.

[13] B. Xu, Z. Luo, L. Huang, B. Liang, Y. Xiao, D. Yang, & W. Wang. METIC: Multi-instance entity typing
from corpus. In: Proceedings of the 27th ACM International Conference on Information and Knowledge
Management, ACM, 2018, pp. 903–912. doi: 10.1145/3269206.3271804.

[14] M. Mintz, S. Bills, R. Snow, & D. Jurafsky. Distant supervision for relation extraction without labeled data.
In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International
Joint Conference on Natural Language Processing of the AFNLP: Volume 2, Association for Computational
Linguistics, 2009, pp. 1003–1011. doi: 10.3115/1690219.1690287.

[15] M. Surdeanu, J. Tibshirani, R. Nallapati, & C.D. Manning. Multi-instance multi-label learning for relation
extraction. In: Proceedings of the 2012 joint conference on empirical methods in natural language process-
ing and computational natural language learning, Association for Computational Linguistics, 2012, pp. 455–
465. Available at: https://dl.acm.org/citation.cfm?id=2391003.

[16] L. Chen, J. Liang, C. Xie, & Y. Xiao. Short text entity linking with fine-grained topics. In: Proceedings
of the 27th ACM International Conference on Information and Knowledge Management, ACM, 2018,
pp. 457–466. doi: 10.1145/3269206.3271809.

[17] S. Hochreiter, & J. Schmidhuber. Long short-term memory. Neural computation 9 (1997), 1735–1780.

[18]

doi: 10.1162/neco.1997.9.8.1735.
J. Clarke, & M. Lapata. Global inference for sentence compression: An integer linear programming approach.
Journal of Artificial Intelligence Research 31 (2008), 399–429. doi: 10.1613/jair.2433.

[19] N. Nakashole, T. Tylenda, & G. Weikum. Fine-grained semantic typing of emerging entities. In: Proceedings
of the 51st Annual Meeting of the Association for Computational Linguistics, ACL, 2013, pp.1488–1497.
Available at: https://www.aclweb.org/anthology/P13-1146.

[20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L.u. Kaiser, & I. Polosukhin. Attention
is all you need. In: I. Guyon, et al. (eds.) Advances in Neural Information Processing Systems 30, Curran
Associates Inc., 2017, pp. 5998–6008. Available at: http://papers.nips.cc/paper/7181-attention-is-all-you-
need.pdf.

[21] M.J. Seo, A. Kembhavi, A. Farhadi, & H. Hajishirzi. Bidirectional attention flow for machine comprehension.

arXiv preprint. arXiv:1611.01603, 2016.

[22] N. Reimers, & I. Gurevych. Reporting score distributions makes a difference: Performance study of
lstm-networks for sequence tagging. In: Proceedings of the 2017 Conference on Empirical Methods in Natu-
ral Language Processing, 2017, pp. 338–348. doi: 10.18653/v1/D17-1035.

[23] L. Von Ahn, M. Blum, N.J. Hopper, & J. Langford. CAPTCHA: Using hard ai problems for security. In: Inter-
national Conference on the Theory and Applications of Cryptographic Techniques, Springer, 2003, pp. 294–
311. doi: 10.1007/3-540-39200-9_18.

Data Intelligence

286

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
d
n

t
/

a
r
t
i
c
e
–
p
d

f
/

1
3
2
7
1
6
8
3
7
8
2
d
n
_
a
_
0
0
0
1
7
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

CN-DBpedia2: An Extraction and Veriﬁ cation Framework for Enriching Chinese Encyclopedia
Knowledge Base

AUTHOR BIOGRAPHY

Bo Xu is currently an lecturer at the School of Computer Science and
Technology, Donghua University. He received his PhD Degree from Fudan
University in 2018. His research interests include knowledge base construction
and applications. He has published serveral papers at major conferences
including the International Joint Conference on Artificial Intelligence (IJCAI),
the Conference on Information and Knowledge Management (CIKM), the
European Conference on Principles of Data Mining and Knowledge Discovery
(PKDD) and the International Conference on Database Systems for Advanced
Applications (DASFAA). He won the Best Student Paper Award in the 31st
National Database Conference (NDBC 2014).

Jiaqing Liang received his BS degree from the School of Computer Science,
Fudan University, China, in 2015. He is currently working toward the PhD
degree in the School of Computer Science, Fudan University, China. His
research interests include knowledge bases and deep learning for text data.

Chenhao Xie is currently a PhD candidate with the School of Computer
Science, Fudan University. He is the co-founder of Shuyan Inc (http://
shuyantech.com). His main research interests include information extraction
and knowledge graph. He has published serveral papers at conferences
including: the International Joint Conference on Artificial Intelligence (IJCAI),
the Conference on Information and Knowledge Management (CIKM), and the
International Conference on Data Mining (ICDM).

287

Data Intelligence

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
d
n

t
/

a
r
t
i
c
e
–
p
d

f
/

1
3
2
7
1
6
8
3
7
8
2
d
n
_
a
_
0
0
0
1
7
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

CN-DBpedia2: An Extraction and Veriﬁ cation Framework for Enriching Chinese Encyclopedia
Knowledge Base

Bin Liang is a PhD candidate in the School of Computer Science at Fudan
University, China. He also completed his undergraduate studies at the
Department of Computer Science and Technology at Fudan University. His
research interests include data science, knowledge graph, recommender
systems and social network mining.

Lihan Chen is a PhD candidate in the school of Computer Science at
Fudan University, China. His research interests include entity linking and
information extraction.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
d
n

t
/

a
r
t
i
c
e
–
p
d

f
/

1
3
2
7
1
6
8
3
7
8
2
d
n
_
a
_
0
0
0
1
7
p
d

Yanghua Xiao is a full professor of computer science at Fudan University.
He is one of young “973” scientists. His research interests include big data
management and mining, graph database and knowledge graph. Recently,
he has published more than 70 papers in international leading journals and
top conferences. He won the Best PhD Thesis Nomination of the Chinese
Computer Federation (CCF) in year 2010, CCF2014 Natural Science Award
(second
(CCF) Shanghai Distinguished Young Scientists
Nomination Award.

level), ACM

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Data Intelligence

288

Download pdf