Compositional Generalization in Multilingual - 麻省理工学院人工智能研究专业

Compositional Generalization in Multilingual
Semantic Parsing over Wikidata

Ruixiang Cui, Rahul Aralikatte, Heather Lent, and Daniel Hershcovich
计算机科学系
University of Copenhagen, 丹麦
{rc, rahul, hcl, dh}@di.ku.dk

抽象的

Semantic parsing (SP) allows humans to lever-
age vast knowledge resources through natural
相互作用. 然而, parsers are mostly de-
signed for and evaluated on English resources,
such as CFQ (Keysers et al., 2020), 电流-
rent standard benchmark based on English
data generated from grammar rules and ori-
ented towards Freebase, an outdated knowl-
edge base. We propose a method for creating a
multilingual, parallel dataset of question-query
对, grounded in Wikidata. We introduce
such a dataset, which we call Multilingual
Compositional Wikidata Questions (MCWQ),
and use it to analyze the compositional gen-
eralization of semantic parsers in Hebrew,
Kannada, Chinese, and English. While within-
language generalization is comparable across
语言, experiments on zero-shot cross-
lingual transfer demonstrate that cross-lingual
compositional generalization fails, even with
state-of-the-art pretrained multilingual encod-
呃. 此外, our methodology, dataset,
and results will facilitate future research on
SP in more realistic and diverse settings than
has been possible with existing resources.

介绍

Semantic parsers grounded in knowledge bases
(KBs) enable knowledge base question answer-
英 (KBQA) for complex questions. Many seman-
tic parsers are grounded in KBs such as Freebase
(Bollacker et al., 2008), DBpedia (Lehmann et al.,
2015), and Wikidata (Pellissier Tanon et al.,
2016), and models can learn to answer questions
about unseen entities and properties (Herzig and
Berant, 2017; Cheng and Lapata, 2018, Shen et al.,
2019; Sas et al., 2020). An important desired abil-
ity is compositional generalization—the ability
to generalize to unseen combinations of known
成分 (Oren et al., 2020; Kim and Linzen,
2020).

937

One of the most widely used datasets for mea-
suring compositional generalization in KBQA is
CFQ (Compositional Freebase Questions; Keysers
等人。, 2020), which was generated using grammar
规则, and is based on Freebase, an outdated and
unmaintained English-only KB. While the need to
expand language technology to many languages
is widely acknowledged (Joshi et al., 2020), 这
lack of a benchmark for compositional generaliza-
tion in multilingual semantic parsing (SP) hinders
KBQA in languages other than English. 更远-
更多的, progress in both SP and KB necessitates
that benchmarks can be reused and adapted for
future methods.

Wikidata is a multilingual KB, with entity and
property labels in a multitude of languages. 它
has grown continuously over the years and is an
important complement to Wikipedia. Much effort
has been made to migrate Freebase data to Wiki-
数据 (Pellissier Tanon et al., 2016; Diefenbach
等人。, 2017; Hogan et al., 2021), but only in En-
glish. Investigating compositional generalization
in cross-lingual SP requires a multilingual data-
放, a gap we address in this work.

We leverage Wikidata and CFQ to create
Multilingual Compositional Wikidata Questions
(MCWQ), a new multilingual dataset of compo-
sitional questions grounded in Wikidata (见图-
乌尔 1 for an example). Beyond the original English,
an Indo-European language using the Latin script,
we create parallel datasets of questions in He-
brew, Kannada, and Chinese, which use different
scripts and belong to different language families:
Afroasiatic, Dravidian, and Sino-Tibetan, 重新指定-
主动地. Our dataset includes questions in the four
languages and their associated SPARQL queries.

Our contributions are:

• a method to automatically migrate a KBQA
dataset to another KB and extend it to diverse
languages and domains,

计算语言学协会会刊, 卷. 10, PP. 937–955, 2022. https://doi.org/10.1162/tacl 00499
动作编辑器: Emily Pitler. 提交批次: 4/2022; 修改批次: 6/2022; 已发表 9/2022.
C(西德:2) 2022 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
9
2
0
4
2
5
8
8

/
t

我

A
C
_
A
_
0
0
4
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Parsers trained on CFQ transform these ques-
tions into SPARQL queries, which can subse-
quently be executed against Freebase to answer
the original questions (在这种情况下, ‘‘Yes’’).

the test set

CFQ uses the Distribution-Based Composi-
tionality Assessment (DBCA) method to generate
multiple train-test splits with maximally divergent
examples in terms of compounds, while main-
taining a low divergence in terms of primitive
元素 (atoms). In these maximum compound
分歧 (MCD) splits,
is con-
strained to examples containing novel compounds,
那是, new ways of composing the atoms seen
during training. For measuring compositional gen-
eralizations, named entities in the questions are
anonymized so that models cannot simply learn the
relationship between entities and properties. CFQ
包含 239,357 English question-answer pairs,
which encompass 49,320 question patterns and
34,921 SPARQL query patterns. 桌子 1 节目
selected fields of an example in CFQ. In their ex-
实验, Keysers et al. (2020) trained seman-
tic parsers using several architectures on various
train-test splits. They demonstrated strong nega-
tive correlation between models’ accuracy (科尔-
rectness of the full generated SPARQL query) 和
compound divergence across a variety of system
architectures—all models generalized poorly in
the high-divergence settings, highlighting the need
to improve compositional generalization in SP.

By the time CFQ was released, Freebase
had already been shut down. On that account, 到
our knowledge,
there is no existing SP data-
set targeting compositional generalization that is
grounded in a currently usable KB, which con-
tains up-to-date information. We therefore mi-
grate the dataset to such a KB, 即, Wikidata,
in §3.

而且, only a few studies have evaluated
semantic parsers’ performance in a multilingual
环境, due to the scarcity of multilingual KBQA
datasets (Perevalov et al., 2022乙). No compara-
ble benchmark exists for languages other than
英语, and it is therefore not clear whether
results are generalizable to other languages. 康姆-
positional generalization in typologically distant
languages may pose completely different chal-
伦格斯, as these languages may have different
ways to compose meaning (Evans and Levinson,
2009). We create such a multilingual dataset in
§4, leveraging the multilinguality of Wikidata.

数字 1: An example from the MCWQ dataset. 这
question in every language corresponds to the same
Wikidata SPARQL query, 哪个, upon execution,
returns the answer (which is positive in this case).

• a benchmark for measuring compositional
generalization in SP for KBQA over Wiki-
data in four typologically diverse languages,

• monolingual experiments with different SP
architectures in each of the four languages,
demonstrating similar within-language gen-
eralization, 和

• zero-shot cross-lingual experiments using
pretrained multilingual encoders,
展示-
ing that compositional generalization from
English to the other languages fails.

Our code for generating the dataset and for
the experiments, as well as the dataset itself and
trained models, are publicly available on https://
github.com/coastalcph/seq2sparql.

2 Limitations of CFQ

CFQ (Keysers et al., 2020) is a dataset for mea-
suring compositional generalization in SP. It tar-
gets the task of parsing questions in English into
SPARQL queries executable on the Freebase KB
(Bollacker et al., 2008). CFQ contains questions
as in Table 1, as well as the following English
问题 (with entities surrounded by brackets):

‘‘Was [United Artists]
founded by
[先生. Fix-it]’s star, founded by [D. 瓦.
Griffith], founded by [Mary Pickford],
and founded by [The Star Boarder]’s
星星?’’

938

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
9
2
0
4
2
5
8
8

/
t

我

A
C
_
A
_
0
0
4
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

CFQ field

questionWithBrackets
questionPatternModEntities
questionWithMids
sparql

sparqlPatternModEntities

Content

Did [‘Murder’ Legendre]’s male actor marry [Lillian Lugosi]
Did M0 ’s male actor marry M2
Did m.0h4y854 ’s male actor marry m.0hpnx3b
SELECT count(*) WHERE { ?x0 ns:film.actor.film/ns:film.performance
.特点
ns:米.0h4y854 . ?x0 ns:people.person.gender ns:米.05zppz . ?x0
ns:people.person.spouse s/ns:fictional universe.marriage of
fictional characters.spouses ns:米.0hpnx3b . FILTER ( ?x0 !=
ns:米.0hpnx3b )}
SELECT count(*) WHERE { ?x0 ns:film.actor.film/ns:film.performance
.特点
M0 . ?x0 ns:people.person.gender ns:米.05zppz . ?x0
ns:people.person.spouse s
/ns:fictional universe.marriage of fictional characters.spouses M2
. FILTER
( ?x0 != M2 )}

桌子 1: Selected fields in a CFQ entry. questionWithBrackets is the full English question
with entities surrounded by brackets. questionPatternModEntities is the question with entites
replaced by placeholders. In questionWithMids, the entity codes (Freebase machine IDs; MIDs)
are given instead of their labels. sparql is the fully executable SPARQL query for the question, 和
in sparqlPatternModEntities the entity codes are replaced by placeholders.

3 Migration to Wikidata

Wikidata is widely accepted as the replacement
for Freebase. It is actively maintained and repre-
sents knowledge in a multitude of languages and
域, and also supports SPARQL. Migrating
Freebase queries to Wikidata, 然而, is not triv-
ial, as there is no established full mapping between
the KBs’ properties and entities. An obvious alter-
native to migration would be a replication of the
original CFQ generation process but with Wiki-
data as the KB. Before delving into the details of
the migration process, let us motivate the deci-
sion not to pursue that option: The grammar used
to generate CFQ was not made available to oth-
ers by Keysers et al. (2020) and is prohibitively
too complex to reverse-engineer. Our migration
过程, 另一方面, is general and can sim-
ilarly be applied for migrating other datasets from
Freebase to Wikidata. 最后, many competitive
models with specialized architecture have been
developed for CFQ (Guo et al., 2020; Herzig
等人。, 2021; Gai et al., 2021. Our migrated data-
set is formally similar and facilitates their eval-
uation and the development of new methods.

3.1 Property Mapping

uses 51 unique properties in its SPARQL queries,
mostly belonging to the cinematography domain.
These Freebase properties cannot be applied di-
rectly to Wikidata, which uses different property
codes known as P-codes (例如, P21). We there-
fore need to map the Freebase properties into
Wikidata properties.

As a first step in the migration process, 我们
check which Freebase properties used in CFQ
have corresponding Wikidata properties. Using a
publicly available repository providing a partial
mapping between the KBs,1 we identify that 22
出于 51 Freebase properties in CFQ can
be directly mapped to Wikidata properties.2 The
其他 29 require further processing:

Fourteen properties are the reverse of other
特性, which do not have Wikidata counter-
部分. 例如, ns:film.director.film
is the reverse of ns:film.film.directed by,
and only the latter has Wikidata mapping, P57.
We resolve the problem by swapping the entities
around the property.

另一个 15 properties deal with judging
whether an entity has a certain quality. In CFQ,
?x1 a ns:film.director asks whether ?x1
is a director. Wikidata does not contain such
unary properties. 所以, we need to treat these

As can be seen in Table 1, the WHERE clause in a
SPARQL query consists of a list of triples, 在哪里
the second element in each triple is the property
(例如, ns:people.person.gender). CFQ

1https://www.wikidata.org/wiki/Wikidata

:WikiProject Freebase/Mapping.

2While some Freebase properties have multiple cor-
responding Wikidata properties, we consider a property
mappable as long as it has at least one mapping.

939

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
9
2
0
4
2
5
8
8

/
t

我

A
C
_
A
_
0
0
4
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

CFQ properties as entities in Wikidata. 对于前-
充足, director is wd:Q2526255, so we para-
phrase the query as ?x1 wdt:P106 wd:
Q2526255, asking whether ?x1’s occupation
(P106) is director. 此外, we substitute the
art director property from CFQ with the com-
poser property because the former has no equiv-
alent in Wikidata. 最后, we filter out queries
with reverse marks over properties, 例如,
?x0 ˆns:people.person.gender M0, 到期的
to incompatibility with the question generation
过程 (§3.2).

After filtering, we remain with 236,304 entries
with only fully-mappable properties—98.7% of all
entries in CFQ. We additionally make necessary
SPARQL syntax modification for Wikidata.3

3.2 Entity Substitution

A large number of entities in Freebase are absent
in Wikidata. 例如, neither of the entities
表中 1 exist in Wikidata. 此外, unlike
the case of properties, 据我们所知, 那里
is no comprehensive or even partial mapping of
Freebase entity IDs (IE。, Freebase machine IDs,
MIDs, such as s:米.05zppz) to Wikidata en-
tity IDs (IE。, Q-codes, such as wd:Q6581097).
We replicate the grounding process carried out
by Keysers et al. (2020), substituting entity place-
holders with compatible entities codes by execut-
ing the queries against Wikidata:

1. Replacing entity placeholders with SPARQL
变量 (例如, ?v0), we obtain queries that
return sets of compatible candidate entity
assignments instead of simply an answer for
a given assignment of entities.

2. We add constraints for the entities to be
distinct, to avoid nonsensical redundancies
(例如, due to conjunction of identical clauses).

3. Special entities, representing nationalities
and genders, are regarded as part of the
question patterns in CFQ (and are not re-
placed with placeholders). Before running
the queries, we thus replace all such entities
with corresponding Wikidata Q-codes (在-
stead of variables).

4. We execute the queries against the Wikidata
query service4 to get the satisfying assign-
ments of entity combinations, with which
we replace the placeholders in sparql-
PatternModEntities fields.

5. 最后, we insert

the Q-codes into the
English questions in the questionWith-
Mids field and the corresponding entity la-
bels into the questionWithBrackets to
obtain the English questions for our dataset.

Along this process, 52.5% of the queries have
at least one satisfying assignment. The resulting
question-query pairs constitute our English data-
放. They maintain the SPARQL patterns in CFQ,
but the queries are all executable on Wikidata.

我们得到 124,187 question-query pairs, 的
哪个 67,523 are yes/no questions and 56,664
are wh- 问题. The expected responses of
yes/no questions in this set are all ‘‘yes’’ due to
our entity assignment process. To make MCWQ
comparable to CFQ, which has both positive and
negative answers, we sample alternative queries
by replacing entities with ones from other queries
whose preceding predicates are the same. 我们的
negtive sampling results in 30,418 questions with
‘‘no’’ answers.

3.3 Migration Example

Consider the SPARQL pattern from Table 1:

SELECT count(*) WHERE { ?x0 ns:film.actor.

film/ns:film.performance.character M0 .

?x0 ns:people.person.gender ns:米.05zppz .

?x0 ns:people.person.spouse s/ns:

fictional universe.

marriage of fictional characters.spouses
M2 . FILTER ( ?x0 != M2 )}

We replace the properties and special enti-
领带 (here the gender male: ns:米.05zppz →
wd:Q6581097):

SELECT count(*) WHERE {?x0 wdt:P453 M0 .?x0

wdt:P21 wd:Q6581097 .
FILTER ( ?x0 != M2 )}

?x0 wdt:P26 M2 .

Then we replace placeholders (例如, M0) with vari-
ables and add constraints for getting only one
assignment (which is enough for our purposes)
with distinct entities. The resulting query is:

3CFQ uses SELECT count(*) WHERE to query
yes/no questions, but this syntax is not supported by Wiki-
数据. We replace it with ASK WHERE, intended for Boolean
queries.

SELECT ?v0 ?v1 WHERE ?x0 wdt:P453 ?v0. ?x0

wdt:P21 wd:Q6581097.

?x0 wdt:P26

?v1.

4https://query.wikidata.org/.

940

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
9
2
0
4
2
5
8
8

/
t

我

A
C
_
A
_
0
0
4
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

FILTER ( ?x0 != ?v1 ). FILTER ( ?v0 !=

?v1 ) LIMIT 1

We execute the query and get wd:Q50807639
(Lohengrin) and wd:Q1560129 (Margarete
Joswig) as satisfying answers for v0 and v1,
分别. Note that these are different from the
entities in the original question (‘Murder’ Legendre
and Lillian Lugosi)—in general, there is no guar-
antee that the same entities from CFQ will be
preserved in our dataset. Then we put back
these answers into the query, and make neces-
sary SPARQL syntax modification for Wikidata.
The final query for this entry is:

ASK WHERE {?x0 wdt:P453 wd:Q50807639. ?x0 wdt:

P21 wd:Q6581097
Q1560129 . FILTER ( ?x0 != wd:Q1560129 )}

?x0 wdt:P26 wd:

As for the English question, we map the Free-
base entities in the questionWithMids field
with the labels of the obtained Wikidata entities.
所以, the English question resulting from this
process is:

Did [Lohengrin] ’s male actor marry
[Margarete Joswig]?

数字 2: Complexity distribution of the MCD1 split
of CFQ (多于) and MCWQ (以下).

3.4 Dataset Statistics

We compare the statistics of MCWQ with CFQ
表中 3. MCWQ has 29,312 unique question
图案 (mod entities, 动词, etcs), 那是, 23.6%
of questions cover all question patterns, com-
pared to 20.6% in CFQ. 此外, MCWQ
有 86,353 unique query patterns (mod enti-
领带), 导致 69.5% of instances covering
all SPARQL patterns, 18% higher than CFQ. 我们的
dataset thus poses a greater challenge for com-
positional SP, and exhibits less redundancy in
terms of duplicate query patterns. It is worth not-
ing that less unique query percentage in MCWQ
than CFQ results from the loss during swapping
the entities in §3.1.

To be compositionally challenging, Keysers
等人. (2020) generated the MCD splits to have
high compound divergence while maintaining
low atom divergence. As atoms in MCWQ are
mapped from CFQ while leaving the composi-
tional structure intact, we derive train-test splits
of our dataset by inducing the train-test splits from
CFQ on the corresponding subset of instances in
our dataset.

The complexity of questions in CFQ is mea-
sured by recursion depth and reflects the number

of rule applications used to generate a ques-
的, which encompasses grammar, 知识,
inference, and resolution rules. While each ques-
tion’s complexity in MCWQ is the same as the
corresponding CFQ question’s, some cannot be
migrated (see §3.1 and §3.2). To verify the com-
pound divergence is not affected, we compare
the question complexity distribution of the two
datasets in one of the three compositional splits
(MCD1) 图中 2. The training, 发展,
and test sets of the split in CFQ and MCWQ
follow a similar trend in general. The fluctuation
in the complexity of questions in the MCWQ
splits reflects the dataset’s full distribution—see
数字 3.

Stemming from its entities and properties, CFQ
questions are limited to the domain of movies. 这
entities in MCWQ, 然而, can in principle come
from any domain, owing to our flexible entity
replacing method. Though MCWQ’s properties
are still a subset of those used in CFQ, 他们是
primarily in the movies domain. We also observe a
few questions from literature, 政治, 和历史
in MCWQ.

941

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
9
2
0
4
2
5
8
8

/
t

我

A
C
_
A
_
0
0
4
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
9
2
0
4
2
5
8
8

/
t

我

A
C
_
A
_
0
0
4
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

桌子 2: The MCWQ example from Figure 1. The English question is generated from the CFQ
entry in Table 1 by the migration process described in §3.3, and the questions in the other languages
are automatically translated (§4.1). The questionWithBrackets, questionPatternMod-
Entities, sparql, and sparqlPatternModEntities fields are analogous to the CFQ ones.
recursionDepth (which quantifies the question complexity) and expectedResponse (哪个
is the answer returned upon execution of the query) are copied from the CFQ entry.

CFQ

MCWQ

Unique questions
Questions patterns
Unique queries
Query patterns

239,357

124,187

49,320 (20.6%)
228,149 (95.3%)
123,262 (51.5%)

29,312 (23.6%)

101,856 (82%)
86,353 (69.5%)

Yes/no questions
Wh- 问题

130,571 (54.6%)
108,786 (45.5%)

67,523 (54.4%)
56,664 (45.6%)

桌子 3: Dataset statistics comparison for MCWQ
and CFQ. Percentages are relative to all unique
问题. Questions patterns refer to mod enti-
领带, 动词, ETC. while query patterns refer to mod
entities only.

4.1 Generating Translations

Both question patterns and bracketed questions
are translated separately with Google Cloud Trans-
lation5 from English.6 SPARQL queries remain
unchanged, as both property and entity IDs are
language-independent in Wikidata, which con-
tains labels in different languages for each. 桌子 2
shows an example for a question in our dataset
(which is generated from the same question as

5https://cloud.google.com/translate.
6We attempted to translate bracketed questions and sub-
sequently replace the bracketed entities with placeholders as
question patterns. In preliminary experiments, 我们发现
separate translation of question patterns is of higher trans-
lation quality. 所以, we choose to translate question
patterns and bracketed questions individually.

数字 3: Complexity distribution of MCWQ, 事物-
sured by recursion depth, compared to CFQ.

4 Generating Multilingual Questions

To create a typologically diverse dataset, start-
ing from our English dataset (an Indo-European
language using the Latin script), we use machine
translation to three other languages from differ-
ent families (Afroasiatic, Dravidian, and Sino-
Tibetan), which use different scripts: Hebrew,
Kannada, and Chinese (§4.1). For a comparison
to machine translation and a more realistic eval-
uation with regard to compositional SP, we man-
ually translate a subset of the test sets of the
three MCD splits (§4.2) and evaluate the machine
translation quality (§4.3).

942

the CFQ instance from Table 1), as well as the
resulting translations.

As an additional technical necessity, we add
a question mark to the end of each question be-
fore translation (as the original dataset does not
include question marks) and remove trailing ques-
tion marks from the translated question before
including it in our dataset. We find this step to be
essential for translation quality.

4.2 Gold Test Set

CFQ and other datasets for evaluating compo-
sitional generalization (Lake and Baroni, 2018;
Kim and Linzen, 2020) are generated from gram-
mars. 然而, It has not been investigated how
well models trained on them generalize to hu-
man questions. As a step towards that goal,
we evaluate whether models trained with auto-
matically generated and translated questions can
generalize to high-quality human-translated ques-
系统蒸发散. For that purpose, we obtain the intersection
of the test sets of the MCD splits (1,860 在-
尝试), and sample two translated questions with
yes/no questions and two with wh- questions for
each complexity level (if available). This sample,
termed test-intersection-MT, 有 155 entries in
全部的. 作者 (one native speaker for each
语言) manually translate the English ques-
tions into Hebrew, Kannada, and Chinese. 我们
term the resulting dataset test-intersection-gold.

4.3 Translation Quality

test-intersection-MT against

We compute the BLEU (Papineni et al., 2002)
测试-
scores of
intersection-gold using SacreBLEU (邮政, 2018),
导致 87.4, 76.6, 和 82.8 for Hebrew,
Kannada, and Chinese, 分别. This indicates
high quality of the machine translation outputs.

此外, one author for each language
manually assesses translation quality for one sam-
pled question from each complexity level from
the full dataset (40 in total). We rate the trans-
lations on a scale of 1–5 for fluency and for
meaning preservation, 和 1 being poor, 和 5 是-
ing optimal. Despite occasional translation issues,
mostly attributed to lexical choice or morphologi-
cal agreement, we confirm that the translations are
of high quality. Across languages, 超过 80% 的
examples score 3 or higher in fluency and meaning
preservation. The average meaning preservation
scores for Hebrew, Kannada, and Chinese are 4.4,

3.9, 和 4.0, 分别. For fluency, 他们是
3.6, 3.9, 和 4.4, 分别.

As a control, one of the authors (a native
English speaker) evaluated English fluency for
the same sample of 40 问题. 仅有的 62% 的
patterns were rated 3 or above. While all English
questions are grammatical, many suffer from poor
fluency, tracing back to their automatic genera-
tion using rules. Some translations are rated higher
in terms of fluency, mainly due to annotator le-
niency (focusing on disfluencies that might result
from translation) and paraphrasing of unnatural
constructions by the MT system (especially for
lower complexities).

5 实验

While specialized architectures have achieved
state-of-the-art results on CFQ (Guo et al., 2020,
2021; Gai et al., 2021), these approaches are
英语- or Freebase-specific. We therefore exper-
iment with sequence-to-sequence (seq2seq) 模组-
这, among which T5 (Raffel et al., 2020) 已经
shown to perform best on CFQ (Herzig et al.,
2021). We evaluate these models for each lan-
guage separately (§5.1), and subsequently evaluate
their cross-lingual compositional generalization
(§5.2).

5.1 Monolingual Experiments

We evaluate six models’ monolingual parsing
performance on the three MCD splits and a random
split of MCWQ. As done by Keysers et al. (2020),
entities are masked during training, except those
that are part of the question patterns (genders and
nationalities).

We experiment with two seq2seq architectures
on MCWQ for each language, with the same hy-
perparameters tuned by Keysers et al. (2020) 在
the CFQ random split: LSTM (Hochreiter and
施米德胡贝尔, 1997) with attention mechanism
(Bahdanau et al., 2015) and Evolved Trans-
以前的 (So et al., 2019), both implemented using
Tensor2Tensor (Vaswani et al., 2018). Separate
models are trained and evaluated per language,
with randomly initialized (not pretrained) 在-
coders. We train a model for each of the three
MCD splits plus a random split for each language.
We also experiment with pretrained language
型号 (PLMs), to assess whether multilingual
PLMs, mBERT (Devlin et al., 2019) and mT5

943

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
9
2
0
4
2
5
8
8

/
t

我

A
C
_
A
_
0
0
4
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

MCD1

MCD2

MCD3

MCDmean

Random

Exact Match (%)

他

LSTM+Attention
乙. Transformer
mBERT
T5-base+RIR
mT5-small+RIR
mT5-base+RIR

38.2
53.3
49.5
57.4
77.6
55.5

29.3
35
38.7
–
57.8
59.5

27.1
30.7
34.4
–
55
49.1

26.1
31
35.6
–
52.8
30.2

6.3
16.5
13.4
14.6
13
27.7

5.6
8.7
11.4
–
12.6
16.6

9.9
11.9
12.3
–
8.2
16.6

7.5
10.2
15.1
–
21.1
23

13.6
18.2
17
12.3
24.3
18.2

11.5
13
18
–
17.5
23.4

15.7
18.1
18.1
–
31.4
30.5

15.1
15.5
19.4
–
34.9
35.6

19.4
29.3
26.6
28.1
38.3
33.8

15.5
18.9
22.7
–
29.3
33.2

17.6
20.2
21.6
–
31.5
32.1

16.2
18.9
23.4
–
36.3
29.6

96.6
99
98.7
98.5
98.6
99.1

80.8
90.4
91
–
90
90.6

88.7
93.7
95.1
–
93.8
94.2

86.8
92.2
93.3
–
91.8
92.2

桌子 4: Monolingual evaluation: Exact match accuracies on MCWQ. MCDmean is the mean accu-
racy of all three MCD splits. Random represents a random split of MCWQ. This is an upper bound
on the performance shown only for comparison. As SPARQL BLEU scores are highly correlated with
accuracies in this experiment, we only show the latter here.

(Xue et al., 2020), are as effective for monolin-
gual compositional generalization as an English-
only PLM using the Transformers library (Wolf
等人。, 2020).

For mBERT, we fine-tune a multi cased
L-12 H-768 A-12 encoder and a randomly
initialized decoder of the same architecture. 我们
train for 100 epochs with patience of 25, batch
大小 128, and learning rate of 5 × 10−5 with a
linear decay.

For T5, we fine-tune T5-base on MCWQ
英语, and mT5-small and mT5-base on
each language separately. We use the default hy-
perparameter settings except trying two learning
费率, 5e−4 and 3e−5 (see results below). SPARQL
queries are pre-processed using reversible inter-
mediate representations (RIR), previously shown
(Herzig et al., 2021) to facilitate compositional
generalization for T5. We fine-tune all models for
50K steps.

We use six Titan RTX GPUs for training, 和
batch size of 36 for T5-base, 24 for mT5-
小的, 和 12 for mT5-base. We use two
random seeds for T5-base. It takes 384 小时
to finish a round of mT5-small experiments,
120 hours for T5-base, 和 592 hours for
mT5-base.

In addition to exact-match accuracy, we report
the BLEU scores of the predictions computed
with SacreBLEU, as a large portion of the gen-
erated queries is partially (but not fully) 正确的.

Results The results are shown in Table 4. 尽管
models generalize almost perfectly in the random
split for all four languages, the MCD splits are
much harder, with the highest mean accuracies
的 38.3%, 33.2%, 32.1%, 和 36.3% for English,
Hebrew, Kannada, and Chinese, 分别. 为了
比较, on CFQ, T5-base+RIR has an ac-
curacy of 60.8% on MCDmean (Herzig et al.,

2021). One reason for this decrease in performance
is the smaller training data: The MCWQ dataset
有 52.5% the size of CFQ. 此外, MCWQ
has less redundancy than CFQ in terms of dupli-
cate questions and SPARQL patterns, rendering
models’ potential strategy of simply memorizing
patterns less effective.

Contrary to expectation, mT5-base does not
outperform mT5-small. During training, 我们
found mT5-base reached minimum loss early
(after 1k steps). By changing the learning rate from
the default 3e−5 to 5e−4, we seem to have over-
come the local minimum. Training mT5-small
with learning rate 5e−4 also renders better per-
formance. 此外, the batch size we use for
mT5-base may not be optimal, but we could
not experiment with larger batch sizes due to
resource limitations.

Comparing the performance across languages,
mT5-base performs best on Hebrew and Kan-
nada on average, while mT5-small has the best
performance on English and Chinese. Due to re-
source limitations, we were not able to look deeper
into the effect of hyperparameters or evaluate
larger models. 然而, our experiments show
that while multilingual compositional generaliza-
tion is challenging for seq2seq semantic parsers,
within-language generalization is comparable be-
tween languages. 尽管如此, English is always
the easiest (at least marginally). A potential cause
is that most semantic query languages were ini-
tially designed to represent and retrieve data stored
in English databases, and thus have a bias towards
英语. 最后, SPARQL syntax is closer
to English than Hebrew, Kannada, and Chinese.
While translation errors might have an effect as
出色地, we have seen in §4.3 that translation quality
is high.

To investigate further, we plot the complexity
distribution of true predictions (exactly matching

944

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
9
2
0
4
2
5
8
8

/
t

我

A
C
_
A
_
0
0
4
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

MCDmean

Random

SPARQL BLEU

他

mT5-small+RIR
mT5-base+RIR

87.5
86.4

53.8
46.4

53.2
46

59
52.7

99.9
99.9

60.4
63.2

59.9
63.5

63.8
70.6

Exact Match (%)

mT5-small+RIR
mT5-base+RIR

38.3
33.8

0.2
0.4

0.3
0.7

0.2
1.5

98.6
99.1

0.5
1.1

0.4
0.9

1.1
7.2

桌子 5: Mean BLEU scores and exact match ac-
curacies on the three MCD splits and on a random
split in zero-shot cross-lingual transfer experi-
ments on MCWQ. The gray text represents the
models’ monolingual performance on English,
given for reference (the exact match accuracies
are copied from Table 4). The black text indicates
the zero-shot cross-lingual transfer performances
on Hebrew, Kannada, and Chinese of a model
trained on English. While the scores for individ-
ual MCD splits are omitted for brevity, in all three
MCD splits, the accuracies are below 1% (除了
on MCD2 Chinese, 存在 4%).

等人。, 2020; Sherborne and Lapata, 2022). 是-
cause translating datasets and training KBQA
systems is expensive, it is beneficial to leverage
multilingual PLMs, fine-tuned on English data,
for generating SPARQL queries over Wikidata
given natural language questions in different lan-
guages. While compositional generalization is
difficult even in a monolingual setting, it is inter-
esting to investigate whether multilingual PLMs
can transfer in cross-lingual SP over Wikidata.
Simple seq2seq T5/mT5 models perform reason-
ably well (> 30% 准确性) on monolingual SP
on some splits (see §5.1). We investigate whether
the learned multilingual representations of such
models enable compositional generalization even
without target language training. We use mT5-
small+RIR and mT5-base+RIR, the best two
models trained and evaluated on English from
previous experiments,
to predict on the other
语言.

Results The results are shown in Table 5. 两个都
BLEU and exact match accuracy of the predicted
SPARQL queries drop drastically when the model
is evaluated on Hebrew, Kannada, and Chinese.
mT5-small+RIR achieves 38.3% accuracy on
MCDmean English, but less than 0.3% in zero-shot
parsing on three non-English languages.

Even putting aside compositionality evaluation,
as seen in the random split, the exact match ac-
curacy in the zero-shot cross-lingual setting is

数字 4: Two mT5 models’ number of correct pre-
the three MCD splits in
dictions summing over
monolingual experiments, plotted by complexity level.
Each line represents a language. While mT5-small
generalizes better overall, mT5-base is better in
lower complexities (which require less compositional
generalization).

the gold SPARQL) per language by the two best
systems in Figure 4. We witness a near-linear
performance decay from complexity level 19. 我们
find that mT5-base is better than mT5-small
on lower complexity despite the latter’s supe-
rior overall performance. 有趣的是, 翻译的
questions seem to make the parsers generalize
better at higher complexity, as shown in the
figure. For mT5-small, the three non-English
models successfully parse more questions within
the complexity range 46–50 than English, 为了
mT5-base 44–50. As is discussed in §4.3,
machine-translated questions tend to have higher
fluency than English questions; we conjecture
that such a smoothing method helps the parser
to understand and learn from higher complexity
问题.

5.2 Zero-shot Cross-lingual Parsing

Zero-shot cross-lingual SP has witnessed new
advances with the development of PLMs (Shao

945

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
9
2
0
4
2
5
8
8

/
t

我

A
C
_
A
_
0
0
4
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

still low. The relatively high BLEU scores can
be attributed to the small overall vocabulary
used in SPARQL queries. 有趣的是, 尽管
mT5-base+RIR on MCDmean English does not
outperform mT5-small+RIR,
it yields better
performance in the zero-shot setting. For He-
brew, Kannada, and Chinese, the accuracies are
0.2%, 0.4%, 和 1.3% 更高, 分别. 为了
mT5-base, Chinese is slightly easier than Kan-
nada and Hebrew to parse in the zero-shot setting,
outperforming 1.1% 和 0.8%.

To conclude, zero-shot cross-lingual transfer
from English to Hebrew, Kannada, and Chinese
fails to generate valid queries in MCWQ. A po-
tential cause for such unsuccessful transfer is that
all four languages in MCWQ belong to different
language families and have low linguistic sim-
ilarities. It remains to be investigated whether
such cross-lingual transfer will be more effective
on related languages, such as from English to
德语 (林等人。, 2019).

6 分析

6.1 Evaluation with Gold Translation

Most existing compositional generalization data-
sets focus on SP (Lake and Baroni, 2018; Kim
and Linzen, 2020; Keysers et al., 2020). 这些
datasets are composed either with artificial lan-
guage or in English using grammar rules. 和
test-intersection-gold proposed in §4.2, 我们在-
vestigate whether models can generalize from
a synthetic automatically translated dataset to a
manually translated dataset.

We use the monolingual models trained on
three MCD splits to parse test-intersection-gold.
表中 6, we present the mean BLEU scores and
exact match accuracy of the predicted SPARQL
queries. There is no substantial difference be-
tween the performances on the two intersection
套, except for Kannada, 其中有一个 4% accu-
racy drop on average. These results testify that
MCWQ has sufficiently high translation qual-
ity and that models trained with such synthetic
data can be used to generalize to high-quality
manually-translated questions.

6.2 Categorizing Errors

In an empirical analysis, we categorize typical
prediction errors on test-intersection-gold and
test-intersection-MT into six types: missing prop-
厄蒂, extra property, wrong property (哪里的

SPARQL BLEU

他

test-intersection-MT

test-intersection-gold

mT5-small+RIR
mT5-base+RIR

86.1
85.5

82.5
83.7

78.9
81.8

85.1
83.2

Exact Match (%)

mT5-small+RIR
mT5-base+RIR

45.6
40.4

35.7
41.9

32.7
40.2

38.5
38.7

–
–

81.8
83.8

77.7
80.9

86
83.8

35.9
41.1

28.2
34

39.8
38.9

桌子 6: Mean BLEU scores and accuracies of
monolingual models (§5.1) on test-intersection-
MT and test-intersection-gold. The numbers are
averaged over the accuracies of the predictions
from the monolingual models trained on three
MCD splits. 全面的, there is no substantial dif-
ference between the performances on the two
intersection sets, demonstrating the reliability of
evaluating on machine translated data in this case.

数字 5: Number of errors per category in differ-
ent SPARQL predictions on test-intersection-MT and
test-intersection-gold, averaged across monolingual
mT5-small+RIR models trained on the three MCD
splits. The total number of items in each test set is 155.

two property sets have the same numbers of prop-
erties, but the elements do not match), 丢失的
实体, extra entity and wrong entity (再次, 相同的
number of entities but different entity sets). 我们
plot the mean number of errors per category, 作为
well as the number of predictions with multiple
错误, 图中 5 for monolingual mT5-small
型号. 全面的, model predictions tend to have
more missing properties and entities than extra
那些. Different languages, 然而, vary in er-
ror types. 例如, on Hebrew, models make
more missing property/entity errors than other
语言; but on Kannada they make more ex-
tra property/entity errors than the others. 关于
70 出于 155 examples contain multiple
errors for all languages, with Kannada having
slightly more.

946

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
9
2
0
4
2
5
8
8

/
t

我

A
C
_
A
_
0
0
4
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 7: Example of an error reflecting incorrect
predicate-argument structure. wdt:P57 is director
and wdt:P58 is screenwriter. Incorrect triples are
shown in red and missed triples in blue.

placeholders. In the example shown in Figure 7,
we see that the model generates M1 wdt:P57
M2 instead of M0 wdt:P57 M2, which in-
dicates incorrect predicate-argument structure
解释.

7 相关工作

Compositional Generalization Compositional
generalization has witnessed great developments
in recent years. SCAN (Lake and Baroni, 2018),
a synthetic dataset consisting of natural language
and command pairs, is an early dataset designed
to systematically evaluate neural networks’ gen-
eralization ability. CFQ and COGS are two more
realistic benchmarks following SCAN. 有
various approaches developed to enhance com-
positional generalization, 例如, by using
hierarchical poset decoding (Guo et al., 2020),
combining relevant queries (Das et al., 2021)
using span representation (Herzig and Berant,
2021), and graph encoding (Gai et al., 2021).
In addition to pure language, the evaluation of
compositional generalization has been expanded
to image captioning and situated language under-
常设 (Nikolaus et al., 2019; Ruis et al., 2020).
Multilingual and cross-lingual compositional gen-
eralization is an important and challenging field
to which our paper aims to bring researchers’
注意力.

阅读

to machine

Knowledge Base Question Answering Com-
paring
comprehension
(Rajpurkar et al., 2016; Joshi et al., 2017; Shao
等人。, 2018; Dua et al., 2019; d’Hoffschmidt
等人。, 2020), KBQA is less diverse in terms of
datasets. Datasets such as WebQuestions (Berant
等人。, 2013), SimpleQuestions (Bordes et al., 2015),
ComplexWebQuestions (Talmor and Berant, 2018),
FreebaseQA (Jiang et al., 2019), GrailQA (Gu
等人。, 2021), CFQ and *CFQ (Tsarkov et al., 2021)
were proposed on Freebase, a now-discontinued
KB. SimpleQuestions2Wikidata (Diefenbach et al.,

数字 6: Number of errors per category in differ-
ent zero-shot cross-lingual SPARQL predictions on
test-intersection-MT, averaged across mT5-small+
RIR models trained on the three MCD splits in En-
glish. 此外, mean error counts on the English
set are given for comparison. The total number of
items in each test set is 155.

Comparing errors on test-intersection-gold and
test-intersection-MT, we find missing properties
are more common in gold for all languages. 为了
Hebrew and Kannada, extra properties and enti-
ties are also more common in gold. 然而, 为了
Chinese, these and missing entities are less com-
mon in gold compared to MT.

图中 6 we plot the error statistics for zero-
transfer using mT5-small
shot cross-lingual
型号. We can see that there are drastically more
error occurrences. For both missing and extra
property/entity,
the numbers are about double
those from monolingual experiments. The number
of wrong property/entity errors remain similar,
due to the difficulty of even predicting a set of
the correct size in this setting. For all three target
语言, nearly all predictions contain multi-
ple errors. The statistics indicate the variety and
pervasiveness of errors.

6.3 Other Observations

We also find that, comparatively, parsers perform
well on short questions on all four languages.
This is expected as the compositionality of these
questions is inherently low. On languages other
than English, the models perform well when the
translations are faithful. On occasions when they
are less faithful or fluent but still generate cor-
rect queries, we hypothesize that translation acts
as data regularizers, especially at higher com-
复杂性, as demonstrated in Figure 4.

Among wrong entity errors, the most common
cause across languages is the shuffling of entity

947

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
9
2
0
4
2
5
8
8

/
t

我

A
C
_
A
_
0
0
4
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

2017) and ComplexSequentialQuestions (Saha
等人。, 2018) are based on Wikidata, 但, like most
其他的, they are monolingual English datasets.
Related to our work is RuBQ (Korablinov and
Braslavski, 2020; Rybin et al., 2021), an English-
Russian dataset for KBQA over Wikidata. 尽管
the dataset is bilingual, it uses crowdsourced ques-
tions and is not designed for compositionality
分析. 最近, Thorne et al. (2021) proposed
WIKINLDB, a Wikidata-based English KBQA
dataset, focusing on scalability rather than compo-
sitionality. Other related datasets include QALM
(Kaffee et al., 2019), a dataset for multilingual
question answering over a set of different popular
knowledge graphs, intended to help determine the
multilinguality of those knowledge graphs. Simi-
larly, QALD-9 (Ngomo, 2018) and QALD-9-plus
(Perevalov et al., 2022A) support the develop-
ment of multilingual question answering systems,
tied to DBpedia and Wikidata, 分别. 这
goal of both datasets is to expand QA systems to
more languages rather than improving composi-
tionality. KQA Pro (Cao et al., 2022), a concurrent
work to us, is an English KBQA dataset over
Wikidata with a focus on compositional reasoning.
Wikidata has been leveraged across many NLP
tasks such as coreference resolution (Aralikatte
等人。, 2019), frame-semantic parsing (Sas et al.,
2020), entity linking (Kannan Ravi et al., 2021),
and named entity recognition (Nie et al., 2021).
As for KBQA, the full potential of Wikidata is
yet to be explored.

Multilingual
and Cross-lingual Modeling
Benchmarks such as XGLUE (Liang et al., 2020)
and XTREME (Hu et al., 2020) focus on multi-
lingual classification and generation tasks. 叉-
lingual learning has been studied across multiple
fields, such as sentiment analysis (Abdalla and
Hirst, 2017), document classification (Dong and
de Melo, 2019), 词性标注 (Kim et al., 2017),
and syntactic parsing (Rasooli and Collins, 2017).
最近几年, multilingual PLMs have been a
primary tool for extending NLP applications to
low-resource languages, as these models amelio-
rate the need to train individual models for each
语言, for which less data may be available.
Several studies have attempted to explore the
limitations of such models in terms of practical
usability for low-resource languages (Wu and
Dredze, 2020), and also the underlying elements
that make cross-lingual transfer learning viable

(Dufter and Sch¨utze, 2020). Beyond these PLMs,
other works focus on improving cross-lingual
learning by making particular changes to the
encoder-decoder architecture, such as adding
adapters to attune to specific information (Artetxe
等人。, 2020乙; Pfeiffer et al., 2020).

For cross-lingual SP, Sherborne and Lapata
(2022) explored zero-shot SP by aligning latent
陈述. Zero-shot cross-lingual SP has
also been studied in dialogue modeling (尼科西亚
等人。, 2021). Yang et al. (2021) present augmenta-
tion methods for Discourse Representation Theory
(刘等人。, 2021乙). 奥彭等人. (2020) explore
cross-framework and cross-lingual SP for mean-
ing representations. 据我们所知,
our work is the first on studying cross-lingual
transfer learning in KBQA.

8 Limitations

MCWQ is based on CFQ, a rule-base generated
dataset, and hence it has the inherited unnatural-
ness in question-query pairs of high complexity.
第二, we use machine translation to make
MCWQ multilingual. Although this is the domi-
nant approach for generating multilingual datasets
(Ruder et al., 2021) and we have provided ev-
idences that MCWQ has reasonable translation
accuracy and fluency with human evaluation and
comparative experiments in §4.3 and §5.1, 嘛-
chine translation would nevertheless create sub-
standard translation artifacts (Artetxe et al.,
2020A). One alternative is to write rules for tem-
plate translation. The amount of work can possibly
be reduced by refering to a recent work (Goodwin
等人。, 2021) in which English rules are provided
for syntactic dependency parsing on CFQ’s ques-
tion fields.

此外, the assumption that an English KB
is a ‘‘canonical’’ conceptualization is unjustified,
as speakers of other languages may know and care
about other entities and relationships (刘等人。,
2021A; Hershcovich et al., 2022A). 所以,
future work must create multilingual SP datasets
by sourcing questions from native speakers rather
than translating them.

9 结论

The field of KBQA has been saturated with
work on English, due to both the inherent chal-
lenges of translating datasets and the reliance on
English-only DBs. 在这项工作中, we presented a

948

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
9
2
0
4
2
5
8
8

/
t

我

A
C
_
A
_
0
0
4
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

MCWQ mT5-base+RIR

参考

信息

Unit

1. Model publicly available?
2. Time to train final model
3. Time for all experiments
4. Energy consumption
5. Location for computations
6. Energy mix at location
7. CO2eq for final model
8. CO2eq for all experiments

是的
592 小时
1315 小时
2209.2 千瓦时
丹麦
191 gCO2eq/ kWh
189.96 kg
421.96 kg

桌子 7: Climate performance model card for
mT5-base+RIR fine-tuned on all splits and
语言.

method for migrating the existing CFQ dataset
to Wikidata and created a challenging multilin-
gual dataset, MCWQ,
targeting compositional
generalization in multilingual and cross-lingual
SP. In our experiments, we observe that pre-
trained multilingual language models struggle to
transfer and generalize compositionally across
语言. Our dataset will facilitate building
robust multilingual semantic parsers by serving
as a benchmark for evaluation of cross-lingual
compositional generalization.

10 Environmental Impact

Following the climate-aware practice proposed
by Hershcovich et al. (2022乙), we present a cli-
mate performance model card in Table 7. ‘‘Time
to train final model’’ is the sum over splits and
languages for mT5-base+RIR, while ‘‘Time for
all experiments’’ also includes the experiments
with the English-only T5-base+RIR across all
splits. Although the work does not have direct pos-
itive environmental impact, better understanding
of compositional generalization, resulting from
我们的工作, will facilitate more efficient modeling
and therefore reduce emissions in the long term.

致谢

The authors thank Anders Søgaard and Miryam
de Lhoneux for their comments and suggestions,
as well as the TACL editors and several rounds of
reviewers for their constructive evaluation. 这
project has received funding from the European
Union’s Horizon 2020 research and innovation
programme under the Marie Skłodowska-Curie
grant agreement No. 801199 (Heather Lent).

Mohamed Abdalla and Graeme Hirst. 2017.
Cross-lingual sentiment analysis without (好的)
翻译. In Proceedings of the Eighth In-
ternational Joint Conference on Natural Lan-
guage Processing (体积 1: Long Papers),
pages 506–515, Taipei, 台湾. Asian Federa-
tion of Natural Language Processing.

Rahul Aralikatte, Heather Lent, Ana Valeria
冈萨雷斯, Daniel Herschcovich, Chen Qiu,
Anders Sandholm, Michael Ringaard, 和
Anders Søgaard. 2019. Rewarding coreference
resolvers for being consistent with world knowl-
边缘. 在诉讼程序中 2019 会议
on Empirical Methods in Natural Language
Processing and the 9th International Joint
Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 1229–1235, HongKong,
中国. Association for Computational Lin-
语言学. https://doi.org/10.18653/v1
/D19-1118

Mikel Artetxe, Gorka Labaka,

and Eneko
Agirre. 2020A. Translation artifacts in cross-
transfer learning. 在诉讼程序中
lingual
这 2020 经验方法会议
自然语言处理博士 (EMNLP),
pages 7674–7684, 在线的. Association for Com-
putational Linguistics. https://doi.org/10
.18653/v1/2020.emnlp-main.618

Mikel Artetxe, Sebastian Ruder, and Dani
Yogatama. 2020乙. On the cross-lingual trans-
ferability of monolingual representations. 在
Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics,
pages 4623–4637, 在线的. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/2020.acl-main.421

Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua
本吉奥. 2015. Neural machine translation by
jointly learning to align and translate. In 3rd
International Conference on Learning Repre-
句子, ICLR 2015.

Jonathan Berant, Andrew Chou, Roy Frostig, 和
Percy Liang. 2013. Semantic parsing on Free-
base from question-answer pairs. In Proceed-
ings of
这 2013 Conference on Empirical
Methods in Natural Language Processing,
pages 1533–1544, Seattle, 华盛顿, 美国.
计算语言学协会.

949

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
9
2
0
4
2
5
8
8

/
t

我

A
C
_
A
_
0
0
4
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Kurt Bollacker, Colin Evans, Praveen Paritosh,
Tim Sturge, and Jamie Taylor. 2008. Freebase:
A collaboratively created graph database for
structuring human knowledge. In Proceedings
的 2008 ACM SIGMOD International Con-
ference on Management of Data, SIGMOD ’08,
pages 1247–1250, 纽约, 纽约, 美国. Asso-
ciation for Computing Machinery. https://
doi.org/10.1145/1376616.1376746

Antoine Bordes, Nicolas Usunier, Sumit Chopra,
and Jason Weston. 2015. Large-scale simple
question answering with memory networks.
arXiv 预印本 arXiv:1506.02075.

Shulin Cao, Jiaxin Shi, Liangming Pan, Lunyiu
Nie, Yutong Xiang, Lei Hou, Juanzi Li, Bin He,
and Hanwang Zhang. 2022. KQA pro: A dataset
with explicit compositional programs for com-
plex question answering over knowledge base.
In Proceedings of the 60th Annual Meeting of
the Association for Computational Linguistics
(体积 1: Long Papers), pages 6101–6119,
都柏林, 爱尔兰. Association for Computational
语言学. https://doi.org/10.18653
/v1/2022.acl-long.422

Jianpeng Cheng and Mirella Lapata. 2018.
Weakly-supervised neural semantic parsing
with a generative ranker. 在诉讼程序中
the 22nd Conference on Computational Natural
Language Learning, pages 356–367, 布鲁塞尔,
比利时. Association for Computational Lin-
语言学. https://doi.org/10.18653/v1
/K18-1035

Rajarshi Das, Manzil Zaheer, Dung Thai, Ameya
Godbole, Ethan Perez, Jay Yoon Lee, Lizhen
Tan, Lazaros Polymenakos,
and Andrew
麦卡勒姆. 2021. Case-based reasoning for nat-
ural language queries over knowledge bases.
在诉讼程序中
这 2021 会议
Empirical Methods in Natural Language Pro-
cessing, pages 9594–9611, Online and Punta
Cana, Dominican Republic. 协会
计算语言学.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, 和
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
理解. 在诉讼程序中
这 2019
Conference of the North American Chapter
of the Association for Computational Linguis-
抽动症: 人类语言技术, 体积 1
(Long and Short Papers), pages 4171–4186,

明尼阿波利斯, Minnesota. Association for Com-
putational Linguistics.

Martin d’Hoffschmidt, Wacim Belblidia, Quentin
Heinrich, Tom Brendl´e, and Maxime Vidal.
2020. FQuAD: French question answering
dataset. In Findings of
the Association for
计算语言学: EMNLP 2020,
pages 1193–1208, 在线的. Association for Com-
putational Linguistics. https://doi.org/10
.18653/v1/2020.findings-emnlp.107

Dennis Diefenbach, Thomas Pellissier Tanon,
K. 辛格, 和P. Maret. 2017. Question an-
swering benchmarks for Wikidata. In Interna-
tional Semantic Web Conference.

Xin Dong and Gerard de Melo. 2019. A robust
self-learning framework for cross-lingual text
classification. 在诉讼程序中 2019 骗局-
ference on Empirical Methods in Natural Lan-
guage Processing and the 9th International
Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP), pages 6306–6310,
香港, 中国. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/D19-1658

Dheeru Dua, Yizhong Wang, Pradeep Dasigi,
Gabriel Stanovsky, Sameer Singh, and Matt
加德纳. 2019. DROP: A reading comprehen-
sion benchmark requiring discrete reasoning
over paragraphs. 在诉讼程序中 2019
Conference of the North American Chapter
of the Association for Computational Linguis-
抽动症: 人类语言技术, 体积 1
(Long and Short Papers), pages 2368–2378,
明尼阿波利斯, Minnesota. Association for Com-
putational Linguistics.

Philipp Dufter and Hinrich Sch¨utze. 2020. Iden-
tifying elements essential for BERT’s multilin-
guality. 在诉讼程序中 2020 会议
on Empirical Methods in Natural Language
加工 (EMNLP), pages 4423–4437, 在线的.
计算语言学协会.
https://doi.org/10.18653/v1/2020
.emnlp-main.358

Nicholas Evans and Stephen C. Levinson.
2009. The myth of language universals: 兰-
guage diversity and its importance for cogni-
tive science. Behavioral and Brain Sciences,
32(5):429–448. https://doi.org/10.1017
/S0140525X0999094X

950

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
9
2
0
4
2
5
8
8

/
t

我

A
C
_
A
_
0
0
4
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Yu Gai, Paras

Jain, Wendi Zhang,

约瑟夫
冈萨雷斯, Dawn Song, and Ion Stoica. 2021.
Grounded graph decoding improves composi-
tional generalization in question answering. 在
Findings of the Association for Computational
语言学: EMNLP 2021, pages 1829–1838,
Punta Cana, Dominican Republic. 协会
for Computational Linguistics. https://土井
.org/10.18653/v1/2021.findings
-emnlp.157

Emily Goodwin, Siva Reddy, Timothy J.
O’Donnell, and Dzmitry Bahdanau. 2021. 康姆-
positional generalization in dependency parsing.

Yu Gu, Sue Kase, Michelle Vanni, Brian Sadler,
Percy Liang, Xifeng Yan, and Yu Su. 2021.
Beyond iid: Three levels of generalization
for question answering on knowledge bases.
In Proceedings of the Web Conference 2021,
pages 3477–3488. https://doi.org/10
.1145/3442381.3449992

Yinuo Guo, Zeqi Lin, Jian-Guang Lou, 和
Dongmei Zhang. 2020. Hierarchical poset
decoding for compositional generalization in
语言. Advances in Neural
信息
Processing Systems, 33:6913–6924.

Yinuo Guo, Hualei Zhu, Zeqi Lin, Bei Chen,
Jian-Guang Lou, and Dongmei Zhang. 2021.
Revisiting iterative back-translation from the
perspective of compositional generalization. 在
AAAI’21.

丹尼尔·赫什科维奇, Stella Frank, Heather
Lent, Miryam de Lhoneux, Mostafa Abdou,
Stephanie Brandl, Emanuele Bugliarello, Laura
Cabello Piqueras, Ilias Chalkidis, Ruixiang
Cui, Constanza Fierro, Katerina Margatina,
Phillip Rust, and Anders Søgaard. 2022A. Chal-
lenges and strategies in cross-cultural NLP. 在
Proceedings of the 60th Annual Meeting of
the Association for Computational Linguistics
(体积 1: Long Papers), pages 6997–7013,
都柏林, 爱尔兰. Association for Computational
语言学. https://doi.org/10.18653
/v1/2022.acl-long.482

丹尼尔·赫什科维奇, Nicolas Webersinke,
Mathias Kraus, Julia Anna Bingler, and Markus
Leippold. 2022乙. Towards climate awareness
in NLP research. arXiv 预印本 arXiv:2205.
05071.

Jonathan Herzig and Jonathan Berant. 2017. 新-
ral semantic parsing over multiple knowledge-

951

the 55th Annual
bases. 在诉讼程序中
the Association for Computa-
Meeting of
tional Linguistics (体积 2: Short Papers),
pages 623–628, Vancouver, 加拿大. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/P17-2098

Jonathan Herzig and Jonathan Berant. 2021.
Span-based semantic parsing for compositional
generalization. In Proceedings of the 59th An-
nual Meeting of the Association for Computa-
tional Linguistics and the 11th International
Joint Conference on Natural Language Process-
英 (体积 1: Long Papers), pages 908–921,
在线的. Association for Computational Lin-
语言学. https://doi.org/10.18653/v1
/2021.acl-long.74

Jonathan Herzig, Peter Shaw, Ming-Wei Chang,
Kelvin Guu, Panupong Pasupat, and Yuan
张. 2021. Unlocking compositional gen-
eralization in pre-trained models using inter-
mediate representations. arXiv 预印本 arXiv:
2104.07478.

Sepp Hochreiter and J¨urgen Schmidhuber. 1997.
Long short-term memory. Neural computation,
9(8):1735–1780. https://doi.org/10.1162
/neco.1997.9.8.1735

Aidan Hogan, Eva Blomqvist, Michael Cochez,
Claudia d’Amato, Gerard de Melo, Claudio
Gutierrez, Jos´e Emilio Labra Gayo, S. Kirrane,
Sebastian Neumaier, Axel Polleres, 右. Navigli,
Axel-Cyrille Ngonga Ngomo, Sabbir M. Rashid,
Anisa Rula, Lukas Schmelzeisen, Juan Sequeda,
Steffen Staab, and Antoine Zimmermann. 2021.
这
Knowledge graphs. Communications of
ACM, 64:96–104. https://doi.org/10
.1145/3418294

Junjie Hu, Sebastian Ruder, Aditya Siddhant,
Graham Neubig, Orhan Firat, and Melvin
约翰逊. 2020. XTREME: A massively
multilingual multi-task benchmark for eval-
uating cross-lingual generalisation. In Inter-
national Conference on Machine Learning,
pages 4411–4421. PMLR.

Kelvin Jiang, Dekun Wu, and Hui

Jiang.
2019. FreebaseQA: A new factoid QA data
set matching trivia-style question-answer pairs
with Freebase. 在诉讼程序中
这 2019
Conference of the North American Chapter
of the Association for Computational Linguis-
抽动症: 人类语言技术, 体积 1

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
9
2
0
4
2
5
8
8

/
t

我

A
C
_
A
_
0
0
4
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

(Long and Short Papers), pages 318–323,
明尼阿波利斯, Minnesota. Association for Com-
putational Linguistics.

Mandar Joshi, Eunsol Choi, Daniel Weld, 和
Luke Zettlemoyer. 2017. TriviaQA: A large
scale distantly supervised challenge dataset for
reading comprehension. 在诉讼程序中
55th Annual Meeting of the Association for
计算语言学 (体积 1: 长的
文件), pages 1601–1611, Vancouver, 加拿大.
计算语言学协会.
https://doi.org/10.18653/v1/P17
-1147

Pratik Joshi, Sebastin Santy, Amar Budhiraja,
Kalika Bali, and Monojit Choudhury. 2020.
The state and fate of linguistic diversity and in-
clusion in the NLP world. 在诉讼程序中
the 58th Annual Meeting of the Association for
计算语言学, pages 6282–6293,
在线的. Association for Computational Lin-
语言学. https://doi.org/10.18653/v1
/2020.acl-main.560

Lucie-Aim´ee Kaffee, Kemele M. Endris, 埃琳娜
Simperl, and Maria-Esther Vidal. 2019. Rank-
ing knowledge graphs by capturing knowledge
about languages and labels. 在诉讼程序中
the 10th International Conference on Knowl-
edge Capture, K-CAP 2019, Marina Del Rey,
CA, 美国, November 19–21, 2019. ACM.
https://doi.org/10.1145/3360901
.3364443

Manoj Prabhakar Kannan Ravi, Kuldeep Singh,
Isaiah Onando Mulang’, Saeedeh Shekarpour,
Johannes Hoffart, and Jens Lehmann. 2021.
CHOLAN: A modular approach for neural
entity linking on Wikipedia and Wikidata.
the 16th Conference of
在诉讼程序中
the European Chapter of
the Association
for Computational Linguistics: Main Volume,
pages 504–514, 在线的. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/2021.eacl-main.40

Daniel Keysers, Nathanael Sch¨arli, Nathan Scales,
Hylke Buisman, Daniel Furrer, Sergii Kashubin,
Nikola Momchev, Danila Sinopalnikov, Lukasz
Stafiniak, Tibor Tihon, Dmitry Tsarkov, Xiao
王, Marc van Zee, and Olivier Bousquet.
2020. Measuring compositional generaliza-
的: A comprehensive method on realistic

数据. In International Conference on Learning
Representations.

Joo-Kyung Kim, Young-Bum Kim, Ruhi Sarikaya,
and Eric Fosler-Lussier. 2017. Cross-lingual
transfer learning for POS tagging without cross-
lingual resources. 在诉讼程序中 2017
Conference on Empirical Methods in Natu-
ral Language Processing, pages 2832–2838,
哥本哈根, 丹麦. Association for Com-
putational Linguistics.

Najoung Kim and Tal Linzen. 2020. COGS: A
compositional generalization challenge based
on semantic interpretation. 在诉讼程序中
这 2020 经验方法会议
自然语言处理博士 (EMNLP),
pages 9087–9105, 在线的. 协会
计算语言学.

Vladislav Korablinov and Pavel Braslavski.
2020. RuBQ: A Russian dataset for question
answering over Wikidata. In International Se-
mantic Web Conference. https://doi.org
/10.1007/978-3-030-62466-8_7

Brenden M. Lake and Marco Baroni. 2018.
Generalization without systematicity: 上
compositional skills of sequence-to-sequence
recurrent networks. In Proceedings of the 35th
International Conference on Machine Learn-
英, ICML 2018, Stockholmsm¨assan, Stock-
holm, 瑞典, July 10–15, 2018, 体积 80 的
Proceedings of Machine Learning Research,
pages 2879–2888. PMLR.

Jens Lehmann, Robert Isele, Max Jakob, Anja
Jentzsch, Dimitris Kontokostas, Pablo N.
Sebastian Hellmann, Mohamed
Mendes,
Morsey, Patrick van Kleef, S¨oren Auer, 和
Christian Bizer. 2015. DBpedia – 一个大的-
规模, multilingual knowledge base extracted
from Wikipedia. 语义网, 6(2):167–195.
https://doi.org/10.3233/SW-140134

Yaobo Liang, Nan Duan, Yeyun Gong, Ning Wu,
Fenfei Guo, Weizhen Qi, Ming Gong, Linjun
Shou, Daxin Jiang, Guihong Cao, Xiaodong
Fan, Ruofei Zhang, Rahul Agrawal, 爱德华
Cui, Sining Wei, Taroon Bharti, Ying Qiao,
Jiun-Hung Chen, Winnie Wu, Shuguang Liu,
Fan Yang, Daniel Campos, Rangan Majumder,
and Ming Zhou. 2020. XGLUE: A new bench-
for cross-lingual pre-training,
mark dataset
understanding and generation. In Proceedings
的 2020 经验方法会议

952

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
9
2
0
4
2
5
8
8

/
t

我

A
C
_
A
_
0
0
4
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

自然语言处理博士 (EMNLP),
pages 6008–6018, 在线的. 协会
计算语言学. https://doi.org
/10.18653/v1/2020.emnlp-main.484

(CoNLL), pages 87–98, 香港, 中国.
计算语言学协会.
https://doi.org/10.18653/v1/K19
-1009

Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui
李, Yuyan Zhang, Mengzhou Xia, Shruti
Rijhwani, Junxian He, Zhisong Zhang, Xuezhe
Ma, Antonios Anastasopoulos, Patrick Littell,
and Graham Neubig. 2019. Choosing trans-
fer languages for cross-lingual
学习. 在
Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 3125–3135, Florence, 意大利. 协会
for Computational Linguistics.

Fangyu Liu, Emanuele Bugliarello, Edoardo
Maria Ponti, Siva Reddy, Nigel Collier, 和
Desmond Elliott. 2021A. Visually grounded
reasoning across languages and cultures. In Pro-
ceedings of the 2021 Conference on Empiri-
cal Methods in Natural Language Processing,
pages 10467–10485, Online and Punta Cana,
Dominican Republic. Association for Compu-
tational Linguistics.

Jiangming Liu, Shay B. 科恩, Mirella Lapata,
and Johan Bos. 2021乙. Universal Discourse
Representation Structure Parsing. Computa-
tional Linguistics, 47(2):445–476.

Ngonga Ngomo. 2018. 9th challenge on question
answering over linked data (qald-9). 语言,
7(1):58–64.

Massimo Nicosia, Zhongdi Qu, and Yasemin
Altun. 2021. Translate & Fill: Improving zero-
shot multilingual semantic parsing with syn-
thetic data. In Findings of
the Association
for Computational Linguistics: EMNLP 2021,
pages 3272–3284, Punta Cana, Dominican
共和国. Association for Computational Lin-
语言学. https://doi.org/10.18653/v1
/2021.findings-emnlp.279

Binling Nie, Ruixue Ding, Pengjun Xie, Fei
黄, Chen Qian, and Luo Si. 2021.
Knowledge-aware named entity recognition with
alleviating heterogeneity. 诉讼程序
AAAI 人工智能会议.

Mitja Nikolaus, Mostafa Abdou, Matthew Lamm,
Rahul Aralikatte, and Desmond Elliott. 2019.
Compositional generalization in image cap-
提宁. In Proceedings of the 23rd Conference
on Computational Natural Language Learning

Stephan Oepen, Omri Abend, Lasha Abzianidze,
Johan Bos, Jan Hajic, 丹尼尔·赫什科维奇, Bin
李, Tim O’Gorman, Nianwen Xue, 和丹尼尔
Zeman. 2020. MRP 2020: The second shared
task on cross-framework and cross-lingual
meaning representation parsing. In Proceedings
的
the CoNLL 2020 Shared Task: 叉-
Framework Meaning Representation Parsing,
pages 1–22, 在线的. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/2020.conll-shared.1

Inbar Oren, Jonathan Herzig, Nitish Gupta, 马特
加德纳, and Jonathan Berant. 2020. Improv-
ing compositional generalization in semantic
解析. In Findings of
the Association for
计算语言学: EMNLP 2020,
pages 2482–2495, 在线的. 协会
计算语言学. https://doi.org
/10.18653/v1/2020.findings-emnlp.225

Kishore Papineni, Salim Roukos, Todd Ward,
and Wei-Jing Zhu. 2002. 蓝线: A method for
automatic evaluation of machine translation. 在
Proceedings of the 40th Annual Meeting of
the Association for Computational Linguistics,
pages 311–318, 费城, 宾夕法尼亚州,
美国. 计算语言学协会.
https://doi.org/10.3115/1073083
.1073135

Thomas Pellissier Tanon, Denny Vrandeˇci´c,
Sebastian Schaffert, Thomas Steiner, and Lydia
Pintscher. 2016. From Freebase to Wikidata:
The great migration. 在诉讼程序中
25th International Conference on World Wide
Web, WWW ’16, pages 1419–1428. https://
doi.org/10.1145/2872427.2874809

Aleksandr

Perevalov, Dennis Diefenbach,
Ricardo Usbeck, and Andreas Both. 2022A.
QALD-9-plus: A multilingual dataset for ques-
tion answering over DBpedia and Wikidata
translated by native speakers. 在 2022 IEEE
16th International Conference on Semantic
计算 (ICSC). IEEE. https://doi.org
/10.1109/ICSC52841.2022.00045

Aleksandr Perevalov, Axel-Cyrille Ngonga
Ngomo, and Andreas Both. 2022乙. Enhancing
the accessibility of knowledge graph question

953

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
9
2
0
4
2
5
8
8

/
t

我

A
C
_
A
_
0
0
4
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

answering systems through multilingualization.
在 2022 IEEE 16th International Conference
on Semantic Computing (ICSC), pages 251–256.
https://doi.org/10.1109/ICSC52841
.2022.00048

Jonas Pfeiffer, Ivan Vuli´c, 伊琳娜·古列维奇, 和
Sebastian Ruder. 2020. MAD-X: An Adapter-
Based Framework for Multi-Task Cross-Lingual
Transfer. 在诉讼程序中 2020 Confer-
ence on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 7654–7673,
在线的. Association for Computational Lin-
语言学. https://doi.org/10.18653/v1
/2020.emnlp-main.617

Matt Post. 2018. A call for clarity in reporting
the Third
BLEU scores. 在诉讼程序中
Conference on Machine Translation: 研究
文件, pages 186–191, 布鲁塞尔, 比利时.
计算语言学协会.
https://doi.org/10.18653/v1/W18
-6319

Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, 迈克尔
Matena, Yanqi Zhou, Wei Li, and Peter J. 刘.
2020. Exploring the limits of transfer learning
with a unified text-to-text transformer. 杂志
of Machine Learning Research, 21(140):1–67.

Pranav Rajpurkar,

文本.

在诉讼程序中

Jian Zhang, Konstantin
Lopyrev, and Percy Liang. 2016. SQuAD:
100,000+ questions for machine comprehen-
这 2016
sion of
Conference on Empirical Methods in Natural
语言处理, pages 2383–2392, Austin,
德克萨斯州. Association for Computational Lin-
语言学. https://doi.org/10.18653/v1
/D16-1264

Mohammad Sadegh Rasooli and Michael Collins.
2017. Cross-lingual syntactic transfer with
limited resources. Transactions of the Associa-
tion for Computational Linguistics, 5:279–293.
https://doi.org/10.1162/tacl 00061

Sebastian Ruder, Noah Constant, Jan Botha,
Jinlan Fu,
Aditya Siddhant, Orhan Firat,
Junjie Hu, Dan Garrette,
Pengfei Liu,
Graham Neubig, and Melvin Johnson. 2021.
XTREME-R: Towards more challenging and
nuanced multilingual evaluation. In Proceed-
这 2021 Conference on Empirical
ings of
Methods in Natural Language Processing,
pages 10215–10245, Online and Punta Cana,

954

Dominican Republic. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/2021.emnlp-main.802

Laura Ruis, Jacob Andreas, Marco Baroni, Diane
Bouchacourt, and Brenden M. Lake. 2020.
A benchmark for systematic generalization in
grounded language understanding. Advances
in Neural Information Processing Systems,
33:19861–19872.

Ivan Rybin, Vladislav Korablinov, Pavel Efimov,
and Pavel Braslavski. 2021. RuBQ 2.0: 一个
innovated Russian question answering data-
放. In Eighteenth Extended Semantic Web
会议 – Resources Track. https://土井
.org/10.1007/978-3-030-77385-4 32

Amrita Saha, Vardaan Pahuja, Mitesh M. Khapra,
Karthik Sankaranarayanan, 和一个. 磷. S. Chandar.
2018. Complex sequential question answer-
英: Towards learning to converse over linked
question answer pairs with a knowledge graph.
In AAAI. https://doi.org/10.1609/aaai
.v32i1.11332

Cezar Sas, Meriem Beloucif, and Anders Søgaard.
2020. WikiBank: Using Wikidata to improve
multilingual frame-semantic parsing. In Pro-
ceedings of
the 12th Language Resources
and Evaluation Conference, pages 4183–4189,
Marseille, 法国. European Language Re-
sources Association.

Bo Shao, Yeyun Gong, Weizhen Qi, Nan Duan,
and Xiaola Lin. 2020. Multi-level alignment
pretraining for multi-lingual semantic pars-
英. In Proceedings of the 28th International
Conference on Computational Linguistics,
pages 3246–3256, 巴塞罗那, 西班牙 (在线的).
International Committee on Computational Lin-
语言学. https://doi.org/10.18653/v1
/2020.coling-main.289

Chih Chieh Shao, Trois Liu, Yuting Lai, Yiying
Tseng, and Sam Tsai. 2018. DRCD: A Chinese
machine reading comprehension dataset.

Tao Shen, Xiubo Geng, Tao Qin, Daya Guo, Duyu
唐, Nan Duan, Guodong Long, and Daxin
Jiang. 2019. Multi-task learning for conversa-
tional question answering over a large-scale
knowledge base. 在诉讼程序中 2019
Conference on Empirical Methods in Natural
Language Processing and the 9th International
Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP), pages 2442–2451,

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
9
2
0
4
2
5
8
8

/
t

我

A
C
_
A
_
0
0
4
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Ashish Vaswani, Samy Bengio, Eugene Brevdo,
Francois Chollet, Aidan N. Gomez, Stephan
Gouws, Llion Jones, Łukasz Kaiser, Nal
Kalchbrenner, Niki Parmar, Ryan Sepassi,
Noam Shazeer, and Jakob Uszkoreit. 2018.
Tensor2tensor for neural machine translation.
CoRR, abs/1803.07416.

Funtowicz,

Joe Davison,

Thomas Wolf, Lysandre Debut, Victor Sanh,
Julien Chaumond, Clement Delangue, Anthony
Moi, Pierric Cistac, Tim Rault, Remi Louf,
摩根
山姆
Shleifer, Patrick von Platen, Clara Ma, Yacine
Jernite, Julien Plu, Canwen Xu, Teven Le
Scao, Sylvain Gugger, Mariama Drame,
Quentin Lhoest, and Alexander Rush. 2020.
Transformers: State-of-the-art natural language
这 2020
在诉讼程序中
加工.
Conference on Empirical Methods in Natural
语言处理: 系统演示,
pages 38–45, 在线的. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/2020.emnlp-demos.6

Shijie Wu and Mark Dredze. 2020. Are all lan-
guages created equal in multilingual BERT? 在
Proceedings of the 5th Workshop on Representa-
tion Learning for NLP, pages 120–130, 在线的.
计算语言学协会.

Linting Xue, Noah Constant, Adam Roberts,
Mihir Kale, Rami Al-Rfou, Aditya Siddhant,
Aditya Barua, and Colin Raffel. 2020. mT5:
A massively multilingual pre-trained text-to-
text transformer. arXiv 预印本 arXiv:2010.
11934.

Jingfeng Yang, Federico Fancellu, Bonnie
Webber, and Diyi Yang. 2021. Frustrat-
ingly simple but surprisingly strong: 使用
language-independent features for zero-shot
cross-lingual semantic parsing. In Proceed-
ings of
这 2021 Conference on Empirical
Methods in Natural Language Processing,
pages 5848–5856, Online and Punta Cana,
Dominican Republic. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/2021.emnlp-main.472

香港, 中国. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/D19-1248

Tom Sherborne and Mirella Lapata. 2022.
Zero-shot cross-lingual semantic parsing. 在
Proceedings of the 60th Annual Meeting of
the Association for Computational Linguistics
(体积 1: Long Papers), pages 4134–4153,
都柏林, 爱尔兰. Association for Computational
语言学. https://doi.org/10.18653
/v1/2022.acl-long.285

David So, Quoc Le, and Chen Liang. 2019. 这
evolved transformer. In Proceedings of the 36th
International Conference on Machine Learn-
英, 体积 97 of Proceedings of Machine
Learning Research, pages 5877–5886. PMLR.

Alon Talmor and Jonathan Berant. 2018. 这
web as a knowledge-base for answering com-
plex questions. 在诉讼程序中 2018
Conference of the North American Chapter
of the Association for Computational Linguis-
抽动症: 人类语言技术, 体积 1
(Long Papers), pages 641–651, New Orleans,
Louisiana. Association for Computational Lin-
语言学. https://doi.org/10.18653/v1
/N18-1059

James Thorne, Majid Yazdani, Marzieh Saeidi,
Fabrizio Silvestri, Sebastian Riedel, and Alon
Halevy. 2021. Database reasoning over text.
In Proceedings of the 59th Annual Meeting of
the Association for Computational Linguistics
and the 11th International Joint Conference
on Natural Language Processing (体积 1:
Long Papers), pages 3091–3104, 在线的.
计算语言学协会.
https://doi.org/10.18653/v1/2021
.acl-long.241

Dmitry Tsarkov, Tibor Tihon, Nathan Scales,
Nikola Momchev, Danila Sinopalnikov, 和
Nathanael Sch¨arli. 2021. *-CFQ: Analyzing the
scalability of machine learning on a composi-
tional task. Proceedings of the AAAI Conference
on Artificial Intelligence.

955

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
9
2
0
4
2
5
8
8

/
t

我

A
C
_
A
_
0
0
4
9
9
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3 Compositional Generalization in Multilingual image

下载pdf