众包隐式话语关系的设计选择: - 麻省理工学院人工智能研究专业

众包隐式话语关系的设计选择:
Revealing the Biases Introduced by Task Design

Valentina Pyatkin1 Frances Yung2 Merel C. J. Scholman2,3
Reut Tsarfaty1 Ido Dagan1 Vera Demberg2
1巴伊兰大学, 拉马特甘, 以色列
2Saarland University, 萨尔布吕肯, 德国
3Utrecht University, 乌得勒支, 荷兰

{皮亚特基夫,沙皇路透}@biu.ac.il; dagan@cs.biu.ac.il
{弗朗西斯,舒尔曼,维拉}@coli.uni-saarland.de

抽象的

Disagreement in natural language annotation
has mostly been studied from a perspective
of biases introduced by the annotators and the
annotation frameworks. 这里, we propose to
analyze another source of bias—task design
bias, which has a particularly strong impact
on crowdsourced linguistic annotations where
natural language is used to elicit the interpreta-
tion of lay annotators. For this purpose we look
at implicit discourse relation annotation, a task
that has repeatedly been shown to be difficult
due to the relations’ ambiguity. 我们比较
the annotations of 1,200 话语关系
obtained using two distinct annotation tasks
and quantify the biases of both methods across
four different domains. Both methods are nat-
ural language annotation tasks designed for
crowdsourcing. We show that the task design
can push annotators towards certain relations
and that some discourse relation senses can
be better elicited with one or the other anno-
tation approach. We also conclude that this
type of bias should be taken into account when
training and testing models.

介绍

Crowdsourcing has become a popular method for
data collection. It not only allows researchers
to collect large amounts of annotated data in a
shorter amount of time, but also captures human
inference in natural language, which should be the
goal of benchmark NLP tasks (曼宁, 2006).
In order to obtain reliable annotations, the crowd-
sourced labels are traditionally aggregated to a
single label per item, using simple majority voting
or annotation models that reduce noise from the
data based on the disagreement among the annota-
托尔斯 (Hovy et al., 2013; Passonneau and Carpenter,

2014). 然而, there is increasing consensus that
disagreement in annotation cannot be generally
discarded as noise in a range of NLP tasks, 这样的
as natural language inferences (De Marneffe et al.,
2012; Pavlick and Kwiatkowski, 2019; 陈等人。,
2020; Nie et al., 2020), word sense disambiguation
(Jurgens, 2013), question answering (Min et al.,
2020; Ferracane et al., 2021), anaphora resolution
(Poesio and Artstein, 2005; Poesio et al., 2006),
sentiment analysis (D´ıaz et al., 2018; Cowen et al.,
2019), and stance classification (Waseem, 2016;
Luo et al., 2020). Label distributions are proposed
to replace categorical labels in order to repre-
sent the label ambiguity (Aroyo and Welty, 2013;
Pavlick and Kwiatkowski, 2019; Uma et al., 2021;
Dumitrache et al., 2021).

There are various reasons behind the ambigu-
ity of linguistic annotations (Dumitrache, 2015;
Jiang and de Marneffe, 2022). Aroyo and Welty
(2013) summarize the sources of ambiguity into
three categories: 文本, the annotators, 和
annotation scheme. In downstream NLP tasks, 它
would be helpful if models could detect possible
alternative interpretations of ambiguous texts, 或者
predict a distribution of interpretations by a pop-
计算. In addition to the existing work on the
disagreement due to annotators’ bias, the effect
of annotation frameworks has also been stud-
ied, such as the discussion on whether entailment
should include pragmatic inferences (Pavlick and
Kwiatkowski, 2019), the effect of the granularity
of the collected labels (Chung et al., 2019), 或者
the system of labels that categorize the linguistic
现象 (Demberg et al., 2019). 在这项工作中,
we examine the effect of task design bias, 这是
independent of the annotation framework, 在
quality of crowdsourced annotations. 具体来说,

1014

计算语言学协会会刊, 卷. 11, PP. 1014–1032, 2023. https://doi.org/10.1162/tacl 00586
动作编辑器: Annie Louis. 提交批次: 1/2023; 修改批次: 1/2023; 已发表 8/2023.
C(西德:2) 2023 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

我

A
C
_
A
_
0
0
5
8
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Elazar et al., 2022). It is therefore of interest to
the broader research community to see how task
design biases can arise, even when the tasks are
more accessible to the lay public.

We examine two distinct natural

语言
crowdsourcing discourse relation annotation tasks
(数字 1): Yung et al. (2019) derive relation la-
bels from discourse connectives (DCs) that crowd
workers insert; Pyatkin et al. (2020) derive la-
bels from Question Answer (QA) pairs that crowd
workers write. Both task designs employ natural
language annotations instead of labels from a tax-
onomy. The two task designs, DC and QA, 是
used to annotate 1,200 隐性话语关系
在 4 different domains. This allows us to explore
how the task design impacts the obtained anno-
tations, as well as the biases that are inherent to
each method. To do so we showcase the differ-
ence of various inter-annotator agreement metrics
on annotations with distributional and aggregated
labels.

We find that both methods have strengths and
weaknesses in identifying certain types of re-
these biases are
lations. We further see that
also affected by the domain. In a series of dis-
course relation classification experiments, 我们
demonstrate the benefits of collecting annota-
tions with mixed methodologies, we show that
training with a soft loss with distributions as tar-
gets improves model performance, and we find
that cross-task generalization is harder than cross-
domain generalization.

The outline of the paper is as follows. 我们在-
troduce the notion of task design bias and analyze
its effect on crowdsourcing implicit DRs, 使用
two different task designs (Section 3–4). 下一个,
we quantify strengths and weaknesses of each
method using the obtained annotations, and sug-
gest ways to reduce task bias (部分 5). 然后
we look at genre-specific task bias (部分 6).
最后, we demonstrate the task bias effect on DR
classification performance (部分 7).

2 Background

2.1 Annotation Biases

Annotation tends to be an inherently ambiguous
任务, often with multiple possible interpretations
and without a single ground truth (Aroyo and
Welty, 2013). An increasing amount of research
has studied annotation disagreements and biases.

数字 1: Example of two relational arguments (S1 and
S2) and the DC and QA annotation in the middle.

we look at inter-sentential implicit discourse rela-
的 (DR) 注解, IE。, semantic or pragmatic
relations between two adjacent sentences without
a discourse connective to which the sense of the
relation can be attributed. 数字 1 shows an ex-
ample of an implicit relation that can be annotated
as Conjunction or Result.

Implicit DR annotation is arguably the hardest
task in discourse parsing. Discourse coherence is
a feature of the mental representation that readers
form of a text, rather than of the linguistic material
本身 (Sanders et al., 1992). Discourse annotation
thus relies on annotators’ interpretation of a text.
更远, relations can often be interpreted in vari-
ous ways (Rohde et al., 2016), with multiple valid
readings holding at the same time. 这些因素
make discourse relation annotation, especially for
implicit relations, a particularly difficult task. 我们
collect 10 different annotations per DR, 从而
focusing on distributional representations, 哪个
are more informative than categorical labels.

Since DR annotation labels are often abstract
terms that are not easily understood by lay individ-
乌尔斯, we focus on ‘‘natural language’’ task designs.
Decomposing and simplifying an annotation task,
where the DR labels can be obtained indirectly
from the natural language annotations, 已经
shown to work well for crowdsourcing (张
等人。, 2016; Scholman and Demberg, 2017; Pyatki
等人。, 2020). Crowdsourcing with natural language
has become increasingly popular. This includes
tasks such as NLI (Bowman et al., 2015), SRL
(Fitzgerald et al., 2018), and QA (Rajpurkar et al.,
2018). This trend is further visible in modeling
approaches that cast traditional structured predic-
tion tasks into NL tasks, such as for co-reference
(Aralikatte et al., 2021), discourse comprehension
(Ko et al., 2021), or bridging anaphora (Hou, 2020;

1015

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

我

A
C
_
A
_
0
0
5
8
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Prior studies have focused on how crowd-
workers can be biased. Worker biases are
subject to various factors, such as educational
or cultural background, or other demographic
特征. Prabhakaran et al. (2021) 观点
out that for more subjective annotation tasks, 这
socio-demographic background of annotators con-
tributes to multiple annotation perspectives and
argue that label aggregation obfuscates such per-
spectives. 反而, soft labels are proposed, 这样的
as the ones provided by the CrowdTruth method
(Dumitrache et al., 2018), which require multi-
ple judgments to be collected per instance (Uma
等人, 2021). Bowman and Dahl (2021) 建议
that annotations that are subject to bias from
methodological artifacts should not be included
in benchmark datasets. 相比之下, Basile et al.
(2021) argue that all kinds of human disagree-
ments should be predicted by NLU models and
thus included in evaluation datasets.

In contrast to annotator bias, a limited amount
of research is available on bias related to the for-
mulation of the task. Jakobsen et al. (2022) 展示
that argument annotations exhibit widely differ-
ent levels of social group disparity depending on
which guidelines the annotators followed. Simi-
larly, Buechel and Hahn (2017A,乙) study different
design choices for crowdsourcing emotion annota-
tions and show that the perspective that annotators
are asked to take in the guidelines affects anno-
tation quality and distribution. Jiang et al. (2017)
study the effect of workflow for paraphrase collec-
tion and found that examples based on previous
contributions prompt workers to produce more
diverging paraphrases. Hube et al. (2019) 展示
that biased subjective judgment annotations can
be mitigated by asking workers to think about re-
sponses other workers might give and by making
workers aware of their possible biases. 因此, 这
available research suggests that task design can af-
fect the annotation output in various ways. 更远
research studied the collection of multiple labels:
Jurgens (2013) compares between selection and
scale rating and finds that workers would choose
an additional label for a word sense labelling task.
相比之下, Scholman and Demberg (2017) 寻找
that workers usually opt not to provide an addi-
tional DR label even when allowed. Chung et al.
(2019) compare various label collection methods
including single / multiple labelling, ranking, 和
probability assignment. We focus on the biases
in DR annotation approaches using the same set

of labels, but translated into different ‘‘natural
language’’ for crowdsourcing.

2.2 DR Annotation

Various frameworks exist that can be used to an-
notate discourse relations, such as RST (Mann
and Thompson, 1988) and SDRT (亚瑟, 1993).
在这项工作中, we focus on the annotation of im-
plicit discourse relations, following the framework
used to annotate the Penn Discourse Treebank 3.0
(PDTB, Webber et al., 2019). PDTB’s sense clas-
sification is structured as a three-level hierarchy,
with four coarse-grained sense groups in the first
level and more fine-grained senses for each of
the next levels.1 The process is a combination of
manual and automated annotation: An automated
process identifies potential explicit connectives,
and annotators then decide on whether the poten-
tial connective is indeed a true connective. 如果是这样,
they specify one or more senses that hold between
its arguments. If no connective or alternative lex-
icalization is present (IE。, for implicit relations),
each annotator provides one or more connectives
that together express the sense(s) they infer.

DR datasets, such as PDTB (Webber et al.,
2019), RST-DT (Carlson and Marcu, 2001), 和
TED-MDB (Zeyrek et al., 2019), are commonly
annotated by trained annotators, who are ex-
pected to be familiar with extensive guidelines
written for a given task (Plank et al., 2014;
阿特斯坦和波西奥, 2008; Riezler, 2014).
然而, there have also been efforts to crowd-
source discourse relation annotations (Kawahara
等人。, 2014; Kishimoto et al., 2018; Scholman and
登贝格, 2017; Pyatkin et al., 2020). We investi-
gate two crowdsourcing approaches that annotate
inter-sentential implicit DRs and we deterministi-
cally map the NL-annotations to the PDTB3 label
框架.

2.2.1 Crowdsourcing DRs with the DC

方法

Yung et al. (2019) developed a crowdsourcing
discourse relation annotation method using dis-
course connectives, referred to as the DC method.
For every instance, participants first provide a
connective that in their view, best expresses the
relation between the two arguments. Note that the

1We merge the belief and speech-act relation senses
(which cannot be distinguished reliably by QA and DC) 和
their corresponding more general relation senses.

1016

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

我

A
C
_
A
_
0
0
5
8
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

connective chosen by the participant might be am-
biguous. 所以, participants disambiguate the
relation in a second step, by selecting a connective
from a list that is generated dynamically based on
the connective provided in the first step. 当。。。的时候
first step insertion does not match any entry in the
connective bank (from which the list of disam-
biguating connectives is generated), 参与者
are presented with a default list of twelve con-
nectives expressing a variety of relations. Based
on the connectives chosen in the two steps, 这
inferred relation sense can be extracted. 对于前-
充足, the CONJUNCTION reading in Figure 1 可
expressed by in addition, and the RESULT reading
can be expressed by consequently.

The DC method was used to create a crowd-
sourced corpus of 6,505 discourse-annotated
implicit relations, named DiscoGeM (Scholman
等人。, 2022A). A subset of DiscoGeM is used in
the current study (参见章节 3).

2.2.2 Crowdsourcing DRs by the QA Method

Pyatkin et al. (2020) proposed to crowdsource
discourse relations using QA pairs. They collected
a dataset of intra-sentential QA annotations which
aim to represent discourse relations by including
one of the propositions in the question and the
other in the respective answer, with the question
prefix (What is similar to..?, What is an example
of..?) mapping to a relation sense. Their method
was later extended to also work inter-sententially
(Scholman et al., 2022). In this work we make use
of the extended approach that relates two distinct
sentences through a question and answer. 这
following QA pair, 例如, connects the two
sentences in Figure 1 with a RESULT relation.

(1) What is the result of Caesar being assassi-
nated by a group of rebellious senators?(S1) –
A new series of civil wars broke out […](S2)

The annotation process consists of the following
脚步: From two consecutive sentences, annotators
are asked to choose a sentence that will be used to
formulate a question. The other sentence functions
as an answer to that question. Next they start
building a question by choosing a question prefix
and by completing the question with content from
the chosen sentence.

Since it is possible to choose either of the two
sentences as question/answer for a specific set of
symmetric relations, (IE。, What is the reason a

new series of civil wars broke out?), we consider
both possible formulations as equivalent.

The set of possible question prefixes cover all
PDTB 3.0 senses (excluding belief and speech-act
关系). The direction of the relation sense, 例如,
arg1-as-denier vs. arg2-as-denier, is determined
by which of the two sentences is chosen for
the question/answer. While Pyatkin et al. (2020)
allowed crowdworkers to form multiple QA pairs
per instance, IE。, annotate more than one discourse
sense per relation, we decided to limit the task to
1 sense per relation per worker. We took this
decision in order for the QA method to be more
comparable to the DC method, which also only
allows the insertion of a single connective.

3 方法

3.1 数据

We annotated 1,200 inter-sentential discourse re-
lations using both the DC and the QA task design.2
Of these 1,200 关系, 900 were taken from the
DiscoGeM corpus and 300 from the PDTB 3.0.

DiscoGeM Relations The 900 DiscoGeM in-
stances that were included in the current study
represent different domains: 296 instances were
taken from the subset of DiscoGeM relations that
were taken from Europarl proceedings (written
proceedings of prepared political speech taken
from the Europarl corpus; 科恩, 2005), 304
instances were taken from the literature subset
(narrative text from five English books),3 和 300
instances from the Wikipedia subset of DiscoGeM
(informative text, taken from the summaries of 30
Wikipedia articles). These different genres enable
a cross-genre comparison. This is necessary, 给定
that prevalence of certain relation types can dif-
fer across genres (Rehbein et al., 2016; Scholman
等人。, 2022A; Webber, 2009).

这些 900 relations were already labeled using
the DC method in DiscoGeM; we additionally
collect labels using the QA method for the current
学习. In addition to crowd-sourced labels using
the DC and QA methods, the Wikipedia subset
was also annotated by three trained annotators.4

2The annotations are available at https://github

.com/merelscholman/DiscoGeM.

3Animal Farm by George Orwell, Harry Potter and the
Philosopher’s Stone by J. K. Rowling, The Hitchhikers Guide
to the Galaxy by Douglas Adams, The Great Gatsby by
F. Scott Fitzgerald, and The Hobbit by J. 右. 右. Tolkien.

4Instances were labeled by two annotators and verified by
a third; Cohen’s κ agreement between the first annotator and

1017

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

我

A
C
_
A
_
0
0
5
8
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Forty-seven percent of these Wikipedia instances
were labeled with multiple senses by the expert
annotators (IE。, were considered to be ambiguous
or express multiple readings).

PDTB Relations The PDTB relations were
included for the purpose of comparing our an-
notations with traditional PDTB gold standard
注释. These instances (all inter-sentential)
were selected to represent all relational classes,
randomly sampling at most 15 and at least 2 (为了
classes with less than 15 relation instances we
sampled all existing relations) relation instances
per class. The reference labels for the PDTB
instances consist of the original PDTB labels an-
notated as part of the PDTB3 corpus. 仅有的 8% 的
these consisted of multiple senses.

3.2 Crowdworkers

Crowdworkers were recruited via Prolific using
a selection approach (Scholman et al., 2022) 那
has been shown to result in a good trade off be-
tween quality and time/monetary efforts for DR
注解. Crowdworkers had to meet the fol-
lowing requirements: be native English speakers,
reside in UK, 爱尔兰, 美国, or Canada, 并有
obtained at least an undergraduate degree.

Workers who fulfilled these conditions could
participate in an initial recruitment task, for which
they were asked to annotate a text with either
the DC or QA method and were shown imme-
diate feedback on their performance. Workers
with an accuracy ≥ 0.5 on this task were qual-
ified to participate in further tasks. We hence
created a unique set of crowdworkers for each
方法. The DC annotations (collected as part
of DiscoGeM) were provided by a final set of
199 selected crowdworkers; QA had a final set
的 43 selected crowdworkers.5 Quality was moni-
tored throughout the production data collection
and qualifications were adjusted according to
表现.

Every instance was annotated by 10 workers per
方法. This number was chosen based on parity
with previous research. 例如, Snow et al.
(2008) show that a sample of 10 crowdsourced an-
notations per instance yields satisfactory accuracy

the reference label was .82 (88% 协议), 和之间
the second and the reference label was .96 (97% 协议).
See Scholman et al. (2022A) for additional details.

5The larger set of selected workers in the DC method is
because more data was annotated by DC workers as part of
the creation of DiscoGeM.

for various linguistic annotation tasks. Scholman
and Demberg (2017) found that assigning a new
group of 10 annotators to annotate the same in-
stances resulted in a near-perfect replication of the
connective insertions in an earlier DC study.

Instances were annotated in batches of 20. 为了
QA, one batch took about 20 minutes to complete,
and for DC, 7 minutes. Workers were reimbursed
关于 $2.50 和 $1.88 per batch, 分别.

3.3 Inter-annotator Agreement

We evaluate the two DR annotation methods by
the inter-annotator agreement (国际航空协会) between the
annotations collected by both methods and IAA
with reference annotations collected from trained
annotators.

Cohen’s kappa (科恩, 1960) is a metric fre-
quently used to measure IAA. For DR annotations,
a Cohen’s kappa of .7 is considered to reflect good
国际航空协会 (Spooren and Degand, 2010). 然而, prior
research has shown that agreement on implicit re-
lations is more difficult to reach than on explicit
关系: Kishimoto et al. (2018) report an F1
的 .51 on crowdsourced annotations of implicits
using a tagset with 7 level-2 labels; Zik´anov´a et al.
(2019) report κ = .47 (58%) on expert annotations
of implicits using a tagset with 23 level-2 labels;
and Demberg et al. (2019) find that PDTB and
RST-DT annotators agree on the relation sense on
37% of implicit relations. Cohen’s kappa is pri-
marily used for comparison between single labels
and the IAAs reported in these studies are also
based on single aggregated labels.

然而, we also want to compare the obtained
10 annotations per instance with our reference
labels that also contain multiple labels. com-
parison becomes less straightforward when there
are multiple labels because the chance of agree-
ment is inflated and partial agreement should be
treated differently. We thus measure the IAA be-
tween multiple labels in terms of both full and
partial agreement rates, as well as the multi-label
kappa metric proposed by Marchal et al. (2022).
This metric adjusts the multi-label agreements
with bootstrapped expected agreement. 我们骗-
sider all the labels annotated by the crowdworkers
in each instance, excluding minority labels with
only one vote.6

6We assumed there were 10 votes per item and removed
labels with less than 20% of votes, even though in rare cases
there could be 9 或者 11 votes. 平均而言, the removed labels
represent 24.8% of the votes per item.

1018

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

我

A
C
_
A
_
0
0
5
8
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Item counts
QA sub-labels/item
DC sub-labels/item
full/+partial agreement
multi-label kappa
JSD

Europarl
296
2.13
2.37
.051/.841
.813
.505

Novel

304
2.21
2.00
.092/.865
.842
.492

Wiki.

300
2.26
2.09
.060/.920
.903
.482

PDTB

302
2.45
2.21
.050/.884
.868
.510

全部
1202
2.21
2.17
.063/.878
.857
.497

桌子 1: Comparison between the labels obtained by DC vs. QA. Full (or +partial) agreement means
全部 (or at least one sub-label) 匹配(英语). Multi-label kappa is adapted from Marchal et al. (2022).
JSD is calculated based on the actual distributions of the crowdsourced sub-labels, excluding labels
with only one vote (smaller values are better).

此外, we compare the distributions of the
crowdsourced labels using the Jensen-Shannon
分歧 (JSD) following existing reports (Erk
and McCarthy, 2009; Nie et al., 2020; 张
等人。, 2021). 相似地, minority labels with only
one vote are excluded. Since distributions are not
available in the reference labels, when comparing
with the reference labels, we evaluate by the JSD
based on the flattened distributions of the labels,
which means we replace the original distribution
of the votes with an even distribution of the labels
that have been voted by more than one annotator.
We call this version JSD flat.

As a third perspective on IAA we report agree-
ment among annotators on an item annotated with
QA/DC. Following previous work (Nie et al.,
2020), we use entropy of the soft labels to quan-
tify the uncertainty of the crowd annotation. 这里
labels with only one vote are also included as they
contribute to the annotation uncertainty. When cal-
culating the entropy, we use a logarithmic base of
n = 29, where n is the number of possible labels.
A lower entropy value suggests that the annotators
agree with each other more and the annotated la-
bel is more certain. As discussed in Section 1, 这
source of disagreement in annotations could come
from the items, the annotators, and the method-
ology. High entropy across multiple annotations
of a specific item within the same annotation task
suggests that the item is ambiguous.

4 结果

relations that have received more than one anno-
站; and ‘‘label distribution’’ is the distribution
of the votes of the sub-labels.

4.1 IAA Between the Methods

桌子 1 shows that both methods yield more than
two sub-labels per instance after excluding minor-
ity labels with only one vote. This supports the
idea that multi-sense annotations better capture
the fact that often more than one sense can hold
implicitly between two discourse arguments.

桌子 1 also presents the IAA between the labels
crowdsourced with QA and DC per domain. 这
agreement between the two methods is good: 这
labels assigned by the two methods (或者至少
one of the sub-labels in case of a multi-label
注解) match for about 88% of the items.
This speaks for the fact that both methods are
valid, as similar sets of labels are produced.

The full agreement scores, 然而, are very
低的. This is expected, as the chance to match
on all sub-labels is also very low compared to a
single-label setting. The multi-label kappa (哪个
takes chance agreement of multiple labels into),
and JSD (which compares the distributions of
the multiple labels) are hence more suitable. 我们
note that the PDTB gold annotation that we use
for evaluation does not assign multiple relations
systematically and has a low rate of double labels.
This explains why the PDTB subsets have a high
partial agreement while the JSD ends up being
worst.

We first compare the IAA between the two crowd-
sourced annotations, then we discuss IAA between
DC/QA and the reference annotations, and lastly
we perform an analysis based on annotation uncer-
污点. 这里, ‘‘sub-labels’’ of an instance means all

4.2 IAA Between Crowdsourced and

Reference Labels

桌子 2 compares the labels crowdsourced by
each method and the reference labels, 哪个是
available for the Wikipedia and PDTB subsets. 它

1019

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

我

A
C
_
A
_
0
0
5
8
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Item counts

Ref. sub-labels/item
QA: sub-labels/item
full/+partial agreement
multi-label kappa
JSDflat
直流: sub-labels/item
full/+partial agreement
multi-label kappa
JSDflat

Wiki.

PDTB

300

302

1.08
2.45

1.54
2.26
.133/.887 .070/.487
.857
.468
2.09
.110/.853 .103/.569
.817
.483

.449
.643
2.21

.524
.606

桌子 2: Comparison against gold labels for the
QA or DC methods. Since the distribution of the
reference sub-labels is not available, JSD flat is
calculated between uniform distributions of the
sub-labels.

Europarl Wikipedia Novel PDTB

QA
直流

0.40
0.37

0.38
0.34

0.38
0.35

0.41
0.36

桌子 3: Average entropy of the label distribu-
系统蒸发散 (10 annotations per relation) for QA/DC,
split by domain.

can be observed that both methods achieve higher
full agreements with the reference labels than with
each other on both domains. This indicates that
the two methods are complementary, with each
method better capturing different sense types. 在
特别的, the QA method tends to show higher
agreement with the reference for Wikipedia items,
while the DC annotations show higher agreement
with the reference for PDTB items. This can
possibly be attributed to the development of the
methodologies: The DC method was originally
developed by testing on data from the PDTB in
Yung et al. (2019), whereas the QA method was
developed by testing on data from Wikipedia and
Wikinews in Pyatkin et al. (2020).

4.3 Annotation Uncertainty

桌子 3 compares the average entropy of the
soft labels collected by both methods. It can be
observed that the uncertainty among the labels
chosen by the crowdworkers is similar across
domains but always slightly lower for DC. 我们
further look at the correlation between annotation
uncertainty and cross-method agreement, and find

数字 2: Correlation between the entropy of the anno-
tations and the JSDflat between the crowdsourced labels
and reference.

that agreement between methods is substantially
higher for those instances where within-method
entropy was low. 相似地, we find that agree-
ment between crowdsourced annotations and gold
labels is highest for those relations, where little
entropy was found in crowdsourcing.

下一个, we want to check if the item effect is simi-
lar across different methods and domains. 数字 2
shows the correlation between the annotation en-
tropy and the agreement with the reference of
each item, of each method for the Wikipedia /
PDTB subsets. It illustrates that annotations of
both methods diverge with the reference more as
the uncertainty of the annotation increases. 尽管
the effect of uncertainty is similar across meth-
ods on the Wikipedia subset, the quality of the
QA annotations depends more on the uncertainty

1020

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

我

A
C
_
A
_
0
0
5
8
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

标签

FNQA FNDC FPQA FPDC

conjunction
arg2-as-detail
precedence
arg2-as-denier
结果
对比
arg2-as-instance
原因
synchronous
arg2-as-subst
equivalence
succession
相似
norel
arg1-as-detail
disjunction
arg1-as-denier
arg2-as-manner
arg2-as-excpt
arg2-as-goal
arg2-as-cond
arg2-as-negcond
arg1-as-goal

43
42
19
38
10
8
10
12
20
21
22
17
7
12
9
5
3
2
2
1
1
1
1

46
62
18
20
5
17
7
17
27
13
22
15
8
12
8
4
3
2
2
1
1
1
1

203
167
18
15
110
84
44
54
11
1
2
24
15
0
39
10
33
9
1
5
0
0
3

167
152
37
47
187
39
57
37
5
0
1
3
12
0
13
0
31
0
0
0
0
0
0

桌子 4: FN and FP counts of each method
grouped by the reference sub-labels.

例子, the QA method confuses workers when
the question phrase contains a connective:7

tyke,’’
左边

chortled Mr. Dursley
(2) ‘‘Little
as he
进入
his car and backed out of number
four’s drive. [QA:SUCCESSION, PRECEDENCE,
直流:CONJUNCTION, PRECEDENCE]

the house. He got

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

我

A
C
_
A
_
0
0
5
8
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

In the above example, the majority of the work-
ers formed the question ‘‘After what he left the
房子?, which was likely a confusion with ‘‘What
did he do after he left the house?’’. This could ex-
plain the frequent confusion between PRECEDENCE
and SUCCESSION by QA, resulting in the frequent
FPs of SUCCESSION (数字 3).8

For DC,

rare relations which lack a fre-
quently used connective are harder to annotate,
例如:

(3) He had made an arrangement with one of the
cockerels to call him in the mornings half an

7The examples are presented in the following format:
italics = argument 1; bolded = argument 2; plain = contexts.
8相似地, the question ‘‘Despite what . . . ?’’ is easily
confused with ‘‘despite…’’, which could explain the frequent
FP of arg1-as-denier by the QA method.

数字 3: Distribution of the annotation errors by
方法. Labels annotated by at least 2 workers are
compared against the reference labels of the Wikipedia
and PDTB items. The relation types are arranged in
descending order of the ‘‘ref. sub-label counts.’’

compared to the DC annotations on the PDTB
subset. This means that method bias also exists on
the level of annotation uncertainty and should be
taken into account when, 例如, entropy is
used as a criterion to select reliable annotations.

5 Sources of the Method Bias

在这个部分, we analyze method bias in terms of
the sense labels collected by each method. 我们也
examine the potential limitations of the methods
which could have contributed to the bias and
demonstrate how we can utilize information on
method bias to crowdsource more reliable labels.
最后, we provide a cross-domain analysis.

桌子 5 presents the confusion matrix of the
labels collected by both methods for the most
frequent level-2 relations. 数字 3 和表 4
show the distribution of the true and false positives
of the sub-labels. These results show that both
methods are biased towards certain DRs. 这
source of these biases can be categorized into
two types, which we will detail in the following
subsections.

5.1 Limitation of Natural Language

for Annotation

There are limitations of representing DRs in
natural languages using both QA and DC. 为了

1021

hour earlier than anyone else, and would put
in some volunteer labour at whatever seemed
to be most needed, before the regular day’s
work began. His answer to every problem,
every setback, was ‘‘I will work harder!’’
– which he had adopted as his personal
motto. [QA:ARG1-AS-INSTANCE; 直流:RESULT]

It is difficult to use the DC method to annotate
the ARG1-AS-INSTANCE relation due to a lack of typ-
伊卡尔, specific, and context-independent connective
phrases that mark these rare relations, 例如
‘‘this is an example of …’’. 相比之下, the QA
method allows workers to make a question and
answer pair in the reverse direction, with S1 be-
ing the answer to S2, using the same question
字, 例如, What is an example of the fact that
his answer to every problem […] was ‘‘I will work
harder!’’?. This allows workers to label rarer rela-
tion types that were not even uncovered by trained
annotators.

Many common DCs are ambiguous, 例如
but and and, and can be hard to disambiguate.
To address this, the DC method provides workers
with unambiguous connectives in the second step.
然而, these unambiguous connectives are of-
ten relatively uncommon and come with different
syntactic constraints, depending on whether they
are coordinating or subordinating conjunctions or
discourse adverbials. 因此, they do not fit in all
上下文. 此外, some of the unambiguous
connectives sound very ‘‘heavy’’ and would not
be used naturally in a given sentence. 例如,
however is often inserted in the first step, 但它
can mark multiple relations and is disambiguated
in the second step by the choice among on the
contrary for CONTRAST, despite for ARG1-AS-DENIER,
and despite this for ARG2-AS-DENIER. Despite this
was chosen frequently since it can be applied to
most contexts. This explains the DC method’s
bias towards arg2-as-denier against contrast
(数字 3: most FPs of arg2-as-denier and most
FNs of contrast come from DC).

While the QA method also requires workers to
select from a set of question starts, which also
contain infrequent expressions (such as Unless
what..?), workers are allowed to edit the text to
improve the wordings of the questions. This helps
reduce the effect of bias towards more frequent
question prefixes and makes crowdworkers doing
the QA task more likely to choose infrequent
relation senses than those doing the DC task.

5.2 Guideline Underspecification

Jiang and de Marneffe (2022) report that some
disagreements in NLI tasks come from the loose
definition of certain aspects of the task. 我们
found that both QA and DC also do not give
clear enough instructions in terms of argument
跨度. The DRs are annotated at the boundary
of two consecutive sentences but both methods
do not limit workers to annotate DRs that span
exactly the two sentences.

进一步来说, the QA method allows the
crowdworkers to form questions by copying spans
from one of the sentences. While this makes sure
that the relation lies locally between two consec-
utive sentences, it also sometimes happens that
workers highlight partial spans and annotate re-
lations that span over parts of the sentences. 为了
例子:

(4) I agree with Mr Pirker, and it

is prob-
ably the only thing I will agree with
him on if we do vote on the Lud-
ford report. 它
is going to be an in-
teresting vote. [QA:ARG2-AS-DETAIL,REASON;
直流:CONJUNCTION,RESULT]

In Ex. (4), workers constructed the question
‘‘What provides more details on the vote on the
Ludford report?’’. This is similar to the instruc-
tions in PDTB 2.0 and 3.0’s annotation manuals,
specifying that annotators should take minimal
spans which don’t have to span the entire sen-
张力. Other relations should be inferred when
the argument span is expanded to the whole sen-
张力, for example a RESULT relation reflecting that
there is little agreement, which will make the vote
interesting.

经常, a sentence can be interpreted as the elab-
oration of certain entities in the previous sentence.
This could explain why ARG1/2-AS-DETAIL tends to
be overlabelled by QA. 数字 3 shows that the
QA has more than twice as many FP counts for
ARG2-AS-DETAIL compared to DC – the contrast is
even bigger for ARG1-AS-DETAIL. Yet it is not trivial
to filter out such questions that only refer to a part
of the sentence, because in some cases, the high-
lighted entity does represent the whole argument
span.9 Clearer instructions in the guidelines are
desirable.

9Such as ‘‘a few final comments’’ in this example: Ladies
and gentlemen, I would like to make a few final comments.
This is not about the implementation of the habitats
directive.

1022

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

我

A
C
_
A
_
0
0
5
8
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

DETAIL and SUCCESSION and DC’s bias towards
CONCESSION and RESULT. Being aware of these biases
would allow to combine the methods: After first
labelling all instances with the more cost-effective
DC method, RESULT relations, which we know tend
to be overlabelled by the DC method, 可能
re-annotated using the QA method. We simulate
this for our data and find that this would increase
the partial agreement from 0.853 到 0.913 为了
Wikipedia and from 0.569 到 0.596 for PDTB.

6 Analysis by Genre

For each of the four genres (Novel, 维基百科,
Europarl, and WSJ) we have ∼300 implicit DRs
annotated by both DC and QA. Scholman et al.
(2022A) 显示, based on the DC method, 那
in DiscoGeM, CONJUNCTION is prevalent in the
Wikipedia domain, PRECEDENCE in Literature and
RESULT in Europarl. The QA annotations replicate
this finding, as displayed in Figure 4.

It appears more difficult to obtain agreement
with the majority labels in Europarl than in other
流派, which is reflected in the average entropy
(见表 3) of the distributions for each genre,
where DC has the highest entropy in the Europarl
domain and QA the second highest (after PDTB).
桌子 1 confirms these findings, showing that the
agreement between the two methods is highest for
Wikipedia and lowest for Europarl.

In the latter domain, the DC method results
in more CAUSAL relations: 36% of the CONJUNC-
TIONS labelled by QA are labelled as RESULT
in DC.11 Manual inspection of these DC anno-
tations reveals that workers chose considering
this frequently only in the Europarl subset. 这
connective phrase is typically used to mark a
pragmatic result relation, where the result reading
comes from the belief of the speaker (Ex. (4)).
This type of relation is expected to be more fre-
quent in speech and argumentative contexts and is
labelled as RESULT-BELIEF in PDTB3. QA does not
have a question prefix available that could capture
RESULT-BELIEF senses. The RESULT labels obtained
by DC are therefore a better fit with the PDTB3
framework than QA’s CONJUNCTIONS. CONCESSION
is generally more prevalent with the DC method,
especially in Europarl, 和 9% 相比 3%
for QA. CONTRAST, 另一方面, seems to be
favored by the QA method, of which most (6%)

11This appeared to be distributed over many annotators

and is thus a true method bias.

桌子 5: Confusion matrix for the most frequent
level-2 sublabels which were annotated by at least
2 workers per relation; values are represented as
颜色.

相似地, DC does not limit workers to annotate

relations between the two sentences, consider:

(5)

相同的

When two differently-doped regions exist
在里面
crystal, a semiconductor
junction is created. The behavior of
收费
elec-
carriers, which include
trons, ions and electron holes, at these
junctions is the basis of diodes, tran-
and all modern electronics.
sistors
[Ref:ARG2-AS-DETAIL;
QA:ARG2-AS-DETAIL,
CONJUNCTION; 直流:CONJUNCTION, RESULT]

在这个例子中, many people inserted as a
结果, which naturally marks the intra-sentence
关系 (…is created as a result.) Many rela-
tions are potentially spuriously labelled as RESULT,
which are frequent between larger chunks of texts.
桌子 5 shows that the most frequent confusion
is between DC’s CAUSE and QA’s CONJUNCTION.10
Within the level-2 CAUSE relation sense, it is the
level-3 RESULT relation that turns out to be the
main contributor to the observed bias. 数字 3
also shows that most FPs of RESULT come from the
DC method.

5.3 Aggregating DR Annotations Based on

Method Bias

The qualitative analysis above provides insights
on certain method biases observed in the label
分布, such as QA’s bias towards ARG1/2-AS

10A chi-squared test confirms that the observed distribution
is significantly different from what could be expected based
on chance disagreement.

1023

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

我

A
C
_
A
_
0
0
5
8
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

数字 4: Level-2 sublabel counts of all the annotated labels of both methods, split by domain.

CONTRAST relations are found in Wikipedia, com-
pared to 3% for DC. 数字 4 also highlights that
for the QA approach, annotators tend to choose
a wider variety of senses which are rarely ever
annotated by DC, such as PURPOSE, CONDITION, 和
MANNER.

We conclude that encyclopedic and literary
texts are the most suitable to be annotated us-
ing either DC or QA, as they show higher
inter-method agreement (and for Wikipedia also
higher agreement with gold). Spoken-language
and argumentative domains on the other hand are
trickier to annotate as they contain more pragmatic
readings of the relations.

7 Case Studies: Effect of Task Design on

DR Classification Models

Analysis of the crowdsourced annotations reveals
that the two methods have different biases and
different correlations with domains and the style
(and possibly function) of the language used in
the domains. We now investigate the effect of task
design bias on automatic prediction of implicit
话语关系. 具体来说, we carry out two
case studies to demonstrate the effect that task
design and the resulting label distributions have
on discourse parsing models.

Task and Setup We formulate the task of pre-
dicting implicit discourse relations as follows.
The input to the model are two sequences S1
and S2, which represent the arguments of a dis-
course relation. The targets are PDTB 3.0 感觉
类型 (including level-3). This model architecture
is similar to the model for implicit DR prediction
by Shi and Demberg (2019). We experiment with
two different losses and targets: a cross-entropy
loss where the target is a single majority label

and a soft cross-entropy loss where the target is
a probability distribution over the annotated la-
bels. 使用 10 annotations per instance, 我们
obtain label distributions for each relation, 哪个
we use as soft targets. Training with a soft loss has
been shown to improve generalization in vision
and NLP tasks (Peterson et al., 2019; Uma et al.,
2020). As suggested in Uma et al. (2020), we nor-
malize the sense-distribution over the 30 可能的
labels12 with a softmax.

Assume one has a relation with the following
注释: 4 RESULT, 3 CONJUNCTION, 2 SUCCESSION,
1 ARG1-AS-DETAIL. For the hard loss, the target
would be the majority label: RESULT. For the soft
loss we normalize the counts (every label with no
annotation has a count of 0) using a softmax, 为了
a smoother distribution without zeros.

We fine-tune DeBERTa (deberta-base) (他
等人。, 2020) in a sequence classification setup
using the HuggingFace checkpoint (沃尔夫等人。,
2020). The model trains for 30 epochs with early
stopping and a batch size of 8.

数据
In addition to the 1,200 instances we ana-
lyzed in the current contribution, we additionally
use all annotations from DiscoGeM as training
数据. DiscoGeM, which was annotated with the
DC method, 添加 2756 Novel relations, 2504 欧洲联盟-
roparl relations, 和 345 Wikipedia relations. 我们
formulate different setups for the case studies.

12precedence, arg2-as-detail, conjunction, 结果, arg1-as-
细节, arg2-as-denier, 对比, arg1-as-denier, synchronous,
原因, arg2-as-instance, arg2-as-cond, arg2-as-subst, 西米-
larity, disjunction, succession, arg1-as-goal, arg1-as-instance,
arg2-as-goal, arg2-as-manner, arg1-as-manner, equivalence,
arg2-as-excpt, arg1-as-excpt, arg1-as-cond, differentcon,
norel, arg1-as-negcond, arg2-as-negcond, arg1-as-subst.

1024

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

我

A
C
_
A
_
0
0
5
8
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

直流
DC Soft
QA+DC∩
QA+DC∩ Soft
QA+DC∪
QA+DC∪ Soft

PDTBtest Wikigold
0.34†
0.29*†
0.34(西德:3)
0.38*(西德:3)
0.35♠
0.41*♠

0.65†
0.70*†
0.67
0.66*
0.49♠
0.67*♠

TED-M.
0.36
0.34*
0.37
0.31*
0.36♠
0.43*♠

桌子 6: Accuracy of model (with soft vs. 难的
loss) prediction on gold labels. The model is
trained either on DC data (直流), an intersection
of DC and QA (∩), or the union of DC and
QA (∪). Same symbol in a column indicates a
statistically significant (McNemar test) difference
in cross-model results.

7.1 案件 1: Incorporating Data from

Different Task Designs

The purpose of this study is to see if a model
trained on data crowdsourced by DC/QA meth-
ods can generalize to traditionally annotated test
套. We thus test on the 300 Wikipedia rela-
tions annotated by experts (Wiki gold), all implicit
relations from the test set of PDTB 3.0 (PDTB
测试), and the implicit relations of the English
test set of TED-MDB (Zeyrek et al., 2020).
For training data, we either use (1) all of the
DiscoGeM annotations (Only DC); 或者 (2) 1200
QA annotations from all four domains, 加
5,605 DC annotations from the rest of Disco-
GeM (Intersection, ∩); 或者 (3) 1200 注释
which combine the label counts (例如, 20 计数
而不是 10) of QA and DC, 加 5,605 DC an-
notations from the rest of DiscoGeM (联盟,
∪). We hypothesize that
带领
to improved results due to the annotation dis-
tribution coming from a bigger sample. 什么时候
testing on Wiki gold, the corresponding subset
of Wikipedia relations are removed from the
training data. We randomly sampled 30 关系
for dev.

this union will

Results Table 6 shows how the model gen-
eralizes to traditionally annotated data. 上
PDTB and the Wikipedia test set,
该模型
with a soft loss generally performs better than
the hard loss model. TED-MDB, 在另一
手, only contains a single label per relation
and training with a distributional loss is there-
fore less beneficial. Mixing DC and QA data
only improves in the soft case for PDTB. 这
merging of the respective method label counts,

另一方面, leads to the best model per-
formance on both PDTB and TED-MDB. 在
Wikipedia the best performance is obtained when
training on soft DC-only distributions. Looking at
the label-specific differences in performance, 我们
observe that improvement on the Wikipedia test
set mainly comes from better precision and recall
when predicting ARG2-AS-DETAIL, while on PDTB
QA+DC∩ Soft is better at predicting CONJUNCTION.
We conclude that training on data that comes
from different task designs does not hurt perfor-
曼斯, and even slightly improves performance
when using majority vote labels. When training
with a distribution, the union setup (∪) seems to
work best.

7.2 案件 2: Cross-domain vs Cross-method

The purpose of this study is to investigate how
cross-domain generalization is affected by method
to compare a
bias. 换句话说, we want
cross-domain and cross-method setup with a
cross-domain and same-method setup. We test
on the domain-specific data from the 1,200 在-
stances annotated by QA and DC, 分别,
and train on various domain configurations from
DiscoGem (excluding dev and test), together with
the extra 300 PDTB instances, annotated by DC.
桌子 7 shows the different combinations of
data sets we use in this study (columns) 还有
as the results of in- and cross-domain and in- 和
cross-method predictions (rows). Both a change
in domain and a change in annotation task lead
to lower performance. 有趣的是, 结果
show that the task factor has a stronger effect on
performance than the domain: When training on
DC distributions, the QA test results are worse
than the DC test results in all cases. This indicates
that task bias is an important factor to consider
when training models. 一般来说, except in the
out-of-domain novel test case, training with a
soft loss leads to the same or considerably better
generalization accuracy than training with a hard
loss. We thus confirm the findings of Peterson
等人. (2019) and Uma et al. (2020) also for DR
classification.

8 Discussion and Conclusion

DR annotation is a notoriously difficult task with
low IAA. Annotations are not only subject to the
interpretation of the coder (Spooren and Degand,

1025

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

我

A
C
_
A
_
0
0
5
8
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

桌子 7: Cross-domain and cross-method experiments, using a hard-loss vs. a soft-loss. Columns
show train setting and rows test performance. Acc. is for predicting the majority label. JDS compares
predicted distribution (soft) with target distribution. * indicates cross-method results are not statistically
重要的 (McNemar’s test).

2010), but also to the framework (Demberg et al.,
2019). The current study extends these findings
by showing that the task design also crucially af-
fects the output. We investigated the effect of two
distinct crowdsourced DR annotation tasks on the
obtained relation distributions. These two tasks are
unique in that they use natural language to anno-
tate. Even though these designs are more intuitive
to lay individuals, we show that also such natu-
ral language-based annotation designs suffer from
bias and leave room for varying interpretations (作为
do traditional annotation tasks).

The results show that both methods have unique
biases, but also that both methods are valid, 作为
similar sets of labels are produced. 更远, 这
methods seem to be complementary: Both meth-
ods show higher agreement with the reference
label than with each other. This indicates that the
methods capture different sense types. The results
further show that the textual domain can push
each method towards different label distributions.
最后, we simulated how aggregating annotations
based on method bias improves agreement.

We suggest several modifications to both meth-
ods for future work. For QA, we recommend to
replace question prefix options which start with
a connective, such as ‘‘After what’’. The revised
options should ideally start with a Wh-question
word, 例如, ‘‘What happens after..’’. 这
would make the questions sound more natural and
help to prevent confusion with respect to level-3
sense distinctions. For DC, an improved interface
that allows workers to highlight argument spans
could serve as a screen that confirms the rela-
tion is between the two consecutive sentences.
Syntactic constraints making it difficult to insert
certain rare connectives could also be mitigated if

the workers are allowed to make minor edits to
the texts.

Considering that both methods show benefits
and possible downsides, it could be interesting to
combine them for future crowdsourcing efforts.
Given that obtaining DC annotations is cheaper
and quicker, it could make sense to collect DC
annotations on a larger scale and then use the QA
method for a specific subset that shows high label
entropy. Another option would be to merge both
方法, by first letting the crowdworkers insert
a connective and then use QAs for the second
connective-disambiguation step. 最后, since we
showed that often more than one relation sense can
hold, it would make sense to allow annotators to
write multiple QA pairs or insert multiple possible
connectives for a given relation.

The DR classification experiments revealed that
generalization across data from different task de-
signs is hard, in the DC and QA case even harder
than cross-domain generalization. 此外,
we found that merging data distributions com-
ing from different task designs can help boost
performance on data coming from a third source
(traditional annotations). 最后, we confirmed
that soft modeling approaches using label dis-
tributions can improve discourse classification
表现.

Task design bias has been identified as one
source of annotation bias and acknowledged as an
artifact of the dataset in other linguistic tasks as
出色地 (Pavlick and Kwiatkowski, 2019; Jiang and
de Marneffe, 2022). Our findings show that the
effect of this type of bias can be reduced by training
with data collected by multiple methods. 这
could be the same for other NLP tasks, 尤其
those cast in natural language, and comparing their

1026

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

我

A
C
_
A
_
0
0
5
8
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

task designs could be an interesting future research
方向. We therefore encourage researchers to
be more conscious about the biases crowdsourc-
ing task design introduces.

致谢

This work was supported by the Deutsche
Forschungsgemeinschaft, Funder ID: http://
dx.doi.org/10.13039/501100001659, grant
数字: SFB1102: Information Density and Lin-
guistic Encoding, by the the European Research
理事会, ERC-StG grant no. 677352, and the Israel
Science Foundation grant 2827/21, for which we
are grateful. We also thank the TACL reviewers
and action editors for their thoughtful comments.

参考

Rahul Aralikatte, Matthew Lamm, Daniel Hardt,
and Anders Søgaard. 2021. Ellipsis reso-
lution as question answering: An evalua-
的. In 16th Conference of
the European
the Association for Computa-
Chapter of
tional Linguistics (EACL), pages 810–817,
在线的. Association for Computational Linguis-
抽动症. https://doi.org/10.18653/v1
/2021.eacl-main.68

Lora Aroyo and Chris Welty. 2013. Crowd truth:
Harnessing disagreement in crowdsourcing a
relation extraction gold standard. WebSci2013
ACM, 2013(2013).

Ron Artstein and Massimo Poesio. 2008.
Inter-coder agreement for computational lin-
语言学. 计算语言学, 34(4):
https://doi.org/10.1162
555–596.
/大肠杆菌.07-034-R2

氮. 亚瑟. 1993. Reference to Abstract Objects in
话语, 体积 50. Kluwer, Norwell, 嘛,
多德雷赫特. https://doi.org/10.1007
/978-94-011-1715-9

Valerio Basile, 迈克尔

Fell,

Tommaso
Fornaciari, Dirk Hovy, Silviu Paun, 芭芭拉
Plank, Massimo Poesio, and Alexandra Uma.
2021. We need to consider disagreement in
评估. In Proceedings of the 1st Workshop
on Benchmarking: 过去的, 现在与未来,
pages 15–21, 在线的. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/2021.bppf-1.3

Samuel Bowman, Gabor Angeli, Christopher
波茨, and Christopher D. 曼宁. 2015. A
large annotated corpus for learning natural lan-
guage inference. 在诉讼程序中 2015
Conference on Empirical Methods in Nat-
ural Language Processing, pages 632–642,
里斯本, Portugal. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/D15-1075

它

Samuel Bowman and George Dahl. 2021.
take to fix benchmarking
What will
language understanding? In Pro-
in natural
这
这 2021 Conference of
ceedings of
North American Chapter of
the Associa-
tion for Computational Linguistics: 人类
语言技术, pages 4843–4855,
在线的. Association for Computational Linguis-
抽动症. https://doi.org/10.18653/v1
/2021.naacl-main.385

Sven Buechel and Udo Hahn. 2017A. Emobank:
Studying the impact of annotation perspec-
tive and representation format on dimensional
emotion analysis. In Proceedings of the 15th
欧洲分会会议
计算语言学协会:
体积 2, Short Papers, pages 578–585,
Valencia, 西班牙. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/E17-2092

Sven Buechel and Udo Hahn. 2017乙. Readers
文本: Coping with different
与. writers vs.
perspectives of text understanding in emo-
tion annotation. In Proceedings of the 11th
Linguistic Annotation Workshop, pages 1–12,
Valencia, 西班牙. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/W17-0801

Lynn Carlson and Daniel Marcu. 2001. 话语
tagging reference manual. ISI Technical Report
ISI-TR-545, 54:1–56.

Nancy Chang, Russell Lee-Goldman,

和
Michael Tseng. 2016. Linguistic wisdom
In Third AAAI Confer-
from the crowd.
ence on Human Computation and Crowd-
https://doi.org/10.1609
sourcing.
/hcomp.v3i1.13266

Tongfei Chen, Zheng Ping Jiang, Adam Poliak,
Keisuke Sakaguchi, and Benjamin Van Durme.
2020. Uncertain
infer-
natural
语言
the 58th Annual
在诉讼程序中
恩斯.

1027

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

我

A
C
_
A
_
0
0
5
8
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Meeting of
the Association for Computa-
tional Linguistics, pages 8772–8779, 在线的.
协会
for Computational Linguis-
抽动症. https://doi.org/10.18653/v1
/2020.acl-main.774

John Joon Young Chung, Jean Y. 歌曲, Sindhu
Kutty, Sungsoo Hong, Juho Kim, and Walter S.
Lasecki. 2019. Efficient elicitation approaches
to estimate collective crowd answers. Pro-
ceedings of
the ACM on Human-Computer
Interaction, 3(CSCW):1–25. https://土井
.org/10.1145/3359164

Jacob Cohen. 1960. A coefficient of agreement for
nominal scales. Educational and Psychological
Measurement, 20(1):37–46. https://土井
.org/10.1177/001316446002000104

emotional

Alan Cowen, Disa Sauter, Jessica L. Tracy,
and Dacher Keltner. 2019. Mapping the pas-
西翁: Toward a high-dimensional taxonomy
的
and expression.
Psychological Science in the Public Inter-
东方, 20(1):69–90. https://doi.org/10
.1177/1529100619850176,
考研:
31313637

经验

Marie-Catherine De Marneffe, Christopher D.
曼宁, and Christopher Potts. 2012. Did
it happen? The pragmatic complexity of
veridicality assessment. Computational Lin-
语言学, 38(2):301–333. https://doi.org
/10.1162/COLI_a_00097

Vera Demberg, Merel C. J. Scholman, 和
Fatemeh Torabi Asr. 2019. How compatible are
our discourse annotation frameworks? Insights
from mapping RST-DT and PDTB annota-
系统蒸发散. Dialogue & 话语, 10(1):87–135.
https://doi.org/10.5087/dad
.2019.104

Mark D´ıaz, Isaac Johnson, Amanda Lazar, Anne
Marie Piper, and Darren Gergle. 2018. Ad-
dressing age-related bias in sentiment analysis.
在诉讼程序中 2018 CHI Conference
on Human Factors in Computing Systems,
pages 1–14. https://doi.org/10.1145
/3173574.3173986

Anca Dumitrache. 2015. Crowdsourcing dis-
协议
for collecting semantic annota-
的. In European Semantic Web Conference,
pages 701–710. 施普林格. https://doi.org
/10.1007/978-3-319-18818-8 43

Anca Dumitrache, Oana Inel, Lora Aroyo,
Benjamin Timmermans, and Chris Welty. 2018.
CrowdTruth 2.0: Quality metrics for crowd-
sourcing with disagreement. In 1st Workshop
on Subjectivity, Ambiguity and Disagreement
in Crowdsourcing, and Short Paper 1st Work-
shop on Disentangling the Relation Between
Crowdsourcing and Bias Management, SAD+
CrowdBias 2018, pages 11–18. CEUR-WS.

Anca Dumitrache, Oana

Inel, 本杰明
Timmermans, Carlos Ortiz, Robert-Jan
啜饮, Lora Aroyo, and Chris Welty. 2021.
Empirical methodology for crowdsourcing
ground truth. 语义网, 12(3):403–421.
h t t p s : / / d o i . o r g / 1 0 . 3 2 3 3 / S W
– 2 0 0 4 1 5

Yanai Elazar, Victoria Basmov, Yoav Goldberg,
and Reut Tsarfaty. 2022. Text-based np enrich-
蒙特. Transactions of the Association for Com
putational Linguistics, 10:764–784. https://
doi.org/10.1162/tacl 00488

Katrin Erk and Diana McCarthy. 2009. Graded
word sense assignment. 在诉讼程序中
2009 实证方法会议
自然语言处理, pages 440–449,
新加坡. Association for Computational Lin-
语言学.

Elisa Ferracane, Greg Durrett, Junyi Jessy Li, 和
Katrin Erk. 2021. Did they answer? Subjective
acts and intents in conversational discourse.
在诉讼程序中
这 2021 Conference of
the North American Chapter of the Associ-
ation for Computational Linguistics: 人类
语言技术, pages 1626–1644,
在线的. Association for Computational Linguis-
抽动症. https://doi.org/10.18653/v1
/2021.naacl-main.129

Nicholas Fitzgerald, Julian Michael, Luheng
他, and Luke Zettlemoyer. 2018. Large-scale
the 56th
qa-srl parsing. 在诉讼程序中
Annual Meeting of the Association for Compu-
tational Linguistics (体积 1: Long Papers),
页面
2051–2060, 墨尔本, 澳大利亚.
计算语言学协会.
https://doi.org/10.18653/v1/P18
-1191

Pengcheng He, Xiaodong

Jianfeng
高, and Weizhu Chen. 2020. Deberta:
Decoding-enhanced bert with disentangled

刘,

1028

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

我

A
C
_
A
_
0
0
5
8
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

注意力.
Learning Representations.

In International Conference on

Yufang Hou. 2020. Bridging anaphora resolution
as question answering. 在诉讼程序中
58th Annual Meeting of the Association for
计算语言学, pages 1428–1438,
在线的. Association for Computational Linguis-
抽动症. https://doi.org/10.18653/v1
/2020.acl-main.132

这 2013 Conference of

Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish
Vaswani, and Eduard Hovy. 2013. 学习
In Proceed-
whom to trust with MACE.
ings of
the North
American Chapter of the Association for Com-
putational Linguistics: Human Language Tech-
逻辑的 (NAACL-HLT), pages 1120–1130,
亚特兰大, 乔治亚州. Association for Computa-
tional Linguistics.

Christoph Hube, Besnik Fetahu, and Ujwal
Gadiraju. 2019. Understanding and mitigating
worker biases in the crowdsourced collection
of subjective judgments. 在诉讼程序中
2019 CHI Conference on Human Factors in
Computing Systems, pages 1–12, 纽约,
纽约. Association for Computing Machinery.

Terne Sasha Thorn Jakobsen, Maria Barrett,
Anders Søgaard, and David Lassen. 2022.
The sensitivity of annotator bias to task def-
initions in argument mining. In Proceedings
the 16th Lingusitic Annotation Work-
的
店铺 (LAW-XVI) within LREC2022, 页面
44–61, Marseille, 法国. European Language
Resources Association.

Nanjiang Jiang and Marie-Catherine de Marneffe.
2022. Investigating reasons for disagreement
language inference. Transactions
in natural
of the Association for Computational Linguis-
抽动症, 10:1357–1374. https://doi.org/10
.1162/tacl_a_00523

Youxuan Jiang,

在

设计

trade-offs

Jonathan K. Kummerfeld,
and Walter Lasecki. 2017. Understanding
任务
crowdsourced
paraphrase collection. 在诉讼程序中
55th Annual Meeting of the Association for
计算语言学 (体积 2: Short
文件), pages 103–109, Vancouver, 加拿大.
计算语言学协会.
https://doi.org/10.18653/v1/P17
-2017

这 2013 Conference of

David Jurgens. 2013. Embracing ambiguity: A
comparison of annotation methodologies for
In Pro-
crowdsourcing word sense labels.
ceedings of
这
North American Chapter of
the Associa-
tion for Computational Linguistics: 人类
语言技术, pages 556–562, 在-
lanta, 乔治亚州. Association for Computational
语言学.

Daisuke Kawahara, Yuichiro Machida, Tomohide
Shibata, Sadao Kurohashi, Hayato Kobayashi,
and Manabu Sassano. 2014. Rapid development
of a corpus with discourse annotations using
two-stage crowdsourcing. 在诉讼程序中
the International Conference on Computational
语言学 (科林), pages 269–278, 都柏林,
爱尔兰. Dublin City University and Association
for Computational Linguistics.

Yudai Kishimoto, Shinnosuke Sawada, Yugo
Murawaki, Daisuke Kawahara, and Sadao
Kurohashi. 2018. Improving crowdsourcing-
based annotation of Japanese discourse rela-
系统蒸发散. In LREC.

Wei-Jen Ko, Cutter Dalton, Mark Simmons,
Eliza Fisher, Greg Durrett, and Junyi Jessy
李. 2021. Discourse comprehension: A ques-
tion answering framework to represent sentence
连接. arXiv 预印本 arXiv:2111.00701.

Philipp Koehn. 2005. Europarl: A parallel corpus
for statistical machine translation. In Proceed-
ings of MT Summit X, pages 79–86. Phuket,
Thailand.

Yiwei Luo, Dallas Card, and Dan Jurafsky. 2020.
Detecting stance in media on global warming. 在
Findings of the Association for Computational
语言学: EMNLP 2020, pages 3296–3315,
在线的. Association for Computational Linguis-
抽动症. https://doi.org/10.18653/v1
/2020.findings-emnlp.296

William C. Mann and Sandra A. 汤普森.
1988. Rhetorical Structure Theory: Toward a
functional theory of text organization. Text-
for the Study of
Interdisciplinary Journal
话语, 8(3):243–281. https://doi.org
/10.1515/text.1.1988.8.3.243

Christopher D. 曼宁. 2006. Local textual in-
参考: It’s hard to circumscribe, but you know
it when you see it—and NLP needs it.

1029

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

我

A
C
_
A
_
0
0
5
8
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Marian Marchal, Merel Scholman, Frances Yung,
and Vera Demberg. 2022. Establishing an-
notation quality in multi-label annotations.
在诉讼程序中
the 29th International
Conference on Computational Linguistics,
pages 3659–3668, Gyeongju, Republic of Ko-
雷亚. International Committee on Computational
语言学.

Sewon Min,

Julian Michael, Hannaneh
Hajishirzi, and Luke Zettlemoyer. 2020. 是-
bigqa: Answering ambiguous open-domain
问题. 在诉讼程序中 2020 骗局-
ference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 5783–5797,
在线的. Association for Computational Linguis-
抽动症. https://doi.org/10.18653/v1
/2020.emnlp-main.466

Yixin Nie, Xiang Zhou, and Mohit Bansal.
2020. What can we learn from collective hu-
man opinions on natural language inference
数据? 在诉讼程序中 2020 会议
on Empirical Methods in Natural Language
加工 (EMNLP), pages 9131–9143, 在-
线. Association for Computational Linguis-
抽动症. https://doi.org/10.18653/v1
/2020.emnlp-main.734

Rebecca J. Passonneau and Bob Carpenter. 2014.
The benefits of a model of annotation. 反式-
actions of the Association for Computational
语言学, 2:311–326. https://doi.org
/10.1162/tacl_a_00185

Ellie Pavlick and Tom Kwiatkowski. 2019. Inher-
ent disagreements in human textual inferences.
Transactions of the Association for Computa-
tional Linguistics, 7:677–694. https://土井
.org/10.1162/tacl_a_00293

Joshua C. 彼得森, Ruairidh M. Battleday,
托马斯·L. Griffiths, and Olga Russakovsky.
2019. Human uncertainty makes classification
more robust. In Proceedings of the IEEE/CVF
International Conference on Computer Vision,
pages 9617–9626. https://doi.org/10
.1109/ICCV.2019.00971

Barbara Plank, Dirk Hovy, and Anders Søgaard.
2014. Linguistically debatable or just plain
wrong? In Proceedings of the 52nd Annual
Meeting of
the Association for Computa-
tional Linguistics (体积 2: Short Papers),
pages 507–511, 巴尔的摩, Maryland. 作为-
for Computational Linguistics.
sociation

https://doi.org/10.3115/v1/P14
-2083

Massimo Poesio and Ron Artstein. 2005. 这
reliability of anaphoric annotation,
recon-
sidered: Taking ambiguity into account. 在
会议记录
the Workshop on Fron-
tiers in Corpus Annotations II: Pie in the
Sky, pages 76–83. https://doi.org/10
.3115/1608829.1608840

Massimo Poesio, Patrick Sturt, Ron Artstein,
and Ruth Filik. 2006. Underspecification and
anaphora: Theoretical issues and preliminary
证据. Discourse processes, 42(2):157–175.
https://doi.org/10.1207
/s15326950dp4202 4

Vinodkumar Prabhakaran, Aida Mostafazadeh
Davani, and Mark Diaz. 2021. On releas-
ing annotator-level
labels and information
In Proceedings of The Joint
in datasets.
15th Linguistic Annotation Workshop (LAW)
and 3rd Designing Meaning Representations
(DMR) 作坊, pages 133–138, Punta Cana,
Dominican Republic. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/2021.law-1.14

Valentina Pyatkin, Ayal Klein, Reut Tsarfaty,
and Ido Dagan. 2020. QADiscourse-Discourse
Relations as QA Pairs: Representation, 人群-
sourcing and baselines. 在诉讼程序中
2020 Conference on Empirical Methods in Nat-
ural Language Processing (EMNLP), 页面
2804–2819, 在线的. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/2020.emnlp-main.224

the 56th Annual Meeting of

Pranav Rajpurkar, Robin Jia, and Percy Liang.
2018. Know what you don’t know: Unan-
swerable questions for squad. In Proceedings
的
the Associ-
ation for Computational Linguistics (体积
2: Short Papers), pages 784–789, 墨尔本,
澳大利亚. Association for Computational Lin-
语言学. https://doi.org/10.18653
/v1/P18-2124

Ines Rehbein, Merel Scholman,

and Vera
登贝格. 2016. Annotating discourse rela-
in spoken language: A comparison
系统蒸发散
在
the PDTB and CCR frameworks.
的
Proceedings of the Tenth International Confer-
ence on Language Resources and Evaluation

1030

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

我

A
C
_
A
_
0
0
5
8
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

(LREC’16), pages 1039–1046, Portoroˇz, Slove-
尼亚. European Language Resources Association
(ELRA).

条款

Stefan Riezler. 2014. On the problem of
computa-
在
语言学. Computational Linguis-
40(1):235–245. https://doi.org

theoretical
的
抽动症,
/10.1162/COLI_a_00182

empirical

Hannah Rohde, Anna Dickinson, Nathan
施耐德, Christopher Clark, Annie Louis,
and Bonnie Webber. 2016. Filling in the blanks
in understanding discourse adverbials: Consis-
tency, 冲突, and context-dependence in a
crowdsourced elicitation task. In Proceedings
of the 10th Linguistic Annotation Workshop
(LAW X), 第 49–58 页, 柏林, 德国.
https://doi.org/10.18653/v1/W16
-1707

Ted J. 中号. Sanders, Wilbert P. 中号. S. Spooren,
and Leo G. 中号. Noordman. 1992. Toward a
taxonomy of coherence relations. 话语
Processes, 15(1):1–35. https://doi.org
/10.1080/01638539209544800

Merel C. J. Scholman and Vera Demberg. 2017.
Crowdsourcing discourse interpretations: 在
the influence of context and the reliability of
a connective insertion task. In Proceedings
的
the 11th Linguistic Annotation Work-
店铺 (LAW), pages 24–33, Valencia, 西班牙.
计算语言学协会.
https://doi.org/10.18653/v1/W17
-0803

Merel C. J. Scholman, Tianai Dong, Frances
Yung, and Vera Demberg. 2022A. Discogem:
A crowdsourced corpus of genre-mixed im-
plicit discourse relations. 在诉讼程序中
Thirteenth International Conference on Lan-
guage Resources and Evaluation (LREC’22),
Marseille, 法国. European Language Re-
sources Association (ELRA).

Merel C. J. Scholman, Valentina Pyatkin, Frances
Yung, Ido Dagan, Reut Tsarfaty, and Vera
登贝格. 2022乙. Design choices in crowd-
sourcing discourse relation annotations: 这
effect of worker selection and training. 在
会议记录
the Thirteenth International
Conference on Language Resources and Evalu-
化 (LREC’22), Marseille, 法国. 欧洲的
Language Resources Association (ELRA).

在诉讼程序中

Wei Shi and Vera Demberg. 2019. 学习-
ing to explicitate connectives with Seq2Seq
network for implicit discourse relation clas-
the 13th In-
sification.
ternational Conference on Computational
语义学 – Long Papers, pages 188–199,
Gothenburg, 瑞典. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/W19-0416

Rion

Snow, Brendan O’Connor, Daniel
Jurafsky, and Andrew Y. 的. 2008. Cheap
and fast—but is it good? Evaluating non-expert
在
annotations for natural
会议记录
the Conference on Empiri-
cal Methods in Natural Language Processing
(EMNLP), pages 254–263, 檀香山, Hawaii.
计算语言学协会.

language tasks.

Wilbert P. 中号. S. Spooren and Liesbeth Degand.
2010. Coding coherence relations: Reliability
and validity. Corpus Linguistics and Linguistic
理论, 6(2):241–266. https://doi.org
/10.1515/cllt.2010.009

Alexandra Uma, Tommaso Fornaciari, Dirk Hovy,
Silviu Paun, Barbara Plank, and Massimo
Poesio. 2020. A case for soft
loss func-
系统蒸发散. In Proceedings of the AAAI Conference
on Human Computation and Crowdsourcing,
体积 8, pages 173–177. https://土井
.org/10.1609/hcomp.v8i1.7478

Alexandra N. Uma, Tommaso Fornaciari, Dirk
蓝色的, Silviu Paun, Barbara Plank, and Massimo
Poesio. 2021. Learning from disagreement: A
survey. Journal of Artificial Intelligence Re-
搜索, 72:1385–1470. https://doi.org
/10.1613/jair.1.12752

Zeerak Waseem. 2016. Are you a racist or am
I seeing things? Annotator influence on hate
speech detection on Twitter. 在诉讼程序中
the First Workshop on NLP and Computational
社会科学, pages 138–142, Austin, 德克萨斯州.
计算语言学协会.
https://doi.org/10.18653/v1/W16
-5618

Bonnie Webber. 2009. Genre distinctions for dis-
course in the Penn TreeBank. 在诉讼程序中
the Joint Conference of the 47th Annual Meeting
of the ACL and the 4th International Joint
Conference on Natural Language Processing
of the AFNLP, pages 674–682. https://土井
.org/10.3115/1690219.1690240

1031

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

我

A
C
_
A
_
0
0
5
8
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Bonnie Webber, Rashmi Prasad, Alan Lee, 和
Aravind Joshi. 2019. The Penn Discourse
树库 3.0 annotation manual. 费城,
宾夕法尼亚大学.

Ogrodniczuk. 2019. Ted multilingual discourse
bank (TED-MDB): A parallel corpus annotated
in the PDTB style. Language Resources and
评估, 1–27.

Thomas Wolf, Lysandre Debut, Victor Sanh,
Julien Chaumond, Clement Delangue, Anthony
Moi, Pierric Cistac, Tim Rault, R´emi Louf,
Morgan Funtowicz, Joe Davison, Sam Shleifer,
Patrick von Platen, Clara Ma, Yacine Jernite,
Julien Plu, Canwen Xu, Teven Le Scao,
Sylvain Gugger, Mariama Drame, Quentin
Lhoest, and Alexander Rush. 2020. 反式-
语言
前者: State-of-the-art natural
加工. 在诉讼程序中 2020 骗局-
ference on Empirical Methods in Natural
语言处理: 系统演示,
pages 38–45, 在线的. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/2020.emnlp-demos.6

Frances Yung, Vera Demberg,

and Merel
Scholman. 2019. Crowdsourcing discourse re-
lation annotations by a two-step connective
insertion task. In Proceedings of the 13th Lin-
guistic Annotation Workshop, pages 16–25,
Florence, 意大利. Association for Computational
语言学. https://doi.org/10.18653
/v1/W19-4003

Deniz Zeyrek, Am´alia Mendes, Yulia Grishina,
Murathan Kurfalı, Samuel Gibbon, and Maciej

Deniz Zeyrek, Am´alia Mendes, Yulia Grishina,
Murathan Kurfalı, Samuel Gibbon, and Maciej
Ogrodniczuk. 2020. Ted multilingual discourse
bank (TED-MDB): A parallel corpus annotated
in the PDTB style. Language Resources and
评估, 54(2):587–613. https://土井
.org/10.1007/s10579-019-09445-9

Shujian Zhang, Chengyue Gong, and Eunsol
Choi. 2021. Learning with different amounts
of annotation: From zero to many labels. 在
诉讼程序 2021 Conference on Empir-
ical Methods in Natural Language Processing,
pages 7620–7632, Online and Punta Cana,
Dominican Republic. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/2021.emnlp-main.601
ˇS´arka Zik´anov´a, Jiˇr´ı M´ırovsk`y, and Pavl´ına
Synkov´a. 2019. Explicit and implicit dis-
in the Prague Discourse
course relations
and Dia-
树库.
logue: 22nd International Conference, TSD
2019, Ljubljana, Slovenia, September 11–13,
2019, 会议记录
236–248.
https://doi.org/10.1007
施普林格.
/978-3-030-27947-9_20

Speech,

页面

Text,

22,

在

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

我

A
C
_
A
_
0
0
5
8
6
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

1032
下载pdf