Design Choices for Crowdsourcing Implicit Discourse Relations: - IA de Investigación especializada en el MIT

Opciones de diseño para relaciones de discurso implícitas mediante crowdsourcing:
Revealing the Biases Introduced by Task Design

Valentina Pyatkin1 Frances Yung2 Merel C. j. Scholman2,3
Reut Tsarfaty1 Ido Dagan1 Vera Demberg2
1Bar Ilan University, Ramat Gan, Israel
2Saarland University, Sarrebruck, Alemania
3Universidad de Utrecht, Utrecht, Países Bajos

{piatkiv,reut.tsarfaty}@biu.ac.il; dagan@cs.biu.ac.il
{frances,m.c.j.sholman,vera}@coli.uni-saarland.de

Abstracto

Disagreement in natural language annotation
has mostly been studied from a perspective
of biases introduced by the annotators and the
annotation frameworks. Aquí, we propose to
analyze another source of bias—task design
inclinación, which has a particularly strong impact
on crowdsourced linguistic annotations where
natural language is used to elicit the interpreta-
tion of lay annotators. For this purpose we look
at implicit discourse relation annotation, una tarea
that has repeatedly been shown to be difficult
due to the relations’ ambiguity. comparamos
the annotations of 1,200 discourse relations
obtained using two distinct annotation tasks
and quantify the biases of both methods across
four different domains. Both methods are nat-
ural language annotation tasks designed for
crowdsourcing. We show that the task design
can push annotators towards certain relations
and that some discourse relation senses can
be better elicited with one or the other anno-
tation approach. We also conclude that this
type of bias should be taken into account when
training and testing models.

Introducción

Crowdsourcing has become a popular method for
data collection. It not only allows researchers
to collect large amounts of annotated data in a
shorter amount of time, but also captures human
inference in natural language, which should be the
goal of benchmark NLP tasks (Manning, 2006).
In order to obtain reliable annotations, the crowd-
sourced labels are traditionally aggregated to a
single label per item, using simple majority voting
or annotation models that reduce noise from the
data based on the disagreement among the annota-
tores (Hovy et al., 2013; Passonneau and Carpenter,

2014). Sin embargo, there is increasing consensus that
disagreement in annotation cannot be generally
discarded as noise in a range of NLP tasks, semejante
as natural language inferences (De Marneffe et al.,
2012; Pavlick and Kwiatkowski, 2019; Chen et al.,
2020; Nie et al., 2020), desambiguación del sentido de la palabra
(Jurgens, 2013), question answering (Min et al.,
2020; Ferracane et al., 2021), anaphora resolution
(Poesio and Artstein, 2005; Poesio et al., 2006),
sentiment analysis (D´ıaz et al., 2018; Cowen et al.,
2019), and stance classification (Waseem, 2016;
Luo et al., 2020). Label distributions are proposed
to replace categorical labels in order to repre-
sent the label ambiguity (Aroyo and Welty, 2013;
Pavlick and Kwiatkowski, 2019; Uma et al., 2021;
Dumitrache et al., 2021).

There are various reasons behind the ambigu-
ity of linguistic annotations (Dumitrache, 2015;
Jiang and de Marneffe, 2022). Aroyo and Welty
(2013) summarize the sources of ambiguity into
three categories: the text, the annotators, y el
annotation scheme. In downstream NLP tasks, él
would be helpful if models could detect possible
alternative interpretations of ambiguous texts, o
predict a distribution of interpretations by a pop-
ulación. In addition to the existing work on the
disagreement due to annotators’ bias, el efecto
of annotation frameworks has also been stud-
ied, such as the discussion on whether entailment
should include pragmatic inferences (Pavlick and
Kwiatkowski, 2019), the effect of the granularity
of the collected labels (Chung et al., 2019), o
the system of labels that categorize the linguistic
fenómeno (Demberg et al., 2019). En este trabajo,
we examine the effect of task design bias, cual es
independent of the annotation framework, sobre el
quality of crowdsourced annotations. Específicamente,

1014

Transacciones de la Asociación de Lingüística Computacional, volumen. 11, páginas. 1014–1032, 2023. https://doi.org/10.1162/tacl a 00586
Editor de acciones: Annie Louis. Lote de envío: 1/2023; Lote de revisión: 1/2023; Publicado 8/2023.
C(cid:2) 2023 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

a
C
_
a
_
0
0
5
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Elazar et al., 2022). It is therefore of interest to
the broader research community to see how task
design biases can arise, even when the tasks are
more accessible to the lay public.

We examine two distinct natural

idioma
crowdsourcing discourse relation annotation tasks
(Cifra 1): Yung et al. (2019) derive relation la-
bels from discourse connectives (DCs) that crowd
workers insert; Pyatkin et al. (2020) derive la-
bels from Question Answer (control de calidad) pairs that crowd
workers write. Both task designs employ natural
language annotations instead of labels from a tax-
onomy. The two task designs, DC and QA, son
used to annotate 1,200 implicit discourse relations
en 4 different domains. This allows us to explore
how the task design impacts the obtained anno-
taciones, as well as the biases that are inherent to
each method. To do so we showcase the differ-
ence of various inter-annotator agreement metrics
on annotations with distributional and aggregated
labels.

We find that both methods have strengths and
weaknesses in identifying certain types of re-
these biases are
laciones. We further see that
also affected by the domain. In a series of dis-
course relation classification experiments, nosotros
demonstrate the benefits of collecting annota-
tions with mixed methodologies, we show that
training with a soft loss with distributions as tar-
gets improves model performance, and we find
that cross-task generalization is harder than cross-
domain generalization.

The outline of the paper is as follows. We in-
troduce the notion of task design bias and analyze
its effect on crowdsourcing implicit DRs, usando
two different task designs (Section 3–4). Próximo,
we quantify strengths and weaknesses of each
method using the obtained annotations, and sug-
gest ways to reduce task bias (Sección 5). Entonces
we look at genre-specific task bias (Sección 6).
Por último, we demonstrate the task bias effect on DR
classification performance (Sección 7).

2 Fondo

2.1 Annotation Biases

Annotation tends to be an inherently ambiguous
tarea, often with multiple possible interpretations
and without a single ground truth (Aroyo and
Welty, 2013). An increasing amount of research
has studied annotation disagreements and biases.

Cifra 1: Example of two relational arguments (S1 and
S2) and the DC and QA annotation in the middle.

we look at inter-sentential implicit discourse rela-
ción (DR) annotation, es decir., semantic or pragmatic
relations between two adjacent sentences without
a discourse connective to which the sense of the
relation can be attributed. Cifra 1 shows an ex-
ample of an implicit relation that can be annotated
as Conjunction or Result.

Implicit DR annotation is arguably the hardest
task in discourse parsing. Discourse coherence is
a feature of the mental representation that readers
form of a text, rather than of the linguistic material
sí mismo (Sanders et al., 1992). Discourse annotation
thus relies on annotators’ interpretation of a text.
Más, relations can often be interpreted in vari-
ous ways (Rohde et al., 2016), with multiple valid
readings holding at the same time. These factors
make discourse relation annotation, especialmente para
implicit relations, a particularly difficult task. Nosotros
collect 10 different annotations per DR, thereby
focusing on distributional representations, cual
are more informative than categorical labels.

Since DR annotation labels are often abstract
terms that are not easily understood by lay individ-
uals, we focus on ‘‘natural language’’ task designs.
Decomposing and simplifying an annotation task,
where the DR labels can be obtained indirectly
from the natural language annotations, ha sido
shown to work well for crowdsourcing (Chang
et al., 2016; Scholman and Demberg, 2017; Pyatki
et al., 2020). Crowdsourcing with natural language
has become increasingly popular. This includes
tasks such as NLI (Bowman et al., 2015), srl
(Fitzgerald et al., 2018), and QA (Rajpurkar et al.,
2018). This trend is further visible in modeling
approaches that cast traditional structured predic-
tion tasks into NL tasks, such as for co-reference
(Aralikatte et al., 2021), discourse comprehension
(Ko et al., 2021), or bridging anaphora (Hou, 2020;

1015

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

a
C
_
a
_
0
0
5
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Prior studies have focused on how crowd-
workers can be biased. Worker biases are
subject to various factors, such as educational
or cultural background, or other demographic
características. Prabhakaran et al. (2021) punto
out that for more subjective annotation tasks, el
socio-demographic background of annotators con-
tributes to multiple annotation perspectives and
argue that label aggregation obfuscates such per-
spectives. En cambio, soft labels are proposed, semejante
as the ones provided by the CrowdTruth method
(Dumitrache et al., 2018), which require multi-
ple judgments to be collected per instance (Uma
et al, 2021). Bowman and Dahl (2021) sugerir
that annotations that are subject to bias from
methodological artifacts should not be included
in benchmark datasets. A diferencia de, Basile et al.
(2021) argue that all kinds of human disagree-
ments should be predicted by NLU models and
thus included in evaluation datasets.

In contrast to annotator bias, a limited amount
of research is available on bias related to the for-
mulation of the task. Jakobsen et al. (2022) espectáculo
that argument annotations exhibit widely differ-
ent levels of social group disparity depending on
which guidelines the annotators followed. simí-
mucho, Buechel and Hahn (2017a,b) study different
design choices for crowdsourcing emotion annota-
tions and show that the perspective that annotators
are asked to take in the guidelines affects anno-
tation quality and distribution. Jiang et al. (2017)
study the effect of workflow for paraphrase collec-
tion and found that examples based on previous
contributions prompt workers to produce more
diverging paraphrases. Hube et al. (2019) espectáculo
that biased subjective judgment annotations can
be mitigated by asking workers to think about re-
sponses other workers might give and by making
workers aware of their possible biases. Por eso, el
available research suggests that task design can af-
fect the annotation output in various ways. Más
research studied the collection of multiple labels:
Jurgens (2013) compares between selection and
scale rating and finds that workers would choose
an additional label for a word sense labelling task.
A diferencia de, Scholman and Demberg (2017) find
that workers usually opt not to provide an addi-
tional DR label even when allowed. Chung et al.
(2019) compare various label collection methods
including single / multiple labelling, ranking, y
probability assignment. We focus on the biases
in DR annotation approaches using the same set

of labels, but translated into different ‘‘natural
language’’ for crowdsourcing.

2.2 DR Annotation

Various frameworks exist that can be used to an-
notate discourse relations, such as RST (Mann
and Thompson, 1988) and SDRT (Asher, 1993).
En este trabajo, we focus on the annotation of im-
plicit discourse relations, following the framework
used to annotate the Penn Discourse Treebank 3.0
(PDTB, Webber et al., 2019). PDTB’s sense clas-
sification is structured as a three-level hierarchy,
with four coarse-grained sense groups in the first
level and more fine-grained senses for each of
the next levels.1 The process is a combination of
manual and automated annotation: An automated
process identifies potential explicit connectives,
and annotators then decide on whether the poten-
tial connective is indeed a true connective. En ese caso,
they specify one or more senses that hold between
its arguments. If no connective or alternative lex-
icalization is present (es decir., for implicit relations),
each annotator provides one or more connectives
that together express the sense(s) they infer.

DR datasets, such as PDTB (Webber et al.,
2019), RST-DT (Carlson and Marcu, 2001), y
TED-MDB (Zeyrek et al., 2019), are commonly
annotated by trained annotators, who are ex-
pected to be familiar with extensive guidelines
written for a given task (Plank et al., 2014;
Artstein y Poesio, 2008; Riezler, 2014).
Sin embargo, there have also been efforts to crowd-
source discourse relation annotations (Kawahara
et al., 2014; Kishimoto et al., 2018; Scholman and
Demberg, 2017; Pyatkin et al., 2020). We investi-
gate two crowdsourcing approaches that annotate
inter-sentential implicit DRs and we deterministi-
cally map the NL-annotations to the PDTB3 label
estructura.

2.2.1 Crowdsourcing DRs with the DC

Método

Yung et al. (2019) developed a crowdsourcing
discourse relation annotation method using dis-
course connectives, referred to as the DC method.
For every instance, participants first provide a
connective that in their view, best expresses the
relation between the two arguments. Note that the

1We merge the belief and speech-act relation senses
(which cannot be distinguished reliably by QA and DC) con
their corresponding more general relation senses.

1016

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

a
C
_
a
_
0
0
5
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

connective chosen by the participant might be am-
biguous. Por lo tanto, participants disambiguate the
relation in a second step, by selecting a connective
from a list that is generated dynamically based on
the connective provided in the first step. Cuando el
first step insertion does not match any entry in the
connective bank (from which the list of disam-
biguating connectives is generated), Participantes
are presented with a default list of twelve con-
nectives expressing a variety of relations. Basado
on the connectives chosen in the two steps, el
inferred relation sense can be extracted. por ejemplo-
amplio, the CONJUNCTION reading in Figure 1 can be
expressed by in addition, and the RESULT reading
can be expressed by consequently.

The DC method was used to create a crowd-
sourced corpus of 6,505 discourse-annotated
implicit relations, named DiscoGeM (Scholman
et al., 2022a). A subset of DiscoGeM is used in
el estudio actual (mira la sección 3).

2.2.2 Crowdsourcing DRs by the QA Method

Pyatkin et al. (2020) proposed to crowdsource
discourse relations using QA pairs. They collected
a dataset of intra-sentential QA annotations which
aim to represent discourse relations by including
one of the propositions in the question and the
other in the respective answer, with the question
prefix (What is similar to..?, What is an example
of..?) mapping to a relation sense. Their method
was later extended to also work inter-sententially
(Scholman et al., 2022). In this work we make use
of the extended approach that relates two distinct
sentences through a question and answer. El
following QA pair, Por ejemplo, connects the two
sentences in Figure 1 with a RESULT relation.

(1) What is the result of Caesar being assassi-
nated by a group of rebellious senators?(S1) –
A new series of civil wars broke out […](S2)

The annotation process consists of the following
steps: From two consecutive sentences, annotators
are asked to choose a sentence that will be used to
formulate a question. The other sentence functions
as an answer to that question. Next they start
building a question by choosing a question prefix
and by completing the question with content from
the chosen sentence.

Since it is possible to choose either of the two
sentences as question/answer for a specific set of
symmetric relations, (es decir., What is the reason a

new series of civil wars broke out?), we consider
both possible formulations as equivalent.

The set of possible question prefixes cover all
PDTB 3.0 senses (excluding belief and speech-act
relaciones). The direction of the relation sense, p.ej.,
arg1-as-denier vs. arg2-as-denier, is determined
by which of the two sentences is chosen for
the question/answer. While Pyatkin et al. (2020)
allowed crowdworkers to form multiple QA pairs
per instance, es decir., annotate more than one discourse
sense per relation, we decided to limit the task to
1 sense per relation per worker. We took this
decision in order for the QA method to be more
comparable to the DC method, which also only
allows the insertion of a single connective.

3 Método

3.1 Datos

We annotated 1,200 inter-sentential discourse re-
lations using both the DC and the QA task design.2
De estos 1,200 relaciones, 900 were taken from the
DiscoGeM corpus and 300 from the PDTB 3.0.

DiscoGeM Relations The 900 DiscoGeM in-
stances that were included in the current study
represent different domains: 296 instances were
taken from the subset of DiscoGeM relations that
were taken from Europarl proceedings (written
proceedings of prepared political speech taken
from the Europarl corpus; Koehn, 2005), 304
instances were taken from the literature subset
(narrative text from five English books),3 y 300
instances from the Wikipedia subset of DiscoGeM
(informative text, taken from the summaries of 30
Wikipedia articles). These different genres enable
a cross-genre comparison. This is necessary, given
that prevalence of certain relation types can dif-
fer across genres (Rehbein et al., 2016; Scholman
et al., 2022a; Webber, 2009).

Estos 900 relations were already labeled using
the DC method in DiscoGeM; nosotros además
collect labels using the QA method for the current
estudiar. In addition to crowd-sourced labels using
the DC and QA methods, the Wikipedia subset
was also annotated by three trained annotators.4

2The annotations are available at https://github

.com/merelscholman/DiscoGeM.

3Animal Farm by George Orwell, Harry Potter and the
Philosopher’s Stone by J. k. Rowling, The Hitchhikers Guide
to the Galaxy by Douglas Adams, The Great Gatsby by
F. Scott Fitzgerald, and The Hobbit by J. R. R. Tolkien.

4Instances were labeled by two annotators and verified by
un tercio; Cohen’s κ agreement between the first annotator and

1017

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

a
C
_
a
_
0
0
5
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Forty-seven percent of these Wikipedia instances
were labeled with multiple senses by the expert
annotators (es decir., were considered to be ambiguous
or express multiple readings).

PDTB Relations The PDTB relations were
included for the purpose of comparing our an-
notations with traditional PDTB gold standard
anotaciones. These instances (all inter-sentential)
were selected to represent all relational classes,
randomly sampling at most 15 and at least 2 (para
classes with less than 15 relation instances we
sampled all existing relations) relation instances
per class. The reference labels for the PDTB
instances consist of the original PDTB labels an-
notated as part of the PDTB3 corpus. Solo 8% de
these consisted of multiple senses.

3.2 Crowdworkers

Crowdworkers were recruited via Prolific using
a selection approach (Scholman et al., 2022) eso
has been shown to result in a good trade off be-
tween quality and time/monetary efforts for DR
annotation. Crowdworkers had to meet the fol-
lowing requirements: be native English speakers,
reside in UK, Irlanda, EE.UU, or Canada, and have
obtained at least an undergraduate degree.

Workers who fulfilled these conditions could
participate in an initial recruitment task, para cual
they were asked to annotate a text with either
the DC or QA method and were shown imme-
diate feedback on their performance. Workers
with an accuracy ≥ 0.5 on this task were qual-
ified to participate in further tasks. We hence
created a unique set of crowdworkers for each
método. The DC annotations (collected as part
of DiscoGeM) were provided by a final set of
199 selected crowdworkers; QA had a final set
de 43 selected crowdworkers.5 Quality was moni-
tored throughout the production data collection
and qualifications were adjusted according to
actuación.

Every instance was annotated by 10 workers per
método. This number was chosen based on parity
with previous research. Por ejemplo, Snow et al.
(2008) show that a sample of 10 crowdsourced an-
notations per instance yields satisfactory accuracy

the reference label was .82 (88% agreement), and between
the second and the reference label was .96 (97% agreement).
See Scholman et al. (2022a) for additional details.

5The larger set of selected workers in the DC method is
because more data was annotated by DC workers as part of
the creation of DiscoGeM.

for various linguistic annotation tasks. Scholman
and Demberg (2017) found that assigning a new
group of 10 annotators to annotate the same in-
stances resulted in a near-perfect replication of the
connective insertions in an earlier DC study.

Instances were annotated in batches of 20. Para
control de calidad, one batch took about 20 minutes to complete,
and for DC, 7 minutos. Workers were reimbursed
acerca de $2.50 y $1.88 per batch, respectivamente.

3.3 Inter-annotator Agreement

We evaluate the two DR annotation methods by
the inter-annotator agreement (IAA) between the
annotations collected by both methods and IAA
with reference annotations collected from trained
annotators.

Cohen’s kappa (cohen, 1960) is a metric fre-
quently used to measure IAA. For DR annotations,
a Cohen’s kappa of .7 is considered to reflect good
IAA (Spooren and Degand, 2010). Sin embargo, previo
research has shown that agreement on implicit re-
lations is more difficult to reach than on explicit
relaciones: Kishimoto et al. (2018) report an F1
de .51 on crowdsourced annotations of implicits
using a tagset with 7 level-2 labels; Zik´anov´a et al.
(2019) report κ = .47 (58%) on expert annotations
of implicits using a tagset with 23 level-2 labels;
and Demberg et al. (2019) find that PDTB and
RST-DT annotators agree on the relation sense on
37% of implicit relations. Cohen’s kappa is pri-
marily used for comparison between single labels
and the IAAs reported in these studies are also
based on single aggregated labels.

Sin embargo, we also want to compare the obtained
10 annotations per instance with our reference
labels that also contain multiple labels. El COM-
parison becomes less straightforward when there
are multiple labels because the chance of agree-
ment is inflated and partial agreement should be
treated differently. We thus measure the IAA be-
tween multiple labels in terms of both full and
partial agreement rates, as well as the multi-label
kappa metric proposed by Marchal et al. (2022).
This metric adjusts the multi-label agreements
with bootstrapped expected agreement. Nosotros estafamos-
sider all the labels annotated by the crowdworkers
in each instance, excluding minority labels with
only one vote.6

6We assumed there were 10 votes per item and removed
labels with less than 20% of votes, even though in rare cases
there could be 9 o 11 votos. De término medio, the removed labels
representar 24.8% of the votes per item.

1018

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

a
C
_
a
_
0
0
5
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Item counts
QA sub-labels/item
DC sub-labels/item
full/+partial agreement
multi-label kappa
JSD

Europarl
296
2.13
2.37
.051/.841
.813
.505

Novel

304
2.21
2.00
.092/.865
.842
.492

Wiki.

300
2.26
2.09
.060/.920
.903
.482

PDTB

302
2.45
2.21
.050/.884
.868
.510

todo
1202
2.21
2.17
.063/.878
.857
.497

Mesa 1: Comparison between the labels obtained by DC vs. control de calidad. Lleno (or +partial) agreement means
todo (or at least one sub-label) match(es). Multi-label kappa is adapted from Marchal et al. (2022).
JSD is calculated based on the actual distributions of the crowdsourced sub-labels, excluding labels
with only one vote (smaller values are better).

Además, we compare the distributions of the
crowdsourced labels using the Jensen-Shannon
divergencia (JSD) following existing reports (Erk
and McCarthy, 2009; Nie et al., 2020; zhang
et al., 2021). Similarmente, minority labels with only
one vote are excluded. Since distributions are not
available in the reference labels, when comparing
with the reference labels, we evaluate by the JSD
based on the flattened distributions of the labels,
which means we replace the original distribution
of the votes with an even distribution of the labels
that have been voted by more than one annotator.
We call this version JSD flat.

As a third perspective on IAA we report agree-
ment among annotators on an item annotated with
QA/DC. Following previous work (Nie et al.,
2020), we use entropy of the soft labels to quan-
tify the uncertainty of the crowd annotation. Aquí
labels with only one vote are also included as they
contribute to the annotation uncertainty. When cal-
culating the entropy, we use a logarithmic base of
norte = 29, where n is the number of possible labels.
A lower entropy value suggests that the annotators
agree with each other more and the annotated la-
bel is more certain. As discussed in Section 1, el
source of disagreement in annotations could come
from the items, the annotators, and the method-
ology. High entropy across multiple annotations
of a specific item within the same annotation task
suggests that the item is ambiguous.

4 Resultados

relations that have received more than one anno-
tation; and ‘‘label distribution’’ is the distribution
of the votes of the sub-labels.

4.1 IAA Between the Methods

Mesa 1 shows that both methods yield more than
two sub-labels per instance after excluding minor-
ity labels with only one vote. This supports the
idea that multi-sense annotations better capture
the fact that often more than one sense can hold
implicitly between two discourse arguments.

Mesa 1 also presents the IAA between the labels
crowdsourced with QA and DC per domain. El
agreement between the two methods is good: El
labels assigned by the two methods (or at least
one of the sub-labels in case of a multi-label
annotation) match for about 88% of the items.
This speaks for the fact that both methods are
válido, as similar sets of labels are produced.

The full agreement scores, sin embargo, are very
bajo. This is expected, as the chance to match
on all sub-labels is also very low compared to a
single-label setting. The multi-label kappa (cual
takes chance agreement of multiple labels into),
and JSD (which compares the distributions of
the multiple labels) are hence more suitable. Nosotros
note that the PDTB gold annotation that we use
for evaluation does not assign multiple relations
systematically and has a low rate of double labels.
This explains why the PDTB subsets have a high
partial agreement while the JSD ends up being
worst.

We first compare the IAA between the two crowd-
sourced annotations, then we discuss IAA between
DC/QA and the reference annotations, and lastly
we perform an analysis based on annotation uncer-
tainty. Aquí, ‘‘sub-labels’’ of an instance means all

4.2 IAA Between Crowdsourced and

Reference Labels

Mesa 2 compares the labels crowdsourced by
each method and the reference labels, cuales son
available for the Wikipedia and PDTB subsets. Él

1019

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

a
C
_
a
_
0
0
5
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Item counts

Ref. sub-labels/item
control de calidad: sub-labels/item
full/+partial agreement
multi-label kappa
JSDflat
corriente continua: sub-labels/item
full/+partial agreement
multi-label kappa
JSDflat

Wiki.

PDTB

300

302

1.08
2.45

1.54
2.26
.133/.887 .070/.487
.857
.468
2.09
.110/.853 .103/.569
.817
.483

.449
.643
2.21

.524
.606

Mesa 2: Comparison against gold labels for the
QA or DC methods. Since the distribution of the
reference sub-labels is not available, JSD flat is
calculated between uniform distributions of the
sub-labels.

Europarl Wikipedia Novel PDTB

control de calidad
corriente continua

0.40
0.37

0.38
0.34

0.38
0.35

0.41
0.36

Mesa 3: Average entropy of the label distribu-
ciones (10 annotations per relation) for QA/DC,
split by domain.

can be observed that both methods achieve higher
full agreements with the reference labels than with
each other on both domains. This indicates that
the two methods are complementary, with each
method better capturing different sense types. En
particular, the QA method tends to show higher
agreement with the reference for Wikipedia items,
while the DC annotations show higher agreement
with the reference for PDTB items. This can
possibly be attributed to the development of the
methodologies: The DC method was originally
developed by testing on data from the PDTB in
Yung et al. (2019), whereas the QA method was
developed by testing on data from Wikipedia and
Wikinews in Pyatkin et al. (2020).

4.3 Annotation Uncertainty

Mesa 3 compares the average entropy of the
soft labels collected by both methods. It can be
observed that the uncertainty among the labels
chosen by the crowdworkers is similar across
domains but always slightly lower for DC. Nosotros
further look at the correlation between annotation
uncertainty and cross-method agreement, and find

Cifra 2: Correlation between the entropy of the anno-
tations and the JSDflat between the crowdsourced labels
and reference.

that agreement between methods is substantially
higher for those instances where within-method
entropy was low. Similarmente, we find that agree-
ment between crowdsourced annotations and gold
labels is highest for those relations, where little
entropy was found in crowdsourcing.

Próximo, we want to check if the item effect is simi-
lar across different methods and domains. Cifra 2
shows the correlation between the annotation en-
tropy and the agreement with the reference of
each item, of each method for the Wikipedia /
PDTB subsets. It illustrates that annotations of
both methods diverge with the reference more as
the uncertainty of the annotation increases. Mientras
the effect of uncertainty is similar across meth-
ods on the Wikipedia subset, the quality of the
QA annotations depends more on the uncertainty

1020

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

a
C
_
a
_
0
0
5
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

label

FNQA FNDC FPQA FPDC

conjunction
arg2-as-detail
precedence
arg2-as-denier
resultado
contrast
arg2-as-instance
reason
synchronous
arg2-as-subst
equivalence
succession
semejanza
norel
arg1-as-detail
disjunction
arg1-as-denier
arg2-as-manner
arg2-as-excpt
arg2-as-goal
arg2-as-cond
arg2-as-negcond
arg1-as-goal

43
42
19
38
10
8
10
12
20
21
22
17
7
12
9
5
3
2
2
1
1
1
1

46
62
18
20
5
17
7
17
27
13
22
15
8
12
8
4
3
2
2
1
1
1
1

203
167
18
15
110
84
44
54
11
1
2
24
15
0
39
10
33
9
1
5
0
0
3

167
152
37
47
187
39
57
37
5
0
1
3
12
0
13
0
31
0
0
0
0
0
0

Mesa 4: FN and FP counts of each method
grouped by the reference sub-labels.

ejemplo, the QA method confuses workers when
the question phrase contains a connective:7

tyke,''
izquierda

chortled Mr. Dursley
(2) ‘‘Little
as he
en
his car and backed out of number
four’s drive. [control de calidad:SUCCESSION, PRECEDENCE,
corriente continua:CONJUNCTION, PRECEDENCE]

la casa. He got

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

a
C
_
a
_
0
0
5
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

In the above example, the majority of the work-
ers formed the question ‘‘After what he left the
house?, which was likely a confusion with ‘‘What
did he do after he left the house?''. This could ex-
plain the frequent confusion between PRECEDENCE
and SUCCESSION by QA, resulting in the frequent
FPs of SUCCESSION (Cifra 3).8

For DC,

rare relations which lack a fre-
quently used connective are harder to annotate,
Por ejemplo:

(3) He had made an arrangement with one of the
cockerels to call him in the mornings half an

7The examples are presented in the following format:
italics = argument 1; bolded = argument 2; plain = contexts.
8Similarmente, the question ‘‘Despite what . . . ?’’ is easily
confused with ‘‘despite…'', which could explain the frequent
FP of arg1-as-denier by the QA method.

Cifra 3: Distribution of the annotation errors by
método. Labels annotated by at least 2 workers are
compared against the reference labels of the Wikipedia
and PDTB items. The relation types are arranged in
descending order of the ‘‘ref. sub-label counts.’’

compared to the DC annotations on the PDTB
subset. This means that method bias also exists on
the level of annotation uncertainty and should be
taken into account when, Por ejemplo, entropy is
used as a criterion to select reliable annotations.

5 Sources of the Method Bias

En esta sección, we analyze method bias in terms of
the sense labels collected by each method. Nosotros también
examine the potential limitations of the methods
which could have contributed to the bias and
demonstrate how we can utilize information on
method bias to crowdsource more reliable labels.
Por último, we provide a cross-domain analysis.

Mesa 5 presents the confusion matrix of the
labels collected by both methods for the most
frequent level-2 relations. Cifra 3 and Table 4
show the distribution of the true and false positives
of the sub-labels. These results show that both
methods are biased towards certain DRs. El
source of these biases can be categorized into
two types, which we will detail in the following
subsections.

5.1 Limitation of Natural Language

for Annotation

There are limitations of representing DRs in
natural languages using both QA and DC. Para

1021

hour earlier than anyone else, and would put
in some volunteer labour at whatever seemed
to be most needed, before the regular day’s
work began. His answer to every problem,
every setback, was ‘‘I will work harder!''
– which he had adopted as his personal
motto. [control de calidad:ARG1-AS-INSTANCE; corriente continua:RESULT]

It is difficult to use the DC method to annotate
the ARG1-AS-INSTANCE relation due to a lack of typ-
ical, specific, and context-independent connective
phrases that mark these rare relations, como
‘‘this is an example of …''. Por el contrario, the QA
method allows workers to make a question and
answer pair in the reverse direction, with S1 be-
ing the answer to S2, using the same question
palabras, p.ej., What is an example of the fact that
his answer to every problem […] was ‘‘I will work
harder!''?. This allows workers to label rarer rela-
tion types that were not even uncovered by trained
annotators.

Many common DCs are ambiguous, como
but and and, and can be hard to disambiguate.
To address this, the DC method provides workers
with unambiguous connectives in the second step.
Sin embargo, these unambiguous connectives are of-
ten relatively uncommon and come with different
syntactic constraints, depending on whether they
are coordinating or subordinating conjunctions or
discourse adverbials. Por eso, they do not fit in all
contextos. Además, some of the unambiguous
connectives sound very ‘‘heavy’’ and would not
be used naturally in a given sentence. Por ejemplo,
however is often inserted in the first step, pero
can mark multiple relations and is disambiguated
in the second step by the choice among on the
contrary for CONTRAST, despite for ARG1-AS-DENIER,
and despite this for ARG2-AS-DENIER. Despite this
was chosen frequently since it can be applied to
most contexts. This explains the DC method’s
bias towards arg2-as-denier against contrast
(Cifra 3: most FPs of arg2-as-denier and most
FNs of contrast come from DC).

While the QA method also requires workers to
select from a set of question starts, which also
contain infrequent expressions (such as Unless
what..?), workers are allowed to edit the text to
improve the wordings of the questions. This helps
reduce the effect of bias towards more frequent
question prefixes and makes crowdworkers doing
the QA task more likely to choose infrequent
relation senses than those doing the DC task.

5.2 Guideline Underspecification

Jiang and de Marneffe (2022) report that some
disagreements in NLI tasks come from the loose
definition of certain aspects of the task. Nosotros
found that both QA and DC also do not give
clear enough instructions in terms of argument
spans. The DRs are annotated at the boundary
of two consecutive sentences but both methods
do not limit workers to annotate DRs that span
exactly the two sentences.

More specifically, the QA method allows the
crowdworkers to form questions by copying spans
from one of the sentences. While this makes sure
that the relation lies locally between two consec-
utive sentences, it also sometimes happens that
workers highlight partial spans and annotate re-
lations that span over parts of the sentences. Para
ejemplo:

(4) I agree with Mr Pirker, y eso

is prob-
ably the only thing I will agree with
him on if we do vote on the Lud-
ford report. Él
is going to be an in-
teresting vote. [control de calidad:ARG2-AS-DETAIL,REASON;
corriente continua:CONJUNCTION,RESULT]

In Ex. (4), workers constructed the question
‘‘What provides more details on the vote on the
Ludford report?''. This is similar to the instruc-
tions in PDTB 2.0 and 3.0’s annotation manuals,
specifying that annotators should take minimal
spans which don’t have to span the entire sen-
tence. Other relations should be inferred when
the argument span is expanded to the whole sen-
tence, for example a RESULT relation reflecting that
there is little agreement, which will make the vote
interesante.

A menudo, a sentence can be interpreted as the elab-
oration of certain entities in the previous sentence.
This could explain why ARG1/2-AS-DETAIL tends to
be overlabelled by QA. Cifra 3 shows that the
QA has more than twice as many FP counts for
ARG2-AS-DETAIL compared to DC – the contrast is
even bigger for ARG1-AS-DETAIL. Yet it is not trivial
to filter out such questions that only refer to a part
of the sentence, because in some cases, La altura-
lighted entity does represent the whole argument
span.9 Clearer instructions in the guidelines are
desirable.

9Such as ‘‘a few final comments’’ in this example: Ladies
and gentlemen, I would like to make a few final comments.
This is not about the implementation of the habitats
directive.

1022

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

a
C
_
a
_
0
0
5
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

DETAIL and SUCCESSION and DC’s bias towards
CONCESSION and RESULT. Being aware of these biases
would allow to combine the methods: After first
labelling all instances with the more cost-effective
DC method, RESULT relations, which we know tend
to be overlabelled by the DC method, could be
re-annotated using the QA method. We simulate
this for our data and find that this would increase
the partial agreement from 0.853 a 0.913 para
Wikipedia and from 0.569 a 0.596 for PDTB.

6 Analysis by Genre

For each of the four genres (Novel, Wikipedia,
Europarl, and WSJ) we have ∼300 implicit DRs
annotated by both DC and QA. Scholman et al.
(2022a) presentado, based on the DC method, eso
in DiscoGeM, CONJUNCTION is prevalent in the
Wikipedia domain, PRECEDENCE in Literature and
RESULT in Europarl. The QA annotations replicate
this finding, as displayed in Figure 4.

It appears more difficult to obtain agreement
with the majority labels in Europarl than in other
genres, which is reflected in the average entropy
(ver tabla 3) of the distributions for each genre,
where DC has the highest entropy in the Europarl
domain and QA the second highest (after PDTB).
Mesa 1 confirms these findings, showing that the
agreement between the two methods is highest for
Wikipedia and lowest for Europarl.

In the latter domain, the DC method results
in more CAUSAL relations: 36% of the CONJUNC-
TIONS labelled by QA are labelled as RESULT
in DC.11 Manual inspection of these DC anno-
tations reveals that workers chose considering
this frequently only in the Europarl subset. Este
connective phrase is typically used to mark a
pragmatic result relation, where the result reading
comes from the belief of the speaker (Ex. (4)).
This type of relation is expected to be more fre-
quent in speech and argumentative contexts and is
labelled as RESULT-BELIEF in PDTB3. QA does not
have a question prefix available that could capture
RESULT-BELIEF senses. The RESULT labels obtained
by DC are therefore a better fit with the PDTB3
framework than QA’s CONJUNCTIONS. CONCESSION
is generally more prevalent with the DC method,
especially in Europarl, con 9% compared to 3%
for QA. CONTRAST, por otro lado, seems to be
favored by the QA method, of which most (6%)

11This appeared to be distributed over many annotators

and is thus a true method bias.

Mesa 5: Confusion matrix for the most frequent
level-2 sublabels which were annotated by at least
2 workers per relation; values are represented as
colores.

Similarmente, DC does not limit workers to annotate

relations between the two sentences, consider:

(5)

mismo

When two differently-doped regions exist
en el
crystal, a semiconductor
junction is created. The behavior of
charge
elec-
carriers, which include
trons, ions and electron holes, at these
junctions is the basis of diodes, tran-
and all modern electronics.
sistors
[Ref:ARG2-AS-DETAIL;
control de calidad:ARG2-AS-DETAIL,
CONJUNCTION; corriente continua:CONJUNCTION, RESULT]

In this example, many people inserted as a
resultado, which naturally marks the intra-sentence
relation (…is created as a result.) Many rela-
tions are potentially spuriously labelled as RESULT,
which are frequent between larger chunks of texts.
Mesa 5 shows that the most frequent confusion
is between DC’s CAUSE and QA’s CONJUNCTION.10
Within the level-2 CAUSE relation sense, it is the
level-3 RESULT relation that turns out to be the
main contributor to the observed bias. Cifra 3
also shows that most FPs of RESULT come from the
DC method.

5.3 Aggregating DR Annotations Based on

Method Bias

The qualitative analysis above provides insights
on certain method biases observed in the label
distributions, such as QA’s bias towards ARG1/2-AS

10A chi-squared test confirms that the observed distribution
is significantly different from what could be expected based
on chance disagreement.

1023

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

a
C
_
a
_
0
0
5
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 4: Level-2 sublabel counts of all the annotated labels of both methods, split by domain.

CONTRAST relations are found in Wikipedia, com-
comparado con 3% for DC. Cifra 4 also highlights that
for the QA approach, annotators tend to choose
a wider variety of senses which are rarely ever
annotated by DC, such as PURPOSE, CONDITION, y
MANNER.

We conclude that encyclopedic and literary
texts are the most suitable to be annotated us-
ing either DC or QA, as they show higher
inter-method agreement (and for Wikipedia also
higher agreement with gold). Spoken-language
and argumentative domains on the other hand are
trickier to annotate as they contain more pragmatic
readings of the relations.

7 Case Studies: Effect of Task Design on

DR Classification Models

Analysis of the crowdsourced annotations reveals
that the two methods have different biases and
different correlations with domains and the style
(and possibly function) of the language used in
the domains. We now investigate the effect of task
design bias on automatic prediction of implicit
discourse relations. Específicamente, we carry out two
case studies to demonstrate the effect that task
design and the resulting label distributions have
on discourse parsing models.

Task and Setup We formulate the task of pre-
dicting implicit discourse relations as follows.
The input to the model are two sequences S1
and S2, which represent the arguments of a dis-
course relation. The targets are PDTB 3.0 sense
types (including level-3). This model architecture
is similar to the model for implicit DR prediction
by Shi and Demberg (2019). We experiment with
two different losses and targets: a cross-entropy
loss where the target is a single majority label

and a soft cross-entropy loss where the target is
a probability distribution over the annotated la-
bels. Using the 10 annotations per instance, nosotros
obtain label distributions for each relation, cual
we use as soft targets. Training with a soft loss has
been shown to improve generalization in vision
and NLP tasks (Peterson et al., 2019; Uma et al.,
2020). As suggested in Uma et al. (2020), we nor-
malize the sense-distribution over the 30 posible
labels12 with a softmax.

Assume one has a relation with the following
anotaciones: 4 RESULT, 3 CONJUNCTION, 2 SUCCESSION,
1 ARG1-AS-DETAIL. For the hard loss, the target
would be the majority label: RESULT. For the soft
loss we normalize the counts (every label with no
annotation has a count of 0) using a softmax, para
a smoother distribution without zeros.

We fine-tune DeBERTa (deberta-base) (Él
et al., 2020) in a sequence classification setup
using the HuggingFace checkpoint (Wolf et al.,
2020). The model trains for 30 epochs with early
stopping and a batch size of 8.

Datos
In addition to the 1,200 instances we ana-
lyzed in the current contribution, nosotros además
use all annotations from DiscoGeM as training
datos. DiscoGeM, which was annotated with the
DC method, adds 2756 Novel relations, 2504 Eu-
roparl relations, y 345 Wikipedia relations. Nosotros
formulate different setups for the case studies.

12precedence, arg2-as-detail, conjunction, resultado, arg1-as-
detail, arg2-as-denier, contrast, arg1-as-denier, synchronous,
reason, arg2-as-instance, arg2-as-cond, arg2-as-subst, simi-
larity, disjunction, succession, arg1-as-goal, arg1-as-instance,
arg2-as-goal, arg2-as-manner, arg1-as-manner, equivalence,
arg2-as-excpt, arg1-as-excpt, arg1-as-cond, differentcon,
norel, arg1-as-negcond, arg2-as-negcond, arg1-as-subst.

1024

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

a
C
_
a
_
0
0
5
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

corriente continua
DC Soft
QA+DC∩
QA+DC∩ Soft
QA+DC∪
QA+DC∪ Soft

PDTBtest Wikigold
0.34†
0.29*†
0.34(cid:3)
0.38*(cid:3)
0.35♠
0.41*♠

0.65†
0.70*†
0.67
0.66*
0.49♠
0.67*♠

TED-M.
0.36
0.34*
0.37
0.31*
0.36♠
0.43*♠

Mesa 6: Accuracy of model (with soft vs. hard
loss) prediction on gold labels. The model is
trained either on DC data (corriente continua), an intersection
of DC and QA (∩), or the union of DC and
control de calidad (∪). Same symbol in a column indicates a
statistically significant (McNemar test) diferencia
in cross-model results.

7.1 Case 1: Incorporating Data from

Different Task Designs

The purpose of this study is to see if a model
trained on data crowdsourced by DC/QA meth-
ods can generalize to traditionally annotated test
conjuntos. We thus test on the 300 Wikipedia rela-
tions annotated by experts (Wiki gold), all implicit
relations from the test set of PDTB 3.0 (PDTB
prueba), and the implicit relations of the English
test set of TED-MDB (Zeyrek et al., 2020).
For training data, we either use (1) all of the
DiscoGeM annotations (Only DC); o (2) 1200
QA annotations from all four domains, plus
5,605 DC annotations from the rest of Disco-
GeM (Intersection, ∩); o (3) 1200 anotaciones
which combine the label counts (p.ej., 20 cuenta
instead of 10) of QA and DC, plus 5,605 DC an-
notations from the rest of DiscoGeM (Union,
∪). We hypothesize that
dirigir
to improved results due to the annotation dis-
tribution coming from a bigger sample. Cuando
testing on Wiki gold, the corresponding subset
of Wikipedia relations are removed from the
training data. We randomly sampled 30 relaciones
for dev.

this union will

Results Table 6 shows how the model gen-
eralizes to traditionally annotated data. Sobre el
PDTB and the Wikipedia test set,
el modelo
with a soft loss generally performs better than
the hard loss model. TED-MDB, en el otro
mano, only contains a single label per relation
and training with a distributional loss is there-
fore less beneficial. Mixing DC and QA data
only improves in the soft case for PDTB. El
merging of the respective method label counts,

por otro lado, leads to the best model per-
formance on both PDTB and TED-MDB. On
Wikipedia the best performance is obtained when
training on soft DC-only distributions. Looking at
the label-specific differences in performance, nosotros
observe that improvement on the Wikipedia test
set mainly comes from better precision and recall
when predicting ARG2-AS-DETAIL, while on PDTB
QA+DC∩ Soft is better at predicting CONJUNCTION.
We conclude that training on data that comes
from different task designs does not hurt perfor-
mance, and even slightly improves performance
when using majority vote labels. When training
with a distribution, the union setup (∪) seems to
work best.

7.2 Case 2: Cross-domain vs Cross-method

The purpose of this study is to investigate how
cross-domain generalization is affected by method
to compare a
inclinación. En otras palabras, we want
cross-domain and cross-method setup with a
cross-domain and same-method setup. We test
on the domain-specific data from the 1,200 en-
stances annotated by QA and DC, respectivamente,
and train on various domain configurations from
DiscoGem (excluding dev and test), together with
the extra 300 PDTB instances, annotated by DC.
Mesa 7 shows the different combinations of
data sets we use in this study (columnas) también
as the results of in- and cross-domain and in- y
cross-method predictions (filas). Both a change
in domain and a change in annotation task lead
to lower performance. Curiosamente, the results
show that the task factor has a stronger effect on
performance than the domain: When training on
DC distributions, the QA test results are worse
than the DC test results in all cases. Esto indica
that task bias is an important factor to consider
when training models. Generally, except in the
out-of-domain novel test case, training with a
soft loss leads to the same or considerably better
generalization accuracy than training with a hard
loss. We thus confirm the findings of Peterson
et al. (2019) and Uma et al. (2020) also for DR
clasificación.

8 Discusión y conclusión

DR annotation is a notoriously difficult task with
low IAA. Annotations are not only subject to the
interpretation of the coder (Spooren and Degand,

1025

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

a
C
_
a
_
0
0
5
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Mesa 7: Cross-domain and cross-method experiments, using a hard-loss vs. a soft-loss. columnas
show train setting and rows test performance. Acc. is for predicting the majority label. JDS compares
predicted distribution (soft) with target distribution. * indicates cross-method results are not statistically
significativo (McNemar’s test).

2010), but also to the framework (Demberg et al.,
2019). The current study extends these findings
by showing that the task design also crucially af-
fects the output. We investigated the effect of two
distinct crowdsourced DR annotation tasks on the
obtained relation distributions. These two tasks are
unique in that they use natural language to anno-
tate. Even though these designs are more intuitive
to lay individuals, we show that also such natu-
ral language-based annotation designs suffer from
bias and leave room for varying interpretations (como
do traditional annotation tasks).

The results show that both methods have unique
prejuicios, but also that both methods are valid, como
similar sets of labels are produced. Más, el
methods seem to be complementary: Both meth-
ods show higher agreement with the reference
label than with each other. This indicates that the
methods capture different sense types. The results
further show that the textual domain can push
each method towards different label distributions.
Por último, we simulated how aggregating annotations
based on method bias improves agreement.

We suggest several modifications to both meth-
ods for future work. For QA, we recommend to
replace question prefix options which start with
a connective, such as ‘‘After what’’. The revised
options should ideally start with a Wh-question
palabra, Por ejemplo, ‘‘What happens after..’’. Este
would make the questions sound more natural and
help to prevent confusion with respect to level-3
sense distinctions. For DC, an improved interface
that allows workers to highlight argument spans
could serve as a screen that confirms the rela-
tion is between the two consecutive sentences.
Syntactic constraints making it difficult to insert
certain rare connectives could also be mitigated if

the workers are allowed to make minor edits to
the texts.

Considering that both methods show benefits
and possible downsides, it could be interesting to
combine them for future crowdsourcing efforts.
Given that obtaining DC annotations is cheaper
and quicker, it could make sense to collect DC
annotations on a larger scale and then use the QA
method for a specific subset that shows high label
entropy. Another option would be to merge both
methods, by first letting the crowdworkers insert
a connective and then use QAs for the second
connective-disambiguation step. Por último, desde que nosotros
showed that often more than one relation sense can
hold, it would make sense to allow annotators to
write multiple QA pairs or insert multiple possible
connectives for a given relation.

The DR classification experiments revealed that
generalization across data from different task de-
signs is hard, in the DC and QA case even harder
than cross-domain generalization. Además,
we found that merging data distributions com-
ing from different task designs can help boost
performance on data coming from a third source
(traditional annotations). Por último, we confirmed
that soft modeling approaches using label dis-
tributions can improve discourse classification
actuación.

Task design bias has been identified as one
source of annotation bias and acknowledged as an
artifact of the dataset in other linguistic tasks as
Bueno (Pavlick and Kwiatkowski, 2019; Jiang and
de Marneffe, 2022). Our findings show that the
effect of this type of bias can be reduced by training
with data collected by multiple methods. Este
could be the same for other NLP tasks, especially
those cast in natural language, and comparing their

1026

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

a
C
_
a
_
0
0
5
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

task designs could be an interesting future research
direction. We therefore encourage researchers to
be more conscious about the biases crowdsourc-
ing task design introduces.

Expresiones de gratitud

This work was supported by the Deutsche
Forschungsgemeinschaft, Funder ID: http://
dx.doi.org/10.13039/501100001659, grant
number: SFB1102: Information Density and Lin-
guistic Encoding, by the the European Research
Council, ERC-StG grant no. 677352, and the Israel
Science Foundation grant 2827/21, for which we
are grateful. We also thank the TACL reviewers
and action editors for their thoughtful comments.

Referencias

Rahul Aralikate, Mateo Lamm, Daniel Hardt,
y Anders Sogaard. 2021. Ellipsis reso-
lution as question answering: An evalua-
ción. In 16th Conference of
the European
la Asociación de Computación-
Chapter of
lingüística nacional (EACL), pages 810–817,
En línea. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2021.eacl-main.68

Lora Aroyo and Chris Welty. 2013. Crowd truth:
Harnessing disagreement in crowdsourcing a
relation extraction gold standard. WebSci2013
ACM, 2013(2013).

Ron Artstein and Massimo Poesio. 2008.
Inter-coder agreement for computational lin-
guísticos. Ligüística computacional, 34(4):
https://doi.org/10.1162
555–596.
/coli.07-034-R2

norte. Asher. 1993. Reference to Abstract Objects in
Discurso, volumen 50. Desorden, Norwell, MAMÁ,
Dordrecht. https://doi.org/10.1007
/978-94-011-1715-9

Valerio Basile, Miguel

Fell,

Tommaso
Fornaciari, Dirk Hovy, Silviu Paun, Barbara
Plank, Massimo Poesio, and Alexandra Uma.
2021. We need to consider disagreement in
evaluación. In Proceedings of the 1st Workshop
on Benchmarking: Past, Present and Future,
pages 15–21, En línea. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/2021.bppf-1.3

Samuel Bowman, Gabor Angeli, Christopher
Potts, and Christopher D. Manning. 2015. A
large annotated corpus for learning natural lan-
guage inference. En Actas de la 2015
Conference on Empirical Methods in Nat-
ural Language Processing, pages 632–642,
Lisbon, Portugal. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/D15-1075

él

Samuel Bowman and George Dahl. 2021.
take to fix benchmarking
What will
language understanding? En profesional-
in natural
el
el 2021 Conference of
cesiones de
North American Chapter of
the Associa-
ción para la Lingüística Computacional: Humano
Language Technologies, pages 4843–4855,
En línea. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2021.naacl-main.385

Sven Buechel and Udo Hahn. 2017a. Emobank:
Studying the impact of annotation perspec-
tive and representation format on dimensional
emotion analysis. In Proceedings of the 15th
Conference of the European Chapter of the
Asociación de Lingüística Computacional:
Volumen 2, Artículos breves, pages 578–585,
Valencia, España. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/E17-2092

Sven Buechel and Udo Hahn. 2017b. Readers
textos: Coping with different
vs. writers vs.
perspectives of text understanding in emo-
tion annotation. In Proceedings of the 11th
Linguistic Annotation Workshop, pages 1–12,
Valencia, España. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/W17-0801

Lynn Carlson and Daniel Marcu. 2001. Discurso
tagging reference manual. ISI Technical Report
ISI-TR-545, 54:1–56.

Nancy Chang, Russell Lee-Goldman,

y
Michael Tseng. 2016. Linguistic wisdom
In Third AAAI Confer-
from the crowd.
ence on Human Computation and Crowd-
https://doi.org/10.1609
sourcing.
/hcomp.v3i1.13266

Tongfei Chen, Zheng Ping Jiang, Adam Poliak,
Keisuke Sakaguchi, and Benjamin Van Durme.
2020. Uncertain
infer-
natural
idioma
the 58th Annual
En procedimientos de
ence.

1027

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

a
C
_
a
_
0
0
5
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

reunión de
la Asociación de Computación-
lingüística nacional, pages 8772–8779, En línea.
Asociación
for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.acl-main.774

John Joon Young Chung, Jean Y. Song, Sindhu
Kutty, Sungsoo Hong, Juho Kim, and Walter S.
Lasecki. 2019. Efficient elicitation approaches
to estimate collective crowd answers. Pro-
cesiones de
the ACM on Human-Computer
Interaction, 3(CSCW):1–25. https://doi
.org/10.1145/3359164

Jacob Cohen. 1960. A coefficient of agreement for
nominal scales. Educational and Psychological
Measurement, 20(1):37–46. https://doi
.org/10.1177/001316446002000104

emotional

Alan Cowen, Disa Sauter, Jessica L. Tracy,
and Dacher Keltner. 2019. Mapping the pas-
siones: Toward a high-dimensional taxonomy
de
and expression.
Psychological Science in the Public Inter-
est, 20(1):69–90. https://doi.org/10
.1177/1529100619850176,
PubMed:
31313637

experiencia

Marie-Catherine De Marneffe, Cristóbal D..
Manning, and Christopher Potts. 2012. Hizo
it happen? The pragmatic complexity of
veridicality assessment. Computational Lin-
guísticos, 38(2):301–333. https://doi.org
/10.1162/COLI_a_00097

Vera Demberg, Merel C. j. Scholman, y
Fatemeh Torabi Asr. 2019. How compatible are
our discourse annotation frameworks? Insights
from mapping RST-DT and PDTB annota-
ciones. Dialogue & Discurso, 10(1):87–135.
https://doi.org/10.5087/dad
.2019.104

Mark D´ıaz, Isaac Johnson, Amanda Lazar, Anne
Marie Piper, and Darren Gergle. 2018. Ad-
dressing age-related bias in sentiment analysis.
En Actas de la 2018 CHI Conference
on Human Factors in Computing Systems,
pages 1–14. https://doi.org/10.1145
/3173574.3173986

Anca Dumitrache. 2015. Crowdsourcing dis-
agreement
for collecting semantic annota-
ción. In European Semantic Web Conference,
pages 701–710. Saltador. https://doi.org
/10.1007/978-3-319-18818-8 43

Anca Dumitrache, Oana Inel, Lora Aroyo,
Benjamin Timmermans, and Chris Welty. 2018.
CrowdTruth 2.0: Quality metrics for crowd-
sourcing with disagreement. In 1st Workshop
on Subjectivity, Ambiguity and Disagreement
in Crowdsourcing, and Short Paper 1st Work-
shop on Disentangling the Relation Between
Crowdsourcing and Bias Management, SAD+
CrowdBias 2018, pages 11–18. CEUR-WS.

Anca Dumitrache, Oana

Inel, Benjamín
Timmermans, Carlos Ortiz, Robert-Jan
Sips, Lora Aroyo, and Chris Welty. 2021.
Empirical methodology for crowdsourcing
ground truth. Web semántica, 12(3):403–421.
h t t p s : / / d o i . o r g / 1 0 . 3 2 3 3 / S W
– 2 0 0 4 1 5

Yanai Elazar, Victoria Basmov, Yoav Goldberg,
and Reut Tsarfaty. 2022. Text-based np enrich-
mento. Transactions of the Association for Com
Lingüística putacional, 10:764–784. https://
doi.org/10.1162/tacl a 00488

Katrin Erk and Diana McCarthy. 2009. Graded
word sense assignment. En Actas de la
2009 Conference on Empirical Methods in
Natural Language Processing, pages 440–449,
Singapur. Asociación de Lin Computacional-
guísticos.

Elisa Ferracane, Greg Durrett, Junyi Jessy Li, y
Katrin Erk. 2021. Did they answer? Subjective
acts and intents in conversational discourse.
En procedimientos de
el 2021 Conference of
the North American Chapter of the Associ-
ation for Computational Linguistics: Humano
Language Technologies, pages 1626–1644,
En línea. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2021.naacl-main.129

Nicholas Fitzgerald, Julian Michael, Luheng
Él, and Luke Zettlemoyer. 2018. Large-scale
the 56th
qa-srl parsing. En procedimientos de
Annual Meeting of the Association for Compu-
lingüística nacional (Volumen 1: Artículos largos),
paginas
2051–2060, Melbourne, Australia.
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/P18
-1191

Pengcheng He, Xiao Dong

Jianfeng
gao, and Weizhu Chen. 2020. Deberta:
Decoding-enhanced bert with disentangled

Liu,

1028

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

a
C
_
a
_
0
0
5
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

atención.
Learning Representations.

In International Conference on

Yufang Hou. 2020. Bridging anaphora resolution
as question answering. En Actas de la
58ª Reunión Anual de la Asociación de
Ligüística computacional, pages 1428–1438,
En línea. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.acl-main.132

el 2013 Conference of

Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish
Vaswani, and Eduard Hovy. 2013. Aprendiendo
En curso-
whom to trust with MACE.
cosas de
the North
American Chapter of the Association for Com-
Lingüística putacional: Human Language Tech-
nológico (NAACL-HLT), pages 1120–1130,
Atlanta, Georgia. Asociación de Computación-
lingüística nacional.

Christoph Hube, Besnik Fetahu, and Ujwal
Gadiraju. 2019. Understanding and mitigating
worker biases in the crowdsourced collection
of subjective judgments. En Actas de la
2019 CHI Conference on Human Factors in
Computing Systems, pages 1–12, Nueva York,
Nueva York. Association for Computing Machinery.

Terne Sasha Thorn Jakobsen, Maria Barrett,
Anders Sogaard, and David Lassen. 2022.
The sensitivity of annotator bias to task def-
initions in argument mining. En procedimientos
the 16th Lingusitic Annotation Work-
de
shop (LAW-XVI) within LREC2022, paginas
44–61, Marsella, Francia. European Language
Resources Association.

Nanjiang Jiang and Marie-Catherine de Marneffe.
2022. Investigating reasons for disagreement
language inference. Transactions
in natural
de la Asociación de Linguis Computacional-
tics, 10:1357–1374. https://doi.org/10
.1162/tacl_a_00523

Youxuan Jiang,

diseño

trade-offs

Jonathan K. Kummerfeld,
and Walter Lasecki. 2017. Comprensión
tarea
crowdsourced
paraphrase collection. En Actas de la
55ª Reunión Anual de la Asociación de
Ligüística computacional (Volumen 2: Short
Documentos), pages 103–109, vancouver, Canada.
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/P17
-2017

el 2013 Conference of

David Jurgens. 2013. Embracing ambiguity: A
comparison of annotation methodologies for
En profesional-
crowdsourcing word sense labels.
cesiones de
el
North American Chapter of
the Associa-
ción para la Lingüística Computacional: Humano
Language Technologies, pages 556–562, En-
lanta, Georgia. Asociación de Computación
Lingüística.

Daisuke Kawahara, Yuichiro Machida, Tomohide
Shibata, Sadao Kurohashi, Hayato Kobayashi,
and Manabu Sassano. 2014. Rapid development
of a corpus with discourse annotations using
two-stage crowdsourcing. En procedimientos de
the International Conference on Computational
Lingüística (COLECCIONAR), pages 269–278, Dublín,
Irlanda. Dublin City University and Association
para Lingüística Computacional.

Yudai Kishimoto, Shinnosuke Sawada, Yugo
Murawaki, Daisuke Kawahara, and Sadao
Kurohashi. 2018. Improving crowdsourcing-
based annotation of Japanese discourse rela-
ciones. In LREC.

Wei-Jen Ko, Cutter Dalton, Mark Simmons,
Eliza Fisher, Greg Durrett, and Junyi Jessy
li. 2021. Discourse comprehension: A ques-
tion answering framework to represent sentence
connections. arXiv preimpresión arXiv:2111.00701.

Philipp Koehn. 2005. Europarl: A parallel corpus
for statistical machine translation. En curso-
ings of MT Summit X, pages 79–86. Phuket,
Tailandia.

Yiwei Luo, Dallas Card, and Dan Jurafsky. 2020.
Detecting stance in media on global warming. En
Hallazgos de la Asociación de Computación
Lingüística: EMNLP 2020, pages 3296–3315,
en línea. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.findings-emnlp.296

William C. Mann and Sandra A. Thompson.
1988. Teoría de la estructura retórica: Toward a
functional theory of text organization. Texto-
for the Study of
Interdisciplinary Journal
Discurso, 8(3):243–281. https://doi.org
/10.1515/text.1.1988.8.3.243

Cristóbal D.. Manning. 2006. Local textual in-
ference: It’s hard to circumscribe, but you know
it when you see it—and NLP needs it.

1029

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

a
C
_
a
_
0
0
5
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Marian Marchal, Merel Scholman, Frances Yung,
and Vera Demberg. 2022. Establishing an-
notation quality in multi-label annotations.
En procedimientos de
the 29th International
Congreso sobre Lingüística Computacional,
pages 3659–3668, Gyeongju, Republic of Ko-
rea. International Committee on Computational
Lingüística.

Sewon Min,

Julian Michael, Hannaneh
Hajishirzi, and Luke Zettlemoyer. 2020. Am-
bigqa: Answering ambiguous open-domain
preguntas. En Actas de la 2020 Estafa-
Conferencia sobre métodos empíricos en Lan Natural.-
Procesamiento de calibre (EMNLP), pages 5783–5797,
En línea. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.emnlp-main.466

Yixin Nie, Xiang Zhou, and Mohit Bansal.
2020. What can we learn from collective hu-
man opinions on natural language inference
datos? En Actas de la 2020 Conferencia
sobre métodos empíricos en lenguaje natural
Procesando (EMNLP), pages 9131–9143, On-
line. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.emnlp-main.734

Rebecca J. Passonneau and Bob Carpenter. 2014.
The benefits of a model of annotation. Trans-
acciones de la Asociación de Computación
Lingüística, 2:311–326. https://doi.org
/10.1162/tacl_a_00185

Ellie Pavlick and Tom Kwiatkowski. 2019. Inher-
ent disagreements in human textual inferences.
Transactions of the Association for Computa-
lingüística nacional, 7:677–694. https://doi
.org/10.1162/tacl_a_00293

Joshua C. Peterson, Ruairidh M. Battleday,
Thomas L. Griffiths, and Olga Russakovsky.
2019. Human uncertainty makes classification
more robust. In Proceedings of the IEEE/CVF
International Conference on Computer Vision,
pages 9617–9626. https://doi.org/10
.1109/ICCV.2019.00971

Barbara Plank, Dirk Hovy, y Anders Sogaard.
2014. Linguistically debatable or just plain
wrong? In Proceedings of the 52nd Annual
reunión de
la Asociación de Computación-
lingüística nacional (Volumen 2: Artículos breves),
pages 507–511, baltimore, Maryland. Como-
para Lingüística Computacional.
sociation

https://doi.org/10.3115/v1/P14
-2083

Massimo Poesio and Ron Artstein. 2005. El
reliability of anaphoric annotation,
recon-
sidered: Taking ambiguity into account. En
Actas de
the Workshop on Fron-
tiers in Corpus Annotations II: Pie in the
Sky, pages 76–83. https://doi.org/10
.3115/1608829.1608840

Massimo Poesio, Patrick Sturt, Ron Artstein,
and Ruth Filik. 2006. Underspecification and
anaphora: Theoretical issues and preliminary
evidencia. Discourse processes, 42(2):157–175.
https://doi.org/10.1207
/s15326950dp4202 4

Vinodkumar Prabhakaran, Aida Mostafazadeh
Davani, and Mark Diaz. 2021. On releas-
ing annotator-level
labels and information
In Proceedings of The Joint
in datasets.
15th Linguistic Annotation Workshop (LAW)
and 3rd Designing Meaning Representations
(DMR) Taller, pages 133–138, Punta Cana,
República Dominicana. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/2021.law-1.14

Valentina Pyatkin, Ayal Klein, Reut Tsarfaty,
and Ido Dagan. 2020. QADiscourse-Discourse
Relations as QA Pairs: Representation, crowd-
sourcing and baselines. En Actas de la
2020 Conference on Empirical Methods in Nat-
ural Language Processing (EMNLP), paginas
2804–2819, En línea. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/2020.emnlp-main.224

the 56th Annual Meeting of

Pranav Rajpurkar, Robin Jia, y Percy Liang.
2018. Know what you don’t know: Unan-
swerable questions for squad. En procedimientos
de
the Associ-
ation for Computational Linguistics (Volumen
2: Artículos breves), pages 784–789, Melbourne,
Australia. Asociación de Lin Computacional-
guísticos. https://doi.org/10.18653
/v1/P18-2124

Ines Rehbein, Merel Scholman,

and Vera
Demberg. 2016. Annotating discourse rela-
in spoken language: A comparison
ciones
En
the PDTB and CCR frameworks.
de
Proceedings of the Tenth International Confer-
ence on Language Resources and Evaluation

1030

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

a
C
_
a
_
0
0
5
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

(LREC’16), pages 1039–1046, Portoroˇz, Slove-
nia. European Language Resources Association
(ELRA).

terms

Stefan Riezler. 2014. On the problem of
computa-
en
linguistics. Computational Linguis-
40(1):235–245. https://doi.org

teórico
tional
tics,
/10.1162/COLI_a_00182

empirical

Hannah Rohde, Anna Dickinson, Nathan
Schneider, Christopher Clark, Annie Louis,
and Bonnie Webber. 2016. Filling in the blanks
in understanding discourse adverbials: Consis-
tency, conflicto, and context-dependence in a
crowdsourced elicitation task. En procedimientos
of the 10th Linguistic Annotation Workshop
(LAW X), pages 49–58, Berlina, Alemania.
https://doi.org/10.18653/v1/W16
-1707

Ted J. METRO. Lijadoras, Wilbert P. METRO. S. Pistas,
and Leo G. METRO. Noordman. 1992. Toward a
taxonomy of coherence relations. Discurso
Processes, 15(1):1–35. https://doi.org
/10.1080/01638539209544800

Merel C. j. Scholman and Vera Demberg. 2017.
Crowdsourcing discourse interpretations: On
the influence of context and the reliability of
a connective insertion task. En procedimientos
de
the 11th Linguistic Annotation Work-
shop (LAW), pages 24–33, Valencia, España.
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/W17
-0803

Merel C. j. Scholman, Tianai Dong, Frances
Yung, and Vera Demberg. 2022a. Discogem:
A crowdsourced corpus of genre-mixed im-
plicit discourse relations. En Actas de la
Thirteenth International Conference on Lan-
guage Resources and Evaluation (LREC’22),
Marsella, Francia. Idioma europeo Re-
Asociación de fuentes (ELRA).

Merel C. j. Scholman, Valentina Pyatkin, Frances
Yung, Ido Dagan, Reut Tsarfaty, and Vera
Demberg. 2022b. Design choices in crowd-
sourcing discourse relation annotations: El
effect of worker selection and training. En
Actas de
the Thirteenth International
Conference on Language Resources and Evalu-
ación (LREC’22), Marsella, Francia. European
Language Resources Association (ELRA).

En procedimientos de

Wei Shi and Vera Demberg. 2019. Learn-
ing to explicitate connectives with Seq2Seq
network for implicit discourse relation clas-
the 13th In-
sification.
ternational Conference on Computational
Semántica – Artículos largos, pages 188–199,
Gothenburg, Suecia. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/W19-0416

Rion

Snow, Brendan O’Connor, Daniel
Jurafsky, and Andrew Y. Ng. 2008. Cheap
and fast—but is it good? Evaluating non-expert
En
annotations for natural
Actas de
the Conference on Empiri-
Métodos cal en el procesamiento del lenguaje natural
(EMNLP), pages 254–263, Honolulu, Hawaii.
Asociación de Lingüística Computacional.

language tasks.

Wilbert P. METRO. S. Spooren and Liesbeth Degand.
2010. Coding coherence relations: Reliability
and validity. Corpus Linguistics and Linguistic
Teoría, 6(2):241–266. https://doi.org
/10.1515/cllt.2010.009

Alexandra Uma, Tommaso Fornaciari, Dirk Hovy,
Silviu Paun, Barbara Plank, and Massimo
Poesio. 2020. A case for soft
loss func-
ciones. In Proceedings of the AAAI Conference
on Human Computation and Crowdsourcing,
volumen 8, pages 173–177. https://doi
.org/10.1609/hcomp.v8i1.7478

Alexandra N. Uma, Tommaso Fornaciari, Dirk
Azul, Silviu Paun, Barbara Plank, and Massimo
Poesio. 2021. Learning from disagreement: A
survey. Journal of Artificial Intelligence Re-
buscar, 72:1385–1470. https://doi.org
/10.1613/jair.1.12752

Zeerak Waseem. 2016. Are you a racist or am
I seeing things? Annotator influence on hate
speech detection on Twitter. En procedimientos de
the First Workshop on NLP and Computational
Ciencias Sociales, pages 138–142, austin, Texas.
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/W16
-5618

Bonnie Webber. 2009. Genre distinctions for dis-
course in the Penn TreeBank. En procedimientos de
the Joint Conference of the 47th Annual Meeting
of the ACL and the 4th International Joint
Conferencia sobre procesamiento del lenguaje natural
of the AFNLP, pages 674–682. https://doi
.org/10.3115/1690219.1690240

1031

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

a
C
_
a
_
0
0
5
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Bonnie Webber, Rashmi Prasad, Alan Lee, y
Aravind Joshi. 2019. The Penn Discourse
Treebank 3.0 annotation manual. Filadelfia,
Universidad de Pennsylvania.

Ogrodniczuk. 2019. Ted multilingual discourse
bank (TED-MDB): A parallel corpus annotated
in the PDTB style. Language Resources and
Evaluation, 1–27.

Tomás Lobo, Debut de Lysandre, Víctor Sanh,
Julien Chaumond, Clemente Delangue, Antonio
moi, Pierric Cistac, Tim Rault, R´emi Louf,
Morgan Funtowicz, Joe Davison, Sam Shleifer,
Patrick von Platen, Clara Ma, Yacine Jernite,
Julien Plu, Canwen Xu, Teven Le Scao,
Sylvain Gugger, Mariama Drama, Quintín
Lhoest, y Alejandro Rush. 2020. Trans-
idioma
formadores: State-of-the-art natural
Procesando. En Actas de la 2020 Estafa-
ference on Empirical Methods in Natural
Procesamiento del lenguaje: Demostraciones del sistema,
páginas 38–45, En línea. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/2020.emnlp-demos.6

Frances Yung, Vera Demberg,

and Merel
Scholman. 2019. Crowdsourcing discourse re-
lation annotations by a two-step connective
insertion task. In Proceedings of the 13th Lin-
guistic Annotation Workshop, pages 16–25,
Florencia, Italia. Asociación de Computación
Lingüística. https://doi.org/10.18653
/v1/W19-4003

Deniz Zeyrek, Am´alia Mendes, Yulia Grishina,
Murathan Kurfalı, Samuel Gibbon, and Maciej

Deniz Zeyrek, Am´alia Mendes, Yulia Grishina,
Murathan Kurfalı, Samuel Gibbon, and Maciej
Ogrodniczuk. 2020. Ted multilingual discourse
bank (TED-MDB): A parallel corpus annotated
in the PDTB style. Language Resources and
Evaluation, 54(2):587–613. https://doi
.org/10.1007/s10579-019-09445-9

Shujian Zhang, Chengyue Gong, and Eunsol
Choi. 2021. Learning with different amounts
of annotation: From zero to many labels. En
Actas de la 2021 Conferencia sobre el Imperio-
Métodos icales en el procesamiento del lenguaje natural,
pages 7620–7632, En línea y Punta Cana,
República Dominicana. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/2021.emnlp-main.601
ˇS´arka Zik´anov´a, Jiˇr´ı M´ırovsk`y, and Pavl´ına
Synkov´a. 2019. Explicit and implicit dis-
in the Prague Discourse
course relations
and Dia-
Treebank.
logue: 22nd International Conference, TSD
2019, Ljubljana, Slovenia, September 11–13,
2019, Actas
236–248.
https://doi.org/10.1007
Saltador.
/978-3-030-27947-9_20

Discurso,

paginas

Texto,

22,

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
6
2
1
5
4
4
5
1

/
t

a
C
_
a
_
0
0
5
8
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

1032
Descargar PDF