Survey Article

Inter-Coder Agreement for
计算语言学

Ron Artstein∗
University of Essex

Massimo Poesio∗∗
University of Essex/Universit`a di Trento

This article is a survey of methods for measuring agreement among corpus annotators. It exposes
the mathematics and underlying assumptions of agreement coefﬁcients, covering Krippendorff’s
alpha as well as Scott’s pi and Cohen’s kappa; discusses the use of coefﬁcients in several annota-
tion tasks; and argues that weighted, alpha-like coefﬁcients, traditionally less used than kappa-
like measures in computational linguistics, may be more appropriate for many corpus annotation
tasks—but that their use makes the interpretation of the value of the coefﬁcient even harder.

1. Introduction and Motivations

Since the mid 1990s, increasing effort has gone into putting semantics and discourse
research on the same empirical footing as other areas of computational linguistics (CL).
This soon led to worries about the subjectivity of the judgments required to create
annotated resources, much greater for semantics and pragmatics than for the aspects of
language interpretation of concern in the creation of early resources such as the Brown
语料库 (Francis and Kucera 1982), the British National Corpus (水蛭, 加赛德, 和
Bryant 1994), or the Penn Treebank (马库斯, 马尔辛凯维奇, and Santorini 1993). 问题-
lems with early proposals for assessing coders’ agreement on discourse segmentation
任务 (such as Passonneau and Litman 1993) led Carletta (1996) to suggest the adoption
of the K coefﬁcient of agreement, a variant of Cohen’s κ (科恩 1960), as this had already
been used for similar purposes in content analysis for a long time.1 Carletta’s proposals

∗ Now at the Institute for Creative Technologies, University of Southern California, 13274 Fiji Way, Marina

Del Rey, CA 90292.

∗∗ At the University of Essex: Department of Computing and Electronic Systems, University of Essex,
Wivenhoe Park, Colchester, CO4 3SQ, 英国. 电子邮件: poesio@essex.ac.uk. At the University of Trento:
CIMeC, Universit`a degli Studi di Trento, Palazzo Fedrigotti, Corso Bettini, 31, 38068 Rovereto (TN), 意大利.
电子邮件: massimo.poesio@unitn.it.

1 The literature is full of terminological inconsistencies. Carletta calls the coefﬁcient of agreement she

argues for “kappa,” referring to Krippendorff (1980) and Siegel and Castellan (1988), and using Siegel
and Castellan’s terminology and deﬁnitions. 然而, Siegel and Castellan’s statistic, which they call K,
is actually Fleiss’s generalization to more than two coders of Scott’s π, not of the original Cohen’s κ; 到
confuse matters further, Siegel and Castellan use the Greek letter κ to indicate the parameter which is
estimated by K. 下文中, we use κ to indicate Cohen’s original coefﬁcient and its generalization
to more than two coders, and K for the coefﬁcient discussed by Siegel and Castellan.

提交材料已收到: 26 八月 2005; revised submission received: 21 十二月 2007; accepted for
出版物: 28 一月 2008.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

3
4
4
5
5
5
1
8
0
8
9
4
7
/
C
哦

我
我
.

0
7
–
0
3
4
–
r
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

计算语言学

体积 34, 数字 4

were enormously inﬂuential, and K quickly became the de facto standard for measuring
agreement in computational linguistics not only in work on discourse (Carletta et al.
1997; Core and Allen 1997; Hearst 1997; Poesio and Vieira 1998; Di Eugenio 2000; Stolcke
等人. 2000; 卡尔森, 马可, 和奥库罗夫斯基 2003) but also for other annotation tasks
(例如, V´eronis 1998; Bruce and Wiebe 1998; Stevenson and Gaizauskas 2000; Craggs and
McGee Wood 2004; Mieskes and Strube 2006). 在这段时期, 然而, a number
of questions have also been raised about K and similar coefﬁcients—some already in
Carletta’s own work (Carletta et al. 1997)—ranging from simple questions about the
way the coefﬁcient is computed (例如, whether it is really applicable when more than
two coders are used), to debates about which levels of agreement can be considered
‘acceptable’ (Di Eugenio 2000; Craggs and McGee Wood 2005), to the realization that K
is not appropriate for all types of agreement (Poesio and Vieira 1998; 马可, Romera,
and Amorrortu 1999; Di Eugenio 2000; Stevenson and Gaizauskas 2000). Di Eugenio
raised the issue of the effect of skewed distributions on the value of K and pointed out
that the original κ developed by Cohen is based on very different assumptions about
coder bias from the K of Siegel and Castellan (1988), which is typically used in CL. 这
issue of annotator bias was further debated in Di Eugenio and Glass (2004) and Craggs
and McGee Wood (2005). Di Eugenio and Glass pointed out that the choice of calculating
chance agreement by using individual coder marginals (κ) or pooled distributions (K)
can lead to reliability values falling on different sides of the accepted 0.67 临界点,
and recommended reporting both values. Craggs and McGee Wood argued, following
克里彭多夫 (2004A,乙), that measures like Cohen’s κ are inappropriate for measur-
ing agreement. 最后, Passonneau has been advocating the use of Krippendorff’s α
(克里彭多夫 1980, 2004A) for coding tasks in CL which do not involve nominal and
disjoint categories, including anaphoric annotation, wordsense tagging, and summa-
rization (Passonneau 2004, 2006; Nenkova and Passonneau 2004; Passonneau, Habash,
and Rambow 2006).

Now that more than ten years have passed since Carletta’s original presentation
at the workshop on Empirical Methods in Discourse, it is time to reconsider the use
of coefﬁcients of agreement in CL in a systematic way. 在本文中, a survey of
coefﬁcients of agreement and their use in CL, we have three main goals. 第一的, we discuss
in some detail the mathematics and underlying assumptions of the coefﬁcients used or
mentioned in the CL and content analysis literatures. 第二, we also cover in some
detail Krippendorff’s α, often mentioned but never really discussed in detail in previous
CL literature other than in the papers by Passonneau just mentioned. 第三, we review
the past ten years of experience with coefﬁcients of agreement in CL, reconsidering the
issues that have been raised also from a mathematical perspective.2

2. Coefﬁcients of Agreement

2.1 协议, Reliability, and Validity

We begin with a quick recap of the goals of agreement studies, inspired by Krippendorff
(2004A, 部分 11.1). Researchers who wish to use hand-coded data—that is, data in
which items are labeled with categories, whether to support an empirical claim or to
develop and test a computational model—need to show that such data are reliable.

2 Only part of our material could ﬁt in this article. An extended version of the survey is available from

http://cswww.essex.ac.uk/Research/nle/arrau/.

556

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

3
4
4
5
5
5
1
8
0
8
9
4
7
/
C
哦

我
我
.

0
7
–
0
3
4
–
r
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

阿特斯坦和波西奥

Inter-Coder Agreement for CL

The fundamental assumption behind the methodologies discussed in this article is that
data are reliable if coders can be shown to agree on the categories assigned to units to
an extent determined by the purposes of the study (Krippendorff 2004a; Craggs and
McGee Wood 2005). If different coders produce consistently similar results, then we can
infer that they have internalized a similar understanding of the annotation guidelines,
and we can expect them to perform consistently under this understanding.

Reliability is thus a prerequisite for demonstrating the validity of the coding
scheme—that is, to show that the coding scheme captures the “truth” of the phenom-
enon being studied, in case this matters: If the annotators are not consistent then either
some of them are wrong or else the annotation scheme is inappropriate for the data.
(Just as in real life, the fact that witnesses to an event disagree with each other makes it
difﬁcult for third parties to know what actually happened.) 然而, it is important to
keep in mind that achieving good agreement cannot ensure validity: Two observers of
the same event may well share the same prejudice while still being objectively wrong.

2.2 A Common Notation

It is useful to think of a reliability study as involving a set of items (markables), A
set of categories, and a set of coders (annotators) who assign to each item a unique
category label. The discussions of reliability in the literature often use different notations
to express these concepts. We introduce a uniform notation, which we hope will make
the relations between the different coefﬁcients of agreement clearer.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

3
4
4
5
5
5
1
8
0
8
9
4
7
/
C
哦

我
我
.

0
7
–
0
3
4
–
r
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

•

The set of items is { 我 | i ∈ I } and is of cardinality i.
The set of categories is { k | k ∈ K } and is of cardinality k.
The set of coders is { C | c ∈ C } and is of cardinality c.

Confusion also arises from the use of the letter P, which is used in the literature with at
least three distinct interpretations, namely “proportion,” “percent,” and “probability.”
We will use the following notation uniformly throughout the article.

•

Ao is observed agreement and Do is observed disagreement.

Ae and De are expected agreement and expected disagreement,
分别. The relevant coefﬁcient will be indicated with a superscript
when an ambiguity may arise (例如, Aπ
used for calculating π, and Aκ
calculating κ).
磷(·) is reserved for the probability of a variable, and ˆP(·) is an estimate of
such probability from observed data.

e is the expected agreement used for

e is the expected agreement

最后, we use n with a subscript to indicate the number of judgments of a given type:

•

nik is the number of coders who assigned item i to category k;

nck is the number of items assigned by coder c to category k;

nk is the total number of items assigned by all coders to category k.

557

计算语言学

体积 34, 数字 4

2.3 Agreement Without Chance Correction

The simplest measure of agreement between two coders is percentage of agreement or
observed agreement, deﬁned for example by Scott (1955, 页 323) as “the percentage of
judgments on which the two analysts agree when coding the same data independently.”
This is the number of items on which the coders agree divided by the total number
of items. 更确切地说, and looking ahead to the following discussion, observed
agreement is the arithmetic mean of the agreement value agri for all items i ∈ I, deﬁned
as follows:

(西德:1)

agri =

1 if the two coders assign i to the same category
0 if the two coders assign i to different categories

Observed agreement over the values agri for all items i ∈ I is then:

Ao =

1
我

∑
i∈I

agri

例如, let us assume a very simple annotation scheme for dialogue acts in
information-seeking dialogues which makes a binary distinction between the categories
statement and info-request, as in the DAMSL dialogue act scheme (Allen and Core
1997). Two coders classify 100 utterances according to this scheme as shown in Table 1.
Percentage agreement for this data set is obtained by summing up the cells on the
diagonal and dividing by the total number of items: Ao = (20 + 50)/100 = 0.7.

Observed agreement enters in the computation of all the measures of agreement we
consider, but on its own it does not yield values that can be compared across studies,
because some agreement is due to chance, and the amount of chance agreement is
affected by two factors that vary from one study to the other. First of all, as Scott (1955,
页 322) points out, “[percentage agreement] is biased in favor of dimensions with a
small number of categories.” In other words, given two coding schemes for the same
现象, the one with fewer categories will result in higher percentage agreement
just by chance. If two coders randomly classify utterances in a uniform manner using
the scheme of Table 1, we would expect an equal number of items to fall in each of the
four cells in the table, and therefore pure chance will cause the coders to agree on half of
the items (the two cells on the diagonal: 1
4 + 1
4 ). But suppose we want to reﬁne the simple
binary coding scheme by introducing a new category, 查看, as in the MapTask coding
scheme (Carletta et al. 1997). If two coders randomly classify utterances in a uniform
manner using the three categories in the second scheme, they would only agree on a
third of the items ( 1

9 + 1

9 + 1
9 ).

桌子 1
A simple example of agreement on dialogue act tagging.

CODER A

STAT

IREQ

TOTAL

CODER B

STAT
IREQ
TOTAL

20
10
30

20
50
70

40
60
100

558

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

3
4
4
5
5
5
1
8
0
8
9
4
7
/
C
哦

我
我
.

0
7
–
0
3
4
–
r
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

阿特斯坦和波西奥

Inter-Coder Agreement for CL

The second reason percentage agreement cannot be trusted is that it does not
correct for the distribution of items among categories: We expect a higher percentage
agreement when one category is much more common than the other. This problem,
already raised by Hsu and Field (2003, 页 207) 除其他外, can be illustrated
using the following example (Di Eugenio and Glass 2004, 例子 3, pages 98–99).
认为 95% of utterances in a particular domain are statement, 并且只有 5% are info-
request. We would then expect by chance that 0.95 × 0.95 = 0.9025 of the utterances
would be classiﬁed as statement by both coders, 和 0.05 × 0.05 = 0.0025 as info-
request, so the coders would agree on 90.5% of the utterances. Under such circum-
立场, a seemingly high observed agreement of 90% is actually worse than expected by
机会.

The conclusion reached in the literature is that in order to get ﬁgures that are compa-
rable across studies, observed agreement has to be adjusted for chance agreement. 这些
are the measures we will review in the remainder of this article. We will not look at the
variants of percentage agreement used in CL work on discourse before the introduction
of kappa, such as percentage agreement with an expert and percentage agreement with
大多数; see Carletta (1996) for discussion and criticism.3

2.4 Chance-Corrected Coefﬁcients for Measuring Agreement between Two Coders

All of the coefﬁcients of agreement discussed in this article correct for chance on the
basis of the same idea. First we ﬁnd how much agreement is expected by chance: 让我们
call this value Ae. The value 1 − Ae will then measure how much agreement over and
above chance is attainable; the value Ao − Ae will tell us how much agreement beyond
chance was actually found. The ratio between Ao − Ae and 1 − Ae will then tell us which
proportion of the possible agreement beyond chance was actually observed. This idea
is expressed by the following formula.

S, 圆周率, k =

Ao − Ae
1 − Ae

The three best-known coefﬁcients, S (Bennett, Alpert, and Goldstein 1954), 圆周率 (斯科特
1955), and κ (科恩 1960), and their generalizations, all use this formula; 然而
Krippendorff’s α is based on a related formula expressed in terms of disagreement
(参见章节 2.6). All three coefﬁcients therefore yield values of agreement between
−Ae/1 − Ae (no observed agreement) 和 1 (observed agreement = 1), with the value 0
signifying chance agreement (observed agreement = expected agreement). Note also
that whenever agreement is less than perfect (Ao < 1), chance-corrected agreement will be strictly lower than observed agreement, because some amount of agreement is always expected by chance. Observed agreement Ao is easy to compute, and is the same for all three coefﬁcients—the proportion of items on which the two coders agree. But the notion of chance agreement, or the probability that two coders will classify an arbitrary item as belonging to the same category by chance, requires a model of what would happen if coders’ behavior was only by chance. All three coefﬁcients assume independence of the two coders—that is, that the chance of c1 and c2 agreeing on any given category k 3 The extended version of the article also includes a discussion of why χ2 and correlation coefﬁcients are not appropriate for this task. 559 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 34, Number 4 Table 2 The value of different coefﬁcients applied to the data from Table 1. Coefﬁcient Expected agreement Chance-corrected agreement S π κ 2 × ( 1 2 )2 = 0.5 0.352 + 0.652 = 0.545 0.3 × 0.4 + 0.6 × 0.7 = 0.54 (0.7 − 0.5)/(1 − 0.5) = 0.4 (0.7 − 0.545)/(1 − 0.545) ≈ 0.341 (0.7 − 0.54)/(1 − 0.54) ≈ 0.348 Observed agreement for all the coefﬁcients is 0.7. is the product of the chance of each of them assigning an item to that category: P(k|c1) · P(k|c2).4 Expected agreement is then the probability of c1 and c2 agreeing on any category, that is, the sum of the product over all categories: AS e = A π e = A κ e = ∑ k∈K P(k|c1) · P(k|c2) The difference between S, π, and κ lies in the assumptions leading to the calculation of P(k|ci), the chance that coder ci will assign an arbitrary item to category k (Zwick 1988; Hsu and Field 2003). S: π: κ: This coefﬁcient is based on the assumption that if coders were operating by chance alone, we would get a uniform distribution: That is, for any two coders cm, cn and any two categories kj, kl, P(kj|cm) = P(kl|cn). If coders were operating by chance alone, we would get the same distribution for each coder: For any two coders cm, cn and any category k, P(k|cm) = P(k|cn). If coders were operating by chance alone, we would get a separate distribution for each coder. Additionally, the lack of independent prior knowledge of the distribution of items among categories means that the distribution of categories (for π) and the priors for the individual coders (for κ) have to be estimated from the observed data. Table 2 demon- strates the effect of the different chance models on the coefﬁcient values. The remainder of this section explains how the three coefﬁcients are calculated when the reliability data come from two coders; we will discuss a variety of proposed generalizations starting in Section 2.5. 2.4.1 All Categories Are Equally Likely: S. The simplest way of discounting for chance is the one adopted to compute the coefﬁcient S (Bennett, Alpert, and Goldstein 1954), also known in the literature as C, κn, G, and RE (see Zwick 1988; Hsu and Field 2003). As noted previously, the computation of S is based on an interpretation of chance as a random choice of category from a uniform distribution—that is, all categories are equally likely. If coders classify the items into k categories, then the chance P(k|ci) of 4 The independence assumption has been the subject of much criticism, for example by John S. Uebersax. http://ourworld.compuserve.com/homepages/jsuebersax/agree.htm. 560 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Artstein and Poesio Inter-Coder Agreement for CL any coder assigning an item to category k under the uniformity assumption is 1 the total agreement expected by chance is k ; hence e = ∑ AS k∈K 1 k · 1 k = k · (cid:3) 2 (cid:2) 1 k = 1 k The calculation of the value of S for the ﬁgures in Table 1 is shown in Table 2. The coefﬁcient S is problematic in many respects. The value of the coefﬁcient can be artiﬁcially increased simply by adding spurious categories which the coders would never use (Scott 1955, pages 322–323). In the case of CL, for example, S would reward designing extremely ﬁne-grained tagsets, provided that most tags are never actually encountered in real data. Additional limitations are noted by Hsu and Field (2003). It has been argued that uniformity is the best model for a chance distribution of items among categories if we have no independent prior knowledge of the distribution (Brennan and Prediger 1981). However, a lack of prior knowledge does not mean that the distribution cannot be estimated post hoc, and this is what the other coefﬁcients do. 2.4.2 A Single Distribution: π. All of the other methods for discounting chance agreement we discuss in this article attempt to overcome the limitations of S’s strong uniformity assumption using an idea ﬁrst proposed by Scott (1955): Use the actual behavior of the coders to estimate the prior distribution of the categories. As noted earlier, Scott based his characterization of π on the assumption that random assignment of categories to items, by any coder, is governed by the distribution of items among categories in the actual world. The best estimate of this distribution is ˆP(k), the observed proportion of items assigned to category k by both coders. P(k|c1) = P(k|c2) = ˆP(k) ˆP(k), the observed proportion of items assigned to category k by both coders, is the total number of assignments to k by both coders nk, divided by the overall number of assignments, which for the two-coder case is twice the number of items i: ˆP(k) = nk 2i Given the assumption that coders act independently, expected agreement is computed as follows. π e = ∑ A k∈K ˆP(k) · ˆP(k) = ∑ k∈K (cid:5) 2 (cid:4) nk 2i = 1 4i2 ∑ k∈K n2 k It is easy to show that for any set of coding data, Aπ e and therefore π ≤ S, with e the limiting case (equality) obtaining when the observed distribution of items among categories is uniform. ≥ AS 2.4.3 Individual Coder Distributions: κ. The method proposed by Cohen (1960) to calcu- late expected agreement Ae in his κ coefﬁcient assumes that random assignment of categories to items is governed by prior distributions that are unique to each coder, and which reﬂect individual annotator bias. An individual coder’s prior distribution is 561 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 34, Number 4 estimated by looking at her actual distribution: P(k|ci), the probability that coder ci will classify an arbitrary item into category k, is estimated by using ˆP(k|ci), the proportion of items actually assigned by coder ci to category k; this is the number of assignments to k by ci, ncik, divided by the number of items i. P(k|ci) = ˆP(k|ci) = ncik i As in the case of S and π, the probability that the two coders c1 and c2 assign an item to a particular category k ∈ K is the joint probability of each coder making this assignment independently. For κ this joint probability is ˆP(k|c1) · ˆP(k|c2); expected agreement is then the sum of this joint probability over all the categories k ∈ K. κ e = ∑ A k∈K ˆP(k|c1) · ˆP(k|c2) = ∑ k∈K nc1k i · nc2k i = 1 i2 ∑ k∈K nc1knc2k It is easy to show that for any set of coding data, Aπ e and therefore π ≤ κ, with the e limiting case (equality) obtaining when the observed distributions of the two coders are identical. The relationship between κ and S is not ﬁxed. ≥ Aκ 2.5 More Than Two Coders In corpus annotation practice, measuring reliability with only two coders is seldom considered enough, except for small-scale studies. Sometimes researchers run reliability studies with more than two coders, measure agreement separately for each pair of coders, and report the average. However, a better practice is to use generalized versions of the coefﬁcients. A generalization of Scott’s π is proposed in Fleiss (1971), and a generalization of Cohen’s κ is given in Davies and Fleiss (1982). We will call these coefﬁcients multi-π and multi-κ, respectively, dropping the multi-preﬁxes when no confusion is expected to arise.5 2.5.1 Fleiss’s Multi-π. With more than two coders, the observed agreement Ao can no longer be deﬁned as the percentage of items on which there is agreement, because inevitably there will be items on which some coders agree and others disagree. The solution proposed in the literature is to measure pairwise agreement (Fleiss 1971): Deﬁne the amount of agreement on a particular item as the proportion of agreeing judgment pairs out of the total number of judgment pairs for that item. Multiple coders also pose a problem for the visualization of the data. When the number of coders c is greater than two, judgments cannot be shown in a contingency table like Table 1, because each coder has to be represented in a separate dimension. 5 Due to historical accident, the terminology in the literature is confusing. Fleiss (1971) proposed a coefﬁcient of agreement for multiple coders and called it κ, even though it calculates expected agreement based on the cumulative distribution of judgments by all coders and is thus better thought of as a generalization of Scott’s π. This unfortunate choice of name was the cause of much confusion in subsequent literature: Often, studies which claim to give a generalization of κ to more than two coders actually report Fleiss’s coefﬁcient (e.g., Bartko and Carpenter 1976; Siegel and Castellan 1988; Di Eugenio and Glass 2004). Since Carletta (1996) introduced reliability to the CL community based on the deﬁnitions of Siegel and Castellan (1988), the term “kappa” has been usually associated in this community with Siegel and Castellan’s K, which is in effect Fleiss’s coefﬁcient, that is, a generalization of Scott’s π. 562 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Artstein and Poesio Inter-Coder Agreement for CL Fleiss (1971) therefore uses a different type of table which lists each item with the num- ber of judgments it received for each category; Siegel and Castellan (1988) use a similar table, which Di Eugenio and Glass (2004) call an agreement table. Table 3 is an example of an agreement table, in which the same 100 utterances from Table 1 are labeled by three coders instead of two. Di Eugenio and Glass (page 97) note that compared to contingency tables like Table 1, agreement tables like Table 3 lose information because they do not say which coder gave each judgment. This information is not used in the calculation of π, but is necessary for determining the individual coders’ distributions in the calculation of κ. (Agreement tables also add information compared to contingency tables, namely, the identity of the items that make up each contingency class, but this information is not used in the calculation of either κ or π.) Let nik stand for the number of times an item i is classiﬁed in category k (i.e., the number of coders that make such a judgment): For example, given the distribution in Table 3, nUtt1Stat = 2 and nUtt1IReq = 1. Each category k contributes (nik 2 ) pairs of agreeing judgments for item i; the amount of agreement agri for item i is the sum of (nik 2 ) over all categories k ∈ K, divided by (c 2), the total number of judgment pairs per item. agri = (cid:3) (cid:2) nik 2 1 (c 2) ∑ k∈K = 1 c(c − 1) ∑ k∈K nik(nik − 1) For example, given the results in Table 3, we ﬁnd the agreement value for Utterance 1 as follows. agr1 = 1 (3 2) (cid:2)(cid:2) (cid:3) (cid:2) + nUtt1Stat 2 nUtt1IReq 2 (cid:3)(cid:3) = 1 3 (1 + 0) ≈ 0.33 The overall observed agreement is the mean of agri for all items i ∈ I. Ao = 1 i ∑ i∈I agri = 1 ic(c − 1) ∑ i∈I ∑ k∈K nik(nik − 1) (Notice that this deﬁnition of observed agreement is equivalent to the mean of the two-coder observed agreement values from Section 2.4 for all coder pairs.) If observed agreement is measured on the basis of pairwise agreement (the pro- portion of agreeing judgment pairs), it makes sense to measure expected agreement in terms of pairwise comparisons as well, that is, as the probability that any pair of judg- ments for an item would be in agreement—or, said otherwise, the probability that two Table 3 Agreement table with three coders. STAT IREQ 2 0 1 1 3 2 90 (0.3) 210 (0.7) Utt1 Utt2 ... Utt100 TOTAL 563 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 34, Number 4 arbitrary coders would make the same judgment for a particular item by chance. This is the approach taken by Fleiss (1971). Like Scott, Fleiss interprets “chance agreement” as the agreement expected on the basis of a single distribution which reﬂects the combined judgments of all coders, meaning that expected agreement is calculated using ˆP(k), the overall proportion of items assigned to category k, which is the total number of such assignments by all coders nk divided by the overall number of assignments. The latter, in turn, is the number of items i multiplied by the number of coders c. ˆP(k) = 1 ic nk As in the two-coder case, the probability that two arbitrary coders assign an item to a particular category k ∈ K is assumed to be the joint probability of each coder making this assignment independently, that is ( ˆP(k))2. The expected agreement is the sum of this joint probability over all the categories k ∈ K. π e = ∑ A k∈K (cid:6) (cid:7)2 = ∑ ˆP(k) k∈K (cid:2) (cid:3) 2 = 1 ic nk 1 (ic)2 ∑ k∈K n2 k Multi-π is the coefﬁcient that Siegel and Castellan (1988) call K. 2.5.2 Multi-κ. It is fairly straightforward to adapt Fleiss’s proposal to generalize Cohen’s κ proper to more than two coders, calculating expected agreement based on individual coder marginals. A detailed proposal can be found in Davies and Fleiss (1982), or in the extended version of this article. 2.6 Krippendorff’s α and Other Weighted Agreement Coefﬁcients A serious limitation of both π and κ is that all disagreements are treated equally. But especially for semantic and pragmatic features, disagreements are not all alike. Even for the relatively simple case of dialogue act tagging, a disagreement between an accept and a reject interpretation of an utterance is clearly more serious than a disagreement between an info-request and a check. For tasks such as anaphora resolution, where reliability is determined by measuring agreement on sets (coreference chains), allowing for degrees of disagreement becomes essential (see Section 4.4). Under such circum- stances, π and κ are not very useful. In this section we discuss two coefﬁcients that make it possible to differentiate between types of disagreements: α (Krippendorff 1980, 2004a), which is a coefﬁcient deﬁned in a general way that is appropriate for use with multiple coders, different magnitudes of disagreement, and missing values, and is based on assumptions similar to those of π; and weighted kappa κw (Cohen 1968), a generalization of κ. 2.6.1 Krippendorff’s α. The coefﬁcient α (Krippendorff 1980, 2004a) is an extremely ver- satile agreement coefﬁcient based on assumptions similar to π, namely, that expected agreement is calculated by looking at the overall distribution of judgments without regard to which coders produced these judgments. It applies to multiple coders, and it allows for different magnitudes of disagreement. When all disagreements are con- sidered equal it is nearly identical to multi-π, correcting for small sample sizes by using an unbiased estimator for expected agreement. In this section we will present 564 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Artstein and Poesio Inter-Coder Agreement for CL Krippendorff’s α and relate it to the other coefﬁcients discussed in this article, but we will start with α’s origins as a measure of variance, following a long tradition of using variance to measure reliability (see citations in Rajaratnam 1960; Krippendorff 1970). A sample’s variance s2 is deﬁned as the sum of square differences from the mean SS = ∑(x − ¯x)2 divided by the degrees of freedom df . Variance is a useful way of looking at agreement if coders assign numerical values to the items, as in magnitude estimation tasks. Each item in a reliability study can be considered a separate level in a single-factor analysis of variance: The smaller the variance around each level, the higher the reliability. When agreement is perfect, the variance within the levels (s2 within) is zero; when agreement is at chance, the variance within the levels is equal to the variance between the levels, in which case it is also equal to the overall variance of the data: s2 within/s2 total are therefore 0 when agreement is perfect and 1 when agreement is at chance. Additionally, ≤ SStotal by deﬁnition, and df total < 2 df within the latter ratio is bounded at 2: SSwithin because each item has at least two judgments. Subtracting the ratio s2 total from 1 yields a coefﬁcient which ranges between −1 and 1, where 1 signiﬁes perfect agreement and 0 signiﬁes chance agreement. between (that is, 1/F) and s2 total. The ratios s2 between = s2 within = s2 within/s2 within/s2 α = 1 − s2 within s2 total = 1 − SSwithin/df within SStotal/df total We can unpack the formula for α to bring it to a form which is similar to the other coefﬁcients we have looked at, and which will allow generalizing α beyond simple numerical values. The ﬁrst step is to get rid of the notion of arithmetic mean which lies at the heart of the measure of variance. We observe that for any set of numbers x1, . . . , xN with a mean ¯x = 1 n=1 xn, the sum of square differences from the mean SS can be N expressed as the sum of square of differences between all the (ordered) pairs of numbers, scaled by a factor of 1/2N. ∑N SS = N ∑ n=1 (xn − ¯x)2 = 1 2N N ∑ n=1 N ∑ m=1 (xn − xm)2 For calculating α we considered each item to be a separate level in an analysis of variance; the number of levels is thus the number of items i, and because each coder marks each item, the number of observations for each item is the number of coders c. Within-level variance is the sum of the square differences from the mean of each item, SSwithin = ∑i ∑c(xic − ¯xi)2, divided by the degrees of freedom df within = i(c − 1). We can express this as the sum of the squares of the differences between all of the judgment pairs for each item, summed over all items and scaled by the appropriate factor. We use the notation xic for the value given by coder c to item i, and ¯xi for the mean of all the values given to item i. s2 within = SSwithin df within = 1 i(c − 1) ∑ i∈I ∑ c∈C (xic − ¯xi)2 = 1 2ic(c − 1) c ∑ m=1 c ∑ n=1 ∑ i∈I (xicm − xicn )2 The total variance is the sum of the square differences of all judgments from the grand mean, SStotal = ∑i ∑c(xic − ¯x)2, divided by the degrees of freedom df total = ic − 1. This 565 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 34, Number 4 can be expressed as the sum of the squares of the differences between all of the judg- ments pairs without regard to items, again scaled by the appropriate factor. The notation ¯x is the overall mean of all the judgments in the data. s2 total = SStotal df total = 1 ic − 1 ∑ i∈I ∑ c∈C (xic − ¯x)2 = 1 2ic(ic − 1) i ∑ j=1 c ∑ m=1 i ∑ l=1 c ∑ n=1 (xijcm − xil cn )2 Now that we have removed references to means from our formulas, we can abstract over the measure of variance. We deﬁne a distance function d which takes two numbers and returns the square of their difference. dab = (a − b)2 We also simplify the computation by counting all the identical value assignments together. Each unique value used by the coders will be considered a category k ∈ K. We use nik for the number of times item i is given the value k, that is, the number of coders that make such a judgment. For every (ordered) pair of distinct values ka, kb ∈ K there are nika nikb pairs of judgments of item i, whereas for non-distinct values there − 1) pairs. We use this notation to rewrite the formula for the within-level are nika (nika variance. Dα o, the observed disagreement for α, is deﬁned as twice the variance within the levels in order to get rid of the factor 2 in the denominator; we also simplify the formula by using the multiplier nika nika for identical categories—this is allowed because dkk = 0 for all k. D α o = 2 s2 within = 1 ic(c − 1) k ∑ j=1 k ∑ l=1 ∑ i∈I nikj nikl dkjkl We perform the same simpliﬁcation for the total variance, where nk stands for the total number of times the value k is assigned to any item by any coder. The expected disagreement for α, Dα e , is twice the total variance. D α e = 2 s2 total = 1 ic(ic − 1) k ∑ j=1 k ∑ l=1 nkj nkl dkjkl Because both expected and observed disagreement are twice the respective vari- ances, the coefﬁcient α retains the same form when expressed with the disagreement values. α = 1 − Do De Now that α has been expressed without explicit reference to means, differences, and squares, it can be generalized to a variety of coding schemes in which the labels cannot be interpreted as numerical values: All one has to do is to replace the square difference function d with a different distance function. Krippendorff (1980, 2004a) offers distance metrics suitable for nominal, interval, ordinal, and ratio scales. Of particular interest is 566 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Artstein and Poesio Inter-Coder Agreement for CL the function for nominal categories, that is, a function which considers all distinct labels equally distant from one another. (cid:1) 0 if a = b 1 if a (cid:12)= b dab = It turns out that with this distance function, the observed disagreement Dα o is exactly the complement of the observed agreement of Fleiss’s multi-π, 1 − Aπ o , and the expected disagreement Dα e by a factor of (ic − 1)/ic; the difference is due to the fact that π uses a biased estimator of the expected agreement in the population whereas α uses an unbiased estimator. The following equation shows that given the correspondence between observed and expected agreement and disagreement, the co- efﬁcients themselves are nearly equivalent. e differs from 1 − Aπ α = 1 − Dα o Dα e ≈ 1 − 1 − Aπ o 1 − Aπ e = − (1 − Aπ o ) 1 − Aπ e 1 − Aπ e = Aπ − Aπ e o 1 − Aπ e = π For nominal data, the coefﬁcients π and α approach each other as either the number of items or the number of coders approaches inﬁnity. Krippendorff’s α will work with any distance metric, provided that identical cat- egories always have a distance of zero (dkk = 0 for all k). Another useful constraint is symmetry (dab = dba for all a, b). This ﬂexibility affords new possibilities for analysis, which we will illustrate in Section 4. We should also note, however, that the ﬂexibility also creates new pitfalls, especially in cases where it is not clear what the natural dis- tance metric is. For example, there are different ways to measure dissimilarity between sets, and any of these measures can be justiﬁably used when the category labels are sets of items (as in the annotation of anaphoric relations). The different distance metrics yield different values of α for the same annotation data, making it difﬁcult to interpret the resulting values. We will return to this problem in Section 4.4. 2.6.2 Cohen’s κw. A weighted variant of Cohen’s κ is presented in Cohen (1968). The implementation of weights is similar to that of Krippendorff’s α—each pair of cate- gories ka, kb ∈ K is associated with a weight dkakb , where a larger weight indicates more disagreement (Cohen uses the notation v; he does not place any general constraints on the weights—not even a requirement that a pair of identical categories have a weight of zero, or that the weights be symmetric across the diagonal). The coefﬁcient is deﬁned for two coders: The disagreement for a particular item i is the weight of the pair of categories assigned to it by the two coders, and the overall observed disagreement is the (normalized) mean disagreement of all the items. Let k(cn, i) denote the category assigned by coder cn to item i; then the disagreement for item i is disagri = dk(c1,i)k(c2,i). The observed disagreement Do is the mean of disagri for all items i, normalized to the interval [0, 1] through division by the maximal weight dmax. D κw o = 1 dmax 1 i ∑ i∈I disagri = 1 dmax 1 i ∑ i∈I dk(c1,i)k(c2,i) If we take all disagreements to be of equal weight, that is dkaka = 0 for all categories ka = 1 for all ka (cid:12)= kb, then the observed disagreement is exactly the complement and dkakb of the observed agreement as calculated in Section 2.4: D κw o = 1 − Aκ o. 567 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 34, Number 4 Like κ, the coefﬁcient κw interprets expected disagreement as the amount expected by chance from a distinct probability distribution for each coder. These individual distributions are estimated by ˆP(k|c), the proportion of items assigned by coder c to category k, that is the number of such assignments nck divided by the number of items i. ˆP(k|c) = 1 i nck The probability that coder c1 assigns an item to category ka and coder c2 assigns it to category kb is the joint probability of each coder making this assignment independently, namely, ˆP(ka|c1) ˆP(kb|c2). The expected disagreement is the mean of the weights for all (ordered) category pairs, weighted by the probabilities of the category pairs and normalized to the interval [0, 1] through division by the maximal weight. D κw e = 1 dmax k ∑ j=1 k ∑ l=1 ˆP(kj|c1) ˆP(kl|c2)dkjkl = 1 dmax 1 i2 k ∑ j=1 k ∑ l=1 nc1kj nc2kl dkjkl If we take all disagreements to be of equal weight then the expected disagreement is exactly the complement of the expected agreement for κ as calculated in Section 2.4: D κw e = 1 − Aκ e. Finally, the coefﬁcient κw itself is the ratio of observed disagreement to expected disagreement, subtracted from 1 in order to yield a ﬁnal value in terms of agreement. κw = 1 − Do De l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o 2.7 An Integrated Example We end this section with an example illustrating how all of the agreement coefﬁcients just discussed are computed. To facilitate comparisons, all computations will be based on the annotation statistics in Table 4. This confusion matrix reports the results of an experiment where two coders classify a set of utterances into three categories. 2.7.1 The Unweighted Coefﬁcients. Observed agreement for all of the unweighted coefﬁ- cients (S, κ, and π) is calculated by counting the items on which the coders agree (the l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Table 4 An integrated coding example. CODER A STAT IREQ CHCK TOTAL STAT IREQ CODER B CHCK TOTAL 46 0 0 46 6 32 6 44 0 0 10 10 52 32 16 100 568 Artstein and Poesio Inter-Coder Agreement for CL ﬁgures on the diagonal of the confusion matrix in Table 4) and dividing by the total number of items. Ao = 46 + 32 + 10 100 = 0.88 The expected agreement values and the resulting values for the coefﬁcients are shown in Table 5. The values of π and κ are very similar, which is to be expected when agreement is high, because this implies similar marginals. Notice that Aκ e , hence κ > 圆周率; 这
reﬂects a general property of κ and π, already mentioned in Section 2.4, which will be
elaborated in Section 3.1.

e < Aπ 2.7.2 Weighted Coefﬁcients. Suppose we notice that whereas Statement and Info- Request are clearly distinct classiﬁcations, Check is somewhere between the two. We therefore opt to weigh the distances between the categories as follows (recall that 1 denotes maximal disagreement, and identical categories are in full agreement and thus have a distance of 0). Statement Info-Request Check Statement Info-Request Check 1 0 0.5 0 1 0.5 0.5 0.5 0 The observed disagreement is calculated by summing up all the cells in the contingency table, multiplying each cell by its respective weight, and dividing the total by the number of items (in the following calculation we ignore cells with zero items). Do = 46 × 0 + 6 × 1 + 32 × 0 + 6 × 0.5 + 10 × 0 100 = 6 + 3 100 = 0.09 The only sources of disagreement in the coding example of Table 4 are the six utterances marked as Info-Requests by coder A and Statements by coder B, which receive the maximal weight of 1, and the six utterances marked as Info-Requests by coder A and Checks by coder B, which are given a weight of 0.5. The calculation of expected disagreement for the weighted coefﬁcients is shown in Table 6, and is the sum of the expected disagreement for each category pair multiplied Table 5 Unweighted coefﬁcients for the data from Table 4. Expected agreement Chance-corrected agreement l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 S π κ 3 )2 = 1 3 = 0.4014 .46 × .52 + .44 × .32 + .1 × .16 = 0.396 3 × ( 1 + 0.10+0.16 2 + 0.44+0.32 2 0.46+0.52 2 (0.88 − 1 3 )/(1 − 1 3 ) = 0.82 (0.88 − 0.4014)/(1 − 0.4014) ≈ 0.7995 (0.88 − 0.396)/(1 − 0.396) ≈ 0.8013 569 Computational Linguistics Volume 34, Number 4 Table 6 Expected disagreement of the weighted coefﬁcients for the data from Table 4. Dα e κw e D (46+52)×(46+52) 2×100×(2×100−1) + (46+52)×(44+32) 2×100×(2×100−1) + (46+52)×(10+16) 2×100×(2×100−1) 46×52 100×100 + 46×32 100×100 + 46×16 100×100 × 0 + 44×52 100×100 × 1 + 44×32 100×100 2 + 44×16 × 1 100×100 × 1 × 0 + (44+32)×(46+52) 2×100×(2×100−1) × 1 + (44+32)×(44+32) 2×100×(2×100−1) 2 + (44+32)×(10+16) 2×100×(2×100−1) × 1 + 10×52 × 1 100×100 2 × 0 + 10×32 × 1 100×100 2 2 + 10×16 × 0 × 1 100×100 × 1 + (10+16)×(46+52) 2×100×(2×100−1) × 0 + (10+16)×(44+32) 2×100×(2×100−1) 2 + (10+16)×(10+16) 2×100×(2×100−1) × 1 × 1 2 × 1 2 × 0 0.4879 0.49 by its weight. The value of the weighted coefﬁcients is given by the formula 1 − Do De α ≈ 1 − 0.09 0.4879 ≈ 0.8156, and κw = 1 − 0.09 0.49 ≈ 0.8163. , so 3. Bias and Prevalence Two issues recently raised by Di Eugenio and Glass (2004) concern the behavior of agreement coefﬁcients when the annotation data are severely skewed. One issue, which Di Eugenio and Glass call the bias problem, is that π and κ yield quite different numerical values when the annotators’ marginal distributions are widely divergent; the other issue, the prevalence problem, is the exceeding difﬁculty in getting high agreement values when most of the items fall under one category. Looking at these two problems in detail is useful for understanding the differences between the coefﬁcients. 3.1 Annotator Bias The difference between π and α on the one hand and κ on the other hand lies in the interpretation of the notion of chance agreement, whether it is the amount expected from the the actual distribution of items among categories (π) or from individual coder priors (κ). As mentioned in Section 2.4, this difference has been the subject of much debate (Fleiss 1975; Krippendorff 1978, 2004b; Byrt, Bishop, and Carlin 1993; Zwick 1988; Hsu and Field 2003; Di Eugenio and Glass 2004; Craggs and McGee Wood 2005). A claim often repeated in the literature is that single-distribution coefﬁcients like π and α assume that different coders produce similar distributions of items among categories, with the implication that these coefﬁcients are inapplicable when the anno- tators show substantially different distributions. Recommendations vary: Zwick (1988) suggests testing the individual coders’ distributions using the modiﬁed χ2 test of Stuart (1955), and discarding the annotation as unreliable if signiﬁcant systematic discrepan- cies are observed. In contrast, Hsu and Field (2003, page 214) recommend reporting the value of κ even when the coders produce different distributions, because it is “the only [index] . . . that could legitimately be applied in the presence of marginal hetero- geneity”; likewise, Di Eugenio and Glass (2004, page 96) recommend using κ in “the vast majority . . . of discourse- and dialogue-tagging efforts” where the individual coders’ distributions tend to vary. All of these proposals are based on a misconception: that 570 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Artstein and Poesio Inter-Coder Agreement for CL single-distribution coefﬁcients require similar distributions by the individual annota- tors in order to work properly. This is not the case. The difference between the coefﬁ- cients is only in the interpretation of “chance agreement”: π-style coefﬁcients calculate the chance of agreement among arbitrary coders, whereas κ-style coefﬁcients calcu- late the chance of agreement among the coders who produced the reliability data. There- fore, the choice of coefﬁcient should not depend on the magnitude of the divergence between the coders, but rather on the desired interpretation of chance agreement. Another common claim is that individual-distribution coefﬁcients like κ “reward” annotators for disagreeing on the marginal distributions. For example, Di Eugenio and Glass (2004, page 99) say that κ suffers from what they call the bias problem, described as “the paradox that κCo [our κ] increases as the coders become less similar.” Similar reservations about the use of κ have been noted by Brennan and Prediger (1981) and Zwick (1988). However, the bias problem is less paradoxical than it sounds. Although it is true that for a ﬁxed observed agreement, a higher difference in coder marginals implies a lower expected agreement and therefore a higher κ value, the conclusion that κ penalizes coders for having similar distributions is unwarranted. This is because Ao and Ae are not independent: Both are drawn from the same set of observations. What κ does is discount some of the disagreement resulting from different coder marginals by incorporating it into Ae. Whether this is desirable depends on the application for which the coefﬁcient is used. The most common application of agreement measures in CL is to infer the reliability of a large-scale annotation, where typically each piece of data will be marked by just one coder, by measuring agreement on a small subset of the data which is annotated by multiple coders. In order to make this generalization, the measure must reﬂect the reliability of the annotation procedure, which is independent of the actual annotators used. Reliability, or reproducibility of the coding, is reduced by all disagreements—both random and systematic. The most appropriate measures of reliability for this purpose are therefore single-distribution coefﬁcients like π and α, which generalize over the individual coders and exclude marginal disagreements from the expected agreement. This argument has been presented recently in much detail by Krippendorff (2004b) and reiterated by Craggs and McGee Wood (2005). At the same time, individual-distribution coefﬁcients like κ provide important in- formation regarding the trustworthiness (validity) of the data on which the annotators agree. As an intuitive example, think of a person who consults two analysts when deciding whether to buy or sell certain stocks. If one analyst is an optimist and tends to recommend buying whereas the other is a pessimist and tends to recommend selling, they are likely to agree with each other less than two more neutral analysts, so overall their recommendations are likely to be less reliable—less reproducible—than those that come from a population of like-minded analysts. This reproducibility is measured by π. But whenever the optimistic and pessimistic analysts agree on a recommendation for a particular stock, whether it is “buy” or “sell,” the conﬁdence that this is indeed the right decision is higher than the same advice from two like-minded analysts. This is why κ “rewards” biased annotators: it is not a matter of reproducibility (reliability) but rather of trustworthiness (validity). Having said this, we should point out that, ﬁrst, in practice the difference between π and κ doesn’t often amount to much (see discussion in Section 4). Moreover, the difference becomes smaller as agreement increases, because all the points of agreement contribute toward making the coder marginals similar (it took a lot of experimentation to create data for Table 4 so that the values of π and κ would straddle the conventional cutoff point of 0.80, and even so the difference is very small). Finally, one would expect 571 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 34, Number 4 the difference between π and κ to diminish as the number of coders grows; this is shown subsequently.6 We deﬁne B, the overall annotator bias in a particular set of coding data, as the difference between the expected agreement according to (multi)-π and the expected agreement according to (multi)-κ. Annotator bias is a measure of variance: If we take c to be a random variable with equal probabilities for all coders, then the annotator bias B is the sum of the variances of P(k|c) for all categories k ∈ K, divided by the number of coders c less one (see Artstein and Poesio [2005] for a proof). π B = A e − A κ e = 1 c − 1 ∑ k∈K σ2 ˆP(k|c) Annotator bias can be used to express the difference between κ and π. κ − π = Ao − (Aπ e 1 − (Aπ e − B) − B) − Ao − Aπ e 1 − Aπ e = B · (1 − Ao) (1 − Aκ e)(1 − Aπ e ) This allows us to make the following observations about the relationship between π and κ. Observation 1. The difference between κ and π grows as the annotator bias grows: For a constant Ao and Aπ e , a greater B implies a greater value for κ − π. Observation 2. The greater the number of coders, the lower the annotator bias B, and hence the lower the difference between κ and π, because the variance of ˆP(k|c) does not increase in proportion to the number of coders. In other words, provided enough coders are used, it should not matter whether a single-distribution or individual-distribution coefﬁcient is used. This is not to imply that multiple coders increase reliability: The variance of the individual coders’ distributions can be just as large with many coders as with few coders, but its effect on the value of κ decreases as the number of coders grows, and becomes more similar to random noise. The same holds for weighted measures too; see the extended version of this article for deﬁnitions and proof. In an annotation study with 18 subjects, we compared α with a variant which uses individual coder distributions to calculate expected agreement, and found that the values never differed beyond the third decimal point (Poesio and Artstein 2005). We conclude with a summary of our views concerning the difference between π- style and κ-style coefﬁcients. First of all, keep in mind that empirically the difference is small, and gets smaller as the number of annotators increases. Then instead of reporting two coefﬁcients, as suggested by Di Eugenio and Glass (2004), the appropriate coefﬁcient should be chosen based on the task (not on the observed differences between coder marginals). When the coefﬁcient is used to assess reliability, a single-distribution coefﬁcient like π or α should be used; this is indeed already the practice in CL, because Siegel and Castellan’s K is identical with (multi-)π. It is also good practice to test 6 Craggs and McGee Wood (2005) also suggest increasing the number of coders in order to overcome individual annotator bias, but do not provide a mathematical justiﬁcation. 572 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Artstein and Poesio Inter-Coder Agreement for CL reliability with more than two coders, in order to reduce the likelihood of coders sharing a deviant reading of the annotation guidelines. 3.2 Prevalence We touched upon the matter of skewed data in Section 2.3 when we motivated the need for chance correction: If a disproportionate amount of the data falls under one category, then the expected agreement is very high, so in order to demonstrate high reliability an even higher observed agreement is needed. This leads to the so-called paradox that chance-corrected agreement may be low even though Ao is high (Cicchetti and Feinstein 1990; Feinstein and Cicchetti 1990; Di Eugenio and Glass 2004). Moreover, when the data are highly skewed in favor of one category, the high agreement also corresponds to high accuracy: If, say, 95% of the data fall under one category label, then random coding would cause two coders to jointly assign this category label to 90.25% of the items, and on average 95% of these labels would be correct, for an overall accuracy of at least 85.7%. This leads to the surprising result that when data are highly skewed, coders may agree on a high proportion of items while producing annotations that are indeed correct to a high degree, yet the reliability coefﬁcients remain low. (For an illustration, see the discussion of agreement results on coding discourse segments in Section 4.3.1.) This surprising result is, however, justiﬁed. Reliability implies the ability to dis- tinguish between categories, but when one category is very common, high accuracy and high agreement can also result from indiscriminate coding. The test for reliabil- ity in such cases is the ability to agree on the rare categories (regardless of whether these are the categories of interest). Indeed, chance-corrected coefﬁcients are sensitive to agreement on rare categories. This is easiest to see with a simple example of two coders and two categories, one common and the other one rare; to further simplify the calculation we also assume that the coder marginals are identical, so that π and κ yield the same values. We can thus represent the judgments in a contingency table with just two parameters: (cid:6) is half the proportion of items on which there is disagreement, and δ is the proportion of agreement on the Rare category. Both of these proportions are assumed to be small, so the bulk of the items (a proportion of 1 − (δ + 2(cid:6))) are labeled with the Common category by both coders (Table 7). From this table we can calculate Ao = 1 − 2(cid:6) and Ae = 1 − 2(δ + (cid:6)) + 2(δ + (cid:6))2, as well as π and κ. π, κ = 1 − 2(cid:6) − (1 − 2(δ + (cid:6)) + 2(δ + (cid:6))2) 1 − (1 − 2(δ + (cid:6)) + 2(δ + (cid:6))2) = δ δ + (cid:6) − (cid:6) 1 − (δ + (cid:6)) When (cid:6) and δ are both small, the fraction after the minus sign is small as well, so π and κ are approximately δ/(δ + (cid:6)): the value we get if we take all the items marked by one Table 7 A simple example of agreement on dialogue act tagging. CODER B COMMON RARE TOTAL CODER A COMMON 1 − (δ + 2(cid:6)) (cid:6) 1 − (δ + (cid:6)) RARE (cid:6) δ δ + (cid:6) TOTAL 1 − (δ + (cid:6)) δ + (cid:6) 1 573 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 34, Number 4 particular coder as Rare, and calculate what proportion of those items were labeled Rare by the other coder. This is a measure of the coders’ ability to agree on the rare category. 4. Using Agreement Measures for CL Annotation Tasks In this section we review the use of intercoder agreement measures in CL since Carletta’s original paper in light of the discussion in the previous sections. We begin with a summary of Krippendorff’s recommendations about measuring reliability (Krippendorff 2004a, Chapter 11), then discuss how coefﬁcients of agreement have been used in CL to measure the reliability of annotation schemes, focusing in particular on the types of annotation where there has been some debate concerning the most appropriate measures of agreement. 4.1 Methodology and Interpretation of the Results: General Issues Krippendorff (2004a, Chapter 11) notes with regret the fact that reliability is discussed in only around 69% of studies in content analysis. In CL as well, not all annotation projects include a formal test of intercoder agreement. Some of the best known annotation efforts, such as the creation of the Penn Treebank (Marcus, Marcinkiewicz, and Santorini 1993) and the British National Corpus (Leech, Garside, and Bryant 1994), do not report reliability results as they predate the Carletta paper; but even among the more recent efforts, many only report percentage agreement, as for the creation of the PropBank (Palmer, Dang, and Fellbaum 2007) or the ongoing OntoNotes annotation (Hovy et al. 2006). Even more importantly, very few studies apply a methodology as rigorous as that envisaged by Krippendorff and other content analysts. We therefore begin this discussion of CL practice with a summary of the main recommendations found in Chapter 11 of Krippendorff (2004a), even though, as we will see, we think that some of these recommendations may not be appropriate for CL. 4.1.1 Generating Data to Measure Reproducibility. Krippendorff’s recommendations were developed for the ﬁeld of content analysis, where coding is used to draw conclusions from the texts. A coded corpus is thus akin to the result of a scientiﬁc experiment, and it can only be considered valid if it is reproducible—that is, if the same coded results can be replicated in an independent coding exercise. Krippendorff therefore argues that any study using observed agreement as a measure of reproducibility must satisfy the following requirements: • • • It must employ an exhaustively formulated, clear, and usable coding scheme together with step-by-step instructions on how to use it. It must use clearly speciﬁed criteria concerning the choice of coders (so that others may use such criteria to reproduce the data). It must ensure that the coders that generate the data used to measure reproducibility work independently of each other. Some practices that are common in CL do not satisfy these requirements. The ﬁrst requirement is violated by the practice of expanding the written coding instructions and including new rules as the data are generated. The second requirement is often 574 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Artstein and Poesio Inter-Coder Agreement for CL violated by using experts as coders, particularly long-term collaborators, as such coders may agree not because they are carefully following written instructions, but because they know the purpose of the research very well—which makes it virtually impossible for others to reproduce the results on the basis of the same coding scheme (the prob- lems arising when using experts were already discussed at length in Carletta [1996]). Practices which violate the third requirement (independence) include asking coders to discuss their judgments with each other and reach their decisions by majority vote, or to consult with each other when problems not foreseen in the coding instructions arise. Any of these practices make the resulting data unusable for measuring reproducibility. Krippendorff’s own summary of his recommendations is that to obtain usable data for measuring reproducibility a researcher must use data generated by three or more coders, chosen according to some clearly speciﬁed criteria, and working indepen- dently according to a written coding scheme and coding instructions ﬁxed in advance. Krippendorff also discusses the criteria to be used in the selection of the sample, from the minimum number of units (obtained using a formula from Bloch and Kraemer [1989], reported in Krippendorff [2004a, page 239]), to how to make the sample rep- resentative of the data population (each category should occur in the sample often enough to yield at least ﬁve chance agreements), to how to ensure the reliability of the instructions (the sample should contain examples of all the values for the categories). These recommendations are particularly relevant in light of the comments of Craggs and McGee Wood (2005, page 290), which discourage researchers from testing their coding instructions on data from more than one domain. Given that the reliability of the coding instructions depends to a great extent on how complications are dealt with, and that every domain displays different complications, the sample should contain sufﬁcient examples from all domains which have to be annotated according to the instructions. 4.1.2 Establishing Signiﬁcance. In hypothesis testing, it is common to test for the sig- niﬁcance of a result against a null hypothesis of chance behavior; for an agreement coefﬁcient this would mean rejecting the possibility that a positive value of agreement is nevertheless due to random coding. We can rely on the statement by Siegel and Castellan (1988, Section 9.8.2) that when sample sizes are large, the sampling distribu- tion of K (Fleiss’s multi-π) is approximately normal and centered around zero—this allows testing the obtained value of K against the null hypothesis of chance agreement by using the z statistic. It is also easy to test Krippendorff’s α with the interval distance metric against the null hypothesis of chance agreement, because the hypothesis α = 0 is identical to the hypothesis F = 1 in an analysis of variance. However, a null hypothesis of chance agreement is not very interesting, and demon- strating that agreement is signiﬁcantly better than chance is not enough to establish reliability. This has already been pointed out by Cohen (1960, page 44): “to know merely that κ is beyond chance is trivial since one usually expects much more than this in the way of reliability in psychological measurement.” The same point has been repeated and stressed in many subsequent works (e.g., Posner et al. 1990; Di Eugenio 2000; Krippendorff 2004a): The reason for measuring reliability is not to test whether coders perform better than chance, but to ensure that the coders do not deviate too much from perfect agreement (Krippendorff 2004a, page 237). The relevant notion of signiﬁcance for agreement coefﬁcients is therefore a conﬁ- dence interval. Cohen (1960, pages 43–44) implies that when sample sizes are large, the sampling distribution of κ is approximately normal for any true population value of κ, and therefore conﬁdence intervals for the observed value of κ can be determined 575 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 34, Number 4 using the usual multiples of the standard error. Donner and Eliasziw (1987) propose a more general form of signiﬁcance test for arbitrary levels of agreement. In contrast, Krippendorff (2004a, Section 11.4.2) states that the distribution of α is unknown, so conﬁdence intervals must be obtained by bootstrapping; a software package for doing this is described in Hayes and Krippendorff (2007). 4.1.3 Interpreting the Value of Kappa-Like Coefﬁcients. Even after testing signiﬁcance and establishing conﬁdence intervals for agreement coefﬁcients, we are still faced with the problem of interpreting the meaning of the resulting values. Suppose, for example, we establish that for a particular task, K = 0.78 ± 0.05. Is this good or bad? Unfortunately, deciding what counts as an adequate level of agreement for a speciﬁc purpose is still little more than a black art: As we will see, different levels of agreement may be appropriate for resource building and for more linguistic purposes. The problem is not unlike that of interpreting the values of correlation coefﬁcients, and in the area of medical diagnosis, the best known conventions concerning the value of kappa-like coefﬁcients, those proposed by Landis and Koch (1977) and reported in Figure 1, are indeed similar to those used for correlation coefﬁcients, where values above 0.4 are also generally considered adequate (Marion 2004). Many medical re- searchers feel that these conventions are appropriate, and in language studies, a similar interpretation of the values has been proposed by Rietveld and van Hout (1993). In CL, however, most researchers follow the more stringent conventions from content analysis proposed by Krippendorff (1980, page 147), as reported by Carletta (1996, page 252): “content analysis researchers generally think of K > .8 as good reliability,
和 .67 < K < .8 allowing tentative conclusions to be drawn” (Krippendorff was dis- cussing values of α rather than K, but the coefﬁcients are nearly equivalent for cate- gorical labels). As a result, ever since Carletta’s inﬂuential paper, CL researchers have attempted to achieve a value of K (more seldom, of α) above the 0.8 threshold, or, failing that, the 0.67 level allowing for “tentative conclusions.” However, the description of the 0.67 boundary in Krippendorff (1980) was actually “highly tentative and cautious,” and in later work Krippendorff clearly considers 0.8 the absolute minimum value of α to accept for any serious purpose: “Even a cutoff point of α = .800 . . . is a pretty low standard” (Krippendorff 2004a, page 242). Recent content analysis practice seems to have settled for even more stringent requirements: A recent textbook, Neuendorf (2002, page 3), analyzing several proposals concerning “acceptable” reliability, con- cludes that “reliability coefﬁcients of .90 or greater would be acceptable to all, .80 or greater would be acceptable in most situations, and below that, there exists great disagreement.” This is clearly a fundamental issue. Ideally we would want to establish thresholds which are appropriate for the ﬁeld of CL, but as we will see in the rest of this section, a decade of practical experience hasn’t helped in settling the matter. In fact, weighted coefﬁcients, while arguably more appropriate for many annotation tasks, make the issue of deciding when the value of a coefﬁcient indicates sufﬁcient agreement even K = 0.0 0.2 0.4 0.6 0.8 1.0 Poor Slight Fair Moderate Substantial Perfect Figure 1 Kappa values and strength of agreement according to Landis and Koch (1977). 576 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Artstein and Poesio Inter-Coder Agreement for CL more complicated because of the problem of determining appropriate weights (see Section 4.4). We will return to the issue of interpreting the value of the coefﬁcients at the end of this article. 4.1.4 Agreement and Machine Learning. In a recent article, Reidsma and Carletta (2008) point out that the goals of annotation in CL differ from those of content analysis, where agreement coefﬁcients originate. A common use of an annotated corpus in CL is not to conﬁrm or reject a hypothesis, but to generalize the patterns using machine-learning algorithms. Through a series of simulations, Reidsma and Carletta demonstrate that agreement coefﬁcients are poor predictors of machine-learning success: Even highly reproducible annotations are difﬁcult to generalize when the disagreements contain pat- terns that can be learned, whereas highly noisy and unreliable data can be generalized successfully when the disagreements do not contain learnable patterns. These results show that agreement coefﬁcients should not be used as indicators of the suitability of annotated data for machine learning. However, the purpose of reliability studies is not to ﬁnd out whether annotations can be generalized, but whether they capture some kind of observable reality. Even if the pattern of disagreement allows generalization, we need evidence that this general- ization would be meaningful. The decision whether a set of annotation guidelines are appropriate or meaningful is ultimately a qualitative one, but a baseline requirement is an acceptable level of agreement among the annotators, who serve as the instruments of measurement. Reliability studies test the soundness of an annotation scheme and guidelines, which is not to be equated with the machine-learnability of data produced by such guidelines. 4.2 Labeling Units with a Common and Predeﬁned Set of Categories: The Case of Dialogue Act Tagging The simplest and most common coding in CL involves labeling segments of text with a limited number of linguistic categories: Examples include part-of-speech tagging, dialogue act tagging, and named entity tagging. The practices used to test reliability for this type of annotation tend to be based on the assumption that the categories used in the annotation are mutually exclusive and equally distinct from one another; this assumption seems to have worked out well in practice, but questions about it have been raised even for the annotation of parts of speech (Babarczy, Carroll, and Sampson 2006), let alone for discourse coding tasks such as dialogue act coding. We concentrate here on this latter type of coding, but a discussion of issues raised for POS, named entity, and prosodic coding can be found in the extended version of the article. Dialogue act tagging is a type of linguistic annotation with which by now the CL community has had extensive experience: Several dialogue-act-annotated spoken lan- guage corpora now exist, such as MapTask (Carletta et al. 1997), Switchboard (Stolcke et al. 2000), Verbmobil (Jekat et al. 1995), and Communicator (e.g., Doran et al. 2001), among others. Historically, dialogue act annotation was also one of the types of annota- tion that motivated the introduction in CL of chance-corrected coefﬁcients of agreement (Carletta et al. 1997) and, as we will see, it has been the type of annotation that has generated the most discussion concerning annotation methodology and measuring agreement. A number of coding schemes for dialogue acts have achieved values of K over 0.8 and have therefore been assumed to be reliable: For example, K = 0.83 for the 577 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 34, Number 4 13-tag MapTask coding scheme (Carletta et al. 1997), K = 0.8 for the 42-tag Switchboard- DAMSL scheme (Stolcke et al. 2000), K = 0.90 for the smaller 20-tag subset of the CSTAR scheme used by Doran et al. (2001). All of these tests were based on the same two assumptions: that every unit (utterance) is assigned to exactly one category (dialogue act), and that these categories are distinct. Therefore, again, unweighted measures, and in particular K, tend to be used for measuring inter-coder agreement. However, these assumptions have been challenged based on the observation that utterances tend to have more than one function at the dialogue act level (Traum and Hinkelman 1992; Allen and Core 1997; Bunt 2000); for a useful survey, see Popescu-Belis (2005). An assertion performed in answer to a question, for instance, typically performs at least two functions at different levels: asserting some information—the dialogue act that we called Statement in Section 2.3, operating at what Traum and Hinkelman called the “core speech act” level—and conﬁrming that the question has been understood, a di- alogue act operating at the “grounding” level and usually known as Acknowledgment (Ack). In older dialogue act tagsets, acknowledgments and statements were treated as alternative labels at the same “level”, forcing coders to choose one or the other when an utterance performed a dual function, according to a well-speciﬁed set of instructions. By contrast, in the annotation schemes inspired from these newer theories such as DAMSL (Allen and Core 1997), coders are allowed to assign tags along distinct “dimensions” or “levels”. Two annotation experiments testing this solution to the “multi-tag” problem with the DAMSL scheme were reported in Core and Allen (1997) and Di Eugenio et al. (1998). In both studies, coders were allowed to mark each communicative function independently: That is, they were allowed to choose for each utterance one of the Statement tags (or possibly none), one of the Influencing-Addressee-Future-Action tags, and so forth—and agreement was evaluated separately for each dimension using (unweighted) K. Core and Allen found values of K ranging from 0.76 for answer to 0.42 for agreement to 0.15 for Committing-Speaker-Future-Action. Using differ- ent coding instructions and on a different corpus, Di Eugenio et al. observed higher agreement, ranging from K = 0.93 (for other-forward-function) to 0.54 (for the tag agreement). These relatively low levels of agreement led many researchers to return to “ﬂat” tagsets for dialogue acts, incorporating however in their schemes some of the in- sights motivating the work on schemes such as DAMSL. The best known example of this type of approach is the development of the SWITCHBOARD-DAMSL tagset by Jurafsky, Shriberg, and Biasca (1997), which incorporates many ideas from the “multi-dimensional” theories of dialogue acts, but does not allow marking an utterance as both an acknowledgment and a statement; a choice has to be made. This tagset results in overall agreement of K = 0.80. Interestingly, subsequent developments of SWITCHBOARD-DAMSL backtracked on some of these decisions. For instance, the ICSI-MRDA tagset developed for the annotation of the ICSI Meeting Recorder corpus reintroduces some of the DAMSL ideas, in that annotators are allowed to assign multi- ple SWITCHBOARD-DAMSL labels to utterances (Shriberg et al. 2004). Shriberg et al. achieved a comparable reliability to that obtained with SWITCHBOARD-DAMSL, but only when using a tagset of just ﬁve “class-maps”. Shriberg et al. (2004) also introduced a hierarchical organization of tags to improve reliability. The dimensions of the DAMSL scheme can be viewed as “superclasses” of dialogue acts which share some aspect of their meaning. For instance, the dimension of Influencing-Addressee-Future-Action (IAFA) includes the two dialogue acts Open-option (used to mark suggestions) and Directive, both of which bring into 578 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Artstein and Poesio Inter-Coder Agreement for CL consideration a future action to be performed by the addressee. At least in principle, an organization of this type opens up the possibility for coders to mark an utterance with the superclass (IAFA) in case they do not feel conﬁdent that the utterance satisﬁes the additional requirements for Open-option or Directive. This, in turn, would do away with the need to make a choice between these two options. This possibility wasn’t pursued in the studies using the original DAMSL that we are aware of (Core and Allen 1997; Di Eugenio 2000; Stent 2001), but was tested by Shriberg et al. (2004) and subsequent work, in particular Geertzen and Bunt (2006), who were speciﬁcally interested in the idea of using hierarchical schemes to measure partial agreement, and in addition experimented with weighted coefﬁcients of agreement for their hierarchical tagging scheme, speciﬁcally κw. Geertzen and Bunt tested intercoder agreement with Bunt’s DIT++ (Bunt 2005), a scheme with 11 dimensions that builds on ideas from DAMSL and from Dynamic Interpretation Theory (Bunt 2000). In DIT++, tags can be hierarchically related: For example, the class information-seeking is viewed as consisting of two classes, yes- no question (ynq) and wh-question (whq). The hierarchy is explicitly introduced in order to allow coders to leave some aspects of the coding undecided. For example, check is treated as a subclass of ynq in which, in addition, the speaker has a weak belief that the proposition that forms the belief is true. A coder who is not certain about the dialogue act performed using an utterance may simply choose to tag it as ynq. The distance metric d proposed by Geertzen and Bunt is based on the crite- rion that two communicative functions are related (d(c1, c2) < 1) if they stand in an ancestor–offspring relation within a hierarchy. Furthermore, they argue, the magnitude of d(c1, c2) should be proportional to the distance between the functions in the hierar- chy. A level-dependent correction factor is also proposed so as to leave open the option to make disagreements at higher levels of the hierarchy matter more than disagreements at the deeper level (for example, the distance between information-seeking and ynq might be considered greater than the distance between check and positive-check). The results of an agreement test with two annotators run by Geertzen and Bunt show that taking into account partial agreement leads to values of κw that are higher than the values of κ for the same categories, particularly for feedback, a class for which Core and Allen (1997) got low agreement. Of course, even assuming that the values of κw and κ were directly comparable—we remark on the difﬁculty of interpreting the values of weighted coefﬁcients of agreement in Section 4.4—it remains to be seen whether these higher values are a better indication of the extent of agreement between coders than the values of unweighted κ. This discussion of coding schemes for dialogue acts introduced issues to which we will return for other CL annotation tasks as well. There are a number of well- established schemes for large-scale dialogue act annotation based on the assumption of mutual exclusivity between dialogue act tags, whose reliability is also well known; if one of these schemes is appropriate for modeling the communicative intentions found in a task, we recommend to our readers to use it. They should also realize, however, that the mutual exclusivity assumption is somewhat dubious. If a multi-dimensional or hierarchical tagset is used, readers should also be aware that weighted coefﬁcients do capture partial agreement, and need not automatically result in lower reliability or in an explosion in the number of labels. However, a hierarchical scheme may not reﬂect genuine annotation difﬁculties: For example, in the case of DIT++, one might argue that it is more difﬁcult to confuse yes-no questions with wh-questions than with statements. We will also see in a moment that interpreting the results with weighted coefﬁcients is difﬁcult. We will return to both of these problems in what follows. 579 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 34, Number 4 4.3 Marking Boundaries and Unitizing Before labeling can take place, the units of annotation, or markables, need to be identiﬁed—a process Krippendorff (1995, 2004a) calls unitizing. The practice in CL for the forms of annotation discussed in the previous section is to assume that the units are linguistic constituents which can be easily identiﬁed, such as words, utterances, or noun phrases, and therefore there is no need to check the reliability of this process. We are aware of few exceptions to this assumption, such as Carletta et al. (1997) on unitization for move coding and our own work on the GNOME corpus (Poesio 2004b). In cases such as text segmentation, however, the identiﬁcation of units is as important as their labeling, if not more important, and therefore checking agreement on unit identiﬁcation is essential. In this section we discuss current CL practice with reliability testing of these types of annotation, before brieﬂy summarizing Krippendorff’s proposals concerning measuring reliability for unitizing. 4.3.1 Segmentation and Topic Marking. Discourse segments are portions of text that con- stitute a unit either because they are about the same “topic” (Hearst 1997; Reynar 1998) or because they have to do with achieving the same intention (Grosz and Sidner 1986) or performing the same “dialogue game” (Carletta et al. 1997).7 The analysis of discourse structure—and especially the identiﬁcation of discourse segments—is the type of annotation that, more than any other, led CL researchers to look for ways of measuring reliability and agreement, as it made them aware of the extent of disagree- ment on even quite simple judgments (Kowtko, Isard, and Doherty 1992; Passonneau and Litman 1993; Carletta et al. 1997; Hearst 1997). Subsequent research identiﬁed a number of issues with discourse structure annotation, above all the fact that segmen- tation, though problematic, is still much easier than marking more complex aspects of discourse structure, such as identifying the most important segments or the “rhetorical” relations between segments of different granularity. As a result, many efforts to annotate discourse structure concentrate only on segmentation. The agreement results for segment coding tend to be on the lower end of the scale proposed by Krippendorff and recommended by Carletta. Hearst (1997), for instance, found K = 0.647 for the boundary/not boundary distinction; Reynar (1998), measuring agreement between his own annotation and the TREC segmentation of broadcast news, reports K = 0.764 for the same task; Ries (2002) reports even lower agreement of K = 0.36. Teufel, Carletta, and Moens (1999), who studied agreement on the identiﬁcation of argumentative zones, found high reliability (K = 0.81) for their three main zones (own, other, background), although lower for the whole scheme (K = 0.71). For intention-based segmentation, Passonneau and Litman (1993) in the pre-K days reported an overall percentage agreement with majority opinion of 89%, but the agreement on boundaries was only 70%. For conversational games segmentation, Carletta et al. (1997) reported “promising but not entirely reassuring agreement on where games began (70%),” whereas the agreement on transaction boundaries was K = 0.59. Exceptions are two segmentation efforts carried out as part of annotations of rhetorical structure. Moser, Moore, and Glendening (1996) achieved an agreement 7 The notion of “topic” is notoriously difﬁcult to deﬁne and many competing theoretical proposals exist (Reinhart 1981; Vallduv´ı 1993). As it is often the case with annotation, fairly simple deﬁnitions tend to be used in discourse annotation work: For example, in TDT topic is deﬁned for annotation purposes as “an event or activity, along with all directly related events and activities” (TDT-2 Annotation Guide, http://projects.ldc.upenn.edu/TDT2/Guide/label-instr.html). 580 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Artstein and Poesio Inter-Coder Agreement for CL of K = 0.9 for the highest level of segmentation of their RDA annotation (Poesio, Patel, and Di Eugenio 2006). Carlson, Marcu, and Okurowski (2003) reported very high agreement over the identiﬁcation of the boundaries of discourse units, the build- ing blocks of their annotation of rhetorical structure. (Agreement was measured sev- eral times; initially, they obtained K = 0.87, and in the ﬁnal analysis K = 0.97.) This, however, was achieved by employing experienced annotators, and with considerable training. One important reason why most agreement results on segmentation are on the lower end of the reliability scale is the fact, known to researchers in discourse analysis from as early as Levin and Moore (1978), that although analysts generally agree on the “bulk” of segments, they tend to disagree on their exact boundaries. This phenomenon was also observed in more recent studies: See for example the discussion in Passonneau and Litman (1997), the comparison of the annotations produced by seven coders of the same text in Figure 5 of Hearst (1997, page 55), or the discussion by Carlson, Marcu, and Okurowski (2003), who point out that the boundaries between elementary discourse units tend to be “very blurry.” See also Pevzner and Hearst (2002) for similar comments made in the context of topic segmentation algorithms, and Klavans, Popper, and Passonneau (2003) for selecting deﬁnition phrases. This “blurriness” of boundaries, combined with the prevalence effects discussed in Section 3.2, also explains the fact that topic annotation efforts which were only concerned with roughly dividing a text into segments (Passonneau and Litman 1993; Carletta et al. 1997; Hearst 1997; Reynar 1998; Ries 2002) generally report lower agree- ment than the studies whose goal is to identify smaller discourse units. When disagree- ment is mostly concentrated in one class (‘boundary’ in this case), if the total number of units to annotate remains the same, then expected agreement on this class is lower when a greater proportion of the units to annotate belongs to this class. When in addition this class is much less numerous than the other classes, overall agreement tends to depend mostly on agreement on this class. For instance, suppose we are testing the reliability of two different segmentation schemes—into broad “discourse segments” and into ﬁner “discourse units”—on a text of 50 utterances, and that we obtain the results in Table 8. Case 1 would be a situation in which Coder A and Coder B agree that the text consists of two segments, obviously agree on its initial and ﬁnal boundaries, but disagree by one position on the intermediate boundary—say, one of them places it at utterance 25, the other at utterance 26. Never- theless, because expected agreement is so high—the coders agree on the classiﬁcation of 98% of the utterances—the value of K is fairly low. In case 2, the coders disagree on three times as many utterances, but K is higher than in the ﬁrst case because expected agreement is substantially lower (Ae = 0.53). The fact that coders mostly agree on the “bulk” of discourse segments, but tend to disagree on their boundaries, also makes it likely that an all-or-nothing coefﬁcient like K calculated on individual boundaries would underestimate the degree of agree- ment, suggesting low agreement even among coders whose segmentations are mostly similar. A weighted coefﬁcient of agreement like α might produce values more in keeping with intuition, but we are not aware of any attempts at measuring agreement on segmentation using weighted coefﬁcients. We see two main options. We suspect that the methods proposed by Krippendorff (1995) for measuring agreement on unitizing (see Section 4.3.2, subsequently) may be appropriate for the purpose of measuring agreement on discourse segmentation. A second option would be to measure agreement not on individual boundaries but on windows spanning several units, as done in the methods proposed to evaluate the performance of topic detection algorithms such as 581 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 34, Number 4 Table 8 Fewer boundaries, higher expected agreement. Case 1: Broad segments Ao = 0.96, Ae = 0.89, K = 0.65 CODER A BOUNDARY NO BOUNDARY TOTAL BOUNDARY CODER B NO BOUNDARY TOTAL 2 1 3 1 46 47 3 47 50 Case 2: Fine discourse units Ao = 0.88, Ae = 0.53, K = 0.75 CODER A BOUNDARY NO BOUNDARY TOTAL BOUNDARY CODER B NO BOUNDARY TOTAL 16 3 19 3 28 31 19 31 50 Pk (Beeferman, Berger, and Lafferty 1999) or WINDOWDIFF (Pevzner and Hearst 2002) (which are, however, raw agreement scores not corrected for chance). 4.3.2 Unitizing (Or, Agreement on Markable Identiﬁcation). It is often assumed in CL anno- tation practice that the units of analysis are “natural” linguistic objects, and therefore there is no need to check agreement on their identiﬁcation. As a result, agreement is usually measured on the labeling of units rather than on the process of identifying them (unitizing, Krippendorff 1995). We have just seen, however, two coding tasks for which the reliability of unit identiﬁcation is a crucial part of the overall reliability, and the problem of markable identiﬁcation is more pervasive than is generally acknowledged. For example, when the units to be labeled are syntactic constituents, it is common practice to use a parser or chunker to identify the markables and then to allow the coders to correct the parser’s output. In such cases one would want to know how reliable the coders’ corrections are. We thus need a general method of testing relibility on markable identiﬁcation. The one proposal for measuring agreement on markable identiﬁcation we are aware of is the αU coefﬁcient, a non-trivial variant of α proposed by Krippendorff (1995). A full presentation of the proposal would require too much space, so we will just present the core idea. Unitizing is conceived of as consisting of two separate steps: identifying boundaries between units, and selecting the units of interest. If a unit identiﬁed by one coder overlaps a unit identiﬁed by the other coder, the amount of disagreement is the square of the lengths of the non-overlapping segments (see Figure 2); if a unit identiﬁed by one coder does not overlap any unit of interest identiﬁed by the other coder, the amount of disagreement is the square of the length of the whole unit. This distance metric is used in calculating observed and expected disagreement, and αU itself. We refer the reader to Krippendorff (1995) for details. Krippendorff’s αU is not applicable to all CL tasks. For example, it assumes that units may not overlap in a single coder’s output, yet in practice there are many 582 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Artstein and Poesio Inter-Coder Agreement for CL Coder A Coder B s−✛ ✲ ✛ s ✛ ✲ ✲ s+ Figure 2 The difference between overlapping units is d(A, B) = s2 1995, Figure 4, page 61). − + s2 + (adapted from Krippendorff l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 annotation schemes which require coders to label nested syntactic constituents. For continuous segmentation tasks, αU may be inappropriate because when a segment identiﬁed by one annotator overlaps with two segments identiﬁed by another annotator, the distance is smallest when the one segment is centered over the two rather than aligned with one of them. Nevertheless, we feel that when the non-overlap assumption holds, and the units do not cover the text exhaustively, testing the reliabilty of unit identiﬁcation may prove beneﬁcial. To our knowledge, this has never been tested in CL. 4.4 Anaphora The annotation tasks discussed so far involve assigning a speciﬁc label to each category, which allows the various agreement measures to be applied in a straightforward way. Anaphoric annotation differs from the previous tasks because annotators do not assign labels, but rather create links between anaphors and their antecedents. It is therefore not clear what the “labels” should be for the purpose of calculating agreement. One possibility would be to consider the intended referent (real-world object) as the label, as in named entity tagging, but it wouldn’t make sense to predeﬁne a set of “labels” applicable to all texts, because different objects are mentioned in different texts. An alternative is to use the marked antecedents as “labels”. However, we do not want to count as a disagreement every time two coders agree on the discourse entity realized by a particular noun phrase but just happen to mark different words as antecedents. Consider the reference of the underlined pronoun it in the following dialogue excerpt (TRAINS 1991 [Gross, Allen, and Traum 1993], dialogue d91-3.2).8 1.1 1.4 1.5 1.6 2.1 3.1 M: .... first thing I’d like you to do is send engine E2 off with a boxcar to Corning to pick up oranges as soon as possible S: okay M: and while it’s there it should pick up the tanker Some of the coders in a study we carried out (Poesio and Artstein 2005) indicated the noun phrase engine E2 as antecedent for the second it in utterance 3.1, whereas others indicated the immediately preceding pronoun, which they had previously marked as having engine E2 as antecedent. Clearly, we do not want to consider these coders to be in disagreement. A solution to this dilemma has been proposed by Passonneau (2004): Use the emerging coreference sets as the ‘labels’ for the purpose of calculating agreement. This requires using weighted measures for calculating agreement on such sets, and 8 ftp://ftp.cs.rochester.edu/pub/papers/ai/92.tn1.trains 91 dialogues.txt. 583 Computational Linguistics Volume 34, Number 4 consequently it raises serious questions about weighted measures—in particular, about the interpretability of the results, as we will see shortly. 4.4.1 Passonneau’s Proposal. Passonneau (2004) recommends measuring agreement on anaphoric annotation by using sets of mentions of discourse entities as labels, that is, the emerging anaphoric/coreference chains. This proposal is in line with the meth- ods developed to evaluate anaphora resolution systems (Vilain et al. 1995). But using anaphoric chains as labels would not make unweighted measures such as K a good measure for agreement. Practical experience suggests that, except when a text is very short, few annotators will catch all mentions of a discourse entity: Most will forget to mark a few, with the result that the chains (that is, category labels) differ from coder to coder and agreement as measured with K is always very low. What is needed is a coefﬁcient that also allows for partial disagreement between judgments, when two annotators agree on part of the coreference chain but not on all of it. Passonneau (2004) suggests solving the problem by using α with a distance metric that allows for partial agreement among anaphoric chains. Passonneau proposes a dis- tance metric based on the following rationale: Two sets are minimally distant when they are identical and maximally distant when they are disjoint; between these extremes, sets that stand in a subset relation are closer (less distant) than ones that merely intersect. This leads to the following distance metric between two sets A and B.    dP = 0 if A = B 1/3 if A ⊂ B or B ⊂ A 2/3 if A ∩ B (cid:12)= ∅, but A (cid:12)⊂ B and B (cid:12)⊂ A 1 if A ∩ B = ∅ Alternative distance metrics take the size of the anaphoric chain into account, based on measures used to compare sets in Information Retrieval, such as the coefﬁcient of community of Jaccard (1912) and the coincidence index of Dice (1945) (Manning and Sch ¨utze 1999). Jaccard: dJ = 1 − |A ∩ B| |A ∪ B| Dice: dD = 1 − 2 |A ∩ B| |A| + |B| In later work, Passonneau (2006) offers a reﬁned distance metric which she called MASI (Measuring Agreement on Set-valued Items), obtained by multiplying Passonneau’s original metric dP by the metric derived from Jaccard dJ. dM = dP × dJ 4.4.2 Experience with α for Anaphoric Annotation. In the experiment mentioned previously (Poesio and Artstein 2005) we used 18 coders to test α and K under a variety of condi- tions. We found that even though our coders by and large agreed on the interpretation of anaphoric expressions, virtually no coder ever identiﬁed all the mentions of a discourse entity. As a result, even though the values of α and K obtained by using the ID of the antecedent as label were pretty similar, the values obtained when using anaphoric chains as labels were drastically different. The value of α increased, because examples where coders linked a markable to different antecedents in the same chain were no 584 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Artstein and Poesio Inter-Coder Agreement for CL longer considered as disagreements. However, the value of K was drastically reduced, because hardly any coder identiﬁed all the mentions of discourse entities (Figure 3). The study also looked at the matter of individual annotator bias, and as mentioned in Section 3.1, we did not ﬁnd differences between α and a κ-style version of α beyond the third decimal point. This similarity is what one would expect, given the result about annotator bias from Section 3.1 and given that in this experiment we used 18 annotators. These very small differences should be contrasted with the differences resulting from the choice of distance metrics, where values for the full-chain condition ranged from α = 0.642 using Jaccard as distance metric, to α = 0.654 using Passonneau’s metric, to the value for Dice reported in Figure 3, α = 0.691. These differences raise an important issue concerning the application of α-like measures for CL tasks: Using α makes it difﬁ- cult to compare the results of different annotation experiments, in that a “poor” value or a “high” value might result from “too strict” or “too generous” distance metrics, making it even more important to develop a methodology to identify appropriate values for these coefﬁcients. This issue is further emphasized by the study reported next. 4.4.3 Discourse Deixis. A second annotation study we carried out (Artstein and Poesio 2006) shows even more clearly the possible side effects of using weighted coefﬁcients. This study was concerned with the annotation of the antecedents of references to abstract objects, such as the example of the pronoun that in utterance 7.6 (TRAINS 1991, dialogue d91-2.2). 7.3 7.4 7.5 7.6 : so we ship one : boxcar : of oranges to Elmira : and that takes another 2 hours Previous studies of discourse deixis annotation showed that these are extremely difﬁ- cult judgments to make (Eckert and Strube 2000; Navarretta 2000; Byron 2002), except perhaps for identifying the type of object (Poesio and Modjeska 2005), so we simpliﬁed the task by only requiring our participants to identify the boundaries of the area of text in which the antecedent was introduced. Even so, we found a great variety in how these boundaries were marked: Exactly as in the case of discourse segmentation discussed earlier, our participants broadly agreed on the area of text, but disagreed on Chain K None Partial Full 0.628 0.563 0.480 α 0.656 0.677 0.691 0.7 0.6 0.5 0.4 α K α K α K no chain partial chain full chain Figure 3 A comparison of the values of α and K for anaphoric annotation (Poesio and Artstein 2005). 585 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 34, Number 4 its exact boundary. For instance, in this example, nine out of ten annotators marked the antecedent of that as a text segment ending with the word Elmira, but some started with the word so, some started with we, some with ship, and some with one. We tested a number of ways to measure partial agreement on this task, and obtained widely different results. First of all, we tested three set-based distance metrics inspired by the Passonneau proposals that we just discussed: We considered discourse segments to be sets of words, and computed the distance between them using Passonneau’s metric, Jaccard, and Dice. Using these three metrics, we obtained α values of 0.55 (with Passonneau’s metric), 0.45 (with Jaccard), and 0.55 (with Dice). We should note that because antecedents of different expressions rarely overlapped, the expected disagree- ment was close to 1 (maximal), so the value of α turned out to be very close to the com- plement of the observed disagreement as calculated by the different distance metrics. Next, we considered methods based on the position of words in the text. The ﬁrst method computed differences between absolute boundary positions: Each antecedent was associated with the position of its ﬁrst or last word in the dialogue, and agreement was calculated using α with the interval distance metric. This gave us α values of 0.998 for the beginnings of the antecedent-evoking area and 0.999 for the ends. This is because expected disagreement is exceptionally low: Coders tend to mark discourse an- tecedents close to the referring expression, so the average distance between antecedents of the same expression is smaller than the size of the dialogue by a few orders of magnitude. The second method associated each antecedent with the position of its ﬁrst or last word relative to the beginning of the anaphoric expression. This time we found extremely low values of α = 0.167 for beginnings of antecedents and 0.122 for ends— barely in the positive side. This shows that agreement among coders is not dramatically better than what would be expected if they just marked discourse antecedents at a ﬁxed distance from the referring expression. The three ranges of α that we observed (middle, high, and low) show agreement on the identity of discourse antecedents, their position in the dialogue, and their position relative to referring expressions, respectively. The middle range shows variability of up to 10 percentage points, depending on the distance metric chosen. The lesson is that once we start using weighted measures we cannot anymore interpret the value of α using traditional rules of thumb such as those proposed by Krippendorff or by Landis and Koch. This is because depending on the way we measure agreement, we can report α values ranging from 0.122 to 0.998 for the very same experiment! New interpretation methods have to be developed, which will be task- and distance-metric speciﬁc. We’ll return to this issue in the conclusions. 4.5 Word Senses Word sense tagging is one of the hardest annotation tasks. Whereas in the case of part- of-speech and dialogue act tagging the same categories are used to classify all units, in the case of word sense tagging different categories must be used for each word, which makes writing a single coding manual specifying examples for all categories impossible: The only option is to rely on a dictionary. Unfortunately, different dictionaries make different distinctions, and often coders can’t make the ﬁne-grained distinctions that trained lexicographers can make. The problem is particularly serious for verbs, which tend to be polysemous rather than homonymous (Palmer, Dang, and Fellbaum 2007). These difﬁculties, and in particular the difﬁculty of tagging senses with a ﬁne- grained repertoire of senses such as that provided by dictionaries or by WordNet (Fellbaum 1998), have been highlighted by the three SENSEVAL initiatives. Already 586 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Artstein and Poesio Inter-Coder Agreement for CL during the ﬁrst SENSEVAL, V´eronis (1998) carried out two studies of intercoder agreement on word sense tagging in the so-called ROMANSEVAL task. One study was concerned with agreement on polysemy—that is, the extent to which coders agreed that a word was polysemous in a given context. Six naive coders were asked to make this judgment about 600 French words (200 nouns, 200 verbs, 200 adjectives) using the repertoire of senses in the Petit Larousse. On this task, a (pairwise) percentage agreement of 0.68 for nouns, 0.74 for verbs, and 0.78 for adjectives was observed, corresponding to K values of 0.36, 0.37, and 0.67, respectively. The 20 words from each category perceived by the coders in this ﬁrst experiment to be most polysemous were then used in a second study, of intercoder agreement on the sense tagging task, which involved six different naive coders. Interestingly, the coders in this second experiment were allowed to assign multiple tags to words, although they did not make much use of this possibility; so κw was used to measure agreement. In this experiment, V´eronis observed (weighted) pairwise agreement of 0.63 for verbs, 0.71 for adjectives, and 0.73 for nouns, corresponding to κw values of 0.41, 0.41, and 0.46, but with a wide variety of values when measured per word—ranging from 0.007 for the adjective correct to 0.92 for the noun d´etention. Similarly mediocre results for intercoder agreement between naive coders were reported in the subsequent editions of SENSEVAL. Agreement studies for SENSEVAL-2, where WordNet senses were used as tags, reported a percentage agreement for verb senses of around 70%, whereas for SENSEVAL-3 (English Lexical Sample Task), Mihalcea, Chklovski, and Kilgarriff (2004) report a percentage agreement of 67.3% and average K of 0.58. Two types of solutions have been proposed for the problem of low agreement on sense tagging. The solution proposed by Kilgarriff (1999) is to use professional lexicog- raphers and arbitration. The study carried out by Kilgarriff does not therefore qualify as a true study of replicability in the sense of the terms used by Krippendorff, but it did show that this approach makes it possible to achieve percentage agreement of around 95.5%. An alternative approach has been to address the problem of the inability of naive coders to make ﬁne-grained distinctions by introducing coarser-grained classiﬁcation schemes which group together dictionary senses (Bruce and Wiebe, 1998; Buitelaar 1998; V´eronis 1998; Palmer, Dang, and Fellbaum 2007). Hierarchical tagsets were also developed, such as HECTOR (Atkins 1992) or, indeed, WordNet itself (where senses are related by hyponymy links). In the case of Buitelaar and Palmer, Dang, and Fellbaum, the “supersenses” were identiﬁed by hand, whereas Bruce and Wiebe and V´eronis used clustering methods such as those from Bruce and Wiebe (1999) to collapse some of the initial sense distinctions.9 Palmer, Dang, and Fellbaum (2007) illustrate this practice with the example of the verb call, which has 28 ﬁne-grained senses in WordNet 1.7: They conﬂate these senses into a small number of groups using various criteria—for example, four senses can be grouped in a group they call Group 1 on the basis of subcategorization frame similarities (Table 9). Palmer, Dang, and Fellbaum (2007) achieved for the English Verb Lexical Sense task of SENSEVAL-2 a percentage agreement among coders of 82% with grouped senses, as opposed to 71% with the original WordNet senses. Bruce and Wiebe (1998) found that collapsing the senses of their test word (interest) on the basis of their use by coders and merging the two classes found to be harder to distinguish resulted in an increase of 9 The methodology proposed in Bruce and Wiebe (1999) is in our view the most advanced technique to “make sense” of the results of agreement studies available in the literature. The extended version of this article contains a fuller introduction to these methods. 587 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 34, Number 4 Table 9 Group 1 of senses of call in Palmer, Dang, and Fellbaum (2007, page 149). SENSE WN1 WN3 DESCRIPTION name, call call, give a quality EXAMPLE “They nameda their son David” “She called her children lazy and ungrateful” WN19 WN22 a The verb named appears in the original WordNet example for the verb call. “I would not call her beautiful” “Call me mister” call, consider address, call HYPERNYM LABEL LABEL SEE ADDRESS the value of K from 0.874 to 0.898. Using a related technique, V´eronis (1998) found that agreement on noun word sense tagging went up from a K of around 0.45 to a K of 0.86. We should note, however, that the post hoc merging of categories is not equivalent to running a study with fewer categories to begin with. Attempts were also made to develop techniques to measure partial agreement with hierarchical tagsets. A ﬁrst proposal in this direction was advanced by Melamed and Resnik (2000), who developed a coefﬁcient for hierarchical tagsets that could be used in SENSEVAL for measuring agreement with tagsets such as HECTOR. Melamed and Resnik proposed to “normalize” the computation of observed and expected agreement by taking each label which is not a leaf in the tag hierarchy and distributing it down to the leaves in a uniform way, and then only computing agreement on the leaves. For example, with a tagset like the one in Table 9, the cases in which the coders used the label ‘Group 1’ would be uniformly “distributed down” and added in equal measure to the number of cases in which the coders assigned each of the four WordNet labels. The method proposed in the paper has, however, problematic properties when used to measure intercoder agreement. For example, suppose tag A dominates two sub-tags A1 and A2, and that two coders mark a particular item as A. Intuitively, we would want to consider this a case of perfect agreement, but this is not what the method proposed by Melamed and Resnik yields. The annotators’ marks are distributed over the two sub-tags, each with probability 0.5, and then the agreement is computed by summing the joint probabilities over the two subtags (Equation (4) of Melamed and Resnik 2000), with the result that the agreement over the item turns out to be 0.52 + 0.52 = 0.5 instead of 1. To correct this, Dan Melamed (personal communication) suggested replacing the product in Equation (4) with a minimum operator. However, the calculation of expected agreement (Equation (5) of Melamed and Resnik 2000) still gives the amount of agree- ment which is expected if coders are forced to choose among leaf nodes, which makes this method inappropriate for coding schemes that do not force coders to do this. One way to use Melamed and Resnik’s proposal while avoiding the discrepancy between observed and expected agreement is to treat the proposal not as a new co- efﬁcient, but rather as a distance metric to be plugged into a weighted coefﬁcient like α. Let A and B be two nodes in a hierarchical tagset, let L be the set of all leaf nodes in the tagset, and let P(l|T) be the probability of selecting a leaf node l given an arbitrary node T when the probability mass of T is distributed uniformly to all the nodes dominated by T. We can reinterpret Melamed’s modiﬁcation of Equation (4) in Melamed and Resnik (2000) as a metric measuring the distance between nodes A and B. dM+R = 1 − ∑ l∈L min(P(l|A), P(l|B)) 588 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Artstein and Poesio Inter-Coder Agreement for CL This metric has the desirable properties—it is 0 when tags A and B are identical, 1 when the tags do not overlap, and somewhere in between in all other cases. If we use this metric for Krippendorff’s α we ﬁnd that observed agreement is exactly the same as in Melamed and Resnik (2000) with the product operator replaced by minimum (Melamed’s modiﬁcation). We can also use other distance metrics with α. For example, we could associate with each sense an extended sense—a set es(s) including the sense itself and its grouped sense—and then use set-based distance metrics from Section 4.4, for ex- ample Passonneau’s dP. To illustrate how this approach could be used to measure (dis)agreement on word sense annotation, suppose that two coders have to annotate the use of call in the following sentence (from the WSJ part of the Penn Treebank, section 02, text w0209): This gene, called “gametocide,” is carried into the plant by a virus that remains active for a few days. The standard guidelines (in SENSEVAL, say) require coders to assign a WN sense to words. Under such guidelines, if coder A classiﬁes the use of called in the above example as an instance of WN1, whereas coder B annotates it as an instance of WN3, we would = 1) which seems excessively harsh as the two senses are ﬁnd total disagreement (dkakb clearly related. However, by using the broader senses proposed by Palmer, Dang, and Fellbaum (2007) in combination with a distance metric such as the one just proposed, it is possible to get more ﬂexible and, we believe, more realistic assessments of the degree of agreement in situations such as this. For instance, in case the reliability study had already been carried out under the standard SENSEVAL guidelines, the distance metric proposed above could be used to identify post hoc cases of partial agreement by adding to each WN sense its hypernyms according to the groupings proposed by Palmer, Dang, and Fellbaum. For example, A’s annotation could be turned into a new set label {WN1,LABEL} and B’s mark into the set table {WN3,LABEL}, which would give a distance d = 2/3, indicating a degree of overlap. The method for computing agreement proposed here could could also be used to allow coders to choose either a more speciﬁc label or one of Palmer, Dang, and Fellbaum’s superlabels. For example, suppose A sticks to WN1, but B decides to mark the use above using Palmer, Dang, and Fellbaum’s LABEL category, then we would still ﬁnd a distance d = 1/3. An alternative way of using α for word sense annotation was developed and tested by Passonneau, Habash, and Rambow (2006). Their approach is to allow coders to assign multiple labels (WordNet synsets) for wordsenses, as done by V´eronis (1998) and more recently by Rosenberg and Binkowski (2004) for text classiﬁcation labels and by Poesio and Artstein (2005) for anaphora. These multi-label sets can then be compared using the MASI distance metric for α (Passonneau 2006). 5. Conclusions The purpose of this article has been to expose the reader to the mathematics of chance- corrected coefﬁcients of agreement as well as the current state of the art of using these coefﬁcients in CL. Our hope is that readers come to view agreement studies not as an additional chore or hurdle for publication, but as a tool for analysis which offers new insights into the annotation process. We conclude by summarizing what in our view are the main recommendations emerging from ten years of experience with coefﬁcients of agreement. These can be grouped under three main headings: methodology, choice of coefﬁcients, and interpretation of coefﬁcients. 589 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 34, Number 4 5.1 Methodology Our ﬁrst recommendation is that annotation efforts should perform and report rigorous reliability testing. The last decade has already seen considerable improvement, from the absence of any tests for the Penn Treebank (Marcus, Marcinkiewicz, and Santorini 1993) or the British National Corpus (Leech, Garside, and Bryant 1994) to the central role played by reliability testing in the Penn Discourse Treebank (Miltsakaki et al. 2004) and OntoNotes (Hovy et al. 2006). But even the latter efforts only measure and report percent agreement. We believe that part of the reluctance to report chance-corrected measures is the difﬁculty in interpreting them. However, our experience is that chance- corrected coefﬁcients of agreement do provide a better indication of the quality of the resulting annotation than simple percent agreement, and moreover, the detailed calcu- lations leading to the coefﬁcients can be very revealing as to where the disagreements are located and what their sources may be. A rigorous methodology for reliability testing does not, in our opinion, exclude the use of expert coders, and here we feel there may be a motivated difference between the ﬁelds of content analysis and CL. There is a clear tradeoff between the complexity of the judgments that coders are required to make and the reliability of such judgments, and we should strive to devise annotation schemes that are not only reliable enough to be replicated, but also sophisticated enough to be useful (cf. Krippendorff 2004a, pages 213–214). In content analysis, conclusions are drawn directly from annotated corpora, so the emphasis is more on replicability; whereas in CL, corpora constitute a resource which is used by other processes, so the emphasis is more towards usefulness. There is also a tradeoff between the sophistication of judgments and the availability of coders who can make such judgments. Consequently, annotation by experts is often the only practical way to get useful corpora for CL. Current practice achieves high reliability either by using professionals (Kilgarriff 1999) or through intensive training (Hovy et al. 2006; Carlson, Marcu, and Okurowski 2003); this means that results are not replicable across sites, and are therefore less reliable than annotation by naive coders adhering to written instructions. We feel that inter-annotator agreement studies should still be carried out, as they serve as an assurance that the results are replicable when the annotators are chosen from the same population as the original annotators. An important additional assurance should be provided in the form of an independent evaluation of the task for which the corpus is used (cf. Passonneau 2006). 5.2 Choosing a Coefﬁcient One of the goals of this article is to help authors make an informed choice regarding the coefﬁcients they use for measuring agreement. While coefﬁcients other than K, speciﬁcally Cohen’s κ and Krippendorff’s α, have appeared in the CL literature as early as Carletta (1996) and Passonneau and Litman (1996), they hadn’t sprung into general awareness until the publication of Di Eugenio and Glass (2004) and Passonneau (2004). Regarding the question of annotator bias, there is an overwhelming consensus in CL practice: K and α are used in the vast majority of the studies we reported. We agree with the view that K and α are more appropriate, as they abstract away from the bias of spe- ciﬁc coders. But we also believe that ultimately this issue of annotator bias is of little con- sequence because the differences get smaller and smaller as the number of annotators grows (Artstein and Poesio 2005). We believe that increasing the number of annotators is the best strategy, because it reduces the chances of accidental personal biases. 590 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Artstein and Poesio Inter-Coder Agreement for CL However, Krippendorff’s α is indispensable when the category labels are not equally distinct from one another. We think there are at least two types of coding schemes in which this is the case: (i) hierarchical tagsets and (ii) set-valued interpre- tations such as those proposed for anaphora. At least in the second case, weighted coefﬁcients are almost unavoidable. We therefore recommend using α, noting however that the speciﬁc choice of weights will affect the overall numerical result. 5.3 Interpreting the Values We view the lack of consensus on how to interpret the values of agreement coefﬁcients as a serious problem with current practice in reliability testing, and as one of the main reasons for the reluctance of many in CL to embark on reliability studies. Unlike signiﬁcance values which report a probability (that an observed effect is due to chance), agreement coefﬁcients report a magnitude, and it is less clear how to interpret such magnitudes. Our own experience is consistent with that of Krippendorff: Both in our earlier work (Poesio and Vieira 1998; Poesio 2004a) and in the more recent efforts (Poesio and Artstein 2005) we found that only values above 0.8 ensured an annotation of reasonable quality (Poesio 2004a). We therefore feel that if a threshold needs to be set, 0.8 is a good value. That said, we doubt that a single cutoff point is appropriate for all purposes. For some CL studies, particularly on discourse, useful corpora have been obtained while attaining reliability only at the 0.7 level. We agree therefore with Craggs and McGee Wood (2005) that setting a speciﬁc agreement threshold should not be a pre- requisite for publication. Instead, as recommended by Di Eugenio and Glass (2004) and others, researchers should report in detail on the methodology that was followed in collecting the reliability data (number of coders, whether they coded independently, whether they relied exclusively on an annotation manual), whether agreement was sta- tistically signiﬁcant, and provide a confusion matrix or agreement table so that readers can ﬁnd out whether overall ﬁgures of agreement hide disagreements on less common categories. For an example of good practice in this respect, see Teufel and Moens (2002). The decision whether a corpus is good enough for publication should be based on more than the agreement score—speciﬁcally, an important consideration is an independent evaluation of the results that are based on the corpus. Acknowledgments This work was supported in part by EPSRC grant GR/S76434/01, ARRAU. We wish to thank four anonymous reviewers and Jean Carletta, Mark Core, Barbara Di Eugenio, Ruth Filik, Michael Glass, George Hripcsak, Adam Kilgarriff, Dan Melamed, Becky Passonneau, Phil Resnik, Tony Sanford, Patrick Sturt, and David Traum for helpful comments and discussion. Special thanks to Klaus Krippendorff for an extremely detailed review of an earlier version of this article. We are also extremely grateful to the British Library in London, which made accessible to us virtually every paper we needed for this research. References Allen, James and Mark Core. 1997. DAMSL: Dialogue act markup in several layers. Draft contribution for the Discourse Resource Initiative, University of Rochester. Available at http://www.cs.rochester.edu/ research/cisd/resources/damsl/. Artstein, Ron and Massimo Poesio. 2005. Bias decreases in proportion to the number of annotators. In Proceedings of FG-MoL 2005, pages 141–150, Edinburgh. Artstein, Ron and Massimo Poesio. 2006. Identifying reference to abstract objects in dialogue. In brandial 2006: Proceedings of the 10th Workshop on the Semantics and 591 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 34, Number 4 Pragmatics of Dialogue, pages 56–63, Potsdam. Atkins, Sue. 1992. Tools for computer-aided corpus lexicography: The Hector project. Acta Linguistica Hungarica, 41:5–71. Babarczy, Anna, John Carroll, and Geoffrey Sampson. 2006. Deﬁnitional, personal, and mechanical constraints on part of speech annotation performance. Natural Language Engineering, 12(1):77–90. Bartko, John J. and William T. Carpenter, Jr. 1976. On the methods and theory of reliability. Journal of Nervous and Mental Disease, 163(5):307–317. Beeferman, Doug, Adam Berger, and John Lafferty. 1999. Statistical models for text segmentation. Machine Learning, 34(1–3):177–210. Bennett, E. M., R. Alpert, and A. C. Goldstein. 1954. Communications through limited questioning. Public Opinion Quarterly, 18(3):303–308. Bloch, Daniel A. and Helena Chmura Kraemer. 1989. 2 × 2 kappa coefﬁcients: Measures of agreement or association. Biometrics, 45(1):269–287. Brennan, Robert L. and Dale J. Prediger. 1981. Coefﬁcient kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement, 41(3):687–699. Bruce, Rebecca and Janyce Wiebe. 1998. Word-sense distinguishability and inter-coder agreement. In Proceedings of EMNLP, pages 53–60, Granada. Bruce, Rebecca F. and Janyce M. Wiebe. 1999. Recognizing subjectivity: A case study in manual tagging. Natural Language Engineering, 5(2):187–205. Buitelaar, Paul. 1998. CoreLex : Systematic Polysemy and Underspeciﬁcation. Ph.D. thesis, Brandeis University, Waltham, MA. Bunt, Harry C. 2000. Dynamic interpretation and dialogue theory. In Martin M. Taylor, Franc¸oise N´eel, and Don G. Bouwhuis, editors, The Structure of Multimodal Dialogue II. John Benjamins, Amsterdam, pages 139–166. Bunt, Harry C. 2005. A framework for dialogue act speciﬁcation. In Proceedings of the Joint ISO-ACL Workshop on the Representation and Annotation of Semantic Information, Tilburg. Available at: http://let.uvt.nl/research/ti/ sigsem/wg/discussionnotes4.htm. Byron, Donna K. 2002. Resolving pronominal reference to abstract entities. In Proceedings of the 40th Annual Meeting of the ACL, pages 80–87, Philadelphia, PA. 592 Byrt, Ted, Janet Bishop, and John B. Carlin. 1993. Bias, prevalence and kappa. Journal of Clinical Epidemiology, 46(5):423–429. Carletta, Jean. 1996. Assessing agreement on classiﬁcation tasks: The kappa statistic. Computational Linguistics, 22(2):249–254. Carletta, Jean, Amy Isard, Stephen Isard, Jacqueline C. Kowtko, Gwyneth Doherty-Sneddon, and Anne H. Anderson. 1997. The reliability of a dialogue structure coding scheme. Computational Linguistics, 23(1):13–32. Carlson, Lynn, Daniel Marcu, and Mary Ellen Okurowski. 2003. Building a discourse-tagged corpus in the framework of rhetorical structure theory. In Jan C. J. van Kuppevelt and Ronnie W. Smith, editors, Current and New Directions in Discourse and Dialogue. Kluwer, Dordrecht, pages 85–112. Cicchetti, Domenic V. and Alvan R. Feinstein. 1990. High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 43(6):551–558. Cohen, Jacob. 1960. A coefﬁcient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46. Cohen, Jacob. 1968. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4):213–220. Core, Mark G. and James F. Allen. 1997. Coding dialogs with the DAMSL annotation scheme. In Working Notes of the AAAI Fall Symposium on Communicative Action in Humans and Machines, AAAI, Cambridge, MA. Available at: http://www. cs.umd.edu/∼traum/CA/fpapers.html. Craggs, Richard and Mary McGee Wood. 2004. A two-dimensional annotation scheme for emotion in dialogue. In Papers from the 2004 AAAI Spring Symposium on Exploring Attitude and Affect in Text: Theories and Applications, Stanford, pages 44–49. Craggs, Richard and Mary McGee Wood. 2005. Evaluating discourse and dialogue coding schemes. Computational Linguistics, 31(3):289–295. Davies, Mark and Joseph L. Fleiss. 1982. Measuring agreement for multinomial data. Biometrics, 38(4):1047–1051. Di Eugenio, Barbara. 2000. On the usage of Kappa to evaluate agreement on coding tasks. In Proceedings of LREC, volume 1, pages 441–444, Athens. Di Eugenio, Barbara and Michael Glass. 2004. The kappa statistic: A second look. Computational Linguistics, 30(1):95–101. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Artstein and Poesio Inter-Coder Agreement for CL Di Eugenio, Barbara, Pamela W. Jordan, Johanna D. Moore, and Richmond H. Thomason. 1998. An empirical investigation of proposals in collaborative dialogues. In Proceedings of 36th Annual Meeting of the ACL, pages 325–329, Montreal. Dice, Lee R. 1945. Measures of the amount of ecologic association between species. Ecology, 26(3):297–302. Donner, Allan and Michael Eliasziw. 1987. Sample size requirements for reliability studies. Statistics in Medicine, 6:441–448. Doran, Christine, John Aberdeen, Laurie Damianos, and Lynette Hirschman. 2001. Comparing several aspects of human-computer and human-human dialogues. In Proceedings of the 2nd SIGdial Workshop on Discourse and Dialogue, Aalborg, Denmark. Available at: http://www.sigdial.org/workshops/ workshop2/proceedings. Eckert, Miriam and Michael Strube. 2000. Dialogue acts, synchronizing units, and anaphora resolution. Journal of Semantics, 17(1):51–89. Feinstein, Alvan R. and Domenic V. Cicchetti. 1990. High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology, 43(6):543–549. Fellbaum, Christiane, editor. 1998. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA. Fleiss, Joseph L. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5):378–382. Fleiss, Joseph L. 1975. Measuring agreement between two judges on the presence or absence of a trait. Biometrics, 31(3):651–659. Francis, W. Nelson and Henry Kucera. 1982. Frequency Analysis of English Usage: lexicon and grammar. Houghton Mifﬂin, Boston, MA. Geertzen, Jeroen and Harry Bunt. 2006. Measuring annotator agreement in a complex hierarchical dialogue act annotation scheme. In Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue, pages 126–133, Sydney. Gross, Derek, James F. Allen, and David R. Traum. 1993. The Trains 91 dialogues. TRAINS Technical Note 92-1, University of Rochester Computer Science Department, Rochester, NY. Grosz, Barbara J. and Candace L. Sidner. 1986. Attention, intentions, and the structure of discourse. Computational Linguistics, 12(3):175–204. Hayes, Andrew F. and Klaus Krippendorff. 2007. Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, 1(1):77–89. Hearst, Marti A. 1997. TextTiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23(1):33–64. Hovy, Eduard, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. 2006. OntoNotes: The 90% solution. In Proceedings of HLT–NAACL, Companion Volume: Short Papers, pages 57–60, New York. Hsu, Louis M. and Ronald Field. 2003. Interrater agreement measures: Comments on kappan, Cohen’s kappa, Scott’s π, and Aickin’s α. Understanding Statistics, 2(3):205–219. Jaccard, Paul. 1912. The distribution of the ﬂora in the Alpine zone. New Phytologist, 11(2):37–50. Jekat, Susanne, Alexandra Klein, Elisabeth Maier, Ilona Maleck, Marion Mast, and J. Joachim Quantz. 1995. Dialogue acts in VERBMOBIL. VM-Report 65, Universit¨at Hamburg, DFKI GmbH, Universit¨at Erlangen, and TU Berlin. Jurafsky, Daniel, Elizabeth Shriberg, and Debra Biasca. 1997. Switchboard SWBD-DAMSL shallow-discourse- function annotation coders manual, draft 13. Technical Report 97-02, University of Colorado at Boulder, Institute for Cognitive Science. Kilgarriff, Adam. 1999. 95% replicability for manual word sense tagging. In Proceedings of the Ninth Conference of the European Chapter of the Association for Computational Linguistics, pages 277–278, Bergen, Norway. Klavans, Judith L., Samuel Popper, and Rebecca Passonneau. 2003. Tackling the internet glossary glut: Automatic extraction and evaluation of genus phrases. In Proceedings of the SIGIR-2003 Workshop on the Semantic Web, Toronto. Kowtko, Jacqueline C., Stephen D. Isard, and Gwyneth M. Doherty. 1992. Conversational games within dialogue. Research Paper HCRC/RP-31, Human Communication Research Centre, University of Edinburgh. Krippendorff, Klaus. 1970. Estimating the reliability, systematic error and random error of interval data. Educational and Psychological Measurement, 30(1):61–70. 593 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 34, Number 4 Krippendorff, Klaus. 1978. Reliability of binary attribute data. Biometrics, 34(1):142–144. Letter to the editor, with a reply by Joseph L. Fleiss. Krippendorff, Klaus. 1980. Content Analysis: An Introduction to Its Methodology, chapter 12. Sage, Beverly Hills, CA. Krippendorff, Klaus. 1995. On the reliability of unitizing contiguous data. Sociological Methodology, 25:47–76. Krippendorff, Klaus. 2004a. Content Analysis: An Introduction to Its Methodology, second edition, chapter 11. Sage, Thousand Oaks, CA. Krippendorff, Klaus. 2004b. Reliability in content analysis: Some common misconceptions and recommendations. Human Communication Research, 30(3):411–433. Landis, J. Richard and Gary G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics, 33(1):159–174. Leech, Geoffrey, Roger Garside, and Michael Bryant. 1994. CLAWS4: The tagging of the British National Corpus. In Proceedings of COLING 1994: The 15th International Conference on Computational Linguistics, Volume 1, pages 622–628, Kyoto. Levin, James A. and James A. Moore. 1978. Dialogue-games: Metacommunication structures for natural language interaction. Cognitive Science, 1(4):395–420. Manning, Christopher D. and Hinrich Schuetze. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA. Marcu, Daniel, Magdalena Romera, and Estibaliz Amorrortu. 1999. Experiments in constructing a corpus of discourse trees: Problems, annotation choices, issues. In Workshop on Levels of Representation in Discourse, pages 71–78, University of Edinburgh. Marcus, Mitchell P., Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics, 19(2):313–330. Marion, Rodger. 2004. The whole art of deduction. Unpublished manuscript. Melamed, I. Dan and Philip Resnik. 2000. Tagger evaluation given hierarchical tagsets. Computers and the Humanities, 34(1–2):79–84. Available at: http://www.sahs/utmb.edu/PELLINORE/ Intro to research/wad/wad/ home.htm. Mieskes, Margot and Michael Strube. 2006. Part-of-speech tagging of transcribed 594 speech. In Proceedings of LREC, pages 935–938, Genoa. Mihalcea, Rada, Timothy Chklovski, and Adam Kilgarriff. 2004. The SENSEVAL-3 English lexical sample task. In Proceedings of SENSEVAL-3, pages 25–28, Barcelona. Miltsakaki, Eleni, Rashmi Prasad, Aravind Joshi, and Bonnie Webber. 2004. Annotating discourse connectives and their arguments. In Proceedings of the HLT-NAACL Workshop on Frontiers in Corpus Annotation, pages 9–16, Boston, MA. Moser, Megan G., Johanna D. Moore, and Erin Glendening. 1996. Instructions for Coding Explanations: Identifying Segments, Relations and Minimal Units. Technical Report 96-17, University of Pittsburgh, Department of Computer Science. Navarretta, Costanza. 2000. Abstract anaphora resolution in Danish. In Proceedings of the 1st SIGdial Workshop on Discourse and Dialogue, Hong Kong, pages 56–65. Nenkova, Ani and Rebecca Passonneau. 2004. Evaluating content selection in summarization: The pyramid method. In Proceedings of HLT-NAACL 2004, pages 145–152, Boston, MA. Neuendorf, Kimberly A. 2002. The Content Analysis Guidebook. Sage, Thousand Oaks, CA. Palmer, Martha, Hoa Trang Dang, and Christiane Fellbaum. 2007. Making ﬁne-grained and coarse-grained sense distinctions, both manually and automatically. Natural Language Engineering, 13(2):137–163. Passonneau, Rebecca J. 2004. Computing reliability for coreference annotation. In Proceedings of LREC, volume 4, pages 1503–1506, Lisbon. Passonneau, Rebecca J. 2006. Measuring agreement on set-valued items (MASI) for semantic and pragmatic annotation. In Proceedings of LREC, Genoa, pages 831–836. Passonneau, Rebecca J., Nizar Habash, and Owen Rambow. 2006. Inter-annotator agreement on a multilingual semantic annotation task. In Proceedings of LREC, Genoa, pages 1951–1956. Passonneau, Rebecca J. and Diane J. Litman. 1993. Intention-based segmentation: Human reliability and correlation with linguistic cues. In Proceedings of 31st Annual Meeting of the ACL, pages 148–155, Columbus, OH. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Artstein and Poesio Inter-Coder Agreement for CL Passonneau, Rebecca J. and Diane J. Litman. 1996. Empirical analysis of three dimensions of spoken discourse: Segmentation, coherence and linguistic devices. In Eduard H. Hovy and Donia R. Scott, editors, Computational and Conversational Discourse: Burning Issues – An Interdisciplinary Account, volume 151 of NATO ASI Series F: Computer and Systems Sciences. Springer, Berlin, chapter 7, pages 161–194. Passonneau, Rebecca J. and Diane J. Litman. 1997. Discourse segmentation by human and automated means. Computational Linguistics, 23(1):103–139. Pevzner, Lev and Marti A. Hearst. 2002. A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics, 28(1):19–36. Poesio, Massimo. 2004a. Discourse annotation and semantic annotation in the GNOME corpus. In Proceedings of the 2004 ACL Workshop on Discourse Annotation, pages 72–79, Barcelona. Poesio, Massimo. 2004b. The MATE/GNOME proposals for anaphoric annotation, revisited. In Proceedings of the 5th SIGdial Workshop on Discourse and Dialogue, pages 154–162, Cambridge, MA. Poesio, Massimo and Ron Artstein. 2005. The reliability of anaphoric annotation, reconsidered: Taking ambiguity into account. In Proceedings of the Workshop on Frontiers in Corpus Annotation II: Pie in the Sky, pages 76–83, Ann Arbor, MI. Poesio, Massimo and Natalia N. Modjeska. 2005. Focus, activation, and this-noun phrases: An empirical study. In Ant ´onio Branco, Tony McEnery, and Ruslan Mitkov, editors, Anaphora Processing, volume 263 of Current Issues in Linguistic Theory. John Benjamins, pages 429–442, Amsterdam and Philadelphia. Poesio, Massimo, A. Patel, and Barbara Di Eugenio. 2006. Discourse structure and anaphora in tutorial dialogues: An empirical analysis of two theories of the global focus. Research in Language and Computation, 4(2–3):229–257. Poesio, Massimo and Renata Vieira. 1998. A corpus-based investigation of deﬁnite description use. Computational Linguistics, 24(2):183–216. Popescu-Belis, Andrei. 2005. Dialogue acts: One or more dimensions? Working Paper 62, ISSCO, University of Geneva. Posner, Karen L., Paul D. Sampson, Robert A. Caplan, Richard J. Ward, and Frederick W. Cheney. 1990. Measuring interrater reliability among multiple raters: An example of methods for nominal data. Statistics in Medicine, 9:1103–1115. Rajaratnam, Nageswari. 1960. Reliability formulas for independent decision data when reliability data are matched. Psychometrika, 25(3):261–271. Reidsma, Dennis and Jean Carletta. 2008. Reliability measurement without limits. Computational Linguistics, 34(3):319–326. Reinhart, T. 1981. Pragmatics and linguistics: An analysis of sentence topics. Philosophica, 27(1):53–93. Reynar, Jeffrey C. 1998. Topic Segmentation: Algorithms and Applications. Ph.D. thesis, University of Pennsylvania, Philadelphia. Ries, Klaus. 2002. Segmenting conversations by topic, initiative and style. In Anni R. Coden, Eric W. Brown, and Savitha Srinivasan, editors, Information Retrieval Techniques for Speech Applications, volume 2273 of Lecture Notes in Computer Science. Springer, Berlin, pages 51–66. Rietveld, Toni and Roeland van Hout. 1993. Statistical Techniques for the Study of Language and Language Behaviour. Mouton de Gruyter, Berlin. Rosenberg, Andrew and Ed Binkowski. 2004. Augmenting the kappa statistic to determine interannotator reliability for multiply labeled data points. In Proceedings of HLT-NAACL 2004: Short Papers, pages 77–80, Boston, MA. Scott, William A. 1955. Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly, 19(3):321–325. Shriberg, Elizabeth, Raj Dhillon, Sonali Bhagat, Jeremy Ang, and Hannah Carvey. 2004. The ICSI meeting recorder dialog act (MRDA) corpus. In Proceedings of the 5th SIGdial Workshop on Discourse and Dialogue, pages 97–100, Cambridge, MA. Siegel, Sidney and N. John Castellan, Jr. 1988. Nonparametric Statistics for the Behavioral Sciences, 2nd edition, chapter 9.8. McGraw-Hill, New York. Stent, Amanda J. 2001. Dialogue Systems as Conversational Partners: Applying Conversation Acts Theory to Natural Language Generation for Task-Oriented Mixed-Initiative Spoken Dialogue. Ph.D. thesis, Department of Computer Science, University of Rochester. Stevenson, Mark and Robert Gaizauskas. 2000. Experiments on sentence boundary detection. In Proceedings of 6th ANLP, pages 84–89, Seattle, WA. 595 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 34, Number 4 Stolcke, Andreas, Noah Coccaro, Rebecca Bates, Paul Taylor, Carol Van Ess-Dykema, Klaus Ries, Elizabeth Shriberg, Daniel Jurafsky, Rachel Martin, and Marie Meteer. 2000. Dialogue act modeling for automatic tagging and recognition of conversational speech. Computational Linguistics, 26(3):339–373. Stuart, Alan. 1955. A test for homogeneity of the marginal distributions in a two-way classiﬁcation. Biometrika, 42(3/4):412–416. Teufel, Simone, Jean Carletta, and Marc Moens. 1999. An annotation scheme for discourse-level argumentation in research articles. In Proceedings of Ninth Conference of the EACL, pages 110–117, Bergen. Teufel, Simone and Marc Moens. 2002. Summarizing scientiﬁc articles: Experiments with relevance and rhetorical status. Computational Linguistics, 28(4):409–445. Traum, David R. and Elizabeth A. Hinkelman. 1992. Conversation acts in task-oriented spoken dialogue. Computational Intelligence, 8(3):575–599. Vallduv´ı, Enric. 1993. Information packaging: A survey. Research Paper RP-44, University of Edinburgh, HCRC. V´eronis, Jean. 1998. A study of polysemy judgments and inter-annotator agreement. In Proceedings of SENSEVAL-1, Herstmonceux Castle, England. Available at: http://www.itri.brighton.ac.uk/ events/senseval/ARCHIVE/PROCEEDINGS/. Vilain, Marc, John Burger, John Aberdeen, Dennis Connolly, and Lynette Hirschman. 1995. A model-theoretic coreference scoring scheme. In Proceedings of the Sixth Message Understanding Conference, pages 45–52, Columbia, MD. Zwick, Rebecca. 1988. Another look at interrater agreement. Psychological Bulletin, 103(3):374–378. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 4 4 5 5 5 1 8 0 8 9 4 7 / c o l i . 0 7 - 0 3 4 - r 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 596
下载pdf