Evaluating Human Pairwise
Preference Judgments
Mark Dras∗
Centre for Language Technology
Macquarie University
Human evaluation plays an important role in NLP, often in the form of preference judgments.
Although there has been some use of classical non-parametric and bespoke approaches to evaluat-
ing these sorts of judgments, there is an entire body of work on this in the context of sensory
discrimination testing and the human judgments that are central to it, backed by rigorous
statistical theory and freely available software, that NLP can draw on. We investigate one
approach, Log-Linear Bradley-Terry models, and apply it to sample NLP data.
1. introduzione
Human evaluation is a key aspect of many NLP technologies. Automatic metrics that
correlate with human judgments have been developed, especially in Machine Transla-
zione, to relieve some of the burden. Neverthess, Callison-Burch et al. (2007) note in their
meta-evaluation that in MT they still “consider the human evaluation to be primary.”
Whereas MT has traditionally used a Likert scale score for the criteria of adequacy and
fluency, this meta-evaluation noted that these are “seemingly difficult things for judges
to agree on”; consequently, asking judges to express a preference between alternative
translations is increasingly used on the grounds of ease and intuitiveness. Further,
where the major empirical results of a paper are from automatic metrics, it is still useful
to supplement them: As two examples, Collins, Koehn, and Kucerova (2005) and Lewis
and Steedman (2013), in addition to a metric-based evaluation, present human judg-
ments of preferences for their systems with respect to a baseline (Fig. 1). For results in
published work, the reader is typically left to draw inferences from the numbers. For the
data in Figure 1, is there a strong preference for the non-baseline system overall, or do
null preferences count against that? Is anything about the results statistically significant?
There has been work in various areas of NLP in assessing statistical significance of
human judgment results. Tuttavia, to our knowledge, the field has not taken advantage
of a body of work dedicated to analyzing human preferences—predominantly in the
context of sensory discrimination testing, and consequent consumer behavior—which
is supported by a great deal of statistical theory. It is linked to the mixed-effect models
that are increasingly prominent in psycholinguistics and elsewhere, it has associated
freely available R software, and it permits questions like the following to be asked: Can
we say that the judges are expressing a preference at all, as opposed to no preference?
Is there an effect from judge disagreement or inconsistency?
∗ Department of Computing, Macquarie University, NSW 2109, Australia. E-mail: mark.dras@mq.edu.au.
Invio ricevuto: 10 April 2014; accepted for publication: 18 Luglio 2014.
doi:10.1162/COLI a 00222
© 2015 Associazione per la Linguistica Computazionale
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
4
1
2
3
3
7
1
8
0
4
8
2
1
/
C
o
l
io
_
UN
_
0
0
2
2
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 41, Numero 2
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
4
1
2
3
3
7
1
8
0
4
8
2
1
/
C
o
l
io
_
UN
_
0
0
2
2
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Dras
Evaluating Human Pairwise Preference Judgments
Tavolo 1
Artificial pairwise MT data for four systems A, B, C, D. xRy represents whether x is preferred to
sì (X (cid:3) sì), the reverse (x ≺ y), or no preference (x = y), for four judges J1 . . . J4.
xy
xRy (cid:3) = ≺ (cid:3) = ≺ (cid:3) = ≺ (cid:3) = ≺ (cid:3) = ≺ (cid:3) = ≺
AD
CD
BD
AC
AB
BC
J1
J2
J3
J4
38
37
33
37
0
2
4
2
2
1
3
1
36
36
33
39
1
3
3
0
3
1
4
1
37
38
3
35
1
1
3
3
2
1
34
2
39
37
4
34
1
2
4
2
0
1
32
4
0
2
2
2
2
2
4
2
38
36
34
36
3
1
1
3
0
0
0
1
37
39
39
36
3. Classical Non-Parametric Methods
A classical approach to evaluating preferences is the non-parametric sign test (Sprent
and Smeeton 2007). The first issue in applying this test here is ties, or expressions of
no preference—these are often ignored when the proportion of ties is small, but for
our typical examples of Figure 1, this is not true. Randles (2001) observes, regarding
the approach most widely recommended by textbooks of just ignoring ties, that “the
constrained number of possible p values and its ‘elimination of zeroes’ has caused
concern and controversy through the years.” Randles (2001) and Rayner and Best (2001,
chapter 2), reviewing several approaches to handling ties, both advocate splitting ties in
various ways depending on the problem setting, for (in Randles’s characterization) “it is
desirable that zeros have a conservative influence on declaring preference, but not to the
same degree as negative responses.” The key point is that modeling of ties explicitly can
be important, although there is no consensus on how this should be done; no approach
apart from ignoring ties appears to be in widespread use. The second issue with the sign
test is that of multiple judges, where data points are related (per esempio., the same items are
given to all judges). The Friedman test (Sprent and Smeeton 2007, Sezione 7.3.1) can be
viewed as an extension that can be applied to multiple subjects ranking multiple items
(see Bi 2006, Sezione 5.1.3, for an example). Tuttavia, Francis, Dittrich, and Hatzinger
(2010) note that
[the Friedman test] simply examines the null hypothesis that the median ranks
for all items are equal, and does not consider any differences in ranking between
respondents. . . . Inoltre, if the Friedman test rejects the null hypothesis, NO
quantitative interpretation, such as the odds of preferring one item over another,
is provided. [Further, Questo] fail[S] both to consider the underlying psychological
mechanism for ranking, and to formulate correct statistical models for this
meccanismo.
4. Methods in Machine Translation
Human evaluation in NLP is a pervasive issue, but here we focus on MT and its shared
compiti. IL 2007 shared task (Callison-Burch et al. 2007) was the first to investigate a
range of approaches that specifically included ranking of n translations, from best to
worst, allowing ties (which were ignored); from this they defined an aggregate “rank,"
“the average number of times that a system was judged to be better than any other
system in the sentence ranking evaluation.” They assessed inter-annotator agreement,
and—with a key goal of the meta-evaluation being to find the automatic evaluation
metric that best matched human evaluations—calculated Spearman’s rank correlation
coefficient between the two types of assessment. IL 2008 shared task (Callison-Burch
339
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
4
1
2
3
3
7
1
8
0
4
8
2
1
/
C
o
l
io
_
UN
_
0
0
2
2
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 41, Numero 2
et al. 2008) took the same approach, but noted that in ranking, “[H]ow best to treat
these is an open discussion, and certainly warrants further thought,” in particular
because of ties “further complicating matters.” Pado et al. (2009) modified the system-
level predictions approach to become “tie-aware,” and noted that that this “makes
a considerable practical difference, improving correlation figures by 5–10 points.” At
around the same time Vilar et al. (2007) examined the use of pairwise comparisons
in MT evaluation. They pose the problem as one where, given an order relationship
is-better-than between pairs of systems, the goal is to find an ordering of all the systems:
They see this as the fundamental computer science problem of sorting. They define
an aggregate evaluation score for comparing systems, estimating expected value and
standard error for hypothesis testing. Tuttavia, in aggregating this way information
about ties is lost.
Bojar et al. (2011) critique the earlier WMT evaluations, citing issues with the
ignoring of non-top ranks (noted in Section 3 herein also), with ties and also with
interannotator agreement. Lopez (2012) extends the analysis of Bojar et al. and casts the
problem as “finding the minimum feedback arc set in a tournament, a well-known NP-
complete problem.” He advocates using the pairwise rankings themselves, piuttosto che
aggregate statistics like Vilar et al. (2007), and aims to minimize the number of violations
among these. Koehn (2012) evaluates empirically the approaches of both Bojar et al.
(2011) and Lopez (2012), with a focus on determining which systems are statistically
distinguishable in terms of performance, defining confidence bounds for this purpose.
Hopkins and May (2013) recently advocated a focus on finding the extent to which
particular rankings could be trusted. They proposed a model based on Item Response
Theory (IRT), which underlies many standardized tests. They draw an analogy with
judges assessing students on the basis of an underlying distribution of the student’s
ability, with items authored by students having a quality drawn from the student’s
ability distribution. They note in passing that a Gaussian parameterization of their IRT
models resembles Thurstone and Bradley-Terry models; this leads us to the topic of
Sezione 5.
Overall, Poi, there are ongoing discussions about what kind of analysis is appro-
priate for preference judgments. Some of this involves moderately heavy-duty compu-
tation for bootstrapping; this is suitable for large-scale WMT evaluations with dozens of
competing systems, but perhaps less so for the scenarios we envisage in Section 1. More-
Sopra, examining what techniques other fields have developed could be useful, particolarmente
when they come with ready-made, easy-to-use tools for smaller-scale evaluation.
5. Preferences and Log-Linear Bradley-Terry Methods
The statistical analysis of human perception and preferences dates back at least to the
psychophysics work of German physiologist E. H. Weber in the nineteenth century.
A progression from the way humans perceive differences between physical stimuli to
more general analysis of human preferences has occurred particularly in the context
of investigating consumer behavior—dealing with questions like whether there is a
definite preference for a food with a particular type of ingredient, for example—and
this is now a fully fledged area of research. Sources like Lawless and Heymann (2010)
give overviews of the field and relevant statistical techniques. The earliest generally
cited models for pairwise comparisons are the Thurstone model (Thurstone 1927) E
the closely related Bradley-Terry (BT) modello (Bradley and Terry 1952); these have
connections to the IRT models, widely used in analyzing responses to questionnaires,
which Hopkins and May (2013) drew on. Here we only look at BT models.
340
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
4
1
2
3
3
7
1
8
0
4
8
2
1
/
C
o
l
io
_
UN
_
0
0
2
2
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Dras
Evaluating Human Pairwise Preference Judgments
a set of J objects in a particular pairwise comparison jk is given by p(Oj
In a basic BT model, the probability that object j (Oj) is preferred to object k (Ok) from
k) =
k are non-negative “worth” parameters describing the
π
j
location of the object on the scale of preferences for some attribute. For n objects, there
will be
for all j (cid:5)= k, where π
(cid:2)
pairwise comparisons.
j and π
(cid:3) Ok
π
j
+π
k
j, π
| π
(cid:3)
N
2
Log-Linear Models. It is now standard to fit BT models as log-linear models (Agresti
2007, Per esempio), which allows them to be treated in a uniform way with much of
modern statistical analysis. Log-linear models are a variety of generalized linear models
(GLM), as is, Per esempio, the logistic regression used throughout NLP. GLMs consist
of a random component that identifies the response variable Y and selects a probability
distribution for it; a systematic component that specifies some linear combination of
the explanatory variables xi; and a link function g(·) applied to the mean μ of Y relating
μ to this linear combination. They thus have the form g(μ) = α + β
kxk. For
log-linear models, the response variables are counts that are assumed to follow a Pois-
son distribution, and the link function is g(μ) = log(μ) (compare logistic regression’s
μ
G(μ) = log
1−μ ). As an example, Y might be counts of people who hold some belief,
and the various xi might be gender, socioeconomic status, and so forth. GLMs are a key
tool for modern categorical data analysis, Agresti (2007, P. 65) noting that using models
rather than the non-parametric approaches of Section 3 has several benefits:
+ . . . + β
1×1
The structural form of the model describes the patterns of association and interaction.
The sizes of the model parameters determine the strength and importance of the effects.
Inferences about the parameters evaluate which explanatory variables affect the
response variable Y, while controlling effects of possible confounding variables. Finalmente,
the model’s predicted values smooth the data and provide improved estimates of the
mean of Y at possible explanatory variable values.
In a log-linear model, intuitive log-odds interpretations of making one response
relative to another can be derived from the parameters. (Typically, software chooses a
reference parameter and other parameter values are relative to that.) Statistical signif-
icance scores and standard errors can be calculated for these parameters. Inoltre,
GLMs allow for testing of model fit. There are various model choices (per esempio., should
we include ties? should we include terms representing interactions?) and goodness-
of-fit tests can assess the alternatives (Vedere, per esempio., Agresti 2007, Sezione 7.2.1). The model
with a separate parameter for each cell in the associated contingency table is called
the saturated model, and fits the data perfectly, making it a suitable comparator for
alternatives. Deviance is a likelihood ratio statistic comparing a proposed model to the
saturated one, allowing a test of the hypothesis that parameters not included in the
model are zero, via goodness of fit tests; large test statistics and small p-values provide
evidence of model lack of fit.
Models with Ties. To set out the representation of LLBT models, we follow the for-
mulation of Dittrich and Hatzinger (2009). Let n(jk) be the number of comparisons
between objects j and k; and let Y(jk)j be the number of preferences for object j with
respect to k (allo stesso modo, Y(jk)k). The outcome of a paired comparison experiment can
× J incomplete two-dimensional contingency table: There are
also be regarded as a
(cid:2)
J
rows of pairwise comparisons, and J columns recording choices of the jth object.
2
As with log-linear models in general, the distribution of random variables Y(jk)j and
+ Y(jk)k, (Y(jk)j, Y(jk)k)
Y(jk)k is assumed to be Poisson. Conditional on fixed n(jk)
= Y(jk)j
J
2
(cid:3)
(cid:2)
(cid:3)
341
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
4
1
2
3
3
7
1
8
0
4
8
2
1
/
C
o
l
io
_
UN
_
0
0
2
2
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 41, Numero 2
follow a binomial (more generally, multinomial) distribution. The expected number of
preferences of object j with respect to object k is denoted m(jk)j and given by n(jk)P(jk)j,
with p(jk)j the binomial probability. So far this is only for binary preferences; there are
various ways to account for ties. We describe the approach of Davidson and Beaver
(1977), which appears quite widely used, where there is a common null preference effect
for all pairwise comparisons. Then
log m(jk)j
log m(jk)0
log m(jk)k
= μ
= μ
= μ
(jk)
(jk)
(jk)
+ λO
j
− λO
k
− λO
j
+ λO
k
+ γ
(1)
where the μ’s are “nuisance” parameters that fix the n(jk) marginal distributions, and the
λO’s represent object parameters, M(jk)0 is the expected number of null preferences for
pair (jk), and γ is the undecided effect. The object parameters are related to the worth
parameters of the original definition by log π = 2λO: These represent the log-odds.
In addition to the theoretical reasons for using LLBTs for modeling pairwise com-
parisons, a key benefit is the availability of packages in R for doing the modeling. Two
candidates allowing a variety of sophisticated models are by Turner and Firth (2012)
and Hatzinger and Dittrich (2012); we use the latter as the current version of the former
does not handle ties. We first apply the model described by Equations (1) to the single
pairwise data with ties from Section 2 using R. We refer the reader to the associated data
bundle1 for the full output; we only excerpt it in the discussion below. Immediately
following is a snippet of the R output for the ModPref data from Figure 1. o1 is the
variable λO
j for the + categoria, o2 for the – category, g1 for the null preferences.
Estimate Std. Error z value Pr(>|z|)
0.2778
0.0000
-0.6551
o1
o2
g1
—
Signif. codes:
Residual deviance: -6.6614e-15 on 0 degrees of freedom
0.0088 **
NA
0.0044 **
2.620
NA
-2.848
0.1060
NA
0.2300
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In the R output, o2 is the reference object, with parameter value set to zero; IL
negative value of the estimate for g1 combined with its statistical significance says that
there is a strong tendency for an expression of preference. The positive value of the o1
parameter and its significance indicate that the + group is strongly preferred: The odds
in favor of this group with respect to the – group is exp(2 × 0.2778) = 1.74 A 1. Relating
this to the description of the data in Section 2, Poi, there is a strong preference for trans-
lations by the proposed system relative to the baseline, even taking into account null
preferences. The LLBT model confirms that even small data sets like this can produce
meaningful and statistically significant results. For the other artificial preference data of
Figura 1, the parameters behave as expected: for EqualPref, parameter estimates are all
zero, signifying that they all have the same odds; for NoPref, the positive g1 indicates
a strong tendency towards no preference; for StrongPref, the negative g1 indicates a
strong tendency towards some preference, but with + or – equally likely. Note that
all of these are saturated models: there are three objects and three parameters, so the
model fits perfectly (indicated also by zero residual deviance). When we apply them
1 Data files and all R commands and output are at https://purl.org/NET/cl-llbt-data.
342
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
4
1
2
3
3
7
1
8
0
4
8
2
1
/
C
o
l
io
_
UN
_
0
0
2
2
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Dras
Evaluating Human Pairwise Preference Judgments
to the real count data of Figure 1 (C), the results indicate that for the Collins et al. dati
there is a weak to moderate tendency not to choose (g1 estimate 0.303, p = 0.0432), Ma,
given that, there is a significant (0.0001) preference in favor of the reordered system.
For the Lewis and Steedman results, the model gives similar results, albeit with a much
stronger disposition to null preferences. In the data bundle we also carry out the sign
test ignoring ties for each data set for comparison; it gives the same results in each case
for the relation of + than –, but does not allow an evaluation of the effect of ties.
We now apply the model described by Equations (1) to the multiple pairwise data
of Table 1. In the R output, the four systems A, B, C, D correspond to objects o1, o2, o3, o4,
and g1 again to null preferences. As per the overview of the MT data in Section 2, there
is little undecidedness (large negative g1). The coefficients show that object o1 (system
UN) is most preferred, followed by o4 (D), then o2 (B) and o3 (C). Note also that in this
case, the model is not saturated: There is a non-zero residual deviance. As mentioned,
log-linear models can be compared in terms of goodness of fit: Dittrich, Hatzinger, E
Katzenbeisser (1998) and Dittrich and Hatzinger (2009) discuss this in some detail for
LLBT models. Chi-squared statistics can be used to assess goodness of fit based on the
residual deviance; the degrees of freedom (d.f.) equal the number of cell counts minus
the number of model parameters; both deviance and d.f. are given in the R output. For
this data deviance is 30.646 SU 8 d.f., whereas by contrast if the ties (g1) are left out, it is
221.22 SU 9 d.f. A chi-squared test would establish the goodness of fit for each model;
but even without consulting the test it can be seen that leaving out the one parameter
related to ties (1 d.f.) gives a seven-fold increase in deviance, so clearly inclusion of ties
produces a much better model.
Introducing Subject Covariates. The model can also incorporate a range of other factors,
a possibility not easily open to non-parametric methods. The one we look at here is the
notion of a categorical covariate, introduced into LLBT models in Dittrich, Hatzinger,
and Katzenbeisser (1998): This allows the objects (items) to vary with characteristics of
the subject (judge). Many types of subject covariates could be added, grouping subjects
by native language of the speaker, source of judges (per esempio., Mechanical Turk, university),
and so forth. Here we add just one, the identity of the subject. (Typically in a GLM this
would be a random effect; we treat it as a covariate just for our simple illustration.)
We define our categorical covariate S to have levels l, l = 1, 2, . . . , l. Let m(jk)j|l be the
expected number of preferences for object j with respect to object k for subjects in
covariate class l. The log-linear representation is then as follows:
log m( jk)j|l
log m( jk)0|l
log m( jk)k|l
= μ
= μ
= μ
(jk)l
(jk)l
(jk)l
+ λO
j
− λO
k
− λO
j
+ λO
k
+ λS
l
+ λS
l
+ λS
l
+ λOS
jl
− λOS
kl
− λOS
jl
+ λOS
kl
+ γ
(2)
’s specific to group l to the λO
As do Dittrich and Hatzinger (2009), we define a reference group, with the λO
j ’s
representing the ordering for that group; the orderings for other groups are obtained
by adding the λOS
(jk)l and λS
l
jl
are again “nuisance” parameters, the latter representing the main effect of the subject
covariate measured on the lth level; λOS
’s are the (useful) subject-object interaction
jl
parameters describing the effect of the subject covariate on the preference for object
j (similarly λOS
kl and object k). We apply the model described by Equations (2) to the
multiple pairwise data, with the subject covariate SUBJ with four levels (one per judge
Ji of Table 1). There are a few complexities in interpreting the output, beyond the scope
l ’s for the reference group. μ
343
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
4
1
2
3
3
7
1
8
0
4
8
2
1
/
C
o
l
io
_
UN
_
0
0
2
2
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 41, Numero 2
of this article to discuss but covered in Dittrich, Hatzinger, and Katzenbeisser (1998).
The broad interpretations to draw from the output are that interactions o1:SUBJ3 and
o2:SUBJ3 are large and significant, and contribute to the model, unlike any others. These
correspond to the different pairwise rankings given by judge J3 to system A (relative
to D) and to B (relative to C): This is how subject effects are indicated in these LLBT
models.
There are many other extensions to these models. Cattelan (2012) gives a state-of-
the-art overview of such extensions across a range of approaches, with an emphasis on
dependent data. We only note two extensions here that are incorporated into prefmod
and relevant to NLP. With categorical object covariates, items can be grouped as well,
to investigate effects of grouping there, Per esempio, different origins for translation
fonti. With non-pairwise rankings, judges can rank over more than two elements, COME
in the standard WMT evaluations, although this needs a special treatment in the models.
6. Conclusions
We have looked at the sort of (pairwise) preference data that is encountered often in
PNL. A particular characteristic of NLP data is that ties or undecided results may be
frequent, and there is often a concern with inter-judge consistency. Reviewing classical
non-parametric approaches, we note the opinion that it is important to model ties, E
also note that approaches to looking at subject (judge) effects have several issues, come
as a lack of quantitative interpretation of results. Among NLP approaches, particolarmente
within MT, new techniques are still being derived, which could benefit from views from
outside the field. What we present are techniques from the field of sensory preference
evaluation, where there has been a long history of development by statistics researchers.
Recentemente, log-linear models have attracted attention. Applying them to sample data, we
find that they provide the sort of information and uniform framework for analysis that
NLP researchers could find useful. Given both extensive theoretical underpinings and
freely available statistical software, we recommend LLBT models as a potential tool.
Riferimenti
Agresti, Alan. 2007. An Introduction to
Categorical Data Analysis. John Wiley,
2nd edition.
Bi, Jian. 2006. Sensory Discrimination Tests
and Measurements: Statistical Principles,
Procedures and Tables. Blackwell,
Oxford, UK.
Bojar, Ondˇrej, Miloˇs Ercegovˇcevi´c, Martin
Popel, and Omar Zaidan. 2011. A grain of
salt for the WMT manual evaluation. In
Proceedings of WMT, pages 1–11.
Bradley, Ralph and Milton Terry. 1952. Rank
analysis of incomplete block designs, IO.
The method of paired comparisons.
Biometrika, 39:324–345.
Callison-Burch, Chris, Cameron Fordyce,
Philipp Koehn, Christof Monz, and Josh
Schroeder. 2007. (Meta-) Evaluation of
machine translation. Negli Atti di
WMT, pages 136–158.
Callison-Burch, Chris, Cameron Fordyce,
Philipp Koehn, Christof Monz,
344
and Josh Schroeder. 2008. Further
meta-evaluation of machine
translation. In Proceedings of WMT,
pages 70–106.
Cattelan, Manuela. 2012. Models for paired
comparison data: A review with emphasis
on dependent data. Statistical Science,
27(3):412–423.
Collins, Michael, Philipp Koehn, E
Ivona Kucerova. 2005. Clause
restructuring for statistical machine
translation. In Proceedings of ACL,
pages 531–540.
Davidson, R. R. and R. J. Beaver. 1977. On
extending the Bradley-Terry model to
incorporate within-pair order effects.
Biometrics, 33:693–702.
Dittrich, Regina and Reinhold Hatzinger.
2009. Fitting loglinear Bradley-Terry
models (LLBT) for paired comparisons
using the R package prefmod.
Psychological Science Quarterly,
51(2):216–242.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
4
1
2
3
3
7
1
8
0
4
8
2
1
/
C
o
l
io
_
UN
_
0
0
2
2
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Dras
Evaluating Human Pairwise Preference Judgments
Dittrich, Regina, Reinhold Hatzinger, E
W. Katzenbeisser. 1998. Modelling the
effect of subject-specific covariates in
paired comparison studies with an
application to university rankings.
Journal of the Royal Statistical Society,
Series C: Applied Statistics,
47:511–525.
Francis, Brian, Regina Dittrich, E
Reinhold Hatzinger. 2010. Modeling
heterogeneity in ranked responses by
nonparametric maximum likelihood:
How do Europeans get their scientific
knowledge? The Annals of Applied
Statistics, 4(4):2181–2202.
Hatzinger, Reinhold and Regina Dittrich.
2012. prefmod: A package for modeling
preferences based on paired comparisons,
rankings, or ratings. Journal of Statistical
Software, 48(10):1–31.
Hopkins, Mark and Jonathan May.
2013. Models of translation
competitions. In Proceedings of ACL,
pages 1,416–1,424.
Koehn, Philipp. 2012. Simulating human
judgment in machine translation
evaluation campaigns. Negli Atti di
IWSLT, pages 179–184.
Lawless, Harry T. and Hildegarde Heymann.
2010. Sensory Evaluation of Food: Principles
and Practices. Springer, New York, NY
2nd edition.
Lewis, Mike and Mark Steedman. 2013.
Unsupervised induction of cross-lingual
semantic relations. Negli Atti di
EMNLP, pages 681–692.
Lopez, Adam. 2012. Putting human
assessments of machine translation
systems in order. In Proceedings of WMT,
pages 1–9.
Pado, Sebastian, Michel Galley, Dan Jurafsky,
e Christopher D. Equipaggio. 2009. Robust
machine translation evaluation with
entailment features. In Proceedings of ACL /
AFNLP, pages 297–305.
Randles, Ronald H. 2001. On Neutral
Responses (zeros) in the sign test and ties
in the Wilcoxon-Mann-Whitney test. IL
American Statistician, 55(2):96–101.
Rayner, J. C. W. and D. J. Best. 2001. UN
Contingency Table Approach to
Nonparametric Testing. Chapman and
Hall/CRC, Boca Raton, FL.
Sprent, Peter and Nigel Smeeton. 2007.
Applied Nonparametric Statistical Methods.
Chapman and Hall, London, UK.
Thurstone, l. l. 1927. A law of comparative
Judgement. Psychological Review,
34:278–286.
Turner, Heather and David Firth. 2012.
Bradley-Terry Models in R: IL
BradleyTerry2 Package. Journal of
Statistical Software, 48(9):1–21.
Vilar, David, Gregor Leusch, Hermann Ney,
and Rafael E. Banchs. 2007. Umano
evaluation of machine translation through
binary system comparisons. Negli Atti
of WMT, pages 96–103.
345
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
4
1
2
3
3
7
1
8
0
4
8
2
1
/
C
o
l
io
_
UN
_
0
0
2
2
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3