Interactive Text Ranking with Bayesian Optimization: A Case Study on
Community QA and Summarization
Edwin Simpson1,2
Yang Gao1,3
Iryna Gurevych1
1Ubiquitous Knowledge Processing Lab, Technische Universita¨t Darmstadt,
https://www.informatik.tu-darmstadt.de/
2Dept. of Computer Science, University of Bristol, edwin.simpson@bristol.ac.uk
3Dept. of Computer Science, Royal Holloway, University of London, yang.gao@rhul.ac.uk
Astratto
For many NLP applications, such as question
answering and summarization, the goal is to
select the best solution from a large space of
candidates to meet a particular user’s needs.
To address the lack of user or task-specific
training data, we propose an interactive text
ranking approach that actively selects pairs of
candidates, from which the user selects the
best. Unlike previous strategies, which attempt
to learn a ranking across the whole candidate
spazio, our method uses Bayesian optimization
to focus the user’s labeling effort on high
quality candidates and integrate prior knowl-
edge to cope better with small data scenarios.
We apply our method to community question
answering (cQA) and extractive multidocu-
ment summarization, finding that it significantly
outperforms existing interactive approaches.
We also show that the ranking function learned
by our method is an effective reward func-
tion for reinforcement learning, which im-
proves the state of the art for interactive
summarization.
1
introduzione
Many text ranking tasks are highly specific to an
individual user’s topic of interest, which presents
a challenge for NLP systems that have not been
trained to solve that user’s problem. Consider
ranking summaries or answers to non-factoid
questions: A good solution requires understanding
the topic and the user’s information needs (Liu and
Agichtein, 2008; L´opez et al., 1999). We address
this by proposing an interactive text ranking
approach that efficiently gathers user feedback
and combines it with predictions from pretrained,
generic models.
759
To minimize the amount of effort the user must
expend to train a ranker, we learn from pairwise
preference labels, in which the user compares two
candidates and labels the best one. Pairwise pref-
erence labels can often be provided faster than
ratings or class labels (Yang and Chen, 2011;
Kingsley and Brown, 2010; Kendall, 1948), can
be used to rank candidates using learning-to-rank
(Joachims, 2002), preference learning (Thurstone,
1927), or best–worst scaling (Flynn and Marley,
2014), or to train a reinforcement learning (RL)
agent to find the optimal solution (Wirth et al.,
2017).
To reduce the number of labels a user must
provide, a common solution is active learning
(AL). AL learns a model by iteratively acquiring
labels: At each iteration, it trains a model on the
labels collected so far, then uses an acquisition
function to quantify the value of querying the user
about a particular pair of candidates. The system
then chooses the pairs with the highest values, E
instructs the user to label them. The acquisition
function implements one of many different strat-
egies to minimize the number of interaction
rounds, such as reducing uncertainty (Settles,
2012) by choosing informative labels that help
learn the model more quickly.
Many active learning strategies, such as the
pairwise preference learning method of Gao et al.
(2018), aim to learn a good ranking model for all
candidates, Per esempio, by querying the annotator
about candidates whose rank is most uncertain.
Tuttavia, we often need to find and rank only a
small set of good candidates to present to the user.
For instance, in question answering, irrelevant
answers should not be shown to the user, so their
precise ordering is unimportant and users should
not waste time ranking them. Therefore, by reducing
Operazioni dell'Associazione per la Linguistica Computazionale, vol. 8, pag. 759–775, 2020. https://doi.org/10.1162/tacl a 00344
Redattore di azioni: Preslav Nakov. Lotto di invio: 2/2020; Lotto di revisione: 7/2020; Pubblicato 12/2020.
2020 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.
C
(cid:13)
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
4
4
1
9
2
3
8
5
7
/
/
T
l
UN
C
_
UN
_
0
0
3
4
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
uncertainty for all candidates, uncertainty-based
AL strategies may waste labels on sorting poor
candidates.
Here, we propose an interactive method for
ranking texts that replaces the standard uncertainty-
based acquisition functions with acquisition func-
tions for Bayesian optimization (BO) (Moˇckus,
1975; Brochu et al., 2010). Generalmente, BO aims
to find the maximum of a function while mini-
mizing the number of queries to an oracle. Here,
we use BO to maximize a ranking function that
maps text documents to scores, treating the user
as a noisy oracle. Our BO active learning strategy
minimizes the number of labels needed to find the
best candidate, in contrast to uncertainty-based
strategies that attempt to learn the entire ranking
function. This makes BO better suited to tasks
such as question answering, summarization, O
translation, where the aim is to find the best can-
didate and those with low quality can simply be
disregarded rather than ranked precisely. In this
paper, we define two BO acquisition functions for
interactive text ranking.
While our approach is designed to adapt a model
to a highly specialized task, generic models can
provide hints to help us avoid low-quality candi-
dates. Therefore, we learn the ranking function
itself using a Bayesian approach, which integrates
prior predictions from a generic model that is
not tailored to the user. Previous interactive text
ranking methods either do not exploit prior inform-
ation (Baldridge and Osborne, 2004; P.V.S and
Meyer, 2017; Lin and Parikh, 2017; Siddhant
and Lipton, 2018), combine heuristics with user
feedback after active learning is complete (Gao
et al., 2018), or require expensive re-training of
a non-Bayesian method (Peris and Casacuberta,
2018). Here, we show how BO can use prior in-
formation to expedite interactive text ranking.
The interactive learning process is shown in
Algorithm 1 and examples of our system outputs
are shown in Figures 1 E 2.
Our contributions are (1) a Bayesian optimiza-
tion methodology for interactive text ranking that
integrates prior predictions with user feedback, (2)
acquisition functions for Bayesian optimization
with pairwise labels, E (3) empirical evaluations
on community question answering (cQA) E
extractive multidocument summarization, Quale
show that our method brings substantial improve-
ments in ranking and summarization performance
Input: candidate texts x with feature vectors
φ(X)
1 Initialize ranking model m;
2 Set the training data D = ∅;
while
D
< max interactions do | | For each pair of texts (xa, xb) in x, compute v = acquisition(φ(xa), φ(xb), D, m); Set P i to be the set of batch size pairs with the highest values of v; Obtain labels yi from user for pairs P i ; Add yi and P i to D ; Train model m on the training set D ; 3 4 5 6 7 end Output: Return the trained model m and/or its final ranked list of candidate texts in x. Algorithm 1: Interactive text ranking process with preference learning. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 4 4 1 9 2 3 8 5 7 / / t l a c _ a _ 0 0 3 4 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 1: Example from the Stack Exchange Cooking topic. Candidate answer A1 selected without user interaction by COALA (R¨uckl´e et al., 2019); A2 chosen by our system (GPPL with IMP) after 10 user interaction. A2 answers the question (boldfaced texts) but A1 fails. (e.g., for cQA, an average 25% increase in answer selection accuracy over the next-best method with 10 rounds of user interaction). We release the complete experimental software for future work.1 1h t t p s : / / g i t h u b . com/UKPLab/tacl2020 -interactive-ranking. 760 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 4 4 1 9 2 3 8 5 7 / / t l a c _ a _ 0 0 3 4 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 2: Example summaries for DUC’04 produced by RL (see Section 5.4) with a reward function learnt from 100 user interactions using (a) the BT, UNC method of Gao et al. (2018) and (b) our GPPL, IMP method. (c) is a model summary written by an expert. Each color indicates a particular news event or topic, showing where it occurs in each summary. Compared to (a), summary (b) covers more of the events discussed in the reference, (c). 2 Related Work Interactive Learning in NLP. Previous work has applied active learning to tasks involving ranking or optimising generated text, including summarization (P.V.S and Meyer, 2017), visual question answering (Lin and Parikh, 2017), and translation (Peris and Casacuberta, 2018). For summarization, Sokolov et al. (2016), Lawrence and Riezler (2018) and Singh et al. (2019), train reinforcement learners by querying the user directly for rewards, which requires in the order of 105 interactions. Gao et al. (2018) dramatically reduce the number of user interactions to the order of to 102 by using active learning to learn a reward function for RL, an approach proposed by Lopes et al. (2009). These previous works use uncertainty sampling strategies, which query the user about the candidates with the most uncertain rankings to try to learn all candidates’ rankings with a high degree of confidence. We instead propose to find good candidates using an optimization strategy. Siddhant and Lipton (2018) carried out a large empirical study of uncertainty sampling for sentence classification, semantic role labeling and named entity recognition, finding that exploiting model uncertainty estimates provided by Bayesian neural networks improved performance. Our approach also exploits Bayesian uncertainty estimates. BO for Preference Learning. Bayesian ap- proaches using Gaussian processes (GPs) have previously been used to reduce errors in NLP tasks involving sparse or noisy labels (Cohn and Specia, 2013; Beck et al., 2014), making them well-suited to learning from user feedback. Gaussian process preference learning (GPPL) (Chu and Ghahramani 2005) enables GP inference with pairwise prefer- ence labels. Simpson and Gurevych (2018) intro- duced scalable inference for GPPL using stochastic variational inference (SVI) (Hoffman et al., 2013), which outperformed SVM and LSTM methods at ranking arguments by convincingness. They included a study on active learning with pairwise labels, but tested GPPL only with uncertainty sampling, not BO. Here, we adapt GPPL to summarization and cQA, show how to integrate prior predictions, and propose a BO framework for GPPL that facilitates interactive text ranking. Brochu et al. (2008) proposed a BO approach for pairwise comparisons but applied the approach only to a material design use case with a very simple feature space. Gonz´alez et al. (2017) proposed alternative BO strategies for pairwise preferences, but their approach requires expensive sampling to estimate the utilities, which is too slow for an interactive setting. Yang and Klabjan (2018) also propose BO with pairwise preferences, but again, inference is expensive, the method is only tested with fewer than ten features, and it uses an 761 inferior probability of improvement strategy (see Snoek et al., 2012). Our GPPL-based framework permits much faster inference even when the input vector has more than 200 features, and hence allows rapid selection of new pairs when querying users. Ruder and Plank (2017) use BO to select training data for transfer learning in NLP tasks such as sentiment analysis, POS tagging, and parsing. However, unlike our interactive text ranking approach, their work does not involve pairwise comparisons and is not interactive, as the optimizer learns by training and evaluating a model on the selected data. In summary, previous work has not yet devised BO strategies for GPPL or suitable alternatives for interactive text ranking. 3 Background on Preference Learning Popular preference learning models assume that users choose a candidate from a pair with prob- ability p, where p is a function of the candidates’ utilities (Thurstone, 1927). Utility is defined as the value of a candidate to the user, that is, it quantifies how well that instance meets their needs. When candidates have similar utilities, the user’s choice is close to random, while pairs with very different utilities are labeled consistently. Such models include the Bradley–Terry model (BT) (Bradley and Terry, 1952; Luce, 1959; Plackett, 1975), and the Thurstone–Mosteller model (Thurstone, 1927; Mosteller, 1951). BT defines the probability that candidate a is preferred to candidate b as follows: p(ya,b) = 1+exp wTφ(a) wTφ(b) 1 − (1) − (cid:1)(cid:1) (cid:0) (cid:0) ≻ where ya,b = a b is a binary preference label, φ(a) is the feature vector of a and wT is a weight parameter that must be learned. To learn the weights, we treat each pairwise label as two data points: The first point has input features φ(b) and label y, and the second point x = φ(a) is the reverse pair, with x = φ(b) φ(a) and label y. Then, we use standard techniques for logistic 1 regression to find the weights w that minimize the L2-regularized cross entropy loss. The resulting linear model can be used to predict labels for any unseen pairs, or to estimate candidate utilities, fa = wT φ(a), which can be used for ranking. − − − Uncertainty (UNC). At each active learning iteration, the learner requests training labels for 762 candidates that maximize the acquisition function. P.V.S and Meyer (2017) proposed an uncertainty sampling acquisition function for interactive document the uncertainty about a single candidate’s utility, u, as follows: summarization, which defines p(a D) if p(a D) 0.5 D) = u(a | | p(a 1 D) if p(a − | D) = (1 + exp( ≤ D) > 0.5,
|
|
(2)
1
|
−
fa))−
where p(UN
is the
probability that a is a good candidate and w
is the set of BT model weights trained on the
data collected so far, D, which consists of pairs
of candidate texts and pairwise preference labels.
For pairwise labels, Gao et al. (2018) define an
acquisition function, which we refer to here as
UNC, which selects the pair of candidates (UN, B)
with the two highest values of u(UN
D).
UNC is intended to focus labeling effort on
candidates whose utilities are uncertain. If the
learner is uncertain about whether candidate a is
a good candidate, P(UN
D) will be close to 0.5,
so a will have a higher chance of being selected.
Unfortunately, it is also possible for p(UN
D) A
be close to 0.5 even if a has been labeled many
times if a is a candidate of intermediate utility.
Therefore, when using UNC, labeling effort may
be wasted re-labeling mediocre candidates.
D) and u(B
|
|
|
|
The problem is that BT cannot distinguish
two types of uncertainty. The first is aleatoric
uncertainty due to the inherent unpredictability of
the phenomenon we wish to model (Senge et al.,
2014). Per esempio, when predicting the outcome
of a coin toss, we model the outcome as random.
Allo stesso modo, given two equally preferable items, we
assume that the user assigns a preference label
randomly. It does not matter how much training
data we observe: if the items are equally good, we
are uncertain which one the user will choose.
The second type is epistemic uncertainty due
to our lack of knowledge, which can be reduced
by acquiring more training data, as this helps
us to learn the model’s parameters with higher
confidence. BT does not quantify aleatoric and
epistemic uncertainty separately, unlike Bayesian
approcci (Jaynes, 2003), so we may repeatedly
select
items with similar utilities that do not
require more labels. To rectify this shortcoming,
we replace BT with a Bayesian model that both
estimates the utility of a candidate and quantifies
the epistemic uncertainty in that estimate.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
4
4
1
9
2
3
8
5
7
/
/
T
l
UN
C
_
UN
_
0
0
3
4
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Learner
Strategy
BT
BT GPPL GPPL GPPL
random UNC random UNPA EIG
GPPL
IMP
GPPL
TP
Considers epistemic uncertainty
Ignores aleatoric uncertainty
Supports warm-start
Focus on finding best candidate
N
N
N
N
Y
N
N
N
N
N
Y
N
Y
N
Y
N
Y
Y
Y
N
Y
Y
Y
Y
Y
Y
Y (greedy) Y (balanced)
Tavolo 1: Characteristics of active preference learning strategies. TP balances finding best candidate with
exploration.
Process
Preference
Gaussian
Apprendimento
Because BT does not quantify epistemic uncer-
tainty in the utilities, we turn to a Bayesian ap-
proach, GPPL. GPPL uses a GP to provide a
nonlinear mapping from document feature vectors
to utilities, and assumes a Thurstone–Mosteller
model for the pairwise preference labels. Whereas
BT simply estimates a scalar value of fa for each
candidate, UN, GPPL outputs a posterior distribu-
tion over the utilities, F , of all candidate texts, X:
( ˆf , C),
|
N
P(F
φ(X), D) =
(3)
where ˆf is a vector of posterior mean utilities and
C is the posterior covariance matrix of the utilities.
The entries of ˆf are predictions of fa for each
candidate given D, and the diagonal entries of C
represent posterior variance, which can be used
to quantify uncertainty in the predictions. Così,
GPPL provides a way to separate candidates with
uncertain utility from those with middling utility
but many pairwise labels. in questo documento, we infer
the posterior distribution over the utilities using
the scalable SVI method detailed by Simpson and
Gurevych (2020).
4
Interactive Learning with GPPL
We now define four acquisition functions for
GPPL that take advantage of the posterior covari-
ance, C, to account for uncertainty in the utilities.
Tavolo 1 summarises these acquisition functions.
Pairwise Uncertainty (UNPA). Here we pro-
pose a new adaptation of uncertainty sampling to
pairwise labeling with the GPPL model. Piuttosto
than evaluating each candidate individually, COME
in UNC, we select the pair whose label is most
uncertain. UNPA selects the pair with label prob-
ability p(ya,B) closest to 0.5, Dove, for GPPL:
where Φ is the probit likelihood and ˆfa is the
posterior mean utility for candidate a. Through
C, this function accounts for correlations between
candidates’ utilities and epistemic uncertainty in
the utilities. Tuttavia, for two items with similar
expected utilities, ˆfa and ˆfb, the p(ya,B) is close
A 0.5, questo è, it has high aleatoric uncertainty.
Therefore, whereas UNPA will favor candidates
with uncertain utilities, it may still waste labeling
low
effort on pairs with similar utilities but
uncertainty.
Expected Information Gain (EIG). We now
define a second acquisition function for active
learning with GPPL, which has previously been
adapted to GPPL by Houlsby et al. (2011) from an
earlier information-theoretic strategy (MacKay,
1992). EIG greedily reduces the epistemic un-
certainty in the GPPL model by choosing pairs
that maximize information gain, which quantifies
the information a pairwise label provides about the
utilities, F . Unlike UNPA, this function avoids
pairs that only have high aleatoric uncertainty.
The information gain for a pairwise label, ya,B,
is the reduction in entropy of the distribution over
the utilities, F , given ya,B. Houlsby et al. (2011)
note that this can be more easily computed if it
is reversed using a method known as Bayesian
active learning by disagreement (BALD), Quale
computes the reduction in entropy of the label’s
distribution given f . Because we do not know the
value of f , we take the expected information gain
I with respect to f :
IO(ya,B, F ; D) = H(ya,B
D)
|
−
Ef
|
D[H(ya,B
F )],
(6)
|
P(ya,B) = Φ
ˆfa
ˆfb
−
√1 + v !
,
v = Ca,UN + Cb,B
2Ca,B,
−
(4)
(5)
where H is Shannon entropy. Unlike the related
pure exploration strategy (Gonz´alez et al., 2017),
Equazione 6 can be computed in closed form, so
does not need expensive sampling.
763
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
4
4
1
9
2
3
8
5
7
/
/
T
l
UN
C
_
UN
_
0
0
3
4
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Expected Improvement (IMP). The previous
acquisition functions for AL are uncertainty-
based, and spread labeling effort across all items
whose utilities are uncertain. Tuttavia, for tasks
such as summarization or cQA, the goal is to
find the best candidates. While it is important to
distinguish between good and optimal candidates,
it is wasted effort to compare candidates that we
are already confident are not the best, even if
their utilities are still uncertain. We propose to
address this using an acquisition function for BO
that estimates the expected improvement (Moˇckus,
1975) of a candidate, UN, over our current estimated
best solution, B, given current pairwise labels, D.
Here, we provide the first closed-form acquisition
function that uses expected improvement for
pairwise preference learning.
We define improvement
the quantity
COME
, where b is our current best item
max
}
and a is our new candidate. Because the values of
fa and fb are uncertain, we compute the expected
improvement as follows. Primo, we estimate the
posterior distribution over the candidates’ utilities,
(ˆf , C), then find the current best utility: ˆfb =
N
ˆfi
. For any candidate a, the difference
maxi
{
}
fa
fb is Gaussian-distributed as it is a sum
of Gaussians. The probability that this is larger
than zero is given by the cumulative density
ˆfa
function, Φ(z), where z =
√v . We use this
−
to derive expected improvement, which results in
the following closed form equation:
0, fa
{
fb
−
−
ˆfb
Imp(UN; D) = √vzΦ(z) + √v
(z; 0, 1),
(7)
N
This weights the probability of finding a better
solution, Φ(z), by the amount of improvement,
√vz. Both terms account for how close ˆfa is to ˆfb
through z, as a larger distance causes z to be more
negative, which decreases both the probability
Φ(z) and the density
(z; 0, 1). Expected
improvement also accounts for the uncertainty
in both utilities through the posterior standard
deviation, √v, which scales both terms. Tutto
candidates have positive expected improvement,
as there is a non-zero probability that labeling
them will lead to a new best item; otherwise, IL
current best candidate remains, and improvement
is zero.
N
To select pairs of items,
the IMP strategy
greedily chooses the current best item and the item
with the greatest expected improvement. Through
the consideration of posterior uncertainty, IMP
764
trades off exploration of unknown candidates with
exploitation of promising candidates. In contrasto,
uncertainty-based strategies are pure exploration.
Thompson Sampling with Pairwise Labels
(TP). Expected improvement is known to over-
exploit in some cases (Qin et al., 2017): It chooses
where to sample based on the current distribution,
so if this distribution underestimates the mean and
variance of the optimum, it may never be sampled.
To increase exploration, we propose a strategy
that uses Thompson sampling (Thompson,
1933). Unlike IMP, which is deterministic, TP
introduces random exploration through sampling.
TP is similar to dueling-Thompson sampling for
continuous input domains (Gonz´alez et al., 2017),
but uses an information gain step (described
below) and samples from a pool of discrete
candidates.
We select an item using Thompson sampling
come segue: First draw a sample of candidate
utilities from their posterior distribution, f thom ∼
( ˆf , C), then choose the item b with the highest
N
score in the sample. This sampling step depends
on a Bayesian approach to provide a posterior
distribution from which to sample. Sampling
means that while candidates with high expected
utilities have higher values of fthom in most
samples, other candidates may also have the
highest score in some samples. As the number of
samples
, the number of times each candidate
is chosen is proportional to the probability that it
is the best candidate.
→ ∞
To create a pair of items for preference learning,
the TP strategy computes the expected information
gain for all pairs that include candidate b, E
chooses the pair with the maximum. This strategy
is less greedy than IMP as it allows more learning
about uncertain items through both the Thompson
sampling step and the information gain step.
Tuttavia, compared to EIG, the first step focuses
effort more on items with potentially high scores.
to Address Cold Start.
Using Priors
In
previous work on summarization (Gao et al.,
2018), the BT model was trained from a cold start,
namely, with no prior knowledge or pretraining.
Then, after active learning was complete,
IL
predictions from the trained model were combined
with prior predictions based on heuristics by
taking an average of the normalized scores from
both methods. We propose to use such prior
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
4
4
1
9
2
3
8
5
7
/
/
T
l
UN
C
_
UN
_
0
0
3
4
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
predictions to determine an informative prior for
GPPL, enabling the active learner to make more
informed choices of candidates to label at the start
of the active learning process, thereby alleviating
the cold-start problem.
We integrate pre-computed predictions as
follows. Given a set of prior predictions, µ,
from a heuristic or pre-trained model, we set the
prior mean of the Gaussian process to µ before
collecting any data, so that the candidate utilities
have the prior p(F
(µ, K), where K
φ(X)) =
is a hyper-parameter. Given this setup, AL can
now take the prior predictions into account when
choosing pairs of candidates for labeling.
N
|
5 Experiments
We perform experiments on three tasks to test our
interactive text ranking approach: (1) Comunità
question answering (cQA)–identify the best an-
swer to a given question from a pool of candidate
answers; (2) Rating extractive multidocument
summaries according to a user’s preferences; (3)
Generating an extractive multidocument summary
by training a reinforcement learner with the rank-
ing function from 2 as a reward function. Using
interactive learning to learn the reward function
rather than the policy reduces the number of user
interactions from many thousands to 100 or less.
These tasks involve highly specialized questions
or topics where generic models could be improved
with user feedback. For the first two tasks, we
simulate the interactive process in Algorithm 1.
The final task uses the results of this process.
Datasets. Both the cQA and multidocument
summarization datasets were chosen because the
answers and candidate summaries in these datasets
are multisentence documents that take longer for
users to read compared to tasks such as factoid
question-answering. We expect our methods to
have the greatest impact in this type of long-
answer scenario by minimizing user interaction
time.
For cQA, we use datasets consisting of
questions posted on StackExchange in the
communities Apple, Cooking, and Travel, along
with their accepted answers and candidate answers
taken from related questions (R¨uckl´e et al., 2019).
Each accepted answer was marked by the user who
posted the question, so reflects that user’s own
opinion. Dataset statistics are shown in Table 2.
765
cQA Topics
#questions #accepted #candidate
answers
answers
Apple
Cooking
Travel
1,250
792
766
1,250
792
766
Summarization #topics
Datasets
#modello
summaries
125,000
79,200
76,600
#docs
DUC 2001
DUC 2002
DUC 2004
30
59
50
90
177
150
300
567
500
Tavolo 2: Dataset statistics for summarization and
cQA.
For summarization, we use the DUC datasets,2
which contain model summaries written by experts
for collections of documents related to a narrow
topic. Each topic has three model summaries, each
written by a different expert and therefore re-
flecting different opinions about what constitutes
a good summary. Compared with single-document
the challenging multidocument
summarization,
case is an ideal testbed for interactive approaches,
because the diversity of themes within a collection
of documents makes it difficult to identify a single,
concise summary suitable for all users.
Priors and Input Vectors. We use our
interactive approach to improve a set of prior
predictions provided by a pretrained method. For
cQA, we first choose the previous state-of-the-art
for long answer selection, COALA (R¨uckl´e et al.,
2019),which estimates the relevance of answers to
a question by extracting aspects (per esempio., n-grams or
syntactic structures) from the question and answer
texts using CNNs, then matching and aggregating
the aspects. For each topic, we train an instance of
COALA on the training split given by R¨uckl´e et al.
(2019), then run the interactive process on the test
set, questo è, the dataset in Table 2, to simulate
a user interactively refining the answers selected
for their question. As inputs for the BT and GPPL
models, we use the COALA feature vectors: For
each question, COALA extracts aspects from
the question and its candidate answers; each
dimension of an answer’s 50-dimensional feature
vector encodes how well the answer covers one of
the extracted aspects.
2http://duc.nist.gov/.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
4
4
1
9
2
3
8
5
7
/
/
T
l
UN
C
_
UN
_
0
0
3
4
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Next we apply our interactive approach to refine
predictions from the current state of the art (Xu
et al., 2019), which we refer to as BERT-cQA. Questo
method places two dense layers with 100 E 10
hidden units on top of BERT (Devlin et al., 2019).
As inputs to BERT, we concatenate the question
and candidate answer and pad sequences to 512
gettoni (4% QA pairs are over-length and are
truncated). The whole model is fine-tuned on the
StackExchange training sets, the same as COALA.
In our simulations, we use the fine-tuned, final-
layer [CLS] embeddings with 768 dimensions as
inputs to BT and GPPL for each question-answer
pair.
As prior predictions for summary ratings we
first evaluate REAPER, a heuristic evaluation
function described by Ryang and Abekawa (2012).
We obtain bigram+ feature vectors for candi-
date summaries by augmenting bag-of-bigram
embeddings with additional features proposed
by Rioux et al. (2014). The first 200 dimen-
sions of the feature vector have binary values
to indicate the presence of each of the 200
most common bigrams in each topic after tok-
enizing, stemming and applying a stop-list. IL
last 5 dimensions contain the following: the frac-
tion of the 200 most common bigrams that are
present in the document (coverage ratio); the frac-
tion of the 200 most common bigrams that occur
more than once in the document (redundancy
ratio); document length divided by 100 (length
ratio); the sum over all extracted sentences of the
reciprocal of the position of the extracted sentence
in its source document (extracted sentence posi-
tion feature); a single bit to indicate if document
length exceeds the length limit of 100 gettoni. IL
same features are used for both tasks (2) apprendimento
summary ratings and (3) reinforcement learning.
We also test prior predictions from a state-of-
the-art summary scoring method, SUPERT (Gao
et al., 2020), which uses a variant of BERT that
has been fine-tuned on news articles to obtain
1024-dimensional contextualized embeddings of a
summary. To score a summary, SUPERT extracts
a pseudo-reference summary from the source
documents, then compares its embedding with that
of the test summary. With the SUPERT priors we
compare bigram+ feature vectors and the SUPERT
embeddings as input to BT and GPPL for task (2).
Interactive Methods. As baselines, we test BT
as our preference learner with random selection
and the UNC active learning strategy, and GPPL
as the learner with random selection. We also
combine GPPL with the acquisition functions
described in Section 4, UNPA, EIG, IMP, and TP.
For random sampling, we repeat each experiment
ten times.
|
−
T
ga
))−
ga, gb) = (1 + esp( gb
Simulated Users.
In tasks (1) E (2), we
simulate a user’s preferences with a noisy
oracle based on the user-response models of
Viappiani and Boutilier
(2010). Given gold
standard scores for two documents, ga and gb, IL
noisy oracle prefers document a with probability
1, where t
P(ya,B
is a parameter that controls the noise level.
Both datasets contain model summaries or gold
answers, but no gold standard scores. We therefore
estimate gold scores by computing a ROUGE
score of the candidate summary or answer, UN,
against the model summary or gold answer, M. For
cQA, we take the ROUGE-L score as a gold score,
as it is a well-established metric for evaluating
question answering systems (per esempio., Nguyen et al.,
2016; Bauer et al., 2018; Indurthi et al., 2018) E
set t = 0.3, which results in annotation accuracy
Di 83% (the fraction of times the pairwise label
corresponds to the gold ranking).
For summarization, we use t = 1, which gives
noisier annotations with 66% accuracy, reflecting
the greater difficulty of choosing between two
summaries. This corresponds to accuracies of
annotators found by Gao et al. (2019) Quando
comparing summaries from the same datasets.
As gold for summarization, we combine ROUGE
scores using the following formula, previously
found to correlate well with human preferences
on a comparable summarization task (P.V.S and
Meyer, 2017):
ga
Rcomb =
≈
ROUGE2(UN, M)
0.22
+
ROUGE1(UN, M)
0.47
ROUGESU4(UN, M)
0.18
+
.
(8)
Following Gao et al. (2019), we normalize the
gold scores ga to the range [0, 10].
5.1 Warm-start Using Prior Information
We compare two approaches to integrate the prior
predictions of utilities computed before acquiring
user feedback. As a baseline, sum applies a
weighted mean to combine the prior predictions
with posterior predictions learned using GPPL or
BT. Based on preliminary experiments, we weight
766
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
4
4
1
9
2
3
8
5
7
/
/
T
l
UN
C
_
UN
_
0
0
3
4
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Strategy
Prior
Datasets
Learner Strategy Apple
Accuracy for cQA with COALA priors
acc N5
Cooking
acc N5
Travel
acc N5
#interactions=10 Apple Cooking
random sum
random prior
sum
UNPA
prior
UNPA
sum
IMP
prior
IMP
.341
.489
.451
.392
.469
.750
.245
.352
.293
.290
.373
.615
Travel
.393
.556
.423
.476
.466
.784
NDCG@1% for summarization with REAPER priors
#interactions=20 DUC’01 DUC’02
random sum
random prior
sum
UNPA
prior
UNPA
sum
IMP
prior
IMP
DUC’04
.652
.600
.650
.648
.683
.702
.623
.590
.628
.635
.648
.694
.595
.562
.590
.592
.618
.654
Tavolo 3: The effect of integrating pre-computed
predictions as Bayesian priors vs.
taking a
weighted mean of pre-computed and posterior
predictions.
the prior and posterior predictions equally. Prior
sets the prior mean of GPPL to the value of
the prior predictions, as described in Section 4.
Our hypothesis is that prior will provide more
information at the start of the interactive learning
process and help the learner to select more
informative pairs.
Tavolo 3 presents results of a comparison on a
subset of strategies, showing that prior results in
higher performance in many cases. Based on the
results of these experiments, we apply prior to all
further uses of GPPL.
5.2 Community Question Answering
We hypothesize that the prior ranking given by
COALA can be improved by incorporating a small
amount of user feedback for each question. Nostro
interactive process aims to find the best answer to
a specific question, rather than learning a model
that transfers to new questions, hence preferences
are sampled for questions in the test splits.
To evaluate the top-ranked answers from each
method, we compute accuracy as the fraction of
top answers that match the gold answers. Noi
also compare the five highest-ranked solutions
to the gold answers using normalized discounted
cumulative gain (NDCG@5) with ROUGE-L as
the relevance score. NDCG@k evaluates the rele-
vance of the top k ranked items, putting more
COALA
.318 .631 .478 .696 .533 .717
COALA prior, #interactions=10
random .272 .589 .368 .614 .410 .644
BT
UNC
.233 .573 .308 .597 .347 .619
BT
GPPL
random .352 .642 .489 .699 .556 .722
GPPL UNPA .290 .591 .392 .631 .476 .656
.302 .628 .372 .671 .469 .692
GPPL
.274 .592 .353 .636 .414 .675
GPPL
.615 .714 .750 .753 .784 .774
GPPL
EIG
TP
IMP
BERT-cQA
.401 .580 .503 .625 .620 .689
BERT-cQA prior, #interactions=10
BT
BT
GPPL
GPPL
GPPL
random .170 .626 .228 .637 .315 .676
UNC
.129 .580 .181 .583 .326 .618
random .407 .593 .510 .594 .631 .657
.080 .559 .140 .552 .095 .526
EIG
.614 .715 .722 .731 .792 .744
IMP
Tavolo 4:
N5=NDCG@5, acc=accuracy.
Interactive text
ranking for cQA.
weight on higher-ranked items (J¨arvelin and
Kek¨al¨ainen, 2002).
The results in the top half of Table 4 show that
with only 10 user interactions, most methods are
unable to improve performance over pre-trained
COALA. UNC, UNPA, EIG, and TP are out-
performed by random selection and IMP (P
.01
≪
using a two-tailed Wilcoxon signed-rank test).
To see whether the methods improve given
more feedback, Figura 3 plots NDCG@5 against
number of interactions. Whereas IMP perfor-
mance increases substantially, random selection
improves only very slowly. Early interactions
cause a performance drop with UNPA, EIG, E
TP. This is unlikely to be caused by noise in the
cQA data, because preference labels are gener-
ated using ROUGE-L scores computed against
the gold answer. The drop is because uncertainty-
based methods initially sample many low-quality
candidates with high uncertainty. This increases
the predicted utility of the preferred candidate in
each pair, sometimes exceeding better candidates
that were ranked higher by the prior, pushing
them out of the top five. Performance rises once
the uncertainty of mediocre candidates has been
reduced and stronger candidates are selected. Both
BT methods start from a worse initial position but
improve consistently, as their initial samples are
not biased by the prior predictions, although UNC
remains worse than random.
767
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
4
4
1
9
2
3
8
5
7
/
/
T
l
UN
C
_
UN
_
0
0
3
4
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
4
4
1
9
2
3
8
5
7
/
/
T
l
UN
C
_
UN
_
0
0
3
4
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Figura 3: NDCG@5 with increasing interactions,
COALA prior, mean across 3 cQA topics.
Figura 4: NDCG@5 with increasing number of
interactions. BERT-cQA prior. Mean across 3 cQA
topics.
The bottom half of Table 4 and Figure 4 show
results for key methods with BERT-cQA priors
and embeddings. The initial predictions by BERT-
cQA have higher accuracy than COALA but
lower NDCG@5. BERT-based models better
account for question and answer semantics, Guida-
ing to higher accuracy, but place less emphasis on
lexical similarity, which reduces the ROUGE-L
scores of top-ranked answers and consequently,
NDCG@5. While IMP remains the most success-
ful method, the end result is not a clear improve-
ment over COALA, with a collapse in accuracy for
the uncertainty-based EIG and both BT methods.
As with COALA, these uncertainty-based meth-
ods focus initially on middling candidates, but due
to the sparsity of the data with high-dimensional
BERT-cQA embeddings, more samples are
required to reduce their uncertainty before these
methods start to sample strong candidates. IL
flexibility of the GP model means that
È
particularly affected by data sparsity, hence the
poor performance of EIG.
Esso
5.3 Interactive Summary Rating
We apply interactive learning to refine a ranking
over candidate summaries given prior information.
For each topic, we create 10,000 candidate
summaries with fewer than 100 words each,
which are constructed by uniformly selecting
sentences at random from the input documents.
To determine whether some strategies benefit
from more samples, we test each active learning
method with between 10 E 100 user interactions
with noisy simulated users. The method is fast
enough for interactive scenarios: on a standard
Intel desktop workstation with a quad-core CPU
and no GPU, updates to GPPL at each iteration
require around one second.
We evaluate the quality of the 100 highest-
ranked summaries using NDCG@1%,
E
compute the Pearson correlation, R, between
the predicted utilities for all candidates and
the combined ROUGE scores (Eq. (8)). Unlike
NDCG@1%, r does not focus on higher-ranked
candidates but considers the utilities for all
candidates. Hence we do not expect that IMP or
TP, which optimize the highest-ranked candidates,
will have the highest r.
≪
With REAPER priors, bigram+ features and 20
interactions, the top part of Table 5 shows a clear
advantage to IMP in terms of NDCG@1%, Quale
outperforms the previous state of the art, BT-UNC
(significant with p
.01 on all datasets). In terms
of r, IMP is out-performed by TP (significant
.01 on all datasets), which appears
with p
more balanced between finding the best candidate
and learning the ratings for all candidates. UNPA
improves slightly over random sampling for both
metrics, while EIG is stronger due to a better focus
on epistemic uncertainty. Unlike IMP, TP does not
always outperform EIG on NDCG@1%.
≪
Figura 5 shows the progress of each method
interactions on
with increasing numbers of
DUC’01. The slow progress of the BT baselines
is clear, illustrating the advantage the Bayesian
methods have as a basis for active learning
by incorporating uncertainty estimates and prior
predictions.
768
Learner Strategy DUC’01 DUC’02
R
N1
N1
R
DUC’04
R
N1
REAPER
.539 .262 .573 .278 .597 .322
REAPER prior, bigram+ features, #interactions=20
BT
BT
GPPL
GPPL
GPPL
GPPL
GPPL
.596 .335 .626 .358 .659 .408
rand.
.609 .340 .641 .365 .674 .415
UNC
rand.
.558 .248 .590 .266 .603 .289
UNPA .592 .307 .635 .370 .648 .397
.634 .327 .665 .383 .675 .404
EIG
.629 .378 .665 .403 .690 .453
TP
.654 .303 .694 .345 .702 .364
IMP
SUPERT
.602 .382 .624 .400 .657 .438
SUPERT prior, bigram+ features, #interactions=20
BT
BT
GPPL
GPPL
GPPL
GPPL
.633 .415 .654 .438 .684 .483
rand.
UNC .550 .277 .561 .287 .588 .334
.601 .351 .630 .377 .657 .419
rand.
.633 .365 .662 .399 .671 .435
EIG
.649 .417 .668 .437 .698 .479
TP
.653 .322 .696 .374 .717 .407
IMP
SUPERT prior, SUPERT embeddings, #interact.=20
.624 .297 .630 .284 .653 .339
GPPL
IMP
SUPERT prior, bigram+ features, #interactions=100
.668 .308 .788 .466 .815 .521
GPPL
IMP
SUPERT prior, SUPERT embeddings, #interact.=100
.661 .466 .696 .504 .727 .543
rand.
BT
UNC .634 .420 .656 .453 .678 .495
BT
.594 .354 .617 .387 .643 .415
rand.
GPPL
.611 .372 .647 .415 .682 .471
EIG
GPPL
.728 .376 .752 .407 .769 .447
IMP
GPPL
Tavolo 5:
Interactive Summary Rating. N1=
NDCG@1%, r=Pearson’s correlation coefficient.
Bold indicates best result for each prior and number
of interactions.
Figura 5: DUC’01, REAPER prior, bigram+ features,
changes in NDCG@1% with increasing interactions.
769
Figura 6: DUC’01, SUPERT prior, i cambiamenti
NDCG@1% with increasing number of interactions.
In
The lower part of Table 5 and Figure 6
confirm the superior NDCG@1% scores of IMP
with the stronger SUPERT priors. Tuttavia,
while pretrained SUPERT outperforms REAPER,
the results after 20 rounds of interaction with
bigram+ features are almost identical, suggesting
that user feedback helps mitigate the weaker
pretrained model. With only 20 interactions,
bigram+ features work better
than SUPERT
embeddings as input to our interactive learners,
even with the best-performing method, IMP,
since there are fewer features and the model
can cope better with limited labeled data. Con
100 interactions, SUPERT embeddings provide
superior performance as there are sufficient labels
to leverage the richer input embeddings.
5.4 RL for Summarization
We now investigate whether our approach also
improves performance when the ranking function
is used to provide rewards for a reinforcement
learner. Our hypothesis is that it does not matter
whether the rewards assigned to bad candidates
are correct, as long as they are distinguished from
good candidates, as this will prevent the policy
from choosing bad candidates to present to the
user.
To test
the hypothesis, we simulate a flat-
bottomed reward function for summarization on
DUC’01: Primo, for each topic, we set the rewards
5.3) A
for the 10,000 sampled summaries (Vedere
the gold standard, Rcomb (Eq. (8), normalized to
[0, 10]). Then, we set the rewards for a varying
§
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
4
4
1
9
2
3
8
5
7
/
/
T
l
UN
C
_
UN
_
0
0
3
4
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
≪
Tavolo 6 shows that the best-performing method
from the previous tasks, IMP, again produces
a strong improvement over the previous state
the art, BT with UNC (significant with
Di
P
0.01 in all cases), as well as GPPL with
EIG. Con 20 interactions and bigram+ features,
EIG also outperforms BT-UNC, indicating the
benefits of the Bayesian approach, but this is less
clear with SUPERT embeddings, where the high-
dimensional embedding space may lead to sparsity
problems for the GP. The standard deviation in
performance over multiple runs of RL is <0.004
for all metrics, datasets, and methods, suggesting
that the advantage gained by using IMP is robust
to randomness in the RL algorithm. The results
confirm that gains in NDCG@1% made by BO
over uncertainty-based strategies when learning
the utilities translate to better summaries produced
by reinforcement learning in a downstream task.
5.5 Limitations of User Simulations
By testing our interactive process with simulated
users, we were able to compare numerous me-
thods with a fixed labeling error rate. The user
labels were simulated using data from real indi-
viduals: the gold answers for cQA were chosen
by the user who posed the question, and the three
model summaries for each topic in the DUC data-
sets were each authored by a different individual.
While this work shows the promize of BO, further
work is needed to test specific NLP applications
with real end users. Our experiments illustrate
plausible applications where users compare texts
of up to 100 words and gain substantial perfor-
mance advantages. Other applications require a
broader study of reading and labeling time versus
performance benefits and user satisfaction. It
may also be possible to select chunks of longer
documents for the user to compare, rather than
reading whole documents.
Another dimension to consider is that real users
may make systematic, rather than random errors.
However, in the applications we foresee, it is
accepted that their preference labels will often
diverge from any established gold standard, as
users adapt models to their own information needs.
Future work may therefore apply interactive
approaches to more subjective NLP tasks, such as
adapting a summary to more personal information
needs.
Figure 7: Performance of RL on DUC’01 when the
rewards for the bottom x% summaries are flattened to
one. Dashed line = ROUGE-2, solid line = ROUGE-L.
percentage of the lowest-ranked summaries to
1.0 (the flat bottom). We train the reinforcement
learner on the flat-bottomed rewards and plot
ROUGE scores for the proposed summaries in
Figure 7. The performance of the learner actually
increases as candidate values are flattened until
around 90% of the summaries have the same
value. This supports our hypothesis that
the
user’s labeling effort should be spent on the top
candidates.
We now use the ranking functions learned in
the previous summary rating task as rewards for
reinforcement learning. As examples, we take
the rankers learned using SUPERT priors with
bigram+ features and 20 interactions and with
SUPERT embeddings and 100 interactions. We
replicate the RL setup of Gao et al. (2018) for
interactive multidocument summarization, which
previously achieved state-of-the-art performance
using the BT learner with UNC. The RL agent
models the summarization process as follows:
there is a current state, represented by the current
draft summary; the agent uses a policy to select
a sentence to be concatenated to the current
draft summary or to terminate the summary
construction. During the learning process,
the
agent receives a reward after terminating, which
it uses to update its policy to maximize these
rewards. The model is trained for 5,000 episodes
(i.e., generating 5,000 summaries and receiving
their rewards), then the policy is used to produce a
summary. We compare the produced summary to a
human-generated model summary using ROUGE.
By improving the reward function, we hypothesize
that the quality of the resulting summary will also
improve.
770
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
4
4
1
9
2
3
8
5
7
/
/
t
l
a
c
_
a
_
0
0
3
4
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
#inter
-actions
Learner
Features Strat
DUC’01
DUC’02
DUC’04
-egy R1 R2 RL RSU4 R1 R2 RL RSU4 R1 R2 RL RSU4
0
20
20
20
20
100
100
100
100
SUPERT N/A
none .324 .061 .252 .097 .345 .070 .270 .107 .375 .086 .293 .128
BT
GPPL
GPPL
GPPL
BT
GPPL
GPPL
GPPL
bigrams+ UNC .335 .072 .265 .104 .364 .086 .286 .120 .390 .101 .307 .136
.324 .064 .252 .097 .358 .081 .281 .115 .383 .095 .302 .131
bigrams+ rand.
bigrams+ EIG .346 .073 .269 .110 .377 .095 .295 .126 .394 .106 .310 .137
bigrams+ IMP .355 .086 .277 .114 .385 .103 .300 .130 .419 .122 .331 .154
SUPERT UNC .337 .072 .264 .104 .366 .086 .284 .118 .377 .090 .297 .128
.317 .057 .247 .092 .344 071 .270 .107 .372 .087 .292 .124
SUPERT rand.
SUPERT EIG .331 .070 .259 .101 .367 .088 .287 .120 .394 .103 .309 .136
SUPERT IMP .370 .100 .293 .123 .406 .118 .316 .140 .422 .130 .337 .155
Table 6: RL for summarization: ROUGE scores of final summaries, mean over 10 repeats with different
random seeds. Once the rewards are fixed, the performance of RL is stable: standard deviation of each
result is < 0.004.
6 Conclusions
We proposed a novel approach to interactive
text
ranking that uses Bayesian optimization
(BO) to identify top-ranked texts by acquiring
pairwise feedback from a user and applying
Gaussian process preference learning (GPPL).
Our experiments showed that BO significantly
improves the accuracy of answers chosen in a cQA
task with small amounts of feedback, and leads
to summaries that better match human-generated
model summaries when used to learn a reward
function for reinforcement learning.
Of two proposed Bayesian optimization strat-
egies, we found that expected improvement (IMP)
outperforms Thompson sampling (TP) if the goal
is to optimize the proposed best solution. TP may
require a larger number of interactions due to its
random sampling step. IMP is effective in both
cQA and summarization tasks, but has the stron-
gest impact on cQA with only 10 interactions. This
may be due to the greater sparsity of candidates
in cQA (100 versus 10,000 for summarization),
which allows them to be more easily discriminated
by the model, given good training examples.
Further evaluation with real users is required
to gauge the quantity of feedback needed in a
particular domain.
When using high-dimensional BERT embed-
dings as inputs, GPPL requires more labels to
achieve substantial improvements. Future work
may therefore investigate recent dimensionality
reduction methods (Raunak et al., 2019). We
found that performance improves when includ-
ing prior predictions as the GPPL prior mean but
it is unclear how best to estimate confidence in
the prior predictions—here we assume equal con-
fidence in all prior predictions. Future work could
address this by adapting the GPPL prior covari-
ance matrix to kick-start BO. The method is also
currently limited to a single set of prior predic-
tions: In future we intend to integrate predictions
from several models.
Acknowledgments
This work was supported by the German Research
Foundation (DFG) as part of the EVIDENCE
project (grant GU 798/27-1) and the PEER project
(grant GU 798/28-1). We would like to thank the
reviewers and journal editors for their extremely
helpful feedback.
References
Jason Baldridge and Miles Osborne. 2004. Active
learning and the total cost of annotation. In
Proceedings of the 2004 Conference on Empi-
rical Methods in Natural Language Processing,
pages 9–16, Barcelona, Spain. Association for
Computational Linguistics.
Lisa Bauer, Yicheng Wang, and Mohit Bansal.
2018. Commonsense for generative multi-hop
question answering tasks. In Proceedings of the
2018 Conference on Empirical Methods in Nat-
ural Language Processing, pages 4220–4230,
Brussels, Belgium. Association for Computa-
tional Linguistics. DOI: https://doi.org
/10.18653/v1/D18-1454
Daniel Beck, Trevor Cohn, and Lucia Specia.
2014. Joint emotion analysis via multi-task
771
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
4
4
1
9
2
3
8
5
7
/
/
t
l
a
c
_
a
_
0
0
3
4
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Gaussian processes. In Proceedings of the 2014
Conference on Empirical Methods in Natural
Language Processing, pages 1798–1803. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.3115/v1/D14-1190
Ralph Allan Bradley and Milton E. Terry. 1952.
Rank analysis of incomplete block designs: I.
The method of paired comparisons. Biometrika,
39(3/4):324–345. DOI: https://doi.org
/10.1093/biomet/39.3-4.324
Eric Brochu, Vlad M. Cora, and Nando De Freitas.
2010. A tutorial on Bayesian optimization of
expensive cost functions, with application to
active user modeling and hierarchical reinforce-
ment learning. arXiv preprint arXiv:1012.2599.
Eric Brochu, Nando de Freitas, and Abhijeet
learning
Ghosh. 2008. Active preference
In Advances
choice data.
with discrete
in Neural Information Processing Systems,
pages 409–416.
Wei Chu and Zoubin Ghahramani. 2005. Prefer-
ence learning with Gaussian processes.
In
Proceedings of the 22nd International Con-
ference on Machine Learning, pages 137–144.
ACM. DOI: https://doi.org/10.1145
/1102351.1102369
Trevor Cohn and Lucia Specia. 2013. Modelling
annotator bias with multi-task Gaussian pro-
cesses: An application to machine translation
quality estimation. In Proceedings of the 51st
Annual Meeting of the Association for Com-
putational Linguistics, volume 1, pages 32–42.
Association for Computational Linguistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee,
and Kristina Toutanova. 2019. BERT: Pre-
training of deep bidirectional transformers for
In Proceedings of
language understanding.
the 2019 Conference of the North American
Chapter of the Association for Computatio-
nal Linguistics: Human Language Technolo-
gies, Volume 1 (Long and Short Papers),
pages 4171–4186, Minneapolis, Minnesota.
Association for Computational Linguistics.
Terry N. Flynn and A. A. J. Marley. 2014. Best–
worst scaling: Theory and methods. In Stephane
Hess and Andrew Daly, editors, Handbook of
Choice Modelling, pages 178–201. Edward
Elgar Publishing, Cheltenham, UK. DOI:
https://doi.org/10.4337/9781781003152
.00014
Yang Gao, Christian M. Meyer, and Iryna
Gurevych. 2018. APRIL: Interactively learning
to summarise by combining active prefer-
ence learning and reinforcement learning. In
Proceedings of the 2018 Conference on Empir-
ical Methods in Natural Language Processing,
pages 4120–4130, Brussels, Belgium. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D18-1445
Yang Gao, Christian M. Meyer, and Iryna
Gurevych. 2019. Preference-based interactive
Information
multi-document summarisation.
Retrieval Journal, pages 1–31.
Yang Gao, Wei Zhao, and Steffen Eger. 2020.
SUPERT: Towards new frontiers in unsuper-
vised evaluation metrics for multi-document
summarization. In Proceedings of
the 58th
Annual Meeting of the Association for Comput-
ational Linguistics, pages 1347–1354, Online.
Association for Computational Linguistics.
DOI: https://doi.org/10.18653/v1
/2020.acl-main.124, PMID: 31961681
Javier Gonz´alez, Zhenwen Dai, Andreas Dami-
anou, and Neil D. Lawrence. 2017. Preferential
Bayesian optimization. In Proceedings of the
34th International Conference on Machine
Learning, pages 1282–1291.
Matthew D. Hoffman, David M. Blei, Chong
Wang, and John William Paisley. 2013.
inference. Journal of
Stochastic variational
Machine Learning Research, 14:1303–1347.
Neil Houlsby, Ferenc Husz´ar, Zoubin Ghahramani,
and M´at´e Lengyel. 2011. Bayesian active learn-
ing for classification and preference learning.
arXiv preprint arXiv:1112.5745.
Sathish Reddy Indurthi, Seunghak Yu, Seohyun
Back, and Heriberto Cuay´ahuitl. 2018. Cut to
the chase: A context zoom-in network for read-
ing comprehension. In Proceedings of the 2018
Conference on Empirical Methods in Natural
Language Processing, pages 570–575. Brussels,
Belgium. Association for Computational Lin-
guistics. DOI: https://doi.org/10.18653
.18653/v1/D18-1054
772
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
4
4
1
9
2
3
8
5
7
/
/
t
l
a
c
_
a
_
0
0
3
4
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Kalervo J¨arvelin and Jaana Kek¨al¨ainen. 2002.
Cumulated gain-based evaluation of ir techni-
ques. ACM Transactions on Information Sys-
tems (TOIS), 20(4):422–446. DOI: https://
doi.org/10.1145/582415.582418
Edwin T. Jaynes. 2003. Probability Theory:
The Logic of Science, Cambridge University
Press. DOI: https://doi.org/10.1017
/CBO9780511790423
Thorsten Joachims. 2002. Optimizing search
engines using clickthrough data. In Proceed-
ings of the Eighth ACM SIGKDD Internatio-
nal Conference on Knowledge Discovery and
Data Mining, pages 133–142. ACM. DOI:
https://doi.org/10.1145/775047.775067
Maurice George Kendall. 1948. Rank Correlation
Methods. Griffin, Oxford, UK.
David C. Kingsley and Thomas C. Brown. 2010.
Preference uncertainty, preference refinement
and paired comparison experiments. Land Eco-
nomics, 86(3):530–544. DOI: https://doi
.org/10.3368/le.86.3.530
Carolin Lawrence and Stefan Riezler. 2018.
Improving a neural semantic parser by counter-
factual learning from human bandit feedback.
In Proceedings of the 56th Annual Meeting
of the Association for Computational Linguis-
tics (Volume 1: Long Papers), pages 1820–1830,
Melbourne, Australia. Association for Com-
putational Linguistics. DOI: https://doi
.org/10.18653/v1/P18-1169
Xiao Lin and Devi Parikh. 2017. Active learning
for visual question answering: An empirical
study. arXiv preprint arXiv:1711.01732.
Yandong Liu and Eugene Agichtein. 2008. You’ve
got answers: Towards personalized models
for predicting success in community question
answering. In Proceedings of the 46th Annual
Meeting of the Association for Computational
Linguistics on Human Language Technologies:
Short Papers, pages 97–100. Association for
Computational Linguistics.
Manuel Lopes, Francisco Melo,
and Luis
Montesano. 2009. Active learning for reward
estimation in inverse reinforcement learning.
In Joint European Conference on Machine
Learning and Knowledge Discovery in Data-
bases, pages 31–46. Springer. DOI: https://
doi.org/10.1007/978-3-642-04174-7 3
Manuel J. Ma˜na L´opez, Manuel de Buenaga
Rodr´ıguez, and Jos´e Mar´ıa G´omez Hidalgo.
1999. Using and evaluating user directed sum-
maries to improve information access. In Inter-
national Conference on Theory and Practice of
Digital Libraries, pages 198–214. Springer.
DOI: https://doi.org/10.1007/3-540
-48155-9 14
R. Duncan Luce. 1959. On the possible psycho-
physical laws. Psychological Review, 66(2):
81–95. DOI: https://doi.org/10.1037
/h0043178
David J. C. MacKay. 1992. Information-based
objective functions
for active data selec-
tion. Neural computation, 4(4):590–604. DOI:
https://doi.org/10.1162/neco.1992
.4.4.590
Jonas Moˇckus. 1975. On bayesian methods for
seeking the extremum. In Optimization Techni-
ques IFIP Technical Conference, pages 400–404.
Springer. DOI: https://doi.org/10.1007
/978-3-662-38527-2 55
Frederick Mosteller. 1951. Remarks on the method
of paired comparisons: I. The least squares
solution assuming equal standard deviations
and equal correlations. Psychometrika, 16(1):
3–9. DOI: https://doi.org/10.1007
/BF02313422
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng
Gao, Saurabh Tiwary, Rangan Majumder, and
Li Deng. 2016. MS MARCO: A Human
Generated MAchine Reading COmprehension
Dataset. choice, 2640:660.
´Alvaro Peris and Francisco Casacuberta. 2018.
Active learning for interactive neural machine
translation of data streams. In Proceedings of
the 22nd Conference on Computational Natural
Language Learning, pages 151–160. Associa-
tion for Computational Linguistics, Brussels,
Belgium. DOI: https://doi.org/10
.18653/v1/K18-1015
R. L. Plackett. 1975. The analysis of permutations.
Journal of the Royal Statistical Society, Series C
773
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
4
4
1
9
2
3
8
5
7
/
/
t
l
a
c
_
a
_
0
0
3
4
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
(Applied Statistics), 24(2):193–202. DOI:
https://doi.org/10.2307/2346567
Avinesh P.V.S and Christian M. Meyer. 2017.
Joint optimization of user-desired content in
multi-document summaries by learning from
user feedback. In Proceedings of
the 55th
Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers),
pages 1353–1363, Vancouver, Canada. Assoc-
iation for Computational Linguistics.
Chao Qin, Diego Klabjan, and Daniel Russo.
2017. Improving the expected improvement
algorithm. In Advances in Neural Information
Processing Systems, pages 5381–5391.
In Proceedings of
Vikas Raunak, Vivek Gupta, and Florian Metze.
2019. Effective dimensionality reduction for
the
word embeddings.
4th Workshop on Representation Learning
for NLP (RepL4NLP-2019), pages 235–243.
Florence, Italy. Association for Computatio-
nal Linguistics. DOI: https://doi.org
/10.18653/v1/W19-4328
Cody Rioux, Sadid A. Hasan, and Yllias Chali.
2014. Fear the REAPER: A system for auto-
matic multi-document summarization with rein-
forcement learning. In Proceedings of the 2014
Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 681–690,
Doha, Qatar. Association for Computational
Linguistics. DOI: https://doi.org/10.3115
/v1/D14-1075
the
Sebastian Ruder and Barbara Plank. 2017.
Learning to select data for transfer learning
with Bayesian optimization. In Proceedings
on Empirical
of
Methods in Natural Language Processing,
pages 372–382, Copenhagen, Denmark. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D17-1038
2017 Conference
Seonggi Ryang and Takeshi Abekawa. 2012.
Framework of automatic text summarization
using reinforcement learning. In Proceedings
of the 2012 Joint Conference on Empirical
Methods in Natural Language Processing and
Computational Natural Language Learning,
pages 256–265, Jeju Island, Korea. Association
for Computational Linguistics.
774
Andreas R¨uckl´e, Nafise Sadat Moosavi, and
Iryna Gurevych. 2019. COALA: A neural
coverage-based approach for
long answer
selection with small data. In Proceedings of the
Thirty-Third AAAI Conference on Artificial
Intelligence. AAAI. DOI: https://doi
.org/10.1609/aaai.v33i01.33016932
Robin
Stefan B¨osner, Krzysztof
Senge,
J¨org Haasenritter, Oliver
Dembczy´nski,
Hirsch, Norbert Donner-Banzhoff, and Eyke
H¨ullermeier. 2014. Reliable classification: Learn-
ing classifiers that distinguish aleatoric and ep-
istemic uncertainty. Information Sciences, 255:
16–29. DOI: https://doi.org/10.1016/j
.ins.2013.07.030
Burr Settles. 2012. Active learning. Synthesis
Lectures on Artificial Intelligence and Machine
Learning, 6(1):1–114. DOI: https://doi
.org/10.2200/S00429ED1V01Y201207AIM018
Aditya Siddhant and Zachary C. Lipton. 2018.
Deep Bayesian active learning for natural
language processing: Results of a large-scale
empirical study. In Proceedings of the 2018
Conference on Empirical Methods in Natural
Language Processing,
2904–2909,
Brussels, Belgium. Association for Compu-
tational Linguistics. DOI: https://doi
.org/10.18653/v1/D18-1318
pages
Edwin Simpson and Iryna Gurevych. 2018.
Finding convincing arguments using scalable
Bayesian preference learning. Transactions of
the Association for Computational Linguis-
tics, 6:357–371. DOI: https://doi.org
/10.1162/tacl a 00026
Edwin Simpson and Iryna Gurevych. 2020. Scal-
able Bayesian preference learning for crowds.
Machine Learning, 109(4):689–718. DOI:
https://doi.org/10.1007/s10994
-019-05867-2
Avi Singh, Larry Yang, Kristian Hartikainen,
Chelsea Finn, and Sergey Levine. 2019.
End-to-end robotic reinforcement
learning
without reward engineering. arXiv preprint
arXiv:1904.07854. DOI: https://doi.org
/10.15607/RSS.2019.XV.073
Jasper Snoek, Hugo Larochelle, and Ryan P.
Adams. 2012. Practical Bayesian optimization
of machine learning algorithms. In Advances
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
4
4
1
9
2
3
8
5
7
/
/
t
l
a
c
_
a
_
0
0
3
4
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
in Neural Information Processing Systems,
pages 2951–2959.
Advances in Neural Information Processing
Systems, pages 2352–2360.
Artem Sokolov, Julia Kreutzer, Stefan Riezler,
and Christopher Lo. 2016. Stochastic structured
prediction under bandit feedback. In Advances
in Neural Information Processing Systems,
pages 1489–1497.
William R. Thompson. 1933. On the likelihood
that one unknown probability exceeds another
in view of the evidence of two samples. Bio-
metrika, 25(3/4):285–294. DOI: https://
doi.org/10.1093/biomet/25.3-4.285
Louis L. Thurstone. 1927. A law of compara-
tive judgment. Psychological Review, 34(4):
273–286. DOI: https://doi.org/10
.1037/h0070288
Paolo Viappiani and Craig Boutilier. 2010.
Optimal Bayesian recommendation sets and
In
myopically optimal choice query sets.
Christian Wirth, Riad Akrour, Gerhard Neumann,
and Johannes F¨urnkranz. 2017. A survey
learning
of preference-based reinforcement
methods. The Journal of Machine Learning
Research, 18(1):4945–4990.
Peng Xu, Xiaofei Ma, Ramesh Nallapati, and
Bing Xiang. 2019. Passage ranking with weak
supervsion. arXiv preprint arXiv:1905.05910.
Jie Yang and Diego Klabjan. 2018. Bayesian
active learning for choice models with deep
gaussian processes. arXiv preprint arXiv:1805.
01867.
Yi-Hsuan Yang and Homer H. Chen. 2011.
Ranking-based emotion recognition for music
organization and retrieval. IEEE Transactions
on Audio, Speech, and Language Processing,
19(4):762–774. DOI: https://doi.org
/10.1109/TASL.2010.2064164
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
4
4
1
9
2
3
8
5
7
/
/
t
l
a
c
_
a
_
0
0
3
4
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
775
Scarica il pdf