Measuring Online Debaters’ Persuasive Skill from Text over Time
Kelvin Luu1 Chenhao Tan2 Noah A. Smith1,3
1Paul G. Allen School of Computer Science & Engineering, University of Washington
2Department of Computer Science, University of Colorado Boulder
3Allen Institute for Artificial Intelligence
{kellu,nasmith}@cs.washington.edu chenhao.tan@colorado.edu
Astratto
Online debates allow people to express their
persuasive abilities and provide exciting oppor-
tunities for understanding persuasion. Prior
studies have focused on studying persuasion
in debate content, but without accounting
for each debater’s history or exploring the
progression of a debater’s persuasive ability.
We study debater skill by modeling how
participants progress over time in a collection
of debates from Debate.org. We build on
a widely used model of skill in two-player
games and augment it with linguistic features
of a debater’s content. We show that online
debaters’ skill
levels do tend to improve
over time. Incorporating linguistic profiles
leads to more robust skill estimation than
winning records alone. Notably, we find that
an interaction feature combining uncertainty
cues (hedging) with terms strongly associated
with either side of a particular debate (fightin’
parole) is more predictive than either feature
on its own, indicating the importance of fine-
grained linguistic features.
1
introduzione
Persuasion is an important skill with prevalent
use. Nearly every kind of social encounter, from
formal political debates to casual conversations,
can include attempts to convince others. Che cosa
linguistic phenomena are associated with higher
levels of persuasive skill? How do people develop
this skill over time? Online debate communities
offer an opportunity to investigate these questions.
These communities feature users who participate
in multiple debates over their period of engage-
ment. Such debates involve two parties who
willingly and formally present divergent opinions
before an audience. Unlike other media of
persuasion, such as letters to politicians, there is a
537
clear signal, a win or loss, indicating whether or
not a debater was successful against the adversary.
This work aims to quantify the skill level of
each debater in an online community and also
investigates what factors contribute to expertise.
Although persuasion has generated interest
in the natural language processing community,
most researchers have not tried to quantify the
persuasiveness of a particular speaker. Invece,
they estimate how persuasive a text
È, using
linguistic features such as the author’s choice
of wording or how they interact with the audience
(Tan et al., 2014, 2016; Althoff et al., 2014;
Danescu-Niculescu-Mizil et al., 2012). Previous
research has also established that debaters’ content
and interactions both contribute to the success of
the persuader (Tan et al., 2016; Zhang et al.,
2016; Wang et al., 2017). These works have
found textual factors that contribute to a debater’s
success, but they do not emphasize the role of the
individual debater.
There has been some recent work on studying
individual debaters. Durmus and Cardie (2019)
analyze users and find that a user’s success
and improvement depend on their social network
caratteristiche. We take another approach by estimating
each user’s skill level in each debate they par-
ticipate in by considering their debate history.
These estimates reveal features correlated with
skill and the importance of particular debates over
time. Our study is based on debates from an
online debate forum, Debate.org, introduced
by Durmus and Cardie (2018) and discussed in §2.
This Web site is composed of primarily text-
based debates and attracts a large number of users
to debate regularly.
Our model of skill builds on the Elo (1978)
rating system, designed for rating players in
two-player games (§3). Our preliminary analysis
using Elo scores suggests that user skill is not
static; debaters in the Debate.org forum tend
Operazioni dell'Associazione per la Linguistica Computazionale, vol. 7, pag. 537–550, 2019. https://doi.org/10.1162/TACL a 00281
Redattore di azioni: Jordan Boyd-Graber. Lotto di invio: 11/2018; Lotto di revisione: 4/2019; Pubblicato 9/2019.
C(cid:2) 2019 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
2
8
1
1
9
2
3
1
9
9
/
/
T
l
UN
C
_
UN
_
0
0
2
8
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
to improve with practice. We extend Elo for
the debate setting using linguistic features. Noi
decompose our family of models into two design
choices that align with the questions we hope to
answer: 1) the features we use and 2) how we
might choose to aggregate those features from
past debates.
To validate our skill estimates, we introduce a
forecasting task (§4). Previous work predicted
the winner of a debate using the text of the
current debate. In contrasto, we aim to predict
winners using our skill estimates before the
debate (ignoring the current debate’s content).
This design ensures that we are modeling skill of
the debater, as inferred from past performance,
not the idiosyncrasies of a particular debate.
We also investigate the predictive power of
our estimates through an analysis of the results
(§6). We show that our full model outperforms
the baseline Elo model, approaching the accuracy
of an oracle that does use the text of the current
debate. Our ablation studies reveal that the co-
occurrence of phrases that indicate uncertainty
or doubt (hedges) with words that are strongly
associated with one debater or the other (fightin’
parole), is an effective predictor. Inoltre, we
find that not all past debates are equally useful
for prediction: more recent debates are more
indicative of the user’s current level of expertise.
This adds support to our conjecture that individual
debaters tend to improve through the course of
their time debating.
Finalmente, we track the linguistic tendencies of
each debater over the course of their debating
history. We show that several features such as
the length of their turns and the co-occurrence of
hedges and fightin’ words increase over time for
the best debaters, but stay static for those with less
skill. These findings give further evidence that
debaters improve over time.
2 Data
We use both debate and user skill data from the
Debate.org data set introduced by Durmus and
Cardie (2018). Any registered user on the Web site
can initiate debates with others or vote on debates
conducted by others.
2.1 Mechanism of Debate.org
Registered users can create a debate under a topic
of their choosing. The person initiating the debate,
Figura 1: An example of the Debate.org voting
system.
called the instigator, fixes the debate’s number
of rounds (2–10) and chooses the category (per esempio.,
politica, economics, or music) at the start of the
debate. The instigator then presents an opening
statement in the first round and waits for another
user, the contender, to accept the debate and
write another opening statement to complete the
first round.1 We define a debater’s role as being
either the instigator or contender.
To determine
the debate winner, other
Debate.org users vote after the debate ends.
In this phase, voters mark who they thought
performed better in each of seven categories (Vedere
Figura 1).2 This phase can last between 3 days and
6 months, depending on the instigator’s choice at
the debate’s creation. After this period, the debater
who received the most points wins the debate. Noi
record in our data set the textual information,
participants, and voting records for each debate.
2.2 Definition of Winning
The Debate.org voting system lets us define
a ‘‘win’’ in various ways (Guarda la figura 1). In this
study, we would like to model how convincing
each debater is. One approach might be based on
Oxford-style debates like the IQ2 data set from
Zhang et al. (2016), in which scores are based on
the number of audience members who changed
their minds as a result of the debate. We found
here that the majority of voters do not deviate
from their stances before the debate, and most
voters tend to vote for who they already agree
1Although many debaters use their first round to make an
opening statement, some use it only to propose and accept
debates. If the first turn in an n-round debate is under 250
parole, we merge each debater’s first two turns and treat the
debate as an (n − 1)-round debate.
2Another system of voting lets voters choose who they
thought performed better over the entire debate. Although
this appears in the data set, we do not use it in this paper.
538
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
2
8
1
1
9
2
3
1
9
9
/
/
T
l
UN
C
_
UN
_
0
0
2
8
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Completed Debates
Completed & Convincing
Full Filtered
#Users
42,424
21,753
1,284
#Debates
77,595
29,209
4,486
Tavolo 1: Description of the debate.org data
set from Durmus and Cardie (2018) and the
filtered data sets. Full Filtered is a subset of
Completed & Convincing that requires that
participants of each debate have engaged in
five or more debates. We use Full Filtered for
the remainder of our analysis.
system, which serves as both a baseline and the
basis for our final model. In our initial data
analysis, we use Elo scores to define an upset
as a debate where a weaker debater wins over
a stronger one. By measuring the rate at which
upsets occur over a debater’s lifetime, we make an
initial observation that debaters seem to improve
with experience.
3.1 Elo Model
Elo originated as a ranking system for chess
players; it has been adapted to other domains, come
as video games. It is one of the standard methods
to rate players of a two-player, zero-sum game
(Elo, 1978).3 Elo assigns positive integer-valued
scores, typically below 3,000, with higher values
interpreted as ‘‘more skill.’’ The difference in
the scores between two debaters under a logistic
model is used as an estimate of the probability
each debater will win. Per esempio, consider a
debate between A and B. A has an Elo rating
of RA = 1900, and B has an Elo rating of
RB = 2000. Using the Elo rating system, pA,
the probability that A wins is4
pA =
1
1 + 100.0025(RB −RA)
=
1
1 + 100.25
≈ 0.36
(1)
Ratings are updated after every debate, con
the winner (equal to A or B) gaining (and the
loser losing) Δ = 32(1 − pW ) points. (32 is an
arbitrary scalar; we follow non-master chess in
selecting this value.)
Note that the magnitude of the change cor-
responds to how unlikely the outcome was.
Figura 2: Complementary cumulative distribution func-
zioni (1 − CDF) for the total number of debates a
user finished. The blue line tracks only debaters who
engaged in and successfully concluded at least one
debate, and the red line tracks users who have finished
at least five debates in the filtered data set. The right
plot similarly shows the complementary cumulative
distribution for the number of votes given per debate
for all debates and the filtered data set.
con. Therefore, we count the number of times
each debater was rated as more convincing to a
voter despite presenting a viewpoint that the voter
disagreed with before the debate. The debater with
the higher count of such votes is considered the
‘‘winner’’ in the remainder of this paper.
2.3 Data Set Statistics
From an unfiltered set of 77,595 debates, we
remove all debates where a user forfeits, that lack
a winner (§2.2), or that do not have a typical
voting style with seven categories, in partenza 29,209
completed debates.
We record the number of debates that each user
completes (Guarda la figura 2, left). We find that the
quantity of debates per user follows a heavy-tailed
distribution where most users do not participate
in more than one debate. For the remainder of the
lavoro, we focus on the 1,284 (out of 42,424 total
utenti) who have completed five or more of the
debates described above. This leaves us with 4,486
debates where the participants have completed at
least five debates (Vedi la tabella 1).
We also record the number of votes that each
debate attracted (Guarda la figura 2, right). This number
also follows a heavy-tailed distribution.
3 Expertise Estimation
In order to explore debater expertise and discover
what contributes to a user’s expertise over their
time on Debate.org, we begin with a conven-
tional approach to skill estimation, the Elo rating
3The Elo model is a special case of the Bradley-Terry
modello (Bradley and Terry, 1952).
4The base of the exponent, 10, and the multiplicative
factor on the difference RA − RB, 0.0025, are typically used
in chess.
539
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
2
8
1
1
9
2
3
1
9
9
/
/
T
l
UN
C
_
UN
_
0
0
2
8
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Although the Elo ratings traditionally take only a
win or loss as input, there have been adjustments
to account for the magnitude of victory. One
such method would be to use the score difference
between the two players to adjust the Elo gain
(Silver, 2015). If we let SA and SB be scores for
A and B, rispettivamente, the modified gain Δ(cid:4) È
Δ(cid:4) = log(|SA − SB| + 1) × Δ
(2)
Under this model, we represent a user’s history
and skill level as a single scalar, questo è, their Elo
rating. The Elo system ignores all other features,
which include the style a debater uses in the
debates and the content of their argument. Noi
therefore view this model as a baseline and extend
Esso.
3.2 Do Debaters Get Better Over Time?
Our initial data analysis uses the Elo model to
investigate whether users improve over time at all
in the first place. An increase in Elo score can be
seen as the rating system merely becoming more
accurate with another sample or an actual increase
in the player’s skill. We wish to show that there
is no fixed rating for a user, but rather that the
rating is a moving target. A counterhypothesis
is then that the users’ skill levels do not change
despite their activity on Debate.org. If true,
then we would expect their final Elo rating to be
also indicative of their skill level at the beginning
of their Debate.org activity.
To test this hypothesis, we first define an upset
as a debate where pW < τ , that is, where the
winner of the debate was estimated (under Elo)
to win with low probability (set using threshold
τ ). If users tend to have static skill levels, then
participating in many debates does not affect
debaters’ skill, but simply provides more samples
for measuring their skill. It follows that we would
expect upsets to occur at the same rate early and
late in each one’s career given debaters’ static
skill. In this analysis, we calculate pA and pB for
each debate using A’s and B’s final Elo ratings,
which we take to be the most accurate estimate of
their (presumed static) skill levels.5
To operationalize ‘‘early’’ and ‘‘late,’’ we
divide a user’s debates into quintiles by time,
5In a forecasting analysis like the one in §4, this would
be inappropriate, as it uses ‘‘future’’ information to define
upsets. This notion of upset is not used in the forecasting
task.
Figure 3: Upset rates (aggregated across users) across
history quintiles, τ = 0.45. The error bars represent
the 95% confidence intervals.
comparing the upset rate in different quintiles. In
this analysis, we only consider users with at least
ten debates (N = 4, 420).6
We see a downward trend in the upset rate
(Figure 3). In particular, the first and last quintiles
show a statistically significant difference under a
paired t-test (p < .001), meaning that a user’s final
Elo score is not a good measure of skill at earlier
times. We take this finding as suggestive that
users of Debate.org adapt as they participate
in more debates.
4 Predicting Expertise using
Earlier Debates
1 , . . . , dA
Our aim is to estimate a debater A’s persuasive
ability after observing a series of debates they
participated in (denoted dA
t−1 if we are
estimating ability just before the tth debate).
We wish to take into account
the content of
those debates (not merely their outcomes, as in
Elo), so as to understand what factor reveals a
debater’s skill levels. Drawing inspiration from
Elo’s interpretation as a score that can be used
to predict each debater’s probability of winning
(Eq. 1), we formulate a prediction task: estimate
pA for A’s tth debate, given dA
t−1 and also
the opponent’s debate history (which might be of
a different length).
1 , . . . , dA
6In addition to requiring 10 debates instead of 5 per user,
as in the rest of our analysis, we also do not subject the
opponents to the same requirements.
540
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
8
1
1
9
2
3
1
9
9
/
/
t
l
a
c
_
a
_
0
0
2
8
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
By observing debate outcomes alongside the
two participants’ histories, we can estimate the
parameters of such a probability model. Elo
provides a baseline; rather than opaque scores
associated with individual users at different times,
we seek to explain the probability of winning
through linguistic features of past debate content.
Unlike previous work (Zhang et al., 2016;
Potash and Rumshisky, 2017; Tan et al., 2018),
we do not use the content of the current debate
(dA
t ) to predict its outcome; rather, we forecast the
outcome of the debate as if the debate has not yet
occurred. To our knowledge, this is the first work
to derive scores for current skill levels based on
observing participants’ behavior over time. In the
remainder of this section, we discuss features of
past debates and ways of aggregating them.
4.1 Incorporating a Linguistic Profile
into Elo
Elo scores are based entirely on wins and
losses;
they ignore debate content. We seek
to incorporate content into expertise estimation
by using linguistic features. If we modify the
exponential base in Equation 1 from 10 to e, we
can view Elo probabilities (e.g., pA) as the output
of a logistic regression model with one feature
(the score difference, RA − RB) whose weight
is 0.0025; that is, pA = σ(0.0025 · (RA − RB)).
It is straightforward to incorporate more features,
letting
the contender rounds of that debate were written
by his opponent). Table 2 shows the full list of
features.
Hedging with fightin’ words. We introduce
one novel feature for our work: the hedging with
fightin’ words. ‘‘Fightin’ words’’ refer to words
found using a method, introduced by Monroe et al.
(2008), which seeks to identify words (or phrases)
most strongly associated with one side or another
in a debate or other partisan discourse.7 We
are interested in situations where debaters evoke
fightin’ words (their own, or their opponents’)
with an element of uncertainty or doubt. We use
each debater’s top 20 fightin’ words (unigrams
or bigrams) as features, following Zhang et al.,
2016, who found this feature useful in predicting
winners of Oxford-style televised debates. We
also count cooccurences of fightin’ words with
hedge phrases like ‘‘it could be possible that’’ or
‘‘it seems that.’’ An example of this conjoined
feature is found in the utterance ‘‘Could you
give evidence that supports the idea that married
couples are more likely to be committed to [other
tasks]?’’, where hedge phrases are emboldened
and brackets denote fightin’ words (which are
selected separately within each debate). We use a
list of hedging cues curated by Tan et al. (2016)
and derived from Hyland (1996) and Hanauer
et al. (2012). The conjoined feature is the count of
the user’s sentences in a debate where a fightin’
word cooccurs with a hedge phrase in a sentence.
pA = σ (w · (RA − RB))
(3)
4.3 Aggregating Earlier Debates
where w is a vector of weights and RU is user
U ’s ‘‘profile,’’ a vector of features derived from
past debates. In this work, the linguistic profiles
are designed based on extant theory about the
linguistic markers of persuasion, and the vectors
are weighted averages of features derived from
earlier debates.
4.2 Features
We select features discussed in prior work as
the basis for our linguistic profiles (Tan et al.,
2016, 2018; Zhang et al., 2016). We extract these
measurements from each of the user’s debates.
For a given debate and user, we calculate these
values over the rounds written by the user. For
example, if we were interested in a debate by Alex
as the instigator, we would only calculate features
from the instigator rounds of that debate (since
Because we consider the full history of a debater
when estimating their skill level, we opt to ag-
gregate the textual features over each debate. We
do so by taking a weighted sum of the feature
vectors of the previous debates. We consider
four weighting schemes, none of which have free
parameters, to preserve interpretability. Let f be
any one of the features in the linguistic profile, a
function from a single debate to a scalar.
debates
(cid:2)
t−1
i=1
1. Exponential growth: The most
indicative of
recent
are most
skill,
f (dA
i )
2t−i . We take this to be the most
intuitive choice, experimentally comparing
against the alternatives below to confirm this
intuition.
7The method estimates log-odds of words given a side,
with Dirichlet smoothing, and returns the words with the
highest log-odds for each side.
541
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
8
1
1
9
2
3
1
9
9
/
/
t
l
a
c
_
a
_
0
0
2
8
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Feature Description
Elo
Score
Length
Part of
speech
Flesch
reading
ease
Emotional
words
Links
Traditional Elo score calculated
and updated. Updated tradition-
ally, not averaged as in §4.3.
Number of words
uttered in the debate.
each noun, verb,
Count of
adjective, preposition, adverb, or
pronoun from the participant in
the entire debate.
this user
Measure of
readability given
the number of sentences in a
document and the number of
words in each sentence (Kincaid
et al., 1975).
Cues that indicate a positive or
negative emotion (Tausczik and
Pennebaker, 2010).
Links to external websites out-
side of debate.org. This fea-
ture operationalizes the number
of sources a debater used.
Questions The number of questions the user
asked in the debate.
Quotations The number of quotations the
user included in the debate.
(Hyland,
Hedging The number of phrases that
soften a statement by adding
1996;
uncertainty
Hanauer et al., 2012).
The number of instances of words
most strongly associated with
either debater (Monroe et al.,
2008).
Fightin’
words
H∧FW The number of cooccurences
of hedging and fightin’ words,
described in §4.2.
n/a
↑↑↑
Nouns
(↑↑↑)
Adjec-
tives
(↑↑↑)
↑↑
Pos
(↑↑↑)
Neg
(↑↑↑)
↓↓↓
↑↑↑
↑↑
levels. Aside from Elo,
Table 2: Debate-level features used in estimating
skill
the features are a
part of the user’s linguistic profile. The third column
represents statistical significance levels in compar-
ing winners and losers’ features (independently) with
Bonferroni correction: ↑ is p < 0.05, ↑↑ is p < 0.01,
↑↑↑ is p < 0.001.
2. Simple average: Each earlier debate’s
equally,
is weighted
feature
(cid:2)
vector
t−1
i=1 f (dA
i ).
1
t−1
3. Exponential decay: The first debates are
(cid:2)
most indicative of skill,
t−1
i=1
f (dA
i )
2i
.
542
4. Last only: Only the single most recent
t−1) (an extreme version
debate matters, f (dA
of ‘‘exponential growth’’).
In each variation of our method, some or all of
the linguistic profile features are aggregated (using
one of the four weighted averages), then applied
to predict debate outcomes through logistic re-
gression. We note that if our sole aim were to
maximize predictive accuracy, we might explore
much richer linguistic profiles, perhaps learning
word embeddings for the task and combining them
using neural networks and enabling interactions
between a debate’s two participants’ profiles.8
In this study, we seek to estimate skill but also
to understand it, so our focus remains on linear
models.
5 Experimental Setup
We validate our skill models with a binary
classification task, specifically forecasting which
of two debaters will win a debate (without looking
at the content of the debate). Here we remove
any debates where someone forfeits or there is
no winner. We then considered debates where
each debater has completed at least five of the re-
maining debates. As discussed in §2.2, the winner
is taken to be the debater receiving the most ‘‘more
convincing argument’’ votes from observers who
did not initially agree with them. We create four
training/evaluation splits of the data, using debates
from 2013, 2014, 2015, and 2016 as evaluation
sets (i.e., development and test) and debates prior
to the evaluation year for training. Figure 4 shows
the number of debates in each split. We note that
our training sets are cumulative. For example, if
we were to test on 2015 data, we would use the
2012, 2013, and 2014 training as training data.
Because we do not test on 2012 data or train on
2016 data due to the low number of debates before
2012 and after 2016, we treat the whole of 2012
as training data and 2016 as development and test.
We report the accuracy for each run.
We compare several predictors:9
• Full model: Our model with all features, as
described in §4, and (except where otherwise
8Indeed, in preliminary experiments we did explore using
a recurrent neural network instead of a fixed weighted
average, but it did not show any benefit, perhaps owing
to the relatively small size of our data set.
9We use (cid:2)2 regularization in our models with the linguistic
profile features.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
8
1
1
9
2
3
1
9
9
/
/
t
l
a
c
_
a
_
0
0
2
8
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 4: We split data into training/development/test
based on year. This chart shows the number of debates
in each subset of the data. We note that, for training,
we use all the training debates from previous years
(e.g., if we were to test on 2015, we would train using
the training splits from 2012, 2013, and 2014) Each
instance in this figure corresponds to a debater.
stated) the exponential growth weighting.
This model combines linguistic profiles from
earlier debates with a conventional Elo score.
• Full model with point difference: Our full
model as described above, but we scale the
Elo gain by the point difference as described
in Equation 2.
• Linguistic profile only: Our model with
exponential growth weighting (except where
otherwise stated), but ablating the Elo feature.
This model is most similar to those found in
prior literature (Zhang et al., 2016; Tan et al.,
2018; Wang et al., 2017).
• Elo: The prediction is based solely on the
Elo score calculated just before the debate.
This is equivalent to ablating the linguistic
profiles from our model.
• Final Elo oracle: The prediction is based
solely on the two debaters’ final Elo scores
(i.e., using all debates from the past, present,
and future).
• Current debate text oracle: A model that
uses the linguistic profile derived just from
the current debate. Although this model is
most similar to previous work,
is not
a fair estimate of skill (because it ignores
past performance). We therefore view it as
another oracle.
it
• Majority choice: A baseline that always
predicts that the contender will win.10
10In this data set, contenders win nearly 59% of the
time, a fact frequently discussed in the Debate.org
Figure 5: Our results for the prediction task. Our full
model outperforms the Elo baseline and approaches the
current debate text oracle.
6 Results
In this section, we first show that the expertise of a
debater can be better estimated with the linguistic
profile, and then analyze the contribution of dif-
ferent components. We further examine the robust-
ness of our results by controlling for additional
variables.
6.1 Prediction Performance
We first present our results with what we consider
our best model, that is, our full model (with point
differences), which consists of all features and
uses the exponentially growing weight.
Importance of linguistic features. We see from
Figure 5 that
the full model outperforms the
standard Elo baseline. The gap between the two
models suggests that the addition of the linguis-
tic profile contributes to the performance of the
model and therefore plays a useful role in skill esti-
mation. Moreover, the linguistic profile only model
shows that the linguistic profile features are not
only useful, but have at least as strong predictive
power as Elo alone. By only using the linguistic
features aggregated over the course of a debater’s
history, without knowing winning records, we can
forecast at least as well as the Elo baseline.
see
http://ddo.wikia.com/wiki/
community;
Contender Advantage. The contender advantage is
sometimes attribted to having the ‘‘final say,’’ or to the fact
that contenders choose the instigators they wish to debate.
543
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
8
1
1
9
2
3
1
9
9
/
/
t
l
a
c
_
a
_
0
0
2
8
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
8
1
1
9
2
3
1
9
9
/
/
t
l
a
c
_
a
_
0
0
2
8
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 6: Our bootstrap on the feature ablations. We
record the average drop in performance across 100
iterations and tested on the 2016 test set. Higher means
a larger drop in performance.
Importance of multiple debates. We also note
that our full model only performs slightly worse
than the current debate text oracle despite the
current debate text model directly observing the
content of the debate. This result implies that
using information from only previous debates has
at least similar predictive strength to information
from the debate at hand. Moreover, the large gap
between the final Elo and current debate text
oracles implies that a user’s skill is evidenced by
more than the content of a single debate. These
results further demonstrate the importance of
accounting for debaters’ prior history.
Magnitude of victory might matter. Our full
model with the point difference scaled Elo gain
does roughly as well or slightly better than our
normal full model. As the focus of this paper is on
incorporating linguistic profiles, we use the full
model without the point difference scaling for
analyses in the rest of the paper.
6.2 Feature Ablations
We inspect the contribution of each feature by
removing each one from the model. Then, for
each feature, we perform a bootstrap test over
the last year of data (trained on 2012–2015 data;
tested on 2016). At each iteration, we sample
1,000 training examples to train on, but fix the
test set across iterations. We then train our full
model alongside several other models, each with
a feature ablated, on the sample. We track the
drop in performance between our full model and
Figure 7: Comparison in performance for the four ways
we aggregate features over time.
each of our other models. We record the average
performance over 100 iterations for comparison.
From Figure 6, we find that removing Elo re-
sults in the most severe drop in performance
(5.8%). Ablating part-of-speech, negative emotion,
and length from our model had a moderate ef-
fect on performance. Surprisingly, we find that,
although the H∧FW feature is the overlap between
hedge cues and fightin’ words, the latter two
features contribute very little to the performance
of our model compared with H∧FW.
6.3 Combining Prior Debate Features
As described in §4.3, we explore several ways
of combining features over the past debates.
By inspecting how these different aggregation
functions might differ in performance, we hope
to find out whether or not some debates are more
important than others, if recency matters at all, and
if some history is important at all. We do so by 1)
giving the last debates more weight (exponential
growth), 2) giving all weights equal weight
(simple average), 3) giving the first debates more
weight (exponential decay), and 4) giving all the
weight to the last debate (last debate only). The
rest of this section discusses these results, shown
in Figure 7.
More debates help. When using only the
last debate’s features and ignoring all previous
debates, our performance is initially very good in
the years 2013 and 2014 (when our training sets are
544
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
8
1
1
9
2
3
1
9
9
/
/
t
l
a
c
_
a
_
0
0
2
8
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 9: Errors of the full model broken down by the
history sizes of the users. The overall distribution is to
the left and the error distribution is to the right.
new, 2) there is only one new user, and 3) both
debaters have been seen before. We expect that we
will have better performance as our debates have
more complete debate histories. We complete the
analysis by looking at the 2016 test.
Figure 8 shows that both models vastly improve
when removing debates where users have no
history and only improve as we require users to
have longer histories. These results are consistent
with our intuition that longer debate histories
allow our models to better infer a debater’s
skill. Moreover, our model generally outperforms
the Elo baseline across the different history
lengths, implying that our model’s performance
the
is consistent. We do note, however,
gap (between the full model and Elo baseline)
is consistent when there is only one new user
and no new users. This could imply that our
model, in addition to using longer histories more
effectively, is more capable of estimating a de-
bater’s skill against an unknown user compared
with the normal Elo baseline. The slightly higher
performance in the one-new-user case, compared
with no-new-users, is likely due to the small size
of each subset.
that
We extend our analysis of the effect of history
lengths on skill estimation by inspecting the
errors that our model makes. We are particularly
interested in characterizing the mistakes that our
model makes by the participant with the shorter
debate history. For each debate that the full model
predicts incorrectly, we keep count of history
length of the user with the lower history count
(e.g., if a debate has a user with a debate history
of length 5 against a user with a debate history
Figure 8: Our results separated by the number of times
we have seen each participant before in training for
the 2016 test set. The parenthetical numbers show how
many debates are in the subset.
smaller and histories shorter). However, as debate
histories become richer in the later years, last-
only’s performance drops in comparison to that
of the growing weight aggregation. These results
match our intuition: with more experienced users,
considering more debates give us a better gauge
of how a debater performs and a debater’s more
recent debates give a better snapshot of the user’s
current skill level.
Later debates matter more. As hinted in our
last-only results, giving the last debates more
weight does best in all four years. In light of its
performance compared with the simple mean and
decaying weight settings, our results imply that
not all debates contribute equally to a debater’s
skill under our model. Indeed, our results show
that the most recent debates are also the most
important for estimating a debater’s skill rating.
6.4 Impact of the Length of History
We finally examine how often our models have
seen a user’s history impacts performance. Com-
pared with the final Elo and current debate
text oracle models, both the full and the Elo
models suffer when both debaters have no prior
history—these examples devolve into having to
guess the majority class.
Therefore, we further analyze these models by
inspecting the performance on subsets of the test
set pertaining to instances where 1) both users are
545
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
8
1
1
9
2
3
1
9
9
/
/
t
l
a
c
_
a
_
0
0
2
8
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 10: Our results for the prediction task on a
restricted subset of the test set comprising only debates
that are within both participants’ first 10 debates.
Although our model does suffer in accuracy under
this setting, it still outperforms both baselines and
approaches the text-only oracle.
of length 1, we would treat the debate having a
history size of 1). We find in Figure 9 that a large
percentage of the errors (51 % across all debates)
come from debates where one user had a history
length of 0 at test time. Moreover, the debates with
at least one user with no history tend to contribute
proportionally more to the error rate relative to the
overall distribution. Conversely, the debates with
users who have deeper histories of three debates
or more tend to contribute less to the error rate.
6.5 Controlling for Other Variables
We also inspect how well our model generalizes
by measuring our model’s accuracy on subsets of
the test set. In particular, we hope to see if high
activity users or the topic of a debate affect our
model’s performance.
Controlling for prolific users. As seen in
Figure 2, debater activity follows a heavy-tailed
distribution where a minority of debaters perform
the majority of debates. In our data set, 18.9%
of debaters who have engaged in more than 10
debates participate in 66.9% of the debates. Thus,
in order to validate that our model does not simply
recognize the most active users, we test on a
subset of the 2016 data where we cap the number
of debates a user can participate in to 10 (142
debates).
Figure 11: Our results for the prediction task broken
down by topic of debate. Our full model maintains a
consistent performance across all categories.
From Figure 10, we see that the order is roughly
the same even when we restrict the test set to
require that both users have participated in fewer
than 10 debates at the time of the debate. This
implies that our model generalizes relatively well
to most debaters regardless of their experience.
We do note, however, that most of the models
incur a large decrease in performance whereas
the final Elo oracle maintains roughly the same
accuracy.
Our observations are invariant to debate topic.
We also break down our test set by debate topic.
We use the three most popular topics (politics,
religion, and philosophy) and aggregate the rest
of the categories as ‘other’.
We see from Figure 11 that our model consis-
tently outperforms both baselines and typically
below the oracle models with the exception of
‘other’. These results are consistent with those
found in Figure 5 and show that our model per-
forms consistently across different debate topics.
7 Language Change over Time: The Best
Improve and the Worst Stagnate
Our findings from §3.2 and §6.3 imply that users
tend to experience some change over time and
that these changes are helpful for forecasting who
in a debate is more likely to win. In this section,
we explore whether debaters’ linguistic tendencies
546
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
8
1
1
9
2
3
1
9
9
/
/
t
l
a
c
_
a
_
0
0
2
8
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 12: Feature measurements averaged across each history quintile. We see that there is a general trend for
those who eventually become the best debaters to improve on these measures while the bottom users stagnate. The
error bars represent the 95% confidence intervals.
change over time by tracking their use of features
over the course of their debate history.
We examine how language use changes over
time for the best and worst users. To do so, we
divide each user’s debate history into quintiles as
in §3.2. For each quintile, we average the same
features we use in the linguistic profile that we
use in our model (Table 2). We take the top 100
and bottom 100 users ranked by our model to see
how the trajectories change over time.
Figure 12 shows that best and worst users
have different linguistic preferences even from
the beginning of their debating activities. The
best debaters have a higher feature count in every
case except for the Flesch reading ease score.
However, the best users do seem to improve over
time in length, H∧FW, use of emotional cues, and
link use, which mostly correlates with our feature
ablation in Figure 6 (p < 0.05 after Bonferroni
correction). In contrast, the worst users do not
seem to experience any significant change over
time except for the length of their rounds and
negative emotional cue use. In those cases, the
worst users seem to worsen over time.
8 Related Work
In addition to the most relevant studies mentioned
so far, our work is related to three broad areas: skill
estimation, argumentation mining, and studies of
online debates.
Skill estimation. Ranking player strength has
been studied extensively for sports and for online
matchmaking in games such as Halo (Herbrich
et al., 2007). The Bradley-Terry models (of which
Elo is an example) serve as a basis for much of the
research in learning from pairwise comparisons
(Bradley and Terry, 1952; Elo, 1978). Another
rating system used for online matchmaking is
Microsoft’s Trueskill rating system (Herbrich
et al., 2007), which assumes performance is nor-
mally distributed. Neural networks have recently
been explored (Chen and Joachims, 2016; Menke
and Martinez, 2008; Delalleau et al., 2012), incor-
porating player or other contextual game features
from previous games at the cost of interpretability
of those features.
Argumentation and persuasion. Past studies
have noted the persuasiveness of stylistic effects
such as phrasing or linguistic accommodation. For
example, Danescu-Niculescu-Mizil et al. (2012)
showed that, in a pool of people vying to become
an administrator of a Web site, those who were
promoted tended to coordinate more than those
who were not. Similarly, other work defines and
discusses power relations over discussion threads
such as emails (Prabhakaran and Rambow, 2014,
2013). Additionally, Tan et al. (2018) explored
how debate quotes are selected by news media.
They found that linguistic and interactive factors
of an utterance are predictive of whether or not
it would be quoted. Prabhakaran et al. (2014)
547
also studied political debates and found that a
debater’s tendency to switch topics correlates with
their public perception. Argumentation has also
been studied extensively in student persuasive
essays and web discourse (Persing and Ng, 2015;
Ong et al., 2014; Song et al., 2014; Stab and
Gurevych, 2014; Habernal and Gurevych, 2017;
Lippi and Torroni, 2016). Most relevant to our
work on how users improve over time, Zhang et al.
(2017) study how one document may improve over
time through annotated revisions. Where our work
examines users’ linguistic change across multiple
debates, they focus on how a user improves a
single document over multiple revisions.
Online debates. There has also been recent work
in characterizing specific arguments in online
settings in contrast to our focus on the debaters
themselves. For example, Somasundaran and
Wiebe (2009), Walker et al. (2012), Qiu et al.
(2015), and Sridhar et al. (2015) built systems
for identifying the stances users take in online
debate forums. Lukin et al. (2017) studied how
persuasiveness of arguments depends on person-
ality factors of the audience.
Other researchers have focused on annotation
tasks. For example, Park and Cardie (2014) anno-
tated online user comments to identify and clas-
sify different propositions. Hidey et al. (2017)
annotated comments from the changemyview sub-
reddit, a community where participants ask the
community to change a view they hold. Likewise,
Anand et al. (2011) annotated online blogs with
a classification of persuasive tactics. Inspired
by Aristotle’s three modes of persuasion (ethos,
pathos, and logos) their work annotates claims
and premises within the comments. Habernal
and Gurevych (2016) used crowdsourcing to
study what makes an argument more convincing.
They paired two similar arguments and asked
annotators which one was more convincing. This
framework allowed them to study the flaws in the
less convincing arguments. The annotations they
produced offer a rich understanding of arguments
which, though costly, can be useful as future work.
9 Conclusion
In this work, we introduced a method that uses
a linguistic profile derived from a debater’s
their skill
history of past debates to model
level as it changes over time. Using data from
Debate.org, we formulate a forecasting task
around predicting which of two debaters will be
most convincing to observers predisposed to be
unconvinced. We find that linguistic profiles on
their own are similarly predictive to the clas-
sic Elo method, which does not parameterize
skill according to attributes of a participant or
their behavior, but only models wins and losses.
Moreover, we show that our findings are robust
to topic of debate and frequency of user activity.
Further, a model combining linguistic profiles
with Elo achieves predictive accuracy nearly
on par with an oracle based on the content of
the debate itself. A particular feature combining
hedging with fightin’ words is notably important
in our model, and consistent with evidence that
debaters improve with practice, more recent de-
bates appear to provide better estimates of skill
than earlier ones. To verify our hypothesis that
users improve, we explicitly track the feature
use of debaters over the course of their debating
activity to show that the best users improve while
the bottom users stagnate. Our approach sets the
stage for future explorations of the role of history
profile, discourse, more fine-grained sentiment
features, or notions of topic in persuasion.
Acknowledgments
We thank Esin Durmus for her help with the DDO
corpus. We also thank the anonymous reviewers
and the action editor for their helpful comments
and suggestions that helped improve the paper.
This research was supported in part by a University
of Washington Innovation Award to the third
author.
References
Tim Althoff, Cristian Danescu-Niculescu-Mizil,
and Dan Jurafsky. 2014. How to ask for a favor:
A case study on the success of altruistic re-
quests. In Proceedings of ICWSM.
Pranav Anand, Joseph King, Jordan Boyd-Graber,
Earl Wagner, Craig Martell, Doug Oard, and
Philip Resnik. 2011. Believe me? We can do
this! Annotating persuasive acts in blog text. In
Proceedings of the Workshops at AAAI.
Ralph Allan Bradley and Milton E. Terry. 1952.
Rank analysis of incomplete block designs: I.
the method of paired comparisons. Biometrika,
39(3/4):324–345.
548
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
8
1
1
9
2
3
1
9
9
/
/
t
l
a
c
_
a
_
0
0
2
8
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Shuo Chen and Thorsten Joachims. 2016. Pre-
dicting matchups and preferences in context. In
Proceedings of SIGKDD.
Ken Hyland. 1996. Writing without conviction?
Hedging in science research articles. Applied
Linguistics, 17(4):433–454.
Cristian Danescu-Niculescu-Mizil, Lillian Lee,
Bo Pang, and Jon Kleinberg. 2012. Echoes of
power: Language effects and power differences
in social interaction. In Proceedings of WWW.
Olivier Delalleau, Emile Contal, Eric Thibodeau-
Laufer, Raul Chandias Ferrari, Yoshua Bengio,
and Frank Zhang. 2012. Beyond skill rating:
Advanced matchmaking in ghost recon online.
IEEE Transactions on Computational Intelli-
gence and AI in Games, 4(3):167–177.
J Peter Kincaid, Robert P. Fishburne Jr, Richard
L. Rogers, and Brad S. Chissom. 1975. Deri-
vation of new readability formulas (Automated
Readability Index, Fog Count and Flesch Reading
Ease Formula) for Navy enlisted personnel.
CNTECHTRA Research Branch Report 8-75.
Marco Lippi and Paolo Torroni. 2016. Argumen-
tation mining: State of the art and emerging
trends. ACM Transactions on Internet Tech-
nology (TOIT), 16(2):10.
Esin Durmus and Claire Cardie. 2018. Exploring
the role of prior beliefs for argument persuasion.
In Proceedings of NAACL-HLT.
Esin Durmus and Claire Cardie. 2019. Modeling
the factors of user success in online debate. In
Proceedings of WWW.
Arpad E. Elo. 1978. The Rating of Chessplayers,
Past and Present. Arco Publishers.
Ivan Habernal and Iryna Gurevych. 2016.
What makes a convincing argument? Empirical
analysis and detecting attributes of convincing-
ness in Web argumentation. In Proceedings of
EMNLP.
Ivan Habernal and Iryna Gurevych. 2017. Argu-
mentation mining in user-generated Web dis-
course. Computational Linguistics, 43(1):125–179.
David A. Hanauer, Yang Liu, Qiaozhu Mei,
Frank J. Manion, Ulysses J. Balis, and Kai
Zheng. 2012. Hedging their mets: The use of
uncertainty terms in clinical documents and
its potential
implications when sharing the
documents with patients. In Proceedings of
AMIA.
Ralf Herbrich, Tom Minka, and Thore Graepel.
2007. TrueskillTM: A Bayesian skill rating sys-
tem. In Proceedings of NIPS.
Christopher Hidey, Elena Musi, Alyssa Hwang,
Smaranda Muresan, and Kathy McKeown.
2017. Analyzing the semantic types of claims
and premises in an online persuasive forum.
In Proceedings of the Workshop on Argument
Mining.
Stephanie M. Lukin, Pranav Anand, Marilyn
Walker, and Steve Whittaker. 2017. Argument
strength is in the eye of the beholder: Audience
effects in persuasion. In Proceedings of EACL.
Joshua E. Menke and Tony R. Martinez. 2008.
A Bradley–Terry artificial neural network
model for individual ratings in group com-
petitions. Neural Computing and Applications,
17(2):175–186.
Burt L. Monroe, Michael P. Colaresi, and
Kevin M. Quinn. 2008. Fightin’ words: Lexical
feature selection and evaluation for identify-
ing the content of political conflict. Political
Analysis, 16(4):372–403.
Nathan Ong, Diane Litman, and Alexandra
Brusilovsky. 2014. Ontology-based argument
mining and automatic essay scoring. In Pro-
ceedings of the Workshop on Argumentation
Mining.
Joonsuk Park and Claire Cardie. 2014. Identifying
appropriate support for propositions in online
user comments. In Proceedings of the Workshop
on Argumentation Mining.
Isaac Persing and Vincent Ng. 2015. Modeling
argument strength in student essays. In Pro-
ceedings of ACL.
Peter Potash and Anna Rumshisky. 2017. To-
wards debate automation: A recurrent model
for predicting debate winners. In Proceedings
of EMNLP.
Vinodkumar Prabhakaran, Ashima Arora, and
Owen Rambow. 2014. Staying on topic: An
549
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
8
1
1
9
2
3
1
9
9
/
/
t
l
a
c
_
a
_
0
0
2
8
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
indicator of power in political debates. In Pro-
ceedings of EMNLP.
Vinodkumar Prabhakaran and Owen Rambow.
2013. Written dialog and social power: Mani-
festations of different types of power in dialog
behavior. In Proceedings of IJCNLP.
Vinodkumar Prabhakaran and Owen Rambow.
relations between
2014. Predicting power
participants in written dialog from a single
thread. In Proceedings of ACL.
Minghui Qiu, Yanchuan Sim, Noah A. Smith, and
Jing Jiang. 2015. Modeling user arguments,
interactions, and attributes for stance prediction
in online debate forums. In Proceedings of
SDM.
Nate Silver. 2015. How our NFL predictions
work. https://fivethirtyeight.com/
methodology/how-our-nfl-predictions-
work/. Accessed: 2019-02-30.
Swapna Somasundaran and Janyce Wiebe. 2009.
Recognizing stances in online debates. In Pro-
ceedings of ACL-IJCNLP.
Yi Song, Michael Heilman, Beata Beigman
Klebanov, and Paul Deane. 2014. Applying
argumentation schemes for essay scoring. In
Proceedings of the Workshop on Argumentation
Mining.
Dhanya Sridhar, James Foulds, Bert Huang,
Lise Getoor, and Marilyn Walker. 2015. Joint
models of disagreement and stance in online
debate. In Proceedings of ACL.
Christian Stab and Iryna Gurevych. 2014. Iden-
tifying argumentative discourse structures in
persuasive essays. In Proceedings of EMNLP.
Chenhao Tan, Lillian Lee, and Bo Pang. 2014.
The effect of wording on message propagation:
Topic-and author-controlled. In Proceedings of
ACL.
Chenhao Tan, Vlad Niculae, Cristian Danescu-
Niculescu-Mizil, and Lillian Lee. 2016. Win-
Interaction dynamics and
ning arguments:
persuasion strategies
in good-faith online
discussions. In Proceedings of WWW.
Chenhao Tan, Hao Peng, and Noah A. Smith.
2018. You are no Jack Kennedy: On media
selection of highlights from presidential de-
bates. In Proceedings of WWW.
Yla R. Tausczik and James W. Pennebaker.
2010. The psychological meaning of words:
LIWC and computerized text analysis methods.
Journal of Language and Social Psychology,
29(1):24–54.
Marilyn A. Walker, Pranav Anand, Robert Abbott,
and Ricky Grant. 2012. Stance classification
using dialogic properties of persuasion. In Pro-
ceedings of NAACL-HLT.
Lu Wang, Nick Beauchamp, Sarah Shugars, and
Kechen Qin. 2017. Winning on the merits: The
joint effects of content and style on debate
outcomes. Transactions of the Association for
Computational Linguistics, 5:219–232.
Fan Zhang, Homa B. Hashemi, Rebecca Hwa,
and Diane Litman. 2017. A corpus of annotated
revisions for studying argumentative writing. In
Proceedings of ACL.
Justine Zhang, Ravi Kumar, Sujith Ravi, and
Cristian Danescu-Niculescu-Mizil. 2016. Con-
versational flow in Oxford-style debates. In
Proceedings of NAACL-HLT.
550
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
8
1
1
9
2
3
1
9
9
/
/
t
l
a
c
_
a
_
0
0
2
8
1
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3