Computational Constancy Measures of - Specialized Research AI at MIT

Computational Constancy Measures of
Texts—Yule’s K and R´enyi’s Entropy

Kumiko Tanaka-Ishii∗,∗∗
Kyushu University and JST-PRESTO

Shunsuke Aihara
Gunosy Inc.

†

This article presents a mathematical and empirical veriﬁcation of computational constancy mea-
sures for natural language text. A constancy measure characterizes a given text by having an in-
variant value for any size larger than a certain amount. The study of such measures has a 70-year
history dating back to Yule’s K, with the original intended application of author identiﬁcation.
We examine various measures proposed since Yule and reconsider reports made so far, thus
overviewing the study of constancy measures. We then explain how K is essentially equiva-
lent to an approximation of the second-order R´enyi entropy, thus indicating its signiﬁcation
within language science. We then empirically examine constancy measure candidates within
this new, broader context. The approximated higher-order entropy exhibits stable convergence
across different languages and kinds of text. We also show, however, that it cannot identify
authors, contrary to Yule’s intention. Lastly, we apply K to two unknown scripts, the Voynich
manuscript and Rongorongo, and show how the results support previous hypotheses about these
scripts.

1. Introduction

A constancy measure for a natural language text is deﬁned, in this article, as a com-
putational measure that converges to a value for a certain amount of text and remains
invariant for any larger size. Because such a measure exhibits the same value for any size
of text larger than a certain amount, its value could be considered as a text characteristic.
The concept of such a text constancy measure was introduced by Yule (1944) in
the form of his measure K. Since Yule, there has been a continuous quest for such
measures, and various formulae have been proposed. They can be broadly categorized
into three types, namely, those measuring (1) repetitiveness, (2) power law character,
and (3) complexity.

∗ Kyushu University, 744 Motooka Nishiku, Fukuoka City, Fukuoka, Japan.

E-mail: kumiko@ait.kyushu-u.ac.jp.

∗∗ JST-PRESTO, 4-1-8 Honcho, Kawaguchi, Saitama 332-0012, Japan.
† Gunosy Inc., 6-10-1 Roppongi, Minato-ku, Tokyo, Japan.

Submission received: 11 July 2013; revised version received: 17 February 2015; accepted for publication:
18 March 2015.

doi:10.1162/COLI a 00228

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
1
3
4
8
1
1
8
0
6
4
9
4
/
c
o

l
i

_
a
_
0
0
2
2
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 41, Number 3

Yule’s original intention for K’s utility lay in author identiﬁcation, assuming that it
would differ for texts written by different authors. State-of-the-art multivariate machine
learning techniques are powerful, however, for solving such language engineering
tasks, in which Yule’s K is used only as one variable among many, as reported in
Stamatatos, Fakotakis, and Kokkinakis (2001) and Stein, Lipka, and Prettenhofer (2010).
We believe that constancy measures today, however, have greater importance in
understanding the mathematical nature of language. Although mathematical models of
language have been studied in the computational linguistics milieu, via Markov models
(Manning and Schuetze 1999), Zipf’s law and its modiﬁcations (Mandelbrot 1953; Zipf
1965; Bell, Cleary, and Witten 1990), and Pitman-Yor models (Teh 2006) more recently,
the true mathematical model of linguistic processes is ultimately unknown. Therefore,
the convergence of a constancy measure must be examined through empirical veriﬁ-
cation. Because some constancy measures have a mathematical theory of convergence
for a known process, discrepancies in the behavior of real linguistic data from such a
theory would shed light on the nature of linguistic processes and give hints towards
improving the mathematical models. Furthermore, as one application, a convergent
measure would allow for comparison of different texts through a common, stable norm,
provided that the measure converges for a sufﬁciently small amount of text. One of
our goals is to discover a non-trivial measure with a certain convergence speed that
distinguishes the different natures of texts.

The objective of this article is thus to provide a potential explanation of what the
study of constancy measures over 70 years has been about, by answering the three
following questions mathematically and empirically:

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

Question 1 Does a measure exhibit constancy?

Question 2 If so, how fast is the convergence speed?

Question 3 How discriminatory is the measure?

4
1
3
4
8
1
1
8
0
6
4
9
4
/
c
o

We seek answers by ﬁrst showing the meaning of Yule’s K in relation to the R´enyi
higher-order entropy, and by then empirically examining constancy across large-scale
texts of different kinds. We ﬁnally provide an application by considering the natures of
two unknown scripts, the Voynich manuscript and Rongorongo, in order to show the
possible utility of a constancy measure.

The most important and closest previous work was reported in Tweedie and Baayen
(1998), the ﬁrst paper to have examined the empirical behavior of constancy measures
on real texts. The authors used English literary texts to test constancy measure candi-
dates proposed prior to their work. Today, the coverage and abundance of language
corpora allow us to conduct a larger-scale investigation across multiple languages.
Recently, Golcher (2007) tested his measure V (discussed later in this paper) with Indo-
European languages and also programming language sources. Our papers (Kimura
and Tanaka-Ishii 2011, 2014) also precede this one, presenting results preliminary to
this article but with only part of our data, and neither of those provides mathematical
analysis with respect to the R´enyi entropy. Compared with these previous reports, our
contribution here can be summarized as follows:

l
i

_
a
_
0
0
2
2
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Our work elucidates the mathematical relation of Yule’s K to R´enyi’s
higher-order entropy and explains why K converges.

Our work vastly extends the corpora used for empirical examination in
terms of both size and language.

(cid:2)

482

Tanaka-Ishii and Aihara

Computational Constancy Measures of Texts

(cid:2)
(cid:2)

Our work compares the convergent values for these corpora.

Our work also presents results for unknown language data, speciﬁcally
from the Voynich manuscript and Rongorongo.

We start by summarizing the potential constancy measures proposed so far.

2. Constancy Measures

The measures proposed so far can broadly be categorized into three types, calculating
the repetitiveness, power-law distribution, or complexity of text. This section mathe-
matically analyzes these measures and summarizes them.

2.1 Measures Based on Repetitiveness

The study of text constancy started with proposals for simple text measures of vocab-
ulary repetitiveness. The representative example is Yule’s K (Yule 1944), while Golcher
recently proposed V as another candidate (Golcher 2007).

2.1.1 Yule’s K. To the best of our knowledge, the oldest mention of constancy values was
made by Yule with his notion of K (Yule 1944). Let N be the total number of words in a
text, V(N) be the number of distinct words, V(m, N) be the number of words appearing
m times in the text, and mmax be the largest frequency of a word. Yule’s K is then
deﬁned as follows, through the ﬁrst and second moments of the vocabulary population
(cid:2)
m m2V(m, N) (Yule
distribution of V(m, N), where S1
1944; Herdan 1964):

m mV(m, N), and S2

= N =

(cid:2)

K = C

− S1
S2
1

(cid:3)
= C

− 1
N

mmax(cid:4)

m=1

(cid:5)

V(m, N)( m
N

(1)

where C is a constant enlarging of the value of K, deﬁned by Yule as C = 104. K is
designed to measure the vocabulary richness of a text: The larger Yule’s K, the less rich
the vocabulary is. The formula can be intuitively understood from the main term of the
N )2 indicates the degree of recurrence of a
sum in the formula. Because the square of ( m
word, the sum of such degrees for all words is small if the vocabulary is rich, or large in
the opposite case. Another simple example can be given in terms of S2 in this formula.
Suppose a text is 10 words long: if each of the 10 tokens is distinct (high diversity), then
= 1 × 1 × 10 = 10; whereas, if each of the 10 tokens is identical (low diversity), then
S2
= 10 × 10 × 1 = 100.
S2
Measures that are slightly different but essentially equivalent to Yule’s K have
appeared here and there. For example, Herdan deﬁned Vm as follows (Herdan 1964,
pp. 67, 79):

(cid:6)
(cid:7)
(cid:7)
(cid:8)

mmax(cid:4)

m=1

V(m, N)( m
N

)2 − 1

V(N)

483

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
1
3
4
8
1
1
8
0
6
4
9
4
/
c
o

l
i

_
a
_
0
0
2
2
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 41, Number 3

Likewise, Simpson (1949) derived the following formula as a measure to capture the
diversity of a population:

D =

mmax(cid:4)

m=1

V(m, N) m
N

m − 1
N − 1

which is equivalent to Yule’s K, as Simpson noted.

2.1.2 Other Measures Based on Simple Text Statistics. Apart from Yule’s K, various mea-
sures have been proposed from simple statistical observation of text, as detailed in
Tweedie and Baayen (1998). One genre is based on the so-called token-type relation
(i.e., the ratio of the vocabulary size V(N) and the text size N, in log) as formulated by
Guiraud (1954) and Herdan (1964) as a law. Because this simple ratio is not stable, the
measure was modiﬁed numerous times to formulate Herdan’s C (Herdan 1964),
Dugast’s k and U (Dugast 1979), Maas’ a2 (Maas 1972), Tuldava’s LN (Tuldava 1977),
and Brunet’s W (Brunet 1978).

Another genre of measures concerns the proportion of hapax legomena, that is
V(1, N). Honor´e noted that V(1, N) increases linearly with respect to the log of a text’s
vocabulary size V(N) (Honor´e 1979). Another ratio, of V(2, N) to V(N), was proposed
as a text characteristic by Sichel (1975) and Maas (1972).

Each of these values, however, was found not to be convergent according to the
extensive study conducted by Tweedie and Baayen (1998). In common with Yule’s inten-
tion to apply such measures for author identiﬁcation, they examined all of the measures
discussed here, in addition to two measures explained later: Orlov’s Z, and the Shannon
entropy upper bound obtained from the relative frequencies of unigrams. They exam-
ined these measures with English novels (such as Alice’s Adventures in Wonderland) and
empirically found that only Yule’s K and Orlov’s Z were convergent. Given their report,
we consider K the only true candidate among the constancy measures examined so far.

2.1.3 Golcher’s V. Golcher’s V is a string-based measure calculated on the sufﬁx tree of a
text (Golcher 2007). Letting the length of the string be N and the number of inner nodes
of the (Patricia) sufﬁx tree (Gusﬁeld 1997) be k, V is deﬁned as:

V = k
N

(2)

Golcher empirically showed how this measure converges to almost the same value
across Indo-European languages for about 30 megabytes of data. He also showed how
the convergent values differ from those calculated for programming language texts.

Golcher explains in his paper that the possibility of constancy of V does not yet
have mathematical grounding and has only been shown empirically. He does not report
values for texts larger than about 30 megabytes nor for those of non-Indo-European
languages. A simple conjecture on this measure is that because a sufﬁx tree for a string
of length N has at most N − 1 inner nodes, V must end up at some value 0 ≤ V < 1, for any given text. Our group tested V with larger-scale data and concluded that V could be a con- stancy measure, although we admitted to observing a gradual increase (Kimura and Tanaka-Ishii 2014). Because V requires further veriﬁcation on larger-scale data before ruling it out, we include it as a constancy measure candidate. 484 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 3 4 8 1 1 8 0 6 4 9 4 / c o l i _ a _ 0 0 2 2 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Tanaka-Ishii and Aihara Computational Constancy Measures of Texts 2.2 Measures Based on Power Law Distributions Since Zipf (1965), power laws have been reported as an underlying statistical character- istic of text. The famous Zipf’s law is deﬁned as: f (n) ∝ n −γ (3) where γ ≈ 1, and f (n) is the frequency of the nth most frequent word in a text. Various studies have sought to explain mathematically how the exponent could differ depend- ing on the kind of text. To the best of our knowledge, however, there has been a limited number of reports related to text constancy. An exception is the study on Orlov’s Z (Orlov and Chitashvili 1983). Orlov and Chitashvili attempted to obtain explicit mathematical forms for V(N) and V(m, N) by more ﬁnely considering the long tails of vocabulary distributions for which Zipf’s law does not hold. They obtained these forms through a parameter Z, deﬁned as the potential text length minimizing the square error of the estimated V(m, N), with its actual value as follows: Z = arg min N 1 mmax mmax(cid:4) m=1 { E[V(m, N)] − V(m, N) V(N) }2 Thus deﬁning Z, they mathematically deduced for V(N) the following formula: V(N) = Z log(mmaxZ) N N − Z log N Z (cid:9) (cid:10) (4) (5) Two ways to obtain Z can be formulated through approximation: one through Good- Turing smoothing (Good 1953), which assumes Zipf’s law to hold, and the other using Newton’s method. Tweedie and Baayen showed how the value of Z is stable at the size of an English novel by a single author and thus suggested that it could form a text characteristic. The empirical results, however, were not signiﬁcantly convergent with respect to text size, and, moreover, Tweedie and Baayen provided their results without giving an estimation method (Tweedie and Baayen 1998). Calculation using Good-Turing smoothing, which is derived directly from Zipf’s law, would cause Z to converge, but this does not take Orlov’s original intention into consideration. Alternatively, our group (Kimura and Tanaka-Ishii 2014) veriﬁed Z through Newton’s method by setting g(Z) = 0, where g(Z) is the following function: g(Z) = Z log(mmaxZ) N N − Z log (cid:10) (cid:9) N Z − V(N) (6) We also showed how the value of Z increases rapidly when the text size is larger than 10 megabytes. The major problem with measures based on power laws lies in the skewed head and tail of the vocabulary population distribution. Because these exceptions constitute important parts of the population, parameter estimation by ﬁtting to Equation (3) is sensitive to the estimation method. For example, the estimated value of the exponent for Zipf’s law depends on the method used for dealing with these exceptions. We tested several simple methods of estimating the Zipf law’s exponent γ with different ways of handling the head and tail of a distribution. There were settings that led to 485 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 3 4 8 1 1 8 0 6 4 9 4 / c o l i _ a _ 0 0 2 2 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 41, Number 3 convergence, but the convergence depended on the settings. Such difﬁculty could be one reason why there has been no direct proposal for γ as a text constancy measure. Hence, due care must be taken in relating text constancy to a power law. We chose another path by considering text constancy through a random Zipf distribution, as described later in the experimental section. 2.3 Measures Based on Complexity With respect to measures based on complexity, multiple reports have already examined the Shannon entropy (Shannon 1948; Cover and Thomas 2006). In addition, we intro- duce the R´enyi higher-order entropy (R´enyi 1960) as another possible measure. 2.3.1 Shannon Entropy Upper Bound. Let X be the random variable of a sequence X = X1, X2, . . . , Xi, . . . , where Xi represents the ith element of X: Xi = x ∈ X, and where X represents a given set (e.g., a set of words or characters) whose members constitute the sequence. Let Xj i (i < j) denote the random variable indicating its subsequence Xi, Xi+1, Xi+2, . . . , Xj. Let P(X) indicate the probability function of a sequence X. The Shannon entropy is then deﬁned as: H(X) = − (cid:4) X P(X) log P(X) (7) Tweedie and Baayen directly calculated an approximation of this formula in terms of the relative frequencies (for P) of unigrams (for X), and they concluded that the measure would continue increasing with respect to text size and would not converge for short, literary texts (Tweedie and Baayen 1998). Because we are interested in the measure’s behavior on a larger scale, we replicated their experiment, as discussed later in the section on empirical constancy. We denote this measure as H1 in this article. Apart from that report, many have studied the entropy rate, deﬁned as: h ∗ = lim n→∞ H(Xn 1 ) n (8) Theoretically, the behavior of the entropy rate with respect to text size has been contro- versial. On the one hand, there have been indications of entropy rate constancy (Genzel and Charniak 2002; Levy and Jaeger 2007). These reports argue that the entropy rate of natural language could be constant. Due to the inherent difﬁculty in obtaining the true value of h from a text, however, these arguments are based only on indirect clues with respect to convergence. On the other hand, Hilberg conjectured a decrease in the human conditional entropy, as follows (Hilberg 1990): ∗ H(Xn |Xn−1 1 ) ∝ n −1+β He obtained this through an examination of Shannon’s original experimental data and suggested that β ≈ 0.5. From this formula, De¸bowski induces that H(Xn and that the entropy rate can be formulated generally as follows (De¸bowski 2014): 1 ) ∝ n β H(Xn 1 ) n ≈ An −1+β + h ∗ (9) 486 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 3 4 8 1 1 8 0 6 4 9 4 / c o l i _ a _ 0 0 2 2 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Tanaka-Ishii and Aihara Computational Constancy Measures of Texts Note that at the limit of n → ∞, this rate goes to h , a constant, provided that β < 1.0. Hilberg’s conjecture is deemed compatible with entropy rate constancy at its asymptotic limit, provided that h forms a text characteristic, and if so, whether h ∗ > 0 holds.1 We are therefore interested in whether this h

∗ > 0.

∗

Empirically, many have attempted to calculate the upper bound of the entropy
rate. Brown’s report (Brown et al. 1992) is representative in showing a good estimation
of the entropy rate for English from texts, as compared with values obtained from
humans (Cover and King 1978). Subsequently, there have been important studies on
calculating the entropy rate, as reported thoroughly in Sch ¨umann and Grassberger
(1996). The questions related to h
, however, remain unsolved. Recently, De¸bowski
used a Lempel-Ziv compressor and examined Hilberg’s conjecture for texts by single
authors (De¸bowski 2013). He showed an exponential decrease in the entropy rate
with respect to text size, supporting the validity of Equation (9). Following these
previous works, we examine the entropy rate by using an algorithm proposed by
Grassberger (1989) and later on by Farach et al. (1995). This method is based on
universal coding. The algorithm has a theoretical background of convergence to the
true h
, provided the sequence is stationary, but has been proved by Shields (1992)
to be inconsistent—that is, it does not converge to the entropy rate for certain non-
Markovian processes. We still chose to apply this method, because it requires no arbi-
trary parameters for calculation and is applicable to large-scale data within a reasonable
time.

∗

The Grassberger algorithm (Grassberger 1989; Farach et al. 1995) can be summa-
rized as follows. Consider a sequence X of length N. The maximum matching length Li
is deﬁned as:

= max{k : Xj+k

= Xi+k
i

}

for j ∈ {1, . . . , i − 1}, 1 ≤ j ≤ j + k ≤ i − 1. In other words, Li is the maximum common
subsequence before and after i. If ¯L is the average length of Li, given by

¯L = 1
N

i=N(cid:4)

i=1

then the method obtains the entropy rate h1 as

= log2N
¯L

(10)

Given the true entropy rate h
stationary process, such that |h
this entropy rate h1 as a constancy measure candidate.

, convergence has been mathematically proven for a
∗ − h1
| = O(1) when N → ∞. In this article, we consider

∗

1 According to De¸bowski (2009), h

∗ = 0 suggests that the next element of a linguistic process is

deterministic, that is, a function of the corpus observed before, under the two conditions that (1) the
number of possible choices for the element is ﬁnite, and (2) the corpus observed before is inﬁnite. In
reality, the ﬁniteness of linguistic sequences has the opposite tendency (i.e., the size of the observed
corpus is ﬁnite, and the possible vocabulary size is inﬁnite).

487

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
1
3
4
8
1
1
8
0
6
4
9
4
/
c
o

l
i

_
a
_
0
0
2
2
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 41, Number 3

2.3.2 Approximation of R´enyi Entropy Hα. The R´enyi entropy is a generalization of the
Shannon entropy, deﬁned as follows (R´enyi 1960; R´enyi 1970; Cover and Thomas 2006;
Bromiley, Thacker, and Bouhova-Thacker 2010):

Hα(X) = 1

1 − α log(

(cid:4)

(X))

(11)

where α ≥ 0, α (cid:9)= 1. Hα(X) represents different ideas of sequence complexity for differ-
ent α. For example:

(cid:2)
(cid:2)

When α = 0, H0(X) indicates the number of distinct occurrences of X.
When the limit α → 1 is taken, Equation (11) reduces to the Shannon
entropy.

The formula for α = 0 becomes equivalent to the so-called topological entropy (hence,
it is another notion of entropy) for certain probability functions (Kitchens 1998) (Cover
and Thomas 2006). Note that the number of distinct tokens (i.e., the cardinality of a
set) has been used widely as a rough approximation of complexity in computational
linguistics. Indeed, in Section 2.1.2, we saw how some candidate constancy measures are
based on a token-type relation, such that the number of types is related to the complexity
of a text. For texts, note also that the value grows with respect to the text size, unless X
is considered, for example, in terms of unigrams of a phonographic alphabet.

For α → 1, there is controversy regarding convergence, as noted in the previous
section. Such difﬁculty in convergence for these α values lies in the nature of linguistic
processes, in which the vocabulary set evolves.

This view motivates us to consider α > 1 for Hα(X), since the formula captures
complexity by considering linguistic hapax legomena to a lesser degree, thus giving the
possibility of convergence. In fact, an approximation of the probability by the relative
frequencies of unigrams at α = 2 immediately shows the essential equivalence to Yule’s
K, since K from Equation (1) can be rewritten as follows:

mmax(cid:4)

m=1

V(m, N)( m
N

)2 =

(cid:4)

x∈X

(

freq(x)
N

where freq(x) is the frequency of x ∈ X. Therefore, Yule’s K has signiﬁcance within the
context of complexity.

This relation of Yule’s K to the R´enyi entropy H2 is reported for the ﬁrst time here, to
the best of our knowledge. This mathematical relation clariﬁes both why Yule’s K should
converge and what the convergent value means; speciﬁcally, the value represents the
gross complexity underlying the language system. As noted earlier, the higher-order
entropy considers hapax legomena to a lesser degree and calculates the gross entropy
only from the representative vocabulary population. This simple argument shows that
Yule’s K captures not only the simple repetitiveness of vocabulary but also the more
profound signiﬁcation of its equivalence with the approximated second-order entropy.
Because K has been previously reported as a stable text constancy measure, we consider
it here once again, but this time within the broader context of Hα.

488

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
1
3
4
8
1
1
8
0
6
4
9
4
/
c
o

l
i

_
a
_
0
0
2
2
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Tanaka-Ishii and Aihara

Computational Constancy Measures of Texts

2.4 Summary of Constancy Measure Candidates

Based on the previous reports (Tweedie and Baayen 1998; Kimura and Tanaka-Ishii
2014) and the discussion so far, we consider the following four measures as candidates
for text constancy measures.

(cid:2)

Repetitiveness-based measures:
Yule’s K (Equation (1)); and Golcher’s V (Equation (2)).

Complexity-based measures:
The Shannon entropy upper bound (h1 as the entropy rate (Equations (10)
and (8)) and H1 (Equation (7), with X in terms of unigrams and the
probability function in terms of relative frequencies); and the
approximated R´enyi entropy, denoted as Hα (α > 1) (Equation (11),
again with X and the probability function in terms of unigrams and
relative frequencies, respectively).

In addition, we empirically consider how these measures can be understood in the
context of the power-law feature of language. As noted in the Introduction, for the
convergent measures the speed of attaining convergence with respect to text size is
examined as well. Among the candidates, K and H1 have been previously applied in
a word-based manner, whereas V is string based. The Shannon entropy rate h1 has been
considered in both ways. Because we should be able to consider a text in terms of both
words and characters, we examine the constancy of each measure in both ways.

Furthermore, because we have seen the mathematical equivalence of Yule’s K and
H2, in the following we only consider H2. As for Hα, we consider α = 3, 4 only in
comparison with H2. Because H1 is based on relative frequencies and can be considered
together with H2, we ﬁrst focus on the convergence of the three measures V, h1, and H2,
and then we consider H1 in comparison with H2, H3, and H4.

3. Data

3.1 Real Texts

Table 1 lists the data used in our experimental examination. The table indicates the data
identiﬁer (by which we refer to the data in the rest of the article), language, source,
number of distinct tokens, data length by total number of tokens, and size in bytes. The
ﬁrst block contains relatively large-scale natural language corpora consisting of texts
written by multiple authors, and the second block contains smaller corpora consisting
of texts by single authors. The third block contains programming language corpora, and
the fourth block contains corpora of unknown scripts, which we examine at the end of
this article in Section 4.3.

For the large-scale natural language data, we considered ﬁve languages: English,
Japanese, Chinese, Arabic, and Thai. These languages were chosen to represent different
language families and writing systems. The large-scale corpora in English, Japanese,
and Chinese consist of newspapers in chronological order, and the Thai and Arabic
corpora include other kinds of texts. The markers ‘w’, ‘c’, and ‘cr’ appearing at the
end of every identiﬁer in Table 1 (e.g., Enews-c, Enews-w, and Jnews-cr) indicate text
processed through words, characters, and transliterated Roman characters, respectively.

489

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
1
3
4
8
1
1
8
0
6
4
9
4
/
c
o

l
i

_
a
_
0
0
2
2
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 41, Number 3

Table 1
Our data.

Identiﬁer

Enews-c
Enews-w

Language
kind

Source

Number
of distinct
tokens

Data length
by tokens

Data size
in bytes

Large scale corpora

English

WSJ Corpus(1987)

Jnews-c

Japanese

2000–2009 Mainichi
Newspaper

Jnews-cr
Jnews-w

87
137,466

112,868,099
22,679,512

108MB

5,758

475,101,506

1.3GB

94
468,818

1,087,919,430
289,032.862

Cnews-c

Chinese

1995 People’s
Daily Newspaper

5,963

24,696,511

67MB

Cnews-cr
Cnews-w

Atext-c
Atext-w

Ttext-c
Ttext-w

Ebook1-w
Ebook2-w
Fbook-w
Gbook-w

Jbook-w
Cbook-w
Abook-w
Sbook-w

Arabic

Watan-2004 corpus

Thai

NECTEC corpus

English
English
French
German

Japanese
Chinese
Arabic
Sanskrit

Small scale corpora

Ulysses
Les Miserables
Les Miserables
Kritik der reinen
Vernunft
Dohyo
Hong Lou Meng
Quaran
Ramayana

88
144,336

59
298,370

159
16,291

34,359
25,994
31,956
10,604

19,179
18,450
16,121
62,318

68,325,519
14,965,501

42,174,262
7,450,442

1,444,536
280,602

325,692
677,163
691,407
215,299

502,137
701,255
75,185
213,736

73MB

3.9MB

1.5MB
3MB
3MB
1.3MB

2MB
2.5MB
728KB
1.9MB

Corpora of programming languages

Python-w
Cplus-w
Lisp-w

Python
C++
Common Lisp

python library sources
C++ library sources
sbcl and Clozure CL

1,517,424
127,332
164,248

48,704,374
15,617,801
2,326,270

214MB
64MB
32MB

Corpora of Unknown scripts

Unknown
VoynichA-c
VoynichB-c
Unknown
VoynichA-w Unknown
VoynichB-w Unknown
Unknown
RongoA-c
Unknown
RongoB-c

Voynich Manuscript
Voynich Manuscript
Voynich Manuscript
Voynich Manuscript
Rongorongo script
Rongorongo script

22
25
2,628
4,609
3,546
656

44,360
117,105
7,460
18,495
10,376
14,003

44KB
115KB
44KB
115KB
60KB
60KB

As for the small-scale corpora in the second block, the texts were only considered
in terms of words, since veriﬁcation via characters produced ﬁndings consistent with
those obtained with the large-scale corpora. The texts were chosen because each was
written by a single author but is relatively large.

490

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
1
3
4
8
1
1
8
0
6
4
9
4
/
c
o

l
i

_
a
_
0
0
2
2
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Tanaka-Ishii and Aihara

Computational Constancy Measures of Texts

Here, we summarize our preprocessing procedures. For the annotated Thai
NECTEC corpus, texts were tokenized according to the annotation. The preprocessing
methods for the other corpora were as follows:

(cid:2)
(cid:2)

(cid:2)

English: NLTK2 was used to tokenize text into words.

Japanese: Mecab3 was used for tokenization, and KAKASI4 was used for
romanization.

Chinese: ICTCLAS20135 was used for tokenization, and the pinyin Python
library was used for pinyin romanization.

Other European Languages: PunktWordTokenizer6 was used for
tokenization.

All the other natural language corpora were tokenized simply using spaces.

Following Golcher (2007), who ﬁrst suggested testing constancy on programming
languages, we also collected program sources from different languages (third block in
Table 1). The programs were also considered solely in terms of words, not characters.
C++ and Python were chosen to represent different abstraction levels, and Lisp was
chosen because of its different ordering for function arguments. Source code was col-
lected from language libraries. The programming language texts were preprocessed as
follows. Comments in natural language were eliminated (although strings remained
in the programs, where each was a literal token). Identical ﬁles and copies of sources
in large chunks were carefully eliminated, although this process did not completely
eliminate redundancy since most programs reuse some previous code. Finally, the
programs were tokenized according to the language speciﬁcations.7

The last block of the table lists two corpora of unknown scripts. We consider these
scripts at the end of this article in Section 4.3, through Figure 5, to show one possible
application of the text constancy measures. The ﬁrst unknown script is that of the
Voynich manuscript, a famous text that is undeciphered but hypothesized to have been
written in natural language. This corpus is considered in terms of both characters and
words, where words were deﬁned via the white space separation in the original text.
Given the common understanding that the manuscript seems to have two different
parts (Reddy and Knight 2011), we separated it into two parts according to the Currier
annotation (identiﬁed as A and B, respectively). The second corpus of unknown text
consists of the Rongorongo script of Easter Island (Daniels and Bright 1996, Section 13;
Orliac 2005; Barthel 2013). This script’s status as natural language is debatable, but
if so, it is considered to possess characteristics of both phonographs and ideograms
(Pozdniakov and Pozdniakov 2007). Because there are several ways to consider what
constitutes a character in this script (Barthel 2013), we calculate values for the two most
extreme cases as follows. For corpus RongoA-c, we consider a character inclusive of all
adjoining parts (i.e., including accents and ornamental parts). On the other hand, for

2 http://nltk.org.
3 http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html.
4 http://kakasi.namazu.org.
5 http://ictclas.nlpir.org.
6 http://nltk.org.
7 With respect to the Lisp programming language, its culture favors long, hyphenated variable names that

can be almost as long as a sentence. For this work, therefore, Lisp variable names were tokenized by
splitting at the hyphens.

491

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
1
3
4
8
1
1
8
0
6
4
9
4
/
c
o

l
i

_
a
_
0
0
2
2
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 41, Number 3

corpus RongoB-c, we separate parts as reasonably as possible, among multiple possible
separation methods. Because the unit of word in this script is unknown, the Rongorongo
script is only considered in terms of characters.

3.2 Random Data

The empirical veriﬁcation of convergence for real data is controversial. We must ﬁrst
note that it does not conform with the standard approach to statistical testing. In the
domain of statistics, it is a common understanding that “convergence” cannot be tested.
A statistical test raises two contrasting hypotheses—called the null and alternative
hypotheses—and calculates a p-value indicating the probability of the null hypothesis
to occur. When this p-value is smaller than a certain value, the null hypothesis is
considered unable to occur and is thus rejected. For convergence, the null hypothesis
corresponds to “not converging,” and the alternative hypothesis, to “converging.” The
problem here is that the null hypothesis is always related to the alternative hypothesis
to a certain extent, because the difference between convergence and non-convergence
is merely a matter of degree. In other words, the notion of convergence for a constancy
measure does not conform with the philosophy of statistical testing. Convergence is
therefore considered in terms of the distance from convergent values, or in terms of
the error with respect to some parameter (such as data size). Such a distance cannot be
calculated for real data, however, since the underlying mathematical model is unknown.
To sum up, veriﬁcation of the convergence of real data must be considered by some
other means. Our proposal is to consider convergence in comparison to a set of random
data whose process is known. For this random data, we considered two kinds.

The ﬁrst kind is used to examine data convergence in Section 4.1. This random
data was generated from real data by shufﬂing the original text with respect to certain
linguistic units. Tweedie and Baayen (1998) presented results by shufﬂing words, where
the original texts were literary texts by single authors. Here, we generated random
data by shufﬂing (1) words/characters, (2) sentences, or (3) documents. Because these
options greatly increased the number of combinations of results, we mainly present the
results with option (1) for large-scale data in this article. There are three reasons for this:
Convergence must be veriﬁed especially at large scale; the most important convergence
ﬁndings for randomized small-scale data were already reported in Tweedie and Baayen
(1998); and the results for options (2) and (3) were situated within the range of option
(1) and the original texts.

Randomization of the words and characters of original texts will destroy various
linguistic characteristics, such as n-grams and long-range correlation. The convergence
properties of the three measures V, h1, and H2 are as follows. The convergence of V
is unknown, because it lacks a mathematical background. Even if the value of V did
converge, the convergent value for randomized data would differ from that of the
original text, since the measure is based on repeated n-grams in the text. h1 converges to
the entropy rate of the randomized text, if the data size sufﬁces. This is supported by the
mathematical background of the algorithm, which converges to the true entropy rate for
stationary data. Even when h1 converges for random data, the convergent value will be
larger than that of the original text, because h1 considers the probabilities of n-grams.
Lastly, H2 converges to the same point for a randomized text and the original text,
because it is the approximated higher-order entropy, such that words and characters
are considered to occur independently.

The second kind of random data is used to compare the convergent values of
different texts for a constancy measure, as considered in Section 4.2. Random corpora

492

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
1
3
4
8
1
1
8
0
6
4
9
4
/
c
o

l
i

_
a
_
0
0
2
2
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Tanaka-Ishii and Aihara

Computational Constancy Measures of Texts

were generated according to four different distributions: one uniform, and the other
three following Zipf distributions with exponents of γ = 0.8, 1.0, and 1.3, respectively,
for Equation (3). Because each set of real data consists of different numbers of distinct
tokens, ranging from tens to billions, random data sets consisting of 2n distinct tokens
for every n = 4 . . . 19, were randomly generated for each of the four distributions. We
only consider the measures H2 and H0 for these data sets. Both of these measures have
convergent values, given a sufﬁcient data size.

4. Experimental Results

From the previous discussion, we applied the three measures V, h1, and H2 with ﬁve
large-scale and eight small-scale natural language corpora, three programming lan-
guage corpora, and two unknown script corpora, in terms of words and characters.
Because there were many results for different combinations of measure, data, and token
(word or character), this section is structured so that it best highlights our ﬁndings.

4.1 Empirical Constancy

Figures 1, 2, and 3 in this section can be examined in the following manner. The hori-
zontal axis indicates the text size of each corpus, in terms of the number of tokens, on a
log scale. Chunks of different text sizes were always taken from the head of the corpus.8
The vertical axis indicates the values of the different measures: V, h1, or H2. Each ﬁgure
contains multiple lines, each corresponding to a corpus, as indicated in the legends.

First, we consider the results for the large-scale data. Figure 1 shows the different
measures for words (left three graphs) and characters (right three graphs). We can see
that V increased for both words and characters (top two graphs). Golcher tested his
measure on up to 30 megabytes of text in terms of characters (Golcher 2007). We also
observed a stable tendency up to around 107 characters. The increase in V became
apparent, however, for larger text sizes. Thus, it is difﬁcult to consider V as a constancy
measure.

As for the results for h1 (middle graphs), both graphs show a gradual decrease. The
tendency was clearer for words than for characters. For some corpora, especially for
characters, it was possible to observe some values converging towards h
. The overall
tendency, however, could not be concluded as converging. This result suggests the
difﬁculty in attaining convergence of the entropy rate, even with gigabyte-scale data.
From the theoretical background of the Grassberger algorithm, the values would pos-
sibly converge with larger-scale data. The continued decrease could be due to multiple
reasons, including the possibility of requiring far larger data than that used here, or a
discrepancy between linguistic processes and the mathematical model assumed for the
Grassberger algorithm.

∗

We tried to estimate h

by ﬁtting the Equation (9). For the corpora with good ﬁtting,
all of the estimated values were larger than zero, but many of the results could not

∗

8 For real data, this was done without any randomization of the order of texts for all corpora besides

Atext-w and Atext-c. The Watan corpus is distributed not in the chronological order of the publishing
dates, but as a set of articles grouped into categories (i.e., all articles of one category, then all articles of
another category, and so on). Because of this, there is a large skew in the vocabulary distribution,
depending on the section of the corpus. We thus randomly reshufﬂed the articles by categories for the
whole corpus before taking chunks of different sizes (always from the beginning) to generate our results.
Apart from this, we avoided any arbitrary randomization with respect to the original data summarized in
Table 1.

493

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
1
3
4
8
1
1
8
0
6
4
9
4
/
c
o

l
i

_
a
_
0
0
2
2
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 41, Number 3

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
1
3
4
8
1
1
8
0
6
4
9
4
/
c
o

Figure 1
V, h1, and H2 values in terms of words and characters for the large-scale corpora.

l
i

_
a
_
0
0
2
2
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

∗

is reached asymptotically and also whether h

be ﬁtted easily, and the estimated values were unstable due to ﬂuctuation of the lines.
∗ > 0 remain impor-
Whether a value for h
tant questions requiring separate, more extensive mathematical and empirical studies.
In contrast, H2 (or Yule’s K, bottom graphs) showed convergence, already at the
level of 105 tokens, for both words and characters. From the previous veriﬁcation of
Yule’s K, we can conclude that H2 is convergent. The ﬁnal convergent values, however,
differed for the various writing systems. We return to this issue in the next section.

To better understand the convergence, Figure 2 shows the results for the cor-
responding randomized data. As mentioned in Section 3.2, the original texts were

494

Tanaka-Ishii and Aihara

Computational Constancy Measures of Texts

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
1
3
4
8
1
1
8
0
6
4
9
4
/
c
o

l
i

_
a
_
0
0
2
2
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure 2
V, h1, and H2 values in terms of words and characters for the randomized large-scale corpora.

randomized by shufﬂing words and characters for the data examined by words and
characters, respectively. Therefore, all n-gram characteristics existing in the text were
destroyed, and what remained were the different words and characters appearing in a
random order. Here, we see how the random data’s behavior has some of the theoretical
properties of convergence, as summarized in Section 3.2.

As mentioned previously, because V has no mathematical background, its behavior
even for uniform random data is unknown, and even if it converged, the convergent
value would be smaller than that of the original text. The top two graphs in Figure 2
exhibit some oscillation, especially for randomized Chinese (Cnews-c,w). Such peculiar

495

Computational Linguistics

Volume 41, Number 3

oscillation was already reported by Golcher himself (Golcher 2007) for uniformly
random data. This was easy to replicate, as reported in Kimura and Tanaka-Ishii
(2014), for uniformly random data with the number of distinct tokens up to a hundred.
Because the word distribution almost follows Zipf’s law, the vocabulary is not
uniformly distributed, yet oscillating results occur for some randomized data in the top
left ﬁgure. Moreover, the values seem to increase for Japanese and English for words
at a larger scale. Although the plots for some scripts seem convergent (top right graph),
these convergent values are theoretically different from those of the original texts, if
they exist, and this stability is not universal across the different data sets. Given this
result, it is doubtful that V is convergent across languages.

In contrast, h1 is mathematically proven to be convergent given inﬁnite-length
randomized data, but to larger values than those of the original texts, as mentioned in
Section 3.2. The middle two graphs of Figure 2 show the results for h1. The majority of
the plots do not reach convergence even at the largest data sizes, but for certain results
with characters, especially in the Roman alphabet, the plots seem to go to a convergent
value (middle right). All the plots can be extrapolated to converge to a certain entropy
rate above zero, although these values are larger than the convergent values—if they
ever exist—of the real data. These results conﬁrm the difﬁculty of judging whether the
entropy rates of the original texts are convergent and whether they remain above zero.
Lastly, it is easy to see that H2 is convergent for a randomized text (bottom two
graphs), and the convergent values are the same for the cases with and without ran-
domization. In fact, the plots converge to exactly the same points faster and more stably,
which shows the effect of randomization.

the
As for the other randomization options, by sentences and documents,
ﬁndings—both the tendencies of the lines and the changes in the values—can be
situated in the middle of what we have seen so far. The plots should increasingly
ﬂuctuate more like the real data because of the incomplete randomization, in the order
of sentences and then documents.

Returning to inspection of the remaining real data, Figure 3 shows V, h1, and H2
in terms of words for the small-scale corpora (left column) and for the programming
language texts (right column). For the small-scale corpora, in general, the plots are
bunched together, and the results shared the tendencies noted previously for the large-
scale corpora. V again showed an increase, while h1 showed a tendency to decrease. H2
converged rapidly and was already almost stable at 104 tokens. This again shows how
H2 exhibits stable constancy, especially with texts written by single authors.

As for the programming language results, the plots ﬂuctuate more than for the
natural language texts because of the redundancy within the program sources. Still,
the global tendencies noted so far were just discernible. V had relatively larger values
but h1 and H2 had smaller values for programs, as compared to the natural language
texts. The differences in value indicate the larger degree of repetitiveness in programs.
Lastly, Figure 4 shows the Hα results for the Wall Street Journal in terms of words
in unigrams (Enews-w). The horizontal axis indicates the corpus size, and the vertical
axis indicates the approximated entropy value. The different lines represent the results
for Hα with α = 1, 2, 3, 4. The two H1 plots represent calculations with and without
Laplace smoothing (Manning and Schuetze 1999). We can see that without smoothing,
H1 increased, as Tweedie and Baayen (1998) reported, but in contrast to their conclusion,
we observe a tendency of convergence for larger-scale data. The increase was due to the
inﬂuence of low-frequency vocabulary pushing up the entropy. The opposite tendency
to decrease was observed for the smoothed probabilities, with the plot eventually
converging to the same point as that for the unsmoothed H1 values. The convergence

496

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
1
3
4
8
1
1
8
0
6
4
9
4
/
c
o

l
i

_
a
_
0
0
2
2
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Tanaka-Ishii and Aihara

Computational Constancy Measures of Texts

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
1
3
4
8
1
1
8
0
6
4
9
4
/
c
o

l
i

_
a
_
0
0
2
2
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure 3
V, h1, and H2 values for the small-scale corpora and programming language texts in terms of
words.

was by far slower for H1 as compared with that for H2, H3, and H4, which all had
attained convergence already at 102 tokens. The convergence values naturally decreased
for larger α, although the amount of decrease itself rapidly decreased with larger α.

In answer to Questions 1 and 2 raised in the Introduction—which measures show
constancy, with sufﬁcient convergence speed—the empirical conclusion from our data
is that Hα with α > 1 showed stable constancy when the values were approximated
using relative frequencies. For H1, the convergence was much slower because of the
strong inﬂuence of low-frequency words. Consequently, the constancy of Hα with α > 1
is attained by representing the gross complexity underlying a text.

497

Computational Linguistics

Volume 41, Number 3

Figure 4
Hα with α = 1, . . . , 4 for The Wall Street Journal (Enews-w).

4.2 Discriminatory Power of H2

Now we turn to Question 3 raised in the Introduction and examine the discriminatory
power of H2. As Yule intended, does H2 identify authors? Given the inﬂuence of differ-
ent writing systems, as seen previously in Figure 1, we examine the relation between H2
and the number of distinct tokens (the alphabet/vocabulary size). Note that because
this number corresponds to H0 in Equation (11), this analysis effectively considers texts
on the H0-H2 plane. Since H0 grows according to the text size, unlike H2, the same text
size must be used for all corpora in order to meaningfully compare H0 values.9 Given
that H2 converges fast, we chose a size of 104 tokens to handle all of the small- and
large-scale corpora.

For each of the corpora listed in Table 1 and the second kind of random corpora
explained at the end of Section 3.2, Figure 5 plots the values of H2 (vertical axis) and
the number of distinct tokens H0 (horizontal) measured for each corpus at a size of 104
tokens. The three large circles are groupings of points. The leftmost group represents
news sources in alphabetic characters. All of the romanized Chinese, Japanese, and
Arabic texts are located almost at the same vertical location as the English text. This
indicates the difﬁculty for H2 to distinguish natural languages if measured in terms of
alphabetic characters. The middle group represents the programming language texts
in terms of words. This group is located separately (vertically lower than the natural
language corpora in terms of words), so H2 is likely to distinguish between natural
languages and programming languages. The rightmost group represents the small-scale
corpora. Considering the proximity of these points despite the variety of the content, it
is unlikely that H2 can distinguish authors, in contrast to Yule’s hope. Still, these points
are located lower than those for news text. Therefore, H2 has the potential to distinguish
genre or maybe writing style.

9 Because H0 is not convergent, the horizontal locations remain unstable, unless the tokens are of a

phonographic alphabet. In other words, for all word-based results and character-based results not based
on a phonographic alphabet, the resulting horizontal locations are changed by increasing the corpus size.
As for the random data, the H0 values are convergent, because these data sets have a ﬁnite number of
distinct tokens. Since H0 is measured only for the ﬁrst 104 tokens, however, the horizontal locations are
underestimated, especially for random data following a Zipf distribution.

498

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
1
3
4
8
1
1
8
0
6
4
9
4
/
c
o

l
i

_
a
_
0
0
2
2
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Tanaka-Ishii and Aihara

Computational Constancy Measures of Texts

Figure 5
Convergent H2 values with respect to the number of distinct tokens for each corpus.

The natural language texts located near the line for a Zipf exponent of 0.8 are
those of the non-alphabetic writing systems.10 Note that Chinese characters have
morphological features, and the Arabic and Thai languages also have ﬂexibility in
terms of which units are considered words and morphemes. In other words, the plots
closer to the random data with a smaller Zipf exponent are for language corpora of
morphemic sequences. The group of plots measured for phonographic scripts is located
near the line for a Zipf exponent of 1.0 (the grouping of points in the leftmost circle),
which could suggest that morphemes are more randomized units than words.

4.3 Application to Unknown Scripts: Voynich Manuscript and Rongorongo Script

The nature of unknown scripts can also be considered through our understanding thus
far. Figure 5 includes plots for the Voynich manuscript in terms of words and characters,
and for the Rongorongo script in terms of characters. Like all the data seen in this ﬁgure,
the points are placed at the H2 values (vertically) for the number of distinct tokens
(horizontally) at the speciﬁed size of 104 tokens, with the exception of Voynich-A in
terms of words. Because this corpus consists of fewer than 104 words (refer to the data
length by tokens listed for VoynichA-w in Table 1), its point is located horizontally at
the vocabulary size corresponding to the corpus’ maximum size.

For the two Voynich manuscript parts, the plots in terms of words appear near the
Arabic corpus for words (Abook-w). For characters, on the other hand, the plots are
at the leftmost end of the ﬁgure. This was due to overestimation of the total number

10 Note that here we use the values of the Zipf exponent for the random data, and not the estimated

exponents for the real data. The rank-frequency distributions of characters, especially for phonetic
alphabets, often do not follow a power law.

499

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
1
3
4
8
1
1
8
0
6
4
9
4
/
c
o

l
i

_
a
_
0
0
2
2
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 41, Number 3

of characters for the alphabetic texts (e.g., both English and other, romanized language
texts), since all ASCII characters, such as colons, periods, and question marks, are
counted. Still, the H2 values are located almost at the same position as for the other
romanized texts, indicating that the Voinich manuscript has approximately similar com-
plexity. These results suggest the possibility that the Voynich manuscript could have
been generated from a source in natural language, possibly written in some script of
the abjad type. This supports previous ﬁndings (Reddy and Knight 2011; Montemurro
and Zanette 2013), which reported the possibility of the Voynich manuscript being in a
natural language and the coincidence of its word length distribution with that of Arabic.
On the other hand, the plots for the Rongorongo script appear near the line for a
Zipf exponent of 0.8, with RongoA near Arabic in terms of words but RongoB somewhat
further down from Japanese in terms of characters. The status of Rongorongo as natural
language has been controversial (Pozdniakov and Pozdniakov 2007). Both points in the
graph, however, are near many other natural language texts (and not widely separated),
making it reasonable to hypothesize that Rongorongo is indeed natural language. The
characters can be deemed morphologically rich, because both plots are close to the line
for a Zipf exponent of 0.8. In the case of RongoA, for which a character was considered
inclusive of all parts (i.e., including accents and ornamental parts), the morphological
richness is comparable to that of the words of an abjad script. On the other hand,
when considering the different character parts as distinct (RongoB), the location drifts
towards the plot for Thai, a phonographic script, in terms of characters. Therefore, the
Rongorongo script could be considered basically morphemic, with some parts function-
ing phonographically. This conclusion again supports a previous hypothesis proposed
by a domain specialist (Pozdniakov and Pozdniakov 2007).

This analysis of two unknown scripts supports previous conjectures. Our results,
however, only add a small bit of evidence to those conjectures; clearly, reaching a
reasonable conclusion would require further study. Moreover, the analysis of unknown
scripts introduced here could provide another possible application of text constancy
measures, from a broader context.

5. Conclusion

We have discussed text constancy measures, whose values are invariant across different
sizes of text, for a given text. Such measures have a 70-year history, since Yule originally
proposed K as a text characteristic, potentially with language engineering utility for
problems such as author identiﬁcation. We consider text constancy measures today to
have scientiﬁc importance in understanding language universals from a computational
view.

After overviewing measures proposed so far and previous studies on text con-
stancy, we explained how K essentially has a mathematical equivalence to the R´enyi
higher-order entropy. We then empirically examined various measures across different
languages and kinds of corpora. Our results showed that only the approximated higher-
order R´enyi entropy exhibits stable, rapid constancy. Examining the nature of the con-
vergent values revealed that K does not possess the discriminatory power of author
identiﬁcation as Yule had hoped. We also applied our understanding to two unknown
scripts, the Voynich manuscript and Rongorongo, and showed how our constancy
results support previous hypotheses about each of these scripts.

Our future work will include application of K to other kinds of data besides natural
language. There, too, we will consider the questions raised in the Introduction, of

500

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
1
3
4
8
1
1
8
0
6
4
9
4
/
c
o

l
i

_
a
_
0
0
2
2
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Tanaka-Ishii and Aihara

Computational Constancy Measures of Texts

whether K converges and of how discriminatory it is. We are especially interested in
considering the relation between the value of K and the meaningfulness of data.

Acknowledgments
This research was supported by JST’s
PRESTO program.

References
Barthel, T. 2013. The Rongorongo of Easter
Island: Thomas Barthel’s transliteration
system. Available at http://kohaumotu.
org/rongorongo org/corpus/
codes.html. Accessed June 2015.

Farach, M., M. Noordewier, S. Savari, L.

Shepp, A. Wyner, and J. Ziv. 1995. On the
entropy of DNA: Algorithms and
measurements based on memory and
rapid convergence. In Proceedings of the
Sixth Annual ACM-SIAM Symposium on
Discrete Algorithms, pages 48–57,
San Francisco, CA.

Genzel, D. and E. Charniak. 2002. Entropy

rate constancy in text. In Annual Meeting of
the Association for the ACL, pages 199–206,
Philadelphia, PA.

Bell, T. C., J. G. Cleary, and I. H. Witten. 1990.

Golcher, F. 2007. A stable statistical constant

Text Compression. Prentice Hall.
Bromiley, P. A., N. A. Thacker, and E.

Bouhova-Thacker. 2010. Shannon entropy,
Renyi entropy, and information. Available
at http://www.tina-vision.net/
docs/memos/2004-004.pdf. Accessed
June 2015.

speciﬁc for human language texts. In
Recent Advances in Natural Language
Processing, Borovets.

Good, I. J. 1953. The population frequencies

of species and the estimation of
population parameters. Biometrika,
40(3–4):237–264.

Brown, P. F., S. A. Della Pietra, V. J. Della

Grassberger, P. 1989. Estimating the

Pietra, J. C. Lai, and R. L. Mercer. 1992. An
estimate of an upper bound for the
entropy of English. Computational
Linguistics, 18(1):31–40.

Brunet, E. 1978. Vocabulaire de Jean Giraudoux:

Structure et Evolution. Slatkine.

Cover, T. and R. King. 1978. A convergent
gambling estimate of the entropy of
English. IEEE Transactions on Information
Theory, 24(4):413–421.

Cover, T. M. and J. A. Thomas. 2006. Elements
of Information Theory. Wiley-Interscience.
Daniels, P. T. and W. Bright, editors. 1996. The
World’s Writing Systems. Oxford University
Press.

De¸bowski, Ł. 2009. A general deﬁnition of

conditional information and its
application to ergodic decomposition.
Statistics and Probability Letters,
79(9):1260–1268.

De¸bowski, Ł. 2013. Empirical evidence for

Hilberg’s conjecture in single author texts.
In Methods and Applications of Quantitative
Linguistics: Selected papers of the
8th International Conference on
Quantitative Linguistics (Qualico),
pages 143–151, Belgrade.

De¸bowski, Ł. 2014. The relaxed Hilberg

conjecture: A review and new
experimental support. Available at
http://www.ipipan.waw.pl/ldebowsk/.
Accessed June 2015.

Dugast, D. 1979. Vocabulaire et Stylistique.

I Th´eˆatre et Dialogue. Slatkine-Champion.
Travaux de Linguistique Quantitative.

information content of symbol sequences
and efﬁcient codes. IEEE Transactions on
Information Theory, 35:669–675.

Guiraud, H. 1954. Les Charact`eres Statistique
du Vocabulaire. Universitaires de France
Press.

Gusﬁeld, D. 1997. Algorithms on Strings, and

Sequences: Computer Science and
Computational Biology. Cambridge
University Press.

Herdan, G. 1964. Quantitative Linguistics.

Butterworths.

Hilberg, W. 1990. Der bekannte grenzwert

der redundanzfreien information
in texten eine fehlinterpretation der
shannonschen experimente? Frequenz,
44(9–10):243–248.

Honor´e, A. 1979. Some simple measures of
richness of vocabulary. Association for
Literary and Linguistic Computing Bulletin,
7:172–177.

Kimura, D. and K. Tanaka-Ishii. 2011. A

study on constants of natural language
texts. Journal of Natural Language Processing,
18(2):119–137.

Kimura, D. and K. Tanaka-Ishii. 2014. A

study on constants of natural language
texts. Journal of Natural Language Processing,
21:877–895. Special issue of awarded
papers. [The English translated version
of the article appeared in 2011 in
Japanese].

Kitchens, B. 1998. Symbolic Dynamics:

One-sided, Two-sided and Countable State
Markov Shifts. Springer.

501

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
1
3
4
8
1
1
8
0
6
4
9
4
/
c
o

l
i

_
a
_
0
0
2
2
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 41, Number 3

Levy, R. and T. F. Jaeger. 2007. Speakers
optimize information density through
information density through
syntactic reduction. In Annual
Conference on Neural Information
Processing Systems, pages 1–8,
Vancouver.

Maas, H. D. 1972. Zusammenhang zwischen
wortschatzumfang und l¨ange eines textes
[Relationship between vocabulary and text
length]. Zeitschrift f ¨ur Literaturwissenschaft
und Linguistik, 8:73–70.

Mandelbrot, B. 1953. An informational
theory of the statistical structure of
language. Communication Theory,
486–500.

Manning, C. and H. Schuetze. 1999.

Foundations of Statistical Natural Language
Processing. MIT Press.

Montemurro, M. and D. Zanette. 2013.

Keywords and co-occurrence patterns
in the Voynich Manuscript: An
information-theoretic analysis. PLOS One.
doi: 10.1371/journal.pone.0066344.

Orliac, C. 2005. The Rongorongo tablets from

Easter Island: Botanical identiﬁcation
and 14c dating. Archaeology in Oceania,
40(3):115–119.

Orlov, J. K. and R. Y. Chitashvili. 1983.

Generalized z-distribution generating the
well-known ‘rank-distributions’. Bulletin of
the Academy of Sciences of Georgia,
110:269–272.

Pozdniakov, K. and I. Pozdniakov. 2007.
Rapanui writing and the Rapanui
language: Preliminary results of a
statistical analysis. Forum for Anthropology
and Culture, 3:3–36.

Reddy, S. and K. Knight. 2011. What we

know about the Voynich Manuscript. In
ACL Workshop on Language Technology for
Cultural Heritage, Social Sciences, and
Humanities, Portland, OR.

R´enyi, A. 1960. On measures of entropy and
information. In Proceedings of the Fourth
Berkeley Symposium on Mathematics,

Statistics and Probability, pages 547–561,
Berkeley, CA.

R´enyi, A. 1970. Foundations of Probability.

Dover Publications.

Sch ¨umann, T. and P. Grassberger. 1996.

Entropy estimation of symbol sequences.
Chaos, 6(3):414–427.

Shannon, C. 1948. A mathematical theory of

communication. Bell System Technical
Journal, 27:379–423, 623–656.

Shields, P. C. 1992. Entropy and preﬁxes.
Annals of Probability, 20(1):403–409.

Sichel, H. S. 1975. On a distribution law for

word frequencies. Journal of the
American Statistical Association,
70(351):542–547.

Simpson, E. H. 1949. Measurement of

diversity. Nature, 163:688.

Stamatatos, E., N. Fakotakis, and G.
Kokkinakis. 2001. Automatic text
categorization in terms of genre and
author. Computational Linguistics,
26(4):471–495.

Stein, B., N. Lipka, and P. Prettenhofer. 2010.
Intrinsic plagiarism analysis. Language
Resources and Evaluation, 45(1):63–82.
Teh, Y. W. 2006. A hierarchical Bayesian
language model based on Pitman-Yor
processes. In Proceedings of the 21st
International Conference On Computational
Linguistics and 44th Annual Meeting of the
ACL, pages 985–992, Sydney.

Tuldava, J. 1977. Quantitative relations

between the size of the text and lexical
richness. SMIL Quarterly, Journal of
Linguistic Calculus, 4:28–35.

Tweedie, F. J. and Baayen, R. H. 1998. How
variable may a constant be? Measures of
lexical richness in perspective. Computers
and the Humanities, 32:323–352.

Yule, G. U. 1944. The Statistical Study of

Literary Vocabulary. Cambridge University
Press.

Zipf, G. K. 1965. Human Behavior and the

Principle of Least Effort: An Introduction to
Human Ecology. Hafner, New York.

502

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o