Computational Constancy Measures of
Texts—Yule’s K and R´enyi’s Entropy
Kumiko Tanaka-Ishii∗,∗∗
Kyushu University and JST-PRESTO
Shunsuke Aihara
Gunosy Inc.
†
This article presents a mathematical and empirical verification of computational constancy mea-
sures for natural language text. A constancy measure characterizes a given text by having an in-
variant value for any size larger than a certain amount. The study of such measures has a 70-year
history dating back to Yule’s K, with the original intended application of author identification.
We examine various measures proposed since Yule and reconsider reports made so far, daher
overviewing the study of constancy measures. We then explain how K is essentially equiva-
lent to an approximation of the second-order R´enyi entropy, thus indicating its signification
within language science. We then empirically examine constancy measure candidates within
this new, broader context. The approximated higher-order entropy exhibits stable convergence
across different languages and kinds of text. We also show, Jedoch, that it cannot identify
authors, contrary to Yule’s intention. zuletzt, we apply K to two unknown scripts, the Voynich
manuscript and Rongorongo, and show how the results support previous hypotheses about these
scripts.
1. Einführung
A constancy measure for a natural language text is defined, in this article, as a com-
putational measure that converges to a value for a certain amount of text and remains
invariant for any larger size. Because such a measure exhibits the same value for any size
of text larger than a certain amount, its value could be considered as a text characteristic.
The concept of such a text constancy measure was introduced by Yule (1944) In
the form of his measure K. Since Yule, there has been a continuous quest for such
Maßnahmen, and various formulae have been proposed. They can be broadly categorized
into three types, nämlich, those measuring (1) repetitiveness, (2) power law character,
Und (3) complexity.
∗ Kyushu University, 744 Motooka Nishiku, Fukuoka City, Fukuoka, Japan.
Email: kumiko@ait.kyushu-u.ac.jp.
∗∗ JST-PRESTO, 4-1-8 Honcho, Kawaguchi, Saitama 332-0012, Japan.
† Gunosy Inc., 6-10-1 Roppongi, Minato-ku, Tokio, Japan.
Einreichung erhalten: 11 Juli 2013; revised version received: 17 Februar 2015; zur Veröffentlichung angenommen:
18 Marsch 2015.
doi:10.1162/COLI a 00228
© 2015 Verein für Computerlinguistik
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
1
3
4
8
1
1
8
0
6
4
9
4
/
C
Ö
l
ich
_
A
_
0
0
2
2
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Computerlinguistik
Volumen 41, Nummer 3
Yule’s original intention for K’s utility lay in author identification, assuming that it
would differ for texts written by different authors. State-of-the-art multivariate machine
learning techniques are powerful, Jedoch, for solving such language engineering
tasks, in which Yule’s K is used only as one variable among many, as reported in
Stamatatos, Fakotakis, and Kokkinakis (2001) and Stein, Lipka, and Prettenhofer (2010).
We believe that constancy measures today, Jedoch, have greater importance in
understanding the mathematical nature of language. Although mathematical models of
language have been studied in the computational linguistics milieu, via Markov models
(Manning and Schuetze 1999), Zipf’s law and its modifications (Mandelbrot 1953; Zipf
1965; Glocke, Cleary, and Witten 1990), and Pitman-Yor models (Teh 2006) more recently,
the true mathematical model of linguistic processes is ultimately unknown. daher,
the convergence of a constancy measure must be examined through empirical verifi-
cation. Because some constancy measures have a mathematical theory of convergence
for a known process, discrepancies in the behavior of real linguistic data from such a
theory would shed light on the nature of linguistic processes and give hints towards
improving the mathematical models. Außerdem, as one application, a convergent
measure would allow for comparison of different texts through a common, stable norm,
provided that the measure converges for a sufficiently small amount of text. One of
our goals is to discover a non-trivial measure with a certain convergence speed that
distinguishes the different natures of texts.
The objective of this article is thus to provide a potential explanation of what the
study of constancy measures over 70 years has been about, by answering the three
following questions mathematically and empirically:
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
Question 1 Does a measure exhibit constancy?
Question 2 Wenn ja, how fast is the convergence speed?
Question 3 How discriminatory is the measure?
/
/
/
4
1
3
4
8
1
1
8
0
6
4
9
4
/
C
Ö
We seek answers by first showing the meaning of Yule’s K in relation to the R´enyi
higher-order entropy, and by then empirically examining constancy across large-scale
texts of different kinds. We finally provide an application by considering the natures of
two unknown scripts, the Voynich manuscript and Rongorongo, in order to show the
possible utility of a constancy measure.
The most important and closest previous work was reported in Tweedie and Baayen
(1998), the first paper to have examined the empirical behavior of constancy measures
on real texts. The authors used English literary texts to test constancy measure candi-
dates proposed prior to their work. Heute, the coverage and abundance of language
corpora allow us to conduct a larger-scale investigation across multiple languages.
Kürzlich, Golcher (2007) tested his measure V (discussed later in this paper) with Indo-
European languages and also programming language sources. Our papers (Kimura
and Tanaka-Ishii 2011, 2014) also precede this one, presenting results preliminary to
this article but with only part of our data, and neither of those provides mathematical
analysis with respect to the R´enyi entropy. Compared with these previous reports, unser
contribution here can be summarized as follows:
l
ich
_
A
_
0
0
2
2
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Our work elucidates the mathematical relation of Yule’s K to R´enyi’s
higher-order entropy and explains why K converges.
Our work vastly extends the corpora used for empirical examination in
terms of both size and language.
(cid:2)
(cid:2)
482
Tanaka-Ishii and Aihara
Computational Constancy Measures of Texts
(cid:2)
(cid:2)
Our work compares the convergent values for these corpora.
Our work also presents results for unknown language data, specifically
from the Voynich manuscript and Rongorongo.
We start by summarizing the potential constancy measures proposed so far.
2. Constancy Measures
The measures proposed so far can broadly be categorized into three types, calculating
the repetitiveness, power-law distribution, or complexity of text. This section mathe-
matically analyzes these measures and summarizes them.
2.1 Measures Based on Repetitiveness
The study of text constancy started with proposals for simple text measures of vocab-
ulary repetitiveness. The representative example is Yule’s K (Yule 1944), while Golcher
recently proposed V as another candidate (Golcher 2007).
2.1.1 Yule’s K. To the best of our knowledge, the oldest mention of constancy values was
made by Yule with his notion of K (Yule 1944). Let N be the total number of words in a
Text, V(N) be the number of distinct words, V(M, N) be the number of words appearing
m times in the text, and mmax be the largest frequency of a word. Yule’s K is then
defined as follows, through the first and second moments of the vocabulary population
(cid:2)
m m2V(M, N) (Yule
distribution of V(M, N), where S1
1944; Herdan 1964):
m mV(M, N), and S2
= N =
(cid:2)
=
K = C
S2
− S1
S2
1
(cid:3)
= C
− 1
N
+
mmax(cid:4)
m=1
(cid:5)
V(M, N)( M
N
)2
(1)
where C is a constant enlarging of the value of K, defined by Yule as C = 104. K is
designed to measure the vocabulary richness of a text: The larger Yule’s K, the less rich
the vocabulary is. The formula can be intuitively understood from the main term of the
N )2 indicates the degree of recurrence of a
sum in the formula. Because the square of ( M
word, the sum of such degrees for all words is small if the vocabulary is rich, or large in
the opposite case. Another simple example can be given in terms of S2 in this formula.
Suppose a text is 10 words long: if each of the 10 tokens is distinct (high diversity), Dann
= 1 × 1 × 10 = 10; wohingegen, if each of the 10 tokens is identical (low diversity), Dann
S2
= 10 × 10 × 1 = 100.
S2
Measures that are slightly different but essentially equivalent to Yule’s K have
appeared here and there. Zum Beispiel, Herdan defined Vm as follows (Herdan 1964,
S. 67, 79):
(cid:6)
(cid:7)
(cid:7)
(cid:8)
mmax(cid:4)
m=1
=
Vm
V(M, N)( M
N
)2 − 1
V(N)
483
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
1
3
4
8
1
1
8
0
6
4
9
4
/
C
Ö
l
ich
_
A
_
0
0
2
2
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Computerlinguistik
Volumen 41, Nummer 3
Likewise, Simpson (1949) derived the following formula as a measure to capture the
diversity of a population:
D =
mmax(cid:4)
m=1
V(M, N) M
N
m − 1
N − 1
which is equivalent to Yule’s K, as Simpson noted.
2.1.2 Other Measures Based on Simple Text Statistics. Apart from Yule’s K, various mea-
sures have been proposed from simple statistical observation of text, as detailed in
Tweedie and Baayen (1998). One genre is based on the so-called token-type relation
(d.h., the ratio of the vocabulary size V(N) and the text size N, in log) as formulated by
Guiraud (1954) and Herdan (1964) as a law. Because this simple ratio is not stable, Die
measure was modified numerous times to formulate Herdan’s C (Herdan 1964),
Dugast’s k and U (Dugast 1979), Maas’ a2 (Maas 1972), Tuldava’s LN (Tuldava 1977),
and Brunet’s W (Brunet 1978).
Another genre of measures concerns the proportion of hapax legomena, das ist
V(1, N). Honor´e noted that V(1, N) increases linearly with respect to the log of a text’s
vocabulary size V(N) (Honor´e 1979). Another ratio, of V(2, N) to V(N), was proposed
as a text characteristic by Sichel (1975) and Maas (1972).
Each of these values, Jedoch, was found not to be convergent according to the
extensive study conducted by Tweedie and Baayen (1998). In common with Yule’s inten-
tion to apply such measures for author identification, they examined all of the measures
discussed here, in addition to two measures explained later: Orlov’s Z, and the Shannon
entropy upper bound obtained from the relative frequencies of unigrams. They exam-
ined these measures with English novels (such as Alice’s Adventures in Wonderland) Und
empirically found that only Yule’s K and Orlov’s Z were convergent. Given their report,
we consider K the only true candidate among the constancy measures examined so far.
2.1.3 Golcher’s V. Golcher’s V is a string-based measure calculated on the suffix tree of a
Text (Golcher 2007). Letting the length of the string be N and the number of inner nodes
of the (Patricia) suffix tree (Gusfield 1997) be k, V is defined as:
V = k
N
(2)
Golcher empirically showed how this measure converges to almost the same value
across Indo-European languages for about 30 megabytes of data. He also showed how
the convergent values differ from those calculated for programming language texts.
Golcher explains in his paper that the possibility of constancy of V does not yet
have mathematical grounding and has only been shown empirically. He does not report
values for texts larger than about 30 megabytes nor for those of non-Indo-European
languages. A simple conjecture on this measure is that because a suffix tree for a string
of length N has at most N − 1 inner nodes, V must end up at some value 0 ≤ V < 1, for
any given text.
Our group tested V with larger-scale data and concluded that V could be a con-
stancy measure, although we admitted to observing a gradual increase (Kimura and
Tanaka-Ishii 2014). Because V requires further verification on larger-scale data before
ruling it out, we include it as a constancy measure candidate.
484
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
1
3
4
8
1
1
8
0
6
4
9
4
/
c
o
l
i
_
a
_
0
0
2
2
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Tanaka-Ishii and Aihara
Computational Constancy Measures of Texts
2.2 Measures Based on Power Law Distributions
Since Zipf (1965), power laws have been reported as an underlying statistical character-
istic of text. The famous Zipf’s law is defined as:
f (n) ∝ n
−γ
(3)
where γ ≈ 1, and f (n) is the frequency of the nth most frequent word in a text. Various
studies have sought to explain mathematically how the exponent could differ depend-
ing on the kind of text. To the best of our knowledge, however, there has been a limited
number of reports related to text constancy.
An exception is the study on Orlov’s Z (Orlov and Chitashvili 1983). Orlov and
Chitashvili attempted to obtain explicit mathematical forms for V(N) and V(m, N) by
more finely considering the long tails of vocabulary distributions for which Zipf’s
law does not hold. They obtained these forms through a parameter Z, defined as the
potential text length minimizing the square error of the estimated V(m, N), with its
actual value as follows:
Z = arg min
N
1
mmax
mmax(cid:4)
m=1
{ E[V(m, N)] − V(m, N)
V(N)
}2
Thus defining Z, they mathematically deduced for V(N) the following formula:
V(N) =
Z
log(mmaxZ)
N
N − Z
log
N
Z
(cid:9)
(cid:10)
(4)
(5)
Two ways to obtain Z can be formulated through approximation: one through Good-
Turing smoothing (Good 1953), which assumes Zipf’s law to hold, and the other using
Newton’s method. Tweedie and Baayen showed how the value of Z is stable at the
size of an English novel by a single author and thus suggested that it could form a
text characteristic. The empirical results, however, were not significantly convergent
with respect to text size, and, moreover, Tweedie and Baayen provided their results
without giving an estimation method (Tweedie and Baayen 1998). Calculation using
Good-Turing smoothing, which is derived directly from Zipf’s law, would cause
Z to converge, but this does not take Orlov’s original intention into consideration.
Alternatively, our group (Kimura and Tanaka-Ishii 2014) verified Z through Newton’s
method by setting g(Z) = 0, where g(Z) is the following function:
g(Z) =
Z
log(mmaxZ)
N
N − Z
log
(cid:10)
(cid:9)
N
Z
− V(N)
(6)
We also showed how the value of Z increases rapidly when the text size is larger than
10 megabytes.
The major problem with measures based on power laws lies in the skewed head
and tail of the vocabulary population distribution. Because these exceptions constitute
important parts of the population, parameter estimation by fitting to Equation (3) is
sensitive to the estimation method. For example, the estimated value of the exponent
for Zipf’s law depends on the method used for dealing with these exceptions. We
tested several simple methods of estimating the Zipf law’s exponent γ with different
ways of handling the head and tail of a distribution. There were settings that led to
485
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
1
3
4
8
1
1
8
0
6
4
9
4
/
c
o
l
i
_
a
_
0
0
2
2
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 41, Number 3
convergence, but the convergence depended on the settings. Such difficulty could be
one reason why there has been no direct proposal for γ as a text constancy measure.
Hence, due care must be taken in relating text constancy to a power law. We chose
another path by considering text constancy through a random Zipf distribution, as
described later in the experimental section.
2.3 Measures Based on Complexity
With respect to measures based on complexity, multiple reports have already examined
the Shannon entropy (Shannon 1948; Cover and Thomas 2006). In addition, we intro-
duce the R´enyi higher-order entropy (R´enyi 1960) as another possible measure.
2.3.1 Shannon Entropy Upper Bound. Let X be the random variable of a sequence X =
X1, X2, . . . , Xi, . . . , where Xi represents the ith element of X: Xi
= x ∈ X, and where X
represents a given set (e.g., a set of words or characters) whose members constitute
the sequence. Let Xj
i (i < j) denote the random variable indicating its subsequence
Xi, Xi+1, Xi+2, . . . , Xj. Let P(X) indicate the probability function of a sequence X. The
Shannon entropy is then defined as:
H(X) = −
(cid:4)
X
P(X) log P(X)
(7)
Tweedie and Baayen directly calculated an approximation of this formula in terms
of the relative frequencies (for P) of unigrams (for X), and they concluded that the
measure would continue increasing with respect to text size and would not converge
for short, literary texts (Tweedie and Baayen 1998). Because we are interested in the
measure’s behavior on a larger scale, we replicated their experiment, as discussed later
in the section on empirical constancy. We denote this measure as H1 in this article.
Apart from that report, many have studied the entropy rate, defined as:
h
∗ = lim
n→∞
H(Xn
1 )
n
(8)
Theoretically, the behavior of the entropy rate with respect to text size has been contro-
versial. On the one hand, there have been indications of entropy rate constancy (Genzel
and Charniak 2002; Levy and Jaeger 2007). These reports argue that the entropy rate of
natural language could be constant. Due to the inherent difficulty in obtaining the true
value of h
from a text, however, these arguments are based only on indirect clues with
respect to convergence. On the other hand, Hilberg conjectured a decrease in the human
conditional entropy, as follows (Hilberg 1990):
∗
H(Xn
|Xn−1
1
) ∝ n
−1+β
He obtained this through an examination of Shannon’s original experimental data and
suggested that β ≈ 0.5. From this formula, De¸bowski induces that H(Xn
and that
the entropy rate can be formulated generally as follows (De¸bowski 2014):
1 ) ∝ n
β
H(Xn
1 )
n
≈ An
−1+β + h
∗
(9)
486
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
1
3
4
8
1
1
8
0
6
4
9
4
/
c
o
l
i
_
a
_
0
0
2
2
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Tanaka-Ishii and Aihara
Computational Constancy Measures of Texts
Note that at the limit of n → ∞, this rate goes to h
, a constant, provided that β < 1.0.
Hilberg’s conjecture is deemed compatible with entropy rate constancy at its asymptotic
limit, provided that h
forms
a text characteristic, and if so, whether h
∗ > 0 holds.1 We are therefore interested in whether this h
∗ > 0.
∗
∗
∗
Empirically, many have attempted to calculate the upper bound of the entropy
rate. Brown’s report (Brown et al. 1992) is representative in showing a good estimation
of the entropy rate for English from texts, as compared with values obtained from
humans (Cover and King 1978). Subsequently, there have been important studies on
calculating the entropy rate, as reported thoroughly in Sch ¨umann and Grassberger
(1996). The questions related to h
, Jedoch, remain unsolved. Kürzlich, De¸bowski
used a Lempel-Ziv compressor and examined Hilberg’s conjecture for texts by single
authors (De¸bowski 2013). He showed an exponential decrease in the entropy rate
with respect to text size, supporting the validity of Equation (9). Following these
previous works, we examine the entropy rate by using an algorithm proposed by
Grassberger (1989) and later on by Farach et al. (1995). This method is based on
universal coding. The algorithm has a theoretical background of convergence to the
true h
, provided the sequence is stationary, but has been proved by Shields (1992)
to be inconsistent—that is, it does not converge to the entropy rate for certain non-
Markovian processes. We still chose to apply this method, because it requires no arbi-
trary parameters for calculation and is applicable to large-scale data within a reasonable
Zeit.
∗
The Grassberger algorithm (Grassberger 1989; Farach et al. 1995) can be summa-
rized as follows. Consider a sequence X of length N. The maximum matching length Li
is defined as:
= max{k : Xj+k
J
= Xi+k
ich
}
Li
for j ∈ {1, . . . , i − 1}, 1 ≤ j ≤ j + k ≤ i − 1. Mit anderen Worten, Li is the maximum common
subsequence before and after i. If ¯L is the average length of Li, given by
¯L = 1
N
i=N(cid:4)
i=1
Li
then the method obtains the entropy rate h1 as
h1
= log2N
¯L
(10)
Given the true entropy rate h
stationary process, such that |H
this entropy rate h1 as a constancy measure candidate.
, convergence has been mathematically proven for a
∗ − h1
| = O(1) when N → ∞. In diesem Artikel, we consider
∗
1 According to De¸bowski (2009), H
∗ = 0 suggests that the next element of a linguistic process is
deterministic, das ist, a function of the corpus observed before, under the two conditions that (1) Die
number of possible choices for the element is finite, Und (2) the corpus observed before is infinite. In
Wirklichkeit, the finiteness of linguistic sequences has the opposite tendency (d.h., the size of the observed
corpus is finite, and the possible vocabulary size is infinite).
487
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
1
3
4
8
1
1
8
0
6
4
9
4
/
C
Ö
l
ich
_
A
_
0
0
2
2
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Computerlinguistik
Volumen 41, Nummer 3
2.3.2 Approximation of R´enyi Entropy Hα. The R´enyi entropy is a generalization of the
Shannon entropy, defined as follows (R´enyi 1960; R´enyi 1970; Cover and Thomas 2006;
Bromiley, Thacker, and Bouhova-Thacker 2010):
Hα(X) = 1
1 − α log(
(cid:4)
X
α
P
(X))
(11)
where α ≥ 0, α (cid:9)= 1. Hα(X) represents different ideas of sequence complexity for differ-
ent α. Zum Beispiel:
(cid:2)
(cid:2)
When α = 0, H0(X) indicates the number of distinct occurrences of X.
When the limit α → 1 is taken, Gleichung (11) reduces to the Shannon
entropy.
The formula for α = 0 becomes equivalent to the so-called topological entropy (somit,
it is another notion of entropy) for certain probability functions (Kitchens 1998) (Cover
and Thomas 2006). Note that the number of distinct tokens (d.h., the cardinality of a
set) has been used widely as a rough approximation of complexity in computational
linguistics. In der Tat, in Section 2.1.2, we saw how some candidate constancy measures are
based on a token-type relation, such that the number of types is related to the complexity
of a text. For texts, note also that the value grows with respect to the text size, unless X
is considered, Zum Beispiel, in terms of unigrams of a phonographic alphabet.
For α → 1, there is controversy regarding convergence, as noted in the previous
section. Such difficulty in convergence for these α values lies in the nature of linguistic
processes, in which the vocabulary set evolves.
This view motivates us to consider α > 1 for Hα(X), since the formula captures
complexity by considering linguistic hapax legomena to a lesser degree, thus giving the
possibility of convergence. Tatsächlich, an approximation of the probability by the relative
frequencies of unigrams at α = 2 immediately shows the essential equivalence to Yule’s
K, since K from Equation (1) can be rewritten as follows:
mmax(cid:4)
m=1
V(M, N)( M
N
)2 =
(cid:4)
x∈X
(
freq(X)
N
)2
where freq(X) is the frequency of x ∈ X. daher, Yule’s K has significance within the
context of complexity.
This relation of Yule’s K to the R´enyi entropy H2 is reported for the first time here, Zu
the best of our knowledge. This mathematical relation clarifies both why Yule’s K should
converge and what the convergent value means; specifically, the value represents the
gross complexity underlying the language system. As noted earlier, the higher-order
entropy considers hapax legomena to a lesser degree and calculates the gross entropy
only from the representative vocabulary population. This simple argument shows that
Yule’s K captures not only the simple repetitiveness of vocabulary but also the more
profound signification of its equivalence with the approximated second-order entropy.
Because K has been previously reported as a stable text constancy measure, we consider
it here once again, but this time within the broader context of Hα.
488
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
1
3
4
8
1
1
8
0
6
4
9
4
/
C
Ö
l
ich
_
A
_
0
0
2
2
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Tanaka-Ishii and Aihara
Computational Constancy Measures of Texts
2.4 Summary of Constancy Measure Candidates
Based on the previous reports (Tweedie and Baayen 1998; Kimura and Tanaka-Ishii
2014) and the discussion so far, we consider the following four measures as candidates
for text constancy measures.
(cid:2)
(cid:2)
Repetitiveness-based measures:
Yule’s K (Gleichung (1)); and Golcher’s V (Gleichung (2)).
Complexity-based measures:
The Shannon entropy upper bound (h1 as the entropy rate (Equations (10)
Und (8)) and H1 (Gleichung (7), with X in terms of unigrams and the
probability function in terms of relative frequencies); und das
approximated R´enyi entropy, denoted as Hα (α > 1) (Gleichung (11),
again with X and the probability function in terms of unigrams and
relative frequencies, jeweils).
Zusätzlich, we empirically consider how these measures can be understood in the
context of the power-law feature of language. As noted in the Introduction, für die
convergent measures the speed of attaining convergence with respect to text size is
examined as well. Among the candidates, K and H1 have been previously applied in
a word-based manner, whereas V is string based. The Shannon entropy rate h1 has been
considered in both ways. Because we should be able to consider a text in terms of both
words and characters, we examine the constancy of each measure in both ways.
Außerdem, because we have seen the mathematical equivalence of Yule’s K and
H2, in the following we only consider H2. As for Hα, we consider α = 3, 4 only in
comparison with H2. Because H1 is based on relative frequencies and can be considered
together with H2, we first focus on the convergence of the three measures V, h1, and H2,
and then we consider H1 in comparison with H2, H3, and H4.
3. Data
3.1 Real Texts
Tisch 1 lists the data used in our experimental examination. The table indicates the data
identifier (by which we refer to the data in the rest of the article), Sprache, source,
number of distinct tokens, data length by total number of tokens, and size in bytes. Der
first block contains relatively large-scale natural language corpora consisting of texts
written by multiple authors, and the second block contains smaller corpora consisting
of texts by single authors. The third block contains programming language corpora, Und
the fourth block contains corpora of unknown scripts, which we examine at the end of
this article in Section 4.3.
For the large-scale natural language data, we considered five languages: English,
Japanese, Chinese, Arabic, and Thai. These languages were chosen to represent different
language families and writing systems. The large-scale corpora in English, Japanese,
and Chinese consist of newspapers in chronological order, and the Thai and Arabic
corpora include other kinds of texts. The markers ‘w’, ‘c’, and ‘cr’ appearing at the
end of every identifier in Table 1 (z.B., Enews-c, Enews-w, and Jnews-cr) indicate text
processed through words, Figuren, and transliterated Roman characters, jeweils.
489
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
1
3
4
8
1
1
8
0
6
4
9
4
/
C
Ö
l
ich
_
A
_
0
0
2
2
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Computerlinguistik
Volumen 41, Nummer 3
Tisch 1
Our data.
Identifier
Enews-c
Enews-w
Language
kind
Quelle
Nummer
of distinct
tokens
Data length
by tokens
Data size
in bytes
Large scale corpora
English
WSJ Corpus(1987)
Jnews-c
Japanese
2000–2009 Mainichi
Newspaper
Jnews-cr
Jnews-w
87
137,466
112,868,099
22,679,512
108MB
5,758
475,101,506
1.3GB
94
468,818
1,087,919,430
289,032.862
Cnews-c
Chinese
1995 People’s
Daily Newspaper
5,963
24,696,511
67MB
Cnews-cr
Cnews-w
Atext-c
Atext-w
Ttext-c
Ttext-w
Ebook1-w
Ebook2-w
Fbook-w
Gbook-w
Jbook-w
Cbook-w
Abook-w
Sbook-w
Arabic
Watan-2004 corpus
Thai
NECTEC corpus
English
English
French
Deutsch
Japanese
Chinese
Arabic
Sanskrit
Small scale corpora
Ulysses
Les Miserables
Les Miserables
Kritik der reinen
Vernunft
Dohyo
Hong Lou Meng
Quaran
Ramayana
88
144,336
59
298,370
159
16,291
34,359
25,994
31,956
10,604
19,179
18,450
16,121
62,318
68,325,519
14,965,501
42,174,262
7,450,442
1,444,536
280,602
325,692
677,163
691,407
215,299
502,137
701,255
75,185
213,736
73MB
3.9MB
1.5MB
3MB
3MB
1.3MB
2MB
2.5MB
728KB
1.9MB
Corpora of programming languages
Python-w
Cplus-w
Lisp-w
Python
C++
Common Lisp
python library sources
C++ library sources
sbcl and Clozure CL
1,517,424
127,332
164,248
48,704,374
15,617,801
2,326,270
214MB
64MB
32MB
Corpora of Unknown scripts
Unknown
VoynichA-c
VoynichB-c
Unknown
VoynichA-w Unknown
VoynichB-w Unknown
Unknown
RongoA-c
Unknown
RongoB-c
Voynich Manuscript
Voynich Manuscript
Voynich Manuscript
Voynich Manuscript
Rongorongo script
Rongorongo script
22
25
2,628
4,609
3,546
656
44,360
117,105
7,460
18,495
10,376
14,003
44KB
115KB
44KB
115KB
60KB
60KB
As for the small-scale corpora in the second block, the texts were only considered
in terms of words, since verification via characters produced findings consistent with
those obtained with the large-scale corpora. The texts were chosen because each was
written by a single author but is relatively large.
490
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
1
3
4
8
1
1
8
0
6
4
9
4
/
C
Ö
l
ich
_
A
_
0
0
2
2
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Tanaka-Ishii and Aihara
Computational Constancy Measures of Texts
Hier, we summarize our preprocessing procedures. For the annotated Thai
NECTEC corpus, texts were tokenized according to the annotation. The preprocessing
methods for the other corpora were as follows:
(cid:2)
(cid:2)
(cid:2)
(cid:2)
English: NLTK2 was used to tokenize text into words.
Japanese: Mecab3 was used for tokenization, and KAKASI4 was used for
romanization.
Chinese: ICTCLAS20135 was used for tokenization, and the pinyin Python
library was used for pinyin romanization.
Other European Languages: PunktWordTokenizer6 was used for
tokenization.
All the other natural language corpora were tokenized simply using spaces.
Following Golcher (2007), who first suggested testing constancy on programming
languages, we also collected program sources from different languages (third block in
Tisch 1). The programs were also considered solely in terms of words, not characters.
C++ and Python were chosen to represent different abstraction levels, and Lisp was
chosen because of its different ordering for function arguments. Source code was col-
lected from language libraries. The programming language texts were preprocessed as
follows. Comments in natural language were eliminated (although strings remained
in the programs, where each was a literal token). Identical files and copies of sources
in large chunks were carefully eliminated, although this process did not completely
eliminate redundancy since most programs reuse some previous code. Endlich, Die
programs were tokenized according to the language specifications.7
The last block of the table lists two corpora of unknown scripts. We consider these
scripts at the end of this article in Section 4.3, through Figure 5, to show one possible
application of the text constancy measures. The first unknown script is that of the
Voynich manuscript, a famous text that is undeciphered but hypothesized to have been
written in natural language. This corpus is considered in terms of both characters and
Wörter, where words were defined via the white space separation in the original text.
Given the common understanding that the manuscript seems to have two different
Teile (Reddy and Knight 2011), we separated it into two parts according to the Currier
annotation (identified as A and B, jeweils). The second corpus of unknown text
consists of the Rongorongo script of Easter Island (Daniels and Bright 1996, Abschnitt 13;
Orliac 2005; Barthel 2013). This script’s status as natural language is debatable, Aber
if so, it is considered to possess characteristics of both phonographs and ideograms
(Pozdniakov and Pozdniakov 2007). Because there are several ways to consider what
constitutes a character in this script (Barthel 2013), we calculate values for the two most
extreme cases as follows. For corpus RongoA-c, we consider a character inclusive of all
adjoining parts (d.h., including accents and ornamental parts). Andererseits, für
2 http://nltk.org.
3 http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html.
4 http://kakasi.namazu.org.
5 http://ictclas.nlpir.org.
6 http://nltk.org.
7 With respect to the Lisp programming language, its culture favors long, hyphenated variable names that
can be almost as long as a sentence. For this work, daher, Lisp variable names were tokenized by
splitting at the hyphens.
491
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
1
3
4
8
1
1
8
0
6
4
9
4
/
C
Ö
l
ich
_
A
_
0
0
2
2
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Computerlinguistik
Volumen 41, Nummer 3
corpus RongoB-c, we separate parts as reasonably as possible, among multiple possible
separation methods. Because the unit of word in this script is unknown, the Rongorongo
script is only considered in terms of characters.
3.2 Random Data
The empirical verification of convergence for real data is controversial. We must first
note that it does not conform with the standard approach to statistical testing. Im
domain of statistics, it is a common understanding that “convergence” cannot be tested.
A statistical test raises two contrasting hypotheses—called the null and alternative
hypotheses—and calculates a p-value indicating the probability of the null hypothesis
to occur. When this p-value is smaller than a certain value, the null hypothesis is
considered unable to occur and is thus rejected. For convergence, the null hypothesis
corresponds to “not converging,” and the alternative hypothesis, to “converging.” The
problem here is that the null hypothesis is always related to the alternative hypothesis
to a certain extent, because the difference between convergence and non-convergence
is merely a matter of degree. Mit anderen Worten, the notion of convergence for a constancy
measure does not conform with the philosophy of statistical testing. Convergence is
therefore considered in terms of the distance from convergent values, or in terms of
the error with respect to some parameter (such as data size). Such a distance cannot be
calculated for real data, Jedoch, since the underlying mathematical model is unknown.
To sum up, verification of the convergence of real data must be considered by some
other means. Our proposal is to consider convergence in comparison to a set of random
data whose process is known. For this random data, we considered two kinds.
The first kind is used to examine data convergence in Section 4.1. This random
data was generated from real data by shuffling the original text with respect to certain
linguistic units. Tweedie and Baayen (1998) presented results by shuffling words, Wo
the original texts were literary texts by single authors. Hier, we generated random
data by shuffling (1) words/characters, (2) Sätze, oder (3) documents. Because these
options greatly increased the number of combinations of results, we mainly present the
results with option (1) for large-scale data in this article. There are three reasons for this:
Convergence must be verified especially at large scale; the most important convergence
findings for randomized small-scale data were already reported in Tweedie and Baayen
(1998); and the results for options (2) Und (3) were situated within the range of option
(1) and the original texts.
Randomization of the words and characters of original texts will destroy various
linguistic characteristics, such as n-grams and long-range correlation. The convergence
properties of the three measures V, h1, and H2 are as follows. The convergence of V
is unknown, because it lacks a mathematical background. Even if the value of V did
converge, the convergent value for randomized data would differ from that of the
original text, since the measure is based on repeated n-grams in the text. h1 converges to
the entropy rate of the randomized text, if the data size suffices. This is supported by the
mathematical background of the algorithm, which converges to the true entropy rate for
stationary data. Even when h1 converges for random data, the convergent value will be
larger than that of the original text, because h1 considers the probabilities of n-grams.
zuletzt, H2 converges to the same point for a randomized text and the original text,
because it is the approximated higher-order entropy, such that words and characters
are considered to occur independently.
The second kind of random data is used to compare the convergent values of
different texts for a constancy measure, as considered in Section 4.2. Random corpora
492
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
1
3
4
8
1
1
8
0
6
4
9
4
/
C
Ö
l
ich
_
A
_
0
0
2
2
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Tanaka-Ishii and Aihara
Computational Constancy Measures of Texts
were generated according to four different distributions: one uniform, und der andere
three following Zipf distributions with exponents of γ = 0.8, 1.0, Und 1.3, jeweils,
for Equation (3). Because each set of real data consists of different numbers of distinct
tokens, ranging from tens to billions, random data sets consisting of 2n distinct tokens
for every n = 4 . . . 19, were randomly generated for each of the four distributions. Wir
only consider the measures H2 and H0 for these data sets. Both of these measures have
convergent values, given a sufficient data size.
4. Experimental Results
From the previous discussion, we applied the three measures V, h1, and H2 with five
large-scale and eight small-scale natural language corpora, three programming lan-
guage corpora, and two unknown script corpora, in terms of words and characters.
Because there were many results for different combinations of measure, Daten, and token
(word or character), this section is structured so that it best highlights our findings.
4.1 Empirical Constancy
Figuren 1, 2, Und 3 in this section can be examined in the following manner. The hori-
zontal axis indicates the text size of each corpus, in terms of the number of tokens, on a
log scale. Chunks of different text sizes were always taken from the head of the corpus.8
The vertical axis indicates the values of the different measures: V, h1, or H2. Each figure
contains multiple lines, each corresponding to a corpus, as indicated in the legends.
Erste, we consider the results for the large-scale data. Figur 1 shows the different
measures for words (left three graphs) and characters (right three graphs). We can see
that V increased for both words and characters (top two graphs). Golcher tested his
measure on up to 30 megabytes of text in terms of characters (Golcher 2007). We also
observed a stable tendency up to around 107 Figuren. The increase in V became
apparent, Jedoch, for larger text sizes. Daher, it is difficult to consider V as a constancy
messen.
As for the results for h1 (middle graphs), both graphs show a gradual decrease. Der
tendency was clearer for words than for characters. For some corpora, especially for
Figuren, it was possible to observe some values converging towards h
. The overall
tendency, Jedoch, could not be concluded as converging. This result suggests the
difficulty in attaining convergence of the entropy rate, even with gigabyte-scale data.
From the theoretical background of the Grassberger algorithm, the values would pos-
sibly converge with larger-scale data. The continued decrease could be due to multiple
Gründe dafür, including the possibility of requiring far larger data than that used here, or a
discrepancy between linguistic processes and the mathematical model assumed for the
Grassberger algorithm.
∗
We tried to estimate h
by fitting the Equation (9). For the corpora with good fitting,
all of the estimated values were larger than zero, but many of the results could not
∗
8 For real data, this was done without any randomization of the order of texts for all corpora besides
Atext-w and Atext-c. The Watan corpus is distributed not in the chronological order of the publishing
dates, but as a set of articles grouped into categories (d.h., all articles of one category, then all articles of
another category, und so weiter). Because of this, there is a large skew in the vocabulary distribution,
depending on the section of the corpus. We thus randomly reshuffled the articles by categories for the
whole corpus before taking chunks of different sizes (always from the beginning) to generate our results.
Apart from this, we avoided any arbitrary randomization with respect to the original data summarized in
Tisch 1.
493
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
1
3
4
8
1
1
8
0
6
4
9
4
/
C
Ö
l
ich
_
A
_
0
0
2
2
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Computerlinguistik
Volumen 41, Nummer 3
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
1
3
4
8
1
1
8
0
6
4
9
4
/
C
Ö
Figur 1
V, h1, and H2 values in terms of words and characters for the large-scale corpora.
l
ich
_
A
_
0
0
2
2
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
∗
is reached asymptotically and also whether h
be fitted easily, and the estimated values were unstable due to fluctuation of the lines.
∗ > 0 remain impor-
Whether a value for h
tant questions requiring separate, more extensive mathematical and empirical studies.
Im Gegensatz, H2 (or Yule’s K, bottom graphs) showed convergence, already at the
level of 105 tokens, for both words and characters. From the previous verification of
Yule’s K, we can conclude that H2 is convergent. The final convergent values, Jedoch,
differed for the various writing systems. We return to this issue in the next section.
To better understand the convergence, Figur 2 shows the results for the cor-
responding randomized data. As mentioned in Section 3.2, the original texts were
494
Tanaka-Ishii and Aihara
Computational Constancy Measures of Texts
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
1
3
4
8
1
1
8
0
6
4
9
4
/
C
Ö
l
ich
_
A
_
0
0
2
2
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Figur 2
V, h1, and H2 values in terms of words and characters for the randomized large-scale corpora.
randomized by shuffling words and characters for the data examined by words and
Figuren, jeweils. daher, all n-gram characteristics existing in the text were
destroyed, and what remained were the different words and characters appearing in a
random order. Hier, we see how the random data’s behavior has some of the theoretical
properties of convergence, as summarized in Section 3.2.
As mentioned previously, because V has no mathematical background, its behavior
even for uniform random data is unknown, and even if it converged, the convergent
value would be smaller than that of the original text. The top two graphs in Figure 2
exhibit some oscillation, especially for randomized Chinese (Cnews-c,w). Such peculiar
495
Computerlinguistik
Volumen 41, Nummer 3
oscillation was already reported by Golcher himself (Golcher 2007) for uniformly
random data. This was easy to replicate, as reported in Kimura and Tanaka-Ishii
(2014), for uniformly random data with the number of distinct tokens up to a hundred.
Because the word distribution almost follows Zipf’s law, the vocabulary is not
uniformly distributed, yet oscillating results occur for some randomized data in the top
left figure. Darüber hinaus, the values seem to increase for Japanese and English for words
at a larger scale. Although the plots for some scripts seem convergent (top right graph),
these convergent values are theoretically different from those of the original texts, Wenn
they exist, and this stability is not universal across the different data sets. Given this
Ergebnis, it is doubtful that V is convergent across languages.
Im Gegensatz, h1 is mathematically proven to be convergent given infinite-length
randomized data, but to larger values than those of the original texts, as mentioned in
Abschnitt 3.2. The middle two graphs of Figure 2 show the results for h1. The majority of
the plots do not reach convergence even at the largest data sizes, but for certain results
with characters, especially in the Roman alphabet, the plots seem to go to a convergent
value (middle right). All the plots can be extrapolated to converge to a certain entropy
rate above zero, although these values are larger than the convergent values—if they
ever exist—of the real data. These results confirm the difficulty of judging whether the
entropy rates of the original texts are convergent and whether they remain above zero.
zuletzt, it is easy to see that H2 is convergent for a randomized text (bottom two
graphs), and the convergent values are the same for the cases with and without ran-
domization. Tatsächlich, the plots converge to exactly the same points faster and more stably,
which shows the effect of randomization.
Die
As for the other randomization options, by sentences and documents,
findings—both the tendencies of the lines and the changes in the values—can be
situated in the middle of what we have seen so far. The plots should increasingly
fluctuate more like the real data because of the incomplete randomization, in the order
of sentences and then documents.
Returning to inspection of the remaining real data, Figur 3 shows V, h1, and H2
in terms of words for the small-scale corpora (left column) and for the programming
language texts (right column). For the small-scale corpora, in general, the plots are
bunched together, and the results shared the tendencies noted previously for the large-
scale corpora. V again showed an increase, while h1 showed a tendency to decrease. H2
converged rapidly and was already almost stable at 104 tokens. This again shows how
H2 exhibits stable constancy, especially with texts written by single authors.
As for the programming language results, the plots fluctuate more than for the
natural language texts because of the redundancy within the program sources. Trotzdem,
the global tendencies noted so far were just discernible. V had relatively larger values
but h1 and H2 had smaller values for programs, as compared to the natural language
texts. The differences in value indicate the larger degree of repetitiveness in programs.
zuletzt, Figur 4 shows the Hα results for the Wall Street Journal in terms of words
in unigrams (Enews-w). The horizontal axis indicates the corpus size, and the vertical
axis indicates the approximated entropy value. The different lines represent the results
for Hα with α = 1, 2, 3, 4. The two H1 plots represent calculations with and without
Laplace smoothing (Manning and Schuetze 1999). We can see that without smoothing,
H1 increased, as Tweedie and Baayen (1998) reported, but in contrast to their conclusion,
we observe a tendency of convergence for larger-scale data. The increase was due to the
influence of low-frequency vocabulary pushing up the entropy. The opposite tendency
to decrease was observed for the smoothed probabilities, with the plot eventually
converging to the same point as that for the unsmoothed H1 values. The convergence
496
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
1
3
4
8
1
1
8
0
6
4
9
4
/
C
Ö
l
ich
_
A
_
0
0
2
2
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Tanaka-Ishii and Aihara
Computational Constancy Measures of Texts
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
1
3
4
8
1
1
8
0
6
4
9
4
/
C
Ö
l
ich
_
A
_
0
0
2
2
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Figur 3
V, h1, and H2 values for the small-scale corpora and programming language texts in terms of
Wörter.
was by far slower for H1 as compared with that for H2, H3, and H4, which all had
attained convergence already at 102 tokens. The convergence values naturally decreased
for larger α, although the amount of decrease itself rapidly decreased with larger α.
In answer to Questions 1 Und 2 raised in the Introduction—which measures show
constancy, with sufficient convergence speed—the empirical conclusion from our data
is that Hα with α > 1 showed stable constancy when the values were approximated
using relative frequencies. For H1, the convergence was much slower because of the
strong influence of low-frequency words. Folglich, the constancy of Hα with α > 1
is attained by representing the gross complexity underlying a text.
497
Computerlinguistik
Volumen 41, Nummer 3
Figur 4
Hα with α = 1, . . . , 4 for The Wall Street Journal (Enews-w).
4.2 Discriminatory Power of H2
Now we turn to Question 3 raised in the Introduction and examine the discriminatory
power of H2. As Yule intended, does H2 identify authors? Given the influence of differ-
ent writing systems, as seen previously in Figure 1, we examine the relation between H2
and the number of distinct tokens (the alphabet/vocabulary size). Note that because
this number corresponds to H0 in Equation (11), this analysis effectively considers texts
on the H0-H2 plane. Since H0 grows according to the text size, unlike H2, the same text
size must be used for all corpora in order to meaningfully compare H0 values.9 Given
that H2 converges fast, we chose a size of 104 tokens to handle all of the small- Und
large-scale corpora.
For each of the corpora listed in Table 1 and the second kind of random corpora
explained at the end of Section 3.2, Figur 5 plots the values of H2 (vertical axis) Und
the number of distinct tokens H0 (horizontal) measured for each corpus at a size of 104
tokens. The three large circles are groupings of points. The leftmost group represents
news sources in alphabetic characters. All of the romanized Chinese, Japanese, Und
Arabic texts are located almost at the same vertical location as the English text. Das
indicates the difficulty for H2 to distinguish natural languages if measured in terms of
alphabetic characters. The middle group represents the programming language texts
in terms of words. This group is located separately (vertically lower than the natural
language corpora in terms of words), so H2 is likely to distinguish between natural
languages and programming languages. The rightmost group represents the small-scale
corpora. Considering the proximity of these points despite the variety of the content, Es
is unlikely that H2 can distinguish authors, in contrast to Yule’s hope. Trotzdem, these points
are located lower than those for news text. daher, H2 has the potential to distinguish
genre or maybe writing style.
9 Because H0 is not convergent, the horizontal locations remain unstable, unless the tokens are of a
phonographic alphabet. Mit anderen Worten, for all word-based results and character-based results not based
on a phonographic alphabet, the resulting horizontal locations are changed by increasing the corpus size.
As for the random data, the H0 values are convergent, because these data sets have a finite number of
distinct tokens. Since H0 is measured only for the first 104 tokens, Jedoch, the horizontal locations are
underestimated, especially for random data following a Zipf distribution.
498
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
1
3
4
8
1
1
8
0
6
4
9
4
/
C
Ö
l
ich
_
A
_
0
0
2
2
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Tanaka-Ishii and Aihara
Computational Constancy Measures of Texts
Figur 5
Convergent H2 values with respect to the number of distinct tokens for each corpus.
The natural language texts located near the line for a Zipf exponent of 0.8 Sind
those of the non-alphabetic writing systems.10 Note that Chinese characters have
morphological features, and the Arabic and Thai languages also have flexibility in
terms of which units are considered words and morphemes. Mit anderen Worten, the plots
closer to the random data with a smaller Zipf exponent are for language corpora of
morphemic sequences. The group of plots measured for phonographic scripts is located
near the line for a Zipf exponent of 1.0 (the grouping of points in the leftmost circle),
which could suggest that morphemes are more randomized units than words.
4.3 Application to Unknown Scripts: Voynich Manuscript and Rongorongo Script
The nature of unknown scripts can also be considered through our understanding thus
far. Figur 5 includes plots for the Voynich manuscript in terms of words and characters,
and for the Rongorongo script in terms of characters. Like all the data seen in this figure,
the points are placed at the H2 values (vertically) for the number of distinct tokens
(horizontally) at the specified size of 104 tokens, with the exception of Voynich-A in
terms of words. Because this corpus consists of fewer than 104 Wörter (refer to the data
length by tokens listed for VoynichA-w in Table 1), its point is located horizontally at
the vocabulary size corresponding to the corpus’ maximum size.
For the two Voynich manuscript parts, the plots in terms of words appear near the
Arabic corpus for words (Abook-w). For characters, andererseits, the plots are
at the leftmost end of the figure. This was due to overestimation of the total number
10 Note that here we use the values of the Zipf exponent for the random data, and not the estimated
exponents for the real data. The rank-frequency distributions of characters, especially for phonetic
alphabets, often do not follow a power law.
499
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
1
3
4
8
1
1
8
0
6
4
9
4
/
C
Ö
l
ich
_
A
_
0
0
2
2
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Computerlinguistik
Volumen 41, Nummer 3
of characters for the alphabetic texts (z.B., both English and other, romanized language
texts), since all ASCII characters, such as colons, periods, and question marks, Sind
counted. Trotzdem, the H2 values are located almost at the same position as for the other
romanized texts, indicating that the Voinich manuscript has approximately similar com-
plexity. These results suggest the possibility that the Voynich manuscript could have
been generated from a source in natural language, possibly written in some script of
the abjad type. This supports previous findings (Reddy and Knight 2011; Montemurro
and Zanette 2013), which reported the possibility of the Voynich manuscript being in a
natural language and the coincidence of its word length distribution with that of Arabic.
Andererseits, the plots for the Rongorongo script appear near the line for a
Zipf exponent of 0.8, with RongoA near Arabic in terms of words but RongoB somewhat
further down from Japanese in terms of characters. The status of Rongorongo as natural
language has been controversial (Pozdniakov and Pozdniakov 2007). Both points in the
graph, Jedoch, are near many other natural language texts (and not widely separated),
making it reasonable to hypothesize that Rongorongo is indeed natural language. Der
characters can be deemed morphologically rich, because both plots are close to the line
for a Zipf exponent of 0.8. In the case of RongoA, for which a character was considered
inclusive of all parts (d.h., including accents and ornamental parts), the morphological
richness is comparable to that of the words of an abjad script. Andererseits,
when considering the different character parts as distinct (RongoB), the location drifts
towards the plot for Thai, a phonographic script, in terms of characters. daher, Die
Rongorongo script could be considered basically morphemic, with some parts function-
ing phonographically. This conclusion again supports a previous hypothesis proposed
by a domain specialist (Pozdniakov and Pozdniakov 2007).
This analysis of two unknown scripts supports previous conjectures. Our results,
Jedoch, only add a small bit of evidence to those conjectures; clearly, reaching a
reasonable conclusion would require further study. Darüber hinaus, the analysis of unknown
scripts introduced here could provide another possible application of text constancy
Maßnahmen, from a broader context.
5. Abschluss
We have discussed text constancy measures, whose values are invariant across different
sizes of text, for a given text. Such measures have a 70-year history, since Yule originally
proposed K as a text characteristic, potentially with language engineering utility for
problems such as author identification. We consider text constancy measures today to
have scientific importance in understanding language universals from a computational
view.
After overviewing measures proposed so far and previous studies on text con-
stancy, we explained how K essentially has a mathematical equivalence to the R´enyi
higher-order entropy. We then empirically examined various measures across different
languages and kinds of corpora. Our results showed that only the approximated higher-
order R´enyi entropy exhibits stable, rapid constancy. Examining the nature of the con-
vergent values revealed that K does not possess the discriminatory power of author
identification as Yule had hoped. We also applied our understanding to two unknown
scripts, the Voynich manuscript and Rongorongo, and showed how our constancy
results support previous hypotheses about each of these scripts.
Our future work will include application of K to other kinds of data besides natural
Sprache. Dort, zu, we will consider the questions raised in the Introduction, von
500
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
1
3
4
8
1
1
8
0
6
4
9
4
/
C
Ö
l
ich
_
A
_
0
0
2
2
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Tanaka-Ishii and Aihara
Computational Constancy Measures of Texts
whether K converges and of how discriminatory it is. We are especially interested in
considering the relation between the value of K and the meaningfulness of data.
Danksagungen
This research was supported by JST’s
PRESTO program.
Verweise
Barthel, T. 2013. The Rongorongo of Easter
Island: Thomas Barthel’s transliteration
System. Available at http://kohaumotu.
org/rongorongo org/corpus/
codes.html. Accessed June 2015.
Farach, M., M. Noordewier, S. Savari, L.
Shepp, A. Wyner, and J. Ziv. 1995. Auf der
entropy of DNA: Algorithms and
measurements based on memory and
rapid convergence. In Proceedings of the
Sixth Annual ACM-SIAM Symposium on
Discrete Algorithms, pages 48–57,
San Francisco, CA.
Genzel, D. and E. Charniak. 2002. Entropy
rate constancy in text. In Annual Meeting of
the Association for the ACL, pages 199–206,
Philadelphia, PA.
Glocke, T. C., J. G. Cleary, and I. H. Witten. 1990.
Golcher, F. 2007. A stable statistical constant
Text Compression. Prentice Hall.
Bromiley, P. A., N. A. Thacker, and E.
Bouhova-Thacker. 2010. Shannon entropy,
Renyi entropy, and information. Verfügbar
at http://www.tina-vision.net/
docs/memos/2004-004.pdf. Zugriff
Juni 2015.
specific for human language texts. In
Recent Advances in Natural Language
Processing, Borovets.
Good, ICH. J. 1953. The population frequencies
of species and the estimation of
population parameters. Biometrika,
40(3–4):237–264.
Braun, P. F., S. A. Della Pietra, V. J. Della
Grassberger, P. 1989. Estimating the
Pietra, J. C. Lai, and R. L. Mercer. 1992. Ein
estimate of an upper bound for the
entropy of English. Rechnerisch
Linguistik, 18(1):31–40.
Brunet, E. 1978. Vocabulaire de Jean Giraudoux:
Structure et Evolution. Slatkine.
Cover, T. and R. King. 1978. A convergent
gambling estimate of the entropy of
English. IEEE Transactions on Information
Theory, 24(4):413–421.
Cover, T. M. and J. A. Thomas. 2006. Elements
of Information Theory. Wiley-Interscience.
Daniels, P. T. and W. Bright, editors. 1996. Der
World’s Writing Systems. Oxford University
Drücken Sie.
De¸bowski, Ł. 2009. A general definition of
conditional information and its
application to ergodic decomposition.
Statistics and Probability Letters,
79(9):1260–1268.
De¸bowski, Ł. 2013. Empirical evidence for
Hilberg’s conjecture in single author texts.
In Methods and Applications of Quantitative
Linguistik: Selected papers of the
8th International Conference on
Quantitative Linguistics (Qualico),
pages 143–151, Belgrade.
De¸bowski, Ł. 2014. The relaxed Hilberg
conjecture: A review and new
experimental support. Verfügbar um
http://www.ipipan.waw.pl/ldebowsk/.
Accessed June 2015.
Dugast, D. 1979. Vocabulaire et Stylistique.
I Th´eˆatre et Dialogue. Slatkine-Champion.
Travaux de Linguistique Quantitative.
information content of symbol sequences
and efficient codes. IEEE Transactions on
Information Theory, 35:669–675.
Guiraud, H. 1954. Les Charact`eres Statistique
du Vocabulaire. Universitaires de France
Drücken Sie.
Gusfield, D. 1997. Algorithms on Strings, Und
Sequences: Computer Science and
Computational Biology. Cambridge
Universitätsverlag.
Herdan, G. 1964. Quantitative Linguistics.
Butterworths.
Hilberg, W. 1990. Der bekannte grenzwert
der redundanzfreien information
in texten eine fehlinterpretation der
shannonschen experimente? Frequenz,
44(9–10):243–248.
Honor´e, A. 1979. Some simple measures of
richness of vocabulary. Association for
Literary and Linguistic Computing Bulletin,
7:172–177.
Kimura, D. and K. Tanaka-Ishii. 2011. A
study on constants of natural language
texts. Journal of Natural Language Processing,
18(2):119–137.
Kimura, D. and K. Tanaka-Ishii. 2014. A
study on constants of natural language
texts. Journal of Natural Language Processing,
21:877–895. Special issue of awarded
Papiere. [The English translated version
of the article appeared in 2011 In
Japanese].
Kitchens, B. 1998. Symbolic Dynamics:
One-sided, Two-sided and Countable State
Markov Shifts. Springer.
501
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
1
3
4
8
1
1
8
0
6
4
9
4
/
C
Ö
l
ich
_
A
_
0
0
2
2
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Computerlinguistik
Volumen 41, Nummer 3
Erheben, R. and T. F. Jaeger. 2007. Speakers
optimize information density through
information density through
syntactic reduction. In Annual
Conference on Neural Information
Processing Systems, pages 1–8,
Vancouver.
Maas, H. D. 1972. Zusammenhang zwischen
wortschatzumfang und l¨ange eines textes
[Relationship between vocabulary and text
Länge]. Zeitschrift f ¨ur Literaturwissenschaft
und Linguistik, 8:73–70.
Mandelbrot, B. 1953. An informational
theory of the statistical structure of
Sprache. Communication Theory,
486–500.
Manning, C. and H. Schuetze. 1999.
Foundations of Statistical Natural Language
Processing. MIT Press.
Montemurro, M. and D. Zanette. 2013.
Keywords and co-occurrence patterns
in the Voynich Manuscript: Ein
information-theoretic analysis. PLOS One.
doi: 10.1371/zeitschrift.pone.0066344.
Orliac, C. 2005. The Rongorongo tablets from
Easter Island: Botanical identification
and 14c dating. Archaeology in Oceania,
40(3):115–119.
Orlov, J. K. and R. Y. Chitashvili. 1983.
Generalized z-distribution generating the
well-known ‘rank-distributions’. Bulletin of
the Academy of Sciences of Georgia,
110:269–272.
Pozdniakov, K. and I. Pozdniakov. 2007.
Rapanui writing and the Rapanui
Sprache: Preliminary results of a
statistical analysis. Forum for Anthropology
and Culture, 3:3–36.
Reddy, S. and K. Ritter. 2011. What we
know about the Voynich Manuscript. In
ACL Workshop on Language Technology for
Cultural Heritage, Social Sciences, Und
Humanities, Portland, OR.
R´enyi, A. 1960. On measures of entropy and
Information. In Proceedings of the Fourth
Berkeley Symposium on Mathematics,
Statistics and Probability, pages 547–561,
Berkeley, CA.
R´enyi, A. 1970. Foundations of Probability.
Dover Publications.
Sch ¨umann, T. and P. Grassberger. 1996.
Entropy estimation of symbol sequences.
Chaos, 6(3):414–427.
Shannon, C. 1948. A mathematical theory of
communication. Bell System Technical
Zeitschrift, 27:379–423, 623–656.
Shields, P. C. 1992. Entropy and prefixes.
Annals of Probability, 20(1):403–409.
Sichel, H. S. 1975. On a distribution law for
word frequencies. Journal of the
American Statistical Association,
70(351):542–547.
Simpson, E. H. 1949. Measurement of
Diversität. Natur, 163:688.
Stamatatos, E., N. Fakotakis, and G.
Kokkinakis. 2001. Automatic text
categorization in terms of genre and
author. Computerlinguistik,
26(4):471–495.
Stein, B., N. Lipka, and P. Prettenhofer. 2010.
Intrinsic plagiarism analysis. Language
Resources and Evaluation, 45(1):63–82.
Teh, Y. W. 2006. A hierarchical Bayesian
language model based on Pitman-Yor
processes. In Proceedings of the 21st
International Conference On Computational
Linguistics and 44th Annual Meeting of the
ACL, pages 985–992, Sydney.
Tuldava, J. 1977. Quantitative relations
between the size of the text and lexical
richness. SMIL Quarterly, Zeitschrift für
Linguistic Calculus, 4:28–35.
Tweedie, F. J. and Baayen, R. H. 1998. Wie
variable may a constant be? Measures of
lexical richness in perspective. Computers
and the Humanities, 32:323–352.
Yule, G. U. 1944. The Statistical Study of
Literary Vocabulary. Cambridge University
Drücken Sie.
Zipf, G. K. 1965. Human Behavior and the
Principle of Least Effort: An Introduction to
Human Ecology. Hafner, New York.
502
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
C
Ö
l
ich
/
l
A
R
T
ich
C
e
–
P
D
F
/
/
/
/
4
1
3
4
8
1
1
8
0
6
4
9
4
/
C
Ö
l
ich
_
A
_
0
0
2
2
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3