跨语言压力
Uniform Information Density in Word Order
Thomas Hikaru Clark1 Clara Meister2 Tiago Pimentel3 Michael Hahn4
Ryan Cotterell2 Richard Futrell5 Roger Levy1
1和, 美国 2ETH 苏黎世, 瑞士 3剑桥大学, 英国
4Saarland University, 德国 5UC 尔湾分校, 美国
thclark@mit.edu meistecl@inf.ethz.ch tp472@cam.ac.uk
mhahn@lst.uni-saarland.de ryan.cotterell@inf.ethz.ch
rfutrell@uci.edu rplevy@mit.edu
抽象的
While natural languages differ widely in both
canonical word order and word order flex-
能力, their word orders still follow shared
cross-linguistic statistical patterns, often at-
tributed to functional pressures. In the effort to
identify these pressures, prior work has com-
pared real and counterfactual word orders. 然而
one functional pressure has been overlooked in
such investigations: The uniform information
density (UID) 假设, which holds that in-
formation should be spread evenly throughout
an utterance. 这里, we ask whether a pres-
sure for UID may have influenced word order
patterns cross-linguistically. 为此, 我们
use computational models to test whether real
orders lead to greater information uniformity
than counterfactual orders. In our empirical
study of 10 typologically diverse languages,
we find that: (我) among SVO languages, 真实的
word orders consistently have greater unifor-
mity than reverse word orders, 和 (二) 仅有的
linguistically implausible counterfactual or-
ders consistently exceed the uniformity of real
orders. These findings are compatible with
a pressure for information uniformity in the
development and usage of natural languages.1
1
介绍
Human languages differ widely in many respects,
yet there are patterns that appear to hold consis-
tently across languages. Identifying explanations
for these patterns is a fundamental goal of lin-
guistic typology. 此外, such explanations
may shed light on the cognitive pressures under-
lying and shaping human communication.
1Code for reproducing our experiments is available at
https://github.com/thomashikaru/word-order-uid.
This work studies the uniform information den-
城市 (UID) hypothesis as an explanatory princi-
ple for word order patterns (Fenk and Fenk, 1980;
Genzel and Charniak, 2002; Aylett and Turk,
2004; Jaeger, 2010; Meister et al., 2021). 这
UID hypothesis posits a communicative pressure
to avoid spikes in information within an utter-
安斯, thereby keeping the information profile of
an utterance relatively close to uniform over time.
While the UID hypothesis has been proposed as
an explanatory principle for a range of linguistic
现象, 例如, speakers’ choices when faced
with lexical and syntactic alternations (Levy and
Jaeger, 2006), its relationship to word order pat-
terns has received limited attention, 与
notable exception of Maurits et al. (2010).
Our work investigates the relationship be-
tween UID and word order patterns, differing
from prior work in several ways. 我们 (我) use Trans-
former language models (LMs) (Vaswani et al.,
2017) to estimate information-theoretic operation-
alizations of information uniformity; (二) analyze
large-scale naturalistic datasets of 10 typologi-
cally diverse languages; 和 (三、) compare a range
of theoretically motivated counterfactual gram-
mar variants.
Experimentally, we find that among SVO lan-
guages, the real word order has a more uniform
information density than nearly all counterfac-
tual word orders; the only orders that consistently
exceed real orders in uniformity are generated
using an implausibly strong bias for uniformity,
at the cost of expressivity. 更远, we find that
counterfactual word orders that place verbs be-
fore objects are more uniform than ones that place
objects before verbs in nearly every language.
Our findings suggest that a tendency for uniform
information density may exist in human language,
1048
计算语言学协会会刊, 卷. 11, PP. 1048–1065, 2023. https://doi.org/10.1162/tacl 00589
动作编辑器: Mark-Jan Nederhof. 提交批次: 9/2022; 修改批次: 1/2023; 已发表 8/2023.
C(西德:2) 2023 计算语言学协会. 根据 CC-BY 分发 4.0 执照.
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
5
8
9
2
1
5
4
4
9
5
/
/
t
我
A
C
_
A
_
0
0
5
8
9
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
with two potential sources: (我) word order rules,
with SVO order generally being more uniform
than SOV; 和 (二) choices made by speakers,
who use the flexibility present in real languages to
structure information more uniformly at a global
等级 (and not only in a small number of isolated
constructions).
2 Functional Pressures in Language
2.1 Linguistic Optimizations
A number of
linguistic theories link cross-
linguistic patterns to functional pressures. 为了
例子, both the grammatical rules of a lan-
guage and speakers’ choices (within the space of
grammatically acceptable utterances) are posited
to reflect a trade-off between effort and robust-
内斯: Shorter and simpler structures are easier to
produce and comprehend, but longer and more
complex utterances can encode more information
(Gabelentz, 1901; Zipf, 1935; Hawkins, 1994,
2004, 2014; Haspelmath, 2008). Another such
functional pressure follows from the principle of
dependency length minimization (DLM), 哪个
holds that, in order to minimize working memory
load during comprehension, word orders should
place words in direct dependency relations close
to each other (Rijkhoff, 1986, 1990; Hawkins,
1990, 1994, 2004, 2014; Grodner and Gibson,
2005; 吉布森, 1998, 2000; Bartek et al., 2011;
Temperley and Gildea, 2018; 富特雷尔等人。, 2020).
A growing body of work has turned to informa
理论, the mathematical theory of commu-
nication (Shannon, 1948), to formalize principles
that explain linguistic phenomena (Jaeger and
Tily, 2011; 吉布森等人。, 2019; Pimentel et al.,
2021C). One such principle is that of uniform
information density.
2.2 Uniform Information Density
According to the UID hypothesis, speakers tend
to spread information evenly throughout an utter-
安斯; large fluctuations in the per-unit information
content of an utterance can impede communi-
cation by increasing the processing load on the
听众. Speakers may modulate the information
profile of an utterance by selectively producing
linguistic units such as optional complementizers
in English (Levy and Jaeger, 2006; Jaeger, 2010).
A pressure for UID in speaker choices has also
been studied in specific constructions in other
语言, though with mixed conclusions (Zhan
and Levy, 2018; Clark et al., 2022).
正式地, the information conveyed by a lin-
guistic signal y, 例如, an utterance or piece of
文本, is quantified in terms of its surprisal s(·),
which is defined as y’s negative log-probability:
s(y) def= − log p(西德:2)(y). 这里, p(西德:2) is the underlying
probability distribution over sentences y for a
语言 (西德:2). Note that we do not have access to
the true distribution p(西德:2), and typically rely on a
language model with learned parameters θ to es-
timate surprisal values with a second distribu-
tion pθ.
Surprisal can be additively decomposed over
the units that comprise a signal. Explicitly, 为了
a signal y that can be expressed as a series of
linguistic units (西德:3)y1, . . . , yN (西德:4), where yn ∈ V and
V is a set vocabulary of words or morphemes,
the surprisal of a unit yn is its negative log-
probability given prior context: s(yn) = − log
p(西德:2)(yn | y
which disproportionately increases in the pres-
ence of larger surprisal values.4 Note that for all
of these operationalizations, lower values corre-
spond to greater uniformity.5
4This metric suggests a super-linear processing cost for
令人惊讶的.
5We note that, while a fully uniform language would
have value 0 for UIDv and UIDlv, it would not for UIDp(y),
so the metrics are not directly comparable.
3 Counterfactual Language Paradigm
Following prior work that has used counterfac-
tual languages to study the functional pressures
at play in word order patterns, we investigate to
what degree a language’s word order shows signs
of optimization for UID. In this approach, A
corpus of natural language is compared against
a counterfactual corpus containing minimally
changed versions of the same sentences, 在哪里
the changes target an attribute of interest, 例如,
the language’s word order. 例如, several
studies of DLM have compared syntactic depen-
dency lengths in real and counterfactual corpora,
generated by permuting the sentences’ word or-
der either randomly (Ferrer-i-Cancho, 2004; 刘,
2008) or deterministically by applying a counter-
factual grammar (Gildea and Temperley, 2010;
Gildea and Jaeger, 2015; 富特雷尔等人。, 2015乙,
2020). 相似地, we will compare measures of
UID in real and counterfactual corpora to investi-
gate whether real languages’ word orders exhibit
more uniform information density than alterna-
tive realizations.
3.1 Formal Definition
We build on the counterfactual generation proce-
dure introduced by Hahn et al. (2020) to create
parallel corpora. This procedure operates on sen-
tences’ dependency parses. 正式地, a depen-
of a sentence y is a directed tree
dency parse
with one node for every word, where each word
in y, with the exception of a designated root
word, is the child of its (独特的) syntactic head;
see Zmigrod et al. (2020) for a discussion of the
role of the root constraint in dependency tree
注解. Each edge in the tree is annotated
with the syntactic relationship between the words
connected by that edge; 见图 1 for an ex-
充足. Here we use the set of dependency re-
lations defined by the Universal Dependencies
(UD) 范例 (de Marneffe et al., 2021), 尽管
1050
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
5
8
9
2
1
5
4
4
9
5
/
/
t
我
A
C
_
A
_
0
0
5
8
9
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
allowed by this formalism, the grammars which
we borrow from Hahn et al. (2020) enforce two
additional simplifying constraints. 第一的, the rela-
tive positioning (left or right) between the head
and dependent of a particular relation is fixed.
第二, the relative ordering of different rela-
tions on the same side of a head is also fixed. 我们
denote grammars which satisfy both constraints
as consistent. 尤其, natural languages violate
both of these assumptions to varying degrees. 为了
例子, even in English—a language with rel-
atively strict word order—adverbs can generally
appear before or after their head. While these sim-
plifications mean that the formalism cannot per-
fectly describe natural languages, it provides a
computationally well-defined method for inter-
vening on many features of word order. In par-
针状的, the consistent grammars of Hahn et al.
(2020) are parameterized by a set of scalar weights
corresponding to each possible syntactic relation;
the ordering function thus reduces to sorting each
head’s dependents based on their weight values.
尤其, Hahn et al. (2020) also introduced a
method for optimizing these grammars for vari-
ous objective functions by performing stochastic
gradient descent on a probabilistic relaxation of
the grammar formalism; we use several of these
语法 (described in §3.2) in our subsequent
分析.
Creating Counterfactual Word Orderings.
The above paradigm equips us with the tools
necessary for systematically altering sentences’
word orderings, which in turn, enables us to create
counterfactual corpora. 尤其, the large corpora
we use in this study contain sentences as strings,
not as their dependency parses. We therefore de-
fine our counterfactual grammar intervention as
the output of a (确定性的) word re-ordering
function f : Y → Y, where Y def= V ∗ is the set of
all possible sentences that can be constructed us-
ing a language’s vocabulary V.6 This function
takes as input a sentence from our original lan-
guage and outputs a sentence with the counter-
factual word order defined by a given ordering
function g. We decompose this function into two
脚步:
F(y) = linearize(parse(y), G)
(4)
6For notational brevity, we leave the dependency of V
在 (西德:2) implicit as it should be clear from context.
数字 2: Pseudo-code to linearize a dependency tree
according to a grammar’s ordering function g. 在
this code, each node contains a word and its syntac-
tic dependents.
we follow Hahn et al. (2020) in transforming
dependency trees such that function words are
treated as heads, leading to representations closer
to those of standard syntactic theories; see also
Gerdes et al. (2018).
Tree Linearization. While syntactic relation-
ships are naturally described hierarchically, sen-
tences are produced and processed as linear strings
of words. 重要的, there are many ways to
’s nodes into a
linearize a dependency parse
string y. Concretely, a grammar under our for-
malism is defined by an ordering function (看
Kuhlmann, 2010) G(·, ·) which takes as arguments
a dependency parse and a specific node in it, 和
returns an ordering of the node and its depen-
凹痕. For each node, its dependents are arranged
from left to right according to this ordering; 任何
node without dependents is trivially an ordered
set on its own. This process proceeds recursively
to arrive at a final ordering of all nodes in a de-
pendency tree, yielding the final string y. Pseudo-
based on an
code for the linearization of a tree
ordering function g is given in Figure 2.
Simplifying Assumptions. One consequence of
this formalism is that all counterfactual orders
correspond to projective trees,
trees with
no crossing dependencies. While projectivity is
a well-attested cross-linguistic tendency, 人类
languages do not obey it absolutely (Ferrer-i-
Cancho et al., 2018; Yadav et al., 2021). 之内
the space of projective word order interventions
IE。,
1051
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
5
8
9
2
1
5
4
4
9
5
/
/
t
我
A
C
_
A
_
0
0
5
8
9
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
5
8
9
2
1
5
4
4
9
5
/
/
t
我
A
C
_
A
_
0
0
5
8
9
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
数字 3: The same source sentence according to 4 real and counterfactual orderings.
We use a state-of-the-art parser (Straka and
Strakov´a, 2017) to implement parse : Y → T
where T is the set of all dependency parses. Spe-
cifically, we define parse(y) = argmax ∈T
| y) for a learned conditional probability
p(
distribution over possible parses p(· | y). 我们
then obtain the linearized form of the resulting
tree by supplying it and the ordering function g
to linearize, as defined above. Collectively,
the outputs of this process (parallel datasets dif-
fering only in word order) are referred to as
variants. 重要的, f here is a determinis-
tic function; one could instead consider f to be
probabilistic in nature, with each sentence y hav-
. 我们不-
ing a distribution over tree structures
cuss the implications of this choice in §4.
3.2 Counterfactual Grammar Specifications
In addition to the original REAL word order,
we explore the following theoretically motivated
counterfactual grammars for each language. Ex-
ample sentences from several of these grammars
are shown in Figure 3.
Consistent Approximation to Real Order. AP-
PROX is a consistent approximation to the real
word order within our formalism; it uses an order-
ing function parameterized by weights that were
fitted to maximize the likelihood of observed
word orders for each language, as reported by
Hahn et al. (2020). This variant captures most of
the word order features of a real language while
allowing for a fair comparison to deterministic
counterfactual grammars that do not model the
flexibility of real language. From the perspective
of the UID hypothesis, we expect this variant
to be less uniform that REAL because it has less
flexibility to accommodate speakers’ choices that
optimize for UID.
Consistent Random Grammars. We include
variants RANDOM1 through RANDOM5, which use
ordering functions parameterized by randomly
assigned weights. This means that for a given
random grammar, each dependency relation has
a fixed direction (left or right), but that the di-
rections of these relations lack the correlations
observed in natural language (Greenberg, 1963).
Random grammars with the same numerical in-
dex share weights across languages.
Consistent Grammars Optimized for Effi-
ciency. We include two consistent grammars
that are optimized for the joint objective of par-
seability (how much information an utterance
provides about
its underlying syntactic struc-
真实) and sentence-internal predictability, as re
ported by Hahn et al. (2020), one with OV order
(EFFICIENT-OV) and one with VO order (EFFICIENT-
VO). 例如, the EFFICIENT-OV grammar for
English would give a plausible version of a con-
sistent and efficient grammar in the counterfac-
tual world where English has verbs after objects.
Grammars Optimized for Dependency Length
Minimization. From the same work we also
take consistent grammars that are optimized for
1052
DLM, denoted as MIN-DL-OPT. While lineariza-
tions produced by these grammars are not gua-
ranteed to minimize dependency length for any
particular sentence, they minimize the expected
average dependency length of a large sample of
sentences in a language. 此外, we include
MIN-DL-LOC, an inconsistent grammar that applies
the projective dependency-length minimization
algorithm of Gildea and Temperley (2007) 在
sentence level, leading to sentences with minimal
DL but without the constraint of consistency.
Frequency-sorted Grammars. SORT-FREQ is an
inconsistent grammar which orders words in a
sentence from highest to lowest frequency, IG-
noring dependency structure altogether. We use
this ordering as a heuristic baseline for which
we expect UID to hold relatively strongly: 低的-
frequency elements, which tend to have higher
surprisal even if solely from their less frequent
用法 (Ellis, 2002), are given more context, 和
thus should have smaller surprisals than if they
occurred early; more conditioning context tends
to reduce the surprisal of the next word (Luke and
克里斯蒂安森, 2016). We also test SORT-FREQ-REV,
ordering words from least to most frequent, 哪个
for analogous reasons we expect
to perform
poorly in terms of UID. 然而, both of these
orderings lead to massive syntactic ambiguity
by introducing many string collisions—any two
sentences containing the same words in differ-
ent orders would be linearized identically. 这
eliminates word order as a mechanism for ex-
pressing distinctions in meaning, so these orders
are implausible as alternatives to natural
lan-
guages (Mahowald et al., 2022).
Reverse Grammar. 最后, we also include the
REVERSE variant, where the words in each sentence
appear in the reverse order of the original. 这
variant preserves all pairwise distances between
words within sentences and has identical depen-
dency lengths as the original order, thus isolating
the effect of linear order on information density
from other potential influences. 尤其, 如果
original language happens to be perfectly consis-
帐篷, then REVERSE will also satisfy consistency;
在实践中, this is unlikely to hold with natural
语言.
3.3 UID and Counterfactual Grammars
Let p(西德:2)(y) be the probability distribution over
sentences y for a language of interest (西德:2). 我们可以
define a language’s UID score as the expected
value of its sentences’ UID scores, where we
overload the UID function to take either a sen-
tence y or an entire language (西德:2):
UID((西德:2)) def=
(西德:3)
y∈Y
p(西德:2)(y) UID(y)
(5)
sentence-level UID can be UIDv(y),
在哪里
UIDlv(y), or UIDp(y). 在实践中, we estimate this
language-level UID score using a Monte-Carlo
estimator, taking the mean sentence-level UID
score across a held-out test set S(西德:2) 句子数
y in language (西德:2), where we assume y ∼ p(西德:2):
(西德:4)UID((西德:2)) def=
1
|S(西德:2)|
(西德:3)
y∈S(西德:2)
UID(y)
(6)
相似地,
entropy, H) of this language is computed as:
the expected surprisal (or Shannon
H((西德:2)) def= −
(西德:3)
y∈Y
p(西德:2)(y) log p(西德:2)(y)
(7)
We evaluate how well a language model pθ
approximates p(西德:2) by its cross-entropy:
H(p(西德:2), p) = −
(西德:3)
y∈Y
p(西德:2)(y) log pθ(y)
(8)
where a smaller value of H implies a better
模型. Again using a Monte Carlo estimator,
we measure cross-entropy using the held-out test
set S(西德:2):
(西德:5)H(p(西德:2), p) = − 1
|S(西德:2)|
(西德:3)
y∈S(西德:2)
log pθ(y)
(9)
This is simply the mean surprisal that the model
assigns to a corpus of naturalistic data.
These computations can also be applied to
counterfactual variants of a language. Let (西德:2)F
stand for a language identical to (西德:2), but where
its strings have been transformed by f; this lan-
guage’s distribution over sentences would be
y(西德:10)∈Y p(西德:2)(y(西德:10)) {y = f(y(西德:10))}. 自从
p(西德:2)F(y) =
entropy is non-increasing over function transfor-
mations (by Jensen’s inequality), it follows that:
(西德:2)
H((西德:2)) ≥ H((西德:2)F)
(10)
1053
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
5
8
9
2
1
5
4
4
9
5
/
/
t
我
A
C
_
A
_
0
0
5
8
9
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
更远, if our counterfactual generation function
f is a bijection—meaning that each input string
gets mapped to a distinct output string and each
output string has an input that maps to it—then
we can create a second function f−1 : Y →
是, which would generate (西德:2) 从 (西德:2)F. 然后, 这
following holds:
H((西德:2)) ≥ H((西德:2)F) ≥ H((西德:2)f−1◦ f) = H((西德:2))
(11)
IE。, it must be that H((西德:2)) = H((西德:2)F). Reversing a
sentence is an example of a bijective function,
and thus Equation (11) holds necessarily for the
pair of REAL and REVERSE variants; the counter-
factual generation procedure thus should not pro-
duce differences in mean surprisal between these
variants. 同时, bijectivity does not
necessarily hold for our other counterfactual trans-
formations and is violated to a large degree when
mapping to SORT-FREQ and SORT-FREQ-REV. 因此
一般来说, we can only guarantee Inequality 10.
至关重要的是, 然而, the transformation f might
change the UID score of such a language, 阿尔-
lowing us to evaluate the impact of word order
on information uniformity. As a simple example,
consider the language (西德:2)1 that places a uniform
distribution over only four strings: aw, ax, 经过,
and bz. In this language, the first and second
symbols always have 1 bit of surprisal, 和
end of the string has 0 bits of surprisal. 如果
counterfactual language (西德:2)2 is the reverse of (西德:2)1, 我们
have a uniform distribution over the strings wa,
xa, yb, and zb. 这里, the first symbol always has
2 bits of surprisal, and the second symbol and
end of sentence always have zero bits, as their
values are deterministic for a given initial sym-
bol. While the mean surprisal per symbol is the
same for (西德:2)1 和 (西德:2)2, (西德:2)1 has more uniform infor-
mation density than (西德:2)2.
4 Limitations
4.1 Use of Counterfactual Grammars
Real Word Orders Are not Consistent. 这
consistent grammars borrowed from Hahn et al.
(2020) assume that the direction of each syntactic
关系, as well as the relative ordering of de-
pendents on the same side of a head, are fixed.
This is not generally true of natural languages. 我们
address this difference by including the variant AP-
PROX as a comparison to the counterfactual vari-
蚂蚁, which are constrained by consistency, 和
by including REVERSE as a comparison to REAL,
both of which are not constrained by consistency.
Automatic Parsing Errors. Another issue is
that the dependency parses extracted for each
original sentence as part of the counterfactual
generation pipeline may contain parsing errors.
These errors may introduce noise into the coun-
在里面
terfactual datasets that
original sentences, and may cause deviations from
the characteristics that we assume our counter-
factual grammars should induce. 例如,
MIN-DL-LOC only produces sentences with mini-
mized dependency length if the automatic parse is
正确的.
is not present
Deterministic Parsing. 最后, our counter-
factual generation procedure assumes a determin-
istic mapping from sentences to dependency trees
as one of its steps. 然而, multiple valid parses
of sentences are possible in the presence of syn-
tactic ambiguity. 在这种情况下, we always select
the most likely structure according to the parser,
which learns these probabilities based on its train-
ing data. 所以, this design choice could lead
to underrepresentation of certain syntactic struc-
tures when applying a transformation. 然而,
we note that the variants REAL, REVERSE, SORT-
FREQ, and SORT-FREQ-REV do not depend on de-
pendency parses and so are unaffected by this
design choice.
4.2 Choice of Dataset
Properties of language can vary across genres and
域. When drawing conclusions about hu-
man language in general, no single dataset will
be completely representative. Due to the amount
of data required to train LMs, we use written
corpora in this work, and use the term speaker
loosely to refer to any language producer regard-
less of modality. To address potential concerns
about the choice of dataset in this study, 我们骗-
ducted a supplementary analysis on a subset of
languages using a different web corpus, 我们
report in §7.5.
4.3 Errors and Inductive Biases
Model Errors. Language model quality could
impact the estimated values of our UID metrics
UIDv, UIDp, and UIDlv. To see why, consider a
model pθ that—rather than providing unbiased
1054
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
5
8
9
2
1
5
4
4
9
5
/
/
t
我
A
C
_
A
_
0
0
5
8
9
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
estimates of p(西德:2)—is a smoothed interpolation be-
tween p(西德:2) and the uniform distribution:
p(yn | y