Helpful Neighbors: - Am MIT spezialisierte KI-Forschung

Helpful Neighbors:
Leveraging Neighbors in Geographic Feature Pronunciation

Llion Jones† Richard Sproat† Haruko Ishikawa† Alexander Gutkin‡

†Google Japan

‡Google UK

{llion,rws,ishikawa,agutkin}@google.com

Abstrakt

If one sees the place name Houston Mer-
cer Dog Run in New York, how does one
know how to pronounce it? Assuming one
knows that Houston in New York is pro-
nounced
and not like the Texas
), then one can probably guess
city (
Das
is also used in the name of the
dog park. We present a novel architecture that
learns to use the pronunciations of neighbor-
ing names in order to guess the pronunciation
of a given target feature. Applied to Japanese
place names, we demonstrate the utility of
the model to finding and proposing correc-
tions for errors in Google Maps.

To demonstrate the utility of this approach to
structurally similar problems, we also report
on an application to a totally different task:
Cognate reflex prediction in comparative his-
torical linguistics. A version of the code has
been open-sourced.1

Einführung

In many parts of the world, pronunciation of
toponyms and establishments can require local
Wissen. Many visitors to New York, zum Beispiel-
reichlich, get tripped up by Houston Street, welche
they assume is pronounced the same as the city
in Texas. If they do not know how to pronounce
Houston Street, they would likely also not know
how to pronounce the nearby Houston Mercer
Dog Run. But if some one knows one, that can
(usually) be used as a clue to how to pronounce
the other.

Before we proceed further, a bit of terminol-
Ogy. Technically, the term toponym refers to the
name of a geographical or administrative feature,
such as a river, lake, Stadt, or state. In most of
what follows, we will use the term feature to refer
to these and other entities such as roads, build-
ings, Schulen, und so weiter. In practice we will

1https://github.com/google-research/google
-research/tree/master/cognate inpaint neighbors.

not make a major distinction between the two,
but since there is a sense in which toponyms are
more basic, and the names of the more general
features are often derived from a toponym (as in
the Houston Mercer Dog Run example above),
we will retain the distinction where it is needed.

While features cause not infrequent problems
in the US, they become a truly serious issue in
Japan. Japan is notorious for having toponyms
whose pronunciation is so unexpected that even
native speakers may not know how to pronounce
a given case. Most toponyms in Japanese are writ-
ten in kanji (Chinese characters) with a possible
intermixing of one of the two syllabaries, hira-
Ueno is entirely in
gana or katakana. Daher
Tora no mon has two kanji and
kanji;
one katakana symbol (der Zweite); Und
Fukiwari Waterfalls has three kanji and one hira-
gana symbol (der dritte). Features more generally
tend to have more characters in one of the syl-
labaries—especially katakana if, Zum Beispiel, Die
feature is a building that includes the name of a
company as part of its name.

The syllabaries are basically phonemic scripts
so there is generally no ambiguity in how to pro-
nounce those portions of names, but kanji present a
serious problem in that the pronunciation of a kanji
string in a toponym is frequently something one
Ueno
just has to know. To take the example
über, that pronunciation (for the well-known area
in Tokyo) is indeed the most common one, Aber
there are places in Japan with the same spelling
but with pronunciations such as Uwano, Kamino,
Wano, among others.2 It is well-known that many
kanji have both a native (kun) Japanese pronunci-
yama ‘mountain’) as well as one or
ation (z.B.,
more Chinese-derived on pronunciations (z.B.,
san ‘mountain’), but the issue with toponyms goes

2Different pronunciations of kanji are often referred to
as readings, but in this paper we will use the more general
term pronunciation.

Transactions of the Association for Computational Linguistics, Bd. 11, S. 85–101, 2023. https://doi.org/10.1162/tacl a 00535
Action Editor: Karen Livescu. Submission batch: 6/2022; Revision batch: 9/2022; Published 1/2023.
C(cid:2) 2023 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
5
3
5
2
0
6
8
1
2
7

/
T

A
C
_
A
_
0
0
5
3
5
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

well beyond this since there are nanori pronun-
ciations of kanji that are only found in names
also has the nanori pro-
(Ogihara, 2021):
nunciation taka, Zum Beispiel. The kun-on-nanori
variants relate to an important property of how
kanji are used in Japanese: Among all modern
writing systems, the Japanese use of kanji comes
closest to being semasiographic—that is, repre-
senting meaning rather than specific morphemes.
The common toponym component kawa ‘river’,
,
is usually written
which also means ‘river’. That kanji in turn has
other pronunciations, such as k¯o, a Sino-Japanese
word for ‘river’. This freedom to spell words with
a range of kanji that have the same meaning, oder
to read kanji with any of a number of morphemes
having the same meaning, is a particular char-
acteristic of Japanese. Daher, while reading place
names can be tricky in many parts of the world,
the problem is particularly acute in Japan.

, but can also be written as

Since the variation is largely unpredictable,
one therefore simply needs to know for a given
toponym what the pronunciation is. But once one
Ist
knows, zum Beispiel, that a name written
read as Uwano, as with the Houston case, eins
ought to be able to deduce that in the name of the
‘Uwano First Public Park’, Das
local
is read as Uwano and not Ueno. If one’s digital
assistant is reading this name to you, or needs
to understand your pronunciation of the name, Es
needs to know the correct pronunciation. Während
one might expect a complete and correct maps
database to have all of this information correctly
entered, in practice maps data contain many errors,
especially for less frequently accessed features.

In this paper we propose a model that learns to
use information from the geographical context to
guide the pronunciation of features. We demon-
strate its application to detecting and correcting
errors in Google Maps. Zusätzlich, in Section 8
we show that the model can be applied to a differ-
ent but structurally similar problem, nämlich, Die
problem of cognate reflex prediction in compara-
tive historical linguistics. In this case the ‘neigh-
bors’ are related word forms in a set of languages
from a given language family, and the pronuncia-
tion to be predicted is the corresponding form in a
language from the same family.

2 Hintergrund
Pronouncing written geographical feature names
involves a combination of text normalization (Wenn

the names contain expressions such as numbers
or abbreviations), and word pronunciation, von-
ten termed ‘‘grapheme-to-phoneme conversion’’.
Both of these are typically cast as sequence-
to-sequence problems, and neural approaches to
both are now common. Neural approaches to
grapheme-to-phoneme conversion is used by some
researchers (Yao and Zweig, 2015; Rao et al.,
2015; Toshniwal and Livescu, 2016; Peters et al.,
2017; Yolchuyeva et al., 2019), and others use a
text normalization approach (Sproat and Jaitly,
2017; Zhang et al., 2019; Yolchuyeva et al., 2018;
Pramanik and Hussain, 2019; Mansfield et al.,
2019; Kawamura et al., 2020; Tran and Bui,
2021). For languages that use the Chinese script,
grapheme-to-phoneme conversion may benefit
from the fact that Chinese characters can mostly
be decomposed into a component that relates to the
meaning of the character and another that relates
to the pronunciation. The latter information is po-
tentially useful, in particular in Chinese and in the
Sino-Japanese readings of characters in Japanese.
Recent neural models that have taken advantage
of this include Dai and Cai (2017) and Nguyen
et al. (2020). Andererseits, it should be
pointed out that other more ‘brute force’ decom-
positions of characters seem to be useful. Daher
Yu et al. (2020) propose a byte decomposition
für (UTF-8) character encodings for a model that
covers a wide variety of languages, einschließlich
Chinese and Japanese.

The above approaches generally treat the prob-
lem in isolation in the sense that the problem is
cast as one where the task is to predict a pronun-
ciation independent of context. Different pronun-
ciations for the same string in different linguistic
contexts comes under the rubric of homograph
disambiguation, and there is a long tradition
of work in this area; for an early example see
Yarowsky (1996) and for a recent incarnation see
Gorman et al. (2018). Nicht überraschend, there has
been recent interest in neural models for predict-
ing homograph pronunciations: See Park and Lee
(2020) and Shi et al. (2021) for recent examples
focused on Mandarin.

The present task is different, since what disam-
biguates the possible pronunciations of Japanese
features is not generally linguistic, but geograph-
ical context, which can be thought of as a way of
biasing the decision as to which pronunciation to
use, given evidence from the local context. Unser
approach is similar in spirit to that of Pundak et al.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
5
3
5
2
0
6
8
1
2
7

/
T

A
C
_
A
_
0
0
5
3
5
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

calized pronunciations for text-to-speech, sowie
as search suggestions, it is important that it be
correct.

We started by extracting from the database all
features that include a broad (but not exhaustive)
set of feature types from a bounding box that
covers the four main islands of Japan. We then ex-
tracted feature summaries for names that included
both kanji original names, and katakana rendi-
tionen. These summaries include the feature name,
the hiragana version of the name converted from
katakana, and the bounding box for the feature.
We then find, for each feature in the feature sum-
maries, a bucket of other features that are within
a given radius (10 kilometers in our experiments).
Dann, for each feature in each bucket, we desig-
nate that feature a target feature, and we build
neighborhoods around that feature. We attempt
for each feature, to find interesting neighboring
features whose name shares a kanji bigram with
the target feature’s name. The intuition here is that
a feature that is likely to be useful in determining
the pronunciation of another feature should be
nearby geographically, and should share at least
some of the name. In any case we cap the num-
ber of ‘non-interesting’ neighbors to a limit—5
in our experiments. This means that some neigh-
borhoods will have target features that lack useful
neighbors; this is a realistic situation in that while
it is often the case that one can find hints for
a name’s pronunciation in the immediate neigh-
Bors, it is not always the case. While such neigh-
borhoods are not useful from the point of view
of neighbor-based evidence for a target feature’s
pronunciation, they still provide useful data for
training the target sequence-to-sequence model.
Our final dataset consists of about 2.7M fea-
ture neighborhoods,
including the information
from the summary for each target feature as de-
scribed above, the associated neighboring fea-
tures and their summaries, along with the distance
(in kilometers) from the target feature. Figur 2
shows parts of one such neighborhood.

4 Modell

Despite the differences noted above, the prob-
lem we are interested in can still be characterized
at its core as a sequence-to-sequence problem.
The input is a sequence of tokens representing
the feature name in its original Japanese writ-
ten form. The output is a sequence of hiragana

Figur 1: The biasing LAS model from Pundak et al.
(2018), Figure 1a.

(2018), who propose the use of a bias-encoder in
a ‘‘listen-attend-and-spell’’ (Chan et al., 2016)
Automatic Speech Recognition architecture. Der
bias encoder takes a set of ‘‘bias phrases’’, welche
can be used to guide the model towards a par-
ticular decoding. Pundak et al.’s (2018) model is
shown schematically in Figure 1.

3 Data

Features in Google Maps are stored in a data
representation that includes a variety of informa-
tion about each feature including: its location as
a bounding box in latitude-longitude; the type of
the feature—street, building, municipality, topo-
graphic feature, und so weiter; name(S) of the feature
in the native language as well as in many (meistens
automatically generated) transliterations; an ad-
dress if there is an address associated with this
feature; road signs that may be associated; Und
so forth. Each feature is identified with a unique
hexadecimal feature id. Features may have ad-
ditional names besides the primary names. Für
example in English, street names are often ab-
breviated (Main St.) and these abbreviations are
typically expanded (Main Street) as an additional
name. Many Japanese features have pronuncia-
tions of the names added as additional names in
katakana. Some of these have been carefully hand
curated, but many were generated automatically
and are therefore potentially errorful, as we will
sehen. Since the katakana version is used as the
basis for transliterations into other languages, lo-

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
5
3
5
2
0
6
8
1
2
7

/
T

A
C
_
A
_
0
0
5
3
5
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

dings are then shared between the feature names
and the pronunciations. Das ist, the same embed-
dings are used for the input name tokens and the
neighbor tokens, and similarly between the tar-
get pronunciation (decoder output) and the neigh-
bors’ pronunciations:

embinp = Embedname(xinp),
embname = Embedname(xname),
embpron = Embedpron(xpron).

These embedded tokens are then processed sep-
arately by the neighbor encoder. No parameters
are shared between these encoders, or with the
decoder:

hinp = Encoderinp(embinp),
hname = Encodername(embname),
hpron = Encoderpron(embpron).

Since each example has nneigh neighbors,
hinp is of shape [inp size, emb size] Aber
the processed neighbor spelling and pronuncia-
tion inputs are of size [nneigh, name len,
emb size] Und [nneigh, pron len,
emb size].

One of the simplest ways to incorporate the
neighboring information is to concatenate the fea-
ture names and pronunciation embeddings into
the main input sequence, allowing the transformer
to attend directly to all the relevant information.
Bedauerlicherweise, this is not possible with a vanilla
transformer with a quadratic attention mechanism
if we want to attend to, sagen, 30 neighbors. In
our experiments name_len is set to 20 Und
pron_len is set to 40, yielding (20+ 40) × 30 =
1800 input tokens, far too many for a vanilla
transformer decoder to attend to. To mitigate
against this we average the encoder outputs to
give a single vector per neighbor to attend to:

sname = Ave(hname) ,
spron = Ave(hpron) ,

c = Concat(hinp, sname, spron).

The vectors are concatenated along the neigh-
bor dimension to give a sequence of size
[inp len+2*nneigh, emb size]. Option-
ally, if embeddings representing the latitudinal
and longitudinal position of the feature (welche
we refer to as Lat-Long embeddings, discussed
später) are used then these are also concatenated

Figur 2: A small example of a neighborhood. Der
store, circled on the map, has a pronunciation listed as
C’est la Vie Sorimachi, but the neighboring areas are
Tanmachi and Kamitanmachi. Sorimachi is therefore
wrong.

characters representing the correct pronunciation.
The difference between this and a more con-
ventional sequence-to-sequence problem is that
we provide additional biasing information in the
form of geographical neighbors, such as their
pronunciation and geographical
location. Das
neighbor information is provided as additional
input sequences to aid the model in making its
prediction. In our experiments, we limit the num-
ber of neighbors to at most 30 (it is usually much
less than this), each consisting of two sequences,
nämlich, the neighbor’s name and the correspond-
ing pronunciation.

4.1 Model Architecture

Due to many recent successes in other NLP appli-
Kationen, we experiment with a transformer model
(Vaswani et al., 2017). Our transformer model
(Figur 3) uses a standard encoder-decoder ar-
chitecture as the backbone. The inputs to the
model are the input name with unknown pronun-
ciation xinp, the neighbor names xname (of length
name_len) and associated pronunciations xpron
(of length pron_len). Erste, these input tokens
are embedded with size emb_size. The embed-

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
5
3
5
2
0
6
8
1
2
7

/
T

A
C
_
A
_
0
0
5
3
5
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
5
3
5
2
0
6
8
1
2
7

/
T

A
C
_
A
_
0
0
5
3
5
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figur 3: The transformer model, showing how the main feature and neighbor features are encoded. Colors
for the embeddings and encoders reflect the shared parameters for the transformer model. Example shown is
nipponbashi

mezon nipponbashi, and some neighboring features

nipponbashi,

nishi, Und

nipponbashi higashi.

Hier. This input sequence is then concatenated
to the encoder output and is attended over by
the transformer decoder. There are no positional
embeddings added to this sequence, so they are
unordered from the point of view of decoder at-
Aufmerksamkeit. daher, we help the decoder match the
neighbor names to their corresponding neighbor
pronunciation by adding source tokens (Johnson
et al., 2017) to the sequence. The same source
token is added to matching names and pronun-
ciation inputs. The specific hyperparameters used

the transformer stacks are shown in

for all
Tisch 1.3

To combat overfitting, several types of dropout
were employed. As in Vaswani et al. (2017) Wir
use input-dropout, where entire input embeddings
can be dropped. We further use ReLU-dropout,

3Most feature names can be covered by 3,000 Figuren
(Sat¯o, 1985), so an input vocabulary of 4,710 kanji and other
characters is a reasonable size for an industrial-scale maps
database.

Beam Search Size:
Number of layers:
# attention heads:
Token embedding size:
Hidden size:
Dropout:
Label smoothing:
Lat-Long grid size:
Input vocab size:
Output vocab size:

8
4
8
256
256
0.1
0.2
100
4,710
427

Tisch 1: Hyperparameters for transformer stacks.

dropping activations in the feed-forward layer af-
ter applying the ReLU non-linearity. Endlich, Wir
use attention-dropout, which is applied to the out-
put of the attention layers. Zusätzlich, dropout
is applied to the auxiliary neighbor information,
which means that a given neighbor’s name or pro-
nunciation, as well as the Lat-Long embedding,
has a 10% chance of being dropped entirely in a
training example. The model can be configured to
use neighbor information or not. We show below
that the model benefits from neighbor information
if it is available.

4.2 Lat-Long Embeddings

Some neighborhoods lack clues to pronunciation
of the target feature. Jedoch, pronunciation of
names is to some extent influenced by region,
so the model might be able to deduce the pro-
nunciation if given latitude/longitude coordinates
of the main feature. We thus added embeddings
to represent this information. An n by n grid
was placed over Japan. Simply assigning a sep-
arate embedding to each square would require
many embeddings and might slow the training.
Auch, due to Japan’s shape, many embeddings
would be in the sea and thus unused. Daher, eher
than having n2 embeddings, we treated each di-
mension separately resulting in 2n embeddings,
each of size |emb size/2|. The separate longi-
tude and latitude embeddings for a given square
are then concatenated together, and given to the
decoder as an additional auxiliary input. Experi-
ments showed that this configuration both trained
faster and reduced overfitting.

4.3 Overfitting

that it was known that there are incorrect pro-
nunciations in the data and since we wanted to
use the model to find errors, including ones in
the training data, 100% accuracy on the training
set was actually undesirable. A few techniques
were used to combat overfitting. As well as the
heavy use of dropout, label smoothing was set at
0.2, encouraging the model to be less confident
about outliers. Since source tokens were added
to the neighbor information, this made it easier
for the model to memorize locations from their
neighbor arrangements, so to mitigate against this
the neighbors were shuffled within a batch before
being processed by the model.4 Also, care had to
be taken to balance the size of the lat-long grid,
between providing a useful clue to location, Und
allowing memorization of the location if the grid
was too fine.

To assess potential overfitting during training,
we created a small golden set of 2,008 hoch
confidence pronunciations from the human eval-
uations that we ran while developing the model
(Abschnitt 6). The distribution of these examples
is very skewed with respect to the training data
as a whole since these were all examples where
earlier versions of our model disagreed with the
pronunciations in the training data. With heavy
dropout and label smoothing as described, early
stopping was not required: In particular we did not
observe the accuracy on the golden set dropping
towards the end of training. Im Gegensatz, without
such techniques the model would usually start to
overfit at about 250K steps, whereas with them the
models train to a million steps without overfitting,
and still get higher accuracies.

5 Experiments and Evaluation

The various configurations of the model, mit
and without neighbors, were trained on 2,397,154
neighborhoods, für 1 million steps. Before re-
porting overall performance results, we illustrate
the operation of the with-neighbors transformer
model with an example that illustrates the model
detecting cases where the data is incorrect. Der
, Mezon Nipponbashi, is an
feature,
apartment building in the Nipponbashi district of
is also a part of
Osaka. The problem is that
Tokio, pronounced Nihonbashi, and being more
famous, is arguably the ‘‘default’’ pronunciation.

One of the main challenges with training the
model was overfitting. The reason for this was

4This is not to be confused with shuffling of neighbor-

hoods introduced below in Section 5.2.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
5
3
5
2
0
6
8
1
2
7

/
T

A
C
_
A
_
0
0
5
3
5
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

the golden set. Jedoch, it is also important to
show that the model is indeed learning to attend to
relevant features in the neighborhood. We present
evidence of this in Section 5.4. In Section 5.5
we discuss the important question: How often
does a prediction error make the target name in-
comprehensible?

Moving beyond Section 5, as noted earlier the
maps data has errors, meaning that a small per-
centage of the cases where the hypothesis of the
model differs from what is in the database, Die
database is in fact incorrect. The main practical
application of the model is finding and correcting
these sorts of errors. Determining which dis-
crepancies are errors and which are not requires
human evaluation, and we report results on this
in Section 6. Endlich, since manual evaluation is
expensive, we would like to be able to decide
automatically when we can be confident that a
discrepancy should be judged in the model’s fa-
vor: This is the topic of Section 7. In Section 8,
we demonstrate an application of the model to
a totally different problem.

5.1 Baseline System

As a baseline for comparison we used a proprietary
state of the art Japanese text-normalization system
to produce pronunciations. The system constructs
a lattice using a dictionary and rules, and uses
linear models to score paths and Viterbi search to
select the best path through the lattice.

This system converts an input feature name to
its reading and does not make use of neighbor
Information. To simulate the use of neighbor in-
Formation, we first aligned neighbor names with
their readings using a kanji-to-hiragana aligner
that is part of the text-normalization system in-
troduced above. Zum Beispiel, the neighbor name
would be aligned to its hiragana read-
/
ing
/shita. We then col-
gai,
lect statistics on all kanji substrings and their
hiragana readings, and keep the most common
reading of each substring. Endlich, we find the
longest span(S) in the target name that match
against the substrings collected from the neigh-
Bors, and replace the corresponding portion of
the name’s reading as computed by the text-
normalizer, with the reading found from the neigh-
Bors. Thus if the text-normalizer produces for

/michi,

/shika,

als

the incorrect reading

Figur 4: An error in the original data:
Mezon Nipponbashi apartments in Osaka, circled in
red on the map. Highlighted in green shaded areas are
Nipponbashi Higashi
neighboring features,
Und

Nipponbashi Nishi.

The pronunciation of this feature was presumably
originally populated by a method that did not take
geographical context into account. In Abbildung 4
we show the feature, the pronunciation as found
in the database, the hypothesized (correct) Profi-
nunciation, and the neighbors that
the model
attended to when hypothesizing the feature’s pro-
nunciation. The example introduced in Figure 2
is also correctly predicted by the model as seravi
tanmachi.

In the remainder of this section we present two
types of evaluation. First we introduce a non-
neural baseline (Abschnitt 5.1). In Section 5.2, Wir
present error rates on held-out data for several
versions of the model, the non-neural baseline,
and a separate RNN model that has been used for
more general text-normalization applications. Wir
show that the with-neighbors transformer model
has by far the best performance. In Section 5.3
we delve a bit deeper into the effect of Lat-Long
Merkmale, as well as details of the performance on

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
5
3
5
2
0
6
8
1
2
7

/
T

A
C
_
A
_
0
0
5
3
5
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

System

Baseline

RoadRuNNer

Transformer

Transformer + Lat-Long

± Neigh
−

+
−

−

+30
−

+30

Shuffled

Unshuffled

Golden

Params

Steps

0.199

0.179

0.129

0.102

0.0862

0.0892

0.0867

0.198

0.179

0.131

0.103

0.088

n/a

0.502

0.396

0.442

0.381

0.367

0.332

0.341

n/a

7.6M

6.56M

9.74M

6.58M

9.76M

n/a

Tisch 2: Error rates for the non-neural baselines, RoadRuNNer (without neighbors), the without-
neighbors transformer model, and the with-neighbors transformer model on the test data sets. For the
with-neighbors transformer, 30 neighbors were used, somit +30 in the table.

(shishi kai michi ue), the method might correct that
Zu

(shika gai michi ue).

5.2 Quantitative Evaluation

We evaluated the model on a held-out test set con-
sisting of about 138K neighborhoods, comparing
four models: the non-neural baseline (Abschnitt 5.1),
the with-neighbors and without-neighbors trans-
former models, and another sequence-to-sequence
Modell, the RoadRuNNer RNN-based neural text
normalization system (Zhang et al., 2019) trained
on the same data. For RoadRuNNer, the check-
point with the best string error rate on training was
used in evaluation. Note that the RoadRuNNer
system has no access to the neighbor information,
and thus serves as a baseline sanity check for a
sequence-to-sequence model for pronouncing the
feature names in the absence of any information
about other names in the geographical neighbor-
hood. We also analyze the effects of including
Lat-Long embeddings.

We prepared the train-test split in two different
ways; in the first, which we refer to as shuffled,
we sample features uniformly across Japan when
constructing the two sets. In dieser Sekunde, which we
refer to as unshuffled, the held out set is actually
from non-overlapping areas of Japan such that
features in the test set are from areas that the
model will not have seen during training. Clearly
the Lat-Long embeddings cannot be used in the
latter case since the embeddings for the test area
would not be trained. Hier, the point was to
verify that the model is still able to generalize by
making use of neighbors, in neighborhoods from
parts of the country the model will not have seen
Vor. This provides further evidence, in addition

to what we discuss in Section 5.4, that the model
is learning to use the neighbor information. In
practice we use the shuffled set for training and
generating corrections in the data (Abschnitt 6).

transcriptions, und das

Wieder, when we speak of error rates on this
dataset, we know, as discussed above, that there
are incorrect
daher
there are some cases where the model actually
predicts the correct transcription, but is penalized
because the ground ‘‘truth’’ contains an error.
dennoch, while these are frequent enough
to be worth using our method to correct them
(Abschnitt 6), they are still in the minority of cases,
and the majority of the time, what is in the data
set is correct, which in turn means that one can
usefully compare different methods.

Error rates are given in Table 2. For the shuf-
fled data, the error rate of the without-neighbors
baseline system (Abschnitt 5.1) War 19.9%, welche
is quite high but reflects the difficulty of the
task of reading names of geographical features in
Japanese for which the system was not particularly
tuned. Using neighbors (sehen, wieder, Abschnitt 5.1)
we can reduce this to 17.9%, a 2-point abso-
lute reduction. While this reinforces the point
that neighbors are useful for predicting the pro-
nunciation of a target name, the overall error
rates are high. RoadRuNNer outperforms the
baseline, mit 12.9% error on the shuffled data.
The without-neighbors version of the transformer
Modell (10.2%) outperforms RoadRuNNer by 2.7
points absolute, with the with-neighbors trans-
former reducing the error rate by a further 1.6
points.

Is this reduction significant? Given the Cen-
tral Limit Theorem for Bernouilli trials (Grinstead

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
5
3
5
2
0
6
8
1
2
7

/
T

A
C
_
A
_
0
0
5
3
5
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

We see that for the zero-neighbor (= without-
neighbors) model the Lat-Long embeddings give
a significant boost to the accuracy as one might
expect, but as we add more neighbors the benefit
appears to diminish, and after about 10 neighbors
it seems to hurt performance. This is likely due
to overfitting as the extra information makes it
easier to memorize a location.

Surprisingly, for the golden set (Tisch 2, col-
umn 5), despite neighbors lowering the error, Und
the addition of the Lat-Long embeddings low-
ering it further, the lowest error is achieved by
adding Lat-Long embeddings only. We believe
that this is due to the different distribution of
the examples in the dataset and again the ef-
fect of more information allowing overfitting. In
practice we keep both the Lat-Long embeddings
and use neighbor information for decoding po-
tential correction since the results seem qualita-
tively better.

5.4 The Model Attends to Neighbors

For further confirmation that the model attends to
the neighbors, we created artificial data using fea-
tures containing seven name spellings that have (bei
least) two pronunciations. To create these seven
test sets, we started with real neighborhoods,
and manipulated them as follows. We focused
here on the two most common pronunciations
and designated the more common pronuncia-
tion as ‘P1’ and the other as ‘P2’. Zum Beispiel,
, primary pronun-
consider toponyms spelled
ciation (P1) koube, secondary pronunciation (P2)
koudo. Features containing that kanji spelling in-
, koube denchi ‘Kobe Battery’,
clude
koudo ¯ohashi ‘Koudo Bridge’. Wir
Und
decoded our set under three conditions: (1) leav-
ing the pronunciations of the neighbors alone;
(2) changing all relevant portions of a neighbor’s
, to have a P1
In
name, Zum Beispiel,
(koube) no matter what the original pronunciation
War; (3) similarly changing all relevant portions to
P2 (koudo) no matter what the original pronunci-
ation was. The with-neighbors transformer model
was then used to decode the target feature, Und
we measured the proportion of times P1 was de-
coded under the various (possibly artificial) con-
ditions. The results of this experiment are shown
in Table 3. As can be seen in the table, the propor-
tion of P1 is always affected by artificially manip-
ulating the neighbors, though more dramatically

Figur 5: Model accuracies for different number of max
neighbors, with and without latitudinal and longitudinal
embeddings (shuffled test set).

(cid:2)

and Snell, 1997, P. 330), Die 95% confidence in-
P(1 − p)/N , where N is
terval is given as ±
the number of trials. With N = 132, 753 Und
p = 0.102 for the without-neighbors transformer
model and p = 0.0862 for the with-neighbors
transformer model, the confidence intervals are
[0.1012, 0.1028] Und [0.0854, 0.0870],
bzw-
aktiv. These do not overlap, suggesting that
the differences are significant. We further com-
pared the two models using paired bootstrap
resampling (Koehn, 2004), where for each of
Die 10,000 trials we randomly with replacement
drew N/2 elements from the original test set and
computed accuracies. This method also indicates
the superiority of the with-neighbors model for
the nominal significance level α = 0.05 mit
P < α and non-overlapping 95% confidence in- tervals [0.104, 0.100] for without-neighbors and [0.088, 0.084] for with-neighbors models. Finally, we also confirm the statistical significance by per- forming the paired permutation test (Good, 2000) using a t-statistic, which for 5,000 permutations yields p = 0.0003 for α = 0.05, where p < α. As expected all the models perform worse on the unshuffled data, because in that case the test data is more dissimilar to the training data, since it is drawn from different regions of the country. Still, the with-neighbors transformer model still gives a significant drop in error rate, reinforcing the point that the model uses neighbors when available. 5.3 Lat-Long and Golden Set Figure 5 shows the effect of adding Lat-Long embeddings for different numbers of neighbors. 93 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 3 5 2 0 6 8 1 2 7 / / t l a c _ a _ 0 0 5 3 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Name Proportion P1 in decoding hypotheses # exx. 1,110 790 420 780 4,639 1,360 4,670 P1 nihonbashi P2 nipponbashi misato saeki kodaira koube shibuya yamato sangou saiki obira koudo shibutani taiwa Original 0.86 0.79 0.66 0.91 0.96 0.98 0.86 Neigh. → P1 1.0 Neigh. → P2 0.04 0.83 0.86 0.93 0.97 0.99 0.86 0.06 0.58 0.62 0.83 0.85 0.85 Table 3: Synthetic examples demonstrating that the system pays attention to the neighbors. Columns: relevant kanji spelling of the target feature; number of target features; primary pronunciation; secondary pronunciation; proportion of decodings of the primary pronunciation with unchanged data; proportion of decodings of the primary pronunciation when the data are changed as in (2) in the text; proportion of decodings of the primary pronunciation when the data are changed as in (3). of the neighbors in computing its decision on the pronunciation of a target feature. Further evidence can be seen in visualizations of the transformer attention to neighbors’ pro- nunciations. Figure 6 shows average attention weights over all layers and attention heads. The neighbors from Figure 2 correspond to neighbors 5 and 6 here. When decoding the last four charac- in tanmachi, the model is attending ters to the neighbors that contain this sequence. 5.5 Detailed Error Analysis: How Bad Are the Errors When the Model Gets it Wrong? As a reviewer for an earlier version of this pa- as Nihonbashi per pointed out, reading as part of a feature name in Osaka (correct pro- nunciation: Nipponbashi) is wrong, but the hearer would likely still be able to understand the in- tended feature. It should be no worse than reading Houston (Street) in New York as . A reasonable question is what proportion of the errors that the model makes are similarly ‘recov- erable’ in the sense that the hearer will be able to understand the intended referent. To that end we took a random sample of 60 errors made by the best performing model (with-neighbors trans- former model trained on shuffled data, Table 2, row 6) and compared them to the reference tran- scription from the maps database. The third author, a native speaker of Japanese, evaluated how many of these seemed recoverable in the sense above. Of the 60, 48 were deemed to be recoverable, Figure 6: Visualization of transformer attention for the example in Figure 2, with neighbor positions 5 and 6 corresponding to the two neighbors highlighted in that figure. Note the higher attention (darker blue) in the tanmachi in lower right corresponding to the neighbors. so in some cases than others. The signal for the pronunciation yamato for is evidently very strong compared to taiwa so that it is very hard to override it with evidence from the neighbors. nihon/nipponbashi is On the other hand, easily influenced by the pronunciations of the neighbors. In all cases the neighbors influence the results in the expected direction. This small ex- periment thus provides further evidence that the model is paying attention to the pronunciations 94 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 3 5 2 0 6 8 1 2 7 / / t l a c _ a _ 0 0 5 3 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 whereas the other 12 either seemed not to be recoverable or were unclear. An example of a , where the reference recoverable error is transcription is Hachimanda whereas the model predicts Hachimanden. This hinges on the pronun- ciation of the final kanji using the native (ta/da) pronunciation versus the Sino-Japanese (ten/den) pronunciation; both pronunciations are in prin- ciple possible. Another recoverable example is , where the reference is Hirach¯o, but the model predicts Hiramachi. Again this hinges on the native (machi) versus Sino-Japanese (ch¯o) pro- nunciation of the final character. This latter case is particularly hard even for native speakers to ‘town’ get right, since the pronunciation of as ch¯o or machi is not predictable and must be memorized for each place name. An example of an unrecoverable error is R¯oje Asao,5 which the model pre- dicts as R¯oje Asabu. In this case, the difference , but hinges on two native ways to read here the predicted Asabu is potentially confus- ing. While the feature in question is an apartment building in Sapporo, a hearer familiar with Tokyo is likely to confuse it with a well-known area of Azabu. A more dramatic example Tokyo, , a part of Kyoto, where the correct pro- is nunciation is Hitsujisaruch¯o, whereas the model predicted Konch¯o. Once again this hinges on a native (hitsujisaru) versus Sino-Japanese (kon) pronunciation, in this case for the first character. So in 80% of the cases, even though the model picks an inappropriate pronunciation, the result is still recoverable. For the remaining 20%, the model did not produce random unrelated pronunciations, but rather theoretically possible pronunciations—indeed, errors that a person not familiar with the area might make—but where the pronunciation was deemed too far off to be recov- erable. However, we want to stress that in general whether a possible but incorrect pronunciation of a Japanese place name is recoverable or not is an issue that can only be properly answered by a more rigorous study of users in real-life situations. 6 Finding Mistakes in Maps Data An important application of the model is to find potential errors in the database, and flag them 5Apparently for L’Osier Asao. for possible human correction. To that end, us- ing the with-neighbors transformer model trained with all features, we ran decoding on the en- tire data set, including the training and held-out portions, and identified cases where the model hypothesized a different pronunciation from what was in the reference transcription. In order to focus on the cases of interest, we further fil- tered these by considering only neighborhoods where some neighbors have spellings that share substrings with the target feature’s spelling, and pronunciations that share substrings with the hy- pothesized pronunciation. This yielded a set of 18,898 neighborhoods that had some discrepancy per the model. Especially for the training por- tion, it is likely that the model learned whatever pronunciation was in the database, even if it was wrong, so we are likely missing a lot of neigh- borhoods that have errors: We do not, therefore, know the recall of the system. In what follows, we consider the precision, based on a manual analysis by human raters. Preliminary analysis of the output revealed that many of the discrepancies involved estab- lishments, which include buildings and other man-made features including things like bus stops. These often contain a location name the name. For example, a Fam- as part of ily Mart convenience store might be named Family Mart Kobashi Station Square Store, with the issue is correctly pronounced in being whether the establishment name. Three raters6 manually checked pronuncia- tions for 1,056 features, including 555 establish- ment features. Raters were given links to the feature on Google Maps, and were asked to ver- ify which pronunciation was correct, or give an alternative if neither was correct. Evaluators had to provide ‘proof’ of their answers, of which the following were considered acceptable: (a) official website of the location, or the Japan Post web- site; (b) a screenshot from Street View showing the pronunciation (e.g., from a road sign); (c) a Wikipedia page with sufficient appropriate refer- ences. Raters were asked not to use other sources. Overall, the raters found that the model cor- rectly detected that there was a potential problem with the reference data 63% of the time. The 6All raters employed in this study are paid linguistic consultants hired through a third party vendor. 95 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 3 5 2 0 6 8 1 2 7 / / t l a c _ a _ 0 0 5 3 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 N 401 100 N 473 82 Trn Tst Trn Tst Mixed sample +Maps 149 49 +Maps N 0.37 0.49 +Hyp 137 24 Establishments only +Maps 161 +Maps N 0.34 +Hyp 189 35 0.43 29 +Hyp N 0.34 0.24 +Hyp N 0.40 0.35 Figure 7: Results of a manual evaluation of 555 es- tablishment features. See the text for an explanation. remaining 37% of the features were actually cor- rect, despite the model having hypothesized a different pronunciation. The 63% of cases with problems broke down as follows: In 36% (abso- lute) of the cases, the hypothesized replacement pronunciation was correct, in 11% both were wrong (meaning that the model detected a prob- lem, but found the wrong solution), and in 15% of the cases, the rater was unable to verify the answer (which suggests that the feature may need to be checked further). In some categories such as ‘compound building’, the hypothesized pronunci- ation was correct (and the reference pronuncia- tion wrong), 80.7% of the time. Figure 7 shows the results of a manual analy- sis of the 555 establishments by an independent rater. The establishments represent a range of ‘‘impressions’’, with some appearing frequently in searches, others less so. The rater found that for the majority of cases (55.6%), the data in maps was incorrect: 39.2% where the hypothe- sized alternative is the correct one; and 16.4% where both what is in the data and the hypothe- sis are wrong, but where the system has detected a problem with the data. The rater was unsure about a further 9.2%, constituting a further set that should be checked by an expert. (A small per- centage of the establishments have closed since the database was created.) Thus only about a third of the establishments selected were correct in the database. Table 4: Comparison of two sets of hand-checked features, showing the cases where either the maps data (+Maps) or the hypothesis were correct (+Hyp), broken down into whether the feature in question was in the training, versus the held-out data. N is the total size of each set. where either the data already in maps was deemed correct, or the hypothesized replacement was deemed correct. In general the hypothesized cor- rections had higher accuracies, and the maps data lower accuracies in the establishment set than in the mixed set. Also, the model seems to be making better predictions for the training portion than the held-out portion. Indeed, for the establishments, the hypothesis is more often right for the training portion of the data than what is in the original training data. While the model probably memo- rizes aspects of the training data, it can still no- tice discrepancies even in neighborhoods it has seen before. One point that will be clear from the above is that just because there is a discrepancy be- tween the pronunciations of a target feature and the neighboring features does not mean that the target is wrong. Indeed, there are systematic types of features that frequently involve such discrep- ancies. One such class being train stations, which are notoriously difficult in that they are frequently pronounced differently from the name of the town in which they are located (Imao, 2020). Thus Kowakudani is the station that serves Kowakidani eki. Station names were often established during the Meiji Period, and re- flect older pronunciations for nearby toponyms. 7 Automatic Data Correction Table 4 gives a breakdown of the two hand-checked samples, considering only cases Unfortunately the model is not yet accurate enough to use it to automatically fix discrepancies 96 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 3 5 2 0 6 8 1 2 7 / / t l a c _ a _ 0 0 5 3 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 the top candidate being correct. Figure 8 shows the ROC curve for using this metric to features from the golden set. It shows an area under the curve of almost 0.7, a clear positive signal, and the Precision-Recall curve shows that we are able to achieve an accuracy of about 80% for about 50% of the data, which still represents a large number of high confidence corrections. 8 Cognate Reflex Prediction Task List et al. (2022) present the ACL SIGTYP shared task on the problem of cognate reflex prediction. Cognate reflex prediction is best understood by example. English, Dutch, and German are closely related West Germanic languages that share many cognate words. For example, English dream corresponds to droom in Dutch and Traum in German. If one now considers English tree, the words that correspond in meaning to this in Dutch and German are boom and Baum, respectively. These are apparently from the same etymon, but the English word is not. What should an English cognate look like? On analogy with dream, one would predict the form to be beam. Indeed, while beam’s meaning has shifted, it is in fact related to boom and Baum. In the SIGTYP task, par- ticipants were presented with data from several language families, where the task was to recon- struct what the cognate forms for particular etyma would be, given examples in a subset of the sister languages. Kirov et al. (2022) report the results of applying two models to this task, one being a model based on image inpainting (Liu et al., 2018), and the second being a variant of the neighbors model presented in this paper. The cognate reflex predic- tion problem is similar in spirit to the geographical feature reading task, where we replace ‘‘neighbor reading’’ with the form of a cognate in a related language, and ‘‘target reading’’ with the form to be predicted. As for the ‘‘spellings’’, we replace these with a string representing the name of the language associated with each of the neighboring cognates and with the target. Table 5 summarizes the parallels between the two tasks. The model used by Kirov et al. (2022) differed slightly from the version reported above in that the language identifiers and cognate forms are interleaved and then concatenated together and attended to di- rectly by the decoder without any averaging, and source token ids are added to each cognate in the Figure 8: ROC curve (top), and Precision/Recall curve (bottom), for threshholding results on the dif- ference between beam search scores. for all features. Among the 1,056 manually ana- lyzed features, the original data was correct 37% of the time, and the model 35%, meaning that simply substituting the model’s hypothesis would result in a small net loss in accuracy. However, we also saw that the model was more accurate than the reference data for some classes of features, meaning on average the accuracy should increase if we replaced the data in those cases. We have also investigated filtering the data based on met- rics extracted from the model itself. For example, we considered decoding entropy as a measure of confidence, the log likelihoods of the beam search outputs, and the relative amount of attention that the attention layers were giving the neighbor sum- mary. Thus far, the most informative measure is difference between the top two beam search de- coding log likelihoods. Our interpretation of this is that if there is a large difference in confidence between the two beams then there is little am- biguity in how the model thinks they should be pronounced and thus we can be more confident in 97 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 3 5 2 0 6 8 1 2 7 / / t l a c _ a _ 0 0 5 3 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Cognate reflex target form target lang. related form related lang. related form related lang. related form related lang. ... L1 L2 L3 L4 ... Geo. name task ← main feature1 pron. ← main feature1 name ← neigh. feature2 pron. ← neigh. feature2 name. ← neigh. feature3 pron. ← neigh. feature3 name. ← neigh. feature4 pron. ← neigh. feature4 name. ... ... Table 5: Parallels between the cognate reflex prediction and the geographical name reading prediction tasks. ‘‘L1’’ and so forth in the first column represent the names of the languages in the set. set. This allows the model to better attend to the individual cognate and to copy (portions of) the cognate as needed. Also, since the data sets for the cognate reconstruction task are small, a smaller transformer configuration was used. Even so, the provided data sets were too small, so Kirov et al. (2022) augmented the data in two ways. First, the data were augmented by copying neigh- borhoods while randomly removing neighbors, thus making new neighborhoods for the same cognate set. Second, synthetic cognate sets were generated for each of the ‘‘neighbor’’ languages and the target using simple n-gram models trained on the provided data. The two systems developed by Kirov et al. (2022) achieved the top ranking in the shared task. In general, the better performing of the two was the inpainting model, but on some language families, such as Semitic, the neighbors model out- performed the inpainting model. Table 6, adapted from List et al.’s (2022) Table 4, shows the re- sults for two baselines, the inpainting model, three versions of the neighbors model—30K, 35K, and 100K training steps—and three other competing systems. The rank in the final column is aggre- gated over the normalized edit distance (NED), B-cubed F-scores, and BLEU. The inpainting model and the neighbors model were the only two systems that overall outperformed the SVM baseline. The fact that the 30K neighbors model worked better than higher numbers of training steps can likely be attributed to overtraining. These results suggest that the model we have System Rank NED B-Cubes BLEU Aggregated Inpainting Neighbors 30K Neighbors 35K SVM Baseline Neighbors 100K System 2 System 3 CORPAR Baseline System 4 1 2 3 4 5 6 7 8 9 1 2.6 2.4 5.2 4.6 6 7.6 6.8 8.8 1.2 3 4 4 6.6 7 4 6.2 9 1 2.6 2.4 5 4.6 6.2 7.6 6.8 8.8 1.1 ± 0.3 2.7 ± 0.4 2.9 ± 0.9 4.7 ± 1.9 5.3 ± 1.3 6.4 ± 1.1 6.4 ± 2.5 6.6 ± 0.8 8.9 ± 0.4 Table 6: Average ranks of systems in the SIGTYP 2022 Shared Task along with aggregated ranks. presented in this paper has potential applications outside the main task we have reported here. 9 Discussion In this paper we have presented a novel architec- ture for the difficult task of pronouncing Japanese geographical features that learns to use pronun- ciations of features in a local neighborhood as hints. We have shown via various means that the model pays attention to the neighboring features, and that therefore the model has learned what we intended it to learn: That in order to pronounce a name, it is often useful to consider how neighbors are pronounced. We also conducted manual eval- uations showing that for some classes of features, model hypotheses differing from pronunciations in the database could be as high as 80% correct. Our results are currently being used to correct errors in Google Maps. In future work we also plan to extend the coverage of the model beyond Japan. While Japanese place names are particu- larly difficult, we noted in the Introduction that there are similar problems in other regions. One problem that comes up in the United States, for example, is nonce abbreviations for certain fea- tures. For example if one looks in Google Maps in Shreveport, LA, one will run across the weirdly abbreviated Sprt Bkdl Hwy Srv Dr. Out of context this is virtually uninterpretable, but if one looks at nearby features one will find the Shreveport Barksdale Hwy. From this and other information one can deduce that the mysteriously named fea- ture must be the Shreveport Barksdale Highway Service Drive. Besides geographical names, there are other problems to which a similar approach can be ap- plied. The neighbor model can be thought of as an 98 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 3 5 2 0 6 8 1 2 7 / / t l a c _ a _ 0 0 5 3 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 auxiliary memory, to be consulted or not depend- ing on the decision being made. We discussed one possible application of this conceptualiza- tion to the task of cognate reflex prediction in Section 8. A further extension of the idea is joint correc- tion of features in a neighborhood. If most of in a neighborhood the pronunciations of are nipponbashi, then one could consider cor- recting all cases where the pronunciation is listed as nihonbashi in the neighborhood, not just the main feature. Note that this is somewhat similar in spirit to work on collective classification (Sen et al., 2008). Finally, it is also worth noting that while our work has been with a proprietary maps database, there are open-source maps datasets such as OpenStreetMap (Haklay and Weber, 2008), which likely have at least as many problematic issues as the database we used. The techniques we de- scribe in this paper could be applied to improving such data. Acknowledgments We thank three anonymous reviewers of previous versions of this paper for detailed feedback. We also thank Jesse Rosenstock for help with the code that extracts neighborhoods. References William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conver- sational speech recognition. In Proceedings of 2016 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages 4960–4964, Shanghai, China. IEEE. https://doi.org/10.1109/ICASSP.2016 .7472621 Falcon Dai and Zheng Cai. 2017. Glyph-aware embedding of Chinese characters. In Proceed- ings of the First Workshop on Subword and Character Level Models in NLP, pages 64–69, Copenhagen, Denmark. Association for Com- putational Linguistics. Phillip Good. 2000. Permutation Tests: A Practi- cal Guide to Resampling Methods for Testing Hypotheses, 2nd edition. Springer Series in Statistics. Springer, New York, NY. https:// doi.org/10.1007/978-1-4757-3235-1 3 Kyle Gorman, Gleb Mazovetskiy, and Vitaly Nikolaev. 2018. Improving homograph disam- biguation with supervised machine learning. In Proceedings of the Eleventh International Con- ference on Language Resources and Evalua- tion (LREC 2018), pages 1349–1352, Miyazaki, Japan. European Language Resources Associa- tion (ELRA). Charles Grinstead and J. Laurie Snell. 1997. Introduction to Probability, 2nd edition. Amer- ican Mathematical Society, Providence, RI. Mordechai Haklay and Patrick Weber. 2008. OpenStreetMap: User-generated street maps. IEEE Pervasive Computing, 7(4):12–18. https://doi.org/10.1109/MPRV.2008.80 Keisuke Imao. 2020. Ekimei Gaku Ny¯umon (An Introduction to the Study of Station Names). Chuokoron-Shinsha, Tokyo. Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Vi´egas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2017. Google’s multi- lingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351. https://doi.org/10.1162 /tacl_a_00065 Riku Kawamura, Tatsuya Aoki, Hidetaka Kamigaito, Hiroya Takamura, and Manabu Okumura. 2020. Neural text normalization leveraging similarities of strings and sounds. In Proceedings of the 28th International Conference on Computational Linguistics, pages 2126–2131, Barcelona, Spain (Online). International Committee on Computational Linguistics. https://doi.org/10.18653/v1 /2020.coling-main.192 Christo Kirov, Richard Sproat, and Alexander Gutkin. 2022. Mockingbird at the SIGTYP 2022 shared task: Two types of models for the prediction of cognate reflexes. In Pro- ceedings of the 4th Workshop on Research in Computational Typology and Multilin- gual NLP (ACL SIGTYP), pages 70–79, Seattle, Washington. Association for Compu- tational Linguistics. https://doi.org/10 .18653/v1/2022.sigtyp-1.9 99 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 3 5 2 0 6 8 1 2 7 / / t l a c _ a _ 0 0 5 3 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Pro- ceedings of the 2004 Conference on Empiri- cal Methods in Natural Language Processing, pages 388–395, Barcelona, Spain. Association for Computational Linguistics. Johann-Mattis List, Ekaterina Vylomova, Robert Forkel, Nathan Hill, and Ryan Cotterell. 2022. The SIGTYP 2022 shared task on the predic- tion of cognate reflexes. In Proceedings of the 4th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, pages 52–62, Seattle, Washington. Association for Computational Linguistics. Guilin Liu, Fitsum A. Reda, Kevin J. Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro. 2018. Image inpainting for irregular holes using partial convolutions. In Proceed- the 15th European Conference on ings of Computer Vision (ECCV 2018), pages 89–105, Munich, Germany. Springer International Publishing. https://doi.org/10.1007 /978-3-030-01252-6_6 Courtney Mansfield, Ming Sun, Yuzong Liu, Ankur Gandhe, and Bj¨orn Hoffmeister. 2019. Neural text normalization with subword units. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, Volume 2 (Industry Pa- pers), pages 190–196, Minneapolis, Minnesota. Association for Computational Linguistics. https://doi.org/10.18653/v1/N19 -2024 Minh Nguyen, Gia H. Ngo, and Nancy F. Chen. 2020. Hierarchical character embeddings: Learning phonological and semantic represen- tations in languages of logographic origin using recursive neural networks. IEEE/ACM Trans- actions on Audio, Speech, and Language Pro- cessing, 28:461–473. https://doi.org/10 .1109/TASLP.2019.2955246 Yuji Ogihara. 2021. I know the name well, but cannot read it correctly: Difficulties in read- ing recent Japanese names. Humanities and Social Sciences Communications, 8(1):1–7. https://doi.org/10.1057/s41599-021 -00810-0 100 Kyubyong Park and Seanie Lee. 2020. g2pM: A neural grapheme-to-phoneme conversion pack- age for Mandarin Chinese based on a new open benchmark dataset. In Proceedings of Inter- speech 2020, pages 1723–1727, Shanghai, China. International Speech Communication Association. https://doi.org/10.21437 /Interspeech.2020-1094 Ben Peters, Jon Dehdari, and Josef van Genabith. 2017. Massively multilingual neural grapheme- to-phoneme conversion. In Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems, pages 19–26, Copenhagen, Denmark. Association for Com- putational Linguistics. https://doi.org /10.18653/v1/W17-5403 Subhojeet Pramanik and Aman Hussain. 2019. Text normalization using memory augmented neural networks. Speech Communication, 109:15–23. https://doi.org/10.1016 /j.specom.2019.02.003 Golan Pundak, Tara N. Sainath, Rohit Prabhavalkar, Anjuli Kannan, and Ding Zhao. 2018. Deep context: End-to-end contextual speech recognition. In Proceedings of 2018 IEEE Spoken Language Technology Work- shop (SLT), pages 418–425, Athens, Greece. IEEE. https://doi.org/10.1109/SLT .2018.8639034 Kanishka Rao, Fuchun Peng, Has¸im Sak, and Franc¸oise Beaufays. 2015. Grapheme-to- phoneme conversion using Long Short-Term Memory recurrent neural networks. In Proceed- ings of 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4225–4229, South Brisbane, Australia. IEEE. Sat¯o. Norihiko 1985. , Chimei hy¯oki to JIS kanji, (on the expression of place names and JIS Chi- , Suirobuz nese characters). Kenky¯u H¯okoku, (Hydrographic Department Research Report), 20:167–180. Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, and Tina Eliassi- Rad. 2008. Collective classification in network data. AI Magazine, 29(3):93–106. https:// doi.org/10.1609/aimag.v29i3.2157 Yi Shi, Congyi Wang, Yu Chen, and Bin Wang. 2021. Polyphone disambiguation l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 3 5 2 0 6 8 1 2 7 / / t l a c _ a _ 0 0 5 3 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 in Mandarin Chinese with semi-supervised learning. In Proceedings of Interspeech 2021, pages 4109–4113, Brno, Czech Republic. In- ternational Speech and Communication Asso- ciation (ISCA). https://doi.org/10.21437 /Interspeech.2021-502 Richard Sproat and Navdeep Jaitly. 2017. An RNN model of text normalization. In Proceed- ings of Interspeech, pages 754–758, Stockholm, Sweden. International Speech and Commu- nication Association (ISCA). https://doi .org/10.21437/Interspeech.2017-35 Shubham Toshniwal and Karen Livescu. 2016. Jointly learning to align and convert graphemes to phonemes with neural attention models. In 2016 IEEE Spoken Language Technology Workshop (SLT), pages 76–82, San Diego, CA, USA. IEEE. https://doi.org/10.1109 /SLT.2016.7846248 Oanh Thi Tran and Viet The Bui. 2021. Neu- ral text normalization in speech-to-text systems with rich features. Applied Artificial Intelli- gence, 35(3):193–205. https://doi.org /10.1080/08839514.2020.1842108 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. In Pro- 2017. Attention is all you need. ceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), pages 5998–6008, Long Beach, CA, USA. Curran Associates Inc. Kaisheng Yao and Geoffrey Zweig. 2015. Sequence-to-sequence neural net models for In Pro- grapheme-to-phoneme conversion. ceedings of Interspeech, pages 3330–3334, Dresden, Germany. International Speech and Communication Association (ISCA). David Yarowsky. 1996. Homograph disambigua- tion in text-to-speech synthesis. In Jan van Santen, Richard Sproat, Joseph Olive, and Julia Hirschberg, editors, Progress in Speech Synthe- sis, pages 157–172, New York, NY. Springer. https://doi.org/10.1007/978-1-4612 -1894-4 12 Sevinj Yolchuyeva, G´eza N´emeth, and B´alint Gyires-T´oth. 2019. Grapheme-to-phoneme con- version with convolutional neural networks. Applied Sciences, 9(6). https://doi.org /10.3390/app9061143 Sevinj Yolchuyeva, G´eza N´emeth, and B´alint Gyires-T´oth. 2018. Text normalization with convolutional neural networks. International Journal of Speech Technology, 21:589–600. https://doi.org/10.1007/s10772-018 -9521-x Mingzhi Yu, Hieu Duy Nguyen, Alex Sokolov, Jack Lepird, Kanthashree Mysore Sathyendra, Samridhi Choudhary, Athanasios Mouchtaris, and Siegfried Kunzmann. 2020. Multilingual grapheme-to-phoneme conversion with byte the 2020 representatoin. In Proceedings of IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages 8234–8238, Barcelona, Spain. IEEE. Hao Zhang, Richard Sproat, Axel Ng, Felix Stahlberg, Xiaochang Peng, Kyle Gorman, and Brian Roark. 2019. Neural models of text nor- malization for speech applications. Computa- tional Linguistics, 45(2):293–337. https:// doi.org/10.1162/coli a 00349 101 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 3 5 2 0 6 8 1 2 7 / / t l a c _ a _ 0 0 5 3 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Helpful Neighbors: Bild

PDF Herunterladen