Instance-Based Neural Dependency Parsing

Hiroki Ouchi1,3

Jun Suzuki2,3 Sosuke Kobayashi2,4 Sho Yokoi2,3

Tatsuki Kuribayashi2,5 Masashi Yoshikawa2,3 Kentaro Inui2,3
1 NAIST, Japan 2 Tohoku University, Japan 3 RIKEN, Japan
4 Preferred Networks, Inc., Japan 5 Langsmith, Inc., Japan
hiroki.ouchi@is.naist.jp, sosk@preferred.jp,
{jun.suzuki, yokoi, kuribayashi, yoshikawa, inui}@tohoku.ac.jp

Abstrakt

Interpretable rationales for model predictions
are crucial in practical applications. We de-
velop neural models that possess an interpret-
able inference process for dependency parsing.
Our models adopt instance-based inference,
where dependency edges are extracted and
labeled by comparing them to edges in a train-
ing set. The training edges are explicitly used
for the predictions; daher, it is easy to grasp the
contribution of each edge to the predictions.
Our experiments show that our instance-based
models achieve competitive accuracy with stan-
dard neural models and have the reasonable
plausibility of instance-based explanations.

Einführung

While deep neural networks have improved pre-
diction accuracy in various tasks, rationales under-
lying the predictions have been more challenging
for humans to understand (Lei et al., 2016). In
practical applications, interpretable rationales play
a crucial role in driving humans’ decisions and
promoting human–machine cooperation (Ribeiro
et al., 2016). From this perspective, the utility
of instance-based learning (Aha et al., 1991),
a traditional machine learning method, has been
realized again (Papernot and McDaniel, 2018).

Instance-based learning is a method that learns
similarities between training instances and infers
a value or class for a test instance on the basis of
similarities against the training instances. Auf der
one hand, standard neural models encode all the
knowledge in the parameters, making it challeng-
ing to determine what knowledge is stored and
used for predictions (Guu et al., 2020). Auf der
andererseits, models with instance-based inference
explicitly use training instances for predictions
and can exhibit the instances that significantly
contribute to the predictions. The instances play

a role in an explanation to the question: why did
the model make such a prediction? This type of
explanation is called instance-based explana-
tion (Caruana et al., 1999; Baehrens et al., 2010;
Plumb et al., 2018), which facilitates the users’
understandings of model predictions and allows
users to make decisions with higher confidence
(Kolodneer, 1991; Ribeiro et al., 2016).

It is not trivial to combine neural networks with
instance-based inference processes while keep-
ing high prediction accuracy. Recent studies in
image recognition seek to develop such meth-
Odds (Wang et al., 2014; Hoffer and Ailon, 2015;
Liu et al., 2017; Wang et al., 2018; Deng et al.,
2019). This paradigm is called deep metric learn-
ing. Compared to image recognition, es gibt
much fewer studies on deep metric learning in
natural language processing (NLP). As a few ex-
ceptions, Wiseman and Stratos (2019) and Ouchi
et al. (2020) developed neural models that have
an instance-based inference process for sequence
labeling tasks. They reported that their models
have high explainability without sacrificing the
prediction accuracy.

As a next step from targeting consecutive to-
kens, we study instance-based neural models for
relations between discontinuous elements. To cor-
rectly recognize relations, systems need to capture
associations between elements. Als Beispiel
of relation recognition, we address dependency
parsing, where systems seek to recognize binary
relations between tokens (hereafter edges). Tradi-
tionally, dependency parsers have been a useful
tool for text analysis. An unstructured text of in-
terest is parsed, and its structure leads users to a
deeper understanding of the text. By successfully
introducing instance-based models to dependency
parsing, users can extract dependency edges along
with similar edges as a rationale for the parse,
which further helps the process of text analysis.

1493

Transactions of the Association for Computational Linguistics, Bd. 9, S. 1493–1507, 2021. https://doi.org/10.1162/tacl a 00439
Action Editor: Yue Zhang. Submission batch: 7/2021; Revision batch: 8/2021; Published 12/2021.
C(cid:2) 2021 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
3
9
1
9
7
9
2
5
5

/
T

A
C
_
A
_
0
0
4
3
9
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

In diesem Papier, we develop new instance-based
neural models for dependency parsing, equipped
with two inference modes: (ich) explainable mode
Und (ii) fast mode. In the explainable mode, unser
models make use of similarities between the can-
didate edge and each edge in a training set. Von
looking at the similarities, users can quickly check
which training edges significantly contribute to
the prediction. In the fast mode, our models run
as fast as standard neural models, while general
instance-based models are much slower than stan-
dard neural models because of the dependence on
the number of training instances. The fast mode is
motivated by the actual situation: In many cases,
users want only predictions, and when the pre-
dictions seem suspicious, they want to check the
rationales. Daher, the fast mode does not offer ra-
tionales, but instead enables faster parsing that
outputs exactly the same predictions as the ex-
plainable mode. Users can freely switch between
the explainable and fast modes according to their
Zwecke. This property is realized by taking ad-
vantage of the linearity of score computation in our
models and avoids comparing a candidate edge to
each training edge one by one for computing the
score at test time (see Section 4.4 for details).

Our experiments on multilingual datasets show
that our models can achieve competitive accu-
racy with standard neural models. Zusätzlich, Wir
shed light on the plausibility of instance-based
explanations, which has been underinvestigated
in dependency parsing. We verify whether our
models meet a minimal requirement related to the
plausibility (Hanawa et al., 2021). Additional anal-
ysis reveals the existence of hubs (Radovanovic
et al., 2010), a small number of specific training
instances that often appear as nearest neighbors,
and that hubs have a terrible effect on the plau-
sibility. Our main contributions are as follows:

• This is the first work to develop and study
instance-based neural models1 for depen-
dency parsing (Abschnitt 4);

• Our empirical results show that our instance-
based models achieve competitive accuracy
with standard neural models (Abschnitt 6.1);

1Our code is publicly available at https://github.com
/hiroki13/instance-based-dependency-parsing.

• Our analysis reveals that L2-normalization
for edge representations suppresses the hubs’
occurrence, Und, as a result, succeeds in
improving the plausibility of instance-based
explanations (Abschnitte 6.2 Und 6.3).

2 Related Work

2.1 Dependency Parsing

There are two major paradigms for dependency
parsing (K¨ubler et al., 2009): (ich) the transition-
based paradigm (Nivre, 2003; Yamada and
Matsumoto, 2003) Und (ii) the graph-based par-
adigm (McDonald et al., 2005). Recent litera-
ture often adopts the graph-based paradigm and
achieves high accuracy (Dozat and Manning,
2017; Zhang et al., 2017; Hashimoto et al., 2017;
Clark et al., 2018; Ji et al., 2019; Zhang et al.,
2020). The first-order edge-factored models un-
der this paradigm factorize the score of a depen-
dency tree into independent scores of single edges
(McDonald et al., 2005). The score of each edge
is computed on the basis of its edge feature. Das
decomposable property is preferable for our work
because we want to model similarities between
single edges. Daher, we adopt the basic framework
of the first-order edge-factored models for our
instance-based models.

2.2 Instance-Based Methods in NLP

Traditionell, instance-based methods (Erinnerung-
based learning) have been applied to a variety of
NLP tasks (Daelemans and Van den Bosch, 2005),
such as part of speech tagging (Daelemans et al.,
1996), NER (Tjong Kim Sang, 2002; De Meulder
and Daelemans, 2003; Hendrickx and van den
Bosch, 2003), partial parsing (Daelemans et al.,
1999; Sang, 2002), phrase-structure parsing
(Lebowitz, 1983; Scha et al., 1999; K¨ubler, 2004;
Bod, 2009), word sense disambiguation (Veenstra
et al., 2000), semantic role labeling (Akbik and
Li, 2016), and machine translation (MT) (Nagao,
1984; Sumita and Iida, 1991).

Nivre et al. (2004) proposed an instance-based
(memory-based) method for transition-based de-
pendency parsing. The subsequent actions of a
transition-based parser are selected at each step by
comparing the current parser configuration to each
of the configurations in the training set. Hier, jede
parser configuration is treated as an instance and
plays a role of rationales for predicted actions but

1494

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
3
9
1
9
7
9
2
5
5

/
T

A
C
_
A
_
0
0
4
3
9
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

not for predicted edges. Generally, parser config-
urations are not directly mapped to each predicted
edge one by one, so it is troublesome to interpret
which configurations significantly contribute to
edge predictions. Im Gegensatz, since we adopt the
graph-based one, our models can naturally treat
each edge as an instance and exhibit similar edges
as rationales for edge predictions.

2.3 Instance-Based Neural Methods in NLP

Most of the studies above were published be-
fore the current deep learning era. Very recently,
instance-based methods have been revisited and
combined with neural models in language model-
ing (Khandelwal et al., 2019), MT (Khandelwal
et al., 2020), and question answering (Lewis et al.,
2020). They augment a main neural model with
a non-parametric sub-module that retrieves auxil-
iary objects, such as similar tokens and documents.
Guu et al. (2020) proposed to parameterize and
learn the sub-module for a target task.

These studies assume a different setting from
ours. There is no ground-truth supervision sig-
nal for retrieval in their setting, so they adopt
non-parametric approaches or indirectly train the
sub-module to help a main neural model from
the supervision signal of the target task. In our
setting, the main neural model plays a role in
retrieval and is directly trained with ground-truth
Objekte (annotated dependency edges). Daher, unser
findings and insights are orthogonal to theirs.

2.4 Deep Metric Learning

Our work can be categorized into deep metric
learning research in terms of the methodologi-
cal perspective. Although the origins of metric
learning can be traced back to some earlier work
(Short and Fukunaga, 1981; Friedman et al., 1994;
Hastie and Tibshirani, 1996), the pioneering work
is Xing et al. (2002).2 Since then, many methods
using neural networks for metric learning have
been proposed and studied.

Deep metric learning methods can be cate-
gorized into two classes from the training loss
Perspektive (Sun et al., 2020): (ich) learning with
class-level labels and (ii) learning with pair-wise
labels. Given class-level
the first one
learns to classify each training instance to its tar-
get class with a classification loss, Zum Beispiel,

labels,

2If you would like to know the history of metric learning

in more detail, please read Bellet et al. (2013).

Neighbourhood Component Analysis
(NCA)
(Goldberger et al., 2005), L2-constrained softmax
loss (Ranjan et al., 2017), SpereFace (Liu et al.,
2017), CosFace (Wang et al., 2018), and ArcFace
(Deng et al., 2019). Given pair-wise labels, Die
second one learns pair-wise similarity (the simi-
larity between a pair of instances), Zum Beispiel,
contrastive loss (Hadsell et al., 2006), triplet loss
(Wang et al., 2014; Hoffer and Ailon, 2015),
N-pair loss (Sohn, 2016), and multi-similarity loss
(Wang et al., 2019). Our method is categorized
into the first group because it adopts a classifica-
tion loss (Abschnitt 4).

2.5 Neural Models Closely Related to Ours

Among the metric learning methods above, NCA
(Goldberger et al., 2005) shares the same spirit
as our models. In diesem Rahmen, models learn
to map instances with the same label
to the
neighborhood in a feature space. Wiseman and
Stratos (2019) and Ouchi et al. (2020) developed
NCA-based neural models for sequence labeling.
We discuss the differences between their models
and ours later in more detail (Abschnitt 4.5).

3 Dependency Parsing Framework

We adopt a two-stage approach (McDonald et al.,
2006; Zhang et al., 2017): we first identify depen-
dency edges (unlabeled dependency parsing) Und
then classify the identified edges (labeled depen-
dency parsing). Genauer, we solve edge
identification as head selection and solve edge
classification as multi-class classification.3

3.1 Edge Identification

To identify unlabeled edges, we adopt the head
selection approach (Zhang et al., 2017), in which
a model learns to select the correct head of each to-
ken in a sentence. This simple approach enables us
to train accurate parsing models in a GPU-friendly
Weg. We learn the representation for each edge to
be discriminative for identifying correct heads.

Let x = (x0, x1, . . . , xT ) denote a tokenized
input sentence, where x0 is a special ROOT token

3Although some previous studies adopt multi-task learn-
ing methods for edge identification and classification tasks
(Dozat and Manning, 2017; Hashimoto et al., 2017), we inde-
pendently train a model for each task because the interaction
effects produced by multi-task learning make it challenging
to analyze models’ behaviors.

1495

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
3
9
1
9
7
9
2
5
5

/
T

A
C
_
A
_
0
0
4
3
9
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

and x1, . . . , xT are original T tokens, Und (cid:3)xj, xi(cid:4)
denote an edge from head token xj to dependent
token xi. We define the probability of token xj
being the head of token xi in the sentence x as:

P (xj|xi) =

(cid:2)

exp(shead(xj, xi))
T
k=0 exp(shead(xk, xi))

(1)

Hier, the scoring function shead can be any neural
network-based scoring function (see Section 4.1).
At inference time, we choose the most likely
4:

head ˆyi for each token xi

ˆyi = arg max
xk : 0≤k≤T

P (xk|xi).

(2)

At training time, we minimize the negative

log-likelihood of training data:

L = −

|D|(cid:3)

T (N)(cid:3)

n=1

i=1

log P (j(N)

ich

|X(N)
ich

(3)

∈ x(N) is each input token, j(N)

Hier, D = {X(N), j(N), R(N)}|D|
where x(N)
is the gold (ground-truth) head for token x(N)
ich
R(N)
(cid:4).
ich

n=1 is a training set,
∈ y(N)
, Und

∈ r(N) is the label for edge (cid:3)j(N)

, X(N)
ich

ich

3.2 Label Classification

We adopt a simple multi-class classification ap-
proach for labeling each unlabeled edge. Wir
define the probability that each of all possible
labels r ∈ R will be assigned to the edge (cid:3)xj, xi(cid:4):

P (R|xj, xi) =

(cid:2)

exp(slabel(R, xj, xi))

exp(slabel(R(cid:7), xj, xi))

(4)

R(cid:7)∈R

Hier, the scoring function slabel can be any neural
network-based scoring function (see Section 4.2).
At inference time, we choose the most likely

class label from the set of all possible labels R:

ˆr = arg max

r∈R

P (R|ˆyi, xi).

(5)

Hier, ˆyi is the head token identified by a head
selection model.

4While this greedy formulation has no guarantee to pro-
duce well-formed trees, we can produce well-formed ones by
using the Chu-Liu-Edmonds algorithm in the same way as
Zhang et al. (2017). In this work, we would like to focus on
the representation for each edge and evaluate the goodness
of the learned edge representation one by one. With such a
motivation, we adopt the greedy formulation.

At training time, we minimize the negative

log-likelihood of training data:

L = −

|D|(cid:3)

T (N)(cid:3)

n=1

i=1

log P (R(N)

ich

|j(N)
ich

, X(N)
ich

(6)

Hier, R(N)
label for gold edge (cid:3)j(N)

∈ R is the gold (ground-truth) relation
(cid:4).

ich

, X(N)
ich

ich

4 Instance-Based Scoring Methods

For the scoring functions in Eqs. 1 Und 4, Wir
describe our proposed instance-based models.

4.1 Edge Scoring

We would like to assign a higher score to the cor-
rect edge than other candidates (Eq. 1). Hier, Wir
compute similarities between each candidate edge
and ground-truth edges in a training set (Jenseits
training edge). By summing the similarities, Wir
then obtain the score that indicates how likely the
candidate edge is the correct one.

Speziell, we first construct a set of training

edges, called the support set, A(D):

A(D) = {(cid:3)yi, xi(cid:4) | xi ∈ x, yi ∈ y,
(X, j, R) ∈ D} .

(7)

Hier, yi is the ground-truth head token of token
xi. We then compute and sum similarities between
a candidate edge and each edge in the support set:
(cid:3)

shead(xj, xi) =

sim(H(cid:3)J,ich

(cid:3)X(cid:2),xk(cid:4)∈A(D)

(cid:4), H(cid:3)(cid:2),k(cid:4)).

(8)

Hier, H(cid:3)J,ich(cid:4), H(cid:3)(cid:2),k(cid:4) ∈ Rd are d-dimensional edge
Darstellungen (Abschnitt 4.3), and sim is a simi-
larity function. Following recent studies of deep
metric learning, we adopt the dot product and the
cosine similarity:

simdot(A, B) = a(cid:8)B,
simcos(A, B) = τ a(cid:8)

u bu,
au = a/(cid:9)A(cid:9),
bu = b/(cid:9)B(cid:9).

As you can see, the cosine similarity is the same
as the dot product between two unit vectors: Das
Ist, (cid:9)au(cid:9) = (cid:9)bu(cid:9) = 1. As we will discuss later in
Abschnitt 6.3, this property suppresses the occur-
rence of hubs, compared with the dot product
between unnormalized vectors. Beachten Sie, dass, folgen-
ing the previous studies of deep metric learning

1496

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
3
9
1
9
7
9
2
5
5

/
T

A
C
_
A
_
0
0
4
3
9
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

(Wang et al., 2018; Deng et al., 2019), we rescale
the cosine similarity by using the scaling fac-
tor τ (0 ≤ τ ), which works as the temperature
parameter in the softmax function.5

4.2 Label Scoring

Similarly to the scoring function above, we also
design our instance-based label scoring function
slabel in Eq. 4. We first construct a support set
A(D; R) for each relation label r ∈ R:

A(D; R) = {(cid:3)yi, xi(cid:4) | xi ∈ x, yi ∈ y, ri ∈ r,

(X, j, R) ∈ D, ri = r} .

(9)

Hier, only the edges with label r are collected
from the training set. We then compute and sum
similarities between a candidate edge and each
edge of the support set:

(cid:3)

slabel(R, xj, xi) =

sim(H(cid:3)J,ich(cid:4), H(cid:3)(cid:2),k(cid:4)).

(10)

(cid:3)X(cid:2),xk(cid:4)∈A(D;R)

Here is the intuition: If the edge is more similar to
the edges with label r than those with other labels,
the edge is more likely to have the label r.

4.3 Edge Representation

In the proposed models (Eqs. 8 Und 10), wir gebrauchen
d-dimensional edge representations. We define the
representation for each edge (cid:3)xj, xi(cid:4) as follows:

H(cid:3)J,ich(cid:4) = f (hdep

ich

, hhead
J

(11)

Hier, hdep, hhead ∈ Rd are d-dimensional feature
vectors for the dependent and head, jeweils.
These vectors are created from a neural encoder
(Abschnitt 5.2). When designing f , it is desirable to
capture the interaction between the two vectors.
By referring to the insights into feature represen-
tations of relations on knowledge bases (Bordes
et al., 2013; Yang et al., 2015; Nickel et al., 2016),
we adopt a multiplicative composition, a major
composition technique for two vectors:6

F (hdep
ich

, hhead
J

) := W (hdep

ich

(cid:10) hhead
J

5In our preliminary experiments, we set τ by selecting
a value from {16, 32, 64, 128}. Infolge, whichever we
chose, the prediction accuracy was stably better than τ = 1.
6In our preliminary experiments, we also tried an additive
composition and the concatenation of the two vectors. Der
accuracies by these techniques for unlabeled dependency
parsing, Jedoch, were both about 20%, which is much
inferior to that by the multiplicative composition.

Hier, the interaction between hdep
Ist
captured by element-wise multiplication (cid:10). Diese
are composed into one vector, which is then
transformed by a weight matrix W ∈ Rd×d
into h(cid:3)J,ich(cid:4).

and hhead

ich

4.4 Fast Mode

Do users want rationales for all the predictions?
Maybe not. In many cases, all they want to do is
to parse sentences as fast as possible. Only when
they find a suspicious prediction will they check
the rationale for it. To fulfill the demand, unser
parser provides two modes: (ich) explainable mode
Und (ii) fast mode. The explainable mode, als
described in the previous subsections, enables
exhibiting similar training instances as rationales,
but its time complexity depends on the size of
the training set. Im Gegensatz, the fast mode does
not provide rationales, but instead enables faster
parsing than the explainable mode and outputs
exactly the same predictions as the explainable
mode. Daher, at test time, users can freely switch
between the modes: Zum Beispiel, they first use the
fast mode, and if they find a suspicious prediction,
then they will use the explainable mode to obtain
the rationale for it.

Formally, if using the dot product and cosine
similarity for similarity function in Eq. 8, the ex-
plainable mode can be rewritten as the fast mode:

shead(xj, xi) =

(cid:3)

H(cid:8)
(cid:3)J,ich(cid:4)H(cid:3)(cid:2),k(cid:4)
(cid:3)X(cid:2),xk(cid:4)∈A(D)
(cid:3)
= h(cid:8)

H(cid:3)(cid:2),k(cid:4)

(cid:3)J,ich(cid:4)
(cid:3)X(cid:2),xk(cid:4)∈A(D)
(cid:3)J,ich(cid:4)hsum
A(D),

= h(cid:8)

(12)

(cid:2)

A(D) :=

where hsum
(cid:3)X(cid:2),xk(cid:4)∈A(D) H(cid:3)(cid:2),k(cid:4). In this way,
the vectors in the training
once you sum all
set h(cid:3)(cid:2),k(cid:4) ((cid:3)X(cid:2), xk(cid:4) ∈ A(D)), you can reuse the
summed vector without searching the training
set again. At test time, you can precompute this
summed vector hsum
A(D) before running the model on
a test set, which reduces the exhaustive similarity
computation over the training set to the simple dot
product between the two vectors h(cid:8)

(cid:3)J,ich(cid:4)hsum

A(D).7

4.5 Relations to Existing Models

The Closest Models to Ours. Wiseman and
Stratos (2019) and Ouchi et al. (2020) proposed

7In the same way as Eq. 12, we can transform slabel in

Eq. 10 to the fast mode.

1497

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
3
9
1
9
7
9
2
5
5

/
T

A
C
_
A
_
0
0
4
3
9
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

an instance-based model using Neighbourhood
Components Analysis (NCA) (Goldberger et al.,
2005) for sequence labeling. Given an input sen-
tence of T tokens, x = (x1, . . . , xT ), the model
first computes the probability that a token (oder
Spanne) xi ∈ x in the sentence selects each of all the
tokens in the training set xj ∈ xD as its neighbor:

P (xj|xi) =

exp(sim(xj, xi))

(cid:2)

exp(sim(xk, xi))

xk∈xD

The model then constructs a set of only the tokens
xj associated with a label y: X (xD; j) = {xj |
xj ∈ xD, yj = y}, and computes the probability
that each token xi will be assigned a label y:

(cid:3)

P (j|xi) =

P (xj, xi).

xj ∈X (D;j)

The point is that while our models first sums
the similarities (Eq. 8) and then put the summed
score into exponential form as exp(shead(xj, xi)),
their model puts each similarity into exponential
form as exp(sim(xk, xi)) before the summation.
The different order of using exponential function
makes it impossible to rewrite their model as the
fast mode, so their model always has to compare a
token xi to each of the training set xj ∈ xD. Das
is the biggest difference between their model and
ours. While we leave the performance compari-
son between the NCA-based models and ours for
future work, our models have an advantage over
the NCA-based models in that our models offer
two options, the explainable and fast modes.

Standard Models Using Weights. Typically,
neural models use the following scoring functions:

shead(xj, xi) = w(cid:8)H(cid:3)J,ich(cid:4),
r h(cid:3)J,ich(cid:4).

slabel(R, xj, xi) = w(cid:8)

(13)

(14)

Hier, w ∈ Rd is a learnable weight vector and
wr ∈ Rd is a learnable weight vector associ-
ated with label r ∈ R. In früheren Arbeiten (Zhang
et al., 2017), this form is used for dependency
parsing. We call such models weight-based mod-
els. Caruana et al. (1999) proposed combining
weight-based models with instance-based infer-
enz: At inference time, the weights are discarded,
and only the trained encoder is used to extract fea-

Language
Arabic
Basque
Chinese
English
Finnish
Hebrew
Hindi
Italian
Japanese
Koreanisch
Russian
Swedish
Turkish

Treebank
PADT
BDT
GSD
EWT
TDT
HTB
HDTB
ISDT
GSD
GSD
SynTagRus
Talbanken
IMST

Train
Family Order
6.1k
non-IE VSO
5.4k
SOV
non-IE
SVO
non-IE
4.0k
SVO 12.5k
IE
SVO 12.2k
non-IE
SVO
non-IE
5.2k
SOV 13.3k
IE
SVO 13.1k
IE
7.1k
SOV
non-IE
SOV
non-IE
4.4k
SVO 48.8k
IE
4.3k
SVO
IE
3.7k
SOV
non-IE

Tisch 1: Dataset information. ‘‘Family’’ indicates
if Indo-European (IE) or not. ‘‘Order’’ indi-
cates dominant word orders according to WALS
(Haspelmath et al., 2005). ‘‘Train’’ is the number
of training sentences.

ture representations for instance-based inference.
Such a combination has been reported to be ef-
fective for image recognition (Ranjan et al., 2017;
Liu et al., 2017; Wang et al., 2018; Deng et al.,
2019). In dependency parsing, there has been no
investigation on it. Since such a combination can
be a promising method, we investigate its utility
(Abschnitt 6).

5 Experimental Setup

5.1 Data

We use English PennTreebank (PTB) (Marcus
et al., 1993) and Universal Dependencies (UD)
(McDonald et al., 2013). Following previous stud-
ies (Kulmizev et al., 2019; Smith et al., 2018;
de Lhoneux et al., 2017), we choose a variety of
13 languages8 from the UD v2.7. Tisch 1 zeigt an
information about each dataset. We follow the
standard training-development-test splits.

5.2 Neural Encoder Architecture

To compute hdep and hhead (in Eq. 11), we adopt
the encoder architecture proposed by Dozat and
Manning (2017). Erste, we map the input sequence

8These languages have been selected by considering the
perspectives of different language families, different mor-
phological complexity, different training sizes, and domains.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
3
9
1
9
7
9
2
5
5

/
T

A
C
_
A
_
0
0
4
3
9
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

1498

0:T = (htoken

x = (x0, . . . , xT )9 to a sequence of token repre-
Sendungen, htoken
, . . . , htoken
), jeder von
which is htoken
= [et; ct; bt], where et, ct, and bt
are computed by word embeddings,10 character-
level CNN, and BERT (Devlin et al., 2019),11
jeweils. Zweite, the sequence htoken
is fed
0:T
to bidirectional LSTM (BiLSTM) (Graves et al.,
2013) for computing contextual ones: hlstm
0:T =
(hlstm
0:T ). Endlich,
0
hlstm
t = W dephlstm
T
, where W dep ∈ Rd×2d
and hhead
T
and W head ∈ Rd×2d are parameter matrices.

, . . . , hlstm
∈ R2d is transformed as hdep
t = W headhlstm

T ) = BiLSTM(htoken

5.3 Mini-Batching

We train models with the mini-batch stochastic
gradient descent method. To make the current
mini-batch at each time step, we follow a stan-
dard technique for training instance-based models
(Hadsell et al., 2006; Oord et al., 2018).

At training time, we make the mini-batch that
consists of query and support sentences at each
time step. A model encodes the sentences and the
edge representations used for computing similar-
ities between each candidate edge in the query
sentences and each gold edge in the support sen-
tences. Hier, due to the memory limitation of
GPUs, we randomly sample a subset from the
training set at each time step: das ist, (X(N), j(N),
R(N)) ∼ Uniform(D). In edge identification, für
query sentences, we randomly sample a subset D(cid:7)
Q
of N sentences from D. For support sentences,
we randomly sample a subset D(cid:7)
s of M sentences
from D, and construct and use the support set
A(D(cid:7)
S) instead of A(D) in Eq. 7. In label clas-
sification, we would like to guarantee that the
support set in every mini-batch always contains
at least one edge for each label. To do so, Wir
randomly sample a subset A(cid:7)(D; R) of U edges
from the support set for each label r: das ist,
(cid:3)j(N)
(cid:4) ∼ Uniform(A(D; R)) in Eq. 9. Notiz
ich
that each edge (cid:3)j(N)
(cid:4) is in the n-th sentence
X(N) in the training set D, so we put the sentence

, X(N)
ich

ich

9We use the gold tokenized sequences in PTB and UD.
10For PTB, wir gebrauchen 300 dimensional GloVe (Pennington
et al., 2014). For UD, wir gebrauchen 300 dimensional fastText (Grave
et al., 2018). During training, we fix them.

11We first conduct subword segmentation for each token
of the input sequence. Dann, the BERT encoder takes as
input the subword-segmented sequences and computes the
representation for each subword. Hier, we use the (last
layer) representation of the first subword within each token
as its token representation. For PTB, we use ‘‘BERT-Base,
Cased.’’ For UD, we use ‘‘BERT-Base, Multilingual Cased.’’

Name

Word Embedding
BERT
CNN window size
CNN filters
BiLSTM layers
BiLSTM units
Optimization
Learning rate
Rescaling factor τ
Dropout ratio

Wert

GloVe (PTB) / fastText (UD)
BERT-Base
3
30
2
300 dimensions
Adam
0.001
64
{0.1, 0.2, 0.3}

Tisch 2: Hyperparameters used in the experiments.

ich

, X(N)
ich

X(N) into the mini-batch to compute the represen-
tation for (cid:3)j(N)
(cid:4). Eigentlich, we use N = 32
query sentences in both edge identification and
label classification, M = 10 support sentences
in edge identification,12 and U = 1 support edge
(Satz) for each label in label classification.13
At test time, we encode each test (query) sen-
tence and compute the representation for each
candidate edge on-the-fly. The representation is
then compared to the precomputed support edge
representation, hsum
A(D) in Eq 12. To precompute
hsum
A(D), we first encode all the training sentences
and obtain the edge representations. Dann, in edge
identification, we sum all of them and obtain one
support edge representation hsum
A(D). In label clas-
sification, similarly to hsum
A(D), we sum only the
edge representations with label r and obtain one
support representation for each label hsum

A(D;R).14

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
3
9
1
9
7
9
2
5
5

/
T

A
C
_
A
_
0
0
4
3
9
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

5.4 Training Configuration

Tisch 2 lists the hyperparameters. To optimize the
Parameter, we use Adam (Kingma and Ba, 2014)
with β1 = 0.9 and β2 = 0.999. The initial learning
rate is η0 = 0.001 and is updated on each epoch
as ηt = η0/(1 + ρt), where ρ = 0.05 and t is
the epoch number completed. A gradient clipping
value is 5.0 (Pascanu et al., 2013). Die Nummer
of training epochs is 100. We save the parameters
that achieve the best score on each development
set and evaluate them on each test set. It takes less
than one day to train on a single GPU, NVIDIA
DGX-1 with Tesla V100.

12Infolge, the whole mini-batch size is 32 + 10 = 42.
13When U = 1, the whole mini-batch size is 32 + |R|.
14The total number of the support edge representations is

equal to the size of the label set |R|.

1499

Learning
Inference
Similarity
System ID
PTB-English
UD-Average
UD-Arabic
UD-Basque
UD-Chinese
UD-English
UD-Finnish
UD-Hebrew
UD-Hindi
UD-Italian
UD-Japanese
UD-Korean
UD-Russian
UD-Swedish
UD-Turkish

Weight-based
Weight-based
dot
Kulmizev+’19
–
– /84.9
– /81.8
– /79.8
– /83.4
– /87.6
– /83.9
– /85.9
– /90.8
– /91.7
– /92.1
– /84.2
– /91.0
– /86.9
– /64.9

Weight-based

Weight-based
cos
dot
WWc
WWd

96.4/95.3 96.4/95.3
89.0/85.6 89.0/85.6
87.8/82.1 87.8/82.1
84.9/81.1 84.9/80.9
85.6/82.3 85.8/82.4
90.9/88.1 90.7/88.0
89.4/86.6 89.1/86.3
89.4/86.4 89.5/86.5
94.8/91.7 94.8/91.7
94.1/92.0 94.2/92.1
94.3/92.8 94.5/93.0
88.0/84.4 87.9/84.3
94.2/92.7 94.1/92.7
90.3/87.6 90.3/87.5
73.0/65.3 73.2/65.4

Instance-based
cos
dot
WIc
WId

96.4/94.4 93.0/91.8
89.0/85.2 83.0/79.5
87.8/81.6 84.9/79.0
84.9/80.6 82.0/77.9
85.7/81.6 80.9/77.3
90.9/87.8 88.1/85.3
89.3/86.1 84.1/81.2
89.4/85.9 82.7/79.7
94.8/91.4 91.4/88.0
94.1/91.9 91.5/89.4
94.3/92.7 92.5/90.9
88.0/84.2 84.3/80.4
94.2/92.4 57.7/56.5
90.4/87.1 88.6/85.8
73.1/64.5 69.9/61.9

Instance-based
Instance-based
cos
dot
IIc
IId

96.4/95.3 96.4/95.3
89.3/85.7 89.0/85.5
88.0/82.1 87.6/81.9
85.1/80.9 85.0/80.8
86.3/82.8 85.9/82.5
91.1/88.3 91.0/88.2
89.6/86.6 89.4/86.4
89.8/86.7 89.6/86.6
94.9/91.8 94.9/91.6
94.3/92.2 94.1/92.0
94.6/93.1 94.4/92.8
88.1/84.4 88.2/84.5
94.3/92.8 94.1/92.6
90.5/87.5 90.4/87.5
73.7/65.5 72.9/64.7

Tisch 3: Comparison between weight-based and instance-based systems. Cells show unlabeled attach-
ment scores (UAS) before the slash and labeled attachment scores (LAS) after the slash on each test set.
System IDs stand for the first letters of the options: z.B., WId stands for ‘‘W’’eight-based learning and
‘‘I’’nstance-based inference using the ‘‘d’’ot product. The system ID, Kulmizev+’19, is the graph-based
parser with BERT in Kulmizev et al. (2019).

6 Results and Discussion

6.1 Prediction Accuracy on
Benchmark Tests

We report averaged unlabeled attachment scores
(UAS) and labeled attachment scores (LAS) across
three different runs of the model training with ran-
dom seeds. We compare 6 Systeme, jedes davon
consists of two models for edge identification and
label classification, jeweils. For reference,
we list the results by the graph-based parser with
BERT in Kulmizev et al. (2019), whose architec-
ture is the most similar to ours.

Tisch 3 shows UAS and LAS by these sys-
Systeme. The systems WWd and WWc are the standard
ones that consistently use the weight-based scores
(Eqs. 13 Und 14) during learning and inference.
Between these systems, the difference of the sim-
ilarity functions does not make a gap in the accura-
cies. Mit anderen Worten, the dot product and the cosine
similarity are on par in terms of the accuracies.
The systems WId and WIc use the weight-based
scores during learning and the instance-based ones
during inference. While the system WId using dot
achieved competitive UAS and LAS to those by

the standard weight-based system WWd, the sys-
tem WIc using cos achieved lower accuracies
than those by the system WWc. The systems
IId and IIc consistently use the instance-based
scores during learning and inference. Both of
them succeeded in keeping competitive accura-
cies with those by the standard weight-based ones
WWd and WWc.

instance-based models

Out-of-Domain Robustness. We evaluate the
robustness of our
In
out-of-domain settings by using the five domains
of UD-English: we train each model on the train-
ing set of the source domain ‘‘Yahoo! Answers’’
and test it on each test set of the target domains,
Emails, Newsgroups, Rezensionen, and We-
blogs. As Table 4 zeigt an, the out-of-domain
robustness of our instance-based models is com-
parable to the weight-based models. This tendency
is observed when using different source domains.

Sensitivity of M for Inference.
In the experi-
ments above, we used all the training sentences for
support sentences at test time. What if we reduce
the number of support sentences? Hier, im

1500

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
3
9
1
9
7
9
2
5
5

/
T

A
C
_
A
_
0
0
4
3
9
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

dot
WWd
Emails
81.7
Newsgroups 83.1
Rezensionen
88.5
Weblogs
81.9
83.8
Average

Weight-Based Instance-Based
cos
IIc
81.4
82.9
88.8
81.9
83.8

cos
dot
WWc IId
81.7 81.6
83.3 83.1
88.7 88.7
80.9 80.9
83.7 83.6

Tisch 4: UAS in out-of-domain settings, Wo
each model
is trained on the source domain
‘‘Yahoo! Answers’’ and tested on each of the
four target domains.

1
81.5
82.8
88.7
81.8
83.7

10
81.4
83.0
88.7
82.1
83.8

100
81.5
82.9
88.8
82.0
83.8

ALL
81.5
82.9
88.8
81.9
83.8

Emails
Newsgroups
Rezensionen
Weblogs
Average

Tisch 5: UAS by the instance-based system using
the cosine similarity (IIc) and randomly sampled
M support training sentences.

same out-of-domain settings above, we evaluate
the instance-based system using the cosine sim-
ilarity IIc with M support sentences randomly
sampled at each time step. Intuitively, if using
a smaller number of randomly sampled support
Sätze (z.B., M = 1), the prediction accuracies
would drop. Surprisingly, Jedoch, Tisch 5 zeigt an
that the accuracies do not drop even if reducing M .
This tendency is observed when using the other
three systems WId, WIc, and IId. One possible
reason for it is that the feature space is appropri-
ately learned: das ist, because positive edges are
close to each other and far from negative edges in
the feature space, the accuracies do not drop even
if randomly sampling a single support sentence
and using the edges.

6.2 Sanity Check for Plausible Explanations

It is an open question how to evaluate the ‘‘plausi-
bility’’ of explanations: das ist, whether or not the
retrieved instances as explanations are convincing
for humans. As a reasonable compromise, Hanawa
et al. (2021) designed the identical subclass test
for evaluating the plausibility. This test is based on
a minimal requirement that interpretable models

Figur 1: Valid ((cid:2)) and invalid ((cid:3)) examples of
unlabeled edges for the identical subclass test.

should at least satisfy: training instances to be pre-
sented as explanations should belong to the same
latent (sub)class as the test instance. Consider the
examples in Figure 1. The predicted unlabeled
edge ‘‘wrote → novels’’ in the test sentence has
Die (unobserved) latent label, obj. To this edge,
two training instances are given as explanations:
The above one seems more convincing than the
below one because ‘‘published → books’’ has
the same latent label, obj, as that of ‘‘wrote →
novels’’ while ‘‘novels → the’’ has the different
eins, det. As these show, the agreement between
the latent classes are likely to correlate with plau-
sibility. Note that this test is not perfect for the
plausibility assessment, but it works as a sanity
check for verifying whether models make obvious
violations in terms of plausibility.

This test can be used for assessing unlabeled
parsing models because the (unobserved) relation
labels can be regarded as the latent subclasses of
positive unlabeled edges. We follow three steps;
(ich) identifying unlabeled edges in a development
set; (ii) retrieving the nearest training edge for
each identified edge; Und (iii) calculating LAS,
das ist, if the labels of the query and retrieved
edges are identical, we regard them as correct.15

Tisch 6 shows LAS on PTB and UD-English.
The systems using instance-based inference with
the cosine similarity, WIc and IIc, succeeded in
retrieving the support training edges with the same
label as the queries. Surprisingly, the system IIc
achieved over 70% LAS on PTB without label
supervision. The results suggest that systems using
instance-based inference with the cosine similarity

15If the parsed edge is incorrect, we regard it as incorrect.

1501

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
3
9
1
9
7
9
2
5
5

/
T

A
C
_
A
_
0
0
4
3
9
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Weight-Based
cos
dot
WIc IId
7.0
67.5
3.9
51.5

Instance-Based
cos
IIc
71.6
54.0

dot
WId
1.8
16.4

System ID
PTB-English
UD-English

Tisch 6: Results of the identical subclass test. Jede
cell indicates labeled attachment scores (LAS) An
each development set. All the models are trained
with head selection supervision and without label-
ing supervision.

Tisch 7: Examples of support edges retrieved by
the instance-based system using the cosine simi-
larity (IIc). The first query-support pair has the
same (unobserved) label mark. The query of the
second pair has nmod:poss although the support
has det.

meet the minimal requirement, and the retrieved
edges are promising as plausible explanations.

To facilitate the intuitive understanding of
model behaviors, we show actual examples of
the retrieved support edges in Table 7. As the first
query-support pair shows, for query edges whose
head or dependent is a function word (z.B., Wenn ), Die
training edges with the same (unobserved) label
tend to be retrieved. Andererseits, as the
second pair shows, for queries whose head is a
noun (z.B., appeal), the edges whose head is also
a noun (z.B., food) tend to be retrieved regardless
of different latent labels.

6.3 Geometric Analysis on Feature Spaces

The identical subclass test suggests a big differ-
ence between the feature spaces learned by using
the dot product and the cosine similarity. Here we
look into them in more detail.

6.3.1 Observation of Nearest Neighbors

Erste, we look into training edges retrieved as near-
est support ones. Speziell, we use the edges

Tisch 8: Examples of unlabeled support training
edges retrieved by the WId system (weight-based
learning and instance-based inference with the
dot product) for each query. Regardless of the
very different queries, the same support edge was
retrieved.

in the UD-English development set as queries and
retrieve the top k similar support edges in the
UD-English training set. Tisch 8 shows the ex-
amples retrieved by the WId system. Hier, Die
same support edge, (cid:3)ROOT, finden(cid:4), was retrieved
for the different queries, (cid:3)Jennifer, Anderson(cid:4) Und
(cid:3)alle, nach(cid:4). As this indicates, when using the dot
product as the similarity function, a small num-
ber of specific edges are extremely often selected
as support ones for any queries. Such edges are
called hubs (Radovanovic et al., 2010). This phe-
nomenon is not desirable for users in terms of the
plausible interpretation of predictions. If a system
always exhibits the same training instance(S) als
rationales for predictions, users are likely to doubt
the system’s validity.

6.3.2 Quantitative Measurement of Hubness

Zweite, we quantitatively measure the hubness
of each system. Speziell, for the hubness, Wir
measure the k-occurrences of instance x, Nk(X)
(Radovanovic et al., 2010; Schnitzer et al., 2012).
In the case of our dependency parsing experi-
gen, Nk(X) indicates the number of times each
support training edge x occurs among the k near-
est neighbors of all the query edges. The support
training edges with an extremely high Nk value
can be regarded as hubs. In this study, we set
k = 10 and measure N10(X) of unlabeled sup-
port training edges. For query edges, we use the
UD-English development set that contains 25,148
edges. For support edges, we use the UD-English
training set that contains 204,585 edges.

Tisch 9 shows the highest N10 support training
edges. In the case of the system WId, the unla-
beled support edge (cid:3)ROOT, finden(cid:4) appeared 19,407

1502

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
3
9
1
9
7
9
2
5
5

/
T

A
C
_
A
_
0
0
4
3
9
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

System ID sim
WId
dot
WIc
cos
IId
dot
IIc
cos

N10
19,407
82
22,493
34

Instances
(cid:3)ROOT, finden(cid:4)
(cid:3)help, Zeit(cid:4)
(cid:3)sagte, airlifted(cid:4)
(cid:3)force, Israel(cid:4)

Tisch 9: Examples of the highest N10 unlabeled
support training edges in UD-English.

Figur 2: Ranking of the 100 highest N10 unlabeled
support training edges in UD-English.

times in the 10 nearest neighbors of the 25,148
query edges. A similar tendency was observed in
the instance-based system using the dot product
IId. Im Gegensatz, in the case of the systems using
the cosine similarity, WIc and IIc, it was not ob-
served that specific support edges were retrieved
so often. In Abbildung 2, we plot the top 100 support
training edges in terms of N10 with log10 scale.
The N10 distributions of the systems using the dot
product, WId and IId, look very skew; das ist,
hubs emerge. This indicates that when using the
dot product, a small number of specific support
training edges appear in the nearest neighbors so
often, regardless of query edges.

To sum up, systems using instance-based infer-
ence and the dot product are often in trouble with
hubs and have difficulty retrieving plausible sup-
port edges for predictions. The occurrence of hubs
are likely to be related to the norms of edge rep-
resentations since L2-normalization for the edges
in the cosine similarity tends to suppress hubs’
occurrence. We leave a more detailed analysis of
the cause of hubs’ occurrence for future work.

7 Abschluss

classification model (Abschnitt 4). We have ana-
lyzed them from the perspectives of the prediction
accuracy and the explanation plausibility. Der erste
analysis shows that our instance-based systems
and achieve competitive accuracy with weight-
based neural ones (Abschnitt 6.1). The second in-
dicates that our instance-based systems using the
cosine similarity (L2-normalization for edge rep-
resentations) meet the minimal requirement of
plausible explanations (Abschnitt 6.2). The addi-
tional analysis reveals that when using the dot
product, hubs emerge, which degrades the plausi-
bility (Abschnitt 6.3). One interesting future direc-
tion is investigating the cause of hubs’ occurrence
in more detail. Another direction is using the
learned edge representations in downstream tasks,
such as semantic textual similarity.

Danksagungen

The authors are grateful to the anonymous re-
viewers and the Action Editor who provided
many insightful comments that improve the pa-
pro. Special thanks also go to the members of
Tohoku NLP Laboratory for the interesting com-
ments and energetic discussions. The work of H.
Ouchi was supported by JSPS KAKENHI grant
number 19K20351. The work of J. Suzuki was
supported by JST Moonshot R&D grant num-
ber JPMJMS2011 (fundamental research) Und
JSPS KAKENHI grant number 19H04162. Der
work of S. Yokoi was supported by JST ACT-X
grant number JPMJAX200S, Japan. The work of
T. Kuribayashi was supported by JSPS KAK-
ENHI grant number 20J22697. The work of M.
Yoshikawa was supported by JSPS KAKENHI
grant number 20K23314. The work of K. Hassen
was supported by JST CREST grant number
JPMJCR20D2, Japan.

Verweise

David W. Aha, Dennis Kibler, and Marc K. Albert.
1991. Instance-based learning algorithms. Ma-
chine Learning, 6(1):37–66. https://doi
.org/10.1007/BF00153759, https://
doi.org/10.1023/A:1022689900470

We have developed instance-based neural depen-
dency parsing systems, each of which consists
of our edge identification model and our label

Alan Akbik and Yunyao Li. 2016. K-SRL:
Instance-based learning for semantic role labeling.
In Proceedings of COLING, pages 599–608.

1503

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
3
9
1
9
7
9
2
5
5

/
T

A
C
_
A
_
0
0
4
3
9
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

David Baehrens, Timon Schroeter, Stefan
Harmeling, Motoaki Kawanabe, Katja Hansen,
and Klaus-Robert M ˜Aˇzller. 2010. How to
explain individual classification decisions.
Journal of Machine Learning Research,
11(Jun):1803–1831.

Aur´elien Bellet, Amaury Habrard, and Marc
Sebban. 2013. A survey on metric learning
for feature vectors and structured data. arXiv
preprint arXiv:1306.6709.

Rens Bod. 2009. From exemplar to grammar: A
probabilistic analogy-based model of language
learning. Cognitive Science, 33(5):752–793.
https://doi.org/10.1111/j.1551-6709
.2009.01031.X, Pubmed: 21585486

Antoine Bordes, Nicolas Usunier, Alberto
Garcia-Duran,
Jason Weston, and Oksana
Yakhnenko. 2013. Translating embeddings for
modeling multi-relational data. Proceedings of
NIPS, 26, 2787–2795.

Rich Caruana, Hooshang Kangarloo, John David
Dionisio, Usha Sinha, and David Johnson.
1999. Case-based explanation of non-case-
based learning methods. In Proceedings of the
AMIA Symposium, page 212.

Kevin Clark, Minh-Thang Luong, Christopher D.
Manning, and Quoc Le. 2018. Semi-supervised
sequence modeling with cross-view training. In
Proceedings of EMNLP, pages 1914–1925.

Walter Daelemans and Antal Van den Bosch.
2005. Memory-based Language Processing,
Cambridge University Press.

Walter Daelemans, Sabine Buchholz, and Jorn
Veenstra. 1999. Memory-based shallow pars-
ing. In EACL 1999: CoNLL-99 Computational
Natural Language Learning.

Walter Daelemans, Jakub Zavrel, Peter Berck, Und
Steven Gillis. 1996. MBT: A memory-based
part of speech tagger-generator. In Proceedings
of Fourth Workshop on Very Large Corpora.

Fien De Meulder and Walter Daelemans. 2003.
Memory-based named entity recognition us-
In Proceedings of
ing unannotated data.
HLT-NAACL, pages 208–211. https://doi
.org/10.3115/1119176.1119211

Jiankang Deng, Jia Guo, Niannan Xue, Und
Stefanos Zafeiriou. 2019. Arcface: Additive
angular margin loss for deep face recognition.

In Proceedings of CVPR, pages 4690–4699.
https://doi.org/10.1109/CVPR.2019
.00482

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Und
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
Verständnis. In Proceedings of NAACL-HLT,
pages 4171–4186.

Timothy Dozat and Christopher D. Manning.
2017. Deep biaffine attention for neural depen-
dency parsing. In Proceedings of ICLR.

Jerome H. Friedman. 1994. Flexible metric nearest
neighbor classification. Technical report, Stan-
ford University. http://citeseerx.ist
.psu.edu/viewdoc/summary?doi=10.1
.1.31.2959

Jacob Goldberger, Geoffrey E. Hinton, Sam T.
Roweis, and Ruslan R. Salakhutdinov. 2005.
Neighbourhood components analysis. In Pro-
ceedings of NIPS, pages 513–520.

Edouard Grave, Piotr Bojanowski, Prakhar Gupta,
Armand Joulin, and Tomas Mikolov. 2018.
Learning word vectors for 157 languages. In
Proceedings of LREC.

Alan Graves, Navdeep Jaitly, and Abdel-rahman
Mohamed. 2013. Hybrid speech recognition
with deep bidirectional LSTM. In Proceedings
of Automatic Speech Recognition and Un-
derstanding (ASRU), 2013 IEEE Workshop.
https://doi.org/10.1109/ASRU.2013
.6707742

Kelvin Guu, Kenton Lee, Zora Tung, Panupong
Pasupat, and Ming-Wei Chang. 2020. Realm:
Retrieval-augmented language model pre-
Ausbildung. arXiv preprint arXiv:2002.08909.

Raia Hadsell, Sumit Chopra, and Yann LeCun.
2006. Dimensionality reduction by learning an
invariant mapping. In Proceedings of CVPR,
Volumen 2, pages 1735–1742. IEEE.

Kazuaki Hanawa, Sho Yokoi, Satoshi Hara, Und
Kentaro Inui. 2021. Evaluation of similarity-
based explanations. In Proceedings of ICLR.

Kazuma Hashimoto, Caiming Xiong, Yoshimasa
Tsuruoka, and Richard Socher. 2017. A joint
many-task model: Growing a neural network for
multiple NLP tasks. In Proceedings of EMNLP,
pages 1923–1933. https://doi.org/10
.18653/v1/D17-1206

1504

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
3
9
1
9
7
9
2
5
5

/
T

A
C
_
A
_
0
0
4
3
9
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Martin Haspelmath, Matthew S. Dryer, David Gil,
and Bernard Comrie. 2005. The World Atlas of
Language Structures. Oxford University Press.

Trevor Hastie and Robert Tibshirani. 1996.
Discriminant adaptive nearest neighbor classifi-
cation and regression. In Proceedings of NIPS,
pages 409–415.

Iris Hendrickx and Antal van den Bosch. 2003.
Memory-based one-step named-entity recog-
Nation: Effects of seed list features, classifier
stacking, and unannotated data. In Proceedings
of CoNLL, pages 176–179. https://doi
.org/10.3115/1119176.1119203

Elad Hoffer and Nir Ailon. 2015. Deep metric
learning using triplet network. In International
Workshop on Similarity-Based Pattern Recog-
Nation, pages 84–92. Springer. https://doi
.org/10.1007/978-3-319-24261-3 7

Tao Ji, Yuanbin Wu, and Man Lan. 2019.
Graph-based dependency parsing with graph
In Proceedings of ACL,
neural networks.
pages 2475–2485. https://doi.org/10
.18653/v1/P19-1237

Urvashi Khandelwal, Angela Fan, Dan Jurafsky,
Luke Zettlemoyer, and Mike Lewis. 2020.
Nearest neighbor machine translation. arXiv
preprint arXiv:2010.00710.

Urvashi Khandelwal, Omer Levy, Dan Jurafsky,
Luke Zettlemoyer, and Mike Lewis. 2019.
Generalization through memorization: Nearest
neighbor language models. In Proceedings of
ICLR.

D. P. Kingma and J. Ba. 2014. Adam: A method
for stochastic optimization. arXiv preprint
arXiv:1412.6980.

Janet L. Kolodneer. 1991. Improving human
decision making through case-based decision
aiding. AI Magazine, 12(2):52–52.

Sandra K¨ubler. 2004. Memory-based Parsing,
Volumen 7. https://doi.org/10.1075
/nlp.7

Sandra K¨ubler, Ryan McDonald, and Joakim
Nivre. 2009. Dependency parsing. Synthesis
Lectures on Human Language Technologies,
1(1):1–127. https://doi.org/10.2200
/S00169ED1V01Y200901HLT002

Artur Kulmizev, Miryam de Lhoneux, Johannes
Gontrum, Elena Fano, and Joakim Nivre.

2019. Deep contextualized word embed-
in transition-based and graph-based
dings
dependency parsing-a tale of two parsers re-
visited. In Proceedings of EMNLP-IJCNLP,
pages 2755–2768. https://doi.org/10
.18653/v1/D19-1277

Michael Lebowitz. 1983. Memory-based pars-
ing. Artificial
Intelligence, 21(4):363–404.
https://doi.org/10.1016/S0004-3702
(83)80019-8

Tao Lei, Regina Barzilay, and Tommi Jaakkola.
In
2016. Rationalizing neural predictions.
Proceedings of EMNLP, pages 107–117.

Patrick Lewis, Ethan Perez, Aleksandara Piktus,
Fabio Petroni, Vladimir Karpukhin, Naman
Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau
Yih, Tim Rockt¨aschel, Sebastian Riedel, Und
Douwe Kiela. 2020. Retrieval-augmented gen-
eration for knowledge-intensive NLP tasks.
arXiv preprint arXiv:2005.11401.

Miryam de Lhoneux, Sara Stymne, and Joakim
Nivre. 2017. Old school vs. new school: Com-
paring transition-based parsers with and without
neural network enhancement. In The 15th Tree-
banks and Linguistic Theories Workshop (TLT),
pages 99–110.

Weiyang Liu, Yandong Wen, Zhiding Yu,
Ming Li, Bhiksha Raj, and Le Song. 2017.
Sphereface: Deep hypersphere embedding for
face recognition. In Proceedings of CVPR,
pages 212–220.

Mitchell P. Marcus, Beatrice Santorini, Und
Mary Ann Marcinkiewicz. 1993. Building
a large annotated corpus of English: Der
Penn Treebank. Computerlinguistik,
19(2):313–330. https://doi.org/10.21236
/ADA273556

Ryan McDonald, Kevin Lerman, and Fernando
Pereira. 2006. Multilingual dependency anal-
ysis with a two-stage discriminative parser.
In Proceedings of CoNLL-X, pages 216–220.
https://doi.org/10.3115/1596276
.1596317

Ryan McDonald,

Yoav

Joakim Nivre, Yvonne
Quirmbach-Brundage,
Goldberg,
Dipanjan Das, Kuzman Ganchev, Keith Hall,
Slav Petrov, Hao Zhang, Oscar T¨ackstr¨om,
Claudia Bedini, N´uria Bertomeu Castell´o, Und
Jungmee Lee. 2013. Universal Dependency

1505

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
3
9
1
9
7
9
2
5
5

/
T

A
C
_
A
_
0
0
4
3
9
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

annotation for multilingual parsing. In Pro-
ceedings of ACL, pages 92–97.

local explanations. In Proceedings of NIPS,
pages 2515–2524.

Ryan McDonald,

Pereira, Kiril
Fernando
Ribarov, and Jan Hajiˇc. 2005. Non-projective
dependency parsing using spanning tree al-
In Proceedings of HLT-EMNLP,
gorithms.
523–530. https://doi.org/10
Seiten
.3115/1220575.1220641

Makoto Nagao. 1984. A framework of a mechan-
ical translation between Japanese and English
by analogy principle.

Maximilian Nickel, Lorenzo Rosasco,

Und
Tomaso Poggio. 2016. Holographic embed-
dings of knowledge graphs. In Proceedings of
AAAI, Volumen 30.

Joakim Nivre. 2003. An efficient algorithm for
projective dependency parsing. In Proceedings
von
the Eighth International Conference on
Parsing Technologies, pages 149–160.

Joakim Nivre, Johan Hall, and Jens Nilsson.
2004. Memory-based dependency parsing. In
Proceedings of CoNLL, pages 49–56.

Aaron van den Oord, Yazhe Li, and Oriol
Vinyals. 2018. Representation learning with
contrastive predictive coding. arXiv preprint
arXiv:1807.03748.

Hiroki Ouchi, Jun Suzuki, Sosuke Kobayashi,
Sho Yokoi, Tatsuki Kuribayashi, Ryuto Konno,
and Kentaro Inui. 2020. Instance-based learning
of span representations: A case study through
named entity recognition. In Proceedings of
ACL, pages 6452–6459. https://doi.org
/10.18653/v1/2020.acl-main.575

Nicolas Papernot and Patrick McDaniel. 2018.
Deep k-nearest neighbors: Towards confident,
interpretable and robust deep learning. arXiv
preprint arXiv:1803.04765.

Razvan Pascanu, Tomas Mikolov, and Yoshua
Bengio. 2013. On the difficulty of training
recurrent neural networks. In Proceedings of
ICML, pages 1310–1318.

Jeffrey

Socher,

Pennington, Richard

Und
Christopher Manning. 2014. GloVe: Global
vectors for word representation. In Proceedings
of EMNLP, pages 1532–1543. https://doi
.org/10.3115/v1/D14-1162

Gregory Plumb, Denali Molitor, and Ameet S.
Talwalkar. 2018. Model agnostic supervised

Milos Radovanovic, Alexandros Nanopoulos, Und
Mirjana Ivanovic. 2010. Hubs in space: Pop-
ular nearest neighbors in high-dimensional
Daten. Journal of Machine Learning Research,
11(sept):2487–2531.

Rajeev Ranjan, Carlos D Castillo, and Rama
Chellappa. 2017. L2-constrained softmax loss
for discriminative face verification. arXiv
preprint arXiv:1703.09507.

Marco Tulio Ribeiro, Sameer Singh, and Carlos
Guestrin. 2016. Why should i trust you?: Ex-
plaining the predictions of any classifier. In
Proceedings of KDD, pages 1135–1144.

Erik F. Tjong Kim Sang. 2002. Memory-based
shallow parsing. Journal of Machine Learning
Forschung, 2:559–594.

Remko Scha, Rens Bod, and Khalil Sima’An.
1999. A memory-based model of syntac-
tic analysis: data-oriented parsing. Zeitschrift für
Experimental & Theoretical Artificial Intelli-
gence, 11(3):409–440. https://doi.org
/10.1080/095281399146481

Dominik Schnitzer, Arthur Flexer, Markus Schedl,
and Gerhard Widmer. 2012. Local and global
scaling reduce hubs in space. Journal of Ma-
chine Learning Research, 13(10).

R. Short and Keinosuke Fukunaga. 1981. Der
optimal distance measure for nearest neighbor
classification. IEEE transactions on Informa-
tion Theory, 27(5):622–627. https://doi
.org/10.1109/TIT.1981.1056403

Aaron Smith, Miryam de Lhoneux, Sara Stymne,
and Joakim Nivre. 2018. An investigation of
the interactions between pre-trained word em-
beddings, character models and pos tags in de-
pendency parsing. In Proceedings of EMNLP,
pages 2711–2720. https://doi.org/10
.18653/v1/D18-1291

Kihyuk Sohn. 2016. Improved deep metric learn-
ing with multi-class n-pair loss objective. In
Proceedings of NIPS, pages 1857–1865.

Eiichiro Sumita and Hitoshi Iida. 1991. Ex-
periments and prospects of example-based
maschinelle Übersetzung. In Proceedings of ACL,
pages 185–192.

1506

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
3
9
1
9
7
9
2
5
5

/
T

A
C
_
A
_
0
0
4
3
9
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Yifan Sun, Changmao Cheng, Yuhan Zhang,
Chi Zhang, Liang Zheng, Zhongdao Wang,
and Yichen Wei. 2020. Circle loss: A unified
perspective of pair similarity optimization.
In Proceedings of CVPR, pages 6398–6407.
https://doi.org/10.1109/CVPR42600
.2020.00643

Erik F. Tjong Kim Sang. 2002. Memory-
based named entity recognition. In Proceedings
of CoNLL. https://doi.org/10.3115
/1118853.1118878

Jorn Veenstra, Antal Van den Bosch, Sabine
and Jakub
Buchholz, Walter Daelemans,
Zavrel. 2000. Memory-based word sense dis-
ambiguation. Computers and the Humanities,
34(1):171–177.

Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji,
Dihong Gong, Jingchao Zhou, Zhifeng Li, Und
Wei Liu. 2018. Cosface: Large margin cosine
loss for deep face recognition. In Proceedings
of CVPR, pages 5265–5274.

Jiang Wang, Yang Song, Thomas Leung, Chuck
Rosenberg, Jingbin Wang, James Philbin, Bo
Chen, and Ying Wu. 2014. Learning fine-
grained image similarity with deep ranking.
In Proceedings of CVPR, pages 1386–1393.
https://doi.org/10.1109/CVPR.2014
.180

Xun Wang, Xintong Han, Weilin Huang,
Dengke Dong, and Matthew R. Scott. 2019.
Multi-similarity loss with general pair weight-

ing for deep metric learning. In Proceedings
of CVPR, pages 5022–5030. https://doi
.org/10.1109/CVPR.2019.00516

Sam Wiseman

Stratos.

2019.
and Karl
Label-agnostic sequence labeling by copying
nearest neighbors. In Proceedings of ACL,
pages 5363–5369. https://doi.org/10
.18653/v1/P19-1533

Eric Xing, Michael Jordan, Stuart J. Russell,
and Andrew Ng. 2002. Distance metric learn-
ing with application to clustering with side-
Information. In Proceedings of NIPS, Volumen
15, pages 521–528.

Hiroyasu Yamada and Yuji Matsumoto. 2003.
Statistical dependency analysis with support
vector machines. In Proceedings of the Eighth
International Conference on Parsing Tech-
nologies, pages 195–206.

Bishan Yang, Wen-tau Yih, Xiaodong He,
Jianfeng Gao, and Li Deng. 2015. In Embed-
ding entities and relations for learning and
inference in knowledge bases. arXiv:1412.6575

Xingxing Zhang, Jianpeng Cheng, and Mirella
Lapata. 2017. Dependency parsing as head selec-
tion. In Proceedings of EACL, pages 665–676.

Yu Zhang, Zhenghua Li, and Min Zhang. 2020.
Efficient second-order TreeCRF for neural de-
pendency parsing. In Proceedings of ACL,
pages 3295–3305. https://doi.org/10
.18653/v1/2020.acl-main.302

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
3
9
1
9
7
9
2
5
5

/
T

A
C
_
A
_
0
0
4
3
9
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

1507 Instance-Based Neural Dependency Parsing image

PDF Herunterladen