Instance-Based Neural Dependency Parsing

Hiroki Ouchi1,3

Jun Suzuki2,3 Sosuke Kobayashi2,4 Sho Yokoi2,3

Tatsuki Kuribayashi2,5 Masashi Yoshikawa2,3 Kentaro Inui2,3
1 NAIST, Japón 2 Tohoku University, Japón 3 RIKEN, Japón
4 Preferred Networks, Cª, Japón 5 Langsmith, Cª, Japón
hiroki.ouchi@is.naist.jp, sosk@preferred.jp,
{jun.suzuki, yokoi, kuribayashi, yoshikawa, inui}@tohoku.ac.jp

Abstracto

Interpretable rationales for model predictions
are crucial in practical applications. We de-
velop neural models that possess an interpret-
able inference process for dependency parsing.
Our models adopt instance-based inference,
where dependency edges are extracted and
labeled by comparing them to edges in a train-
ing set. The training edges are explicitly used
for the predictions; de este modo, it is easy to grasp the
contribution of each edge to the predictions.
Our experiments show that our instance-based
models achieve competitive accuracy with stan-
dard neural models and have the reasonable
plausibility of instance-based explanations.

Introducción

While deep neural networks have improved pre-
diction accuracy in various tasks, rationales under-
lying the predictions have been more challenging
for humans to understand (Lei et al., 2016). En
practical applications, interpretable rationales play
a crucial role in driving humans’ decisions and
promoting human–machine cooperation (Ribeiro
et al., 2016). Desde esta perspectiva, the utility
of instance-based learning (Aha et al., 1991),
a traditional machine learning method, ha sido
realized again (Papernot and McDaniel, 2018).

Instance-based learning is a method that learns
similarities between training instances and infers
a value or class for a test instance on the basis of
similarities against the training instances. Sobre el
one hand, standard neural models encode all the
knowledge in the parameters, making it challeng-
ing to determine what knowledge is stored and
used for predictions (Guu et al., 2020). Sobre el
other hand, models with instance-based inference
explicitly use training instances for predictions
and can exhibit the instances that significantly
contribute to the predictions. The instances play

a role in an explanation to the question: why did
the model make such a prediction? This type of
explanation is called instance-based explana-
ción (Caruana et al., 1999; Baehrens et al., 2010;
Plumb et al., 2018), which facilitates the users’
understandings of model predictions and allows
users to make decisions with higher confidence
(Kolodneer, 1991; Ribeiro et al., 2016).

It is not trivial to combine neural networks with
instance-based inference processes while keep-
ing high prediction accuracy. Recent studies in
image recognition seek to develop such meth-
probabilidades (Wang y cols., 2014; Hoffer and Ailon, 2015;
Liu et al., 2017; Wang y cols., 2018; Deng et al.,
2019). This paradigm is called deep metric learn-
En g. Compared to image recognition, hay
much fewer studies on deep metric learning in
natural language processing (NLP). As a few ex-
ceptions, Wiseman and Stratos (2019) and Ouchi
et al. (2020) developed neural models that have
an instance-based inference process for sequence
labeling tasks. They reported that their models
have high explainability without sacrificing the
prediction accuracy.

As a next step from targeting consecutive to-
kens, we study instance-based neural models for
relations between discontinuous elements. To cor-
rectly recognize relations, systems need to capture
associations between elements. Como ejemplo
of relation recognition, we address dependency
analizando, where systems seek to recognize binary
relations between tokens (hereafter edges). tradicional-
cionalmente, dependency parsers have been a useful
tool for text analysis. An unstructured text of in-
terest is parsed, and its structure leads users to a
deeper understanding of the text. By successfully
introducing instance-based models to dependency
analizando, users can extract dependency edges along
with similar edges as a rationale for the parse,
which further helps the process of text analysis.

1493

Transacciones de la Asociación de Lingüística Computacional, volumen. 9, páginas. 1493–1507, 2021. https://doi.org/10.1162/tacl a 00439
Editor de acciones: Yue Zhang. Lote de envío: 7/2021; Lote de revisión: 8/2021; Publicado 12/2021.
C(cid:2) 2021 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
3
9
1
9
7
9
2
5
5

/
t

a
C
_
a
_
0
0
4
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

en este documento, we develop new instance-based
neural models for dependency parsing, equipado
with two inference modes: (i) explainable mode
y (ii) fast mode. In the explainable mode, nuestro
models make use of similarities between the can-
didate edge and each edge in a training set. Por
looking at the similarities, users can quickly check
which training edges significantly contribute to
the prediction. In the fast mode, our models run
as fast as standard neural models, while general
instance-based models are much slower than stan-
dard neural models because of the dependence on
the number of training instances. The fast mode is
motivated by the actual situation: In many cases,
users want only predictions, and when the pre-
dictions seem suspicious, they want to check the
rationales. De este modo, the fast mode does not offer ra-
tionales, but instead enables faster parsing that
outputs exactly the same predictions as the ex-
plainable mode. Users can freely switch between
the explainable and fast modes according to their
purposes. This property is realized by taking ad-
vantage of the linearity of score computation in our
models and avoids comparing a candidate edge to
each training edge one by one for computing the
score at test time (mira la sección 4.4 for details).

Our experiments on multilingual datasets show
that our models can achieve competitive accu-
racy with standard neural models. Además, nosotros
shed light on the plausibility of instance-based
explanations, which has been underinvestigated
in dependency parsing. We verify whether our
models meet a minimal requirement related to the
plausibility (Hanawa et al., 2021). Additional anal-
ysis reveals the existence of hubs (Radovanovic
et al., 2010), a small number of specific training
instances that often appear as nearest neighbors,
and that hubs have a terrible effect on the plau-
sibility. Our main contributions are as follows:

• This is the first work to develop and study
instance-based neural models1 for depen-
dency parsing (Sección 4);

• Our empirical results show that our instance-
based models achieve competitive accuracy
with standard neural models (Sección 6.1);

1Our code is publicly available at https://github.com
/hiroki13/instance-based-dependency-parsing.

• Our analysis reveals that L2-normalization
for edge representations suppresses the hubs’
occurrence, y, como resultado, succeeds in
improving the plausibility of instance-based
explanations (Secciones 6.2 y 6.3).

2 Trabajo relacionado

2.1 Dependency Parsing

There are two major paradigms for dependency
analizando (K¨ubler et al., 2009): (i) the transition-
based paradigm (Nivre, 2003; Yamada and
Matsumoto, 2003) y (ii) the graph-based par-
adigm (McDonald et al., 2005). Recent litera-
ture often adopts the graph-based paradigm and
achieves high accuracy (Dozat and Manning,
2017; Zhang et al., 2017; Hashimoto et al., 2017;
Clark et al., 2018; Ji et al., 2019; Zhang et al.,
2020). The first-order edge-factored models un-
der this paradigm factorize the score of a depen-
dency tree into independent scores of single edges
(McDonald et al., 2005). The score of each edge
is computed on the basis of its edge feature. Este
decomposable property is preferable for our work
because we want to model similarities between
single edges. De este modo, we adopt the basic framework
of the first-order edge-factored models for our
instance-based models.

2.2 Instance-Based Methods in NLP

Traditionally, instance-based methods (memory-
based learning) have been applied to a variety of
NLP tasks (Daelemans and Van den Bosch, 2005),
such as part of speech tagging (Daelemans et al.,
1996), NER (Tjong Kim Sang, 2002; De Meulder
and Daelemans, 2003; Hendrickx and van den
Bosch, 2003), partial parsing (Daelemans et al.,
1999; Sang, 2002), phrase-structure parsing
(Lebowitz, 1983; Scha et al., 1999; K¨ubler, 2004;
Bod, 2009), desambiguación del sentido de la palabra (Veenstra
et al., 2000), etiquetado de roles semánticos (Akbik and
li, 2016), and machine translation (MONTE) (Nagao,
1984; Sumita and Iida, 1991).

Nivre et al. (2004) proposed an instance-based
(memory-based) method for transition-based de-
pendency parsing. The subsequent actions of a
transition-based parser are selected at each step by
comparing the current parser configuration to each
of the configurations in the training set. Aquí, cada
parser configuration is treated as an instance and
plays a role of rationales for predicted actions but

1494

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
3
9
1
9
7
9
2
5
5

/
t

a
C
_
a
_
0
0
4
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

not for predicted edges. Generally, parser config-
urations are not directly mapped to each predicted
edge one by one, so it is troublesome to interpret
which configurations significantly contribute to
edge predictions. Por el contrario, since we adopt the
graph-based one, our models can naturally treat
each edge as an instance and exhibit similar edges
as rationales for edge predictions.

2.3 Instance-Based Neural Methods in NLP

Most of the studies above were published be-
fore the current deep learning era. Very recently,
instance-based methods have been revisited and
combined with neural models in language model-
En g (Khandelwal et al., 2019), MONTE (Khandelwal
et al., 2020), and question answering (Lewis et al.,
2020). They augment a main neural model with
a non-parametric sub-module that retrieves auxil-
iary objects, such as similar tokens and documents.
Guu et al. (2020) proposed to parameterize and
learn the sub-module for a target task.

These studies assume a different setting from
nuestro. There is no ground-truth supervision sig-
nal for retrieval in their setting, so they adopt
non-parametric approaches or indirectly train the
sub-module to help a main neural model from
the supervision signal of the target task. En nuestro
configuración, the main neural model plays a role in
retrieval and is directly trained with ground-truth
objects (annotated dependency edges). De este modo, nuestro
findings and insights are orthogonal to theirs.

2.4 Deep Metric Learning

Our work can be categorized into deep metric
learning research in terms of the methodologi-
cal perspective. Although the origins of metric
learning can be traced back to some earlier work
(Short and Fukunaga, 1981; Friedman et al., 1994;
Hastie and Tibshirani, 1996), the pioneering work
is Xing et al. (2002).2 Since then, many methods
using neural networks for metric learning have
been proposed and studied.

Deep metric learning methods can be cate-
gorized into two classes from the training loss
perspectiva (Sun et al., 2020): (i) learning with
class-level labels and (ii) learning with pair-wise
labels. Given class-level
the first one
learns to classify each training instance to its tar-
get class with a classification loss, Por ejemplo,

labels,

2If you would like to know the history of metric learning

in more detail, please read Bellet et al. (2013).

Neighbourhood Component Analysis
(NCA)
(Goldberger et al., 2005), L2-constrained softmax
loss (Ranjan et al., 2017), SpereFace (Liu et al.,
2017), CosFace (Wang y cols., 2018), and ArcFace
(Deng et al., 2019). Given pair-wise labels, el
second one learns pair-wise similarity (the simi-
larity between a pair of instances), Por ejemplo,
contrastive loss (Hadsell et al., 2006), triplet loss
(Wang y cols., 2014; Hoffer and Ailon, 2015),
N-pair loss (Sohn, 2016), and multi-similarity loss
(Wang y cols., 2019). Our method is categorized
into the first group because it adopts a classifica-
tion loss (Sección 4).

2.5 Neural Models Closely Related to Ours

Among the metric learning methods above, NCA
(Goldberger et al., 2005) shares the same spirit
as our models. In this framework, models learn
to map instances with the same label
to the
neighborhood in a feature space. Wiseman and
Stratos (2019) and Ouchi et al. (2020) desarrollado
NCA-based neural models for sequence labeling.
We discuss the differences between their models
and ours later in more detail (Sección 4.5).

3 Dependency Parsing Framework

We adopt a two-stage approach (McDonald et al.,
2006; Zhang et al., 2017): we first identify depen-
dency edges (unlabeled dependency parsing) y
then classify the identified edges (labeled depen-
dency parsing). More specifically, we solve edge
identification as head selection and solve edge
classification as multi-class classification.3

3.1 Edge Identification

To identify unlabeled edges, we adopt the head
selection approach (Zhang et al., 2017), en el cual
a model learns to select the correct head of each to-
ken in a sentence. This simple approach enables us
to train accurate parsing models in a GPU-friendly
way. We learn the representation for each edge to
be discriminative for identifying correct heads.

Let x = (x0, x1, . . . , xT ) denote a tokenized
input sentence, where x0 is a special ROOT token

3Although some previous studies adopt multi-task learn-
ing methods for edge identification and classification tasks
(Dozat and Manning, 2017; Hashimoto et al., 2017), we inde-
pendently train a model for each task because the interaction
effects produced by multi-task learning make it challenging
to analyze models’ behaviors.

1495

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
3
9
1
9
7
9
2
5
5

/
t

a
C
_
a
_
0
0
4
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

and x1, . . . , xT are original T tokens, y (cid:3)xj, xi(cid:4)
denote an edge from head token xj to dependent
token xi. We define the probability of token xj
being the head of token xi in the sentence x as:

PAG (xj|xi) =

(cid:2)

exp.(shead(xj, xi))
t
k=0 exp(shead(xk, xi))

(1)

Aquí, the scoring function shead can be any neural
network-based scoring function (mira la sección 4.1).
At inference time, we choose the most likely
4:

head ˆyi for each token xi

ˆyi = arg max
xk : 0≤k≤T

PAG (xk|xi).

(2)

en el momento del entrenamiento, we minimize the negative

log-likelihood of training data:

L = −

|D|(cid:3)

t (norte)(cid:3)

norte=1

yo=1

registro P (y(norte)

|X(norte)
i

(3)

∈ x(norte) is each input token, y(norte)

Aquí, re = {X(norte), y(norte), r(norte)}|D|
where x(norte)
is the gold (ground-truth) head for token x(norte)
i
r(norte)
(cid:4).
i

n=1 is a training set,
∈ y(norte)
, y

∈ r(norte) is the label for edge (cid:3)y(norte)

, X(norte)
i

3.2 Label Classification

We adopt a simple multi-class classification ap-
proach for labeling each unlabeled edge. Nosotros
define the probability that each of all possible
labels r ∈ R will be assigned to the edge (cid:3)xj, xi(cid:4):

PAG (r|xj, xi) =

(cid:2)

exp.(slabel(r, xj, xi))

exp.(slabel(r(cid:7), xj, xi))

(4)

r(cid:7)∈R

Aquí, the scoring function slabel can be any neural
network-based scoring function (mira la sección 4.2).
At inference time, we choose the most likely

class label from the set of all possible labels R:

ˆr = arg max

r∈R

PAG (r|ˆyi, xi).

(5)

Aquí, ˆyi is the head token identified by a head
selection model.

4While this greedy formulation has no guarantee to pro-
duce well-formed trees, we can produce well-formed ones by
using the Chu-Liu-Edmonds algorithm in the same way as
Zhang et al. (2017). En este trabajo, we would like to focus on
the representation for each edge and evaluate the goodness
of the learned edge representation one by one. With such a
motivación, we adopt the greedy formulation.

en el momento del entrenamiento, we minimize the negative

log-likelihood of training data:

L = −

|D|(cid:3)

t (norte)(cid:3)

norte=1

yo=1

registro P (r(norte)

|y(norte)
i

, X(norte)
i

(6)

Aquí, r(norte)
label for gold edge (cid:3)y(norte)

∈ R is the gold (ground-truth) relation
(cid:4).

, X(norte)
i

4 Instance-Based Scoring Methods

For the scoring functions in Eqs. 1 y 4, nosotros
describe our proposed instance-based models.

4.1 Edge Scoring

We would like to assign a higher score to the cor-
rect edge than other candidates (ecuación. 1). Aquí, nosotros
compute similarities between each candidate edge
and ground-truth edges in a training set (lo sucesivo
training edge). By summing the similarities, nosotros
then obtain the score that indicates how likely the
candidate edge is the correct one.

Específicamente, we first construct a set of training

bordes, called the support set, A(D):

A(D) = {(cid:3)yi, xi(cid:4) | xi ∈ x, yi ∈ y,
(X, y, r) ∈ D} .

(7)

Aquí, yi is the ground-truth head token of token
xi. We then compute and sum similarities between
a candidate edge and each edge in the support set:
(cid:3)

shead(xj, xi) =

sim(h(cid:3)j,i

(cid:3)X(cid:2),xk(cid:4)∈A(D)

(cid:4), h(cid:3)(cid:2),k(cid:4)).

(8)

Aquí, h(cid:3)j,i(cid:4), h(cid:3)(cid:2),k(cid:4) ∈ Rd are d-dimensional edge
representaciones (Sección 4.3), and sim is a simi-
larity function. Following recent studies of deep
metric learning, we adopt the dot product and the
cosine similarity:

simdot(a, b) = a(cid:8)b,
simcos(a, b) = τ a(cid:8)

u bu,
au = a/(cid:9)a(cid:9),
bu = b/(cid:9)b(cid:9).

As you can see, the cosine similarity is the same
as the dot product between two unit vectors: eso
es, (cid:9)au(cid:9) = (cid:9)bu(cid:9) = 1. As we will discuss later in
Sección 6.3, this property suppresses the occur-
rence of hubs, compared with the dot product
between unnormalized vectors. Tenga en cuenta que, seguir-
ing the previous studies of deep metric learning

1496

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
3
9
1
9
7
9
2
5
5

/
t

a
C
_
a
_
0
0
4
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

(Wang y cols., 2018; Deng et al., 2019), we rescale
the cosine similarity by using the scaling fac-
tor τ (0 ≤ τ ), which works as the temperature
parameter in the softmax function.5

4.2 Label Scoring

Similarly to the scoring function above, nosotros también
design our instance-based label scoring function
slabel in Eq. 4. We first construct a support set
A(D; r) for each relation label r ∈ R:

A(D; r) = {(cid:3)yi, xi(cid:4) | xi ∈ x, yi ∈ y, ri ∈ r,

(X, y, r) ∈ D, ri = r} .

(9)

Aquí, only the edges with label r are collected
from the training set. We then compute and sum
similarities between a candidate edge and each
edge of the support set:

(cid:3)

slabel(r, xj, xi) =

sim(h(cid:3)j,i(cid:4), h(cid:3)(cid:2),k(cid:4)).

(10)

(cid:3)X(cid:2),xk(cid:4)∈A(D;r)

Here is the intuition: If the edge is more similar to
the edges with label r than those with other labels,
the edge is more likely to have the label r.

4.3 Edge Representation

In the proposed models (Eqs. 8 y 10), we use
d-dimensional edge representations. We define the
representation for each edge (cid:3)xj, xi(cid:4) como sigue:

h(cid:3)j,i(cid:4) = f (hdep

, hhead
j

(11)

Aquí, hdep, hhead ∈ Rd are d-dimensional feature
vectors for the dependent and head, respectivamente.
These vectors are created from a neural encoder
(Sección 5.2). When designing f , it is desirable to
capture the interaction between the two vectors.
By referring to the insights into feature represen-
tations of relations on knowledge bases (Bordes
et al., 2013; Yang et al., 2015; Nickel et al., 2016),
we adopt a multiplicative composition, a major
composition technique for two vectors:6

F (hdep
i

, hhead
j

) := W (hdep

(cid:10) hhead
j

5In our preliminary experiments, we set τ by selecting
a value from {16, 32, 64, 128}. Como resultado, whichever we
chose, the prediction accuracy was stably better than τ = 1.
6In our preliminary experiments, we also tried an additive
composition and the concatenation of the two vectors. El
accuracies by these techniques for unlabeled dependency
analizando, sin embargo, were both about 20%, which is much
inferior to that by the multiplicative composition.

Aquí, the interaction between hdep
es
captured by element-wise multiplication (cid:10). Estos
are composed into one vector, which is then
transformed by a weight matrix W ∈ Rd×d
into h(cid:3)j,i(cid:4).

and hhead

4.4 Fast Mode

Do users want rationales for all the predictions?
Maybe not. In many cases, all they want to do is
to parse sentences as fast as possible. Only when
they find a suspicious prediction will they check
the rationale for it. To fulfill the demand, nuestro
parser provides two modes: (i) explainable mode
y (ii) fast mode. The explainable mode, como
described in the previous subsections, enables
exhibiting similar training instances as rationales,
but its time complexity depends on the size of
the training set. Por el contrario, the fast mode does
not provide rationales, but instead enables faster
parsing than the explainable mode and outputs
exactly the same predictions as the explainable
mode. De este modo, en el momento de la prueba, users can freely switch
between the modes: Por ejemplo, they first use the
fast mode, and if they find a suspicious prediction,
then they will use the explainable mode to obtain
the rationale for it.

Formalmente, if using the dot product and cosine
similarity for similarity function in Eq. 8, the ex-
plainable mode can be rewritten as the fast mode:

shead(xj, xi) =

(cid:3)

h(cid:8)
(cid:3)j,i(cid:4)h(cid:3)(cid:2),k(cid:4)
(cid:3)X(cid:2),xk(cid:4)∈A(D)
(cid:3)
= h(cid:8)

h(cid:3)(cid:2),k(cid:4)

(cid:3)j,i(cid:4)
(cid:3)X(cid:2),xk(cid:4)∈A(D)
(cid:3)j,i(cid:4)hsum
A(D),

= h(cid:8)

(12)

(cid:2)

A(D) :=

where hsum
(cid:3)X(cid:2),xk(cid:4)∈A(D) h(cid:3)(cid:2),k(cid:4). In this way,
the vectors in the training
once you sum all
set h(cid:3)(cid:2),k(cid:4) ((cid:3)X(cid:2), xk(cid:4) ∈ A(D)), you can reuse the
summed vector without searching the training
set again. en el momento de la prueba, you can precompute this
summed vector hsum
A(D) before running the model on
a test set, which reduces the exhaustive similarity
computation over the training set to the simple dot
product between the two vectors h(cid:8)

(cid:3)j,i(cid:4)hsum

A(D).7

4.5 Relations to Existing Models

The Closest Models to Ours. Wiseman and
Stratos (2019) and Ouchi et al. (2020) propuesto

7In the same way as Eq. 12, we can transform slabel in

ecuación. 10 to the fast mode.

1497

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
3
9
1
9
7
9
2
5
5

/
t

a
C
_
a
_
0
0
4
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

an instance-based model using Neighbourhood
Components Analysis (NCA) (Goldberger et al.,
2005) for sequence labeling. Given an input sen-
tence of T tokens, x = (x1, . . . , xT ), el modelo
first computes the probability that a token (o
span) xi ∈ x in the sentence selects each of all the
tokens in the training set xj ∈ xD as its neighbor:

PAG (xj|xi) =

exp.(sim(xj, xi))

(cid:2)

exp.(sim(xk, xi))

xk∈xD

The model then constructs a set of only the tokens
xj associated with a label y: X (xD; y) = {xj |
xj ∈ xD, yj = y}, and computes the probability
that each token xi will be assigned a label y:

(cid:3)

PAG (y|xi) =

PAG (xj, xi).

xj ∈X (D;y)

The point is that while our models first sums
the similarities (ecuación. 8) and then put the summed
score into exponential form as exp(shead(xj, xi)),
their model puts each similarity into exponential
form as exp(sim(xk, xi)) before the summation.
The different order of using exponential function
makes it impossible to rewrite their model as the
fast mode, so their model always has to compare a
token xi to each of the training set xj ∈ xD. Este
is the biggest difference between their model and
nuestro. While we leave the performance compari-
son between the NCA-based models and ours for
future work, our models have an advantage over
the NCA-based models in that our models offer
two options, the explainable and fast modes.

Standard Models Using Weights. Typically,
neural models use the following scoring functions:

shead(xj, xi) = w(cid:8)h(cid:3)j,i(cid:4),
r h(cid:3)j,i(cid:4).

slabel(r, xj, xi) = w(cid:8)

(13)

(14)

Aquí, w ∈ Rd is a learnable weight vector and
wr ∈ Rd is a learnable weight vector associ-
ated with label r ∈ R. In previous work (zhang
et al., 2017), this form is used for dependency
analizando. We call such models weight-based mod-
los. Caruana et al. (1999) proposed combining
weight-based models with instance-based infer-
ence: At inference time, the weights are discarded,
and only the trained encoder is used to extract fea-

Idioma
Arábica
Basque
Chino
Inglés
Finnish
hebreo
Hindi
italiano
Japanese
Korean
Russian
Swedish
Turkish

Treebank
PADT
BDT
GSD
EWT
TDT
HTB
HDTB
ISDT
GSD
GSD
SynTagRus
Talbanken
IMST

Tren
Family Order
6.1k
non-IE VSO
5.4k
SOV
non-IE
SVO
non-IE
4.0k
SVO 12.5k
IE
SVO 12.2k
non-IE
SVO
non-IE
5.2k
SOV 13.3k
IE
SVO 13.1k
IE
7.1k
SOV
non-IE
SOV
non-IE
4.4k
SVO 48.8k
IE
4.3k
SVO
IE
3.7k
SOV
non-IE

Mesa 1: Dataset information. ‘‘Family’’ indicates
if Indo-European (IE) or not. ‘‘Order’’ indi-
cates dominant word orders according to WALS
(Haspelmath et al., 2005). ‘‘Train’’ is the number
of training sentences.

ture representations for instance-based inference.
Such a combination has been reported to be ef-
fective for image recognition (Ranjan et al., 2017;
Liu et al., 2017; Wang y cols., 2018; Deng et al.,
2019). In dependency parsing, there has been no
investigation on it. Since such a combination can
be a promising method, we investigate its utility
(Sección 6).

5 Experimental Setup

5.1 Datos

We use English PennTreebank (PTB) (marco
et al., 1993) and Universal Dependencies (UD)
(McDonald et al., 2013). Following previous stud-
es (Kulmizev et al., 2019; Smith et al., 2018;
de Lhoneux et al., 2017), we choose a variety of
13 languages8 from the UD v2.7. Mesa 1 muestra
information about each dataset. Seguimos el
standard training-development-test splits.

5.2 Neural Encoder Architecture

To compute hdep and hhead (in Eq. 11), we adopt
the encoder architecture proposed by Dozat and
Manning (2017). Primero, we map the input sequence

8These languages have been selected by considering the
perspectives of different language families, different mor-
phological complexity, different training sizes, and domains.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
3
9
1
9
7
9
2
5
5

/
t

a
C
_
a
_
0
0
4
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

1498

0:T = (htoken

x = (x0, . . . , xT )9 to a sequence of token repre-
sentaciones, htoken
, . . . , htoken
), each of
which is htoken
= [et; ct; bt], where et, ct, and bt
are computed by word embeddings,10 personaje-
level CNN, and BERT (Devlin et al., 2019),11
respectivamente. Segundo, the sequence htoken
is fed
0:t
to bidirectional LSTM (BiLSTM) (Graves et al.,
2013) for computing contextual ones: hlstm
0:T =
(hlstm
0:t ). Finalmente,
0
hlstm
t = W dephlstm
t
, where W dep ∈ Rd×2d
and hhead
t
and W head ∈ Rd×2d are parameter matrices.

, . . . , hlstm
∈ R2d is transformed as hdep
t = W headhlstm

t ) = BiLSTM(htoken

5.3 Mini-Batching

We train models with the mini-batch stochastic
gradient descent method. To make the current
mini-batch at each time step, we follow a stan-
dard technique for training instance-based models
(Hadsell et al., 2006; Oord et al., 2018).

en el momento del entrenamiento, we make the mini-batch that
consists of query and support sentences at each
time step. A model encodes the sentences and the
edge representations used for computing similar-
ities between each candidate edge in the query
sentences and each gold edge in the support sen-
tenencias. Aquí, due to the memory limitation of
GPUs, we randomly sample a subset from the
training set at each time step: eso es, (X(norte), y(norte),
r(norte)) ∼ Uniform(D). In edge identification, para
query sentences, we randomly sample a subset D(cid:7)
q
of N sentences from D. For support sentences,
we randomly sample a subset D(cid:7)
s of M sentences
from D, and construct and use the support set
A(D(cid:7)
s) instead of A(D) in Eq. 7. In label clas-
sification, we would like to guarantee that the
support set in every mini-batch always contains
at least one edge for each label. para hacerlo, nosotros
randomly sample a subset A(cid:7)(D; r) of U edges
from the support set for each label r: eso es,
(cid:3)y(norte)
(cid:4) ∼ Uniform(A(D; r)) in Eq. 9. Nota
i
that each edge (cid:3)y(norte)
(cid:4) is in the n-th sentence
X(norte) in the training set D, so we put the sentence

, X(norte)
i

9We use the gold tokenized sequences in PTB and UD.
10For PTB, we use 300 dimensional GloVe (Pennington
et al., 2014). For UD, we use 300 dimensional fastText (Grave
et al., 2018). Durante el entrenamiento, we fix them.

11We first conduct subword segmentation for each token
of the input sequence. Entonces, the BERT encoder takes as
input the subword-segmented sequences and computes the
representation for each subword. Aquí, we use the (last
capa) representation of the first subword within each token
as its token representation. For PTB, we use ‘‘BERT-Base,
Cased.’’ For UD, we use ‘‘BERT-Base, Multilingual Cased.’’

Nombre

Word Embedding
BERT
CNN window size
CNN filters
BiLSTM layers
BiLSTM units
Optimization
Learning rate
Rescaling factor τ
Dropout ratio

Value

GloVe (PTB) / texto rápido (UD)
BERT-Base
3
30
2
300 dimensions
Adán
0.001
64
{0.1, 0.2, 0.3}

Mesa 2: Hyperparameters used in the experiments.

, X(norte)
i

X(norte) into the mini-batch to compute the represen-
tation for (cid:3)y(norte)
(cid:4). De hecho, we use N = 32
query sentences in both edge identification and
label classification, m = 10 support sentences
in edge identification,12 and U = 1 support edge
(oración) for each label in label classification.13
en el momento de la prueba, we encode each test (query) sen-
tence and compute the representation for each
candidate edge on-the-fly. The representation is
then compared to the precomputed support edge
representación, hsum
A(D) in Eq 12. To precompute
hsum
A(D), we first encode all the training sentences
and obtain the edge representations. Entonces, in edge
identification, we sum all of them and obtain one
support edge representation hsum
A(D). In label clas-
sification, similarly to hsum
A(D), we sum only the
edge representations with label r and obtain one
support representation for each label hsum

A(D;r).14

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
3
9
1
9
7
9
2
5
5

/
t

a
C
_
a
_
0
0
4
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

5.4 Training Configuration

Mesa 2 lists the hyperparameters. To optimize the
parámetros, we use Adam (Kingma and Ba, 2014)
with β1 = 0.9 and β2 = 0.999. The initial learning
rate is η0 = 0.001 and is updated on each epoch
as ηt = η0/(1 + ρt), where ρ = 0.05 and t is
the epoch number completed. A gradient clipping
value is 5.0 (Pascanu et al., 2013). El número
of training epochs is 100. We save the parameters
that achieve the best score on each development
set and evaluate them on each test set. It takes less
than one day to train on a single GPU, NVIDIA
DGX-1 with Tesla V100.

12Como resultado, the whole mini-batch size is 32 + 10 = 42.
13When U = 1, the whole mini-batch size is 32 + |R|.
14The total number of the support edge representations is

equal to the size of the label set |R|.

1499

Aprendiendo
Inference
Similarity
System ID
PTB-English
UD-Average
UD-Arabic
UD-Basque
UD-Chinese
UD-English
UD-Finnish
UD-Hebrew
UD-Hindi
UD-Italian
UD-Japanese
UD-Korean
UD-Russian
UD-Swedish
UD-Turkish

Weight-based
Weight-based
dot
Kulmizev+’19
–
– /84.9
– /81.8
– /79.8
– /83.4
– /87.6
– /83.9
– /85.9
– /90.8
– /91.7
– /92.1
– /84.2
– /91.0
– /86.9
– /64.9

Weight-based

Weight-based
porque
dot
WWc
WWd

96.4/95.3 96.4/95.3
89.0/85.6 89.0/85.6
87.8/82.1 87.8/82.1
84.9/81.1 84.9/80.9
85.6/82.3 85.8/82.4
90.9/88.1 90.7/88.0
89.4/86.6 89.1/86.3
89.4/86.4 89.5/86.5
94.8/91.7 94.8/91.7
94.1/92.0 94.2/92.1
94.3/92.8 94.5/93.0
88.0/84.4 87.9/84.3
94.2/92.7 94.1/92.7
90.3/87.6 90.3/87.5
73.0/65.3 73.2/65.4

Instance-based
porque
dot
WIc
WId

96.4/94.4 93.0/91.8
89.0/85.2 83.0/79.5
87.8/81.6 84.9/79.0
84.9/80.6 82.0/77.9
85.7/81.6 80.9/77.3
90.9/87.8 88.1/85.3
89.3/86.1 84.1/81.2
89.4/85.9 82.7/79.7
94.8/91.4 91.4/88.0
94.1/91.9 91.5/89.4
94.3/92.7 92.5/90.9
88.0/84.2 84.3/80.4
94.2/92.4 57.7/56.5
90.4/87.1 88.6/85.8
73.1/64.5 69.9/61.9

Instance-based
Instance-based
porque
dot
IIc
IId

96.4/95.3 96.4/95.3
89.3/85.7 89.0/85.5
88.0/82.1 87.6/81.9
85.1/80.9 85.0/80.8
86.3/82.8 85.9/82.5
91.1/88.3 91.0/88.2
89.6/86.6 89.4/86.4
89.8/86.7 89.6/86.6
94.9/91.8 94.9/91.6
94.3/92.2 94.1/92.0
94.6/93.1 94.4/92.8
88.1/84.4 88.2/84.5
94.3/92.8 94.1/92.6
90.5/87.5 90.4/87.5
73.7/65.5 72.9/64.7

Mesa 3: Comparison between weight-based and instance-based systems. Cells show unlabeled attach-
ment scores (UAS) before the slash and labeled attachment scores (LAS) after the slash on each test set.
System IDs stand for the first letters of the options: p.ej., WId stands for ‘‘W’’eight-based learning and
‘‘I’’nstance-based inference using the ‘‘d’’ot product. The system ID, Kulmizev+’19, is the graph-based
parser with BERT in Kulmizev et al. (2019).

6 Results and Discussion

6.1 Prediction Accuracy on
Benchmark Tests

We report averaged unlabeled attachment scores
(UAS) and labeled attachment scores (LAS) across
three different runs of the model training with ran-
dom seeds. comparamos 6 sistemas, each of which
consists of two models for edge identification and
label classification, respectivamente. For reference,
we list the results by the graph-based parser with
BERT in Kulmizev et al. (2019), whose architec-
ture is the most similar to ours.

Mesa 3 shows UAS and LAS by these sys-
tems. The systems WWd and WWc are the standard
ones that consistently use the weight-based scores
(Eqs. 13 y 14) during learning and inference.
Between these systems, the difference of the sim-
ilarity functions does not make a gap in the accura-
cíes. En otras palabras, the dot product and the cosine
similarity are on par in terms of the accuracies.
The systems WId and WIc use the weight-based
scores during learning and the instance-based ones
during inference. While the system WId using dot
achieved competitive UAS and LAS to those by

the standard weight-based system WWd, the sys-
tem WIc using cos achieved lower accuracies
than those by the system WWc. The systems
IId and IIc consistently use the instance-based
scores during learning and inference. Both of
them succeeded in keeping competitive accura-
cies with those by the standard weight-based ones
WWd and WWc.

instance-based models

Out-of-Domain Robustness. We evaluate the
robustness of our
en
out-of-domain settings by using the five domains
of UD-English: we train each model on the train-
ing set of the source domain ‘‘Yahoo! Answers’’
and test it on each test set of the target domains,
Emails, Newsgroups, Reseñas, and We-
blogs. As Table 4 muestra, the out-of-domain
robustness of our instance-based models is com-
parable to the weight-based models. This tendency
is observed when using different source domains.

Sensitivity of M for Inference.
In the experi-
ments above, we used all the training sentences for
support sentences at test time. What if we reduce
the number of support sentences? Aquí, en el

1500

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
3
9
1
9
7
9
2
5
5

/
t

a
C
_
a
_
0
0
4
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

dot
WWd
Emails
81.7
Newsgroups 83.1
Reseñas
88.5
Weblogs
81.9
83.8
Average

Weight-Based Instance-Based
porque
IIc
81.4
82.9
88.8
81.9
83.8

porque
dot
WWc IId
81.7 81.6
83.3 83.1
88.7 88.7
80.9 80.9
83.7 83.6

Mesa 4: UAS in out-of-domain settings, dónde
each model
is trained on the source domain
‘‘Yahoo! Answers’’ and tested on each of the
four target domains.

METRO

1
81.5
82.8
88.7
81.8
83.7

10
81.4
83.0
88.7
82.1
83.8

100
81.5
82.9
88.8
82.0
83.8

ALL
81.5
82.9
88.8
81.9
83.8

Emails
Newsgroups
Reseñas
Weblogs
Average

Mesa 5: UAS by the instance-based system using
the cosine similarity (IIc) and randomly sampled
M support training sentences.

same out-of-domain settings above, we evaluate
the instance-based system using the cosine sim-
ilarity IIc with M support sentences randomly
sampled at each time step. Intuitivamente, if using
a smaller number of randomly sampled support
oraciones (p.ej., m = 1), the prediction accuracies
would drop. Asombrosamente, sin embargo, Mesa 5 muestra
that the accuracies do not drop even if reducing M .
This tendency is observed when using the other
three systems WId, WIc, and IId. One possible
reason for it is that the feature space is appropri-
ately learned: eso es, because positive edges are
close to each other and far from negative edges in
the feature space, the accuracies do not drop even
if randomly sampling a single support sentence
and using the edges.

6.2 Sanity Check for Plausible Explanations

It is an open question how to evaluate the ‘‘plausi-
bility’’ of explanations: eso es, whether or not the
retrieved instances as explanations are convincing
for humans. As a reasonable compromise, Hanawa
et al. (2021) designed the identical subclass test
for evaluating the plausibility. This test is based on
a minimal requirement that interpretable models

Cifra 1: Valid ((cid:2)) and invalid ((cid:3)) examples of
unlabeled edges for the identical subclass test.

should at least satisfy: training instances to be pre-
sented as explanations should belong to the same
latent (sub)class as the test instance. Consider the
examples in Figure 1. The predicted unlabeled
edge ‘‘wrote → novels’’ in the test sentence has
el (unobserved) latent label, obj. To this edge,
two training instances are given as explanations:
The above one seems more convincing than the
below one because ‘‘published → books’’ has
the same latent label, obj, as that of ‘‘wrote →
novels’’ while ‘‘novels → the’’ has the different
uno, det. As these show, the agreement between
the latent classes are likely to correlate with plau-
sibility. Note that this test is not perfect for the
plausibility assessment, but it works as a sanity
check for verifying whether models make obvious
violations in terms of plausibility.

This test can be used for assessing unlabeled
parsing models because the (unobserved) relation
labels can be regarded as the latent subclasses of
positive unlabeled edges. We follow three steps;
(i) identifying unlabeled edges in a development
colocar; (ii) retrieving the nearest training edge for
each identified edge; y (iii) calculating LAS,
eso es, if the labels of the query and retrieved
edges are identical, we regard them as correct.15

Mesa 6 shows LAS on PTB and UD-English.
The systems using instance-based inference with
the cosine similarity, WIc and IIc, succeeded in
retrieving the support training edges with the same
label as the queries. Asombrosamente, the system IIc
achieved over 70% LAS on PTB without label
supervision. The results suggest that systems using
instance-based inference with the cosine similarity

15If the parsed edge is incorrect, we regard it as incorrect.

1501

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
3
9
1
9
7
9
2
5
5

/
t

a
C
_
a
_
0
0
4
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Weight-Based
porque
dot
WIc IId
7.0
67.5
3.9
51.5

Instance-Based
porque
IIc
71.6
54.0

dot
WId
1.8
16.4

System ID
PTB-English
UD-English

Mesa 6: Results of the identical subclass test. Cada
cell indicates labeled attachment scores (LAS) en
each development set. All the models are trained
with head selection supervision and without label-
ing supervision.

Mesa 7: Examples of support edges retrieved by
the instance-based system using the cosine simi-
larity (IIc). The first query-support pair has the
mismo (unobserved) label mark. The query of the
second pair has nmod:poss although the support
has det.

meet the minimal requirement, and the retrieved
edges are promising as plausible explanations.

To facilitate the intuitive understanding of
model behaviors, we show actual examples of
the retrieved support edges in Table 7. As the first
query-support pair shows, for query edges whose
head or dependent is a function word (p.ej., si ), el
training edges with the same (unobserved) label
tend to be retrieved. Por otro lado, como el
second pair shows, for queries whose head is a
noun (p.ej., appeal), the edges whose head is also
a noun (p.ej., alimento) tend to be retrieved regardless
of different latent labels.

6.3 Geometric Analysis on Feature Spaces

The identical subclass test suggests a big differ-
ence between the feature spaces learned by using
the dot product and the cosine similarity. Here we
look into them in more detail.

6.3.1 Observation of Nearest Neighbors

Primero, we look into training edges retrieved as near-
est support ones. Específicamente, we use the edges

Mesa 8: Examples of unlabeled support training
edges retrieved by the WId system (weight-based
learning and instance-based inference with the
dot product) for each query. Regardless of the
very different queries, the same support edge was
retrieved.

in the UD-English development set as queries and
retrieve the top k similar support edges in the
UD-English training set. Mesa 8 shows the ex-
amples retrieved by the WId system. Aquí, el
same support edge, (cid:3)ROOT, find(cid:4), was retrieved
for the different queries, (cid:3)Jennifer, anderson(cid:4) y
(cid:3)todo, después(cid:4). As this indicates, when using the dot
product as the similarity function, a small num-
ber of specific edges are extremely often selected
as support ones for any queries. Such edges are
called hubs (Radovanovic et al., 2010). This phe-
nomenon is not desirable for users in terms of the
plausible interpretation of predictions. If a system
always exhibits the same training instance(s) como
rationales for predictions, users are likely to doubt
the system’s validity.

6.3.2 Quantitative Measurement of Hubness

Segundo, we quantitatively measure the hubness
of each system. Específicamente, for the hubness, nosotros
measure the k-occurrences of instance x, Nk(X)
(Radovanovic et al., 2010; Schnitzer et al., 2012).
In the case of our dependency parsing experi-
mentos, Nk(X) indicates the number of times each
support training edge x occurs among the k near-
est neighbors of all the query edges. The support
training edges with an extremely high Nk value
can be regarded as hubs. en este estudio, we set
k = 10 and measure N10(X) of unlabeled sup-
port training edges. For query edges, we use the
UD-English development set that contains 25,148
bordes. For support edges, we use the UD-English
training set that contains 204,585 bordes.

Mesa 9 shows the highest N10 support training
bordes. In the case of the system WId, the unla-
beled support edge (cid:3)ROOT, find(cid:4) appeared 19,407

1502

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
3
9
1
9
7
9
2
5
5

/
t

a
C
_
a
_
0
0
4
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

System ID sim
WId
dot
WIc
porque
IId
dot
IIc
porque

N10
19,407
82
22,493
34

Instances
(cid:3)ROOT, find(cid:4)
(cid:3)ayuda, tiempo(cid:4)
(cid:3)dicho, airlifted(cid:4)
(cid:3)fuerza, Israel(cid:4)

Mesa 9: Examples of the highest N10 unlabeled
support training edges in UD-English.

Cifra 2: Ranking of the 100 highest N10 unlabeled
support training edges in UD-English.

times in the 10 nearest neighbors of the 25,148
query edges. A similar tendency was observed in
the instance-based system using the dot product
IId. Por el contrario, in the case of the systems using
the cosine similarity, WIc and IIc, it was not ob-
served that specific support edges were retrieved
so often. En figura 2, we plot the top 100 apoyo
training edges in terms of N10 with log10 scale.
The N10 distributions of the systems using the dot
product, WId and IId, look very skew; eso es,
hubs emerge. This indicates that when using the
dot product, a small number of specific support
training edges appear in the nearest neighbors so
a menudo, regardless of query edges.

To sum up, systems using instance-based infer-
ence and the dot product are often in trouble with
hubs and have difficulty retrieving plausible sup-
port edges for predictions. The occurrence of hubs
are likely to be related to the norms of edge rep-
resentations since L2-normalization for the edges
in the cosine similarity tends to suppress hubs’
occurrence. We leave a more detailed analysis of
the cause of hubs’ occurrence for future work.

7 Conclusión

classification model (Sección 4). We have ana-
lyzed them from the perspectives of the prediction
accuracy and the explanation plausibility. The first
analysis shows that our instance-based systems
and achieve competitive accuracy with weight-
based neural ones (Sección 6.1). The second in-
dicates that our instance-based systems using the
cosine similarity (L2-normalization for edge rep-
resentaciones) meet the minimal requirement of
plausible explanations (Sección 6.2). The addi-
tional analysis reveals that when using the dot
product, hubs emerge, which degrades the plausi-
habilidad (Sección 6.3). One interesting future direc-
tion is investigating the cause of hubs’ occurrence
in more detail. Another direction is using the
learned edge representations in downstream tasks,
such as semantic textual similarity.

Expresiones de gratitud

The authors are grateful to the anonymous re-
viewers and the Action Editor who provided
many insightful comments that improve the pa-
por. Special thanks also go to the members of
Tohoku NLP Laboratory for the interesting com-
ments and energetic discussions. The work of H.
Ouchi was supported by JSPS KAKENHI grant
number 19K20351. The work of J. Suzuki was
supported by JST Moonshot R&D grant num-
ber JPMJMS2011 (fundamental research) y
JSPS KAKENHI grant number 19H04162. El
work of S. Yokoi was supported by JST ACT-X
grant number JPMJAX200S, Japón. The work of
t. Kuribayashi was supported by JSPS KAK-
ENHI grant number 20J22697. The work of M.
Yoshikawa was supported by JSPS KAKENHI
grant number 20K23314. The work of K. Inui
was supported by JST CREST grant number
JPMJCR20D2, Japón.

Referencias

David W. Aha, Dennis Kibler, and Marc K. Alberto.
1991. Instance-based learning algorithms. Mamá-
chine Learning, 6(1):37–66. https://doi
.org/10.1007/BF00153759, https://
doi.org/10.1023/A:1022689900470

We have developed instance-based neural depen-
dency parsing systems, each of which consists
of our edge identification model and our label

Alan Akbik and Yunyao Li. 2016. K-SRL:
Instance-based learning for semantic role labeling.
In Proceedings of COLING, pages 599–608.

1503

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
3
9
1
9
7
9
2
5
5

/
t

a
C
_
a
_
0
0
4
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

David Baehrens, Timon Schroeter, Stefan
Harmeling, Motoaki Kawanabe, Katja Hansen,
and Klaus-Robert M ˜Aˇzller. 2010. Cómo
explain individual classification decisions.
Journal of Machine Learning Research,
11(Jun):1803–1831.

Aur´elien Bellet, Amaury Habrard, and Marc
Sebban. 2013. A survey on metric learning
for feature vectors and structured data. arXiv
preprint arXiv:1306.6709.

Rens Bod. 2009. From exemplar to grammar: A
probabilistic analogy-based model of language
aprendiendo. Ciencia cognitiva, 33(5):752–793.
https://doi.org/10.1111/j.1551-6709
.2009.01031.X, Pubmed: 21585486

Antonio Bordes, Nicolás Usunier, Alberto
Garcia-Duran,
Jason Weston, and Oksana
Yakhnenko. 2013. Translating embeddings for
modeling multi-relational data. Actas de
NIPS, 26, 2787–2795.

Rich Caruana, Hooshang Kangarloo, John David
Dionisio, Usha Sinha, and David Johnson.
1999. Case-based explanation of non-case-
based learning methods. En Actas de la
AMIA Symposium, página 212.

Kevin Clark, Minh-Thang Luong, Cristóbal D..
Manning, and Quoc Le. 2018. Semi-supervised
sequence modeling with cross-view training. En
Proceedings of EMNLP, pages 1914–1925.

Walter Daelemans and Antal Van den Bosch.
2005. Memory-based Language Processing,
Prensa de la Universidad de Cambridge.

Walter Daelemans, Sabine Buchholz, and Jorn
Veenstra. 1999. Memory-based shallow pars-
En g. In EACL 1999: CoNLL-99 Computational
Natural Language Learning.

Walter Daelemans, Jakub Zavrel, Peter Berck, y
Steven Gillis. 1996. MBT: A memory-based
part of speech tagger-generator. En procedimientos
of Fourth Workshop on Very Large Corpora.

Fien De Meulder and Walter Daelemans. 2003.
Memory-based named entity recognition us-
En procedimientos de
ing unannotated data.
HLT-NAACL, pages 208–211. https://doi
.org/10.3115/1119176.1119211

Jiankang Deng, Jia Guo, Niannan Xue, y
Stefanos Zafeiriou. 2019. Arcface: Additive
angular margin loss for deep face recognition.

In Proceedings of CVPR, pages 4690–4699.
https://doi.org/10.1109/CVPR.2019
.00482

Jacob Devlin, Ming-Wei Chang, Kenton Lee, y
Kristina Toutanova. 2019. BERT: Pre-entrenamiento
de transformadores bidireccionales profundos para el lenguaje
comprensión. In Proceedings of NAACL-HLT,
páginas 4171–4186.

Timothy Dozat and Christopher D. Manning.
2017. Deep biaffine attention for neural depen-
dency parsing. In Proceedings of ICLR.

Jerome H. Friedman. 1994. Flexible metric nearest
neighbor classification. Technical report, estan-
ford University. http://citeseerx.ist
.psu.edu/viewdoc/summary?doi=10.1
.1.31.2959

Jacob Goldberger, Geoffrey E. Hinton, Sam T.
Roweis, and Ruslan R. Salakhutdinov. 2005.
Neighbourhood components analysis. En profesional-
ceedings of NIPS, pages 513–520.

Edouard Grave, Piotr Bojanowski, Prakhar Gupta,
Armand Joulin, and Tomas Mikolov. 2018.
Learning word vectors for 157 idiomas. En
Proceedings of LREC.

Alan Graves, Navdeep Jaitly, and Abdel-rahman
mohamed. 2013. Hybrid speech recognition
with deep bidirectional LSTM. En procedimientos
of Automatic Speech Recognition and Un-
derstanding (ASRU), 2013 IEEE Workshop.
https://doi.org/10.1109/ASRU.2013
.6707742

Kelvin Gu, Kenton Lee, Zora Tung, Panupong
Pasupat, and Ming-Wei Chang. 2020. Realm:
Retrieval-augmented language model pre-
training. arXiv preimpresión arXiv:2002.08909.

Raia Hadsell, Sumit Chopra, and Yann LeCun.
2006. Dimensionality reduction by learning an
invariant mapping. In Proceedings of CVPR,
volumen 2, pages 1735–1742. IEEE.

Kazuaki Hanawa, Sho Yokoi, Satoshi Hara, y
Kentaro Inui. 2021. Evaluation of similarity-
based explanations. In Proceedings of ICLR.

Kazuma Hashimoto, Caiming Xiong, Yoshimasa
Tsuruoka, and Richard Socher. 2017. A joint
many-task model: Growing a neural network for
multiple NLP tasks. In Proceedings of EMNLP,
pages 1923–1933. https://doi.org/10
.18653/v1/D17-1206

1504

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
3
9
1
9
7
9
2
5
5

/
t

a
C
_
a
_
0
0
4
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Martin Haspelmath, Matthew S. Dryer, David Gil,
and Bernard Comrie. 2005. The World Atlas of
Language Structures. prensa de la Universidad de Oxford.

Trevor Hastie and Robert Tibshirani. 1996.
Discriminant adaptive nearest neighbor classifi-
cation and regression. In Proceedings of NIPS,
pages 409–415.

Iris Hendrickx and Antal van den Bosch. 2003.
Memory-based one-step named-entity recog-
nition: Effects of seed list features, classifier
stacking, and unannotated data. En procedimientos
of CoNLL, pages 176–179. https://doi
.org/10.3115/1119176.1119203

Elad Hoffer and Nir Ailon. 2015. Deep metric
learning using triplet network. En internacional
Workshop on Similarity-Based Pattern Recog-
nition, pages 84–92. Saltador. https://doi
.org/10.1007/978-3-319-24261-3 7

Tao Ji, Yuanbin Wu, and Man Lan. 2019.
Graph-based dependency parsing with graph
In Proceedings of ACL,
neural networks.
pages 2475–2485. https://doi.org/10
.18653/v1/P19-1237

Urvashi Khandelwal, Angela Fan, Dan Jurafsky,
Lucas Zettlemoyer, and Mike Lewis. 2020.
Nearest neighbor machine translation. arXiv
preprint arXiv:2010.00710.

Urvashi Khandelwal, Omer Levy, Dan Jurafsky,
Lucas Zettlemoyer, and Mike Lewis. 2019.
Generalization through memorization: Nearest
neighbor language models. En procedimientos de
ICLR.

D. PAG. Kingma and J. Ba. 2014. Adán: A method
for stochastic optimization. arXiv preprint
arXiv:1412.6980.

Janet L. Kolodneer. 1991. Improving human
decision making through case-based decision
aiding. AI Magazine, 12(2):52–52.

Sandra K¨ubler. 2004. Memory-based Parsing,
volumen 7. https://doi.org/10.1075
/nlp.7

Sandra K¨ubler, ryan mcdonald, and Joakim
Nivre. 2009. Dependency parsing. Synthesis
Lectures on Human Language Technologies,
1(1):1–127. https://doi.org/10.2200
/S00169ED1V01Y200901HLT002

Artur Kulmizev, Miryam de Lhoneux, Johannes
Gontrum, Elena Fano, and Joakim Nivre.

2019. Deep contextualized word embed-
in transition-based and graph-based
dings
dependency parsing-a tale of two parsers re-
visited. In Proceedings of EMNLP-IJCNLP,
pages 2755–2768. https://doi.org/10
.18653/v1/D19-1277

Michael Lebowitz. 1983. Memory-based pars-
En g. Artificial
Inteligencia, 21(4):363–404.
https://doi.org/10.1016/S0004-3702
(83)80019-8

Tao Lei, Regina Barzilay, and Tommi Jaakkola.
En
2016. Rationalizing neural predictions.
Proceedings of EMNLP, pages 107–117.

Patrick Lewis, Ethan Pérez, Aleksandara Piktus,
Fabio Petroni, Vladimir Karpukhin, Naman
goyal, Heinrich K¨uttler, mike lewis, Wen-tau
Yih, Tim Rockt¨aschel, Sebastián Riedel, y
Douwe Kiela. 2020. Retrieval-augmented gen-
eration for knowledge-intensive NLP tasks.
arXiv preimpresión arXiv:2005.11401.

Miryam de Lhoneux, Sara Stymne, and Joakim
Nivre. 2017. Old school vs. new school: Com-
paring transition-based parsers with and without
neural network enhancement. In The 15th Tree-
banks and Linguistic Theories Workshop (TLT),
pages 99–110.

Weiyang Liu, Yandong Wen, Zhiding Yu,
Ming Li, Bhiksha Raj, and Le Song. 2017.
Sphereface: Deep hypersphere embedding for
face recognition. In Proceedings of CVPR,
pages 212–220.

Mitchell P. marco, Beatrice Santorini, y
Mary Ann Marcinkiewicz. 1993. Edificio
a large annotated corpus of English: El
Penn Treebank. Ligüística computacional,
19(2):313–330. https://doi.org/10.21236
/ADA273556

ryan mcdonald, Kevin Lerman, and Fernando
Pereira. 2006. Multilingual dependency anal-
ysis with a two-stage discriminative parser.
In Proceedings of CoNLL-X, pages 216–220.
https://doi.org/10.3115/1596276
.1596317

ryan mcdonald,

Yoav

Joakim Nivré, Yvonne
Quirmbach-Brundage,
Goldberg,
Dipanjan Das, Kuzman Ganchev, Keith Hall,
eslavo petrov, Hao Zhang, Oscar T¨ackstr¨om,
Claudia Bedini, N´uria Bertomeu Castell´o, y
Jungmee Lee. 2013. Universal Dependency

1505

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
3
9
1
9
7
9
2
5
5

/
t

a
C
_
a
_
0
0
4
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

annotation for multilingual parsing. En profesional-
ceedings of ACL, pages 92–97.

local explanations. In Proceedings of NIPS,
pages 2515–2524.

ryan mcdonald,

Pereira, Kiril
Fernando
Ribarov, and Jan Hajiˇc. 2005. Non-projective
dependency parsing using spanning tree al-
In Proceedings of HLT-EMNLP,
gorithms.
523–530. https://doi.org/10
paginas
.3115/1220575.1220641

Makoto Nagao. 1984. A framework of a mechan-
ical translation between Japanese and English
by analogy principle.

Maximilian Nickel, Lorenzo Rosasco,

y
Tomaso Poggio. 2016. Holographic embed-
dings of knowledge graphs. En procedimientos de
AAAI, volumen 30.

Joakim Nivré. 2003. An efficient algorithm for
projective dependency parsing. En procedimientos
de
the Eighth International Conference on
Parsing Technologies, pages 149–160.

Joakim Nivré, Johan Hall, and Jens Nilsson.
2004. Memory-based dependency parsing. En
Proceedings of CoNLL, pages 49–56.

Aaron van den Oord, Yazhe Li, and Oriol
Viñales. 2018. Representation learning with
contrastive predictive coding. arXiv preprint
arXiv:1807.03748.

Hiroki Ouchi, Jun Suzuki, Sosuke Kobayashi,
Sho Yokoi, Tatsuki Kuribayashi, Ryuto Konno,
and Kentaro Inui. 2020. Instance-based learning
of span representations: A case study through
named entity recognition. En procedimientos de
LCA, pages 6452–6459. https://doi.org
/10.18653/v1/2020.acl-main.575

Nicolas Papernot and Patrick McDaniel. 2018.
Deep k-nearest neighbors: Towards confident,
interpretable and robust deep learning. arXiv
preprint arXiv:1803.04765.

Razvan Pascanu, Tomas Mikolov, y yoshua
bengio. 2013. On the difficulty of training
recurrent neural networks. En procedimientos de
ICML, pages 1310–1318.

jeffrey

Socher,

Pennington, Ricardo

y
Christopher Manning. 2014. GloVe: Global
vectors for word representation. En procedimientos
of EMNLP, pages 1532–1543. https://doi
.org/10.3115/v1/D14-1162

Gregory Plumb, Denali Molitor, and Ameet S.
Talwalkar. 2018. Model agnostic supervised

Milos Radovanovic, Alexandros Nanopoulos, y
Mirjana Ivanovic. 2010. Hubs in space: Pop-
ular nearest neighbors in high-dimensional
datos. Journal of Machine Learning Research,
11(sept):2487–2531.

Rajeev Ranjan, Carlos D Castillo, and Rama
Chellappa. 2017. L2-constrained softmax loss
for discriminative face verification. arXiv
preprint arXiv:1703.09507.

Marco Tulio Ribeiro, Samer Singh, and Carlos
Guestrin. 2016. Why should i trust you?: Ex-
plaining the predictions of any classifier. En
Proceedings of KDD, pages 1135–1144.

Erik F. Tjong Kim Sang. 2002. Memory-based
shallow parsing. Journal of Machine Learning
Investigación, 2:559–594.

Remko Scha, Rens Bod, and Khalil Sima’An.
1999. A memory-based model of syntac-
tic analysis: data-oriented parsing. Diario de
Experimental & Theoretical Artificial Intelli-
gence, 11(3):409–440. https://doi.org
/10.1080/095281399146481

Dominik Schnitzer, Arthur Flexer, Markus Schedl,
and Gerhard Widmer. 2012. Local and global
scaling reduce hubs in space. Journal of Ma-
chine Learning Research, 13(10).

R. Short and Keinosuke Fukunaga. 1981. El
optimal distance measure for nearest neighbor
clasificación. IEEE transactions on Informa-
tion Theory, 27(5):622–627. https://doi
.org/10.1109/TIT.1981.1056403

Aaron Smith, Miryam de Lhoneux, Sara Stymne,
and Joakim Nivre. 2018. An investigation of
the interactions between pre-trained word em-
camas, character models and pos tags in de-
pendency parsing. In Proceedings of EMNLP,
pages 2711–2720. https://doi.org/10
.18653/v1/D18-1291

Kihyuk Sohn. 2016. Improved deep metric learn-
ing with multi-class n-pair loss objective. En
Proceedings of NIPS, pages 1857–1865.

Eiichiro Sumita and Hitoshi Iida. 1991. Ex-
periments and prospects of example-based
machine translation. In Proceedings of ACL,
pages 185–192.

1506

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
3
9
1
9
7
9
2
5
5

/
t

a
C
_
a
_
0
0
4
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Yifan Sun, Changmao Cheng, Yuhan Zhang,
Chi Zhang, Liang Zheng, Zhongdao Wang,
and Yichen Wei. 2020. Circle loss: A unified
perspective of pair similarity optimization.
In Proceedings of CVPR, pages 6398–6407.
https://doi.org/10.1109/CVPR42600
.2020.00643

Erik F. Tjong Kim Sang. 2002. Memoria-
based named entity recognition. En procedimientos
of CoNLL. https://doi.org/10.3115
/1118853.1118878

Jorn Veenstra, Antal Van den Bosch, Sabine
and Jakub
Buchholz, Walter Daelemans,
Zavrel. 2000. Memory-based word sense dis-
ambiguation. Computers and the Humanities,
34(1):171–177.

Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji,
Dihong Gong, Jingchao Zhou, Zhifeng Li, y
Wei Liu. 2018. Cosface: Large margin cosine
loss for deep face recognition. En procedimientos
of CVPR, pages 5265–5274.

Jiang Wang, Yang Song, Thomas Leung, Chuck
Rosenberg, Jingbin Wang, James Philbin, Bo
Chen, and Ying Wu. 2014. Learning fine-
grained image similarity with deep ranking.
In Proceedings of CVPR, pages 1386–1393.
https://doi.org/10.1109/CVPR.2014
.180

Xun Wang, Xintong Han, Weilin Huang,
Dengke Dong, and Matthew R. Scott. 2019.
Multi-similarity loss with general pair weight-

ing for deep metric learning. En procedimientos
of CVPR, pages 5022–5030. https://doi
.org/10.1109/CVPR.2019.00516

Sam Wiseman

Stratos.

2019.
and Karl
Label-agnostic sequence labeling by copying
nearest neighbors. In Proceedings of ACL,
pages 5363–5369. https://doi.org/10
.18653/v1/P19-1533

Eric Xing, Michael Jordan, Stuart J. Russell,
and Andrew Ng. 2002. Distance metric learn-
ing with application to clustering with side-
información. In Proceedings of NIPS, volumen
15, pages 521–528.

Hiroyasu Yamada and Yuji Matsumoto. 2003.
Statistical dependency analysis with support
vector machines. En Actas del Octavo
International Conference on Parsing Tech-
nológico, pages 195–206.

Bishan Yang, Wen-tau Yih, Xiaodong He,
Jianfeng Gao, and Li Deng. 2015. In Embed-
ding entities and relations for learning and
inference in knowledge bases. arXiv:1412.6575

Xingxing Zhang, Jianpeng Cheng, and Mirella
Lapata. 2017. Dependency parsing as head selec-
ción. In Proceedings of EACL, pages 665–676.

Yu Zhang, Zhenghua Li, and Min Zhang. 2020.
Efficient second-order TreeCRF for neural de-
pendency parsing. In Proceedings of ACL,
pages 3295–3305. https://doi.org/10
.18653/v1/2020.acl-main.302

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
3
9
1
9
7
9
2
5
5

/
t

a
C
_
a
_
0
0
4
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

1507 Instance-Based Neural Dependency Parsing image

Descargar PDF