Unsupervised Compositionality Prediction of - Specialized Research AI at MIT

Unsupervised Compositionality Prediction of
Nominal Compounds

Silvio Cordeiro
Federal University of Rio Grande do Sul
and Aix Marseille University, CNRS, LIS
silvioricardoc@gmail.com

Aline Villavicencio
University of Essex and
Federal University of Rio Grande do Sul
alinev@gmail.com

Marco Idiart
Federal University of Rio Grande do Sul
marco.idiart@gmail.com

Carlos Ramisch
Aix Marseille University, CNRS, LIS
carlos.ramisch@lis-lab.fr

Nominal compounds such as red wine and nut case display a continuum of compositionality,
with varying contributions from the components of the compound to its semantics. This article
proposes a framework for compound compositionality prediction using distributional semantic
models, evaluating to what extent they capture idiomaticity compared to human judgments. For
evaluation, we introduce data sets containing human judgments in three languages: English,
French, and Portuguese. The results obtained reveal a high agreement between the models and
human predictions, suggesting that they are able to incorporate information about idiomaticity.
We also present an in-depth evaluation of various factors that can affect prediction, such as
model and corpus parameters and compositionality operations. General crosslingual analyses
reveal the impact of morphological variation and corpus size in the ability of the model to predict
compositionality, and of a uniform combination of the components for best results.

1. Introduction

It is a universally acknowledged assumption that the meaning of phrases, expressions,
or sentences can be determined by the meanings of their parts and by the rules used

Submission received: 4 December 2017; revised version received: 22 June 2018; accepted for publication:
8 August 2018.

doi:10.1162/COLI a 00341

© 2019 Association for Computational Linguistics
Published under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International
(CC BY-NC-ND 4.0) license

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Number 1

to combine them. Part of the appeal of this principle of compositionality1 is that it
implies that a meaning can be assigned even to a new sentence involving an unseen
combination of familiar words (Goldberg 2015). Indeed, for natural language processing
(NLP), this is an attractive way of linearly deriving the meaning of larger units from
their components, performing the semantic interpretation of any text.

For representing the meaning of individual words and their combinations in com-
putational systems, distributional semantic models (DSMs) have been widely used.
DSMs are based on Harris’ distributional hypothesis that the meaning of a word can be
inferred from the context in which it occurs (Harris 1954; Firth 1957). In DSMs, words
are usually represented as vectors that, to some extent, capture cooccurrence patterns
in corpora (Lin 1998; Landauer, Foltz, and Laham 1998; Mikolov et al. 2013; Baroni,
Dinu, and Kruszewski 2014). Evaluation of DSMs has focused on obtaining accurate
semantic representations for words, and state-of-the-art models are already capable of
obtaining a high level of agreement with human judgments for predicting synonymy or
similarity between words (Freitag et al. 2005; Camacho-Collados, Pilehvar, and Navigli
2015; Lapesa and Evert 2017) and for modeling syntactic and semantic analogies be-
tween word pairs (Mikolov, Yih, and Zweig 2013). These representations for individual
words can also be combined to create representations for larger units such as phrases,
sentences, and even whole documents, using simple additive and multiplicative vector
operations (Mitchell and Lapata 2010; Reddy, McCarthy, and Manandhar 2011; Mikolov
et al. 2013; Salehi, Cook, and Baldwin 2015), syntax-based lexical functions (Socher et al.
2012), or matrix and tensor operations (Baroni and Lenci 2010; Bride, Van de Cruys,
and Asher 2015). However, it is not clear to what extent this approach is adequate in
the case of idiomatic multiword expressions (MWEs). MWEs fall into a wide spectrum
of compositionality; that is, some MWEs are more compositional (e.g., olive oil) while
others are more idiomatic (Sag et al. 2002; Baldwin and Kim 2010). In the latter case, the
meaning of the MWE may not be straightforwardly related to the meanings of its parts,
creating a challenge for the principle of compositionality (e.g., snake oil as a product of
questionable beneﬁt, not necessarily an oil and certainly not extracted from snakes).

In this article, we discuss approaches for automatically detecting to what extent
the meaning of an MWE can be directly computed from the meanings of its compo-
nent words, represented using DSMs. We evaluate how accurately DSMs can model
the semantics of MWEs with various levels of compositionality compared to human
judgments. Since MWEs encompass a large amount of related but distinct phenomena,
we focus exclusively on a subcategory of MWEs: nominal compounds. They represent
an ideal case study for this work, thanks to their relatively homogeneous syntax (as
opposed to other categories of MWEs such as verbal idioms) and their pervasiveness
in language. We assume that models able to predict the compositionality of nominal
compounds could be generalized to other MWE categories by addressing their vari-
ability in future work. Furthermore, to determine to what extent these approaches are
also adequate cross-lingually, we evaluate them in three languages: English, French, and
Portuguese.

Given that MWEs are frequent in languages (Sag et al. 2002), identifying idiomatic-
ity and producing accurate semantic representations for compositional and idiomatic
cases is of relevance to NLP tasks and applications that involve some form of semantic
processing, including semantic parsing (Hwang et al. 2010; Jagfeld and van der Plas
2015), word sense disambiguation (Finlayson and Kulkarni 2011; Schneider et al. 2016),

1 Attributed to Frege (1892/1960).

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Cordeiro et al.

Unsupervised Compositionality Prediction of Nominal Compounds

and machine translation (Ren et al. 2009; Carpuat and Diab 2010; Cap et al. 2015;
Salehi et al. 2015). Moreover, the evaluation of DSMs on tasks involving MWEs, such
as compositionality prediction, has the potential to drive their development towards
new directions.

The main hypothesis of our work is that, if the meaning of a compositional nominal
compound can be derived from a combination of its parts, this translates in DSMs
as similar vectors for a compositional nominal compound and for the combination of
the vectors of its parts using some vector operation, that we refer to as composition
function. Conversely we can use the lack of similarity between the nominal compound
vector representation and a combination of its parts to detect idiomaticity. Further-
more, we hypothesize that accuracy in predicting compositionality depends both on
the characteristics of the DSMs used to represent expressions and their components
and on the composition function adopted. Therefore, we have built 684 DSMs and
performed an extensive evaluation, involving over 9,072 analyses, investigating various
types of DSMs, their conﬁgurations, the corpora used to train them, and the composition
function used to build vectors for expressions.2

This article is structured as follows. Section 2 presents related work on distributional
semantics, compositionality prediction, and nominal compounds. Section 3 presents the
data sets created for our evaluation. Section 4 describes the compositionality prediction
framework, along with the composition functions which we evaluate. Section 5 spec-
iﬁes the experimental setup (corpora, DSMs, parameters, and evaluation measures).
Section 6 presents the overall results of the evaluated models. Sections 7 and 8 evaluate
the impact of DSM and corpus parameters, and of composition functions on composi-
tionality prediction. Section 9 discusses system predictions through an error analysis.
Section 10 summarizes our conclusions. Appendix A contains a glossary, Appendix B
presents extra sanity-check experiments, Appendix C contains the questionnaire used
for data collection, and Appendices D, E, and F list the compounds in the data sets.

2. Related Work

The literature on distributional semantics is extensive (Lin 1998; Turney and Pantel 2010;
Baroni and Lenci 2010; Mohammad and Hirst 2012), so we provide only a brief introduc-
tion here, underlining their most relevant characteristics to our framework (Section 2.1).
Then, we deﬁne compositionality prediction and discuss existing approaches, focusing
on distributional techniques for multiword expressions (Section 2.2). Our framework is
evaluated on nominal compounds, and we discuss their relevant properties (Section 2.3)
along with existing data sets for evaluating compositionality prediction (Section 2.4).

2 This article signiﬁcantly extends and updates previous publications:

We consolidate the description of the data sets introduced in Ramisch et al. (2016) and
Ramisch, Cordeiro, and Villavicencio (2016) by adding details about data collection, ﬁltering,
and results of a thorough analysis studying the correlation between compositionality and
related variables.

We extend the compositionality prediction framework described in Cordeiro, Ramisch, and
Villavicencio (2016) by adding and evaluating new composition functions and DSMs.

We extend the evaluation reported in Cordeiro et al. (2016) not only by adding Portuguese,
but also by evaluating additional parameters: corpus size, composition functions, and new
DSMs.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Number 1

2.1 Distributional Semantic Models

Distributional semantic models (DSMs) use context information to represent the mean-
ing of lexical units as vectors. These vectors are built assuming the distributional
hypothesis, whose central idea is that the meaning of a word can be learned based
on the contexts where it appears—or, as popularized by Firth (1957), “you shall know a
word by the company it keeps.”

Formally, a DSM attempts to encode the meaning of each target word wi of a
vocabulary V as a vector of real numbers v(wi) in R|V|. Each component of v(wi) is a
function of the co-occurrence between wi and the other words in the vocabulary (its
contexts wc). This function can be simply a co-occurrence count c(wi, wc), or some mea-
sure of the association between wi and each wc, such as pointwise mutual information
(PMI, Church and Hanks [1990], Lin [1999]) or positive PMI (PPMI, Baroni, Dinu, and
Kruszewski [2014]; Levy, Goldberg, and Dagan [2015]).

In DSMs, co-occurrence can be deﬁned as two words co-occurring in the same
document, sentence, or sentence fragment in a corpus. Intrasentential models are often
based on a sliding window; that is, a context word wc co-occurs within a certain window
of W words around the target wi. Alternatively, co-occurrence can also be based on
syntactic relations obtained from parsed corpora, where a context word wc appears
within speciﬁc syntactic relations with wi (Lin 1998; Pad ´o and Lapata 2007; Lapesa and
Evert 2017).

The set of all vectors v(wi), ∀wi ∈ V can be represented as a sparse co-occurrence
matrix V × V → R. Given that most word pairs in this matrix co-occur rarely (if ever),
a threshold on the number of co-occurrences is often applied to discard irrelevant pairs.
Additionally, co-occurrence vectors can be transformed to have a signiﬁcantly smaller
number of dimensions, converting vectors in R|V| into vectors in Rd, with d (cid:28) |V|.3
Two solutions are commonly employed in the literature. The ﬁrst one consists in using
context thresholds, where all target–context pairs that do not belong to the top-d most
relevant pairs are discarded (Salehi, Cook, and Baldwin 2014; Padr ´o et al. 2014b). The
second solution consists in applying a dimensionality reduction technique such as
singular value decomposition on the co-occurrence matrix where only the d largest
singular values are retained (Deerwester et al. 1990). Similar techniques focus on the
factorization of the logarithm of the co-occurrence matrix (Pennington, Socher, and
Manning 2014) and on alternative factorizations of the PPMI matrix (Salle, Villavicencio,
and Idiart 2016).

Alternatively, DSMs can be constructed by training a neural network to predict
target–context relationships. For instance, a network can be trained to predict a target
word wi among all possible words in V given as input a window of surrounding
context words. This is known as the continuous bag-of-words model. Conversely, the
network can try to predict context words for a target word given as input, and this is
known as the skip-gram model (Mikolov et al. 2013). In both cases, the network training
procedure allows encoding in the hidden layer semantic information about words as a
side effect of trying to solve the prediction task. The weight parameters that connect
the unity representing wi with the d-dimensional hidden layer are taken as its vector
representation v(wi).

There are a number of factors that may inﬂuence the ability of a DSM to accurately
learn a semantic representation. These include characteristics of the training corpus such

3 After dimensionality reduction, nowadays word vectors are often called word embeddings.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Cordeiro et al.

Unsupervised Compositionality Prediction of Nominal Compounds

as size (Mikolov, Yih, and Zweig 2013) as well as frequency thresholds and ﬁlters (Ferret
2013; Padr ´o et al. 2014b), genre (Lapesa and Evert 2014), preprocessing (Pad ´o and
Lapata 2003, 2007), and type of context (window vs. syntactic dependencies) (Agirre
et al. 2009; Lapesa and Evert 2017). Characteristics of the model include the choice of
association and similarity measures (Curran and Moens 2002), dimensionality reduction
strategies (Van de Cruys et al. 2012), and the use of subsampling and negative sampling
techniques (Mikolov, Yih, and Zweig 2013). However, the particular impact of these
factors on the quality of the resulting DSM may be heterogeneous and depends on the
task and model (Lapesa and Evert 2014). Because there is no consensus about a single
optimal model that works for all tasks, we compare a variety of models (Section 5) to
determine which are best suited for our compositionality prediction framework.

2.2 Compositionality Prediction

Before adopting the principle of compositionality to determine the meaning of a larger
unit, such as a phrase or multiword expression (MWE), it is important to determine
whether it is idiomatic or not.4 This problem, known as compositionality prediction,
can be solved using methods that measure directly the extent to which an expression
is constructed from a combination of its parts, or indirectly via language-dependent
properties of MWEs linked to idiomaticity like the degree of determiner variability and
morphological ﬂexibility (Fazly, Cook, and Stevenson 2009; Tsvetkov and Wintner 2012;
Salehi, Cook, and Baldwin 2015; K ¨oper and Schulte im Walde 2016). In this article, we
focus on direct prediction methods in order to evaluate the target languages under sim-
ilar conditions. Nonetheless, this does not exclude the future integration of information
used by indirect prediction methods, as a complement to the methods discussed here.

For direct prediction methods, three ingredients are necessary. First, we need vector
representations of single-word meanings, such as those built using DSMs (Section 2.1).
Second, we need a mathematical model of how the compositional meaning of a phrase is
calculated from the meanings of its parts. Third, we need the compositionality measure
itself, which estimates the similarity between the compositionally constructed meaning
of a phrase and its observed meaning, derived from corpora. There are a number of
alternatives for each of the ingredients, and throughout this article we call a speciﬁc
choice of the three ingredients a compositionality prediction conﬁguration.

Regarding the second ingredient, that is, the mathematical model of compositional
meaning, the most natural choice is the additive model (Mitchell and Lapata 2008). In
the additive model, the compositional meaning of a phrase w1w2 . . . wn is calculated as
a linear combination of the word vectors of its components: (cid:80)
i βiv(wi), where v(wi) is a
d-dimensional vector for each word wi, and the βi coefﬁcients assign different weights
to the representation of each word (Reddy, McCarthy, and Manandhar 2011; Schulte
im Walde, M ¨uller, and Roller 2013; Salehi, Cook, and Baldwin 2015). These weights
can capture the asymmetric contribution of each of the components to the semantics
of the whole phrase (Bannard, Baldwin, and Lascarides 2003; Reddy, McCarthy, and
Manandhar 2011). For example, in ﬂea market, it is the head (market) that has a clear
contribution to the overall meaning, whereas in couch potato it is the modiﬁer (couch).

The additive model can be generalized to use a matrix of multiplicative coefﬁcients,
which can be estimated through linear regression (Guevara 2011). This model can be

4 The task of determining whether a phrase is compositional is closely related to MWE discovery (Constant

et al. 2017), which aims to automatically extract MWE lists from corpora.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Number 1

further modiﬁed to learn polynomial projections of higher degree, with quadratic pro-
jections yielding particularly promising results (Yazdani, Farahmand, and Henderson
2015). These models come with the caveat of being supervised, requiring some amount
of pre-annotated data in the target language. Because of these requirements, our study
focuses on unsupervised compositionality prediction methods only, based exclusively
on automatically POS-tagged and lemmatized monolingual corpora.

Alternatives to the additive model include the multiplicative model and its vari-
ants (Mitchell and Lapata 2008). However, results suggest that this representation
is inferior to the one obtained through the additive model (Reddy, McCarthy, and
Manandhar 2011; Salehi, Cook, and Baldwin 2015). Recent work on predicting intra-
compound semantics also supports that additive models tend to yield better results
than multiplicative models (Hartung et al. 2017).

The third ingredient is the measure of similarity between the compositionally
constructed vector and its actual corpus-based representation. Cosine similarity is the
most commonly used measure for compositionality prediction in the literature (Schone
and Jurafsky 2001; Reddy, McCarthy, and Manandhar 2011; Schulte im Walde, M ¨uller,
and Roller 2013; Salehi, Cook, and Baldwin 2015). Alternatively, one can calculate the
overlap between the distributional neighbors of the whole phrase and those of the
component words (McCarthy, Keller, and Carroll 2003), or the number of single-word
distributional neighbors of the whole phrase (Riedl and Biemann 2015).

2.3 Nominal Compounds

Instead of covering compositionality prediction for MWEs in general, we focus on a
particular category of phenomena represented by nominal compounds. We deﬁne a
nominal compound as a syntactically well-formed and conventionalized noun phrase
containing two or more content words, whose head is a noun.5 They are convention-
alized (or institutionalized) in the sense that their particular realization is statistically
idiosyncratic, and their constituents cannot be replaced by synonyms (Sag et al. 2002;
Baldwin and Kim 2010; Farahmand, Smith, and Nivre 2015). Their semantic interpre-
tation may be straightforwardly compositional, with contributions from both elements
(e.g., climate change), partly compositional, with contribution mainly from one of the
elements (e.g., grandfather clock), or idiomatic (e.g., cloud nine) (Nakov 2013).

The syntactic realization of nominal compounds varies across languages. In English,
they are often expressed as a sequence of two nouns, with the second noun as the
syntactic head, modiﬁed by the ﬁrst noun. This is the most frequently annotated POS-
tag pattern in the MWE-annotated DiMSUM English corpus (Schneider et al. 2016).
In French and Portuguese, they often assume the form of adjective–noun or noun–
adjective pairs, where the adjective modiﬁes the noun. Examples of such constructions
include the adjective–noun compound FR petite annonce (lit. small announcement ‘classi-
ﬁed ad’) and the noun–adjective compound PT buraco negro (lit. hole black ‘black hole’).6
Additionally, compounds may also involve prepositions linking the modiﬁer with the
head, as in the case of FR cochon d’Inde (lit. pig of India ‘guinea pig’) and PT dente de leite
(lit. tooth of milk ‘milk tooth’). Because prepositions are highly polysemous and their
representation in DSMs is tricky, we do not include compounds containing prepositions

5 The terms noun compound and compound noun are usually reserved for nominal compounds formed by

sequences of nouns only, typical of Germanic languages but not frequent in Romance languages.

6 In this article, examples are preceded by their language codes: EN for English, FR for French, and PT for

Brazilian Portuguese. In the absence of a language code, English is implied.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Cordeiro et al.

Unsupervised Compositionality Prediction of Nominal Compounds

in this article. Hence, we focus on 2-word nominal compounds of the form noun1–noun2
(in English), and noun–adjective and adjective–noun (in the three languages).

Regarding the meaning of nominal compounds, the implicit relation between the
components of compositional compounds can be described in terms of free paraphrases
involving verbs, such as ﬂu virus as virus that causes/creates ﬂu (Nakov 2008),7 or prepo-
sitions, such as olive oil as oil from olives (Lauer 1995). These implicit relations can often
be seen explicitly in the equivalent expressions in other languages (e.g., FR huile d’olive
and PT azeite de oliva for EN olive oil).

Alternatively, the meaning of compositional nominal compounds can be described
using a closed inventory of relations which make the role of the modiﬁer explicit with
respect to the head noun, including syntactic tags such as subject and object, and seman-
tic tags such as instrument and location (Girju et al. 2005). The degree of compositionality
of a nominal compound can also be represented using numerical scores (Section 2.4)
to indicate to what extent the component words allow predicting the meaning of the
whole (Reddy, McCarthy, and Manandhar 2011; Roller, Schulte im Walde, and Scheible
2013; Salehi et al. 2015). The latter is the representation that we adopted in this article.

2.4 Numerical Compositionality Data sets

The evaluation of compositionality prediction models can be performed extrinsically or
intrinsically. In extrinsic evaluation, compositionality information can be used to decide
how a compound should be treated in NLP systems such as machine translation or text
simpliﬁcation. For instance, for machine translation, idiomatic compounds need to be
treated as atomic phrases, as current methods of morphological compound processing
cannot be applied to them (Stymne, Cancedda, and Ahrenberg 2013; Cap et al. 2015).

Although potentially interesting, extrinsic evaluation is not straightforward, as
results may be inﬂuenced both by the compositionality prediction model and by the
strategy for integration of compositionality information into the NLP system. Therefore,
most related work focuses on an intrinsic evaluation, where the compositionality scores
produced by a model are compared to a gold standard, usually a data set where
nominal compound semantics have been annotated manually. Intrinsic evaluation thus
requires the existence of data sets where each nominal compound has one (or several)
numerical scores associated with it, indicating its compositionality. Annotations can be
provided by expert linguist annotators or by crowdsourcing, often requiring that several
annotators judge the same compound to reduce the impact of subjectivity on the scores.
Relevant compositionality data sets of this type are listed below, some of which were
used in our experiments.

•

Reddy, McCarthy, and Manandhar (2011) collected judgments for a set of
90 English noun–noun (e.g., zebra crossing) and adjective–noun (e.g., sacred
cow) compounds, in terms of three numerical scores: the compositionality
of the compound as a whole and the literal contribution of each of its parts
individually, using a scale from 0 to 5. The data set was built through
crowdsourcing, and the ﬁnal scores are the average of 30 judgments per
compound.This data set will be referred to as Reddy in our experiments.

7 Nakov (2008) also proposes a method for automatically extracting paraphrases from the web to classify
nominal compounds. This was extended in a SemEval 2013 task, where participants had to rank free
paraphrases according to the semantic relations in the compounds (Hendrickx et al. 2013).

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Number 1

•

Farahmand, Smith, and Nivre (2015) collected judgments for 1,042 English
noun–noun compounds. Each compound has binary judgments regarding
non-compositionality and conventionalization given by four expert
annotators (both native and non-native speakers). A hard threshold is
applied so that compounds are considered as noncompositional if at least
two annotators say so (Yazdani, Farahmand, and Henderson 2015), and
the total compositionality score is given by the sum of the four binary
judgments. This data set will be referred to as Farahmand in our
experiments.

Kruszewski and Baroni (2014) built the Norwegian Blue Parrot data set,
containing judgments for modiﬁer-head phrases in English. The
judgments consider whether the phrase is (1) an instance of the concept
denoted by the head (e.g., dead parrot and parrot) and (2) a member of the
more general concept that includes the head (e.g., dead parrot and pet),
along with typicality ratings, with 5,849 judgments in total.

Roller, Schulte im Walde, and Scheible (2013) collected judgments for a set
of 244 German noun–noun compounds, each compound with an average
of around 30 judgments on a compositionality scale from 1 to 7, obtained
through crowdsourcing. The resource was later enriched with feature
norms (Roller and Schulte im Walde 2014).

Schulte im Walde et al. (2016) collected judgments for a set of 868 German
noun–noun compounds, including human judgments of compositionality
on a scale of 1 to 6. Compounds are judged by multiple annotators, and
the ﬁnal compositionality score is the average across annotators. The data
set is also annotated for in-corpus frequency, productivity, and ambiguity,
and a subset of 180 compounds has been selected for balancing these
variables. The annotations were performed by the authors, linguists, and
through crowdsourcing. For the balanced subset of 180 compounds,
compositionality annotations were performed by experts only, excluding
the authors.

For a multilingual evaluation, in this work, we construct two data sets, one for
French and one for Portuguese compounds, and extend the Reddy data set for English
using the same protocol as Reddy, McCarthy, and Manandhar (2011).

3. Creation of a Multilingual Compositionality Data set

In Section 3.1, we describe the construction of data sets of 180 compounds for French
(FR-comp) and Portuguese (PT-comp). For English, the complete data set contains 280
compounds, of which 190 are new and 90 come from the Reddy data set. We use 180
of these (EN-comp) for cross-lingual comparisons (90 from the original Reddy data set
combined with 90 new ones from EN-comp90), and 100 new compounds as held-out data
(EN-compExt), to evaluate the robustness of the results obtained (Section 6.3). These data
sets containing compositionality scores for 2–word nominal compounds are used to
evaluate our framework (Section 4), and we discuss their characteristics in Section 3.2.8

8 For English, only EN-comp90 and EN-compExt (90 and 100 new compounds, respectively) are considered.

Reddy (included in EN-comp) is analyzed in Reddy, McCarthy, and Manandhar (2011).

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Cordeiro et al.

Unsupervised Compositionality Prediction of Nominal Compounds

3.1 Data Collection

For each of the target languages, we collected, via crowdsourcing, a set of numerical
scores corresponding to the level of compositionality of the target nominal compounds.
We asked non-expert participants to judge each compound considering three sentences
where the compound occurred. After reading the sentences, participants assess the
degree to which the meaning of the compound is related to the meanings of its parts.
This follows from the assumption that a fully compositional compound will have an
interpretation whose meaning stems from both words (e.g., lime tree as a tree of limes),
while a fully idiomatic compound will have a meaning that is unrelated to its compo-
nents (e.g., nut case as an eccentric person).

Our work follows the protocol proposed by Reddy, McCarthy, and Manandhar
(2011), where compositionality is explained in terms of the literality of the individual
parts. This type of indirect annotation does not require expert linguistic knowledge,
and still provides reliable data, as we show later. For each language, data collection
involved four steps: compound selection, sentence selection, questionnaire design, and
data aggregation.

Compound Selection. For each data set, we manually selected nominal compounds from
dictionaries, corpus searches, and by linguistic introspection, maintaining an equal pro-
portion of compounds that are compositional, partly compositional, and idiomatic.9 We
considered them to be compositional if their semantics are related to both components
(e.g., benign tumor), partly compositional if their semantics are related to only one of
the components (e.g., grandfather clock), and idiomatic if they are not directly related to
either (e.g., old ﬂame). This preclassiﬁcation was used only to select a balanced set of
compounds and was not shown to the participants nor used at any later stage. For
all languages, all compounds are required to have a head that is unambiguously a
noun, and additionally for French and Portuguese, all compounds have an adjective
as modiﬁer.

Sentence Selection. Compounds may be polysemous (e.g., FR bras droit may mean most
reliable helper or literally right arm). To avoid any potential sense uncertainty, each
compound was presented to the participants with the same sense in three sentences.
These sentences were manually selected from the WaC corpora: ukWaC (Baroni et al.
2009), frWaC, and brWaC (Boos, Prestes, and Villavicencio 2014), presented in detail in
Section 5.

Questionnaire Design. For each compound, after reading three sentences, participants are
asked to:

•

provide synonyms for the compound in these sentences. The synonyms
are used as additional validation of the quality of the judgments:
if unrelated words are provided, the answers are discarded.

assess the contribution of the head noun to the meaning of the compound
(e.g., is a busy bee always literally a bee?)

9 We have not attempted to select compounds that are translations of each other, as a compound in a given

language may be realized differently in the other languages.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Number 1

•

assess the contribution of the modiﬁer noun or adjective to the meaning of
the compound (e.g., is a busy bee always literally busy?)

assess the degree to which the compound can be seen as a combination of its
parts (e.g., is a busy bee always literally a bee that is busy?)

Participants answer the last three items using a Likert scale from 0 (idiomatic/non-
literal) to 5 (compositional/literal), following Reddy, McCarthy, and Manandhar (2011).
To qualify for the task, participants had to submit demographic information conﬁrming
that they are native speakers, and to undergo training in the form of four example
questions with annotated answers in an external form (see Appendix C for details).

Data Aggregation. For English and French, we collected answers using Amazon Mechan-
ical Turk (AMT), manually removing answers that were not from native speakers or
where the synonyms provided were unrelated to the target compound sense. Because
AMT has few Brazilian Portuguese native speakers, we developed an in-house web inter-
face for the questionnaire, which was sent out to Portuguese-speaking NLP mailing
lists.

For a given compound and question we calculate aggregated scores as the arith-
metic averages of all answers across participants. We will refer to these averaged
scores as the human compositionality score (hc)s. We average the answers to the three
questions independently, generating three scores: hcH for the head noun, hcM for the
modiﬁer, and hcHM for the whole compound. In our framework, we try to predict hcHM
automatically (Section 5). To assess the variability of the answers (Section 3.2.1), we also
calculate the standard deviation across participants for each question (σH, σM, and σHM).
The list of compounds, their translations, glosses, and compositionality scores are

given in Appendices D (EN-comp90 and EN-compExt), E (FR-comp), and F (PT-comp).10

3.2 Data set Analysis

In this section, we present different measures of agreement among participants (Sec-
tion 3.2.1) and examine possible correlations between compositionality scores, familiar-
ity, and conventionalization (Section 3.2.2) in the data sets created for this article.

3.2.1 Measuring Data set Quality. To assess the quality of the collected human composi-
tionality scores, we use standard deviation and inter-annotator agreement scores.

Standard Deviation ( σ and Pσ>1.5) . The standard deviation (σ) of the participants’ an-
swers can be used as an indication of their agreement: for each compound and for each
of the three questions, small σ values suggest greater agreement. In addition, if the
instructions are clear, σ can also be seen as an indication of the level of difﬁculty of
the task. In other words, all other things being equal, compounds with larger σ can
be considered intrinsically harder to analyze by the participants. For each data set, we
consider two aggregated metrics based on σ:

•

σ — The average of σ in the data set.

Pσ>1.5 — The proportion of compounds whose σ is higher than 1.5.

10 Freely available at: http://pageperso.lis-lab.fr/~carlos.ramisch/?page=downloads/compounds

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Cordeiro et al.

Unsupervised Compositionality Prediction of Nominal Compounds

Table 1
Average number of answers per compound n, average standard deviation σ, proportion of high
standard deviation Pσ>1.5, for the compound (HM), head (H), and modiﬁer (M).

Data set

σHM

σH

FR-comp
PT-comp
EN-comp90
EN-compExt

14.9
31.8
18.8
22.6

1.15
1.22
1.17
1.21

1.08
1.09
1.05
1.27

σM

1.21
1.20
1.18
1.16

PσHM>1.5

PσH>1.5

PσM>1.5

22.78% 24.44% 30.56%
14.44% 17.22% 19.44%
18.89% 16.67% 27.78%
17.00% 29.00% 18.00%

Reddy

28.4

0.99

0.94

0.89

5.56% 11.11%

8.89%

Table 1 presents the result of these metrics when applied to our in-house data sets,
as well as to the original Reddy data set. The column n indicates the average number of
answers per compound, while the other six columns present the values of σ and Pσ>1.5
for compound (HM), head-only (H), and modiﬁer-only (M) scores.

These values are below what would be expected for random decisions (σrand (cid:39)
1.71, for the Likert scale). Although our data sets exhibit higher variability than Reddy,
this may be partly due to the application of ﬁlters done by Reddy, McCarthy, and
Manandhar (2011) to remove outliers.11 These values could also be due to the collec-
tion of fewer answers per compound for some of the data sets. However, there is no
clear tendency in the variation of the standard deviation of the answers and the num-
ber of participants n. The values of σ are quite homogeneous, ranging from 1.05 for
EN-comp90 (head) to 1.27 for EN-compExt (head). The low agreement for modiﬁers may be
related to a greater variability in semantic relations between modiﬁers and compounds:
these include material (e.g., brass ring), attribute (e.g., black cherry), and time (e.g., night
owl).

Figure 1(a) shows standard deviation (σHM, σH, and σM) for each compound of
FR-comp as a function of its average compound score hcHM.12 For all three languages,
greater agreement was found for compounds at the extremes of the compositionality
scale (fully compositional or fully idiomatic) for all scores. These ﬁndings can be partly
explained by end-of-scale effects, that result in greater variability for the intermedi-
ate scores in the Likert scale (from 1 to 4) that correspond to the partly composi-
tional cases. Hence, we expect that it will be easier to predict the compositionality of
idiomatic/compositional compounds than of partly compositional ones.

Inter-Annotator Agreement (α). To measure inter-annotator agreement of multiple partici-
pants, taking into account the distance between the ordinal ratings of the Likert scale, we
adopt the α score (Artstein and Poesio 2008). The α score is more appropriate for ordinal
data than traditional agreement scores for categorical data, such as Cohen’s and Fleiss’
κ (Cohen 1960; Fleiss and Cohen 1973). However, due to the use of crowdsourcing,
most participants rated only a small number of compounds with very limited chance
of overlap among them: the average number of answers per participant is 13.6 for
EN-comp90, 10.2 for EN-compExt, 33.7 for FR-comp, and 53.5 for PT-comp. Because the

11 Participants with negative correlation with the mean, and answers farther than ±1.5 from the mean.
12 Only FR-comp is shown as the other data sets display similar patterns.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Number 1

Figure 1
Left: Standard deviations (σH, σM, and σHM) as a function of hcHM in FR-comp. Right: Average
compositionality (hcH, hcM, and hcHM) as a function of hcHM in FR-comp.

α score assumes that each participant rates all the items, we focus on the answers
provided by three of the participants, who rated the whole set of 180 compounds in
PT-comp.

Using a linear distance schema between the answers,13 we obtain an agreement of
α = .58 for head-only, α = .44 for modiﬁer-only, and α = .44 for the whole compound.
To further assess the difﬁculty of this task, we also calculate α for a single expert
annotator, judging the same set of compounds after an interval of one month. The scores
were α = .69 for the head and α = .59 for both the compound and for the modiﬁer. The
Spearman correlation between these two annotations performed by the same expert
is ρ = 0.77 for hcHM. This can be seen as a qualitative upper bound for automatic
compositionality prediction on PT-comp.

3.2.2 Compositionality, Familiarity, and Conventionalization. Figure 1(b) shows the average
scores (hcHM, hcH, and hcM) for the compounds ranked according to the average com-
pound score hcHM. Although this ﬁgure is for FR-comp, similar patterns were found for
the other data sets. For all three languages, the human compositionality scores provide
additional conﬁrmation that the data sets are balanced, with the compound scores
(hcHM) being distributed linearly along the scale. Furthermore, we have calculated the
average hcHM values separately for the compounds in each of the three compositionality
classes used for compound selection: idiomatic, partly compositional and compositional
(Section 3.1). These averages are, respectively, 1.0, 2.4, and 4.0 for EN-comp90; 1.1, 2.4,
and 4.2 for EN-compExt; 1.3, 2.7, and 4.3 for FR-comp; and 1.3, 2.5, and 3.9 for PT-comp,
indicating that our attempt to select a balanced number of compounds from each class
is visible in the collected hcHM scores.

Additionally, the human scores also suggest an asymmetric impact of the non-literal
parts over the compound: whenever participants judged an element of the compound
as non-literal, the whole compound was also rated as idiomatic. Thus, most head and
modiﬁer scores (hcH and hcM) are close to or above the diagonal line in Figure 1(b).
In other words, a component of the compound is seldom rated as less literal than the
compositionality of the whole compound hcHM, although the opposite is more common.

13 A disagreement between answers a and b is weighted |a − b|.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

020406080100120140160180Compounds ranked by hcHM0.00.51.01.52.0Standard deviation (σ)(a) Standard deviation (FR-comp)σH (head only)σM (modifier only)σHM (compound)020406080100120140160180Compounds ranked by hcHM012345Compositionality score (hc)(b) Compositionality (FR-comp)hcH (head only)hcM (modifier only)hcHM (compound)

Cordeiro et al.

Unsupervised Compositionality Prediction of Nominal Compounds

Figure 2
Relation between hcH ⊗ hcM and hcHM in FR-comp, using arithmetic and geometric means.

Table 2
Spearman ρ correlation between compositionality, frequency, and PMI for the three data sets.

Data set

frequency

PMI

FR-comp
PT-comp
EN-comp90
EN-compExt

0.598 (p < 10−18) 0.109 (p > 0.1)
0.305 (p < 10−2) 0.384 (p < 10−5) 0.164 (p > 0.01)
0.076 (p > 0.1)
−0.024 (p > 0.1)
0.138 (p > 0.1)

To evaluate if it is possible to predict hcHM from the hcH and hcM, we calculate the
arithmetic and geometric means between hcH and hcM for each compound. Figure 2
shows the linear regression of both measures for FR-comp. The goodness of ﬁt is
r2
arith = .93 for the arithmetic mean, and r2
geom = .96 for the geometric mean, conﬁrming
that they are good predictors of hcHM.14 Thus, we assume that hcHM summarizes hcH
and hcM, and focus on predicting hcHM instead of hcH and hcM separately. These ﬁnd-
ings also inspired the pcarith and pcgeom compositionality prediction functions (Section 4).
To examine whether there is an effect of the familiarity of a compound on hc
scores, in particular if more idiomatic compounds need to be more familiar, we also
calculated the correlation between the compositionality score for a compound hcHM
and its frequency in a corpus, as a proxy for familiarity. In this case we used the WaC
corpora and calculated the frequencies based on the lemmas. The results, in Table 2,
show a statistically signiﬁcant positive Spearman correlation of ρ = 0.305 for EN-comp90,
ρ = 0.384 for EN-compExt, and ρ = 0.598 for FR-comp, indicating that, contrary to our
expectations, compounds that are more frequent tend to be assigned higher composi-
tionality scores. However, frequency alone is not enough to predict compositionality,
and further investigation is needed to determine if compositionality and frequency
are also correlated with other factors.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

14 r2

arith and r2

geom are .91 and .96 in PT-comp, .90 and .96 in EN-comp90, and .92 and .95 in EN-compExt.

012345hcHM012345hcH⊗hcM⊗= arithmetic mean⊗= geometric meanLinear regr. of arith. meanLinear regr. of geom. mean

Computational Linguistics

Volume 45, Number 1

We also analyzed the correlation between compositionality and conventionaliza-
tion to determine if more idiomatic compounds correspond to more conventionalized
ones. We use PMI (Church and Hanks 1990) as a measure of conventionalization, as it
indicates the strength of association between the components (Farahmand, Smith, and
Nivre 2015). We found no statistically signiﬁcant correlation between compositionality
and PMI.

4. Compositionality Prediction Framework

We propose a compositionality prediction framework15 including the following ele-
ments: a DSM, created from corpora using existing state-of-the-art models that gen-
erate corpus-derived vectors16 for compounds w1w2 and for their components w1 and
w2; a composition function; and a set of predicted compositionality scores (pc). The
framework, shown in Figure 3, is evaluated by measuring the correlation between
the scores predicted by the models (pc) and the human compositionality scores (hc)
for the list of compounds in our data sets (Section 3). The predicted compositionality
scores are obtained from the cosine similarity between the corpus-derived vector of the
compound, v(w1w2), and the compositionally constructed vector, vβ(w1, w2):

pcβ(w1w2) = cos( v(w1w2), vβ(w1, w2) ).

For vβ(w1, w2), we use the additive model (Mitchell and Lapata 2008), in which the
composition function is a weighted linear combination:

vβ(w1w2) = β

v(whead)
||v(whead)||

+ (1 − β)

v(wmod)
||v(wmod)||

where whead (or wmod) indicates the head (or modiﬁer) of the compound w1w2, || · || is the
Euclidean norm, and β ∈ [0, 1] is a parameter that controls the relative importance of
the head to the compound’s compositionally constructed vector. The normalization of
both vectors allows taking only their directions into account, regardless of their norms,
which are usually proportional to their frequency and irrelevant to meaning.

We deﬁne six compositionality scores based on pcβ. Three of them pchead(w1w2),
pcmod(w1w2), and pcuniform(w1w2), correspond to different assumptions about how we
model compositionality: if dependent on the head (β = 1, for e.g., crocodile tears), on
the modiﬁer (β = 0, for e.g., busy bee), or in equal measure on the head and modiﬁer
(β = 1/2, for e.g., graduate student). The fourth score is based on the assumption that
compositionality may be distributed differently between head and modiﬁer for different
compounds. We implement this idea by setting individually for each compound the

15 Implemented as feat compositionality.py in the mwetoolkit: http://mwetoolkit.sf.net.
16 Except when explicitly indicated, the term vector refers to corpus-derived vectors output by DSMs.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Cordeiro et al.

Unsupervised Compositionality Prediction of Nominal Compounds

Figure 3
Schema of a compositionality prediction conﬁguration based on a composition function. Thick
arrows indicate corpus-based vectors of two-word compounds treated as a single token. The
schema also covers the evaluation of the compositionality prediction conﬁguration (top right).

value for β that yields maximal similarity in the predicted compositionality score,
that is:17

pcmaxsim(w1w2) = max
0≤β≤1

pcβ(w1w2)

Two other scores are not based on the additive model and do not require a compo-
sition function. Instead, they are based on the intuitive notion that compositionality is
related to the average similarity between the compound and its components:

pcavg(w1w2) = avg(pchead(w1w2), pcmod(w1w2))

We test two possibilities: the arithmetic mean pcarith(w1w2) considers that composition-
ality is linearly related to the similarity of each component of the compound, whereas
the geometric mean pcgeom(w1w2) reﬂects the tendency found in human annotations to
assign compound scores hcHM closer to the lowest score between that for the head hcH
and for the modiﬁer hcM (Section 3.2).

5. Experimental Setup

This section describes the common setup used for evaluating compositionality pre-
diction, such as corpora (Section 5.1), DSMs (Section 5.2), and evaluation metrics
(Section 5.3).

17 In practice, for the special case of two words, we do not need to perform parameter search for β, which

has a closed form obtained by solving the equation ∂
β = cos(w1w2,w1 ) − cos(w1w2,w2 ) × cos(w1,w2 )
(cos(w1w2,w1 ) + cos(w1w2,w2 )) × (1−cos(w1,w2 )) .

∂β pcβ(w1w2 ) = 0:

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

questionnairesquestionnairesw1w2compoundvocabulary (list of compounds)DSM configurationDSMparametersprocessedcorpusv(w1w2)v(w2)v(w1)corpusderivedvectorsvβ(w1,w2)compositionallyconstructed vectorscompositionfunctionquestionnairesSpearman ρhuman comp.scores (hc)similarityfunctionDSMpredicted comp.scores (pc)+≅

Computational Linguistics

Volume 45, Number 1

5.1 Corpora

In this work we used the lemmatized and POS-tagged versions of the WaC corpora not
only for building DSMs, but also as sources of information about the target compounds
for the analyses performed (e.g., in Sections 3.2.2, 9.1, and 9.2):

•

for English, the ukWaC (Baroni et al. 2009), with 2.25 billion tokens, parsed
with MaltParser (Nivre, Hall, and Nilsson 2006);

for French, the frWaC with 1.61 billion tokens preprocessed with
TreeTagger (Schmid 1995); and

for Brazilian Portuguese, a combination of brWaC (Boos, Prestes, and
Villavicencio 2014), Corpus Brasileiro,18 and all Wikipedia entries,19 with a
total of 1.91 billion tokens, all parsed with PALAVRAS (Bick 2000).

For all compounds contained in our data sets, we transformed their occur-
rences into single tokens by joining their component words with an underscore (e.g.,
EN monkey business → monkey business and FR belle-m`ere → belle m`ere).20,21 To han-
dle POS-tagging and lemmatization irregularities, we retagged the compounds’ com-
ponents using the gold POS and lemma in our data sets (e.g., for EN sitting duck,
sit/verb duck/noun→sitting/adjective duck/noun). We also simpliﬁed all POS tags using
coarse-grained labels (e.g., verb instead of vvz). All forms are then lowercased (surface
forms, lemmas, and POS tags); and noisy tokens, with special characters, numbers, or
punctuation, are removed. Additionally, ligatures are normalized for French (e.g., œ →
oe) and a spellchecker22 is applied to normalize words across English spelling variants
(e.g., color → colour).

To evaluate the inﬂuence of preprocessing on compositionality prediction (Sec-
tion 7.3), we generated four versions of each corpus, with different levels of linguistic
information. We expect lemmatization to reduce data sparseness by merging morpho-
logically inﬂected variants of the same lemma:

surface+: the original raw corpus with no preprocessing, containing
surface forms.

surface: stopword removal, generating a corpus of surface forms of content
words.

lemmaPoS: stopword removal, lemmatization,23 and POS-tagging;
generating a corpus of content words distinguished by POS tags,
represented as lemma/POS-tag.

lemma: stopword removal and lemmatization; generating a corpus
containing only lemmas of content words.

18 http://corpusbrasileiro.pucsp.br/cb/Inicial.html
19 Wikipedia articles downloaded on June 2016.
20 Hyphenated compounds are also re-tokenized with an underscore separator.
21 Therefore, in Section 5.2, the terms target/context words may actually refer to compounds.
22 https://hunspell.github.io
23 In the lemmatized corpora, the lemmas of proper names are replaced by placeholders.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Cordeiro et al.

Unsupervised Compositionality Prediction of Nominal Compounds

5.2 DSMs

In this section, we describe the state-of-the-art DSMs used for compositionality
prediction.

Positive Pointwise Mutual Information (PPMI). In the models based on the PPMI matrix,
the representation of a target word is a vector containing the PPMI association scores
between the target and its contexts (Bullinaria and Levy 2012). The contexts are nouns
and verbs, selected in a symmetric sliding window of W words to the left/right and
weighted linearly according to their distance D to the target (Levy, Goldberg, and Dagan
2015).24 We consider three models that differ in how the contexts are selected:

•

In PPMI–thresh, the vectors are |V|-dimensional but only the top d contexts
with highest PPMI scores for each target word are kept, while the others
are set to zero (Padr ´o et al. 2014a).25

In PPMI–TopK, the vectors are d-dimensional, and each of the d
dimensions corresponds to a context word taken from a ﬁxed list of k
contexts, identical for all target words. We chose k as the 1, 000 most
frequent words in the corpus after removing the top 50 most frequent
words (Salehi, Cook, and Baldwin 2015).

In PPMI–SVD, singular value decomposition is used to factorize the PPMI
matrix and reduce its dimensionality from |V| to d.26 We set the value of
the context distribution smoothing factor to 0.75, and the negative
sampling factor to 5 (Levy, Goldberg, and Dagan 2015). We use the default
minimum word count threshold of 5.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

Word2vec (w2v). Word2vec27 relies on a neural network to predict target/context pairs
(Mikolov et al. 2013). We use its two variants: continuous bag-of-words (w2v–cbow)
and skip-gram (w2v–sg). We adopt the default conﬁgurations recommended in the
documentation, except for: no hierarchical softmax, 25 negative samples, frequent-word
down-sampling rate of 10−6, execution of 15 training iterations, and minimum word
count threshold of 5.

Global Vectors (glove). GloVe28 implements a factorization of the logarithm of the posi-
tional co-occurrence count matrix (Pennington, Socher, and Manning 2014). We adopt
the default conﬁgurations from the documentation, except for: internal cutoff parameter
xmax = 75 and processing of the corpus in 15 iterations. For the corpora versions lemma
and lemmaPoS (Section 5.1), we use the minimum word count threshold of 5. For surface
and surface+, due to the larger vocabulary sizes, we use thresholds of 15 and 20.29

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

24 In previous work adjectives and adverbs were also included as contexts, but the results obtained with

only verbs and nouns were better (Padr ´o et al. 2014a).

25 Vectors still have |V| dimensions but we use d as a shortcut to represent the fact that we only retain the

most relevant target-context pairs for each target word.

26 https://bitbucket.org/omerlevy/hyperwords
27 https://code.google.com/archive/p/word2vec/
28 https://nlp.stanford.edu/projects/glove/
29 Thresholds were selected so as to not use more than 128 GB of RAM.

Computational Linguistics

Volume 45, Number 1

Table 3
Summary of DSMs, their parameters, and evaluated parameter values. The combination of
these DSMs and their parameter values leads to 228 DSM conﬁgurations evaluated per language
(1 × 1 × 4 × 3 = 12 for PPMI–TopK, plus 6 × 3 × 4 × 3 = 216 for the other models).

DSM

DIMENSION WORDFORM WINDOWSIZE

PPMI–TopK

PPMI–thresh
PPMI–SVD
w2v–cbow
w2v–sg
glove
lexvec

d = 1000

d = 250,
d = 500,
d = 750

surface+,
surface,
lemma,
lemmaPoS

W = 1+1,
W = 4+4,
W = 8+8

Lexical Vectors (lexvec). The LexVec model30 factorizes the PPMI matrix in a way that
penalizes errors on frequent words (Salle, Villavicencio, and Idiart 2016). We adopt
the default conﬁgurations in the documentation, except for: 25 negative samples, sub-
sampling rate of 10−6, and processing of the corpus in 15 iterations. Due to the vocab-
ulary sizes, we use a word count threshold of 10 for lemma and lemmaPoS, and 100 for
surface and surface+.31

5.2.1 DSM Parameters. In addition to model-speciﬁc parameters, the DSMs described
above have some shared DSM parameters. We construct multiple DSM conﬁgurations
by varying the values of these parameters. These combinations produce a total of
228 DSMs per language (see Table 3). In particular, we evaluate the inﬂuence of the
following parameters on compositionality prediction:

• WINDOWSIZE: Number of context words to the left/right of the target
word when searching for target-context co-occurrence pairs. The
assumption is that larger windows are better for capturing semantic
relations (Jurafsky and Martin 2009) and may be more suitable for
compositionality prediction. We use window sizes of 1+1, 4+4, and 8+8.32

•

DIMENSION: Number of dimensions of each vector. The underlying
hypothesis is that, the higher the number of dimensions, the more accurate
the representation of the context is going to be. We evaluate our
framework with vectors of 250, 500, and 750 dimensions.

• WORDFORM: One of the four word-form and stopword removal variants
used to represent a corpus, in Section 5.1: surface+, surface, lemma, and
lemmaPoS. They represent different levels of speciﬁcity in the informational
content of the tokens, and may have a language-dependent impact on the
performance of compositionality prediction.

30 https://github.com/alexandres/lexvec
31 This is in line with the authors’ threshold suggestions (Salle, Villavicencio, and Idiart 2016).
32 Common window sizes are between 1+1 and 10+10, but a few works adopt larger sizes like 16+16 or

20+20 (Kiela and Clark 2014; Lapesa and Evert 2014).

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Cordeiro et al.

Unsupervised Compositionality Prediction of Nominal Compounds

5.3 Evaluation Metrics

To evaluate a compositionality prediction conﬁguration, we calculate Spearman’s ρ
rank correlation between the predicted compositionality scores (pc)s and the human
compositionality scores (hc)s for the compounds that appear in the evaluation data
set. We mostly use the rank correlation instead of linear correlation (Pearson) because
we are interested in the framework’s ability to order compounds from least to most
compositional, regardless of the actual predicted values.

For English, besides the evaluation data sets presented in Section 3, we also use
Reddy and Farahmand (see Section 2.4) to enable comparison with related work. For
Farahmand, since it contains binary judgments33 instead of graded compositionality
scores, results are reported using the best F1 (BF1) score, which is the highest F1 score
found using the top n compounds classiﬁed as noncompositional, when n is varied
(Yazdani, Farahmand, and Henderson 2015). For Reddy, we sometimes present Pearson
scores to enable comparison with related work.

Because of the large number of compositionality prediction conﬁgurations eval-
uated, we only report the best performance for each conﬁguration over all possible
DSM parameter values. The generalization of these analyses is then ensured using
cross-validation and held-out data. To determine whether the difference between two
prediction results are statistically different, we use nonparametric Wilcoxon’s sign-rank
test.

6. Overall Results

In this section, we present the overall results obtained on the Reddy, Farahmand, EN-
comp, FR-comp, and PT-comp data sets, comparing all possible conﬁgurations (Sec-
tion 6.1). To determine their robustness we also report evaluation for all languages
using cross-validation (Section 6.2) and for English using the held-out data set
EN-compExt (Section 6.3). All results reported in this section use the pcuniform function.

6.1 Distributional Semantic Models

Table 4 shows the highest overall values obtained for each DSM (columns) on each
data set (rows). For English (Reddy, EN-comp, and Farahmand), the highest results for
the compounds found in the corpus were obtained with w2v and PPMI–thresh, shown
as the ﬁrst value in each pair in Table 4. Not all compounds in the English data sets are
present in our corpus. Therefore, we also report results adopting a fallback strategy (the
second value). Because its impact depends on the data set, and the relative performance
of the models is similar with or without it, for the remainder of the article we discuss
only the results without fallback.34

The best w2v–cbow and w2v–sg conﬁgurations are not signiﬁcantly different from
each other, but both are different from PPMI–thresh (p < 0.05). In a direct comparison 33 A compound is considered as noncompositional if at least 2 out of 4 annotators annotate it as noncompositional. 34 This refers to 5 out of 180 in EN-comp and 129 out of 1,042 in Farahmand. For these, the fallback strategy assigns the average compositionality score (Salehi, Cook, and Baldwin 2015). Although fallback produces slightly better results for EN-comp, it does the opposite for Farahmand, which contains a larger proportion of missing compounds (2.8% vs. 12.4%). 19 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 1 1 8 0 9 6 8 8 / c o l i _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 45, Number 1 Table 4 Highest results for each DSM, using BF1 for Farahmand data set, Pearson r for Reddy (r), and Spearman ρ for all the other data sets. For English, in each pair of values, the ﬁrst is for the compounds found in the corpus, and the second uses fallback for missing compounds. Data set PPMI–SVD PPMI–TopK PPMI–thresh glove lexvec w2v–cbow w2v–sg Farahmand .487/.424 .435/.376 .472/.404 .400/.358 .449/.431 .512/.471 .507/.468 Reddy (r) .738/.726 .732/.717 .762/.768 .783/.787 .787/.787 .803/.798 .814/.814 Reddy (ρ) EN-comp FR-comp PT-comp .743/.743 .655/.666 .584 .530 .706/.716 .624/.632 .550 .519 .791/.803 .688/.704 .702 .602 .754/.759 .638/.651 .680 .555 .774/.773 .646/.658 .677 .570 .796/.796 .716/.730 .652 .588 .812/.812 .726/.741 .653 .586 with related work, our best result for the Reddy data set (Spearman ρ = .812, Pearson r = .814) improves upon the best correlation reported by Reddy, McCarthy, and Manandhar (2011) (ρ = .714), and by Salehi, Cook, and Baldwin (2015) (r = .796). For Farahmand, these results are comparable to those reported by Yazdani, Farahmand, and Henderson (2015) (BF1 = .487), but our work adopts an unsupervised approach for compositionality prediction. For both FR-comp and PT-comp, the w2v models are outperformed by PPMI–thresh, whose predictions are signiﬁcantly different from the predictions of other models (p < 0.05). In short, these results suggest language-dependent trends for DSMs, by which w2v models perform better for the English data sets, and PPMI–thresh for French and Portuguese. While this may be due to the level of morphological inﬂection in these lan- guages, it may also be due to differences in corpus size or to particular DSM parameters used in each case. In Section 7, we analyze the impact of individual DSM and corpus parameters to better understand this language dependency. 6.2 Cross-Validation Table 4 reports the best conﬁgurations for the EN-comp, FR-comp, and PT-comp data sets. However, to determine whether the Spearman scores obtained are robust and generalizable, in this section we report evaluation using cross-validation. For each data set, we partition the 180 compounds into 5 folds of 36 compounds (f1, f2, . . . , f5). Then, for each fold fi, we exhaustively look for the best conﬁguration (values of WINDOWSIZE, DIMENSION, and WORDFORM) for the union of the other folds (∪j(cid:54)=ifj), and predict the 36 compositionality scores for fi using this conﬁguration. The predicted scores for the 5 folds are then grouped into a single set of predictions, which is evaluated against the 180 human judgments. The partition of compounds into folds is performed automatically, based on random shufﬂing.35 To avoid relying on a single arbitrary fold partition, we run cross-validation 10 times, with different fold partitions each time. This process generates 10 Spearman correlations, for which we calculate the average value and a 95% conﬁdence interval. 35 We have also considered separating folds so as to be balanced regarding their compositionality scores. The results were similar to the ones reported here. 20 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 1 1 8 0 9 6 8 8 / c o l i _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds Figure 4 Results with highest Spearman for oracle and cross-validation, the latter with a conﬁdence interval of 95%; (a) top left: overall Spearman correlations per DSM and language, (b) top right: different WORDFORM values and DSMs for English, (c) bottom left: different DIMENSION values and DSMs for French, and (d) bottom right: different WINDOWSIZE values and DSMs for Portuguese. We have calculated cross-validation scores for a wide range of conﬁgurations, fo- cusing on the following DSMs: PPMI–thresh, w2v–cbow, and w2v–sg. Figure 4 presents the average Spearman correlations of cross-validation experiments compared with the best results reported in the previous section, referred to as oracle. In the top left panel the x-axis indicates the DSMs for each language using the best oracle conﬁguration, Fig- ure 4(a). In the other panels, it indicates the best oracle conﬁguration for a speciﬁc DSM and a ﬁxed parameter for a given language. We present only a sample of the results for ﬁxed parameters, as they are stable across languages. Results are presented in ascending order of oracle Spearman correlation. For each oracle datapoint, the associated average Spearman from cross-validation is presented along with the 95% conﬁdence interval. The Spearman correlations obtained through cross-validation are comparable to the ones obtained by the oracle. Moreover, the results are quite stable: increasingly better conﬁgurations of oracle tend to be correlated with increasingly better cross-validation scores. Indeed, the Pearson r correlation between the 9 oracle points and the 9 cross- validation points in the top-left panel is 0.969, attesting to the correlation between cross- validation and oracle scores. 21 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 1 1 8 0 9 6 8 8 / c o l i _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 pt_BR/w2v-sgpt_BR/w2v-cbowpt_BR/PPMI-threshfr/w2v-cbowfr/w2v-sgfr/PPMI-threshen/PPMI-threshen/w2v-cbowen/w2v-sg0.30.40.50.60.70.8(a) Cross-validation for each language/DSM pairOracleCross-validation (CI 95%)PPMI-thresh/lemmaPPMI-thresh/surface+PPMI-thresh/surfacePPMI-thresh/lemmaPoSw2v-cbow/lemmaPoSw2v-sg/lemmaPoSw2v-cbow/lemmaw2v-cbow/surfacew2v-sg/lemmaw2v-sg/surfacew2v-cbow/surface+w2v-sg/surface+0.30.40.50.60.70.8(b) English cross-validation (word form)OracleCross-validation (CI 95%)w2v-cbow/250w2v-sg/250w2v-sg/500w2v-cbow/500w2v-cbow/750w2v-sg/750PPMI-thresh/250PPMI-thresh/500PPMI-thresh/7500.30.40.50.60.70.8(c) French cross-validation (dimension)OracleCross-validation (CI 95%)PPMI-thresh/8PPMI-thresh/4w2v-sg/1w2v-cbow/1w2v-cbow/4w2v-sg/4w2v-sg/8w2v-cbow/8PPMI-thresh/10.30.40.50.60.70.8(d) Portuguese cross-validation (window size)OracleCross-validation (CI 95%) Computational Linguistics Volume 45, Number 1 For PT-comp, the conﬁdence intervals are quite wide, meaning that prediction quality is sensitive to the choice of compounds used to estimate the best conﬁgura- tions. Probably a larger data set would be required to stabilize cross-validation results. Nonetheless, the other two data sets seem representative enough, so that the small conﬁdence intervals show that, even if we ﬁx the value of a given parameter (e.g., d = 750), the results using cross-validation are stable and very similar to the oracle. The conﬁdence intervals overlapping with oracle data points also indicate that most cross-validation results are not statistically different from the oracle. This suggests that the highest-Spearman oracle conﬁgurations could be trusted as reasonable approxi- mations of the best conﬁgurations for other data sets collected for the same language constructed using similar guidelines. 6.3 Evaluation on Held-Out Data As an additional test of the robustness of the results obtained, we calculated the performance of the best models obtained for one of the data sets (EN-comp), on a separate held-out data set (EN-compExt). The latter contains 100 compounds balanced for compositionality, not included in EN-comp (that is, not used in any of the preced- ing experiments). The results obtained on EN-compExt are shown in Table 5. They are comparable and mostly better than those for the oracle and for cross-validation. As the items are different in the two data sets, a direct comparison of the results is not possible, but the equivalent performances conﬁrm the robustness of the models and conﬁgurations for compositionality prediction. Moreover, these results are obtained in an unsupervised manner, as the compositionality scores are not used to train any of the models. The scores are used only for comparative purposes for determining the impact of various factors in the ability of these DSMs to predict compositionality. 7. Inﬂuence of DSM Parameters In this section, we analyze the inﬂuence of DSM parameters on compositionality predic- tion. We consider different window sizes (Section 7.1), numbers of vector dimensions (Section 7.2), types of corpus preprocessing (Section 7.3), and corpus sizes. For each parameter, we analyze all possible values of other parameters. In other words, we report the best results obtained by ﬁxing a value and considering all possible conﬁgurations of other parameters. Results reported in this section use the pcuniform function. Table 5 Conﬁgurations with best performances on EN-comp and on EN-compExt. Best performances are measured on EN-comp and the corresponding conﬁgurations are applied to EN-compExt. DSM WORDFORM WINDOWSIZE DIMENSION ρ EN-comp ρ EN-compExt PPMI–SVD PPMI–TopK PPMI–thresh glove lexvec w2v–cbow w2v–sg surface lemmaPoS lemmaPoS lemmaPoS lemmaPoS surface+ surface+ 1+1 8+8 8+8 8+8 8+8 1+1 1+1 250 1,000 750 500 250 750 750 0.655 0.624 0.688 0.637 0.646 0.716 0.726 0.692 0.680 0.675 0.670 0.685 0.731 0.733 22 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 1 1 8 0 9 6 8 8 / c o l i _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds Figure 5 Best results for each DSM and WINDOWSIZE (1+1, 4+4, and 8+8), using BF1 for Farahmand, and Spearman ρ for other data sets. Thin bars indicate the use of fallback in English. Differences between the two highest Spearman correlations for each model are statistically signiﬁcant (p < 0.05), except for PPMI–SVD, according to Wilcoxon’s sign-rank test. 7.1 Window Size DSMs build the representation of every word based on the frequency of other words that appear in its context. Our hypothesis is that larger window sizes result in higher scores, as the additional data allows a better representation of word-level semantics. However, as some of these models adopt different weight decays for larger windows,36 variation in their behavior related to window size is to be expected. Contrary to our expectations, for the best models in each language, large windows did not lead to better compositionality prediction. Figure 5 shows the best results obtained for each window size.37 For English, w2v is the best model, and its performance does not seem to depend much on the size of the window, but with a small trend for smaller sizes to be better. For French and Portuguese, PPMI–thresh is only the best model for the minimal window size, and there is a large gap in performance for PPMI–thresh as window size increases, such that for larger windows it is outperformed by other models. 36 For PPMI–SVD with WINDOWSIZE=8+8, a context word at distance D from its target word is weighted l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 1 1 8 0 9 6 8 8 / c o l i _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 8−D 8 . For glove, the decay happens much faster, with a weight of 8 D , which allows the model to look farther away without being affected by potential noise introduced by distant contexts. 37 Henceforth, we omit results for EN-comp90 and Reddy, as they are included in EN-comp. 23 Computational Linguistics Volume 45, Number 1 To assess which of these differences are statistically signiﬁcant, we have performed Wilcoxon’s sign-rank test on the two highest Spearman values for each DSM in each language. All differences are statistically signiﬁcant (p < 0.05), with the exception of PPMI–SVD. The appropriate choice of window size has been shown to be task-speciﬁc (Lapesa and Evert 2017), and the results above suggest that, for compositionality prediction, it depends also on the DSM used. Overall, the trend is for smaller windows to lead to better compositionality prediction. 7.2 Dimension When creating corpus-derived vectors with a DSM, the question is whether additional dimensions can be informative in compositionality prediction. Our hypothesis is that the larger the number of dimensions, the more precise the representations, and the more accurate the compositionality prediction. The results shown in Figure 6 for each of the comparable data sets conﬁrm this trend in the case of the best DSMs: w2v and PPMI–thresh. Moreover, the effect of changing the vector dimensions for the best models seems to be consistent across these languages. The results for PPMI–SVD, lexvec, and glove are more varied, but they are never among l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 1 1 8 0 9 6 8 8 / c o l i _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Figure 6 Best results for each DSM and DIMENSION, using BF1 for Farahmand data set, and Spearman ρ for all the other data sets. For English, the thin bars indicate results using fallback. Differences between two highest Spearman correlations for each model are statistically signiﬁcant (p < 0.05), except for PPMI–SVD for FR-comp, according to Wilcoxon’s sign-rank test. 24 Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds the best models for compositionality prediction in any of the languages.38 All differences between the two highest Spearman correlations are statistically signiﬁcant (p < 0.05), with the exception of PPMI–SVD for FR-comp, according to Wilcoxon’s sign-rank test. 7.3 Type of Preprocessing In related work, DSMs are constructed from corpora with various levels of pre- processing (Bullinaria and Levy 2012; Mikolov et al. 2013; Pennington, Socher, and Manning 2014; Kiela and Clark 2014; Levy, Goldberg, and Dagan 2015; Salle, Villavicencio, and Idiart 2016). In this work, we compare four levels: WORDFORM= surface+, surface, lemmaPoS and lemma, described in Section 5.1, corresponding to decreas- ing amounts of information. Testing different varieties of corpus preprocessing allows us to explore the trade-off between informational content and the statistical signiﬁcance related to data sparsity for compositionality prediction. Figure 7 presents the impact of different types of corpus preprocessing on the quality of compositionality prediction. In EN-comp, all differences between the two highest Spearman values for each DSM were signiﬁcant, according to Wilcoxon’s sign- rank test, except for PPMI–thresh, whereas in FR-comp and PT-comp they were signiﬁcant only for PPMI–TopK and lexvec. However, note that the top two results are often both obtained on representations based on lemmas. If we compare the highest lemma-based result with the highest surface-based result for the same DSM, we ﬁnd a statistically signiﬁcant difference in every single case (p < 0.05). When considering the results themselves, although the results for English are het- erogeneous, for French and Portuguese, the lemma-based representations consistently allow a better prediction of compositionality scores. This may be explained by the fact that these two languages are morphologically richer than English, and lemma-based representations reduce the sparsity in the data, allowing more information to be gath- ered from the same amount of data. Moreover, adding POS information (lemmaPoS vs. lemma) does not seem to bring consistent improvements that are statistically signiﬁcant. This suggests that words that share the same lemma are semantically close enough that any gains from disambiguation are masked by the sparsity of a higher vocabulary size. Finally, the impact of stopword removal is also inconclusive (surface vs. surface+), considering the best models for each language. 7.4 Corpus Size If we assume that the bigger the corpus, the better the DSM, this could explain why the results for English are better than those for French and Portuguese, although it does not explain why Portuguese is behind French.39 In this section, we examine the impact of corpus size on prediction quality by incrementally increasing the amount of data used to generate the DSMs while monitoring the Spearman correlation (ρ) with the human annotations. We use only the best DSMs for these languages, PPMI–thresh and w2v–sg, with the conﬁgurations that produced highest Spearman scores for each full corpus. As expected, the results in Figure 8 show a smooth, roughly monotonic increase of the ρ values with corpus size, for PPMI–thresh and w2v–sg for each language and 38 For PPMI–SVD and lexvec, this behavior might be related to the fact that both methods perform a factorization of the PPMI matrix. 39 As the characteristics of Farahmand are different from the other data sets, in this analysis we only use the other more comparable data sets. 25 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 1 1 8 0 9 6 8 8 / c o l i _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 45, Number 1 Figure 7 Best results for each DSM and WORDFORM, using BF1 for Farahmand data set, and Spearman ρ for all the other data sets. For English, the thin bars indicate results using fallback. In EN-comp all differences between the two highest Spearman values for each DSM were signiﬁcant, according to Wilcoxon’s sign-rank test, except for PPMI–thresh, while in FR-comp and PT-comp they were only signiﬁcant for PPMI–TopK and lexvec. data set.40 In all cases there is a clear saturation behavior, so that we can safely say that after one billion tokens, the quality of the predictions reaches a plateau and additional corpus fragments do not bring improvements. This suggests that differences in compo- sitionality prediction performance for these languages cannot be totally explained by differences in corpus sizes. 8. Inﬂuence of Compositionality Prediction Function Up to this point, the predicted compositionality scores for the compounds were calcu- lated using a uniform function that assumes that each component contributes 50% to 40 For PPMI–thresh, eight different samplings of corpus fragments were performed (for a total of 800 DSMs per language), with each y-axis data point presenting the average and standard deviation of the ρ obtained from those samplings. For w2v–sg, since it is much more time-consuming, a single sampling was used, and thus only one execution was performed for each datapoint (for a total of 100 DSMs per language). 26 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 1 1 8 0 9 6 8 8 / c o l i _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds Figure 8 Spearman’s ρ for increasing corpus sizes for PPMI–thresh (left) and w2v–sg (right) for EN-comp in red, FR-comp in blue, and PT-comp in green. Corpus sizes are in the x-axis in billion words. Curves for PPMI–thresh show average and standard deviation (error bars) across 8 samplings of the corpus. the meaning of the compound (pcuniform). However, this might not accurately capture a faithful representation of compounds whose meaning is more semantically related to one of the components (e.g., crocodile tears, which is semantically closer to the head tears; and night owl, which is semantically closer to the modiﬁer night). As this may have an impact on the success of compositionality prediction, in this section we evaluate how different compositionality prediction functions model these compounds. In particular, we proposed pcmaxsim, (Section 4) for dynamically determining weights that assign maximal similarity between the compound and each of its components. We have also proposed pcgeom, which favors idiomatic readings through the geometric mean of the similarities between a compound and its components. Our hypotheses are that pcmaxsim will be better correlated with human scores for compositional and partly compositional compounds, while pcgeom can better capture the semantics of idiomatic ones (Section 8.1). First, to verify whether other prediction functions improve results obtained for the best pcuniform conﬁgurations reported up to now, we have evaluated every strategy on all DSM conﬁgurations. Table 6 shows that the functions that combine both components (columns pcuniform to pcarith) generate better compositionality predictions than functions that ignore one of the individual components (columns pchead and pcmod). There is some variation among the combined scores, with the best score indicated in bold. Every best score is statistically different from all other scores in its row (p < 0.05). The results for pcarith and pcuniform are very similar, reﬂecting their similar formulations.41 Here we focus on the issue of adjusting β in the compositionally constructed vector; that is, we consider the use of pcmaxsim instead of pcuniform. This score seems to be beneﬁcial in the case of English (EN-comp), but not in the case of French or Portuguese. 41 The Pearson correlations (averaged across 7 DSMs) between pcarith and pcuniform are r = .972 for EN-comp, r = .991 for FR-comp, and r = .969 for PT-comp, conﬁrming their similar results. 27 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 1 1 8 0 9 6 8 8 / c o l i _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 0.20.40.60.81.01.21.41.61.82.02.2Corpus size (in billions of words)0.00.20.40.60.81.0Average Spearman ρ (±σ)PPMI-threshEN-compFR-compPT-comp0.20.40.60.81.01.21.41.61.82.02.2Corpus size (in billions of words)0.00.20.40.60.81.0Spearman ρw2v-sgEN-compFR-compPT-comp Computational Linguistics Volume 45, Number 1 Table 6 Spearman ρ for the proposed compositionality prediction scores, using the best DSM conﬁguration for each score. Data set pcuniform pcmaxsim pcgeom pcarith pchead pcmod EN-comp FR-comp PT-comp .726 .702 .602 .730 .693 .590 .677 .699 .580 .718 .703 .598 .555 .617 .558 .677 .645 .486 Table 7 presents the best pcmaxsim model for each data set, along with the average weights assigned to head and modiﬁer for every compound in the data set. Before analyzing the results in Table 7, we have to verify whether the data sets are balanced for the inﬂuence of each component to the meaning of the whole, or if there is any bias towards heads/modiﬁers. The inﬂuence of the head, estimated as the average of hcH/(hcH + hcM) over all compounds of a data set, is 0.50 for EN-comp, 0.52 for FR- comp, and 0.52 for PT-comp. This indicates that the data sets are balanced in terms of the inﬂuence of each component, and neither head nor modiﬁer predominates as more compositional or idiomatic than the other. As for the average β weights in pcmaxsim, while the weights that maximize compo- sitionality are fairly similar for EN-comp, they strongly favor the head for both FR-comp and PT-comp. This may be explained by the fact that, for the latter, the modiﬁers are all adjectives, while EN-comp has mostly nouns as modiﬁers. Surprisingly, this seemingly more realistic weighting of the compound components for French and Portuguese does not reﬂect in better compositionality scores, and does not correspond to the average inﬂuence of modiﬁers in these data sets, estimated as 0.48 on average. One possible explanation could be that, in these cases, the adjectives may be contributing to some speciﬁc more idiomatic meaning that is not found in isolated occurrences of the adjec- tive itself, such as FR beau (lit. beautiful), which is used in the translation of most in- law family members, such as FR beau-fr`ere (lit. beautiful-brother ‘brother-in-law’). In the next section, we investigate which compounds are affected the most by these different scores. Table 7 DSM and Separman ρ of pcmaxsim, as well as the average weights for the head (β) and for the modiﬁer (1 − β) on each data set. Data set DSM ρmaxsim β (head) 1 − β (mod.) EN-comp w2v–sg FR-comp PT-comp PPMI–thresh w2v–sg .730 .693 .590 .55 .68 .68 .45 .32 .32 28 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 1 1 8 0 9 6 8 8 / c o l i _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds Figure 9 Distribution of improvmaxsim (y-axis) as a function of rkhuman (x-axis). Outliers are indicated by numbers 1–8 (positive improvement) and letters A–H (negative improvement). 8.1 Rank Improvement Analysis To better evaluate the effect of adjusting β for the individual compounds with respect to the pcuniform score, we deﬁne the rank improvement as: improvf (w1w2) = |rkuniform(w1w2) − rkhuman(w1w2)| − |rkf (w1w2) − rkhuman(w1w2)|, where rk indicates the rank of the compound w1w2 in the data set when ordered accord- ing to pcuniform, human annotations hcHM, or the compositionality prediction function f . For instance, when f = maxsim, positive improvmaxsim values indicate that pcmaxsim yields a better approximation of the ranks assigned by hcHM than pcuniform, whereas negative values indicate that pcuniform provides a better ranking. We perform a cross-lingual analysis, grouping the hcHM scores of the EN-comp, FR-comp, and PT-comp into a unique data set (henceforth ALL-comp), containing 540 compounds. Figure 9 presents the values of rank improvement for the best PPMI–thresh and w2v–sg conﬁgurations, ranked according to hcHM (rkhuman): compounds that are better predicted by pcmaxsim have positive rank movements (above the 0 line).42 The density of movement on either side of the 0 (no movement) line appears to be similar for both models with pcmaxsim performing as well as pcuniform. Figure 9 also marks the outlier compounds with the highest improvements (num- bers from 1 to 8) and those with the lowest improvements (letters from A to H), and Table 8 shows their improvement scores. In the case of these outliers, the adjustment seems to be more beneﬁcial to compositional compounds than to idiomatic cases. This is conﬁrmed by a linear regression of the movement of the 8+8 outliers as a function of the compositionality scores hcHM, where we obtain a positive coefﬁcient of r = 0.73 and r = 0.72 for PPMI–thresh and w2v–sg, respectively. There are more outlier com- pounds for Portuguese and French (particularly the former), suggesting that pcmaxsim has a stronger impact on those languages than on English. Moreover, some compounds had a similar improvement under both DSMs, with, for example, high improvement 42 We focus on one representative of PPMI-based DSMs and one representative of word-embedding models. Similar results were observed for the best conﬁgurations of other DSMs. 29 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 1 1 8 0 9 6 8 8 / c o l i _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 060120180240300360420480540Compounds ranked by hcHM15010050050100150Improvement from uniform to maxsim12345678ABCDEFGHPPMI-thresh060120180240300360420480540Compounds ranked by hcHM15010050050100150Improvement from uniform to maxsim12345678ABCDEFGHw2v-sg Computational Linguistics Volume 45, Number 1 Table 8 Outlier compounds with extreme positive/negative improvmaxsim values. Example identiﬁers correspond to numbers/letters shown in Figure 9. ID improv hcHM Compound ‘translation’ (gloss) improvmaxsim for PPMI–thresh 1 2 3 4 5 6 7 8 H G F E D C B A +90 +88 +86 +67 +63 +58 +53 +48 −42 −44 −44 −46 −52 −55 −81 −83 2.82 2.90 2.89 1.92 3.19 3.14 2.90 3.00 1.52 2.84 0.54 1.29 2.87 1.43 0.79 1.06 FR premier plan ‘foreground’ (lit. ﬁrst plan) FR mati´ere premi´ere ‘raw material’ (lit. matter primary) PT amigo oculto ‘secret Santa’ (lit. friend hidden) FR premi´ere dame ‘ﬁrst lady’ (lit. ﬁrst lady) PT caixa forte ‘safe, vault’ (lit. box strong) PT prato feito ‘blue-plate special’ (lit. plate ready-made) FR id´ee re¸cue ‘popular belief’ (lit. idea received) FR mar´ee noire ‘oil spill’ (lit. tide black) PT alta costura ‘haute couture’ (lit. high sewing) EN half sister EN melting pot FR berger allemand ‘German shepherd’ (lit. shepherd German) PT mar aberto ‘open sea’ (lit. sea open) PT febre amarela ‘yellow fever’ (lit. fever yellow) PT livro aberto ‘open book’ (lit. book open) PT cora¸c˜ao partido ‘broken heart’ (lit. heart broken) improvmaxsim for w2v–sg ID improv hcHM Compound ‘translation’ (gloss) 1 2 3 4 5 6 7 8 H G F E D C B A +138 +126 +116 +107 +100 +95 +79 +69 −68 −70 −71 −82 −85 −86 −109 −128 3.58 3.67 3.19 2.03 3.97 4.11 4.47 3.64 0.40 1.52 3.66 1.35 1.10 2.84 1.43 1.06 PT cerca viva ‘hedge’ (lit. fence living) FR coffre fort ‘safe, vault’ (lit. chest/box strong) PT caixa forte ‘safe, vault’ (lit. chest/box strong) PT golpe baixo ‘low blow’ (lit. punch low) PT primeira necessidade ‘ﬁrst necessity’ (lit. ﬁrst necessity) EN role model FR bonne pratique ‘good practice’ (lit. good practice) PT carta aberta ‘open letter’ (lit. letter open) ‘most important helper/assistant’ (lit. arm right) FR bras droit PT alta costura ‘haute couture’ (lit. high sewing) PT carne vermelha ‘red meat’ (lit. meat red) PT alto mar ‘high seas’ (lit. high sea) PT mesa redonda ‘round table’ (lit. table round) EN half sister PT febre amarela ‘yellow fever’ (lit. fever yellow) PT cora¸c˜ao partido ‘broken heart’ (lit. heart broken) for PT caixa forte literally box strong ‘safe’ and low improvement for PT cora¸c˜ao partido ‘broken heart’. In addition, pcmaxsim also affected some equivalent compounds in differ- ent languages, as in the case of PT caixa forte and FR coffre fort. Overall, pcmaxsim does not present a considerable impact on the predictions, obtaining an average improvement of improvmaxsim = +0.41 across all compounds in ALL-comp. Figure 10 shows the same analysis for f = geom, showing the improvement score of pcgeom over pcuniform. We hypothesized that pcgeom should more accurately represent 30 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 1 1 8 0 9 6 8 8 / c o l i _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds Figure 10 Distribution of improvgeom (y-axis) as a function of rkhuman (x-axis). Outliers are indicated by numbers 1–8 (positive improvement) and letters A–H (negative improvement). idiomatic compounds. From the previous sections, we know that pcgeom has lower performance than pcuniform when used to estimate the compositionality of the entire data sets (cf. Table 6). This is conﬁrmed by an average score of improvgeom = −7.87. As in Figure 9, Figure 10 shows a random distribution of improvements. However, the outliers have the opposite pattern, indicating that large reclassiﬁcations due to pcgeom tend to favor idiomatic instead of compositional compounds. The linear regression of the movement of the outliers as a function of the compositionality scores results in r = −0.73 and r = −0.82 for PPMI–thresh and w2v–sg, respectively. These conﬁrm our hypothesis for the behavior of pcgeom. Table 9 lists the outlier compounds indicated in Figure 10 along with their improve- ment values. Here again, the majority of the outliers belong to PT-comp. Some of the compounds that were found as outliers in pcmaxsim re-appear as outliers for pcgeom with inverted polarity in the improvement score, such as the ranks predicted by PPMI– thresh for PT prato feito literally plate made ‘blue-plate special’ (improvmaxsim = +58, improvgeom = −234) and by w2v–sg for FR bras droit literally arm right ‘assistant’ (improvmaxsim = −68, improvgeom = +228). This suggests that, as future work, we should consider combining both approaches into a single prediction that decides which score to use for each compound as a function of pcuniform. 9. Characterization of the Predicted Compositionality In the previous sections, we examined the performance of the compositionality predic- tion framework in terms of the correlation between automatic predictions and human judgments across languages. We now investigate the relation between predicted scores and other variables that may have an impact on results, such as familiarity (Section 9.1) and conventionalization (Section 9.2). We also compare the predicted compositionality scores with trends previously found in human scores (Section 9.3). The experiments focus on the ALL-comp data set, which groups the predicted scores from the best conﬁg- urations on EN-comp, FR-comp, and PT-comp (cf. Table 4). 31 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 1 1 8 0 9 6 8 8 / c o l i _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 060120180240300360420480540Compounds ranked by hcHM4003002001000100200300400Improvement from uniform to geom12345678ABCDEFGHPPMI-thresh060120180240300360420480540Compounds ranked by hcHM4003002001000100200300400Improvement from uniform to geom12345678ABCDEFGHw2v-sg Computational Linguistics Volume 45, Number 1 Table 9 Outlier compounds with extreme positive/negative improvgeom values. Example identiﬁers correspond to numbers/letters shown on Figure 10. improvgeom for PPMI–thresh ID improv hcHM Compound ‘translation’ (gloss) 1 2 3 4 5 6 7 8 H G F E D C B A +157 +110 +109 +104 +93 +85 +82 +79 −190 −202 −202 −234 −292 −327 −370 −376 1.31 3.43 2.83 1.35 2.63 3.32 2.62 1.18 2.44 3.67 3.57 3.14 3.64 3.64 4.08 1.69 EN snail mail FR guerre civile ‘civil war’ (lit. war civil) FR disque dur ‘hard drive’ (lit. disk hard) PT alto mar ‘high seas’ (lit. high sea) PT ˆonibus executivo ‘minibus’ (lit. bus executive) EN search engine PT carro forte ‘armored car’ (lit. car strong) EN noble gas ‘safe, vault’ (lit. chest/box strong) PT ar condicionado ‘air conditioning’ (lit. air conditioned) FR coffre fort FR bon sens ‘common sense’ (lit. good sense) PT prato feito ‘blue-plate special’ (lit. plate ready-made) FR baie vitr´ee ‘open glass window’ (lit. opening glassy) PT carta aberta ‘open letter’ (lit. letter open) PT vinho tinto ‘red wine’ (lit. wine dark-red) PT circuito integrado ‘short circuit’ (lit. short circuit) improvgeom for w2v–sg ID improv hcHM Compound ‘translation’ (gloss) 1 2 3 4 5 6 7 8 H G F E D C B A +228 +158 +127 +104 +89 +75 +73 +72 −151 −169 −190 −238 −256 −260 −266 −370 0.40 1.40 1.35 0.10 1.24 1.60 0.65 3.32 2.76 4.63 2.62 2.83 2.84 3.64 4.47 4.25 ‘most important helper/assistant’ (lit. arm right) FR bras droit PT lua nova ‘new moon’ (lit. moon new) PT alto mar ‘high seas’ (lit. high sea) PT p´e direito ‘ceiling height’ (lit. foot right) EN carpet bombing PT lista negra ‘black list’ (lit. list black) PT arma branca ‘cold weapon’ (lit. weapon white) EN search engine PT disco r´ıgido ‘hard drive’ (lit. disk rigid) EN subway system PT carro forte ‘armored car’ (lit. car strong) FR disque dur ‘hard drive’ (lit. disk hard) EN half sister PT carta aberta ‘open letter’ (lit. letter open) FR bonne pratique ‘good practice’ (lit. good practice) EN end user 9.1 Predicted Compositionality and Familiarity Results from Section 3.2.2 show that the familiarity of compounds measured as fre- quency in large corpora is associated with the compositionality scores assigned by humans. We would like to know whether this correlation also holds true to system predictions: Are the most frequent compounds being predicted as more compositional? As expected, the rank correlation between frequency and pcuniform shows medium to 32 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 1 1 8 0 9 6 8 8 / c o l i _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds Table 10 Spearman ρ correlations between different variables. We consider the set of predicted scores (pc), the set of human–prediction differences (diff), the compound frequencies (freq), and the compound PMI. The predicted scores are the ones from the best conﬁgurations of each sub–data set in ALL-comp. Correlations are indicated only when signiﬁcant (p < 0.05). DSM ρ(pc,freq) ρ(diff,freq) ρ(pc, PMI) ρ(diff, PMI) PPMI–SVD PPMI–thresh glove lexvec PPMI–TopK w2v–cbow w2v–sg 0.36 0.46 0.68 0.54 0.28 0.51 0.50 0.17 0.22 −0.19 * 0.15 * * 0.28 0.13 0.26 0.26 * 0.17 0.17 * * * −0.12 * * * strong correlation (see Table 10, column ρ[pc,freq]), though the level of correlation is somewhat DSM-dependent, are in line with the correlation observed between frequency and human scores, and with the high correlation between predicted and human scores. Another hypothesis we test is whether frequent compounds are easier to model. A ﬁrst intuition would be that this hypothesis is true, as a higher number of occurrences is associated with a larger amount of data, from which more representative vectors can be built. To test this hypothesis, we deﬁne a compound’s difﬁculty as the difference between the predicted score and the normalized human score, diff = |pc − (hcHM/5)|, where high values indicate a compound whose compositionality is harder to predict.43 We found a weak (though statistically signiﬁcant) correlation between frequency and difﬁculty for some of the DSMs (Table 10, column ρ[diff,freq]). They are mostly positive, indicating that frequency is correlated with difﬁculty, which is a surprising result, as it implies that the compositionality of rarer compounds was mildly easier to predict for these systems, disproving the hypothesis above. These results either point to an overall lack of correlation between frequency and difﬁculty, or indicate mild DSM- speciﬁc behavior, which should be investigated in further research. 9.2 Predicted Compositionality and Conventionalization PMI is not only a well-known estimator of the level of conventionalization of a multi- word expression (Church and Hanks 1990; Evert 2004; Farahmand, Smith, and Nivre 2015), but it is also used in some DSMs as a way to estimate the strength of associ- ation between target and context words. To assess if what our models are implicitly measuring is the association between the component words of a compound rather than compositionality, we now examine the correlation between compositionality scores and PMI. We found only a weak but statistically signiﬁcant correlation between predicted compositionality and PMI (Table 10, column ρ[pc, PMI]), which suggests that these DSMs preserve some information regarding conventionalization. However, given that no signiﬁcant correlation between PMI and human compositionality scores was found 43 We linearly normalize predicted scores to be between 0 and 1. However, given that negative scores are rare in practice, unreported correlation with non-normalized pc are similar to the ones reported. 33 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 1 1 8 0 9 6 8 8 / c o l i _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 45, Number 1 (Section 3.2.2) and as DSM predictions are strongly correlated to human predictions, these results indicate that our models capture more than conventionalization. They may also be a feature of this particular set of compounds, as even the compositional cases are also conventional to some extent (e.g., white/?yellow wine). Therefore, further investigation of possible links between idiomaticity and conventionalization is needed. We also calculated the correlation between PMI and the human–prediction differ- ence (diff), to determine if DSMs build less precise vectors for less conventionalized compounds (approximated as those with lower PMI). However, no statistically signiﬁ- cant correlation was found for most DSMs (Table 10, column ρ[diff, PMI]). 9.3 Range-Based Analysis of Predicted Compositionality Spearman correlation assesses the performance of a given conﬁguration by providing a single numerical value. This facilitates the comparison between conﬁgurations, but it hides the internal distribution of predictions. By splitting the data sets into ranges, we obtain a more ﬁne-grained view of possible patterns linked to compositionality prediction. To determine if the compounds that humans agree more on are also more accurately predicted, we divided ALL-comp into three equally sized subsets, according to the stan- dard deviation among human annotators (low, mid-range, and high values of standard deviation, σHM). As high standard deviation indicates disagreement among annotators, it may be an indicator of the difﬁculty of the annotation. Table 11 presents the best DSMs, according to Spearman’s ρ evaluated separately on each of the subsets. Indeed, for the compounds that had low σHM, the Spearman values were the highest (between 0.73 and 0.75), while for those with high σHM, the Spearman correlation with human judgments was the lowest (between 0.35 and 0.43). These results conﬁrm that higher scores are achieved for the compounds for which humans agree more, and suggest that part of the difﬁculty of this task for automatic systems is also related to difﬁculties for humans. To determine if compositional compounds would be more precisely predicted than idiomatic compounds, we divide ALL-comp into three equally sized subsets based on the level of human compositionality scores (low, mid-range, and high values of hcHM). Table 11 presents the correlation obtained on each subset for the best conﬁguration of each DSM. The more idiomatic compounds have the lowest Spearman values (from 0.16 to 0.29) while the more compositional have the highest ones (from 0.32 to 0.37). These results conﬁrm that the predictions are better for compositional than for idiomatic compounds. Moreover, these scores are much lower than those from the full data set Table 11 Spearman’s ρ of best pcuniform models, separated into 3 ranges according to σHM and according to hcHM, all with p < 0.05. DSM full data set Ranges of σHM Ranges of hcHM PPMI–thresh glove lexvec w2v–sg 0.66 0.63 0.64 0.66 low 0.75 0.73 0.73 0.73 mid 0.58 0.54 0.54 0.58 high 0.40 0.35 0.36 0.43 low 0.29 0.27 0.18 0.16 mid 0.24 0.26 0.20 0.24 high 0.37 0.35 0.37 0.32 34 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 1 1 8 0 9 6 8 8 / c o l i _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds (from 0.63 to 0.66), suggesting that it may be harder to make ﬁne-grained distinctions (e.g., between two compositional compounds like access road and subway system) than to make inter-range distinctions (e.g., between idiomatic and compositional compounds like ivory tower and access road). However, further investigation would be needed to verify this hypothesis. 10. Conclusions We proposed a framework for compositionality prediction of multiword expressions, focusing on nominal compounds and using DSMs for meaning representation. We investigated how accurately DSMs capture idiomaticity compared to human judgments and examined the impact of several variables in the accuracy of the predictions. In order to determine how language dependent the results are, we evaluated the com- positionality prediction framework in English, French, and Portuguese, using data sets containing human-rated compositionality scores, some of which were speciﬁcally constructed as part of this work.44 Using these data sets, we presented a large-scale evaluation involving 228 DSMs for each language, and we evaluated more than 9,000 framework conﬁgurations to determine the impact of possible factors that may inﬂu- ence compositionality prediction. Our experiments conﬁrmed that our framework is able to capture idiomaticity accurately, obtaining a strong correlation with human judgments for all three languages. Comparing the performance of different DSMs, the particular choice of DSM had a noticeable impact on the results, with differences over 0.10 Spearman ρ points for all lan- guages. For the comparable data sets (EN-comp, FR-comp, and PT-comp), the best models were w2v and PPMI–thresh.45 Results differed according to language: although for English w2v were the best models, for French and Portuguese, PPMI–thresh outper- formed the other models. Moreover, the results for the three languages varied con- siderably, with those for English outperforming by 0.10 and 0.20 Spearman ρ points those for French and Portuguese, respectively. The latter are morphologically richer than the former, and a closer examination of the type of preprocessing adopted for best results reveals that both languages beneﬁt from less sparse representations resulting from lemmatization and stopword removal, while for English no preprocessing was particularly beneﬁcial. Although corpus size is often assumed to play a fundamental role in the quality of DSMs, so that the bigger the corpus the better the results, prediction quality stabilized at around one billion tokens for all languages. This may reﬂect the point where the minimum frequency was reached for producing reliable representations for all com- pounds in these data sets, even the rare cases, and larger corpora did not lead to better predictions. Moreover, for the best models in each language, DSMs with more dimensions resulted in more accurate predictions conﬁrming our hypothesis. We also found a trend for small window sizes leading to better results for the best models in all three languages, contrary to our hypothesis. A typically good conﬁguration used vectors of 750 dimensions built from minimal context windows of one word to each side of the target. DSMs were also robust regarding the choice of compositionality prediction func- tion, with a uniform combination of the head and modiﬁer producing the best results 44 The resulting data sets and framework implementation are freely available to the community. 45 As Farahmand is considerably different from the other data sets, a direct comparison is not possible. 35 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 1 1 8 0 9 6 8 8 / c o l i _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 45, Number 1 for all languages. Other functions like pcmaxsim and pcgeom, which modify these scores to account for different contributions of each component, produced at best similar results. A deeper analysis of the predicted compositionality scores revealed that, similarly to human-rated scores, familiarity measured as frequency was positively correlated with predicted compositionality. In the case of conventionalization measured as PMI, no correlation was found with human-rated scores and only a mild correlation was found with some predicted scores, suggesting that our models capture more than compound conventionalization, as they have a strong agreement with human scores. Intra-compound standard deviation on human scores was also found to be related to predicted scores, indicating that DSMs have difﬁculties on those compounds that humans also found difﬁcult. Moreover, predictions were found to be more accurate for compositional compounds. Although there are many questions that still need to be solved regarding compo- sitionality, we believe that the results presented here advance signiﬁcantly its under- standing and computational modeling. Furthermore, the proposed framework opens important avenues of research that are ready to be pursued. First, the role of morpholog- ical inﬂection could be clariﬁed by extending this investigation to even more inﬂected languages, such as Turkish. Moreover, other categories of MWEs such as verb+noun expressions should be evaluated to determine the interplay between compositionality prediction and syntactic ﬂexibility of MWEs. The ultimate test would be to use predicted compositionality scores in downstream applications and tasks involving some degree of semantic processing, ranging from MWE identiﬁcation to parsing, and word-sense disambiguation. In particular, it would be interesting to predict compositionality in context, in order to distinguish idiomatic from literal usages in sentences. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 1 1 8 0 9 6 8 8 / c o l i _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Appendix A. Glossary composition function is a function that takes as input a sequence of vectors v(wi) to v(wj) and outputs a compositionally constructed vector v⊕(wi . . . wj) representing the compositional meaning of the sequence, where ⊕ indicates the function used to compose the vectors. Example: ||v(w1 )|| + (1 − β) v(w2 ) vβ(w1, w2) = β v(w1 ) ||v(w2 )|| . 1, 23 compositionality prediction conﬁguration is the combination of a particular DSM conﬁguration with a given compositionality prediction function, fully specifying how a predicted compositionality score is calculated for a given word sequence wi . . . wj. 1 compositionality prediction framework is the set of all possible compositionality prediction conﬁgurations available. 1, 22 compositionality prediction function is a function that takes as input corpus-based vectors for a sequence of words v(wi . . . wj) and for the individual words composing that sequence v(wi) . . . v(wj), and outputs a predicted compositionality score, usually proportional to the similarity between the corpus-based vector v(wi . . . wj) and a compositionally constructed vector v(wi) to v(wj) derived from v(wi) . . . v(wj) using a composition function. Example: maxsim. 1 36 Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds compositionally constructed vector is the output of a composition function, that is, a vector v⊕(wi . . . wj) derived from the individual words’ corpus-derived vectors v(wi) to v(wj). 1, 23 corpus-derived vector is the output of a DSM for a given element wi of the vocabulary V, that is, a corpus-derived D-dimensional real-numbered vector v(wi) that represents the meaning of wi. A corpus-derived vector of a word sequence v(wi . . . wj) is built by treating it as a single token in the corpus. 1, 22 distributional semantic model (DSM) is a function that takes as input a vocabulary V and a (large) corpus, and outputs a corpus-derived vector v(wi) for each element wi of V based on the distributional proﬁle of wi’s occurrences in the corpus. The vocabulary V can be automatically derived from the input corpus. Example: w2v–cbow. 1 DSM conﬁguration is a set of DSM parameters and their values, fully specifying how corpus-derived vectors are built from a given corpus. Example: w2v–cbow using lemmaPoS.W8.d750. 1, 28 DSM parameter is a variable in a DSM whose value inﬂuences the way corpus-derived vectors are built from a corpus. Example: WORDFORM. 1, 28 human compositionality score (hc) is a real value representing the compositionality assigned by human annotators to a word sequence wi . . . wj. The correlation between predicted compositionality (pc) and human compositionality (hc) scores is used to evaluate a compositionality prediction conﬁguration. When subscripted, indicates the question used to obtain the score. Example: hcH. 1, 17, 29 predicted compositionality score (pc) is the output of a compositionality prediction function, that is, a real value representing the predicted compositionality of a word sequence wi . . . wj. The correlation between predicted compositionality (pc) and human compositionality (hc) scores is used to evaluate a compositionality prediction conﬁguration. When subscripted, indicates the compositionality prediction function used to obtain the score. Example: pcuniform. 1, 29 Appendix B. Sanity Checks The number of possible DSM conﬁgurations grows exponentially with the number of internal variables in a DSM, forestalling the possibility of an exhaustive search for every possible parameter. We have evaluated in this article the set of variables that are most often manually tuned in the literature, but a reasonable question would be whether these results can be further improved through the modiﬁcation of some other often- ignored model-speciﬁc parameters. We thus perform some sanity checks through a local search of such parameters around the highest-Spearman conﬁguration of each DSM. B.1 Number of Iterations Some of the DSMs in consideration on this paper are iterative: they re-read and re- process the same corpus multiple times. For those DSMs, we present the results of running their best conﬁguration, but using a higher number of iterations. This higher 37 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 1 1 8 0 9 6 8 8 / c o l i _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 45, Number 1 Table 12 Results using a higher number of iterations. Model (FR-comp) w2v–cbow w2v–sg glove lexvec Model (Reddy) w2v–cbow w2v–sg glove lexvec Model (PT-comp) w2v–cbow w2v–sg glove lexvec ρbase .660 .672 .680 .677 ρbase .809 .821 .764 .774 ρbase .588 .586 .555 .570 ρiter=100 Difference (%) .640 .636 .677 .671 (−2.0) (−3.7) (−0.3) (−0.6) ρiter=100 Difference (%) .766 .777 .746 .757 (−4.3) (−4.4) (−1.8) (−1.7) ρiter=100 Difference (%) .558 .551 .464 .561 (−3.0) (−3.6) (−9.1) (−0.9) number of iterations is inspired by the models found in parts of the literature, where, for example, the number of glove iterations can be as high as 50 (Salle, Villavicencio, and Idiart 2016) or even 100 (Pennington, Socher, and Manning 2014). The intuition is that most models will lose some information (due to their probabilistic sampling), which could be regained at the cost of a higher number of iterations. Table 12 presents a comparison between the baseline ρ for 15 iterations and the ρ obtained when 100 iterations are performed. For all DSMs, we see that the increase in the number of iterations does not improve the quality of the vectors, with the relatively small number of 15 iterations yielding better results. This may suggest that a small number of iterations can already sample enough distributional information, with further iterations accruing additional noise from low-frequency words. The extra number of iterations could also be responsible for overﬁtting of the DSM to represent particularities of the corpus, which would reduce the quality of the underlying vectors. Given the extra cost of running more iterations,46 we refrained from building further models with as many iterations in the rest of the article. B.2 Minimum Count Threshold Minimum-count thresholds are often neglected in the literature, where a default con- ﬁguration of 0, 1, or 5 being presumably used by most authors. An exception to this trend is the threshold of 100 occurrences used by Levy, Goldberg, and Dagan (2015), whose toolkit we use in PPMI–SVD. No explicit justiﬁcation has been found for this higher word-count threshold. A reasonable hypothesis would be that higher thresholds improve the quality of the data, as it ﬁlters rare words more aggressively. 46 The running time grows linearly with the number of iterations. 38 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 1 1 8 0 9 6 8 8 / c o l i _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds Table 13 Results for a higher minimum threshold of word count. Model (FR-comp) w2v–cbow w2v–sg glove PPMI–SVD lexvec Model (Reddy) w2v–cbow w2v–sg glove PPMI–SVD lexvec Model (PT-comp) w2v–cbow w2v–sg glove PPMI–SVD lexvec ρbase .660 .672 .680 .584 .677 ρbase .809 .821 .764 .743 .774 ρbase .588 .586 .555 .530 .570 ρmincount=50 Difference (%) .610 .613 .673 .258 .653 (−5.0) (−5.9) (−0.7) (−32.6) (−2.4) ρmincount=50 Difference (%) .778 .776 .672 .515 .738 (−3.1) (−4.5) (−9.2) (−22.8) (−3.6) ρmincount=50 Difference (%) .580 .575 .540 .418 .566 (−0.8) (−1.1) (−1.5) (−11.1) (−0.4) Table 13 presents the result from the highest-Spearman conﬁgurations along with the results for an identical conﬁguration with a higher occurrence threshold of 50.47 The results unanimously agree that a higher threshold does not contribute to the removal of any extra noise. In particular, for PPMI–SVD, it seems to discard enough useful informa- tion to considerably reduce the quality of the compositionality prediction measure. The results strongly contradict the default conﬁguration used for PPMI–SVD, suggesting that a lower word-count threshold might yield better results for this task. B.3 Windows of Size 2+2 For many models, the best window size found was either WINDOWSIZE = 1+1 or WINDOWSIZE = 4+4 (see Section 7.1). It is possible that a higher score could be obtained by a conﬁguration in between. While a full exhaustive search would be the ideal solution, an initial approximation of the best 2+2 conﬁguration could be obtained by running the experiments on the highest Spearman conﬁgurations, with the window size replaced by 2+2. Results shown in Table 14 for a window size of 2+2 are consistently worse than the base model, indicating that the optimal conﬁguration is likely the one that was obtained with window size of 1+1 or 4+4. This is further conﬁrmed by the fact that most DSMs had the best conﬁguration with window size of 1+1 or 8+8, with few cases of 4+4 as best model, which suggests that the quality of most conﬁgurations in the space of models is either monotonically increasing or decreasing with regards to these window sizes, favoring thus the conﬁgurations with more extreme WINDOWSIZE parameters. 47 The threshold used for ρbase depends on the DSM, and is described in Section 5.2. 39 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 1 1 8 0 9 6 8 8 / c o l i _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 45, Number 1 Table 14 Results using a window of size 2+2. Model (FR-comp) PPMI–SVD PPMI–thresh glove lexvec w2v–cbow w2v–sg Model (Reddy) PPMI–SVD lexvec w2v–cbow w2v–sg Model (PT-comp) PPMI–SVD PPMI–thresh lexvec ρbase .584 .702 .680 .677 .660 .672 ρbase .743 .774 .809 .821 ρbase .530 .602 .570 ρwin=2+2 Difference (%) .397 .678 .657 .671 .644 .639 (−18.7) (−2.4) (−2.3) (−0.6) (−1.6) (−3.3) ρwin=2+2 Difference (%) .583 .757 .777 .784 (−16.0) (−1.7) (−3.2) (−3.7) ρwin=2+2 Difference (%) .446 .561 .564 (−8.4) (−4.1) (−0.6) B.4 Higher Number of Dimensions As seen in Section 7.2, some DSMs obtain better results when moving from 250 to 500 dimensions, and this trend continues when moving to 750 dimensions. This behavior is notably stronger for PPMI–thresh, which suggests that an even higher number of dimensions could have better predictive power. Table 15 presents the result of running PPMI–thresh for increasing values of of the DIMENSION parameter. The baseline conﬁguration (indicated as (cid:63) in Table 15) was the highest-scoring conﬁguration found in Section 7.2: lemmaPoS.W1.d750 for PT-comp and FR-comp, and surface.W8.d750 for Reddy. As seen in Section 7.2, results for 250 and 500 dimensions have lower scores than the results for 750 dimensions. Results for 1,000 dimensions were mixed: they are slightly worse for FR-comp and EN-comp, and slightly better for PT-comp. Increasing the number of dimensions generates models that are progressively worse. These results suggest that the maximum vector quality is achieved between 750 and 1,000 dimensions. B.5 Random Initialization The word vectors generated by the glove and w2v models have some level of non- determinism caused by random initialization and random sampling techniques. A reasonable concern would be whether the results presented for different parameter variations are close enough to the scores obtained by an average model. To assess the variability of these models, we evaluated three different runs of every DSM conﬁgu- ration (the original execution ρ1, used elsewhere in this article, along with two other executions ρ2 and ρ3) for glove, w2v–cbow, and w2v–sg. We then calculate the average ρavg of these three executions for every model. 40 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 1 1 8 0 9 6 8 8 / c o l i _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds Table 15 Results for higher numbers of dimensions (PPMI–thresh). Model (FR-comp) ρdim=X Difference (%) dim = 250 dim = 500 dim = 750 dim = 1,000 dim = 2,000 dim = 5,000 dim = 30,000 dim = 999,999 Model (Reddy) dim = 250 dim = 500 dim = 750 dim = 1,000 dim = 2,000 dim = 5,000 dim = 30,000 dim = 999,999 .671 .695 .702(cid:63) .694 .645 .636 .552 .539 (−3.1) (−0.7) (0.0) (−0.8) (−5.8) (−6.7) (−15.1) (−16.3) ρdim=X Difference (%) .764 .782 .791(cid:63) .784 .760 .744 .700 .566 (−2.7) (−1.0) (0.0) (−0.7) (−3.1) (−4.7) (−9.1) (−22.5) Model (PT-comp) ρdim=X Difference (%) dim = 250 dim = 500 dim = 750 dim = 1,000 dim = 2,000 dim = 5,000 dim = 30,000 dim = 999,999 .543 .546 .602(cid:63) .609 .601 .505 .532 .500 (−5.9) (−5.6) (0.0) (+0.7) (−0.1) (−9.7) (−7.0) (−10.2) Table 16 reports the highest-Spearman conﬁgurations of ρavg for the Reddy and EN-comp data sets. When comparing ρavg to the results of the original execution ρ1, we see that the variability in the different executions of the same conﬁguration is minimal. This is further conﬁrmed by the low sample standard deviation48 obtained from the scores of the three executions. Given the high stability of these models, results in the rest of the article were calculated and reported as ρ1 for all data sets. B.6 Data Filtering Along with the veriﬁcation of parameters, we also evaluate whether data set variations could yield better results. In particular, we consider the use of ﬁltering techniques, which are used in the literature as a method of guaranteeing data set quality. As per Roller, Schulte im Walde, and Scheible (2013), we consider two strategies of data removal: (1) removing individual outlier compositionality judgments through z-score 48 The low standard deviation is not a unique property of high-ranking conﬁgurations: The average of deviations for all models was .004 for EN-comp and .006 for Reddy. 41 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 5 1 1 1 8 0 9 6 8 8 / c o l i _ a _ 0 0 3 4 1 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 45, Number 1 Table 16 Conﬁgurations with highest ρavg for nondeterministic models. Data set DSM conﬁguration ρ1 ρ2 ρ3 Reddy lemmaPoS.W8.d250 glove w2v–cbow surface.W1.d500 surface.W1.d750 w2v–sg EN-comp glove lemmaPoS.W8.d500 w2v–cbow surface+.W1.d750 surface+.W1.d750 w2v–sg .759 .796 .812 .651 .730 .741 .760 .807 .788 .646 .732 .732 .753 .799 .812 .650 .728 .721 ρavg .757 .801 .804 .649 .730 .731 stddev .004 .006 .014 .003 .002 .010 Table 17 Intrinsic quality measures for the raw and ﬁltered data sets. Data set FR-comp PT-comp EN-comp90 Reddy σ raw 1.15 1.22 1.17 0.99 ﬁltered raw ﬁltered Pσ>1.5

DRR

0.94
1.00
0.87

—

22.78%
14.44%
18.89%

13.89%
6.11%
3.33%

87.34%
87.81%
83.61%

5.56%

—

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

ﬁltering; and (2) removing all annotations from outlier human judges. A composition-
ality judgment is considered an outlier if it stands at more than z standard deviations
away from the mean; a human judge is deemed an outlier if its Spearman correlation to
the average of the others ρoth is lower than a given threshold R.49 These methods allow
us to remove accidentally erroneous annotations, as well as annotators whose response
deviated too much from the mean (in particular, spammers and non-native speakers).

Table 17 presents the evaluation of raw and ﬁltered data sets regarding two quality
measures: The average of the standard deviations for all NCs (σ); and the proportion of
NCs in the data set whose standard deviation is higher than 1.5 (Pσ>1.5), as per Reddy,
McCarthy, and Manandhar (2011). The results suggest that ﬁltering techniques can
improve the overall quality of the data sets, as seen in the reduction of the proportion of
NCs with high standard deviation, as well as in the reduction of the average standard
deviation itself. We additionally present the data retention rate (DRR), which is the
proportion of NCs that remained in the data set after ﬁltering. While the DRR does
indicate a reduction in the amount of data, this reduction may be considered acceptable
in light of the improvement suggested by the quality measures.

On a more detailed analysis, we have veriﬁed that the improvement in these quality
measures is heavily tied to the use of z-score ﬁltering, with similar results obtained when
it is considered alone. The application of R-ﬁltering by itself, on the other hand, did
not show any noticeable improvement in the quality measures for reasonable amounts
of DRR. This is the opposite from what was found by Roller, Schulte im Walde, and

49 The judgment threshold we adopted was z = 2.2 for EN-comp90, z = 2.2 for PT-comp, and z = 2.5 for

FR-comp. The human judge threshold was R = 0.5.

Cordeiro et al.

Unsupervised Compositionality Prediction of Nominal Compounds

Table 18
Extrinsic quality measures for the raw and ﬁltered data sets.

Data set

EN-comp90
raw ﬁltered

FR-comp
raw ﬁltered

PT-comp
raw ﬁltered

PPMI–SVD
PPMI–TopK
PPMI–thresh
glove
lexvec
w2v–cbow
w2v–sg

.604
.564
.602
.538
.567
.669
.665

.601
.571
.607
.544
.572
.665
.661

.584
.550
.702
.680
.677
.651
.653

.579
.545
.700
.676
.676
.651
.654

.530
.519
.602
.555
.570
.588
.586

.526
.516
.601
.552
.568
.587
.584

Scheible (2013) on their German data set, where only R-ﬁltering was found to improve
results under these quality measures. We present our ﬁndings in more detail in Ramisch,
Cordeiro, and Villavicencio (2016).

We then consider whether ﬁltering can have an impact on the performance of
predicted compositionality scores. For each of the 228 model conﬁgurations that were
constructed for each language, we launched an evaluation on the ﬁltered EN-comp90, FR-
comp, and PT-comp data sets (using z-score ﬁltering only, as it was responsible for most
of the improvement in quality measures). Overall, no improvement was observed in the
results of the prediction (values of Spearman ρ) when we compare raw and ﬁltered data
sets. Looking more speciﬁcally at the best conﬁgurations for each DSM (see Table 18), we
can see that most results do not signiﬁcantly change when the evaluation is performed
on the raw or ﬁltered data sets. This suggests that the amount of judgments collected
for each compound greatly offsets any irregularity caused by outliers, making the use
of ﬁltering techniques superﬂuous.

Appendix C. Questionnaire

The questionnaire was structured in ﬁve subtasks, presented to the annotators through
these instructions:

Read the compound itself.

Read 3 sentences containing the compound.

Provide 2 to 3 synonym expressions for the target compound seen in the
sentences, preferably involving one of the words in the compound. We ask
annotators to prioritize short expressions, with 1 to 3 words each, and to
try to include the MWE components in their reply (eliciting a paraphrase).

Using a Likert scale from 0 to 5, judge how much of the meaning of the
compound comes from the modiﬁer and the head separately. Figure 11
shows an example for the judgment of the head.

Using a Likert scale from 0 to 5, judge how much of the meaning of the
compound comes from its components.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Number 1

Figure 11
Evaluating compositionality of a compound regarding its head.

We require answers in an even-numbered scale (there are 6 possibilities between 0
and 5), as otherwise the participants could be biased toward the middle score. In order
to help participants visualize the meaning of their reply, whenever their mouse hovers
over a particular score, we present a guiding tooltip, as can be seen in Figure 11.

The order of subtasks has also been taken into account. During a pilot test, we
found that presenting the multiple-choice questions (subtasks 4–5) before asking for
synonyms (subtask 3) yielded lower agreement, as users were often less self-consistent
in the multiple-choice questions (e.g., replying “non-compositional” for subtask 4 but
“compositional” for subtask 5), even if they carefully selected their synonyms in re-
sponse to subtask 3.

The request for synonyms before the multiple-choice questions prompts the par-
ticipants to focus on the meaning of the compound. These synonyms can then also be
taken into account when considering the semantic contribution of each element of the
compound—we leave this for future work.

Appendix D. List of English Compounds

We present below the 90 nominal compounds in EN-comp90 and the 100 nominal com-
pounds in EN-compExt, along with their human-rated compositionality scores. We refer
to Reddy, McCarthy, and Manandhar (2011) for the other 90 compounds belonging to
Reddy which, together with the former two sets, represent 280 nominal compounds in
total.

D.1 Compounds in EN-comp90

Compounds

hcHM

Compounds

hcHM

1.95
1.33
3.94
0.62
4.69
0.85
4.60
3.11
4.25
2.65
0.88
1.24
3.78
1.59

ancient history
armchair critic
baby buggy
bad hat
benign tumour
big ﬁsh
birth rate
black cherry
bow tie
brain teaser
busy bee
carpet bombing
cellular phone
close call

closed book
computer program
con artist
cooking stove
cotton candy
critical review
dead end
dirty money
dirty word
disc jockey
divine service
dry land
dry wall
dust storm

0.68
4.50
2.10
4.68
1.79
4.06
1.32
2.21
2.48
1.25
3.11
3.95
3.33
3.85

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Cordeiro et al.

Unsupervised Compositionality Prediction of Nominal Compounds

Compounds

hcHM

Compounds

hcHM

eager beaver
economic aid
elbow grease
elbow room
entrance hall
eternal rest
ﬁsh story
ﬂower child
food market
foot soldier
front man
goose egg
grey matter
guinea pig
half sister
half wit
health check
high life
inner circle
inner product
insane asylum
insurance company
insurance policy
iron collar
labour union
life belt
life vest
lime tree
loan shark
loose woman
mail service

0.36
4.33
0.56
0.61
4.17
3.25
1.68
0.50
3.82
1.95
1.64
0.48
2.39
0.45
2.84
1.16
4.17
1.67
1.56
3.00
3.95
5.00
4.15
3.88
4.76
2.84
3.44
4.61
1.00
2.53
4.69

market place
mental disorder
middle school
milk tooth
mother tongue
narrow escape
net income
news agency
noble gas
nut case
old ﬂame
old hat
old timer
phone book
pillow slip
pocket book
prison guard
prison term
private eye
record book
research lab
sex bomb
silver lining
sound judgement
sparkling water
street girl
subway system
tennis elbow
top dog
wet blanket
word painting

3.00
4.89
3.84
1.43
0.59
1.75
2.94
4.39
1.18
0.44
0.58
0.35
0.89
4.25
3.70
1.42
4.89
4.79
0.82
3.70
4.75
0.53
0.35
3.39
3.14
3.16
4.63
2.50
1.05
0.21
1.62

D.2 Compounds in EN-compExt

Compounds

hcHM

Compounds

hcHM

academy award
arcade game
baby blues
backroom boy
bad apple
banana republic
bankruptcy proceeding
basket case
beauty sleep
best man
big cheese
big picture
big wig
biological clock
black box
black operation
blind alley
blood bath

3.52
3.80
2.88
1.48
1.13
0.86
4.78
0.42
2.96
3.12
0.36
1.48
0.60
2.42
1.29
1.39
1.14
1.38

blue blood
blue print
box ofﬁce
brain drain
bull market
cable car
calendar month
civil marriage
cocoa butter
computer expert
contact lenses
copy cat
crime rate
damp squib
dark horse
day shift
disability insurance
double cross

0.58
1.04
0.88
2.08
1.23
2.68
4.23
3.13
3.23
4.46
3.64
0.74
4.39
0.95
0.65
4.54
4.45
1.14

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Number 1

Compounds

hcHM

Compounds

hcHM

double dutch
double whammy
dream ticket
dutch courage
fair play
fairy tale
fall guy
ﬁeld work
football season
fresh water
freudian slip
ghost town
glass ceiling
grass root
hard drive
hard shoulder
head hunter
health care
heavy cross
hen party
home run
honey trap
hot potato
incubation period
information age
injury time
insider trading
jet lag
job fair
leap year
love song
low proﬁle

0.29
2.48
1.32
1.00
2.59
1.68
1.36
2.10
4.04
4.20
2.35
1.50
0.81
0.86
2.17
1.52
1.50
4.47
1.17
1.05
2.86
1.22
0.56
3.92
3.40
3.20
3.88
2.67
3.50
2.38
4.58
2.10

marketing consultant
medical procedure
music festival
music journalist
noise complaint
pain killer
peace conference
peace talk
pipe dream
poison pill
radioactive material
radioactive waste
rainy season
rice paper
shelf life
skin tone
smoke screen
social insurance
speed trap
stag night
sugar daddy
tear gas
time difference
trafﬁc control
trafﬁc jam
travel guide
wedding anniversary
wedding day
white noise
white spirit
winter solstice
world conference

4.00
4.83
4.58
4.54
4.52
2.17
4.46
4.13
0.91
0.96
4.61
4.58
4.23
4.00
1.30
3.88
1.11
2.83
3.71
1.44
0.44
3.27
4.41
3.69
3.62
4.38
4.86
4.94
1.17
1.31
4.55
3.96

Appendix E. List of French Compounds

We present below the 180 nominal compounds in FR-comp, along with their human-
rated compositionality scores.

Compounds

hcHM

Translation (Gloss)

activit´e physique
ann´ee scolaire
art contemporain
baie vitr´ee
bas c ˆot´e
beau fr`ere
beau p`ere
belle m`ere
berger allemand
bon sens
bon vent
bon vivant
bonne humeur

4.93
3.60
4.60
3.64
1.31
0.67
1.18
0.80
1.29
3.57
0.87
2.57
4.53

‘physical activity’ (lit. activity physical)
‘school year’ (lit. year scholar)
‘contemporary art’ (lit. art contemporary)
‘open glass window’ (lit. opening glassy)
‘aisle/roadside’ (lit. low side)
‘brother-in-law’ (lit. beautiful brother)
‘father-in-law’ (lit. beautiful father)
‘mother-in-law’ (lit. beautiful mother)
‘German shepherd’ (lit. shepherd German)
‘common sense’ (lit. good sense)
‘good luck’ (lit. good/fair wind)
‘bon vivant’ (lit. good living)
‘good mood’ (lit. good mood)

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Cordeiro et al.

Unsupervised Compositionality Prediction of Nominal Compounds

Compounds

hcHM

Translation (Gloss)

bonne poire
bonne pratique
bouc ´emissaire
bras cass´e
bras droit
brebis galeuse
carte blanche
carte bleue
carte grise
carte vitale
carton plein
casque bleu
centre commercial
cercle vicieux
cerf volant
chambre froide
changement climatique
chapeau bas
charge sociale
chauve souris
chute libre
club priv´e
coffre fort
communaut´e urbaine
conseil municipal
coup dur
coup franc
courrier ´electronique
court circuit
court m´etrage
cr`eme fraˆıche
cr`eme glac´ee
dernier cri
dernier mot
directeur g´en´eral
disque dur
douche froide
droit fondamental
d´eveloppement ´economique
eau chaude
eau douce
eau min´erale
eau potable
eau vive
eau forte
eaux us´ees
effet sp´ecial
exp´erience professionnelle
fait divers
famille nombreuse
faux ami
faux cul
faux pas
faux semblant
feu rouge
feu vert
ﬁl conducteur

0.42
4.47
0.23
0.57
0.40
0.55
0.20
1.94
3.08
1.70
0.78
1.85
3.93
2.15
0.64
4.27
4.79
0.64
3.00
0.33
3.64
4.58
3.67
4.57
4.00
2.40
1.71
4.57
1.69
2.36
3.73
4.75
0.67
3.09
3.87
2.83
1.18
4.27
4.46
5.00
2.33
4.00
5.00
3.44
0.90
4.54
3.67
4.86
3.69
4.90
1.25
0.31
1.82
3.57
2.60
0.71
1.25

‘sucker, soft touch’ (lit. good pear)
‘good practice’ (lit. good practice)
‘scapegoat’ (lit. goat emissary)
‘lame duck’ (lit. arm broken)
‘most important helper/assistant’ (lit. arm right)
‘black sheep’ (lit. sheep scabby)
‘carte blanche’ (lit. card white)
‘bank card’ (lit. card blue)
‘vehicle registration’ (lit. card grey)
‘healthcare card’ (lit. card vital)
‘clean sweep’ (lit. cardboard full)
‘UN peacekeeper’ (lit. helmet blue)
‘shopping center’ (lit. center commercial)
‘vicious circle’ (lit. circle vicious)
‘kite’ (lit. deer ﬂying)
‘cold chamber’ (lit. chamber cold)
‘climate change’ (lit. change climatic)
‘bravo’ (lit. hat low)
‘social security contribution’ (lit. charge social)
‘bat’ (lit. bald mouse)
‘free fall’ (lit. fall free)
‘private club (sexual connotation)’ (lit. club private)
‘safe, vault’ (lit. chest/box strong)
‘urban community’ (lit. community urban)
‘city council’ (lit. council municipal)
‘problem, difﬁculty’ (lit. blow hard)
‘free kick (soccer)’ (lit. blow free/frank)
‘e-mail’ (lit. mail electronic)
‘short circuit’ (lit. short circuit)
‘short ﬁlm’ (lit. short length)
‘French sour cream’ (lit. cream fresh)
‘ice cream’ (lit. cream icy)
‘something trendy’ (lit. last scream)
‘ﬁnal say’ (lit. last word)
‘chief executive ofﬁcer’ (lit. director general)
‘hard drive’ (lit. disk hard)
‘damper/frustration’ (lit. shower cold)
‘fundamental right’ (lit. right fundamental)
‘economic development’ (lit. development economic)
‘hot water’ (lit. water hot)
‘fresh water’ (lit. water soft/sweet)
‘mineral water’ (lit. water mineral)
‘drinking water’ (lit. water potable)
‘jellyﬁsh’ (lit. water lively)
‘etching’ (lit. water strong)
‘sewage’ (lit. waters used)
‘special effect’ (lit. effect special)
‘professional experience’ (lit. experience professional)
‘news story’ (lit. fact diverse)
‘large family’ (lit. family numerous)
‘false friend’ (lit. false friend)
‘hypocrite’ (lit. false arse)
‘blunder’ (lit. false step)
‘false pretence’ (lit. false appearance)
‘red trafﬁc light’ (lit. ﬁre red)
‘green light, permission’ (lit. ﬁre green)
‘underlying theme’ (lit. thread conductor)

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Number 1

Compounds

hcHM

Translation (Gloss)

0.45
4.54
2.33
1.33
1.07
2.17
3.14
4.54
3.14
3.58
1.40
1.87
3.43
1.83
2.54
4.13
4.00
2.25
2.90
4.27

4.36
4.64
4.50
4.85
3.00
2.46
5.00
2.15
2.90
2.38
2.21
1.08
4.79
3.23
2.73
1.07
1.50
4.20
4.90
3.00
0.50
4.33
4.88
2.69
0.80
0.86
1.64
2.27
1.00
4.14
1.15
2.50
2.79
0.92
0.50
2.69

‘sentimental’ (lit. ﬂower blue)
‘foie gras’ (lit. liver fatty)
‘giggle’ (lit. crazy laughter)
‘outdoors’ (lit. big air)
‘broad daylight’ (lit. big day)
‘move forward’ (lit. big leap)
‘silver screen’ (lit. big screen)
‘big company’ (lit. big company)
‘department store’ (lit. big surface)
‘avian ﬂu’ (lit. ﬂu avian)
‘swearword’ (lit. large/fat word)
‘close-up’ (lit. large/fat plan)
‘civil war’ (lit. war civil)
‘loudspeaker’ (lit. loud/high speaker)
‘high seas’ (lit. high sea)
‘high mountains’ (lit. high mountain)
‘overtime hour’ (lit. hour extra)
‘essential oil’ (lit. oil essential)
‘popular belief’ (lit. idea received)
‘professional integration, employability’
(lit. insertion professional)
‘general interest’ (lit. interest general)
‘young girl, maiden’ (lit. young girl)
‘ofﬁcial gazette’ (lit. newspaper ofﬁcial)
‘French language’ (lit. language French)
‘oil spill’ (lit. tide black)
‘draw, stalemate’ (lit. match null)
‘fat’ (lit. matter greasy)
‘grey matter’ (lit. matter grey)
‘raw material’ (lit. matter primary)
‘bad faith’ (lit. bad faith)
‘gossiper’ (lit. bad tongue)
‘roller coaster’ (lit. mountains Russian)
‘historical monument’ (lit. monument historical)
‘stillborn’ (lit. dead born)
‘New World, Americas’ (lit. new world)
‘sleepless night’ (lit. night white)
‘toll-free number’ (lit. number green)
‘household waste’ (lit. garbage domestic)
‘trade union’ (lit. organisation of-trade-union)
‘yellow pages’ (lit. pages yellow)
‘golden parachute’ (lit. parachute golden)
‘nature park’ (lit. park natural)
‘political party’ (lit. party political)
‘bias’ (lit. party taken)
‘orgy’ (lit. party ﬁne/delicate)
‘boyfriend’ (lit. small friend)
‘butter biscuit’ (lit. small butter)
‘breakfast’ (lit. small lunch)
‘amateur’ (lit. small player)
‘pea’ (lit. small pea)
‘salted pork’ (lit. small salty)
‘television’ (lit. small screen)
‘grandchild’ (lit. small child)
‘type of pastry’ (lit. small oven)
‘pidgin or ‘badly spoken’ French’ (lit. little black-person)
‘classiﬁed ad’ (lit. small announcement)

ﬂeur bleue
foie gras
fou rire
grand air
grand jour
grand saut
grand ´ecran
grande entreprise
grande surface
grippe aviaire
gros mot
gros plan
guerre civile
haut parleur
haute mer
haute montagne
heure suppl´ementaire
huile essentielle
id´ee rec¸ue
insertion professionnelle

int´erˆet g´en´eral
jeune ﬁlle
journal ofﬁciel
langue franc¸aise
mar´ee noire
match nul
mati`ere grasse
mati`ere grise
mati`ere premi`ere
mauvaise foi
mauvaise langue
montagnes russes
monument historique
mort n´e
nouveau monde
nuit blanche
num´ero vert
ordure m´enag`ere
organisation syndicale
pages jaunes
parachute dor´e
parc naturel
parti politique
parti pris
partie ﬁne
petit ami
petit beurre
petit d´ejeuner
petit joueur
petit pois
petit sal´e
petit ´ecran
petit enfant
petit four
petit n`egre
petite annonce

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Cordeiro et al.

Unsupervised Compositionality Prediction of Nominal Compounds

Compounds

hcHM

Translation (Gloss)

petite nature
pied noir
pi`ece mont´ee
pleine lune
poids lourd
point faible
point mort
pot pourri
poule mouill´ee
poup´ee russe
premier ministre
premier plan
premi`ere dame
prince charmant
pr´evision m´et´eorologique
recherche scientiﬁque
ressources humaines
rond point
roulette russe
r´echauffement climatique
r´egion parisienne
r´eseau social
sang froid
second degr´e
second r ˆole
septi`eme ciel
service public
site ofﬁciel
soir´ee priv´ee
sucre roux
s´ecurit´e routi`ere
s´ecurit´e sociale
table basse
table ronde
tapis rouge
temps fort
temps mort
temps partiel
temps plein
temps r´eel
travaux publics
trou noir
trou normand
t´el´ephone arabe
t´el´ephone portable
valeur s ˆure
vie associative
vie quotidienne
vieille ﬁlle
vin blanc
vin rouge
yeux rouges
´ecole primaire
´etoile ﬁlante

0.47
0.13
2.47
3.54
2.08
2.46
1.00
0.40
0.00
3.75
3.67
2.82
1.92
2.00
4.70
4.92
3.91
3.18
0.87
4.40
4.43
4.09
0.47
1.40
3.64
0.21
4.71
4.85
4.53
4.31
4.55
3.67
4.79
1.46
3.31
1.87
2.07
3.62
3.08
3.00
4.09
2.58
0.78
0.23
5.00
3.64
4.00
4.31
2.42
3.80
4.69
4.36
3.92
3.20

‘sensitive/fragile person’ (lit. small nature)
‘French expats from Algeria’ (lit. foot black)
‘tiered cake’ (lit. piece assembled)
‘full moon’ (lit.full moon)
‘truck’ (lit. weight heavy)
‘weak point’ (lit. point weak)
‘standstill’ (lit. point dead)
‘medley’ (lit. pot/jar rotten)
‘coward’ (lit. chicken wet)
‘Russian nesting doll’ (lit. doll Russian)
‘prime minister’ (lit. ﬁrst minister)
‘foreground’ (lit. ﬁrst plan)
‘ﬁrst lady’ (lit. ﬁrst lady)
‘prince charming’ (lit. prince charming)
‘weather forecast’ (lit. forecast meteorological)
‘scientiﬁc research’ (lit. research scientiﬁc)
‘human resources’ (lit. resources human)
‘roundabout’ (lit. round point)
‘Russian roulette’ (lit. roulette Russian)
‘global warming’ (lit. warming climatic)
‘Paris region’ (lit. region Parisian)
‘social network’ (lit. network social)
‘cold blood, self-control’ (lit. blood cold)
‘irony, tongue-in-cheek’ (lit. second degree)
‘supporting role’ (lit. second role)
‘cloud nine’ (lit. seventh heaven)
‘public service’ (lit. service public)
‘ofﬁcial website’ (lit. website ofﬁcial)
‘private party’ (lit. party private)
‘brown sugar’ (lit. sugar ginger-colored)
‘road safety’ (lit. safety of-road)
‘social security’ (lit. security social)
‘coffee table’ (lit. table low)
‘round table, discussion’ (lit. table round)
‘red carpet, luxurious welcoming’ (lit. carpet red)
‘key moment, highlight’ (lit. time strong)
‘wasted time, idleness’ (lit. time dead)
‘part-time (work)’ (lit. time partial)
‘full-time (work)’ (lit. time full)
‘real time’ (lit. time real)
‘public works’ (lit. works public)
‘black hole’ (lit. hole black)
‘palate cleanser’ (lit. hole Norman)
‘Chinese whispers’ (lit. telephone Arabic)
‘cellphone’ (lit. telephone portable)
‘safe bet’ (lit. value safe/sure)
‘community life’ (lit. life associative)
‘everyday life’ (lit. life daily)
‘spinster’ (lit. old girl/maid)
‘white wine’ (lit. wine white)
‘red wine’ (lit. wine red)
‘red eyes’ (lit. eyes red)
‘primary school’ (lit. school primary)
‘shooting star’ (lit. star slipping)

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Number 1

Appendix F. List of Portuguese Compounds

We present below the 180 nominal compounds in PT-comp, along with their human-
rated compositionality scores.

Compounds

hcHM

Translation (Gloss)

4.42
4.82
4.58
3.24
1.28
2.04
1.52
1.35
0.88
2.89
3.11
3.91
4.29
2.44
1.95
0.65
3.50
2.19
4.24
5.00
0.47
0.57
2.88
2.70
3.19
0.94
3.43
2.85
3.66
2.62
3.64
3.68
3.43
3.58
0.67
4.52
2.67
2.45
4.88
4.11
3.11
2.71
1.06
1.31
2.32
1.96
4.65
1.68
2.17
2.39

‘earthquake’ (lit. shock seismic)
‘military camp’ (lit. camp military)
‘secret agent’ (lit. agent secret)
‘false alarm’ (lit. alarm false)
‘cotton candy’ (lit. cotton sweet)
‘high season’ (lit. high season)
‘haute couture’ (lit. high sewing)
‘high seas’ (lit. high sea)
‘loudspeaker’ (lit. loud/high speaker)
‘secret Santa’ (lit. friend hidden)
‘secret Santa’ (lit. friend secret)
‘self-esteem’ (lit. love own)
‘new year’ (lit. year new)
‘air conditioning’ (lit. air conditioned)
‘open air’ (lit. air free)
‘cold weapon’ (lit. weapon white)
‘Freudian slip’ (lit. act faulty)
‘Turkish bath’ (lit. bath Turkish)
‘sweet potato’ (lit. potato sweet)
‘alcoholic drink’ (lit. drink alcoholic)
‘scapegoat’ (lit. goat expiatory)
‘right arm’ (lit. arm right)
‘black hole’ (lit. hole black/dark)
‘afternoon tea’ (lit. breakfast colonial)
‘safe, vault’ (lit. box strong)
‘black box’ (lit. box black)
‘traveling salesman’ (lit. clerk traveling)
‘white meat’ (lit. meat white)
‘red meat’ (lit. meat red)
‘armored car’ (lit. car strong)
‘open letter’ (lit. letter open)
‘shopping mall’ (lit. center commercial)
‘Spiritualist center’ (lit. center spiritualist)
‘hedge’ (lit. fence living)
‘parsley’ (lit. smell green)
‘integrated circuit’ (lit. circuit integrated)
‘business class’ (lit. class executive)
‘gossip column’ (lit. column social)
‘military high-school’ (lit. high-school military)
‘homemade food’ (lit. food homemade)
‘airline’ (lit. company aerial)
‘checking account’ (lit. account current)
‘broken heart’ (lit. heart broken)
‘tightrope, bad situation’ (lit. rope wobbly)
‘vocal chords’ (lit. chords vocal)
‘short circuit’ (lit. short circuit)
‘cold chamber’ (lit. chamber cold)
‘outdoors, open air’ (lit. sky open)
‘vicious circle’ (lit. circle vicious)
‘virtuous circle’ (lit. circle virtuous)

abalo s´ısmico
acampamento militar
agente secreto
alarme falso
algod˜ao doce
alta temporada
alta costura
alto mar
alto falante
amigo oculto
amigo secreto
amor pr ´oprio
ano novo
ar condicionado
ar livre
arma branca
ato falho
banho turco
batata doce
bebida alco ´olica
bode expiat ´orio
brac¸o direito
buraco negro
caf´e colonial
caixa forte
caixa preta
caixeiro viajante
carne branca
carne vermelha
carro forte
carta aberta
centro comercial
centro esp´ırita
cerca viva
cheiro verde
circuito integrado
classe executiva
coluna social
col´egio militar
comida caseira
companhia a´erea
conta corrente
corac¸ ˜ao partido
corda bamba
cordas vocais
curto circuito
cˆamara fria
c´eu aberto
c´ırculo vicioso
c´ırculo virtuoso

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Cordeiro et al.

Unsupervised Compositionality Prediction of Nominal Compounds

Compounds

hcHM

Translation (Gloss)

deputado federal
desﬁle militar
direitos humanos
disco r´ıgido
disco voador
efeitos especiais
elefante branco
escada rolante
estrela cadente
exame cl´ınico
exames laboratoriais
farinha integral
febre amarela
ﬁcha limpa
ﬁla indiana
ﬁo condutor
forc¸a bruta
gatos pingados
gelo seco
golpe baixo
governo federal
gripe avi´aria
gripe su´ına
guarda ﬂorestal
jogo duro
ju´ızo ﬁnal
leite integral
lista negra
livre-docente
livro aberto
longa data
longa-metragem
lua cheia
lua nova
lugar comum
magia negra
mar aberto
mar´e alta
mar´e baixa
massa cinzenta
mau contato
mau humor
mau olhado
mercado negro
mesa redonda
montanha russa
m´a f´e
m´aquina virtual
m˜ao fechada
navio negreiro
novo mundo
novo rico
n ´o cego
n ´ucleo at ˆomico
olho gordo
olho m´agico
olho nu

4.92
4.93
3.86
2.76
2.94
3.37
0.16
3.85
2.52
4.75
4.90
4.72
1.43
2.97
1.17
1.58
3.33
0.00
2.33
2.03
4.97
3.11
2.48
4.16
1.13
3.60
4.67
1.60
2.63
0.79
1.63
0.96
3.52
1.40
1.52
1.72
2.87
4.03
4.18
1.69
2.84
4.29
1.97
1.06
1.10
0.31
1.62
3.76
1.06
3.52
2.29
3.62
0.74
4.93
0.28
0.27
2.15

‘federal deputy’ (lit. deputy federal)
‘military parade’ (lit. parade military)
‘human rights’ (lit. rights human)
‘hard drive’ (lit. disk rigid)
‘ﬂying saucer’ (lit. disk ﬂying)
‘special effects’ (lit. effects special)
‘white elephant’ (lit. elephant white)
‘escalator’ (lit. stair rolling)
‘shooting star’ (lit. star falling)
‘clinical examination’ (lit. examination clinical)
‘laboratory tests’ (lit. examinations laboratory)
‘wholemeal ﬂour’ (lit. ﬂour integral)
‘yellow fever’ (lit. fever yellow)
‘clean criminal records’ (lit. ﬁle clean)
‘single ﬁle’ (lit. queue Indian)
‘underlying theme’ (lit. thread conductor)
‘brute force’ (lit. force brute)
‘a few people’ (lit. cats dropped)
‘dry ice’ (lit. ice dry)
‘low blow’ (lit. punch low)
‘federal government’ (lit. government federal)
‘avian ﬂu’ (lit. ﬂu avian)
‘swine ﬂu’ (lit. ﬂu swine)
‘forest ranger’ (lit. guard forest)
‘rough play’ (lit. game hard)
‘doomsday’ (lit. judgement ﬁnal)
‘whole milk’ (lit. milk integral)
‘black list’ (lit. list black)
‘professor’ (lit. free lecturer)
‘open book’ (lit. book open)
‘longtime’ (lit. date long)
‘feature ﬁlm’ (lit. long length/footage)
‘full moon’ (lit. moon full)
‘new moon’ (lit. moon new)
‘clich´e’ (lit. place common)
‘black magic’ (lit. magic black)
‘open sea’ (lit. sea open)
‘high tide’ (lit. tide high)
‘low tide’ (lit. tide low)
‘grey matter’ (lit. mass grey)
‘faulty contact’ (lit. bad contact)
‘bad mood’ (lit. bad humour)
‘evil eye’ (lit. bad glance)
‘black market’ (lit. black market)
‘round table’ (lit. table round)
‘roller coaster’ (lit. mountain Russian)
‘bad faith’ (lit. bad faith)
‘virtual machine’ (lit. machine virtual)
‘stingy’ (lit. hand closed)
‘slave ship’ (lit. ship black-slave)
‘new world’ (lit. new world)
‘new rich, new money’ (lit. new rich)
‘difﬁcult situation’ (lit. knot blind)
‘atomic nucleus’ (lit. nucleus atomic)
‘evil eye’ (lit. eye fat)
‘peephole’ (lit. eye magic)
‘naked eye’ (lit. eye naked)

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Number 1

Compounds

hcHM

Translation (Gloss)

0.45
4.27
1.47
0.90
0.30
0.80
0.53
0.90
0.74
1.92
1.51
2.27
3.29
3.14
3.70
0.71
3.97
1.52
2.87
2.00
4.78
2.76
1.72
1.55
0.12
0.09
0.10
0.23
2.87
2.94
3.48
1.00
3.27
4.00
4.92
2.12
1.12
4.20
0.29
0.37
4.47
4.52
0.15
0.52
0.87
2.52
2.11
1.55
4.67
1.40
1.39
4.36
2.19
3.76
5.00
4.96
2.81

‘black sheep’ (lit. sheep black)
‘toilet paper’ (lit. paper hygienic)
‘tax haven’ (lit. paradise ﬁscal)
‘German shepherd’ (lit. shepherd German)
‘subservient, stooge’ (lit. stick ordered)
‘short-tempered’ (lit. fuse short)
‘careful research’ (lit. comb thin)
‘dead weight’ (lit. weight dead)
‘ﬂoor plan’ (lit. plant short)
‘blind spot’ (lit. point blind)
‘strong point’ (lit. point strong)
‘weak point’ (lit. point weak)
‘magic potion’ (lit. potion magic)
‘blue-plate special’ (lit. plate ready-made)
‘early childhood’ (lit. ﬁrst infancy)
‘ﬁrst hand’ (lit. ﬁrst hand)
‘ﬁrst necessity’ (lit. ﬁrst necessity)
‘ﬁrst lady’ (lit. ﬁrst dame)
‘ﬁrst minister’ (lit. ﬁrst minister)
‘forefront’ (lit. ﬁrst plan)
‘selection process’ (lit. process selective)
‘ﬁrst-aid posts’ (lit. ready aid)
‘prince charming’ (lit. prince enchanted)
‘pure blood’ (lit. pure blood)
‘stingy’ (lit. bread hard)
‘lucky’ (lit. foot hot)
‘ceiling height’ (lit. foot right)
‘unlucky’ (lit. foot cold)
‘water polo’ (lit. aquatic pole/polo)
‘blackboard’ (lit. board black)
‘free fall’ (lit. fall free)
‘second-rate’ (lit. ﬁfth category)
‘social network’ (lit. network social)
‘political system’ (lit. regime political)
‘analog clock’ (lit. clock analog)
‘biological clock’ (lit. clock biological)
‘ﬁnal stretch’ (lit. straight line ﬁnal)
‘Ferris wheel’ (lit. wheel giant)
‘Russian roulette’ (lit. roulette Russian)
‘tight spot’ (lit. skirt tight)
‘operating room’ (lit. room surgical)
‘parish hall’ (lit. hall parish)
‘blue-blooded’ (lit. blood blue)
‘cold-blooded’ (lit. blood cold)
‘hot-blooded’ (lit. blood hot)
‘answering machine’ (lit. secretary electronic)
‘ulterior motives’ (lit. second intentions)
‘aside, in the background’ (lit. second plan)
‘court ruling’ (lit. sentence judicial)
‘sixth sense’ (lit. sixth sense)
‘green lights’ (lit. signal green)
‘political system’ (lit. system political)
‘seventh art’ (lit. seventh art)
‘red carpet’ (lit. carpet red)
‘sea turtle’ (lit. turtle marine)
‘ﬂat screen TV’ (lit. screen ﬂat)
‘real time’ (lit. time real)

ovelha negra
papel higiˆenico
para´ıso ﬁscal
pastor alem˜ao
pau mandado
pavio curto
pente ﬁno
peso morto
planta baixa
ponto cego
ponto forte
ponto fraco
poc¸ ˜ao m´agica
prato feito
primeira infˆancia
primeira-m˜ao
primeira necessidade
primeira-dama
primeiro-ministro
primeiro plano
processo seletivo
pronto socorro
pr´ıncipe encantado
puro sangue
p˜ao-duro
p´e quente
p´e-direito
p´e frio
p ´olo aqu´atico
quadro negro
queda livre
quinta categoria
rede social
regime pol´ıtico
rel ´ogio anal ´ogico
rel ´ogio biol ´ogico
reta ﬁnal
roda gigante
roleta russa
saia justa
sala cir ´urgica
sal˜ao paroquial
sangue azul
sangue frio
sangue quente
secret´aria eletr ˆonica
segundas intenc¸ ˜oes
segundo plano
sentenc¸a judicial
sexto sentido
sinal verde
sistema pol´ıtico
s´etima arte
tapete vermelho
tartaruga marinha
tela plana
tempo real

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Cordeiro et al.

Unsupervised Compositionality Prediction of Nominal Compounds

Compounds

hcHM

Translation (Gloss)

terceira idade
terceira pessoa
tiro livre
trabalho brac¸al
trabalho escravo
vaca louca
vinho branco
vinho tinto
vista grossa
viva voz
voto secreto
v ˆoo dom´estico
v ˆoo internacional
´agua doce
´agua mineral
ˆonibus executivo

1.70
2.00
1.58
3.55
4.24
1.23
3.40
4.08
0.50
1.70
4.82
3.41
4.96
1.45
4.21
2.63

‘elder’ (lit. third age)
‘third person’ (lit. third person)
‘free kick (soccer)’ (lit. shot free)
‘manual labor’ (lit. work arm)
‘slave work’ (lit. work slave)
‘mad cow’ (lit. cow crazy/mad)
‘white wine’ (lit. wine white)
‘red wine’ (lit. wine dark-red)
‘turn a blind eye’ (lit. vision thick)
‘aloud’ (lit. live voice)
‘secret ballot’ (lit. vote secret)
‘domestic ﬂight’ (lit. ﬂight domestic)
‘international ﬂight’ (lit. ﬂight international)
‘fresh water’ (lit. water sweet)
‘mineral water’ (lit. water mineral)
‘minibus’ (lit. bus executive)

Acknowledgments
This work has been partly funded by
projects PARSEME (Cost Action IC1207),
PARSEME-FR (ANR-14-CERA-0001),
AIM-WEST (FAPERGS-INRIA 1706-2551/
13-7), CNPq (312114/2015-0, 423843/2016-8)
“Simpliﬁcac¸ ˜ao Textual de Express ˜oes
Complexas,” sponsored by Samsung
Eletr ˆonica da Amaz ˆonia Ltda. under the
terms of Brazilian federal law No. 8.248/91.
We would like to thank the anonymous
reviewers who provided numerous helpful
suggestions, Alexis Nasr for reviewing
earlier versions of this article, Rodrigo
Wilkens and Leonardo Zilio for contributing
to the data set creation, and all anonymous
annotators who judged the compositionality
of compounds.

References
Agirre, Eneko, Enrique Alfonseca, Keith B.
Hall, Jana Kravalova, Marius Pasca, and
Aitor Soroa. 2009. A study on similarity
and relatedness using distributional and
wordnet-based approaches. In Human
Language Technologies: Conference of the
North American Chapter of the Association of
Computational Linguistics, Proceedings, May
31–June 5, 2009, pages 19–27, Boulder, CO.

Artstein, Ron, and Massimo Poesio. 2008.

Inter-coder agreement for computational
linguistics. Computational Linguistics,
34(4):555–596.

Baldwin, Timothy, and Su Nam Kim. 2010.

Multiword expressions. In Nitin
Indurkhya and Fred J. Damerau, editors,
Handbook of Natural Language Processing,

2nd edition. CRC Press, Taylor and Francis
Group, Boca Raton, FL, pages 267–292.
Bannard, Colin, Timothy Baldwin, and Alex
Lascarides. 2003. A statistical approach to
the semantics of verb-particles. In
Proceedings of the ACL 2003 Workshop on
Multiword Expressions: Analysis, Acquisition
and Treatment (Volume 18), pages 65–72,
Stroudsburg, PA.

Baroni, Marco, Silvia Bernardini, Adriano
Ferraresi, and Eros Zanchetta. 2009. The
wacky wide web: A collection of very large
linguistically processed web-crawled
corpora. Language Resources and Evaluation,
43(3):209–226.

Baroni, Marco, Georgiana Dinu, and Germ´an
Kruszewski. 2014. Don’t count, predict! A
systematic comparison of context-counting
vs. context-predicting semantic vectors. In
Proceedings of the 52nd Annual Meeting of the
Association for Computational Linguistics
(Volume 1: Long Papers), pages 238–247,
Baltimore.

Baroni, Marco, and Alessandro Lenci. 2010.

Distributional memory: A general
framework for corpus-based semantics.
Computational Linguistics, 36(4):673–721.

Bick, Eckhard. 2000. The Parsing System

“palavras”: Automatic Grammatical Analysis
of Portuguese in a Constraint Grammar
Framework. Ph.D. thesis, University of
Aarhus.

Boos, Rodrigo, Kassius Prestes, and Aline
Villavicencio. 2014. Identiﬁcation of
multiword expressions in the brWaC. In
Proceedings of the Conference on Language
Resources and Evaluation 2014,
pages 728–735, ELRA. ACL Anthology
Identiﬁer: L14–1429.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Number 1

Bride, Antoine, Tim Van de Cruys, and

Nicholas Asher. 2015. A generalisation of
lexical functions for composition in
distributional semantics. In Association for
Computational Linguistics (1), pages 281–291.
Bullinaria, John A., and Joseph P. Levy. 2012.
Extracting semantic representations from
word co-occurrence statistics: Stop-lists,
stemming, and SVD. Behavior Research
Methods, 44(3):890–907.

Camacho-Collados, Jos´e, Mohammad Taher

Pilehvar, and Roberto Navigli. 2015.
A framework for the construction of
monolingual and cross-lingual word
similarity datasets. In Proceedings of the
53rd Annual Meeting of the Association for
Computational Linguistics and the 7th
International Joint Conference on Natural
Language Processing (Volume 2: Short
Papers), pages 1–7, Beijing.

Cap, Fabienne, Manju Nirmal, Marion

Weller, and Sabine Schulte im Walde. 2015.
How to account for idiomatic German
support verb constructions in
statistical machine translation. In
Proceedings of the 11th Workshop on
Multiword Expressions, pages 19–28,
Association for Computational Linguistics,
Denver.

Carpuat, Marine, and Mona Diab. 2010.
Task-based evaluation of multiword
expressions: A pilot study in statistical
machine translation. In Proceedings of
NAACL/HLT 2010, pages 242–245,
Los Angeles.

Church, Kenneth Ward, and Patrick Hanks.
1990. Word association norms, mutual
information, and lexicography.
Computational Linguistics, 16(1):22–29.

Cohen, Jacob. 1960. A coefﬁcient of

agreement for nominal scales. Educational
and Psychological Measurement, 20(1):37–46.
Constant, Mathieu, G ¨uls¸en Eryi ˘git, Johanna
Monti, Lonneke Van Der Plas, Carlos
Ramisch, Michael Rosner, and Amalia
Todirascu. 2017. Multiword expression
processing: A survey. Computational
Linguistics, 43(4):837–892.

Cordeiro, Silvio, Carlos Ramisch, Marco
Idiart, and Aline Villavicencio. 2016.
Predicting the compositionality of nominal
compounds: Giving word embeddings a
hard time. In Proceedings of the 54th Annual
Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers),
pages 1986–1997, Berlin.

Cordeiro, Silvio, Carlos Ramisch, and Aline
Villavicencio. 2016. mwetoolkit+sem:
Integrating word embeddings in the
mwetoolkit for semantic MWE processing.

In Proceedings of the Tenth International
Conference on Language Resources and
Evaluation (LREC 2016), pages 1221–1225,
European Language Resources
Association (ELRA), Paris.

Curran, James R., and Marc Moens. 2002.

Scaling context space. In Proceedings of the
40th Annual Meeting of the Association for
Computational Linguistics, pages 231–238.

Deerwester, Scott, Susan T. Dumais,

George W. Furnas, Thomas K. Landauer,
and Richard Harshman. 1990. Indexing by
latent semantic analysis. Journal of the
American Society for Information Science,
41(6):391.

Evert, Stefan. 2004. The Statistics of Word

Cooccurrences: Word Pairs and Collocations.
Ph.D. thesis, Institut f ¨ur maschinelle
Sprachverarbeitung, University of
Stuttgart, Stuttgart, Germany.

Farahmand, Meghdad, Aaron Smith, and

Joakim Nivre. 2015. A multiword
expression data set: Annotating
non-compositionality and
conventionalization for English noun
compounds. In Proceedings of the 11th
Workshop on Multiword Expressions,
pages 29–33, Association for
Computational Linguistics, Denver.
Fazly, Afsaneh, Paul Cook, and Suzanne

Stevenson. 2009. Unsupervised type and
token identiﬁcation of idiomatic
expressions. Computational Linguistics,
35(1):61–103.

Ferret, Olivier. 2013. Identifying bad
semantic neighbors for improving
distributional thesauri. In Association
for Computational Linguistics (1),
pages 561–571.

Finlayson, Mark, and Nidhi Kulkarni. 2011.

Detecting multi-word expressions
improves word sense disambiguation.
In Proceedings of the Association for
Computational Linguistics 2011
Workshop on MWEs, pages 20–24,
Portland, OR.

Firth, John R. 1957. A synopsis of linguistic
theory, 1930–1955. In F. R. Palmer, ed.,
Selected Papers of J. R. Firth, pages 168–205,
Longman, London.

Fleiss, Joseph L., and Jacob Cohen. 1973.
The equivalence of weighted kappa
and the intraclass correlation coefﬁcient
as measures of reliability. Educational
and Psychological Measurement,
33(3):613–619.

Frege, Gottlob. 1892/1960. ¨Uber sinn und
bedeutung. Zeitschrift f ¨ur Philosophie und
philosophische Kritik, 100:25–50. Translated,
as ‘On Sense and Reference,’ by Max Black.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Cordeiro et al.

Unsupervised Compositionality Prediction of Nominal Compounds

Freitag, Dayne, Matthias Blume, John

Byrnes, Edmond Chow, Sadik Kapadia,
Richard Rohwer, and Zhiqiang Wang.
2005. New experiments in distributional
representations of synonymy. In
Proceedings of the Ninth Conference on
Computational Natural Language Learning,
pages 25–32.

Girju, Roxana, Dan Moldovan, Marta Tatu,

and Daniel Antohe. 2005. On the semantics
of noun compounds. Computer Speech &
Language, 19(4):479–496.

Goldberg, Adele E. 2015. Compositionality,
Chapter 24. Routledge, Amsterdam.

Guevara, Emiliano. 2011. Computing

semantic compositionality in distributional
semantics. In Proceedings of the Ninth
International Conference on Computational
Semantics, IWCS ’11, pages 135–144,
Association for Computational Linguistics,
Stroudsburg, PA.

Harris, Zellig. 1954. Distributional structure.

Word, 10:146–162.

Hartung, Matthias, Fabian Kaupmann,

Souﬁan Jebbara, and Philipp Cimiano.
2017. Learning compositionality functions
on word embeddings for modelling
attribute meaning in adjective-noun
phrases. In Proceedings of the 15th Meeting of
the European Chapter of the Association for
Computational Linguistics (Volume 1),
pages 54–64.

Hendrickx, Iris, Zornitsa Kozareva, Preslav

Nakov, Diarmuid ´O S´eaghdha, Stan
Szpakowicz, and Tony Veale. 2013.
Semeval-2013 task 4: Free paraphrases of
noun compounds. In Proceedings of *SEM
2013 (Volume 2 — SemEval), pages 138–143,
Association for Computational
Linguistics.

Hwang, Jena D., Archna Bhatia, Clare Bonial,

Aous Mansouri, Ashwini Vaidya,
Nianwen Xue, and Martha Palmer. 2010.
Propbank annotation of multilingual light
verb constructions. In Proceedings of the
LAW 2010, pages 82–90, Association for
Computational Linguistics.

and their Compositionality (CVSC) at EACL,
pages 21–30.

K ¨oper, Maximilian, and Sabine Schulte im
Walde. 2016. Distinguishing literal and
non-literal usage of German particle verbs.
In HLT-NAACL, pages 353–362.

Kruszewski, Germ´an, and Marco Baroni.
2014. Dead parrots make bad pets:
Exploring modiﬁer effects in noun
phrases. In Proceedings of the Third Joint
Conference on Lexical and Computational
Semantics, *SEM@COLING 2014, August
23-24, 2014, pages 171–181, The *SEM 2014
Organizing Committee, Dublin.

Landauer, Thomas K., Peter W. Foltz, and
Darrell Laham. 1998. An introduction to
latent semantic analysis. Discourse
Processes, 25(2-3):259–284.

Lapesa, Gabriella, and Stefan Evert. 2014.

A large scale evaluation of distributional
semantic models: Parameters, interactions
and model selection. Transactions of the
Association for Computational Linguistics,
2:531–545.

Lapesa, Gabriella, and Stefan Evert. 2017.

Large-scale evaluation of
dependency-based DSMs: Are they
worth the effort? In EACL 2017,
pages 394–400.

Lauer, Mark. 1995. How much is enough?:
Data requirements for statistical NLP.
CoRR, abs/cmp-lg/9509001.

Levy, Omer, Yoav Goldberg, and Ido Dagan.
2015. Improving distributional similarity
with lessons learned from word
embeddings. Transactions of the Association
for Computational Linguistics, 3:211–225.
Lin, Dekang. 1998. Automatic retrieval and
clustering of similar words. In Proceedings
of the 17th International Conference on
Computational Linguistics (Volume 2),
pages 768–774.

Lin, Dekang. 1999. Automatic identiﬁcation

of non-compositional phrases. In
Proceedings of the 37th Annual Meeting of the
Association for Computational Linguistics on
Computational Linguistics, pages 317–324.

Jagfeld, Glorianna, and Lonneke van der

McCarthy, Diana, Bill Keller, and John

Plas. 2015. Towards a better semantic role
labelling of complex predicates. In
Proceedings of NAACL Student Research
Workshop, pages 33–39, Denver.

Jurafsky, Daniel, and James H. Martin. 2009.

Speech and Language Processing, 2nd
Edition, Prentice-Hall, Inc., Upper Saddle
River, NJ.

Kiela, Douwe, and Stephen Clark. 2014. A

systematic study of semantic vector space
model parameters. In Proceedings of the 2nd
Workshop on Continuous Vector Space Models

Carroll. 2003. Detecting a continuum of
compositionality in phrasal verbs. In
Proceedings of the Association for
Computational Linguistics 2003 Workshop on
Multiword Expressions: Analysis, Acquisition
and Treatment, pages 73–80, Association
for Computational Linguistics, Sapporo,
Japan.

Mikolov, Tomas, Ilya Sutskever, Kai Chen,
Greg S. Corrado, and Jeff Dean. 2013.
Distributed representations of words and
phrases and their compositionality. In

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 45, Number 1

Advances in Neural Information Processing
Systems, pages 3111–3119.

Mikolov, Tomas, Wen-tau Yih, and Geoffrey
Zweig. 2013. Linguistic regularities in
continuous space word representations.
In HLT-NAACL, pages 746–751.

Mitchell, Jeff, and Mirella Lapata. 2008.
Vector-based models of semantic
composition. In Association for
Computational Linguistics, pages 236–244.

Mitchell, Jeff, and Mirella Lapata. 2010.

Composition in distributional models of
semantics. Cognitive Science,
34(8):1388–1429.

Mohammad, Saif, and Graeme Hirst. 2012.
Distributional measures of semantic
distance: A survey. CoRR, abs/1203.1858.

Nakov, Preslav. 2008. Paraphrasing verbs
for noun compound interpretation. In
Proceedings of the LREC Workshop Towards a
Shared Task for MWEs, pages 46–49.

Nakov, Preslav. 2013. On the interpretation of
noun compounds: Syntax, semantics, and
entailment. Natural Language Engineering,
19:291–330.

Nivre, Joakim, Johan Hall, and Jens Nilsson.

2006. Maltparser: A data-driven
parser-generator for dependency parsing.
In Proceedings of the Conference on Language
Resources and Evaluation (Volume 6),
pages 2216–2219.

Pad ´o, Sebastian, and Mirella Lapata. 2003.

Constructing semantic space models from
parsed corpora. In Proceedings of the 41st
Annual Meeting of the Association for
Computational Linguistics (Volume 1),
pages 128–135.

Pad ´o, Sebastian, and Mirella Lapata. 2007.

Dependency-based construction of
semantic space models. Computational
Linguistics, 33(2):161–199.

Padr ´o, Muntsa, Marco Idiart, Aline

Villavicencio, and Carlos Ramisch. 2014a.
Comparing similarity measures for
distributional thesauri. In Proceedings of the
Ninth International Conference on Language
Resources and Evaluation (LREC 2014),
pages 2964–2971, European Language
Resources Association, Reykjavik.

Padr ´o, Muntsa, Marco Idiart, Aline

Villavicencio, and Carlos Ramisch. 2014b.
Nothing like good old frequency: Studying
context ﬁlters for distributional thesauri.
In Proceedings of the Conference on Empirical
Methods in Natural Language Processing
(Short Papers), pages 419–424, Doha, Qatar.

Pennington, Jeffrey, Richard Socher, and

Christopher Manning. 2014. Glove: Global
vectors for word representation. In
Proceedings of the 2014 Conference on

Empirical Methods in Natural Language
Processing (EMNLP), pages 1532–1543,
Association for Computational Linguistics,
Doha, Qatar.

Ramisch, Carlos, Silvio Cordeiro, Leonardo
Zilio, Marco Idiart, Aline Villavicencio,
and Rodrigo Wilkens. 2016. How naked is
the naked truth? A multilingual lexicon
of nominal compound compositionality.
In The 54th Annual Meeting of the
Association for Computational Linguistics,
pages 156–161.

Ramisch, Carlos, Silvio Ricardo Cordeiro,

and Aline Villavicencio. 2016. Filtering and
measuring the intrinsic quality of human
compositionality judgments. In Proceedings
of the 12th Workshop on Multiword
Expressions (MWE 2016), pages 32–37,
Berlin.

Reddy, Siva, Diana McCarthy, and Suresh

Manandhar. 2011. An empirical study on
compositionality in compound nouns. In
Proceedings of the 5th International Joint
Conference on Natural Language Processing
2011 (IJCNLP 2011), pages 210–218,
Chiang Mai, Thailand.

Ren, Zhixiang, Yajuan L ¨u, Jie Cao, Qun Liu,

and Yun Huang. 2009. Improving
statistical machine translation using
domain bilingual multiword expressions.
In Proceedings of the ACL 2009 Workshop on
MWEs, pages 47–54, Singapore.

Riedl, Martin, and Chris Biemann. 2015.
A single word is not enough: Ranking
multiword expressions using
distributional semantics. In Proceedings of
the 2015 Conference on Empirical
Methods in Natural Language Processing,
pages 2430–2440, Association for
Computational Linguistics.

Roller, Stephen, and Sabine Schulte im

Walde. 2014. Feature norms of German
noun compounds. In Proceedings of the 10th
Workshop on Multiword Expressions (MWE),
pages 104–108, Association for
Computational Linguistics.

Roller, Stephen, Sabine Schulte im Walde,

and Silke Scheible. 2013. The (un)expected
effects of applying standard cleansing
models to human ratings on
compositionality. In Proceedings of the
9th Workshop on Multiword Expressions,
pages 32–41, Association for
Computational Linguistics.

Sag, Ivan A, Timothy Baldwin, Francis Bond,
Ann Copestake, and Dan Flickinger. 2002,
Multiword expressions: A pain in the neck
for NLP. In Computational Linguistics and
Intelligent Text Processing. Springer,
New York, pages 1–15.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Cordeiro et al.

Unsupervised Compositionality Prediction of Nominal Compounds

Salehi, Bahar, Paul Cook, and Timothy
Baldwin. 2014. Using distributional
similarity of multi-way translations to
predict multiword expression
compositionality. In Proceedings of the 14th
Conference of the European Chapter of the
Association for Computational Linguistics,
pages 472–481, Gothenburg, Sweden.
Salehi, Bahar, Paul Cook, and Timothy
Baldwin. 2015. A word embedding
approach to predicting the
compositionality of multiword
expressions. In Proceedings of the 2015
Conference of the North American Chapter of
the Association for Computational Linguistics:
Human Language Technologies,
pages 977–983, Denver.

Salehi, Bahar, Nitika Mathur, Paul Cook, and
Timothy Baldwin. 2015. The impact of
multiword expression compositionality on
machine translation evaluation. In Proceedings
of the 11th Workshop on Multiword
Expressions, pages 54–59, Association for
Computational Linguistics, Denver.
Salle, Alexandre, Aline Villavicencio, and
Marco Idiart. 2016. Matrix factorization
using window sampling and negative
sampling for improved word
representations. In Proceedings of the 54th
Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short
Papers), pages 419–424, Berlin.

Schmid, Helmut. 1995. Treetagger—A

language independent part-of-speech
tagger. Institut f ¨ur Maschinelle
Sprachverarbeitung, Universit¨at Stuttgart,
43:28.

Schneider, Nathan, Dirk Hovy, Anders

Johannsen, and Marine Carpuat. 2016.
SemEval 2016 task 10: Detecting minimal
semantic units and their meanings
(DiMSUM). In Proceedings of SemEval,
pages 546–559, San Diego.

Schone, Patrick, and Daniel Jurafsky. 2001.

Is knowledge-free induction of multiword
unit dictionary headwords a solved
problem? In Proceedings of Empirical
Methods in Natural Language Processing,
pages 100–108, Pittsburgh.

Schulte im Walde, Sabine, Anna H¨atty, Stefan

Bott, and Nana Khvtisavrishvili. 2016.
GhoSt-NN: A representative gold standard
of German noun-noun compound.
In Proceedings of the Conference on
Language Resources and Evaluation,
pages 2285–2292.

Schulte im Walde, Sabine, Stefan M ¨uller, and
Stefan Roller. 2013. Exploring vector space
models to predict the compositionality
of German noun-noun compounds. In
Proceedings of *SEM 2013 (Volume 1),
pages 255–265. Association for
Computational Linguistics.
Socher, Richard, Brody Huval,

Christopher D. Manning, and Andrew Y.
Ng. 2012. Semantic compositionality
through recursive matrix-vector spaces.
In Proceedings of the 2012 Joint Conference
on Empirical Methods in Natural Language
Processing and Computational Natural
Language Learning, pages 1201–1211.
Stymne, Sara, Nicola Cancedda, and Lars

Ahrenberg. 2013. Generation of compound
words in statistical machine translation
into compounding languages.
Computational Linguistics, 39(4):1067–1108.

Tsvetkov, Yulia, and Shuly Wintner. 2012.

Extraction of multi-word expressions from
small parallel corpora. Natural Language
Engineering, 18(04):549–573.

Turney, Peter D., and Patrick Pantel. 2010.

From frequency to meaning: vector space
models of semantics. Journal of Artiﬁcial
Intelligence Research, 37(1):141–188.

Van de Cruys, Tim, Laura Rimell, Thierry
Poibeau, and Anna Korhonen. 2012.
Multiway tensor factorization for
unsupervised lexical acquisition. In
COLING 2012, pages 2703–2720.
Yazdani, Majid, Meghdad Farahmand,

and James Henderson. 2015. Learning
semantic composition to detect
non-compositionality of multiword
expressions. In Proceedings of the 2015
Conference on Empirical Methods in Natural
Language Processing, pages 1733–1742,
Association for Computational Linguistics,
Lisbon.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
5
1
1
1
8
0
9
6
8
8
/
c
o

l
i

_
a
_
0
0
3
4
1
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3 Unsupervised Compositionality Prediction of image

Download pdf