When the Whole Is Less Than the Sum of Its - Ricerca sull'intelligenza artificiale specializzata al MIT

When the Whole Is Less Than the Sum of Its
Parts: How Composition Affects PMI Values
in Distributional Semantic Vectors

Denis Paperno∗
University of Trento

Marco Baroni∗
University of Trento

Distributional semantic models, deriving vector-based word representations from patterns of
word usage in corpora, have many useful applications (Turney and Pantel 2010). Recentemente, there
has been interest in compositional distributional models, which derive vectors for phrases from
representations of their constituent words (Mitchell and Lapata 2010). Often, the values of distri-
butional vectors are pointwise mutual information (PMI) scores obtained from raw co-occurrence
conta. In this article we study the relation between the PMI dimensions of a phrase vector and
its components in order to gain insights into which operations an adequate composition model
should perform. We show mathematically that the difference between the PMI dimension of a
phrase vector and the sum of PMIs in the corresponding dimensions of the phrase’s parts is an
independently interpretable value, namely, a quantiﬁcation of the impact of the context associated
with the relevant dimension on the phrase’s internal cohesion, as also measured by PMI. We then
explore this quantity empirically, through an analysis of adjective–noun composition.

1. introduzione

Dimensions of a word vector in distributional semantic models contain a function
of the co-occurrence counts of the word with contexts of interest. A popular and
effective option (Bullinaria and Levy 2012) is to transform counts into pointwise
mutual information (PMI) scores, which are given, for any word a and context c, by
PMI(UN, C) = log( P(UN|C)
P(UN) ).1 There are various proposals on deriving phrase representations

∗ Center for Mind/Brain Sciences, University of Trento, Palazzo Fedrigotti, corso Bettini 31, 38068 Rovereto

(TN), Italy. E-mails: denis.paperno@unitn.it; marco.baroni@unitn.it.

1 PMI is also used to measure the tendency of two phrase constituents to be combined in a particular

syntactic conﬁguration (per esempio., to assess the degree of lexicalization of the phrase). We use PMI(ab) to refer
to this “phrase cohesion” PMI. PMI(ab) = log( P(ab)
) depends on how often words a and
b form a phrase of the relevant kind, per esempio., how often adjective a modiﬁes noun b, while PMI(UN, B) is based
on the probability of two words co-occurring in the relevant co-occurrence context (per esempio., within an
n-word window). Quantifying phrase cohesion with PMI(ab) has been the most common usage of PMI in
computational linguistics outside distributional semantics ever since the seminal work on collocations by
Church and Hanks (1990).

P(UN)·P(B) ) = log( P(ab|UN)

P(B)

Invio ricevuto: 18 Giugno 2015; revised submission received: 16 novembre 2015; accepted for publication:
3 Febbraio 2016.

doi:10.1162/COLI a 00250

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
/
C
o

l
io
/

UN
R
T
io
C
e
–
P
D

F
/

4
2
2
3
4
5
1
8
0
7
5
0
9
/
C
o

l
io

_
UN
_
0
0
2
5
0
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Linguistica computazionale

Volume 42, Numero 2

Tavolo 1
Composition methods. (cid:12) is pointwise multiplication. α and β are scalar parameters, matrices
X and Y represent syntactic relation slots (per esempio., Adjective and Noun), matrix A stands for a
functional word (per esempio., the adjective in an adjective-noun construction), E (cid:126)c is a constant
vector.

modello

phrase vector

modello

phrase vector

additive
multiplicative
weighted additive

(cid:126)UN + (cid:126)B
(cid:126)UN (cid:12) (cid:126)B
α(cid:126)UN + β(cid:126)B

full additive
lexical function
shifted additive

X(cid:126)UN + Y(cid:126)B
UN(cid:126)B
(cid:126)UN + (cid:126)B + (cid:126)C

by composing word vectors, ranging from simple, parameter-free vector addition to
fully supervised deep-neural-network-based systems. We focus here on the models
illustrated in Table 1; see Dinu, Pham, and Baroni (2013) for the original model refer-
enze. As an empirical test case, we consider adjective–noun composition.

2. A General Result on the PMI Dimensions of Phrases

An ideal composition model should be able to reconstruct, at least for sufﬁciently
frequent phrases, the corpus-extracted vector of the phrase ab from vectors of its parts
UN, B. When vector dimensions encode PMI values, for each context c, the composition
model has to predict PMI(ab, C) between phrase ab and context c. Equazione (1) shows
that there is a mathematical relation between PMI(ab, C) and the PMI values of the
phrase components PMI(UN, C), PMI(B, C):

PMI(ab, C) = log(

P(ab | C)
P(ab)

) = log(

P(UN | C) · P(ab | a ∧ c)
P(UN) · P(ab | UN)

) =

= log(

P(UN | C) · P(ab | a ∧ c)
P(UN) · P(ab | UN)

P(B | C) · P(B)
P(B | C) · P(B)

) =

= log(

P(UN | C)
P(UN)

P(B | C)
P(B)

P(ab | a ∧ c)
P(B | C)

P(B)
P(ab | UN)

) =

(1)

= log(

P(UN | C)
P(UN)

) + log(

P(B | C)
P(B)

) + log(

P(ab | a ∧ c)
P(B | C)

) − log(

P(ab | UN)
P(B)

) =

= PMI(UN, C) + PMI(B, C) + PMI(ab|C) − PMI(ab)

To make sense of this derivation, observe that P(ab) and P(ab | C) pertain to a
phrase ab where a and b are linked by a speciﬁc syntactic relation. Now, whenever
the phrase ab occurs, a must also occur, and thus P(ab) = P(ab ∧ a), and similarly
P(ab | C) = P(ab ∧ a | C). This connects the PMI of a phrase (based on counts of ab
linked by a syntactic relation) to the PMI of the constituents (based on counts of the
constituents in all contexts). Consequently, we can meaningfully relate PMI(ab, C)
(as computed to calculate phrase vector dimensions) to PMI(UN, C) and PMI(B, C) (COME
computed to calculate single word dimensions).

Equazione (1) unveils a systematic relation between the PMI value in a phrase
vector dimension and the value predicted by the additive approach to composition.
Infatti, PMI(ab, C) equals PMI(UN, C) + PMI(B, C), shifted by some correction ∆cPMI(ab) =
PMI(ab|C) − PMI(ab), measuring how the context changes the tendency of two words
UN, b to form a phrase. ∆c includes any non-trivial effects of composition arising from the

346

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
/
C
o

l
io
/

UN
R
T
io
C
e
–
P
D

F
/

4
2
2
3
4
5
1
8
0
7
5
0
9
/
C
o

l
io

_
UN
_
0
0
2
5
0
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Paperno and Baroni

When the Whole Is Less Than the Sum of Its Parts

interaction between the occurrence of words a, B, C. Absence of non-trivial interaction
of this kind is a reasonable null hypothesis, under which the association of phrase
components with each other is not affected by context at all: PMI(ab|C) = PMI(ab).
Under this null hypothesis, addition should accurately predict PMI values for phrases.

3. Empirical Observations

We have shown that vector addition should perfectly predict phrase vectors under
the idealized assumption that the context’s effect on the association between words
in the phrase, ∆cPMI(ab) = PMI(ab|C) − PMI(ab), is negligible. ∆cPMI(ab) equals the
deviation of the actual PMI(ab, C) from the additive ideal, which any vector composition
model is essentially trying to estimate. Let us now investigate how well actual vectors
of English phrases ﬁt the additive ideal, E, if they do not ﬁt, how good the existing
composition methods are at predicting deviations from the ideal.

3.1 Experimental Setup

We focus on adjective–noun (AN) phrases as a representative case. We used 2.8 billion
tokens comprising ukWaC, Wackypedia, and British National Corpus,2 extracting the
12.6K ANs that occurred at least 1K times. We collected sentence-internal co-occurrence
counts with the 872 nouns3 occurring at least 150K times in the corpus used as contexts.
PMI values were computed by standard maximum-likelihood estimation.

We separated a random subset of 6K ANs to train composition models. We consider
two versions of the corresponding constituent vectors as input to composition: plain
PMI vectors (with zero co-occurrence rates conventionally converted to 0 instead of
−∞) and positive PMI (positive PMI) vettori (all non-positive PMI values converted
A 0). The latter transformation is common in the literature. Model parameters were
estimated using DISSECT (Dinu, Pham, and Baroni 2013), whose training objective is
to approximate corpus-extracted phrase vectors, a criterion especially appropriate for
our purposes.

We report results based on the 1.8 million positive PMI dimensions of the 4.7K
phrase vectors that were not used for training.4 On average a phrase had non-zero
co-occurrence with 84.8% of the context nouns, over half of which gave positive PMI
values. We focus on positive dimensions because negative association values are harder
to interpret and noisier; furthermore, −∞ cases must be set to some arbitrary value, E
most practical applications set all negative values to 0 anyway (PPMI). We also repeated
the experiments including negative observed values, with a similar pattern of results.

3.2 Divergence from Additive

We ﬁrst verify how the observed PMI values of phrases depart from those predicted
by addition: In other words, how much ∆cPMI(ab) (cid:54)= 0 in practice. Lo osserviamo

2 http://wacky.sslmit.unibo.it/, http://www.natcorp.ox.ac.uk/.
3 Only nouns were used to avoid adding the context word’s part of speech as a parameter of the analysis.
The number of contexts used was restricted by the consideration that training the lexical function model
for larger dimensionalities is problematic.

4 About 1.9K ANs containing adjectives occurring with fewer than 5 context nouns in the training set were

removed from the test set at this point, because we would not have had enough data to train the
corresponding lexical function model for those adjectives.

347

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
/
C
o

l
io
/

UN
R
T
io
C
e
–
P
D

F
/

4
2
2
3
4
5
1
8
0
7
5
0
9
/
C
o

l
io

_
UN
_
0
0
2
5
0
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Linguistica computazionale

Volume 42, Numero 2

PMI(ab, C) has a strong tendency to be lower than the sum of PMI of the phrase’s parts
with respect to the same context. In our sample, average PMI(AN, C) era 0.80, and aver-
age PMI(UN, C) and PMI(N, C) were 0.55 E 0.63, respectively.5 Over 70% of positive PMI
values in our sample are lower than additive (PMI(AN, C) < PMI(A, c) + PMI(N, c)); a vast majority of phrases (over 92%) have on average a negative divergence from the (cid:80) c∈C PMI(AN,c)−(PMI(A,c)+PMI(N,c)) < 0. The tendency for phrases to have additive prediction, lower PMI than predicted by the additive idealization is quite robust. It holds whether or not we restrict the data to items with positive PMI of constituent words (PMI(A, c) >
0, PMI(N, C) > 0), if we convert all negative PMI values of constituents to 0, and also if
we extend the test set to include negative PMI values of phrases (PMI(AN, C) < 0). |C| l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 2 2 3 4 5 1 8 0 7 5 0 9 / c o l i _ a _ 0 0 2 5 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 A possible reason for the mostly negative deviation from addition comes from the information-theoretic nature of PMI. Recall that PMI(ab) measures how informative phrase components a, b are about each other. The negative deviation from addition ∆cPMI(ab) means that context diminishes the mutual information of a and b. And indeed it is only natural that the context itself is usually informative. Concretely, it can be informative in multiple ways. In one typical scenario, the two words being composed (and the phrase) share the context topic (e.g., logical and operator in the context of calculus, connected by the topic of mathematical logic). In this case there is little additional PMI gained by composing such words because they share a large amount of co-occurring contexts. Take the idealized case when the shared underlying topic in- creases the probability of A, N, and AN by some constant k, so PMI(A, c) = PMI(N, c) = PMI(AN, c) = log k. Then association (PMI) of AN decreases by log k in the presence of topic-related words, ∆cPMI(AN) = PMI(AN, c) − (PMI(A, c) + PMI(N, c)) = − log k. The opposite case of negative association between context and AN is not symmetric to the positive association just discussed (if it were, it would have produced a positive deviation from the additive model). Negative association is in general less pronounced than positive association: In our sample, positive PMI values cover over half the co- occurrence table; furthermore, positive PMIs are on average greater in absolute value than negative ones. Importantly, two words in a phrase will often disambiguate each other, making the phrase less probable in a given context than expected from the probabilities of its parts: logical operator is very unlikely in the context of automobile even though operator in the sense of a person operating a machine and logical in the non- technical sense are perfectly plausible in the same context. Such disambiguation cases, we believe, largely account for negative deviation from addition in the case of negative components. One can think of minimal adjustments to the additive model correcting for system- atic PMI overestimation. Here, we experiment with a shifted additive model obtained by subtracting a constant vector from the summed PMI vector. Speciﬁcally, we obtained shifted vectors by computing, for each dimension, the average deviation from the additive model in the training data. 3.3 Approximation to Empirical Phrase PMI by Composition Models We have seen that addition would be a reasonable approximation to PMI vector com- position if the inﬂuence of context on the association between parts of the phrase 5 We cannot claim this divergence on unattested phrase-context co-occurrences because those should give rise to very small, probably negative, PMI values. 348 Paperno and Baroni When the Whole Is Less Than the Sum of Its Parts turned out to be negligible. Empirically, phrase-context PMI is systematically negatively deviating from word-context PMI addition. Crucially, an adequate vector composition method should capture this deviation from the additive ideal. The next step is to test existing vector composition models on how well they achieve this goal. To assess approximation quality, we compare the PMI(AN, c) values predicted by each composition model to the ones directly derived from the corpus, using mean squared error as ﬁgure of merit. Besides the full test set (all in Table 2), we consider some informative subsets. The pos subset includes the 40K AN,c pairs with largest positive error with respect to the additive prediction (above 1.264). The neg subset includes the 40K dimensions with the largest negative error with respect to additive (under –1.987). Finally, the near-0 subset includes the 20K items with the smallest positive errors and the 20K items with the smallest negative errors with respect to additive (between –0.026 and 0.023). Each of the three subsets constitutes about 2% of the all data set. By looking at Table 2, we observe ﬁrst of all that addition’s tendency to overestimate phrase PMI values puts it behind other models in the all and neg test sets, even behind the multiplicative method, which, unlike others, has no theoretical motivation. The relatively good result of the multiplicative model can be explained through the patterns observed earlier: PMI(AN,c) is typically just above PMI(A,c) and PMI(N,c) for each of the phrase components (median values 0.66, 0.5, and 0.56, respectively). Adding PMI(A,c) and PMI(N,c) makes the prediction further above the observed PMI(AN,c) than their product is below it (when applied to median values, we obtain deviations of |0.66 − (0.5 × 0.56)| = 0.38 for multiplication and |0.66 − (0.5 + 0.56)| = 0.4 for addi- tion). As one could expect, shifted addition is on average closer to actual PMI values than plain addition. However, weighted addition provides better approximations to the observed values. Shifted addition behaves too conservatively with respect to addition, providing a good ﬁt when observed PMI is close to additive (near-0 subset), but only bringing about a small improvement in the all-important negative subset. Weighted ad- dition, on the other hand, brings about large improvements in approximating precisely the negative subset. Weighted addition is the best model overall, outperforming the parameter-rich full additive and lexical function models (the former only by a small margin). Conﬁrming the effectiveness of the non-negative transform, PPMI-trained models are more accurate than PMI-trained ones, although the latter provide the best ﬁt for the extreme negative subset, where component negative values are common. As discussed before, the observed deviation from additive PMI is mostly negative, due partly to the shared underlying topic effect and partly to the disambiguation effect Table 2 Mean squared error of different models’ predictions, trained on PMI (left) vs. PPMI vectors (right). model all pos neg near-0 all pos neg near-0 PMI PPMI additive multiplicative weighted additive full additive lexical function shifted additive 0.75 0.61 0.39 0.56 0.73 0.66 3.11 2.50 2.52 3.02 2.92 4.66 5.86 3.38 0.41 0.68 0.74 3.01 (≈0.00) 0.55 0.35 0.54 0.68 0.39 0.71 0.59 0.32 0.34 0.45 0.48 1.96 5.92 3.01 1.93 2.01 2.16 5.88 3.40 0.62 0.63 0.68 3.52 (0.02) 0.62 0.27 0.29 0.37 0.18 349 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 2 2 3 4 5 1 8 0 7 5 0 9 / c o l i _ a _ 0 0 2 5 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 42, Number 2 discussed in 3.2. In both cases, whenever the PMI of the constituents (PMI(a, c) and/or PMI(b, c)) is larger, the deviation from additive (PMI(ab, c) − (PMI(a, c) + PMI(b, c))) is likely to become smaller. Weighted addition captures this, setting the negative cor- rection of the additive model to be a linear function of the PMI values of the phrase components. The full additive model, which also showed competitive results overall, might perform better with more training data or with lower vector dimensionality (in the current set-up, there were just about three training examples for each parameter to set). 4. Conclusions We have shown, based on the mathematical deﬁnition of PMI, that addition is a sys- tematic component of PMI vector composition. The remaining component is also an interpretable value, measuring the impact of context on the phrase’s internal PMI. In practice, this component is typically negative. Empirical observations about adjective- noun phrases show that systematic deviations from addition are largely accounted for by a negative shift ∆cPMI(ab), which might be proportional to the composed vectors’ dimensions (as partially captured by the weighted additive method). Further studies should consider other constructions and types of context to conﬁrm the generality of our results. Acknowledgments We would like to thank the Computational Linguistics editor and reviewers: Yoav Goldberg, Omer Levy, Katya Tentori, Germ´an Kruszewski, Nghia Pham, and the other members of the Composes team for useful feedback. Our work is funded by ERC 2011 Starting Independent Research Grant n. 283554 (COMPOSES). References Bullinaria, John and Joseph Levy. 2012. Extracting semantic representations from word co-occurrence statistics: Stop-lists, stemming and SVD. Behavior Research Methods, 44:890–907. Church, Kenneth and Peter Hanks. 1990. Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1):22–29. Dinu, Georgiana, Nghia The Pham, and Marco Baroni. 2013. General estimation and evaluation of compositional distributional semantic models. In Proceedings of ACL Workshop on Continuous Vector Space Models and their Compositionality, pages 50–58, Soﬁa. Mitchell, Jeff and Mirella Lapata. 2010. Composition in distributional models of semantics. Cognitive Science, 34(8):1388–1429. Turney, Peter and Patrick Pantel. 2010. From frequency to meaning: Vector space models of semantics. Journal of Artiﬁcial Intelligence Research, 37:141–188. 350 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 2 2 3 4 5 1 8 0 7 5 0 9 / c o l i _ a _ 0 0 2 5 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3
Scarica il pdf