When the Whole Is Less Than the Sum of Its
Parts: How Composition Affects PMI Values
in Distributional Semantic Vectors
Denis Paperno∗
University of Trento
Marco Baroni∗
University of Trento
Distributional semantic models, deriving vector-based word representations from patterns of
word usage in corpora, have many useful applications (Turney and Pantel 2010). Recentemente, there
has been interest in compositional distributional models, which derive vectors for phrases from
representations of their constituent words (Mitchell and Lapata 2010). Often, the values of distri-
butional vectors are pointwise mutual information (PMI) scores obtained from raw co-occurrence
conta. In this article we study the relation between the PMI dimensions of a phrase vector and
its components in order to gain insights into which operations an adequate composition model
should perform. We show mathematically that the difference between the PMI dimension of a
phrase vector and the sum of PMIs in the corresponding dimensions of the phrase’s parts is an
independently interpretable value, namely, a quantification of the impact of the context associated
with the relevant dimension on the phrase’s internal cohesion, as also measured by PMI. We then
explore this quantity empirically, through an analysis of adjective–noun composition.
1. introduzione
Dimensions of a word vector in distributional semantic models contain a function
of the co-occurrence counts of the word with contexts of interest. A popular and
effective option (Bullinaria and Levy 2012) is to transform counts into pointwise
mutual information (PMI) scores, which are given, for any word a and context c, by
PMI(UN, C) = log( P(UN|C)
P(UN) ).1 There are various proposals on deriving phrase representations
∗ Center for Mind/Brain Sciences, University of Trento, Palazzo Fedrigotti, corso Bettini 31, 38068 Rovereto
(TN), Italy. E-mails: denis.paperno@unitn.it; marco.baroni@unitn.it.
1 PMI is also used to measure the tendency of two phrase constituents to be combined in a particular
syntactic configuration (per esempio., to assess the degree of lexicalization of the phrase). We use PMI(ab) to refer
to this “phrase cohesion” PMI. PMI(ab) = log( P(ab)
) depends on how often words a and
b form a phrase of the relevant kind, per esempio., how often adjective a modifies noun b, while PMI(UN, B) is based
on the probability of two words co-occurring in the relevant co-occurrence context (per esempio., within an
n-word window). Quantifying phrase cohesion with PMI(ab) has been the most common usage of PMI in
computational linguistics outside distributional semantics ever since the seminal work on collocations by
Church and Hanks (1990).
P(UN)·P(B) ) = log( P(ab|UN)
P(B)
Invio ricevuto: 18 Giugno 2015; revised submission received: 16 novembre 2015; accepted for publication:
3 Febbraio 2016.
doi:10.1162/COLI a 00250
© 2016 Associazione per la Linguistica Computazionale
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
4
2
2
3
4
5
1
8
0
7
5
0
9
/
C
o
l
io
_
UN
_
0
0
2
5
0
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 42, Numero 2
Tavolo 1
Composition methods. (cid:12) is pointwise multiplication. α and β are scalar parameters, matrices
X and Y represent syntactic relation slots (per esempio., Adjective and Noun), matrix A stands for a
functional word (per esempio., the adjective in an adjective-noun construction), E (cid:126)c is a constant
vector.
modello
phrase vector
modello
phrase vector
additive
multiplicative
weighted additive
(cid:126)UN + (cid:126)B
(cid:126)UN (cid:12) (cid:126)B
α(cid:126)UN + β(cid:126)B
full additive
lexical function
shifted additive
X(cid:126)UN + Y(cid:126)B
UN(cid:126)B
(cid:126)UN + (cid:126)B + (cid:126)C
by composing word vectors, ranging from simple, parameter-free vector addition to
fully supervised deep-neural-network-based systems. We focus here on the models
illustrated in Table 1; see Dinu, Pham, and Baroni (2013) for the original model refer-
enze. As an empirical test case, we consider adjective–noun composition.
2. A General Result on the PMI Dimensions of Phrases
An ideal composition model should be able to reconstruct, at least for sufficiently
frequent phrases, the corpus-extracted vector of the phrase ab from vectors of its parts
UN, B. When vector dimensions encode PMI values, for each context c, the composition
model has to predict PMI(ab, C) between phrase ab and context c. Equazione (1) shows
that there is a mathematical relation between PMI(ab, C) and the PMI values of the
phrase components PMI(UN, C), PMI(B, C):
PMI(ab, C) = log(
P(ab | C)
P(ab)
) = log(
P(UN | C) · P(ab | a ∧ c)
P(UN) · P(ab | UN)
) =
= log(
P(UN | C) · P(ab | a ∧ c)
P(UN) · P(ab | UN)
·
P(B | C) · P(B)
P(B | C) · P(B)
) =
= log(
P(UN | C)
P(UN)
·
P(B | C)
P(B)
·
P(ab | a ∧ c)
P(B | C)
·
P(B)
P(ab | UN)
) =
(1)
= log(
P(UN | C)
P(UN)
) + log(
P(B | C)
P(B)
) + log(
P(ab | a ∧ c)
P(B | C)
) − log(
P(ab | UN)
P(B)
) =
= PMI(UN, C) + PMI(B, C) + PMI(ab|C) − PMI(ab)
To make sense of this derivation, observe that P(ab) and P(ab | C) pertain to a
phrase ab where a and b are linked by a specific syntactic relation. Now, whenever
the phrase ab occurs, a must also occur, and thus P(ab) = P(ab ∧ a), and similarly
P(ab | C) = P(ab ∧ a | C). This connects the PMI of a phrase (based on counts of ab
linked by a syntactic relation) to the PMI of the constituents (based on counts of the
constituents in all contexts). Consequently, we can meaningfully relate PMI(ab, C)
(as computed to calculate phrase vector dimensions) to PMI(UN, C) and PMI(B, C) (COME
computed to calculate single word dimensions).
Equazione (1) unveils a systematic relation between the PMI value in a phrase
vector dimension and the value predicted by the additive approach to composition.
Infatti, PMI(ab, C) equals PMI(UN, C) + PMI(B, C), shifted by some correction ∆cPMI(ab) =
PMI(ab|C) − PMI(ab), measuring how the context changes the tendency of two words
UN, b to form a phrase. ∆c includes any non-trivial effects of composition arising from the
346
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
4
2
2
3
4
5
1
8
0
7
5
0
9
/
C
o
l
io
_
UN
_
0
0
2
5
0
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Paperno and Baroni
When the Whole Is Less Than the Sum of Its Parts
interaction between the occurrence of words a, B, C. Absence of non-trivial interaction
of this kind is a reasonable null hypothesis, under which the association of phrase
components with each other is not affected by context at all: PMI(ab|C) = PMI(ab).
Under this null hypothesis, addition should accurately predict PMI values for phrases.
3. Empirical Observations
We have shown that vector addition should perfectly predict phrase vectors under
the idealized assumption that the context’s effect on the association between words
in the phrase, ∆cPMI(ab) = PMI(ab|C) − PMI(ab), is negligible. ∆cPMI(ab) equals the
deviation of the actual PMI(ab, C) from the additive ideal, which any vector composition
model is essentially trying to estimate. Let us now investigate how well actual vectors
of English phrases fit the additive ideal, E, if they do not fit, how good the existing
composition methods are at predicting deviations from the ideal.
3.1 Experimental Setup
We focus on adjective–noun (AN) phrases as a representative case. We used 2.8 billion
tokens comprising ukWaC, Wackypedia, and British National Corpus,2 extracting the
12.6K ANs that occurred at least 1K times. We collected sentence-internal co-occurrence
counts with the 872 nouns3 occurring at least 150K times in the corpus used as contexts.
PMI values were computed by standard maximum-likelihood estimation.
We separated a random subset of 6K ANs to train composition models. We consider
two versions of the corresponding constituent vectors as input to composition: plain
PMI vectors (with zero co-occurrence rates conventionally converted to 0 instead of
−∞) and positive PMI (positive PMI) vettori (all non-positive PMI values converted
A 0). The latter transformation is common in the literature. Model parameters were
estimated using DISSECT (Dinu, Pham, and Baroni 2013), whose training objective is
to approximate corpus-extracted phrase vectors, a criterion especially appropriate for
our purposes.
We report results based on the 1.8 million positive PMI dimensions of the 4.7K
phrase vectors that were not used for training.4 On average a phrase had non-zero
co-occurrence with 84.8% of the context nouns, over half of which gave positive PMI
values. We focus on positive dimensions because negative association values are harder
to interpret and noisier; furthermore, −∞ cases must be set to some arbitrary value, E
most practical applications set all negative values to 0 anyway (PPMI). We also repeated
the experiments including negative observed values, with a similar pattern of results.
3.2 Divergence from Additive
We first verify how the observed PMI values of phrases depart from those predicted
by addition: In other words, how much ∆cPMI(ab) (cid:54)= 0 in practice. Lo osserviamo
2 http://wacky.sslmit.unibo.it/, http://www.natcorp.ox.ac.uk/.
3 Only nouns were used to avoid adding the context word’s part of speech as a parameter of the analysis.
The number of contexts used was restricted by the consideration that training the lexical function model
for larger dimensionalities is problematic.
4 About 1.9K ANs containing adjectives occurring with fewer than 5 context nouns in the training set were
removed from the test set at this point, because we would not have had enough data to train the
corresponding lexical function model for those adjectives.
347
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
C
o
l
io
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
4
2
2
3
4
5
1
8
0
7
5
0
9
/
C
o
l
io
_
UN
_
0
0
2
5
0
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Linguistica computazionale
Volume 42, Numero 2
PMI(ab, C) has a strong tendency to be lower than the sum of PMI of the phrase’s parts
with respect to the same context. In our sample, average PMI(AN, C) era 0.80, and aver-
age PMI(UN, C) and PMI(N, C) were 0.55 E 0.63, respectively.5 Over 70% of positive PMI
values in our sample are lower than additive (PMI(AN, C) < PMI(A, c) + PMI(N, c)); a
vast majority of phrases (over 92%) have on average a negative divergence from the
(cid:80)
c∈C
PMI(AN,c)−(PMI(A,c)+PMI(N,c))
< 0. The tendency for phrases to have
additive prediction,
lower PMI than predicted by the additive idealization is quite robust. It holds whether
or not we restrict the data to items with positive PMI of constituent words (PMI(A, c) >
0, PMI(N, C) > 0), if we convert all negative PMI values of constituents to 0, and also if
we extend the test set to include negative PMI values of phrases (PMI(AN, C) < 0).
|C|
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
2
2
3
4
5
1
8
0
7
5
0
9
/
c
o
l
i
_
a
_
0
0
2
5
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
A possible reason for the mostly negative deviation from addition comes from the
information-theoretic nature of PMI. Recall that PMI(ab) measures how informative
phrase components a, b are about each other. The negative deviation from addition
∆cPMI(ab) means that context diminishes the mutual information of a and b. And
indeed it is only natural that the context itself is usually informative. Concretely, it
can be informative in multiple ways. In one typical scenario, the two words being
composed (and the phrase) share the context topic (e.g., logical and operator in the context
of calculus, connected by the topic of mathematical logic). In this case there is little
additional PMI gained by composing such words because they share a large amount
of co-occurring contexts. Take the idealized case when the shared underlying topic in-
creases the probability of A, N, and AN by some constant k, so PMI(A, c) = PMI(N, c) =
PMI(AN, c) = log k. Then association (PMI) of AN decreases by log k in the presence
of topic-related words, ∆cPMI(AN) = PMI(AN, c) − (PMI(A, c) + PMI(N, c)) = − log k.
The opposite case of negative association between context and AN is not symmetric to
the positive association just discussed (if it were, it would have produced a positive
deviation from the additive model). Negative association is in general less pronounced
than positive association: In our sample, positive PMI values cover over half the co-
occurrence table; furthermore, positive PMIs are on average greater in absolute value
than negative ones. Importantly, two words in a phrase will often disambiguate each
other, making the phrase less probable in a given context than expected from the
probabilities of its parts: logical operator is very unlikely in the context of automobile even
though operator in the sense of a person operating a machine and logical in the non-
technical sense are perfectly plausible in the same context. Such disambiguation cases,
we believe, largely account for negative deviation from addition in the case of negative
components.
One can think of minimal adjustments to the additive model correcting for system-
atic PMI overestimation. Here, we experiment with a shifted additive model obtained
by subtracting a constant vector from the summed PMI vector. Specifically, we obtained
shifted vectors by computing, for each dimension, the average deviation from the
additive model in the training data.
3.3 Approximation to Empirical Phrase PMI by Composition Models
We have seen that addition would be a reasonable approximation to PMI vector com-
position if the influence of context on the association between parts of the phrase
5 We cannot claim this divergence on unattested phrase-context co-occurrences because those should give
rise to very small, probably negative, PMI values.
348
Paperno and Baroni
When the Whole Is Less Than the Sum of Its Parts
turned out to be negligible. Empirically, phrase-context PMI is systematically negatively
deviating from word-context PMI addition. Crucially, an adequate vector composition
method should capture this deviation from the additive ideal. The next step is to test
existing vector composition models on how well they achieve this goal.
To assess approximation quality, we compare the PMI(AN, c) values predicted by
each composition model to the ones directly derived from the corpus, using mean
squared error as figure of merit. Besides the full test set (all in Table 2), we consider some
informative subsets. The pos subset includes the 40K AN,c pairs with largest positive
error with respect to the additive prediction (above 1.264). The neg subset includes the
40K dimensions with the largest negative error with respect to additive (under –1.987).
Finally, the near-0 subset includes the 20K items with the smallest positive errors and
the 20K items with the smallest negative errors with respect to additive (between –0.026
and 0.023). Each of the three subsets constitutes about 2% of the all data set.
By looking at Table 2, we observe first of all that addition’s tendency to overestimate
phrase PMI values puts it behind other models in the all and neg test sets, even behind
the multiplicative method, which, unlike others, has no theoretical motivation. The
relatively good result of the multiplicative model can be explained through the patterns
observed earlier: PMI(AN,c) is typically just above PMI(A,c) and PMI(N,c) for each
of the phrase components (median values 0.66, 0.5, and 0.56, respectively). Adding
PMI(A,c) and PMI(N,c) makes the prediction further above the observed PMI(AN,c)
than their product is below it (when applied to median values, we obtain deviations
of |0.66 − (0.5 × 0.56)| = 0.38 for multiplication and |0.66 − (0.5 + 0.56)| = 0.4 for addi-
tion). As one could expect, shifted addition is on average closer to actual PMI values
than plain addition. However, weighted addition provides better approximations to the
observed values. Shifted addition behaves too conservatively with respect to addition,
providing a good fit when observed PMI is close to additive (near-0 subset), but only
bringing about a small improvement in the all-important negative subset. Weighted ad-
dition, on the other hand, brings about large improvements in approximating precisely
the negative subset. Weighted addition is the best model overall, outperforming the
parameter-rich full additive and lexical function models (the former only by a small
margin). Confirming the effectiveness of the non-negative transform, PPMI-trained
models are more accurate than PMI-trained ones, although the latter provide the best fit
for the extreme negative subset, where component negative values are common.
As discussed before, the observed deviation from additive PMI is mostly negative,
due partly to the shared underlying topic effect and partly to the disambiguation effect
Table 2
Mean squared error of different models’ predictions, trained on PMI (left) vs. PPMI vectors
(right).
model
all
pos
neg
near-0
all
pos
neg
near-0
PMI
PPMI
additive
multiplicative
weighted additive
full additive
lexical function
shifted additive
0.75
0.61
0.39
0.56
0.73
0.66
3.11
2.50
2.52
3.02
2.92
4.66
5.86
3.38
0.41
0.68
0.74
3.01
(≈0.00)
0.55
0.35
0.54
0.68
0.39
0.71
0.59
0.32
0.34
0.45
0.48
1.96
5.92
3.01
1.93
2.01
2.16
5.88
3.40
0.62
0.63
0.68
3.52
(0.02)
0.62
0.27
0.29
0.37
0.18
349
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
2
2
3
4
5
1
8
0
7
5
0
9
/
c
o
l
i
_
a
_
0
0
2
5
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 42, Number 2
discussed in 3.2. In both cases, whenever the PMI of the constituents (PMI(a, c) and/or
PMI(b, c)) is larger, the deviation from additive (PMI(ab, c) − (PMI(a, c) + PMI(b, c))) is
likely to become smaller. Weighted addition captures this, setting the negative cor-
rection of the additive model to be a linear function of the PMI values of the phrase
components. The full additive model, which also showed competitive results overall,
might perform better with more training data or with lower vector dimensionality
(in the current set-up, there were just about three training examples for each parameter
to set).
4. Conclusions
We have shown, based on the mathematical definition of PMI, that addition is a sys-
tematic component of PMI vector composition. The remaining component is also an
interpretable value, measuring the impact of context on the phrase’s internal PMI. In
practice, this component is typically negative. Empirical observations about adjective-
noun phrases show that systematic deviations from addition are largely accounted for
by a negative shift ∆cPMI(ab), which might be proportional to the composed vectors’
dimensions (as partially captured by the weighted additive method). Further studies
should consider other constructions and types of context to confirm the generality of
our results.
Acknowledgments
We would like to thank the Computational
Linguistics editor and reviewers: Yoav
Goldberg, Omer Levy, Katya Tentori,
Germ´an Kruszewski, Nghia Pham, and
the other members of the Composes team
for useful feedback. Our work is funded
by ERC 2011 Starting Independent
Research Grant n. 283554 (COMPOSES).
References
Bullinaria, John and Joseph Levy. 2012.
Extracting semantic representations from
word co-occurrence statistics: Stop-lists,
stemming and SVD. Behavior Research
Methods, 44:890–907.
Church, Kenneth and Peter Hanks. 1990.
Word association norms, mutual
information, and lexicography.
Computational Linguistics,
16(1):22–29.
Dinu, Georgiana, Nghia The Pham, and
Marco Baroni. 2013. General estimation
and evaluation of compositional
distributional semantic models. In
Proceedings of ACL Workshop on
Continuous Vector Space Models and
their Compositionality, pages 50–58,
Sofia.
Mitchell, Jeff and Mirella Lapata. 2010.
Composition in distributional models of
semantics. Cognitive Science,
34(8):1388–1429.
Turney, Peter and Patrick Pantel. 2010. From
frequency to meaning: Vector space
models of semantics. Journal of Artificial
Intelligence Research, 37:141–188.
350
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
2
2
3
4
5
1
8
0
7
5
0
9
/
c
o
l
i
_
a
_
0
0
2
5
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Scarica il pdf