When the Whole Is Not Greater
Than the Combination of Its Parts:
A “Decompositional” Look at
Compositional Distributional Semantics
∗
Fabio Massimo Zanzotto
University of Rome “Tor Vergata”
Lorenzo Ferrone
University of Rome “Tor Vergata”
Marco Baroni
University of Trento
Distributional semantics has been extended to phrases and sentences by means of composition
operations. We look at how these operations affect similarity measurements, showing that simi-
larity equations of an important class of composition methods can be decomposed into operations
performed on the subparts of the input phrases. This establishes a strong link between these
models and convolution kernels.
1. Introduction
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
4
1
1
1
6
5
1
8
0
5
1
4
3
/
c
o
Distributional semantics approximates word meanings with vectors tracking co-
occurrence in corpora (Turney and Pantel 2010). Recent work has extended this ap-
proach to phrases and sentences through vector composition (Clark 2015). Resulting
compositional distributional semantic models (CDSMs) estimate degrees of seman-
tic similarity (or, more generally, relatedness) between two phrases: A good CDSM
might tell us that green bird is closer to parrot than to pigeon, useful for tasks such as
paraphrasing.
We take a mathematical look1 at how the composition operations postulated by
CDSMs affect similarity measurements involving the vectors they produce for phrases
or sentences. We show that, for an important class of composition methods, encom-
passing at least those based on linear transformations, the similarity equations can
be decomposed into operations performed on the subparts of the input phrases,
l
i
_
a
_
0
0
2
1
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
∗ Department of Enterprise Engineering, University of Rome “Tor Vergata,” Viale del Politecnico, 1,
00133 Rome, Italy. E-mail: fabio.massimo.zanzotto@uniroma2.it.
1 Ganesalingam and Herbelot (2013) also present a mathematical investigation of CDSMs. However,
except for the tensor product (a composition method we do not consider here as it is not empirically
effective), they do not look at how composition strategies affect similarity comparisons.
Original submission received: 10 December 2013; revision received: 26 May 2014; accepted for publication:
1 August 2014.
doi:10.1162/COLI a 00215
© 2015 Association for Computational Linguistics
Computational Linguistics
Volume 41, Number 1
Table 1
Compositional Distributional Semantic Models: (cid:2)a,
the words a, b, and c, respectively; matrices X, Y, and Z are constant across a phrase type,
corresponding to syntactic slots; the matrix A and the third-order tensor BBB represent the
predicate words a in the first phrase and b in the second phrase, respectively.
(cid:2)
b, and (cid:2)c are distributional vectors representing
Additive
Multiplicative
Full Additive
Lexical Function
2-word phrase
(cid:2)a + (cid:2)
b
(cid:2)a (cid:2) (cid:2)
b
(cid:2)
X(cid:2)a + Y
b
(cid:2)
A
b
3-word phrase
(cid:2)a + (cid:2)
b + (cid:2)c
(cid:2)a (cid:2) (cid:2)
b (cid:2) (cid:2)c
(cid:2)
b + Z(cid:2)c
Y(cid:2)a + X
(cid:2)a BBB (cid:2)c
reference
Mitchell and Lapata (2008)
Mitchell and Lapata (2008)
Guevara (2010), Zanzotto et al. (2010)
Coecke, Sadrzadeh, and Clark (2010)
and typically factorized into terms that reflect the linguistic structure of the input.
This establishes a strong link between CDSMs and convolution kernels (Haussler
1999), which act in the same way. We thus refer to our claim as the “Convolution
Conjecture.”
We focus on the models in Table 1. These CDSMs all apply linear methods, and
we suspect that linearity is a sufficient (but not necessary) condition to ensure that the
Convolution Conjecture holds. We will first illustrate the conjecture for linear methods,
and then briefly consider two nonlinear approaches: the dual space model of Turney
(2012), for which it does, and a representative of the recent strand of work on neural-
network models of composition, for which it does not.
2. Mathematical Preliminaries
Vectors are represented as small letters with an arrow (cid:2)a and their elements are ai,
matrices as capital letters in bold A and their elements are Aij, and third-order or
fourth-order tensors as capital letters in the form AAA and their elements are Aijk or Aijkh.
The symbol (cid:2) represents the element-wise product and ⊗ is the tensor product. The
(cid:2)
b(cid:5) and the Frobenius product—that is, the generalization of the dot
dot product is (cid:4)(cid:2)a,
product to matrices and high-order tensors—is represented as (cid:4)A, B(cid:5)
F. The
Frobenius product acts on vectors, matrices, and third-order tensors as follows:
F and (cid:4)AAA, BBB(cid:5)
(cid:4)(cid:2)a,
(cid:2)
b(cid:5)
F
=
(cid:2)
i
= (cid:4)(cid:2)a,
(cid:2)
b(cid:5)
aibi
(cid:4)A, B(cid:5)
F
=
(cid:3)
ij AijBij
(cid:4)AAA, BBB(cid:5)
F
=
(cid:2)
ijk
AijkBijk
(1)
A simple property that relates the dot product between two vectors and the
Frobenius product between two general tensors is the following:
(cid:4)(cid:2)a,
(cid:2)
(cid:2)
b T(cid:5)
b(cid:5) = (cid:4)I,(cid:2)a
F
where I is the identity matrix. The dot product of A(cid:2)x and B(cid:2)y can be rewritten as:
(cid:4)A(cid:2)x, B(cid:2)y(cid:5) = (cid:4)ATB, (cid:2)x(cid:2)yT(cid:5)
F
(2)
(3)
166
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
4
1
1
1
6
5
1
8
0
5
1
4
3
/
c
o
l
i
_
a
_
0
0
2
1
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Zanzotto, Ferrone, and Baroni When the Whole Is Not Greater Than the Combination of Its Parts
Let AAA and BBB be two third-order tensors and (cid:2)x, (cid:2)y, (cid:2)a, (cid:2)c four vectors. It can be shown that:
(cid:4)(cid:2)xAAA(cid:2)y,(cid:2)aBBB(cid:2)c (cid:5) = (cid:4)
(cid:2)
j
(AAA ⊗ BBB)j, (cid:2)x ⊗ (cid:2)y ⊗ (cid:2)a ⊗ (cid:2)c (cid:5)
F
(4)
(cid:3)
j(AAA ⊗ BBB)j is a non-standard way to indicate the tensor contraction of the
where CCC =
tensor product between two third-order tensors. In this particular tensor contraction,
the elements Ciknm of the resulting fourth-order tensor CCC are Ciknm
j AijkBnjm. The
elements Diknm of the tensor DDD = (cid:2)x ⊗ (cid:2)y ⊗ (cid:2)a ⊗ (cid:2)c are Diknm
= xiykancm.
(cid:3)
=
3. Formalizing the Convolution Conjecture
Structured Objects. In line with Haussler (1999), a structured object x ∈ X is either
a terminal object that cannot be furthermore decomposed, or a non-terminal object
that can be decomposed into n subparts. We indicate with x = (x1, . . . , xn) one such
∈ X are structured objects themselves. The set
decomposition, where the subparts xi
⊆ X is the set of the terminal objects. A
X is the set of the structured objects and TX
structured object x can be anything according to the representational needs. Here, x is
a representation of a text fragment, and so it can be a sequence of words, a sequence
of words along with their part of speech, a tree structure, and so on. The set R(x) is
the set of decompositions of x relevant to define a specific CDSM. Note that a given
decomposition of a structured object x does not need to contain all the subparts of the
original object. For example, let us consider the phrase x = tall boy. We can then define
R(x) = {(tall, boy), (tall), (boy)}. This set contains the three possible decompositions of
the phrase: ( tall(cid:4)(cid:5)(cid:6)(cid:7)
).
, boy(cid:4)(cid:5)(cid:6)(cid:7)
x2
), ( tall(cid:4)(cid:5)(cid:6)(cid:7)
x1
), and ( boy(cid:4)(cid:5)(cid:6)(cid:7)
x1
x1
Recursive formulation of CDSM. A CDSM can be viewed as a function f that acts recur-
sively on a structured object x. If x is a non-terminal object
f (x) =
(cid:8)
x∈R(x)
γ( f (x1), f (x2), . . . , f (xn))
(5)
(cid:9)
where R(x) is the set of relevant decompositions,
is a repeated operation on this
set, γ is a function defined on f (xi) where xi are the subparts of a decomposition of
x. If x is a terminal object, f (x) is directly mapped to a tensor. The function f may
operate differently on different kinds of structured objects, with tensor degree varying
accordingly. The set R(x) and the functions f , γ, and
depend on the specific CDSM,
and the same CDSM might be susceptible to alternative analyses satisfying the form in
Equation (5). As an example, under Additive, x is a sequence of words and f is
(cid:9)
f (x) =
(cid:2)
⎧
⎨
y∈R(x)
⎩
(cid:2)x
f (y)
if x /∈ TX
if x ∈ TX
(6)
167
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
4
1
1
1
6
5
1
8
0
5
1
4
3
/
c
o
l
i
_
a
_
0
0
2
1
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
where R((w1, . . . , wn)) = {(w1), . . . , (wn)}. The repeated operation
summing and γ is identity. For Multiplicative we have
f (x) =
⎧
⎨
(cid:2)
y∈R(x)
⎩
(cid:2)x
f (y)
if x /∈ TX
if x ∈ TX
Volume 41, Number 1
(cid:9)
corresponds to
(7)
where R(x) = {(w1, . . . , wn)} (a single trivial decomposition including all subparts).
With a single decomposition, the repeated operation reduces to a single term; and here γ
is the product (it will be clear subsequently, when we apply the Convolution Conjecture
to these models, why we are assuming different decomposition sets for Additive and
Multiplicative).
Definition 1 (Convolution Conjecture)
For every CDSM f along with its R(x) set, there exist functions K, Ki and a function g
such that:
K( f (x), f (y)) =
(cid:2)
x∈R(x)
y∈R(y)
g(K1( f (x1), f (y1)), K2( f (x2), f (y2)), . . . , Kn( f (xn), f (yn)))
(8)
The Convolution Conjecture postulates that the similarity K( f (x), f (y)) between the
tensors f (x) and f (y) is computed by combining operations on the subparts, that is,
Ki( f (xi), f (yi)), using the function g. This is exactly what happens in convolution kernels
(Haussler 1999). K is usually the dot product, but this is not necessary: We will show
that for the dual-space model of Turney (2012) K turns out to be the fourth root of the
Frobenius tensor.
4. Comparing Composed Phrases
We illustrate now how the Convolution Conjecture (CC) applies to the considered
CDSMs, exemplifying with adjective–noun and subject–verb–object phrases. Without
loss of generality we use tall boy and red cat for adjective–noun phrases and goats eat
grass and cows drink water for subject–verb–object phrases.
Additive Model. K and Ki are dot products, g is the identity function, and f is as in
Equation (6). The structure of the input is a word sequence (i.e., x = (w1 w2)) and the
relevant decompositions consist of these single words, R(x) = {(w1), (w2)}. Then
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
4
1
1
1
6
5
1
8
0
5
1
4
3
/
c
o
l
i
_
a
_
0
0
2
1
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
(cid:2)
K( f (tall boy), f (red cat)) = (cid:4) (cid:2)
tall + (cid:2)
red + (cid:2)cats(cid:5) =
boy,
= (cid:4) (cid:2)
(cid:2)
tall, (cid:2)cat(cid:5) + (cid:4) (cid:2)
red(cid:5) + (cid:4) (cid:2)
boy,
tall,
(cid:2)
=
(cid:4) f (x), f (y)(cid:5) =
(cid:2)
red(cid:5) + (cid:4) (cid:2)
(cid:2)
boy, (cid:2)cat(cid:5) =
K( f (x), f (y))
(9)
x∈{tall,boy}
y∈{red,cat}
x∈{tall,boy}
y∈{red,cat}
The CC form of Additive shows that the overall dot product can be decomposed
into dot products of the vectors of the single words. Composition does not add any
further information. These results can be easily extended to longer phrases and to
phrases of different length.
168
Zanzotto, Ferrone, and Baroni When the Whole Is Not Greater Than the Combination of Its Parts
Multiplicative Model. K, g are dot products, Ki
the component-wise product, and
is as in Equation (7). The structure of the input is x = (w1 w2), and we use the
f
trivial single decomposition consisting of all subparts (thus summation reduces to a
single term):
K( f (tall boy), f (red cat)) = (cid:4) (cid:2)
= (cid:4) (cid:2)
tall (cid:2) (cid:2)
boy,
tall (cid:2) (cid:2)
red,
(cid:2)
red (cid:2) (cid:2)cat(cid:5) = (cid:4) (cid:2)
(cid:2)
boy (cid:2) (cid:2)cat(cid:5) = g(K1(
red (cid:2) (cid:2)
tall (cid:2) (cid:2)
(cid:2)
(cid:2)
red), K2(
tall,
boy (cid:2) (cid:2)cat,(cid:2)1(cid:5) =
(cid:2)
boy, (cid:2)cat))
(10)
tall (cid:2) (cid:2)
(cid:2)
This is the dot product between an indistinct chain of element-wise products and a
vector (cid:2)1 of all ones or the product of two separate element-wise products, one on
(cid:2)
boy (cid:2) (cid:2)cat. In this latter CC form, the final dot
adjectives
product is obtained in two steps: first separately operating on the adjectives and on the
nouns; then taking the dot product of the resulting vectors. The comparison operations
are thus reflecting the input syntactic structure. The results can be easily extended to
longer phrases and to phrases of different lengths.
red, and one on nouns
Full Additive Model. The input consists of a sequence of (label,word) pairs x =
((L1 w1), . . . , (Ln wn)) and the relevant decomposition set includes the single tuples,
that is, R(x) = {(L1 w1), . . . , (Ln wn)}. The CDSM f is defined as
f (x) =
⎧
⎪⎪⎪⎨
⎪⎪⎪⎩
(cid:2)
(L w)∈R(x)
X
(cid:2)w
f (L)f (w)
if x /∈ TX
if x ∈ TX is a label L
if x ∈ TX is a word w
(11)
(cid:9)
The repeated operation
here is summation, and γ the matrix-by-vector product.
In the CC form, K is the dot product, g the Frobenius product, K1( f (x), f (y)) = f (x)Tf (y),
and K2( f (x), f (y)) = f (x)f (y)T. We have then for adjective–noun composition (by
using the property in Equation (3)):
K( f ((A tall) (N boy)), f ((A red) (N cat))) = (cid:4)A
= (cid:4)A
= (cid:4)ATA,
=
(cid:2)
(cid:2)
tall, N (cid:2)cat(cid:5) + (cid:4)N
boy, A
(cid:2)
T(cid:5)
+ (cid:4)NTA,
red
(cid:2)
(cid:2)
red(cid:5) + (cid:4)A
tall, A
(cid:2)
T(cid:5)
tall
(cid:2)
F
(cid:2)
boy
(cid:2)
red
+ (cid:4)ATN,
g(K1( f (lx), f (ly)), K2( f (wx), f (wy))
(cid:2)
(cid:2)
(cid:2)
red + N (cid:2)cat(cid:5) =
tall + N
boy, A
(cid:2)
(cid:2)
boy, N (cid:2)cat(cid:5) =
red(cid:5) + (cid:4)N
(cid:2)
tall (cid:2)cat
+ (cid:4)NTN,
(cid:2)
boy (cid:2)cat
T(cid:5)
F
F
T(cid:5)
=
F
(12)
(lx wx)∈{(A tall),(N boy)}
(ly wy )∈{(A red),(N cat)}
The CC form shows how Full Additive factorizes into a more structural and a
more lexical part: Each element of the sum is the Frobenius product between the
product of two matrices representing syntactic labels and the tensor product between
169
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
4
1
1
1
6
5
1
8
0
5
1
4
3
/
c
o
l
i
_
a
_
0
0
2
1
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 41, Number 1
two vectors representing the corresponding words. For subject–verb–object phrases
((S w1) (V w2) (O w3)) we have
K( f (((S goats) (V eat) (O grass))), f (((S cows) (V drink) (O water)))) =
= (cid:4)S (cid:2)goats + V (cid:2)eat + O (cid:2)grass, S (cid:2)cows + V
(cid:2)
= (cid:4)STS, (cid:2)goats (cid:2)cowsT(cid:5)
drink
T(cid:5)
+(cid:4)VTS, (cid:2)eat (cid:2)cowsT(cid:5)
F
F
(cid:2)
+(cid:4)OTS, (cid:2)grass (cid:2)cowsT(cid:5)
drink
=
(cid:2)
drink + O (cid:2)water(cid:5) =
T(cid:5)
+ (cid:4)VTO, (cid:2)eat
T(cid:5)
F
(cid:2)water
g(K1( f (lx), f (ly)), K2( f (wx), f (wy))
+ (cid:4)STO, (cid:2)goats (cid:2)water
(cid:2)water
T(cid:5)
+ (cid:4)OTO, (cid:2)grass
+ (cid:4)STV, (cid:2)goats
(cid:2)
drink
+ (cid:4)OTV, (cid:2)grass
F
+ (cid:4)VTV, (cid:2)eat
(cid:2)
T(cid:5)
T(cid:5)
F
F
F
F
F
(13)
(lx wx)∈{(S goats),(V eat),(O grass)}
(ly wy )∈{(S cows),(V drink),(O water)}
Again, we observe the factoring into products of syntactic and lexical representations.
By looking at Full Additive in the CC form, we observe that when XTY ≈ I for all
matrix pairs, it degenerates to Additive. Interestingly, Full Additive can also approx-
imate a semantic convolution kernel (Mehdad, Moschitti, and Zanzotto 2010), which
combines dot products of elements in the same slot. In the adjective–noun case, we
obtain this approximation by choosing two nearly orthonormal matrices A and N such
(cid:2)
that AAT = NNT ≈ I and ANT ≈ 0 and applying Equation (2): (cid:4)A
red +
N (cid:2)cat(cid:5) ≈ (cid:4)tall, red(cid:5) + (cid:4)boy, cat(cid:5).
(cid:2)
tall + N
(cid:2)
boy, A
This approximation is valid also for three-word phrases. When the matrices S, V,
and O are such that XXT ≈ I with X one of the three matrices and YXT ≈ 0 with X and
Y two different matrices, Full Additive approximates a semantic convolution kernel
comparing two sentences by summing the dot products of the words in the same role,
that is,
(cid:4)S (cid:2)goats + V (cid:2)eat + O (cid:2)grass, S (cid:2)cows + V
≈ (cid:4)goats, cows(cid:5) + (cid:4)eat, drink(cid:5) + (cid:4)grass, water(cid:5)
(cid:2)
drink + O (cid:2)water(cid:5) ≈
(14)
Results can again be easily extended to longer and different-length phrases.
Lexical Function Model. We distinguish composition with one- vs. two argument pred-
icates. We illustrate the first through adjective–noun composition, where the adjective
acts as the predicate, and the second with transitive verb constructions. Although we
use the relevant syntactic labels, the formulas generalize to any construction with the
same argument count. For adjective–noun phrases, the input is a sequence of (label,
word) pairs (x = ((A, w1), (N, w2))) and the relevant decomposition set again includes
only the single trivial decomposition into all the subparts: R(x) = {((A, w1), (N, w2))}.
The method itself is recursively defined as
⎧
⎪⎨
⎪⎩
f (x) =
f ((A, w1))f ((N, w2))
W1
(cid:2)w2
if x /∈ TX
if x ∈ Tx
if x ∈ Tx
= ((A, w1), (N, w2))
= (A, w1)
= (N, w2)
(15)
170
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
4
1
1
1
6
5
1
8
0
5
1
4
3
/
c
o
l
i
_
a
_
0
0
2
1
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Zanzotto, Ferrone, and Baroni When the Whole Is Not Greater Than the Combination of Its Parts
Here, K and g are, respectively, the dot and Frobenius product, K1( f (x), f (y)) = f (x)Tf (y),
and K2( f (x), f (y)) = f (x)f (y)T. Using Equation (3), we have then
K( f (tall boy)), f (red cat)) = (cid:4)TALL
(cid:2)
boy (cid:2)cat
= (cid:4)TALLTRED,
T(cid:5)
(cid:2)
boy, RED (cid:2)cat(cid:5) =
= g(K1( f (tall), f (red)), K2( f (boy), f (cat)))
F
(16)
The role of predicate and argument words in the final dot product is clearly
separated, showing again the structure-sensitive nature of the decomposition of the
comparison operations. In the two-place predicate case, again, the input is a set of
(label, word) tuples, and the relevant decomposition set only includes the single trivial
decomposition into all subparts. The CDSM f is defined as
⎧
⎪⎨
f ((S w1)) ⊗ f ((V w2)) ⊗ f ((O w3))
(cid:2)w
⎪⎩
WWW
f (x) =
if x /∈ TX
if x ∈ TX
if x ∈ TX
= ((S w1) (V w2) (O w3))
= (l w) and l is S or O
= (V w)
(17)
K is the dot product and g(x, y, z) = (cid:4)x, y ⊗ z(cid:5)
j( f (x) ⊗ f (y))j—
that is, the tensor contraction2 along the second index of the tensor product between
f (x) and f (y)—and K2( f (x), f (y)) = K3( f (x), f (y)) = f (x) ⊗ f (y) are tensor products. The
DRINK (cid:2)water is (by using Equation (4))
dot product of (cid:2)goats EATEATEAT (cid:2)grass and (cid:2)cows DRINK
DRINK
F, K1( f (x), f (y)) =
(cid:3)
(cid:3)
K( f (((S goats) (V eat) (O grass))), f (((S cows) (V drink) (O water)))) =
= (cid:4) (cid:2)goats EATEATEAT (cid:2)grass, (cid:2)cows DRINK
DRINK (cid:2)water(cid:5) =
DRINK
DRINK)j, (cid:2)goats ⊗ (cid:2)grass ⊗ (cid:2)cows ⊗ (cid:2)water(cid:5)
DRINK
j(EATEATEAT ⊗ DRINK
= (cid:4)
= g(K1( f ((V eat)), f ((V drink))),
K2( f ((S goats)), f ((S cows))) ⊗ K3( f ((O grass)), f ((O water))))
=
F
(18)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
4
1
1
1
6
5
1
8
0
5
1
4
3
/
c
o
l
i
_
a
_
0
0
2
1
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
We rewrote the equation as a Frobenius product between two fourth-order ten-
j(EATEATEAT ⊗ DRINK
DRINK
DRINK)j
sors. The first combines the two third-order tensors of the verbs
and the second combines the vectors representing the arguments of the verb, that is:
(cid:2)goats ⊗ (cid:2)grass ⊗ (cid:2)cows ⊗ (cid:2)water. In this case as well we can separate the role of predicate
and argument types in the comparison computation.
(cid:3)
Extension of the Lexical Function to structured objects of different lengths is treated
by using the identity element (cid:3) for missing parts. As an example, we show here the
comparison between tall boy and cat where the identity element is the identity matrix I:
K( f (tall boy)), f (cat)) = (cid:4)TALL
(cid:2)
boy (cid:2)cat
= (cid:4)TALLTI,
(cid:2)
(cid:2)
boy, (cid:2)cat(cid:5) = (cid:4)TALL
boy, I (cid:2)cat(cid:5) =
= g(K1( f (tall), f ((cid:3))), K2( f (boy), f (cat)))
T(cid:5)
F
(19)
Dual Space Model. We have until now applied the CC to linear CDSMs with the dot
product as the final comparison operator (what we called K). The CC also holds for
the effective Dual Space model of Turney (2012), which assumes that each word has
two distributional representations, wd in “domain” space and wf in “function” space.
The similarity of two phrases is directly computed as the geometric average of the
separate similarities between the first and second words in both spaces. Even though
2 Grefenstette et al. (2013) first framed the Lexical Function in terms of tensor contraction.
171
Computational Linguistics
Volume 41, Number 1
there is no explicit composition step, it is still possible to put the model in CC form.
Take x = (x1, x2) and its trivial decomposition. Define, for a word w with vector rep-
resentations wd and wf : f (w) = (cid:2)wd
F,
(cid:14)
K2( f (x2), f (y2)) =
T. Define also K1( f (x1), f (y1)) =
F and g(a, b) to be
(cid:4) f (x2), f (y2)(cid:5)
(cid:4) f (x1), f (y1)(cid:5)
ab. Then
(cid:2)wf
(cid:14)
√
(cid:16)
(cid:15)(cid:16)
g(K1( f (x1), f (y1)), K2( f (x2), f (y2))) =
=
(cid:2)yf 2
T, (cid:2)yd2
(cid:2)xf 2
(cid:5) · (cid:4) (cid:2)xf 2, (cid:2)yf 2
(cid:16)
= 4
= geo(sim(xd1, yd1), sim(xd2, yd2), sim(xf 1, yf 1), sim(xf 2, yf 2))
(cid:2)yf 1
T, (cid:2)yd1
(cid:2)xf 1
(cid:4) (cid:2)xd1
(cid:5) · (cid:4) (cid:2)xf 1, (cid:2)yf 1
(cid:4) (cid:2)xd1, (cid:2)yd1
(cid:4) (cid:2)xd2
F
(cid:5) · (cid:4) (cid:2)xd2, (cid:2)yd2
(cid:5) =
T(cid:5)
T(cid:5)
=
F
·
(20)
+ (cid:2)w2
A Neural-network-like Model. Consider the phrase (w1, w2, . . . , wn) and the model defined
by f (x) = σ( (cid:2)w1
+ . . . + (cid:2)wn), where σ(·) is a component-wise logistic function. Here
we have a single trivial decomposition that includes all the subparts, and γ(x1, . . . , xn)
is defined as σ(x1
+ . . . + xn). To see that for this model the CC cannot hold, consider
two two-word phrases (a b) and (c d)
K( f ((a, b)), f ((c, d))) = (cid:4) f ((a, b)), f ((c, d))(cid:5) =
=
(cid:3)
(cid:19)
1 + e
−ai
i
−bi + e
−ci
−di + e
−ai
−bi
(cid:17)
σ((cid:2)a + (cid:2)
b)
(cid:3)
i
(cid:18)
·
i
(cid:17)
σ((cid:2)c + (cid:2)
d)
(cid:20)−1
−ci
−di
We need to rewrite this as
(cid:2)
g(K1((cid:2)a,(cid:2)c), K2(
b,
(cid:2)
d))
(cid:18)
i
(21)
(22)
But there is no possible choice of g, K1, and K2 that allows Equation (21) to be written
as Equation (22). This example can be regarded as a simplified version of the neural-
network model of Socher et al. (2011). The fact that the CC does not apply to it suggests
that it will not apply to other models in this family.
5. Conclusion
The Convolution Conjecture offers a general way to rewrite the phrase similarity com-
putations of CDSMs by highlighting the role played by the subparts of a composed
representation. This perspective allows for a better understanding of the exact op-
erations that a composition model applies to its input. The Convolution Conjecture
also suggests a strong connection between CDSMs and semantic convolution kernels.
This link suggests that insights from the CDSM literature could be directly integrated
in the development of convolution kernels, with all the benefits offered by this well-
understood general machine-learning framework.
Acknowledgments
We thank the reviewers for
helpful comments. Marco Baroni
acknowledges ERC 2011 Starting
Independent Research Grant
n. 283554 (COMPOSES).
References
Clark, Stephen. 2015. Vector space models
of lexical meaning. In Shalom Lappin
and Chris Fox, editors, Handbook of
Contemporary Semantics, 2nd ed.
Blackwell, Malden, MA. In press.
172
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
4
1
1
1
6
5
1
8
0
5
1
4
3
/
c
o
l
i
_
a
_
0
0
2
1
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Zanzotto, Ferrone, and Baroni When the Whole Is Not Greater Than the Combination of Its Parts
Coecke, Bob, Mehrnoosh Sadrzadeh, and
Stephen Clark. 2010. Mathematical
foundations for a compositional
distributional model of meaning.
Linguistic Analysis, 36:345–384.
Ganesalingam, Mohan and Aur´elie Herbelot.
2013. Composing distributions:
Mathematical structures and their
linguistic interpretation. Working paper,
Computer Laboratory, University
of Cambridge. Available at
www.cl.cam.ac.uk/∼ah433/.
Grefenstette, Edward, Georgiana Dinu,
Yao-Zhong Zhang, Mehrnoosh Sadrzadeh,
and Marco Baroni. 2013. Multi-step
regression learning for compositional
distributional semantics. Proceedings
of IWCS, pages 131–142, Potsdam.
Guevara, Emiliano. 2010. A regression
model of adjective-noun compositionality
in distributional semantics. In Proceedings
of GEMS, pages 33–37, Uppsala.
Haussler, David. 1999. Convolution kernels
on discrete structures. Technical report
USCS-CL-99-10, University of California
at Santa Cruz.
Mehdad, Yashar, Alessandro Moschitti,
and Fabio Massimo Zanzotto. 2010.
Syntactic/semantic structures for
textual entailment recognition. In
Proceedings of NAACL, pages 1,020–1,028,
Los Angeles, CA.
Mitchell, Jeff and Mirella Lapata. 2008.
Vector-based models of semantic
composition. In Proceedings of ACL,
pages 236–244, Columbus, OH.
Socher, Richard, Eric Huang, Jeffrey Pennin,
Andrew Ng, and Christopher Manning.
2011. Dynamic pooling and unfolding
recursive autoencoders for paraphrase
detection. In Proceedings of NIPS,
pages 801–809, Granada.
Turney, Peter. 2012. Domain and function:
A dual-space model of semantic relations
and compositions. Journal of Artificial
Intelligence Research, 44:533–585.
Turney, Peter and Patrick Pantel. 2010.
From frequency to meaning: Vector
space models of semantics. Journal of
Artificial Intelligence Research, 37:141–188.
Zanzotto, Fabio Massimo, Ioannis
Korkontzelos, Francesca Falucchi, and
Suresh Manandhar. 2010. Estimating linear
models for compositional distributional
semantics. In Proceedings of COLING,
pages 1,263–1,271, Beijing.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
4
1
1
1
6
5
1
8
0
5
1
4
3
/
c
o
l
i
_
a
_
0
0
2
1
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
173
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
–
p
d
f
/
/
/
/
4
1
1
1
6
5
1
8
0
5
1
4
3
/
c
o
l
i
_
a
_
0
0
2
1
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3