Concrete Models and Empirical Evaluations - IA de Investigación especializada en el MIT

Concrete Models and Empirical Evaluations
for the Categorical Compositional
Distributional Model of Meaning

Edward Grefenstette
Google DeepMind

∗†

Mehrnoosh Sadrzadeh∗∗†
Queen Mary University of London

Modeling compositional meaning for sentences using empirical distributional methods has been
a challenge for computational linguists. The categorical model of Clark, Coecke, and Sadrzadeh
(2008) and Coecke, Sadrzadeh, and Clark (2010) provides a solution by unifying a categorial
grammar and a distributional model of meaning. It takes into account syntactic relations
during semantic vector composition operations. But the setting is abstract: It has not been
evaluated on empirical data and applied to any language tasks. We generate concrete models
for this setting by developing algorithms to construct tensors and linear maps and instantiate
the abstract parameters using empirical data. We then evaluate our concrete models against
several experiments, both existing and new, based on measuring how well models align with
human judgments in a paraphrase detection task. Our results show the implementation of this
general abstract framework to perform on par with or outperform other leading models in these
experiments.1

1. Introducción

The distributional approach to the semantic modeling of natural language, inspired
by the notion—presented by Firth (1957) and Harris (1968)—that the meaning of a
word is tightly related to its context of use, has grown in popularity as a method
of semantic representation. It draws from the frequent use of vector-based document
models in information retrieval, modeling the meaning of words as vectors based on
the distribution of co-occurring terms within the context of a word.

Using various vector similarity metrics as a measure of semantic similarity, estos
distributional semantic models (DSMs) are used for a variety of NLP tasks, de

∗ DeepMind Technologies Ltd, 5 New Street Square, London EC4A 3 TW.

Correo electrónico: etg@google.com.

∗∗ School of Electronic Engineering and Computer Science, Queen Mary University of London, Mile End

Road, London E1 4NS, Reino Unido. Correo electrónico: mehrnoosh.sadrzadeh@qmul.ac.uk.

† The work described in this article was performed while the authors were at the University of Oxford.
1 Support from EPSRC grant EP/J002607/1 is acknowledged.

Envío recibido: 26 Septiembre 2012; revised submission received: 31 Octubre 2013;
accepted for publication: 5 Abril 2014.

doi:10.1162/COLI a 00209

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 1

automated thesaurus building (Grefenstette 1994; Curran 2004) to automated essay
marking (Landauer and Dumais 1997). The broader connection to information retrieval
and its applications is also discussed by Manning, raghavan, and Sch ¨utze (2011). El
success of DSMs in essentially word-based tasks such as thesaurus extraction and con-
estructura (Grefenstette 1994; Curran 2004) invites an investigation into how DSMs can
be applied to NLP and information retrieval (IR) tasks revolving around larger units of
texto, using semantic representations for phrases, oraciones, or documents, constructed
from lemma vectors. Sin embargo, the problem of compositionality in DSMs—of how to go
from word to sentence and beyond—has proved to be non-trivial.

A new framework, which we refer to as DisCoCat, initially presented in Clark,
Coecke, and Sadrzadeh (2008) and Coecke, Sadrzadeh, and Clark (2010) reconciles
distributional approaches to natural language semantics with the structured, logical
nature of formal semantic models. This framework is abstract; its theoretical predictions
have not been evaluated on real data, and its applications to empirical natural language
processing tasks have not been studied.

This article is the journal version of Grefenstette and Sadrzadeh (2011a, 2011b),
which ﬁll this gap in the DisCoCat literature; in it, we develop a concrete model and
an unsupervised learning algorithm to instantiate the abstract vectors, linear maps, y
vector spaces of the theoretical framework; we develop a series of empirical natural
language processing experiments and data sets and implement our algorithm on large
scale real data; we analyze the outputs of the algorithm in terms of linear algebraic
ecuaciones; and we evaluate the model on these experiments and compare the results
with other competing unsupervised models. Además, we provide a linear algebraic
analysis of the algorithm of Grefenstette and Sadrzadeh (2011a) and present an in-depth
study of the better performance of the method of Grefenstette and Sadrzadeh (2011b).

We begin in Section 2 by presenting the background to the task of developing
compositional distributional models. We brieﬂy introduce two approaches to semantic
modelado: formal semantic models and distributional semantic models. We discuss their
diferencias, and relative advantages and disadvantages. We then present and critique
various approaches to bridging the gap between these models, and their limitations.
En la sección 3, we summarize the categorical compositional distributional framework of
clark, Coecke, and Sadrzadeh (2008) and Coecke, Sadrzadeh, and Clark (2010) and pro-
vide the theoretical background necessary to understand it; we also sketch the road map
of the literature leading to the development of this setting and outline the contributions
of this paper to the ﬁeld. En la sección 4, we present the details of an implementation of this
estructura, and introduce learning algorithms used to build semantic representations
in this implementation. En la sección 5, we present a series of experiments designed to
evaluate this implementation against other unsupervised distributional compositional
modelos. Finalmente, en la sección 6 we discuss these results, and posit future directions for this
research area.

2. Fondo

Compositional formal semantic models represent words as parts of logical expressions,
and compose them according to grammatical structure. They stem from classical ideas
in logic and philosophy of language, mainly Frege’s principle that the meaning of a
sentence is a function of the meaning of its parts (Frege 1892). These models relate to
well-known and robust logical formalisms, hence offering a scalable theory of meaning
that can be used to reason about language using logical tools of proof and inference.
Distributional models are a more recent approach to semantic modeling, representing

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Grefenstette and Sadrzadeh

Categorical Compositional Distributional Model of Meaning

the meaning of words as vectors learned empirically from corpora. They have found
their way into real-world applications such as thesaurus extraction (Grefenstette 1994;
Curran 2004) or automated essay marking (Landauer and Dumais 1997), and have
connections to semantically motivated information retrieval (Manning, raghavan, y
Sch ¨utze 2011). This two-sortedness of deﬁning properties of meaning: “logical form”
versus “contextual use,” has left the quest for “what is the foundational structure of
significado?”—a question initially the concern of solely linguists and philosophers of
language—even more of a challenge.

En esta sección, we present a short overview of the background to the work de-
veloped in this article by brieﬂy describing formal and distributional approaches to
natural language semantics, and providing a non-exhaustive list of some approaches
to compositional distributional semantics. For a more complete review of the topic, nosotros
encourage the reader to consult Turney (2012) or Clark (2013).

2.1 Montague Semantics

Formal semantic models provide methods for translating sentences of natural language
into logical formulae, which can then be fed to computer-aided automation tools to
reason about them (Alshawi 1992).

To compute the meaning of a sentence consisting of n words, meanings of these
words must interact with one another. In formal semantics, this further interaction is
represented as a function derived from the grammatical structure of the sentence. Semejante
models consist of a pairing of syntactic analysis rules (in the form of a grammar) con
semantic interpretation rules, as exempliﬁed by the simple model presented on the left
of Figure 1.

The semantic representations of words are lambda expressions over parts of logical
formulae, which can be combined with one another to form well-formed logical expres-
siones. La función | − | : L → M maps elements of the lexicon L to their interpretation
(p.ej., predicates, relaciones, domain objects) in the logical model M used. Nouns are
typically just logical atoms, whereas adjectives and verbs and other relational words
are interpreted as predicates and relations. The parse of a sentence such as “cats like
milk,” represented here as a binarized parse tree, is used to produce its semantic in-
terpretation by substituting semantic representations for their grammatical constituents
and applying β-reduction where needed. Such a derivation is shown on the right of
Cifra 1.

What makes this class of models attractive is that it reduces language meaning
to logical expressions, a subject well studied by philosophers of language, logicians,
and linguists. Its properties are well known, and it becomes simple to evaluate the
meaning of a sentence if given a logical model and domain, as well as verify whether
or not one sentence entails another according to the rules of logical consequence and
deduction.

Syntactic Analysis
S → NP VP
NP → cats, milk, etc..
VP → Vt NP
Vt → like, hug, etc..

Semantic Interpretation
|vicepresidente|(|notario público|)
|cats|, |milk|, . . .
|Vt|(|notario público|)
λyx.|como|(X, y), . . .

Cifra 1
A simple model of formal semantics.

|como|(|cats|, |milk|)

⇒

|cats|

λx.|como|(X, |milk|)

λyx.|como|(X, y)

|milk|

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 1

Sin embargo, such logical analysis says nothing about the closeness in meaning or
topic of expressions beyond their truth-conditions and which models satisfy these truth
condiciones. Por eso, formal semantic approaches to modeling language meaning do not
perform well on language tasks where the notion of similarity is not strictly based on
truth conditions, such as document retrieval, topic classiﬁcation, Etcétera. Más-
más, an underlying domain of objects and a valuation function must be provided,
as with any logic, leaving open the question of how we might learn the meaning of
language using such a model, rather than just use it.

2.2 Distributional Semantics

A popular way of representing the meaning of words in lexical semantics is as dis-
tributions in a high-dimensional vector space. This approach is based on the distri-
butional hypothesis of Harris (1968), who postulated that the meaning of a word was
dictated by the context of its use. The more famous dictum stating this hypothesis
is the statement of Firth (1957) that “You shall know a word by the company it
keeps.” This view of semantics has furthermore been associated (Grefenstette 2009;
Turney and Pantel 2010) with earlier work in philosophy of language by Wittgenstein
(presented in Wittgenstein 1953), who stated that language meaning was equivalent to
its real world use.

Practically speaking, the meaning of a word can be learned from a corpus by looking
at what other words occur with it within a certain context, and the resulting distribution
can be represented as a vector in a semantic vector space. This vectorial representation is
convenient because vectors are a familiar structure with a rich set of ways of computing
vector distance, allowing us to experiment with different word similarity metrics. El
geometric nature of this representation entails that we can not only compare individual
words’ meanings with various levels of granularity (p.ej., Podríamos, Por ejemplo, be able
to show that cats are closer to kittens than to dogs, but that all three are mutually closer
than cats and steam engines), but also apply methods frequently called upon in IR tasks,
such as those described by Manning, raghavan, and Sch ¨utze (2011), to group concepts
by topic, sentiment, etcétera.

The distribution underlying word meaning is a vector in a vector space, the basis
vectors of which are dictated by the context. In simple models, the basis vectors will
be annotated with words from the lexicon. Traditionally, the vector spaces used in
such models are Hilbert spaces (es decir., vector spaces with orthogonal bases, such that
the inner product of any one basis vector with another [other than itself] is zero). El
semantic vector for any word can be represented as the weighted sum of the basis
vectores:

−−−−−−−→
some word =

(cid:2)

−→
ni

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

dónde {−→
ni
is the weight associated with basis vector

−→
ni .

}
i is the basis of the vector space the meaning of the word lives in, and ci

∈ R

The construction of the vector for a word is done by counting, for each lexicon
−→
ni , how many times it occurs in the context of
word ni associated with basis vector
each occurrence of the word for which we are constructing the vector. This count is then
typically adjusted according to a weighting scheme (p.ej., TF-IDF). The “context” of a
word can be something as simple as the other words occurring in the same sentence as

Grefenstette and Sadrzadeh

Categorical Compositional Distributional Model of Meaning

the word or within k words of it, or something more complex, such as using dependency
relaciones (Pad ´o and Lapata 2007) or other syntactic features.

Commonly, the similarity of two semantic vectors is computed by taking their

cosine measure, which is the sum of the product of the basis weights of the vectors:

−→
a ,

−→
b ) =

coseno(

(cid:4)(cid:3)

(cid:3)

i cb
i
(cid:3)

i ca
i )2

i (ca

i (cb

i )2

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

i and cb

−→
where ca
b , respectivamente. Sin embargo, otro
options may be a better ﬁt for certain implementations, typically dependent on the
weighting scheme.

i are the basis weights for

−→
a and

Readers interested in learning more about these aspects of distributional lexical
semantics are invited to consult Curran (2004), which contains an extensive overview
of implementation options for distributional models of word meaning.

2.3 Compositionality and Vector Space Models

In the previous overview of distributional semantic models of lexical semantics, nosotros
have seen that DSMs are a rich and tractable way of learning word meaning from a
cuerpo, and obtaining a measure of semantic similarity of words or groups of words.
Sin embargo, it should be fairly obvious that the same method cannot be applied to sen-
tenencias, whereby the meaning of a sentence would be given by the distribution of other
sentences with which it occurs.

First and foremost, a sentence typically occurs only once in a corpus, and hence
substantial and informative distributions cannot be created in this manner. More im-
portantly, human ability to understand new sentences is a compositional mechanism:
We understand sentences we have never seen before because we can generate sentence
meaning from the words used, and how they are put into relation. To go from word
vectors to sentence vectors, we must provide a composition operation allowing us to con-
struct a sentence vector from a collection of word vectors. En esta sección, we will discuss
several approaches to solving this problem, their advantages, and their limitations.

2.3.1 Additive Models. The simplest composition operation that comes to mind is straight-
forward vector addition, such that:

−→
ab = −→

a +

−→
b

Conceptually speaking, if we view word vectors as semantic information distributed
across a set of properties associated with basis vectors, using vector addition as a
semantic composition operation states that the information of a set of lemmas in a
sentence is simply the sum of the information of the individual lemmas. A pesar de
crude, this approach is computationally cheap, and appears sufﬁcient for certain NLP
tareas: Landauer and Dumais (1997) show it to be sufﬁcient for automated essay marking
tareas, and Grefenstette (2009) shows it to perform better than a collection of other simple
similarity metrics for summarization, sentence paraphrase, and document paraphrase
detection tasks.

Sin embargo, there are two principal objections to additive models of composition: primero,
−−−→
wine =

−−−−−−−−−−−−→
John drank wine =

vector addition is commutative, por lo tanto,

−−−→
bebió +

−−→
John +

Ligüística computacional

Volumen 41, Número 1

−−−−−−−−−−−−→
Wine drank John, and thus vector addition ignores syntactic structure completely; y
segundo, vector addition sums the information contained in the vectors, effectively jum-
bling the meaning of words together as sentence length grows.

The ﬁrst objection is problematic, as the syntactic insensitivity of additive models
leads them to equate the representation of sentences with patently different meanings.
Mitchell and Lapata (2008) propose to add some degree of syntactic sensitivity—namely,
accounting for word order—by weighting word vectors according to their order of
appearance in a sentence as follows:

−→
ab = α−→

a + b

−→
b

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

−−→
John + β ·
−−−→
wine + β ·

−−−→
bebió + γ·
−−−→
bebió + γ ·

−−−−−−−−−−−−→
John drank wine = α ·
−−−−−−−−−−−−→
Wine drank John = α ·

donde α, β ∈ R. Como consecuencia,
not have the same representation as

−−−→
wine would
−−→
John.
The question of how to obtain weights and whether they are only used to reﬂect
word order or can be extended to cover more subtle syntactic information is open,
but it is not immediately clear how such weights may be obtained empirically and
whether this mode of composition scales well with sentence length and increase in
syntactic complexity. Guevara (2010) suggests using machine-learning methods such
as partial least squares regression to determine the weights empirically, but states that
this approach enjoys little success beyond minor composition such as adjective-noun
or noun-verb composition, and that there is a dearth of metrics by which to evaluate
such machine learning–based systems, stunting their growth and development.

The second objection states that vector addition leads to increase in ambiguity as
we construct sentences, rather than decrease in ambiguity as we would expect from
giving words a context; and for this reason Mitchell and Lapata (2008) suggest replacing
additive models with multiplicative models as discussed in Section 2.3.2, or combining
them with multiplicative models to form mixture models as discussed in Section 2.3.3.

2.3.2 Multiplicative Models. The multiplicative model of Mitchell and Lapata (2008) es
an attempt to solve the ambiguity problem discussed in Section 2.3.1 and provide
implicit disambiguation during composition. The composition operation proposed is
the component-wise multiplication ((cid:6)) of two vectors: Vectors are expressed as the
weighted sum of their basis vectors, and the weight of the basis vectors of the com-
−→
ni , y
posed vector is the product of the weights of the original vectors; para
−→
b =

−→
ni , tenemos

−→
a =

i ci

(cid:3)

(cid:2)
i

i c

−→
ab = −→

a (cid:6)

−→
b =

(cid:2)

cic

(cid:2)
i

−→
ni

Such multiplicative models are shown by Mitchell and Lapata (2008) to perform better
at verb disambiguation tasks than additive models for noun-verb composition, against
a baseline set by the original verb vectors. The experiment they use to show this will
also serve to evaluate our own models, and form the basis for further experiments, como
discutido en la Sección 5.

This approach to compositionality still suffers from two conceptual problems:
Primero, component-wise multiplication is still commutative and hence word order is not
accounted for; segundo, rather than “diluting” information during large compositions

Grefenstette and Sadrzadeh

Categorical Compositional Distributional Model of Meaning

and creating ambiguity, it may remove too much through the “ﬁltering” effect of
component-wise multiplication.

The ﬁrst problem is more difﬁcult to deal with for multiplicative models than for
additive models, because both scalar multiplication and component-wise multiplication
are commutative and hence α−→
a and thus word order cannot be
taken into account using scalar weights.

−→
b (cid:6) β−→

−→
b = α

a (cid:6) b

(cid:2)
i

(cid:3)

i cic

−→
a (cid:6)

−→
b =

= 0 or c

= 0, then cic

−→
ni . For any i, if ci

To illustrate how the second problem entails that multiplicative models do not scale
well with sentence length, we look at the structure of component-wise multiplication
= 0, and therefore
de nuevo:
for any composition, the number of non-zero basis weights of the produced vector is
less than or equal to the number of non-zero basis weights of the original vectors: En
each composition step information is ﬁltered out (or preserved, but never increased).
Por eso, as the number of vectors to be composed grows, the number of non-zero
basis weights of the product vector stays the same or—more realistically—decreases.
= −→
(cid:6) . . . (cid:6) −→
a1
un , si
ai
Por lo tanto, for any composition of the form
−→
=
. . . ai−1 then
for any i,
0 . It follows that purely
multiplicative models alone are not apt as a single mode of composition beyond noun-
verb (or adjective-noun) composition operations.

−−−−−−−−→
. . . un
. . . ai
a1
−−−−−−−−→
. . . un
. . . ai
a1

−→
ai is orthogonal to

−−−−−−→
a1

(cid:6) . . . (cid:6) −→

(cid:2)
i

(cid:3)

−→
a (cid:6)

−→
b =

One solution to this second problem not discussed by Mitchell and Lapata (2008)
would be to introduce some smoothing factor s ∈ R+
for point-wise multiplication
−→
+ s)(C
ni , ensuring that information is never completely
such that
ﬁltered out. Seeing how the problem of syntactic insensitivity still stands in the way of
full-blown compositionality for multiplicative models, we leave it to those interested in
salvaging purely multiplicative models to determine whether some suitable value of s
can be determined.

+ s)

i (ci

(cid:2)
i

2.3.3 Mixture Models. The problems faced by multiplicative models presented in Sec-
ción 2.3.2 are acknowledged in passing by Mitchell and Lapata (2008), who propose
mixing additive and multiplicative models in the hope of leveraging the advantage of
each while doing away with their pitfalls. This is simply expressed as the weighted sum
of additive and multiplicative models:

−→
ab = α−→

a + b

−→
b + γ(

−→
a (cid:6)

−→
b )

donde α, b, and γ are predetermined scalar weights.

The problems for these models are threefold. Primero, the question of how scalar
weights are to be obtained still needs to be determined. Mitchell and Lapata (2008) estafa-
cede that one advantage of purely multiplicative models over weighted additive or
mixture models is that the lack of scalar weights removes the need to optimize the scalar
weights for particular tasks (at the cost of not accounting for syntactic structure), y
avoids the methodological concerns accompanying this requirement.

Segundo, the question of how well this process scales from noun-verb composition
to more syntactically rich expressions must be addressed. Using scalar weights to
account for word order seems ad hoc and superﬁcial, as there is more to syntactic
structure than the mere ordering of words. Por lo tanto, an account of how to build
sentence vectors for sentences such as The dog bit the man and The man was bitten by
the dog in order to give both sentences the same (or a similar) representation would
need to give a richer role to scalar weights than mere token order. Perhaps speciﬁc

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 1

weights could be given to particular syntactic classes (such as nouns) to introduce a
more complex syntactic element into vector composition, but it is clear that this alone
is not a solution, as the weight for nouns dog and man would be the same, permitiendo
for the same commutative degeneracy observed in non-weighted additive models, en
−−−−−−−−−−−−−−→
cual
the man bit the dog. Introducing a mixture of weighting
systems accounting for both word order and syntactic roles may be a solution, but it is
not only ad hoc but also arguably only partially reﬂects the syntactic structure of the
oración.

−−−−−−−−−−−−−−→
the dog bit the man =

The third problem is that Mitchell and Lapata (2008) show that in practice, a pesar de
mixture models perform better at verb disambiguation tasks than additive models and
weighted additive models, they perform equivalently to purely multiplicative models
with the added burden of requiring parametric optimization of the scalar weights.

Por lo tanto, whereas mixture models aim to take the best of additive and multiplica-
tive models while avoiding their problems, they are only partly successful in achieving
the latter goal, and demonstrably do little better in achieving the former.

2.3.4 Tensor-Based Models. From Sections 2.3.1–2.3.3 we observe that the need for incor-
porating syntactic information into DSMs to achieve true compositionality is pressing,
if only to develop a non-commutative composition operation that can take into account
word order without the need for adhoc weighting schemes, and hopefully richer syn-
tactic information as well.

An early proposal by Smolensky and colleagues (Smolensky 1990; Smolensky and
Legendre 2006) to use linear algebraic tensors as a composition operation solves the
problem of ﬁnding non-commutative vector composition operators. The composition of
two vectors is their tensor product, sometimes called the kronecker product when ap-
−→
(cid:2)
norte
j ,

−→
b ∈ W =

−→
a ∈ V =

−→
ni , y

plied to vectors rather than vector spaces. Para
tenemos:

i ci

(cid:3)

j c

(cid:2)
j

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

−→
ab = −→

a ⊗

−→
b =

cic

(cid:2)
j

−→
ni

⊗

−→
(cid:2)
norte
j

(cid:2)

The composition operation takes the original vectors and maps them to a vector in a
larger vector space V ⊗ W, which is the tensor space of the original vectors’ spaces.
Here the second instance of ⊗ is not a recursive application of the kronecker product,
but rather the pairing of basis elements of V and W to form a basis element of V ⊗ W.
The shared notation and occasional conﬂation of kronecker and tensor products may
seem confusing, but is fairly standard in multilinear algebra.

The advantage of this approach is twofold: Primero, vectors for different words need
not live in the same spaces but can be composed nonetheless. This allows us to repre-
sent vectors for different word classes (temas, syntactic roles, etc.) in different spaces
with different bases, which was not possible under additive or multiplicative models.
Segundo, because the product vector lives in a larger space, we obtain the intuitive notion
that the information of the whole is richer and more complex than the information of
the parts.

Dimensionality Problems. Sin embargo, as observed and commented upon by Smolensky
él mismo, this increase in dimensionality brings two rather large problems for tensor
based models. The ﬁrst is computational: The size of the product vector space is the
product of the size of the original vector spaces. If we assume that all words live in the

Grefenstette and Sadrzadeh

Categorical Compositional Distributional Model of Meaning

same space N of dimensionality dim(norte), then the dimensionality of an n-word sentence
vector is dim(norte)norte. If we have as many basis vectors for our word semantic space as
there are lexemes in our vocabulary (p.ej., approximately 170k in English2), then the size
of our sentence vectors quickly reaches magnitudes for which vector comparison (o
even storage) are computationally intractable.3 Even if, as most DSM implementations
hacer, we restrict the basis vectors of word semantic spaces to the k (p.ej., k = 2,000) mayoría
frequent words in a corpus, the sentence vector size still grows exponentially with
sentence length, and the implementation problems remain.

The second problem is mathematical: Sentences of different length live in different
spaces, and if we assign different vector spaces to different word types (p.ej., syntactic
classes), then sentences of different syntactic structure live in different vector spaces,
and hence cannot be compared directly using inner product or cosine measures, leaving
us with no obvious mode of semantic comparison for sentence vectors. If any model
wishes to use tensor products in composition operations, it must ﬁnd some way of
reducing the dimensionality of product vectors to some common vector space so that
they may be directly compared.

One notable method by which these dimensionality problems can be solved in
general are the holographic reduced representations proposed by Plate (1991). El
product vector of two vectors is projected into a space of smaller dimensionality by
circular convolution to produce a trace vector. The circular correlation of the trace vector
and one of the original vectors produces a noisy version of the other original vector.
The noisy vector can be used to recover the clean original vector by comparing it with a
predeﬁned set of candidates (p.ej., the set of word vectors if our original vectors are word
meanings). Traces can be summed to form new traces effectively containing several vec-
tor pairs from which original vectors can be recovered. Using this encoding/decoding
mechanism, the tensor product of sets of vectors can be encoded in a space of smaller
dimensionality, and then recovered for computation without ever having to fully repre-
sent or store the full tensor product, as discussed by Widdows (2008).

There are problems with this approach that make it unsuitable for our purposes,
some of which are discussed in Mitchell and Lapata (2010). Primero, there is a limit to
the information that can be stored in traces, which is independent of the size of the
vectors stored, but is a logarithmic function of their number. As we wish to be able
to store information for sentences of variable word length without having to directly
represent the tensored sentence vector, setting an upper bound to the number of vectors
that can be composed in this manner limits the length of the sentences we can represent
compositionally using this method.

Segundo, and perhaps more importantly, there are restrictions on the nature of
the vectors that can be encoded in such a way: The vectors must be independently
distributed such that the mean Euclidean length of each vector is 1. Such conditions
are unlikely to be met in word semantic vectors obtained from a corpus; and as the
failure to do so affects the system’s ability to recover clean vectors, holographic re-
duced representations are not prima facie usable for compositional DSMs, although it
is important to note that Widdows (2008) considers possible application areas where
they may be of use, although once again these mostly involve noun-verb and adjective-
noun compositionality rather than full blown sentence vector construction. We retain

2 Fuente: http://www.oxforddictionaries.com/page/howmanywords.
3 At four bytes per integer, and one integer per basis vector weight, the vector for John loves Mary

would require roughly (170, 000 · 4)3 ≈ 280 petabytes of storage, which is over ten times the data Google
processes on a daily basis according to Dean and Ghemawat (2008).

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 1

from Plate (1991) the importance of ﬁnding methods by which to project the tensored
sentence vectors into a common space for direct comparison, as will be discussed further
en la sección 3.

Syntactic Expressivity. An additional problem of a more conceptual nature is that using
the tensor product as a composition operation simply preserves word order. Como nosotros
discutido en la Sección 2.3.3, this is not enough on its own to model sentence meaning. Nosotros
need to have some means by which to incorporate syntactic analysis into composition
operaciones.

Early work on including syntactic sensitivity into DSMs by Grefenstette (1992)
suggests using crude syntactic relations to determine the frame in which the distri-
butions for word vectors are collected from the corpus, thereby embedding syntac-
tic information into the word vectors. This idea was already present in the work of
Smolensky, who used sums of pairs of vector representations and their roles, obtained
by taking their tensor products, to obtain a vector representation for a compound. El
application of these ideas to DSMs was studied by Clark and Pulman (2007), OMS
suggest instantiating the roles to dependency relations and using the distributional
representations of words as the vectors. Por ejemplo, in the sentence Simon loves red
wine, Simon is the subject of loves, wine is its object, and red is an adjective describing
wine. Por eso, from the dependency tree with loves as root node, its subject and object as
niños, and their adjectival descriptors (if any) as their children, we read the following
−−−→
wine ⊗

−→
rojo. Using the equality relation for

−−−→
loves ⊗

−−→
subj ⊗

−→
obj ⊗

−→
adj ⊗

−−−−→
Simon ⊗
estructura:
inner products of tensor products:

(cid:9)−→
a ⊗

−→
b | −→

c ⊗

−→
d (cid:10) = (cid:9)−→

a | −→

C (cid:10) × (cid:9)

−→
b |

−→
d (cid:10)

We can therefore express inner-products of sentence vectors efﬁciently without ever
having to actually represent the tensored sentence vector:

−−−−−−−−−−−−→
Simon loves red wine |

(cid:3)

= (cid:3)

−−→
loves ⊗
−−→
loves |
−−→
loves |

−→
subj ⊗
−→
likes(cid:5) × (cid:3)
−→
likes(cid:5) × (cid:3)

−→
obj ⊗

−−−−−−−−−−−−−−→
Mary likes delicious ros´e(cid:5)
−−→
wine ⊗
−−→
Simón |
−−→
wine |

−→
subj(cid:5) × (cid:3)
−−→
Mary(cid:5) × (cid:3)

−→
adj ⊗
−−→
Mary(cid:5) × (cid:3)
−→
ros´e(cid:5) × (cid:3)

−−→
Simon ⊗
−→
subj |
−−→
Simón |

−→
rojo |

−→
likes ⊗
−→
obj |
−→
rojo |

−→
subj ⊗

−→
obj(cid:5) × (cid:3)
−−−−→
delicious(cid:5)

−−→
Mary ⊗
−−→
wine |

−→
obj ⊗

−→
ros´e(cid:5) × (cid:3)

−→
ros´e ⊗
−→
adj |

−→
adj ⊗
−→
adj (cid:5) × (cid:3)

−−−−→
delicious(cid:5)
−→
rojo |

−−−−→
delicious(cid:5)

This example shows that this formalism allows for sentence comparison of sentences
with identical dependency trees to be broken down to term-to-term comparison without
the need for the tensor products to ever be computed or stored, reducing computation
to inner product calculations.

Sin embargo, although matching terms with identical syntactic roles in the sentence
works well in the given example, this model suffers from the same problems as the
original tensor-based compositionality of Smolensky (1990) in that, by the authors’
own admission, sentences of different syntactic structure live in spaces of different
dimensionality and thus cannot be directly compared. Hence we cannot use this to
measure the similarity between even small variations in sentence structure, como el
pair “Simon likes red wine” and “Simon likes wine.”

2.3.5 SVS Models. The idea of including syntactic relations to other lemmas in word
representations discussed in Section 2.3.4 is applied differently in the structured vector

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Grefenstette and Sadrzadeh

Categorical Compositional Distributional Model of Meaning

space model presented by Erk and Pad ´o (2008). They propose to represent word mean-
ings not as simple vectors, but as triplets:

w = (v, R, R

−1)

−1 are selectional
where v is the word vector, constructed as in any other DSM, R and R
preferences, and take the form of R → D maps where R is the set of dependency
relaciones, and D is the set of word vectors. Selectional preferences are used to encode
the lemmas that w is typically the parent of in the dependency trees of the corpus in the
case of R, and typically the child of in the case of R

−1.

Composition takes the form of vector updates according to the following protocol.
) be two words being composed, and let r be the

−1
Let a = (Virginia, Ra, R
a
dependency relation linking a to b. The vector update procedure is as follows:

−1
) and b = (vb, Rb, R
b

(cid:2) = (Virginia
(cid:2) = (vb
b

−1
−1
(cid:6) R
− {r}, R
(r), Ra
a
b
−1
− {r})
(cid:6) Ra(r), Rb, R
b

)

(cid:2)

(cid:2)
, b

are the updated word meanings, y (cid:6) is whichever vector composition
where a
(addition, component-wise multiplication) we wish to use. The word vectors in the
triplets are effectively ﬁltered by combination with the lemma which the word they
are being composed with expects to bear relation r to, and this relation between the
composed words a and b is considered to be used and hence removed from the domain
of the selectional preference functions used in composition.

This mechanism is therefore a more sophisticated version of the compositional
disambiguation mechanism discussed by Mitchell and Lapata (2008) in that the com-
bination of words ﬁlters the meaning of the original vectors that may be ambiguous
(p.ej., if we have one vector for bank); but contrary to Mitchell and Lapata (2008), el
information of the original vectors is modiﬁed but essentially preserved, allowing for
further combination with other terms, rather than directly producing a joint vector for
−1 are partial functions associated
the composed words. The added fact that R and R
with speciﬁc lemmas forces grammaticality during composition, since if a holds a
dependency relation r to b which it never expects to hold (for example a verb having
−1
as its subject another verb, rather than the reverse), then Ra and R
b are undeﬁned for r
and the update fails. Sin embargo, there are some problems with this approach if our goal
is true compositionality.

Primero, this model does not allow some of the “repeated compositionality” we need
−1. Por ejemplo, we expect that an adjective composed
because of the update of R and R
with a noun produces something like a noun in order to be further composed with a verb
or even another adjective. Sin embargo, aquí, because the relation adj would be removed
−1
(cid:2)
for some noun b composed with an adjective a, this new representation b
from R
b
would not have the properties of a noun in that it would no longer expect composition
with an adjective, rendering representations of simple expressions like “the new red
car” impossible. Por supuesto, we could remove the update of the selectional preference
functions from the compositional mechanism, but then we would lose this attractive
−1.
feature of grammaticality enforcement through the partial functionality of R and R
Segundo, this model does little more than represent the implicit disambiguation that
is expected during composition, rather than actually provide a full blown compositional
modelo. The inability of this system to provide a novel mechanism by which to obtain

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 1

a joint vector for two composed lemmas—thereby building towards sentence vectors—
entails that this system provides no means by which to obtain semantic representations
of larger syntactic structures that can be compared by inner product or cosine measure
as is done with any other DSM. Por supuesto, this model could be combined with the
compositional models presented in Sections 2.3.1–2.3.3 to produce sentence vectors, pero
whereas some syntactic sensitivity would have been obtained, the word ordering and
other problems of the aforementioned models would still remain, and little progress
would have been made towards true compositionality.

We retain from this attempt to introduce compositionality in DSMs that including
information obtained from syntactic dependency relations is important for proper dis-
ambiguation, and that having some mechanism by which the grammaticality of the
expression being composed is a precondition for its composition is a desirable feature
for any compositional mechanism.

2.4 Matrix-Based Compositionality

The ﬁnal class of approaches to vector composition we wish to discuss are three
matrix-based models.

Generic Additive Model. The ﬁrst is the Generic Additive Model of Zanzotto et al. (2010).
This is a generalization of the weighted additive model presented in Section 2.3.1.
In this model, rather than multiplying lexical vectors by ﬁxed parameters α and β
before adding them to form the representation of their combination, they are instead
the arguments of matrix multiplication by square matrices A and B:

−→
ab = A

−→
a + B

−→
b

Aquí, A and B represent the added information provided by putting two words into
relation.

−→
a ,

−→
C ) dónde

The numerical content of A and B is learned by linear regression over triplets
−→
−→
c is the expected output
b ,
(
−→
b . This learning system thereby requires the provision of
of the combination of
labeled data for linear regression to be performed. Zanzotto et al. (2010) suggest several
sources for this labeled data, such as dictionary deﬁnitions and word etymologies.

−→
b are lexical semantic vectors, y

−→
a and
−→
a and

This approach is richer than the weighted additive models because the matrices
act as linear maps on the vectors they take as “arguments,” and thus can encode more
subtle syntactic or semantic relations. Sin embargo, this model treats all word combinations
as the same operation—for example, treating the combination of an adjective with its
argument and a verb with its subject as the same sort of composition. debido a la
diverse ways there are of training such supervised models, we leave it to those who
wish to further develop this speciﬁc line of research to perform such evaluations.

Adjective Matrices. The second approach is the matrix-composition model of Baroni and
Zamparelli (2010), which they develop only for the case of adjective-noun composition,
although their approach can seamlessly be used for any other predicate-argument com-
posición. Contrary to most of the earlier approaches proposed, which aim to combine
two lexical vectors to form a lexical vector for their combination, Baroni and Zamparelli
suggest giving different semantic representations to different types, or more speciﬁcally
to adjectives and nouns.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Grefenstette and Sadrzadeh

Categorical Compositional Distributional Model of Meaning

In this model, nouns are lexical vectors, as with other models. Sin embargo, embracing
a view of adjectives that is more in line with formal semantics than with distributional
semantics, they model adjectives as linear maps taking lexical vectors as input and pro-
ducing lexical vectors as output. Such linear maps can be encoded as square matrices,
and applied to their arguments by matrix multiplication. Concretely, let Madjective be the
−−→
noun be the lexical semantic vector for
matrix encoding the adjective’s linear map, y
a noun; their combination is simply

−−−−−−−−−→
adjective noun = Madjective × −−→
noun

Similarly to the Generic Additive Model, the matrix for each adjective is learned
−−→
−→
noun are the lexical
C ) where the vectors
by linear regression over a set of pairs (
−→
c is the semantic
semantic vectors for the arguments of the adjective in a corpus, y
vector corresponding to the expected output of the composition of the adjective with
that noun.

−−→
noun,

This may, at ﬁrst blush, also appear to be a supervised training method for learn-
ing adjective matrices from “labeled data,” seeing how the expected output vectors
are needed. Sin embargo, Baroni and Zamparelli (2010) work around this constraint by
automatically producing the labeled data from the corpus by treating the adjective-
noun compound as a single token, and learning its vector using the same distributional
learning methods they used to learn the vectors for nouns. This same approach can be
extended to other unary relations without change and, using the general framework
of the current article, an extension of it to binary predicates has been presented in
Grefenstette et al. (2013), using multistep regression. For a direct comparison of the
results of this approach with some of the results of the current article, we refer the
reader to Grefenstette et al. (2013).

Recursive Matrix-Vector Model. The third approach is the recently developed Recursive
Matrix-Vector Model (MV-RNN) of Socher et al. (2012), which claims the two matrix-
based models described here as special cases. In MV-RNN, words are represented as a
−→
a with an operation matrix A. Within this model,
pairing of a lexical semantic vector
given the parse of a sentence in the form of a binarized tree, the semantic representation
−→
C , C) of each non-terminal node in the tree is produced by performing the following
(
−→
b , B).
two operations on its children (
Primero, the vector component

−→
a , A) y (
−→
c is produced by applying the operation matrix of one
child to the vector of the other, y viceversa, and then projecting both of the products
back into the same vector space as the child vectors using a projection matrix W, cual
must also be learned:

−→
c = W ×

(cid:5)

(cid:6)

B × −→
a
−→
A ×
b

Segundo, the matrix C is calculated by projecting the pairing of matrices A and B back

into the same space, using a projection matrix WM, which must also be learned:

C = WM

(cid:5)

(cid:6)

A
B

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 1

−→
C , C) obtained through these operations forms the semantic representation

The pairing (
of the phrase falling under the scope of the segment of the parse tree below that node.

This approach to compositionality yields good results in the experiments described
in Socher et al. (2012). It furthermore has appealing characteristics, such as treating
relational words differently through their operation matrices, and allowing for recursive
composition, as the output of each composition operation is of the same type of object
as its inputs. This approach is highly general and has excellent coverage of different
syntactic types, while leaving much room for parametric optimization.

The principal mathematical difference with the compositional framework presented
subsequently is that composition in MV-RNN is always a binary operation; for exam-
por ejemplo, to compose a transitive verb with its subject and object one would ﬁrst need to
compose it with its object, and then compose the output of that operation with the
sujeto. The framework we discuss in this article allows for the construction of larger
representations for relations of larger arities, permitting the simultaneous composition
of a verb with its subject and object. Whether or not this theoretical difference leads to
signiﬁcant differences in composition quality requires joint evaluation. Además,
the description of MV-RNN models in Socher et al. (2012) speciﬁes the need for a
source of learning error during training, which is easy to measure in the case of label
prediction experiments such as sentiment prediction, but non-trivial in the case of
paraphrase detection where no objective label exists. A direct comparison to MV-RNN
methods within the context of experiments similar to those presented in this article has
been produced by Blacoe and Lapata (2012), showing that simple operations perform
on par with the earlier complex deep learning architectures produced by Socher and
colegas; we leave direct comparisons to future work. Early work has shown that the
addition of a hidden layer with non-linearities to these simple models will improve the
resultados.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

2.5 Some Other Approaches to Distributional Semantics

Domains and Functions. In recent work, Turney (2012) suggests modeling word repre-
sentations not as a single semantic vector, but as a pair of vectors: one containing the
information of the word relative to its domain (the other words that are ontologically
similar), and another containing information relating to its function. The former vector
is learned by looking at what nouns a word co-occurs with, and the latter is learned
by looking at what verb-based patterns the word occurs in. Similarity between sets of
words is not determined by a single similarity function, but rather through a combi-
nation of comparisons of the domain components of words’ representations with the
function components of the words’ representations. Such combinations are designed
on a task-speciﬁc basis. Although Turney’s work does not directly deal with vector
composition of the sort we explore in this article, Turney shows that similarity measures
can be designed for tasks similar to those presented here. The particular limitation of
his approach, which Turney discusses, is that similarity measures must be speciﬁed for
each task, whereas most of the compositional models described herein produce repre-
sentations that can be compared in a task-independent manner (p.ej., through cosine
semejanza). Sin embargo, this approach is innovative, and will merit further attention in
future work in this area.

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Language as Algebra. A theoretical model of meaning as context has been proposed in
Clarke (2009, 2012). In that model, the meaning of any string of words is a vector built

Grefenstette and Sadrzadeh

Categorical Compositional Distributional Model of Meaning

from the occurrence of the string in a corpus. This is the most natural extension of
distributional models from words to strings of words: in that model, one builds vectors
for strings of words in exactly the same way as one does for words. The main problem,
sin embargo, is that of data sparsity for the occurrences of strings of words. Words do
appear repeatedly in a document, but strings of words, especially for longer strings,
rarely do so; por ejemplo, it hardly happens that an exact same sentence appears more
than once in a document. To overcome this problem, the model is based on the hypo-
thetical concept of an inﬁnite corpus, an assumption that prevents it from being applied
to real-world corpora and experimented within natural language processing tasks. On
the positive side, the model provides a theoretical study of the abstract properties of
a general bilinear associative composition operator; En particular, it is shown that this
operator encompasses other composition operators, such as addition, multiplicación,
and even tensor product.

3. DisCoCat

En la sección 2, we discussed lexical DSMs and the problems faced by attempts to provide
a vector composition operation that would allow us to form distributional sentence rep-
resentations as a function of word meaning. En esta sección, we will present an existing
formalism aimed at solving this compositionality problem, as well as the mathematical
background required to understand it and further extensions, building on the features
and failures of previously discussed attempts at syntactically sensitive compositionality.
clark, Coecke, and Sadrzadeh (2008) and Coecke, Sadrzadeh, and Clark (2010)
propose adapting a category theoretic model, inspired by the categorical compositional
vector space model of quantum protocols (Abramsky and Coecke 2004), to the task
of compositionality of semantic vectors. Syntactic analysis in the form of pregroup
grammars—a categorial grammar—is given categorical semantics in order to be repre-
sented as a compact closed category P (a concept explained subsequently), the objects of
which are syntactic types and the morphisms of which are the reductions forming the
basis of syntactic analysis. This syntactic category is then mapped onto the semantic
compact closed category FVect of ﬁnite dimensional vector spaces and linear maps.
The mapping is done in the product category FVect × P via the following procedure.
Each syntactic type is interpreted as a vector space in which semantic vectors for words
with that particular syntactic type live; the reductions between the syntactic types are
interpreted as linear maps between the interpreted vector spaces of the syntactic types.
The key feature of category theory exploited here is its ability to relate in a canonical
way different mathematical formalisms that have similar structures, even if the original
formalisms belong in different branches of mathematics. In this context, it has enabled
us to relate syntactic types and reductions to vector spaces and linear maps and obtain
a mechanism by which syntactic analysis guides semantic composition operations.

This pairing of syntactic analysis and semantic composition ensures both that
grammaticality restrictions are in place as in the model of Erk and Pad ´o (2008) y
syntactically driven semantic composition in the form of inner-products provide the im-
plicit disambiguation features of the compositional models of Erk and Pad ´o (2008) y
Mitchell and Lapata (2008). The composition mechanism also involves the projection of
tensored vectors into a common semantic space without the need for full representation
of the tensored vectors in a manner similar to Plate (1991), without restriction to the
nature of the vector spaces it can be applied to. This avoids the problems faced by other
tensor-based composition mechanisms such as Smolensky (1990) and Clark and Pulman
(2007).

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 1

The word vectors can be speciﬁed model-theoretically and the sentence space can be
deﬁned over Boolean values to obtain grammatically driven truth-theoretic semantics
in the style of Montague (1974), as proposed by Clark, Coecke, and Sadrzadeh (2008).
Some logical operators can be emulated in this setting, such as using swap matrices for
negation as shown by Coecke, Sadrzadeh, and Clark (2010). Alternativamente, corpus-based
variations on this formalism have been proposed by Grefenstette et al. (2011) to obtain
a non-truth theoretic semantic model of sentence meaning for which logical operations
have yet to be deﬁned.

Before explaining how this formalism works, en la sección 3.3, we will introduce the
notions of pregroup grammars in Section 3.1, and the required basics of category theory
en la sección 3.2.

3.1 Pregroup Grammars

Presented by Lambek (1999, 2008) as a successor to his syntactic calculus (Lambek 1958),
pregroup grammars are a class of categorial type grammars with pregroup algebras
as semantics. Pregroups are particularly interesting within the context of this work
because of their well-studied algebraic structure, which can trivially be mapped onto the
structure of the category of vector spaces, as will be discussed subsequently. Logically
speaking, a pregroup is a non-commutative form of Linear Logic (Girard 1987) en el cual
the tensor and its dual par coincide; this logic is sometimes referred to as Bi-Compact
Linear Logic (Lambek 1999). The formalism works alongside the general guidelines of
other categorial grammars, por ejemplo, those of the combinatory categorial grammar
(CCG) designed by Steedman (2001) and Steedman and Baldridge (2011). They consist
of atomic grammatical types that combine to form compound types. A series of CCG-
like application rules allow for type-reductions, forming the basis of syntactic analysis.
As our ﬁrst step, we show how this syntactic analysis formalism works by presenting
an introduction to pregroup algebras.

Pregroups. A pregroup is an algebraic structure of the form (PAG, ≤, ·, 1, (−)yo, (−)r). Its
elements are deﬁned as follows:

P is a set {a, b, C, . . .}.
The relation ≤ is a partial ordering on P.
The binary operation · is an associative, non-commutative monoid
multiplication with the type − · − : P × P → P, such that if a, b ∈ P then
a · b ∈ P. En otras palabras, P is closed under this operation.
1 ∈ P is the unit, satisfying a · 1 = a = 1 · a for all a ∈ P.
(−)l and (−)r are maps with types (−)yo : P → P and (−)r : P → P such that
for any a ∈ P, we have that al, ar ∈ P. The images of these maps are
referred to as the left and the right adjoints. These are unique and satisfy
the following conditions:

Reversal: if a ≤ b then bl ≤ al (and similarly for ar, br).
Ordering: a · ar ≤ 1 ≤ ar · a and al · a ≤ 1 ≤ a · al.
Cancellation: alr = a = arl.
Self-adjointness of identity: 1r= 1 = 1l.
Self-adjointness of multiplication: (a · b)r = br · ar.

(cid:2)
(cid:2)

(cid:2)

–
–
–
–
–

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Grefenstette and Sadrzadeh

Categorical Compositional Distributional Model of Meaning

As a notational simpliﬁcation we write ab for a · b, and if abcd ≤ cd we write abcd →
cd and call this a reduction, omitting the identity wherever it might appear. Monoid
multiplication is associative, so parentheses may be added or removed for notational
clarity without changing the meaning of the expression as long as they are not directly
under the scope of an adjoint operator.

An example reduction in pregroup might be:

aarbclc → bclc → b

We note here that the reduction order is not always unique, as we could have reduced as
follows: aarbclc → aarb → b. As a further notational simpliﬁcation, if there exists a chain
of reductions a → . . . → b we may simply write a → b (in virtue of the transitivity of
partial ordering relations). Hence in our given example, we can express both reduction
paths as aarbclc → b.

Pregroups and Syntactic Analysis. Pregroups are used for grammatical analysis by freely
generating the set P of a pregroup from the basic syntactic types n, s, . . . , where here
n stands for the type of both a noun and a noun phrase and s for that of a sentence.
The conﬂation of nouns and noun phrases suggested here is done to keep the work
discussed in this article as simple as possible, but we could of course model them as
different types in a more sophisticated version of this pregroup grammar. As in any
categorial grammar, words of the lexicon are assigned one or more possible types
(corresponding to different syntactic roles) in a predeﬁned type dictionary, y el
grammaticality of a sentence is veriﬁed by demonstrating the existence of a reduction
from the combination of the types of words within the sentence to the sentence type s.

Por ejemplo, having assigned to noun phrases the type n and to sentences the type s,
the transitive verbs will have the compound type nrsnl. We can read from the type of
a transitive verb that it is the type of a word which “expects” a noun phrase on its left
and a noun phrase on its right, in order to produce a sentence. A sample reduction of
John loves cake with John and cake being noun phrases of type n and loves being a verb of
type nrsnl is as follows:

norte(nrsnl)n → s

We see that the transitive verb has combined with the subject and object to reduce to a
oración. Because the combination of the types of the words in the string John loves cake
reduces to s, we say that this string of words is a grammatical sentence. As for more
examples, we recall that intransitive verbs can be given the type nrs such that John sleeps
would be analyzed in terms of the reduction n(nrs) → s. Adjectives can be given the
type nnl such that red round rubber ball would be analyzed by (nnl)(nnl)(nnl)n → n. Y
so on and so forth for other syntactic classes.

Lambek (2008) presents the details of a slightly more complex pregroup grammar
with a richer set of types than presented here. This grammar is hand-constructed and
iteratively extended by expanding the type assignments as more sophisticated gram-
matical constructions are discussed. No general mechanism is proposed to cover all
such types of assignments for larger fragments (p.ej., as seen in empirical data). Pregroup
grammars have been proven to be learnable by B´echet, Foret, and Tellier (2007), OMS
also discuss the difﬁculty of this task and the nontractability of the procedure. Porque
of these constraints and lack of a workable pregroup parser, the pregroup grammars

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 1

we will use in our categorical formalism are derived from CCG types, as we explain in
the following.

Pregroup Grammars and Other Categorial Grammars. Pregroup grammars, in contrast with
other categorial grammars such as CCG, do not yet have a large set of tools for parsing
disponible. If quick implementation of the formalism described later in this paper is
required, it would be useful to be able to leverage the mature state of parsing tools
available for other categorial grammars, such as the Clark and Curran (2007) estadístico
CCG parser, as well as Hockenmaier’s CCG lexicon and treebank (Hockenmaier 2003;
Hockenmaier and Steedman 2007). En otras palabras, is there any way we can translate at
least some subset of CCG types into pregroup types?

There are some theoretical obstacles to consider ﬁrst: Pregroup grammars and CCG
are not equivalent. Buszkowski (2001) shows pregroup grammars to be equivalent to
context-free grammars, whereas Joshi, Vijay-Shanker, and Weir (1989) show CCG to be
weakly equivalent to more expressive mildly context-sensitive grammars. Sin embargo, si
our goal is to exploit the CCG used in Clark and Curran’s parser, or Hockenmaier’s
lexicon and treebank, we may be in luck: Fowler and Penn (2010) prove that some
CCGs, such as those used in the aforementioned tools, are strongly context-free and thus
expressively equivalent to pregroup grammars. In order to be able to apply the parsing
tools for CCGs to our setting, we use a translation mechanism from CCG types to pre-
group types based on the Lambek-calculus-to-pregroup-grammar translation originally
presented in Lambek (1999). In this mechanism, each atomic CCG type X is assigned a
unique pregroup type x; for any X/Y in CCG we have xyl in the pregroup grammar; y
for any X\Y in CCG we have yrx in pregroup grammar. Por lo tanto, by assigning NP to n
and S to s we could, Por ejemplo, translate the CCG transitive verb type (S\NP)/NP into
the pregroup type nrsnl, which corresponds to the pregroup type we used for transitive
verbs in Section 3.1. Wherever type replacement (p.ej., N → NP) is allowed in CCG we
set an ordering relation in the pregroup grammar (p.ej., ¯n ≤ n, where ¯n is the pregroup
type associated with N). Because forward and backward slash “operators” in CCG are
not associative whereas monoid multiplication in pregroups is, it is evident that some
information is lost during the translation process. But because the translation we need
is one-way, we may ignore this problem and use CCG parsing tools to obtain pregroup
parses. Another concern lies with CCG’s crossed composition and substitution rules.
The translations of these rules do not in general hold in a pregroup; this is not a surprise
as pregroups are a simpliﬁcation of the Lambek Calculus and these rules did not hold
in the Lambek Calculus either, as shown in Moortgat (1997), Por ejemplo. Sin embargo, para
the phenomena modeled in this paper, the CCG rules without the backward cross rules
will sufﬁce. In general for the case of English, one can avoid the use of these rules by
overloading the lexicon and using additional categories. To deal with languages that
have cross dependancies, such as Dutch, various solutions have been suggested (p.ej.,
see Genkin, Francez, and Kaminski 2010; Preller 2010).

3.2 Categories

Category theory is a branch of pure mathematics that allows for a general and uniform
formulation of various different mathematical formalisms in terms of their main struc-
tural properties using a few abstract concepts such as objects, arrows, and combinations
and compositions of these. This uniform conceptual language allows for derivation
of new properties of existing formalisms and for relating these to properties of other
formalisms, if they bear similar categorical representation. In this function, Ha sido

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Grefenstette and Sadrzadeh

Categorical Compositional Distributional Model of Meaning

at the center of recent work in unifying two orthogonal models of meaning, a qualitative
categorial grammar model and a quantitative distributional model (clark, Coecke, y
Sadrzadeh 2008; Coecke, Sadrzadeh, and Clark 2010). Además, the unifying categori-
cal structures at work here were inspired by the ones used in the foundations of physics
and the modeling of quantum information ﬂow, as presented in Abramsky and Coecke
(2004), where they relate the logical structure of quantum protocols to their state-based
vector spaces data. The connection between the mathematics used for this branch of
physics and those potentially useful for linguistic modeling has also been noted by
several sources, such as Widdows (2005), Lambek (2010), and Van Rijsbergen (2004).

En esta sección, we will brieﬂy examine the basics of category theory, monoidal
categories, and compact closed categories. The focus will be on deﬁning enough ba-
sic concepts to proceed rather than provide a full-blown tutorial on category theory
and the modeling of information ﬂow, as several excellent sources already cover both
aspectos (p.ej., Mac Lane 1998; walters 1991; Coecke and Paquette 2011). A categories-
in-a-nutshell crash course is also provided in Clark, Coecke, and Sadrzadeh (2008) y
Coecke, Sadrzadeh, and Clark (2010).

The Basics of Category Theory. A basic category C is deﬁned in terms of the following
elementos:
(cid:2)

A collection of objects ob(C).

(cid:2)
(cid:2)

A collection of morphisms hom(C).
A morphism composition operation ◦.

Each morphism f has a domain dom( F ) ∈ ob(C) and a codomain codom( F ) ∈ ob(C).
For dom( F ) = A and codom( F ) = B we abbreviate these deﬁnitions as f : A → B. A pesar de
the notational similarity to function deﬁnitions (and sets and functions being an ex-
ample of a category), it is important to state that nothing else is presupposed about
morphisms, and we should not treat them a priori as functions.

The following axioms hold in every category C :
(cid:2)
(cid:2)

For any f : A → B and g : B → C there exists h : A → C and h = g ◦ f .
For any f : A → B, gramo : B → C and h : C → D, ◦ satisﬁes
(h ◦ g) ◦ f = h ◦ (g ◦ f ).
For every A ∈ ob(C) there is an identity morphism idA : A → A such
= f = idB
that for any f : A → B, f ◦ idA

◦ f .

(cid:2)

We can model various mathematical formalisms using these basic concepts, y
verify that these axioms hold for them. Por ejemplo, category Set with sets as objects and
functions as morphisms, or category Rel with sets as objects and relations as morphisms,
category Pos with posets as objects and order-preserving maps as morphisms, y
category Group with groups as objects and group homomorphisms as morphisms, a
name a few.

The product C × D of two categories C and D is a category with pairs (A, B) como
objects, where A ∈ ob(C) and B ∈ ob(D). There exists a morphism (F, gramo) : (A, B) → (C, D)
in C × D if and only if there exists f : A → C ∈ hom(C) and g : B → D ∈ hom(D). Product
categories allow us to relate objects and morphisms of one mathematical formalism to
those in another, in this example those of C to D.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 1

Compact Closed Categories. A monoidal category C is a basic category to which we add a
monoidal tensor ⊗ such that:

(cid:2)
(cid:2)

(cid:2)

For all A, B ∈ ob(C) there is an object A ⊗ B ∈ ob(C).
For all A, B, C ∈ ob(C), tenemos (A ⊗ B) ⊗ C = A ⊗ (B ⊗ C).
There exists some I ∈ ob(C) such that for any A ∈ ob(C), tenemos
I ⊗ A = A = A ⊗ I.
For f : A → C and g : B → D in hom(C) there is f ⊗ g : A ⊗ B → C ⊗ D in
hom(C).
For f1 : A → C, f2 : B → D, g1 : C → E and g2 : D → F the following
equality holds:

(g1

⊗ g2) ◦ ( f1

⊗ f2) = (g1

◦ f1) ⊗ (g2

◦ f2)

The product category of two monoidal categories has a monodial tensor, deﬁned point-
wisely by (a, A) ⊗ (b, B) := (a ⊗ b, A ⊗ B).

A compact closed category C is a monoidal category with the following additional

axioms:
(cid:2)

(cid:2)

–
–
–
–

Each object A ∈ ob(C) has left and right “adjoint” objects Al and Ar in
transmisión exterior(C).
There exist four structural morphisms for each object A ∈ ob(C):

ηl
A : I → A ⊗ Al.
A : I → Ar ⊗ A.
ηr
(cid:7)yo
A : Al ⊗ A → I.
A : A ⊗ Ar → I.
(cid:7)r

The previous structural morphisms satisfy the following equalities:

(1A
((cid:7)r
A
(1Ar ⊗ (cid:7)r
((cid:7)yo
A

⊗ (cid:7)yo
A) ◦ (ηl
A
⊗ 1A) ◦ (1A

⊗ 1A) = 1A.
A) = 1A.
⊗ ηr
A) ◦ (ηr ⊗ 1Ar ) = 1Ar.
A) = 1Al.

⊗ 1Al ) ◦ (1Al ⊗ ηl

Compact closed categories come equipped with complete graphical calculi, sur-
veyed in Selinger (2010). These calculi visualize and simplify the axiomatic reasoning
within the category to a great extent. En particular, clark, Coecke, and Sadrzadeh
(2008) and Coecke, Sadrzadeh, and Clark (2010) show how they depict the pregroup
grammatical reductions and visualize the ﬂow of information in composing meanings
of single words and forming meanings for sentences. Although useful at an abstract
nivel, these calculi do not play the same simplifying role when it comes to the concrete
and empirical computations; therefore we will not discuss them in this article.

A very relevant example of a compact closed category is a pregroup algebra P.
Elements of a pregroup are objects of the category, the partial ordering relation provides
the morphisms, 1 is I, and monoidal multiplication is the monoidal tensor.

Another very relevant example is the category FVect of ﬁnite dimensional Hilbert
spaces and linear maps over R—that is, vector spaces over R with orthogonal bases
of ﬁnite dimension, and an inner product operation (cid:9)− | −(cid:10) : A × A → R for every
vector space A. The objects of FVect are vector spaces, and the morphisms are linear

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Grefenstette and Sadrzadeh

Categorical Compositional Distributional Model of Meaning

maps between vector spaces. The unit object is R and the monoidal tensor is the linear
algebraic tensor product. FVect is degenerate in its adjoints, in that for any vector space
A, we have Al = Ar = A
is the dual space of A. Además, by ﬁxing a basis we
∗ ∼= A. Tal como, we can effectively do away with adjoints in this category,
obtain that A
and “collapse” (cid:7)yo, (cid:7)r, ηl, and ηr maps into “adjoint-free” (cid:7) and η maps. In this category,
el (cid:7) maps are inner product operations, (cid:7)
A : A ⊗ A → R, and the η maps η : R → A ⊗ A
generate maximally entangled states, also referred to as Bell-states.

, where A

∗

3.3 A Categorical Passage from Grammar to Semantics

En la sección 3.2 we showed how a pregroup algebra and vector spaces can be modeled
as compact closed categories and how product categories allow us to relate the objects
and morphisms of one category to those of another. En esta sección, we will present how
clark, Coecke, and Sadrzadeh (2008) and Coecke, Sadrzadeh, and Clark (2010) sugerir
building on this by using categories to relate semantic composition to syntactic analysis
in order to achieve syntax-sensitive composition in DSMs.

3.3.1 Syntax Guides Semantics. The product category FVect × P has as object pairs (A, a),
where A is a vector space and a is a pregroup grammatical type, and as morphism pairs
( F, ≤) where f is a linear map and ≤ a pregroup ordering relation. By the deﬁnition of
product categories, for any two vector space-type pairs (A, a) y (B, b), there exists
a morphism (A, a) → (B, b) only if there is both a linear map from A into B and a
partial ordering a → b. If we view these pairings as the association of syntactic types
with vector spaces containing semantic vectors for words of that type, this restriction
effectively states that a linear map from A to B is only “permitted” in the product
category if a reduces to b.

Both P and FVect being compact closed, it is simple to show that FVect × P is as well,
by considering the pairs of unit objects and structural morphisms from the separate
categories: I is now (R, 1), and the structural morphisms are ((cid:7)
a),
y (η
a). We are particularly interested in the (cid:7) maps, which are deﬁned as follows
(from the deﬁnition of product categories):

a), (η

a), ((cid:7)

A, ηr

A, (cid:7)r

A, ηl

A, (cid:7)yo

((cid:7)

A, (cid:7)yo

A) : (A ⊗ A, ala) → (R, 1)

((cid:7)

A, (cid:7)r

A) : (A ⊗ A, aar) → (R, 1)

This states that whenever there is a reduction step in the grammatical analysis of a
oración, there is a composition operation in the form of an inner product on the
semantic front. Por eso, if nouns of type n live in some noun space N and transitive
verbs of type nlsnr live in some space N ⊗ S ⊗ N, then there must be some structural
morphism of the form:

((cid:7)

norte

⊗ 1S

⊗ (cid:7)

norte, (cid:7)r

n1s

norte) : (N ⊗ (N ⊗ S ⊗ N) ⊗ N, norte(nrsnl)norte) → (S, s)
(cid:7)yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

We can read from this morphism the functions required to compose a sentence with a
noun, a transitive verb, and an object to obtain a vector living in some sentence space S,
a saber, ((cid:7)

⊗ (cid:7)

⊗ 1S

norte

norte ).

The form of a syntactic type is therefore what dictates the structure of the semantic
space associated with it. The structural morphisms of the product category guarantee
that for every syntactic reduction there is a semantic composition morphism provided
by the product category: syntactic analysis guides semantic composition.

Ligüística computacional

Volumen 41, Número 1

3.3.2 Ejemplo. To give an example, we give syntactic type n to nouns, and nrs to intran-
sitive verbs. The grammatical reduction for kittens sleep, a saber, nnrs → s, corresponds
−−−−→
to the morphism (cid:7)r
kittens lives
norte
−−−→
sleep in N ⊗ S. The reduction mor-
in some vector space N, and the intransitive verb
−−−−→
phism gives us the composition morphism ((cid:7)
kittens
⊗

⊗ 1s in P. The syntactic types dictate that the noun

⊗ 1S), which we can apply to

norte

−−−→
sleep.

Because we can express any vector as the weighted sum of its basis vectors, dejar
sj ; then we can express the

−−−−→
kittens =

−−−→
sleep =

−→
ni and

⊗ −→

−→
ni

(cid:3)

ij csleep

i ckittens
i

us expand
composition as follows:

−−−−−−−−→
kittens sleep = ((cid:7)

norte

−−−−→
kittens ⊗
⊗ 1S)(
⎛

−−−→
sleep)

= ((cid:7)

norte

⊗ 1S)

= ((cid:7)

norte

⊗ 1S)

(cid:2)

⎝

⎛

⎝

ijk

ckittens
i

−→
ni

⊗

(cid:2)

⎞

⎠

csleep
jk

−→
nj

⊗ −→
sk

⎞

ckittens
i

csleep
jk

−→
ni

⊗ −→
nj

⊗ −→
sk

⎠

(cid:2)

ijk
(cid:2)

ckittens
i

csleep
jk

(cid:9)−→
ni

| −→
nj

(cid:10)−→
sk

ckittens
i

csleep
I

−→
sk

In these equations, we have expressed the vectors in their explicit form, tenemos
consolidated the sums by virtue of distributivity of linear algebraic tensor product over
addition, we have applied the tensored linear maps to the vector components (como el
weights are scalars), and ﬁnally, we have simpliﬁed the indices since (cid:9)−→
(cid:10) = 1 si
ni
−→
nj and 0 de lo contrario. As a result of these, we have obtained a vector that lives in
ni
sentence space S.

= −→

| −→
nj

Transitive sentences can be dealt with in a similar fashion:

−−−−−−−−−−−−−→
kittens chase mice

= ((cid:7)

norte

⊗ 1S

⊗ (cid:7)

norte)(

−−−−→
kittens ⊗
⎛

−−−→
chase ⊗
⎛

−−→
mice)

⎞

= ((cid:7)

norte

⊗ 1S

⊗ (cid:7)

norte)

= ((cid:7)

norte

⊗ 1S

⊗ (cid:7)

norte)

ckittens
i

−→
ni

⊗

(cid:2)

⎝

jkl

cchase
jkl

−→
nj

⊗ −→
sk

⊗ −→
nl

⎠ ⊗

(cid:2)

−→
nm

cmice
metro

metro
⎞

ckittens
i

cchase
jkl

cmice
metro

−→
ni

⊗ −→
nj

⊗ −→
sk

⊗ −→
nl

⊗ −→
nm

⎠

(cid:2)

⎝

(cid:2)

⎛

⎝

ijklm

⎞

⎠

ckittens
i

cchase
jkl

cmice
metro

(cid:9)−→
ni

| −→
nj

(cid:10)−→
sk

(cid:9)−→
nl

| −→
nm

(cid:10)

ckittens
i

ikm cmice
cchase
metro

−→
sk

(cid:2)

ijklm
(cid:2)

ikm

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Grefenstette and Sadrzadeh

Categorical Compositional Distributional Model of Meaning

−−−→
chase ⊗

In both cases, it is important to note that the tensor product passed as argument
−−−−→
−−−→
kittens ⊗
to the composition morphism, a saber,
sleep in the intransitive case and
−−−−→
−−→
kittens ⊗
mice in the transitive case, never needs to be computed. We can
treat the tensor products here as commas separating function arguments, thereby
avoiding the dimensionality problems presented by earlier tensor-based approaches to
composicionalidad.

3.4 This Article and the DisCoCat Literature

As elaborated on in Section 2.3.4, the ﬁrst general setting for pairing meaning vec-
tors with syntactic types was proposed in Clark and Pulman (2007). The setting of a
DisCoCat generalized this by making the meaning derivation process rely on a syntactic
type system, hence overcoming its central problem whereby the vector representations
of strings of words with different grammatical structure lived in different spaces. A
preliminary version of a DisCoCat was developed in Clark, Coecke, and Sadrzadeh
(2008), a full version was elaborated on in Coecke, Sadrzadeh, and Clark (2010), dónde,
based on the developments of Preller and Sadrzadeh (2010), it was also exempliﬁed
how the vector space model may be instantiated in a truth theoretic setting where
meanings of words were sets of their denotations and meanings of sentences were their
truth values. A nontechnical description of this theoretical setting was presented in
clark (2013), where a plausibility truth-theoretic model for sentence spaces was worked
out and exempliﬁed. The work of Grefenstette et al. (2011) focused on a tangential
branch and developed a toy example where neither words nor sentence spaces were
Boolean. The applicability of the theoretical setting to a real empirical natural language
processing task and data from a large scale corpus was demonstrated in Grefenstette
and Sadrzadeh (2011a, 2011b). Allá, we presented a general algorithm to build vector
representations for words with simple and complex types and the sentences containing
a ellos; then applied the algorithm to a disambiguation task performed on the British
National Corpus (BNC). We also investigated the vector representation of transitive
verbs and showed how a number of single operations may optimize the performance.
We discuss these developments in detail in the following sections.

4. Concrete Semantic Spaces

En la sección 3.3 we presented a categorical formalism that relates syntactic analysis
steps to semantic composition operations. The structure of our syntactic representation
dictates the structure of our semantic spaces, but in exchange, we are provided with
composition functions by the syntactic analysis, rather than having to stipulate them
ad hoc. Whereas the approaches to compositional DSMs presented in Section 2 either
failed to take syntax into account during composition, or did so at the cost of not being
able to compare sentences of different structure in a common space, this categorical
approach projects all sentences into a common sentence space where they can be directly
comparado. Sin embargo, this alone does not give us a compositional DSM.

As we have seen in the previous examples, the structure of semantic spaces varies
with syntactic types. We therefore cannot construct vectors for different syntactic types
in the same way, as they live in spaces of different structure and dimensionality. Más-
más, nothing has yet been said about the structure of the sentence space S into which
expressions reducing to type s are projected. If we wish to have a compositional DSM

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 1

that leverages all the beneﬁts of lexical DSMs and ports them to sentence-level distribu-
tional representations, we must specify a new sort of vector construction procedure.

In the original formulation of this formalism by Clark, Coecke, and Sadrzadeh
(2008) and Coecke, Sadrzadeh, and Clark (2010), examples of how such a compositional
DSM could be used for logical evaluation are presented, where S is deﬁned as a Boolean
space with True and False as basis vectors. Sin embargo, the word vectors used are hand-
written and speciﬁed model-theoretically, as the authors leave it for future research to
determine how such vectors might be obtained from a corpus. En esta sección, Lo haremos
discuss a new way of constructing vectors for compositional DSMs, and of deﬁning
sentence space S, in order to reconcile this powerful categorical formalism with the
applicability and ﬂexibility of standard distributional models.

4.1 Deﬁning Sentence Space

Assume the following sentences are all true:

The dogs chased the cats.

The dogs annoyed the cats.

The puppies followed the kittens.

The train left the station.

The president followed his agenda.

If asked which sentences have similar meaning, we would most likely point to the
pair (1) y (3), and perhaps to a lesser degree (1) y (2), y (2) y (3). Sentences (4)
y (5) obviously speak of a different state of the world as the other sentences.

If we compare these by truth value, we obviously have no means of making such
distinctions. If we compare these by lexical similarity, (1) y (2) seem to be a closer
match than (1) y (3). If we are classifying these sentences by some higher order
relation such as “topic," (5) might end up closer to (3) than (1). Qué, entonces, might cause
us to pair (1) y (3)?

Intuitivamente, this similarity seems to be because the subjects and objects brought into
relation by similar verbs are themselves similar. Abstracting away from tokens to some
notion of property, we might say that both (1) y (3) express the fact that something
furry and feline and furtive is being pursued by something aggressive (or playful)
and canine. Playing along with the idea that lexical distributional semantics present
conceptos (word meanings) as “messy bundles of properties,” it seems only natural to
have the way these properties are acted upon, qualiﬁed, and related as the basis for
sentence-level distributional representations. In this respect, we here suggest that the
sentence space S, instead of qualifying the truth value of a sentence, should express
how the properties of the semantic objects within are qualiﬁed or brought into relation
by verbs, adjectives, and other predicates and relations.

∼= N for sentences with intransitive verbs and ST

More speciﬁcally, we examine two suggestions for deﬁning the sentence space,
∼= N ⊗ N for sentences
a saber, SI
with transitive verbs. These deﬁnitions mean that the sentence space’s dimensions are
commensurate with either those of N, or those of N ⊗ N. These are by no means the
only options, but as we will discuss here, they offer practical beneﬁts.

In the case of SI, the basis elements are labeled with unique basis elements of N,
n2, etcétera. In the case of ST, the basis elements are labeled with

= −→
n1,

= −→

−→
s2

−→
s1

hence,

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Grefenstette and Sadrzadeh

Categorical Compositional Distributional Model of Meaning

−→
s1

−−−−→
(n1, n1),

−→
s2

−−−−→
(n2, n1),

−→
s3

unique ordered pairs of elements from N, Por ejemplo,
−−−−→
(n1, n2), etcétera, following the same matrix ﬂattening structure used in the proof of
the equal cardinality of N and Q. Because of the isomorphism between ST and N ⊗ N, nosotros
nj interchangeably, as both constitute appropriate
will use the notations
ways of representing the basis elements of such a space. To propagate this distinction
on the syntactic level, we deﬁne types sI and sT for intransitive and transitive sentences,
respectivamente.

−−−−→
(ni, nj) y

⊗ −→

−→
ni

In creating this distinction, we lost one of the most appealing features of the frame-
work of Coecke, Sadrzadeh, and Clark (2010), a saber, the result that all sentence vectors
live in the same sentence space. A mathematical solution to this two-space problem was
suggested in Grefenstette et al. (2011), and a variant of the models presented in this
∼= N)
article permitting the non-problematic projection into a single sentence space (S
has been presented by Grefenstette et al. (2013). Keeping this separation allows us to
deal with both the transitive and intransitive cases in a simpler manner, and because
the experiments in this article only compare intransitive sentences with intransitive
oraciones, and transitive sentences with transitive sentences, we will not address the
issue of uniﬁcation here.

4.2 Noun-Oriented Types

While Lambek’s pregroup types presented in Lambek (2008) include a rich array of
basic types and hand-designed compound types in order to capture speciﬁc grammatic
propiedades, for the sake of simplicity we will use a simpler set of grammatical types
for experimental purposes, similar to some common types found in the CCG-bank
(Steedman 2001).

We assign a basic pregroup type n for all nouns, with an associated vector space N
for their semantic representations. Además, we will treat noun-phrases as nouns,
assigning to them the same pregroup type and semantic space.

CCG treats intransitive verbs as functions NP\S that consume a noun phrase and
return a sentence, and transitive verbs as functions (NP\S)/NP that consume a noun
phrase and return an intransitive verb function, which in turn consumes a noun phrase
and returns a sentence. Using our distinction between intransitive sentences, we give
intransitive verbs the type nrsI associated with the semantic space N ⊗ SI, and transitive
verbs the type nrsTnl associated with the semantic space N ⊗ ST

Adjectives, in CCG, are treated as functions NP/NP, consuming a noun phrase and
returning a noun phrase; and hence we give them the type nnl and associated semantic
space N ⊗ N.

⊗ N.

With the provision of a learning procedure for vectors in these semantic spaces, nosotros
can use these types to construct sentence vector representations for simple intransitive
verb–based and transitive verb–based sentences, with and without adjectives applied
to subjects and objects.

4.3 Learning Procedures

To begin, we construct the semantic space N for all nouns in our lexicon (typically lim-
ited by the words available in the corpus used). Any distributional semantic model can
be used for this stage, such as those presented in Curran (2004), or the lexical semantic
models used by Mitchell and Lapata (2008). It seems reasonable to assume that higher
quality lexical semantic vectors—as measured by metrics such as the WordSim353 test

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 1

of Finkelstein et al. (2001)—will produce better relational vectors from the procedure
designed subsequently. We will not test the hypothesis here, but note that it is an
underlying assumption in most of the current literature on the subject (Erk and Pad ´o
2008; Mitchell and Lapata 2008; Baroni and Zamparelli 2010).

Building upon the foundation of the thus constructed noun vectors, nosotros construimos
semantic representations for relational words. In pregroup grammars (or other com-
binatorial grammars such as CCG), we can view such words as functions, taking as
arguments those types present as adjoints in the compound pregroup type, and return-
ing a syntactic object whose type is that of the corresponding reduction. Por ejemplo,
an adjective nnl takes a noun or noun phrase n and returns a noun phrase n from the
reducción (nnl)n → n. It can also compose with another adjective to return an adjective
(nnl)(nnl) → nnl. We wish for our semantic representations to be viewed in the same
norte)((N ⊗ N) ⊗ N) poder
way, such that the composition of an adjective with a noun (1norte
be viewed as the application of a function f : N → N to its argument of type N.

⊗ (cid:7)

To learn the representation of such functions, we assume that their meaning can
be characterized by the properties that their arguments hold in the corpus, en vez de
just by their context as is the case in lexical distributional semantic models. To give
an example, rather than learning what the adjective angry means by observing that
it co-occurs with words such as ﬁghting, aggressive, or mean, we learn its meaning by
observing that it typically takes as argument words that have the property of being
(es decir., co-occur with words such as) ﬁghting, aggressive, and mean. Whereas in the lexical
semantic case, such associations might only rarely occur in the corpus, in this indirect
method we learn what properties the adjective relates to even if they do not co-occur
with it directly.

Sucesivamente, through composition with its argument, we expect the function for such
an adjective to strengthen the properties that characterize it in the object it takes as
argumento. Si, en efecto, angry is characterized by arguments that have high basis weights
for basis elements corresponding to the concepts (or context words) ﬁghting, aggressive,
and mean, and relatively low counts for semantically different concepts such as passive,
peaceful, and loves, then when we apply angry to dog the vector for the compound
angry dog should contain some of the information found in the vector for dog. Pero esto
vector should also have higher values for the basis weights of ﬁghting, aggressive, y
significar, and correspondingly lower values for the basis weights of passive, peaceful, y
loves.

To turn this idea into a concrete algorithm for constructing the semantic repre-
sentation for relations of any arity, as ﬁrst presented in Grefenstette et al. (2011), nosotros
examine how we would deal with this for binary relations such as transitive verbs. Si
⊗ N is viewed as a function f : N × N → ST
a transitive verb of semantic type N ⊗ ST
that expresses the extent to which the properties of subject and object are brought into
relation by the verb, we learn the meaning of the verb by looking at what properties are
brought into relation by its arguments in a corpus. Recall that the vector for a verb v,
−→
⊗ N, can be expressed as the weighted sum of its basis elements:
v ∈ N ⊗ ST

−→
v =

(cid:2)

ijk

−→
ni

⊗ −→
sj

⊗ −→
nk

CV
ijk

−−−−→
SUBJl,

We take the set of vectors for the subject and object of v in the corpus to be argv
{(
earlier deﬁnition of the basis {sj

}
ijk for v. Exploiting our
}
j of ST, which states that for any value of i and k there

yo. We wish to calculate the basis weightings {CV
ijk

−−−→
OBJl )}

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Grefenstette and Sadrzadeh

Categorical Compositional Distributional Model of Meaning

is some value of j such that sj
de lo contrario. Using all of this, we deﬁne the calculation of each basis weight cv

= (ni, nk), we deﬁne Δ

= 1 if indeed sj

ijk

= (ni, nk) y 0

ijk as:

CV
ijk

(cid:2)

ijkcSUBJl
i

cOBJl
k

This allows for a full formulation of

−→
v as follows:

(cid:2)

−→
v =

ijk

ijkcSUBJl
i

cOBJl
k

−→
ni

⊗ −→
sj

⊗ −→
nk

The nested sums here may seem computationally inefﬁcient, seeing how this
would involve computing size(argv) × dim(norte)2 × dim(S) = size(argv) × dim(norte)4 prod-
ucts. Sin embargo, using the decomposition of basis elements of S into pairs of basis
elements of N (effectively basis elements of N ⊗ N), we can remove the Δ
ijk term and
(cid:14)= (ni, nk), because the basis weight for this combination
ignore all values of j where sj
of indices would be 0. Hence we simplify:

(cid:2)

−→
v =

cSUBJl
i

cOBJl
k

−→
ni

⊗

−−−−→
(ni, nk) ⊗ −→
nk

This representation is still bloated: We perform fewer calculations, but still obtain a
(cid:14)= (ni, nk) son 0, hence where only dim(norte)2
vector in which all the basis weights where sj
−→
of the dim(norte)4 values are non-zero. En breve, the vector weights for
v are, under this
learning algorithm, entirely characterized by the values of a dim(norte) by dim(norte) matrix,
the entries of which are products cSUBJl
cOBJl
k where i and k have become row and column
indices.

Using this and our deﬁnition of ST as a space isomorphic to N ⊗ N, we can formu-
−→
v as follows. Let the kronecker product of two vectors
−→
w ∈ N, written

late a compact expression of
−→
tu ,

w ∈ N ⊗ N, be as follows:

−→
u ⊗ −→

w =

(cid:2)

−→
ni

⊗ −→
nj

i cw
cu
j

Equipped with this deﬁnition, we can formulate the compact form of

−→
v :

(cid:2)

−→
v =

cSUBJl
i

cOBJl
k

−→
ni

⊗ −→
nk

yo
(cid:2)

I
−−−→
SUBJl

⊗

−−→
OBJl

En breve, we are only required to iterate through the corpus once, taking for each
instance of a transitive verb v the kronecker product of its subject and object, y
summing these across all instances of v. It is simple to see that no information was
−→
v : The dimensionality reduction by a
discarded relative to the previous deﬁnition of
factor of dim(norte)2 simply discards all basis elements for which the basis weight was 0 por
default.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 1

This raises a small problem though: This compact representation can no longer be
used in the compositional mechanism presented in Section 3.3, as the dimensions of
−→
v no longer match those that it is required to have according to its syntactic type.
Sin embargo, a solution can be devised if we return to the sample calculation shown in
Sección 3.3.2 of the composition of a transitive verb with its arguments. The composition
reduces as follows:

((cid:7)

norte

⊗ 1S

⊗ (cid:7)

norte)(

−−−→
SUBJ ⊗ −→

v ⊗

−−→
OBJ) =

(cid:2)

ikm

cSUBJ
i

ikmcOBJ
CV

metro

−→
sk

where the verb v is represented in its non-compact form. If we introduce the compact-
ness given to us by our isomorphism ST

−−−−−−−−→
SUBJ v OBJ =

∼= N ⊗ N we can express this as
(cid:2)

cSUBJ
i

imcOBJ
CV

metro

−→
ni

⊗ −→
nm

where v is represented in its compact form. Además, by introducing the component
wise multiplication operation (cid:6):

−→
tu (cid:6) −→

v =

(cid:2)

−→
ni

i cv
cu
i

we can show the general form of transitive verb composition with the reduced verb
representation to be as follows:
(cid:2)

−−−−−−−−→
SUBJ v OBJ =

cSUBJ
i

imcOBJ
CV

metro

−→
ni

⊗ −→
nm

im
(cid:11)
(cid:2)

= −→

v (cid:6)

⊗ −→
nm

−→
CV
ni
im
(cid:13)−−−→
SUBJ ⊗

(cid:14)

−−→
OBJ

(cid:12)

(cid:11)

(cid:6)

(cid:12)

(cid:2)

cSUBJ
i

cOBJ
metro

−→
ni

⊗ −→
nm

To summarize what we have done with transitive verbs:

1. We have treated them as functions, taking two nouns and returning a

sentence in a space ST

∼= N ⊗ N.

2. We have built them by counting what properties of subject and object

noun vectors are related by the verb in transitive sentences in a corpus.

3. We have assumed that output properties were a function of input properties

by the use of the Δ
ijk function—for example, the weights associated with
ni from the subject argument and with nk from the object argument only
⊗ −→
nk .
affect the “output” weight for the sentence basis element

= −→
ni

−→
sj

4. We have shown that this leads to a compact representation

of the verb’s semantic form, and an optimized learning procedure.

5. We have shown that the composition operations

of our formalism can be adapted to this compact representation.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Grefenstette and Sadrzadeh

Categorical Compositional Distributional Model of Meaning

The compact representation and amended composition operation hinge on the
∼= N ⊗ N as output type for N ⊗ N as an input type (the pair of arguments
choice of ST
the verb takes), justifying our choice of a transitive sentence space. In the intransitive
caso, the same phenomenon can be observed, since such a verb takes as argument a
∼= N. Además, our choice to make all other
vector in N and produces a vector in SI
types dependent on one base type—namely, norte (with associated semantic space N)—
yields the same property for every relational word we wish to learn: The output type is
the same as the concatenated (on the syntactic level) or tensored (on the semantic level)
input types. It is this symmetry between input and output types that guarantees that
any m-ary relation, expressed in the original formulation as an element of tensor space
N ⊗ . . . ⊗ N
, where the ith basis weight
(cid:18)
(cid:16)(cid:17)
(cid:15)
2metro

, has a compact representation in N ⊗ . . . ⊗ N
(cid:18)

(cid:16)(cid:17)
metro

(cid:15)

of the reduced representation stands for the degree to which the ith element of the input
vector affects the ith element of the output vector.

This allows the speciﬁcation of the generalized learning algorithm for reduced rep-
resentaciones, ﬁrst presented in Grefenstette and Sadrzadeh (2011a), which is as follows.
Each relational word P with grammatical type π and m adjoint types α
m is en-
coded as an (r × . . . × r) multi-dimensional array with m degrees of freedom (es decir., a rank
m tensor). Because our vector space N has a ﬁxed basis, each such array is represented
in vector form as follows:
−→
P =

2, · · ·, a

1, a

(cid:2)

⊗ −→
n j

−→
cij···ζ (
n i
(cid:15)

⊗ · · · ⊗ −→
(cid:16)(cid:17)
metro

n ζ)
(cid:18)

ij · · · ζ
(cid:15) (cid:16)(cid:17) (cid:18)
metro

This vector lives in the tensor space N ⊗ N ⊗ · · · ⊗ N
. Each cij···ζ is computed according
(cid:18)

(cid:15)

(cid:16)(cid:17)
metro

to the procedure described in Figure 2.

Linear algebraically, this procedure corresponds to computing the following
(cid:19)−→
w 1

⊗ · · · ⊗ −→
w m

⊗ −→
w 2

−→
P =

(cid:2)

(cid:20)

1) Consider a sequence of words containing a relational word P and its arguments w1,
w2, · · · , wm, occurring in the same order as described in P’s grammatical type π.
Refer to these sequences as P-relations. Suppose there are k of them.
−→
2) Retrieve the vector
w l of each argument wl.
−→
3) Suppose w1 has weight c1
n i, w2 has weight c2
i on basis vector
and wm has weight cm

−→
n ζ. Multiply these weights

j on basis vector

ζ on basis vector

−→
n j, · · · ,

4) Repeat the above steps for all the k P-relations, and sum the corresponding weights

c1
i

× c2
j

× · · · × cm
ζ

cij···ζ =

(cid:2)

(cid:13)

c1
i

× c2
j

× · · · × cm
ζ

(cid:14)

Cifra 2
Procedure for learning weights for matrices of words P with relational types π of m arguments.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 1

The general formulation of composing a relational word P with its arguments

arg1, . . . , argm is now expressed as

−→
PAG (cid:6)

(cid:19)−−→
arg1

⊗ . . . ⊗ −−→
argm

(cid:20)

Por ejemplo, the computation of furry cats nag angry dogs would correspond to the

following operations:

−−−−−−−−−−−−−−−−−−→
furry cats nag angry dogs = −−→

nag (cid:6)

(cid:13)(cid:13)−−−→

furry (cid:6)

(cid:13)

(cid:14)

−→
cat

⊗

−−−→
angry (cid:6)

(cid:14)(cid:14)

−−→
dog

This generalized learning algorithm effectively extends the coverage of our ap-
proach to any sentence for which we have a pregroup parse (p.ej., as might be obtained
by our translation mechanism from CCG). Por ejemplo, determiners would have a
type nnl, allowing us to model them as matrices in N ⊗ N; adverbs would be of type
(nrs)r(nrs), and hence be elements of S ⊗ N ⊗ N ⊗ S. We could learn them using the
procedure deﬁned earlier, although for words like determiners and conjunctions, es
unclear whether we would want to learn such logical words or design them by hand,
as was done in Coecke, Sadrzadeh, and Clark (2010) for negation. Sin embargo, the pro-
cedure given here allows us to generalize the work presented in this article to sentences
of any length or structure based on the pregroup types of the words they contain.

4.4 Ejemplo

To conclude this section with an example taken from Grefenstette and Sadrzadeh
(2011a), we demonstrate how the meaning of the word show might be learned from a
corpus and then composed.

Suppose there are two instances of the verb show in the corpus:

s1
s2

table show result

=
= map show location

The vector of show is

−−−→
show =

−−→
table ⊗

−−−→
resultado + −−→

map ⊗

−−−−−→
ubicación

Consider a vector space N with four basis vectors far, habitación, scientiﬁc, and elect. El
TF/IDF-weighted values for vectors of selected nouns (built from the BNC) are as
mostrado en la tabla 1.
Part of the matrix compact representation of show is presented in Table 2.

Mesa 1
Sample weights for selected noun vectors.

−→
ni

table map

resultado

location master

dog

far
habitación
scientiﬁc
elect

6.6
27
0
0

5.6
7.4
5.4
0

7
1.0

4.2

5.9
7.3
6.1
0

4.3
9.1
4.1
6.2

5.5
7.6
0
0

1
2
3
4

100

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Grefenstette and Sadrzadeh

Categorical Compositional Distributional Model of Meaning

Mesa 2
Sample semantic matrix for show.
far
79.24
232.66
32.94
0

room scientiﬁc
47.41
80.75
31.86
0

far
habitación
scientiﬁc
elect

119.96
396.14
32.94
0

elect
27.72
113.2
0
0

−→
n1,

−→
n1 ), eso es, (

As a sample computation, the weight c11 for vector (
by multiplying weights of table and result on
and location on
total weight 79.24. Similarmente, the weight c21 for vector (
computed by multiplying the weight of
(27 × 7), then multiplying the weight of
(7.4 × 5.9), and then adding these (189 + 43.66) to obtain the total weight of 232.66.

−→
far), is computed
−→
far (6.6 × 7), multiplying weights of map
−→
far (5.6 × 5.9), and then adding these (46.2 + 33.04) and obtaining the
−→
−−−→
habitación,
far), es
−→
far for result
−→
far for location

−→
−→
n1 ), eso es, (
n2,
−−−→
room for table by the weight of
−−−→
room for map by the weight of

−→
far,

We now wish to compute the vector for the sentence [el] master shows [su] dog,
omitting the determiner and possessive for simplicity, as we have left open the question
as to whether or not we would want to learn them using the generalized procedure
from Section 4.3 or specify them by design due to their logical structure. The calculation
according to the vectors in Table 1 and the matrix in Table 2 will be:

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

−−−−−−−−−−−−→
master show dog
(cid:13)

−−−→
espectáculo (cid:6)
⎡

−−−−→
master ⊗

(cid:14)

−−→
dog

⎢
⎢
⎣

⎡

⎢
⎢
⎣

79.24
232.66
32.94
0

47.41
80.75
31.86
0

119.96
396.14
32.94
0

1,874.03
11,644.63
742.80
0

1,549.36
5,584.67
992.76
0

0
0
0
0

27.72
113.2
0
0
⎤

⎥
⎥
⎦

0
0
0
0

⎤

⎡

⎥
⎥
⎦ (cid:6)

⎢
⎢
⎣

23.65
50.05
22.55
34.1

32.68
69.16
31.16
47.12

0
0
0
0

⎤

⎥
⎥
⎦

0
0
0
0

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Row-wise, ﬂattening the ﬁnal matrix representation gives us the result we seek,

a saber, the sentence vector in ST for [el] master showed [su] dog:

−−−−−−−−−−−−→
master show dog = [1,874, 1,549, 0, 0, 11,645, 5,585, 0, 0, 743, 993, 0, 0, 0, 0, 0, 0]

5. experimentos

Evaluating compositional models of semantics is no easy task: Primero, we are trying
to evaluate how well the compositional process works; segundo, we also are trying
to determine how useful the ﬁnal representation—the output of the composition—is,
relative to our needs.

101

Ligüística computacional

Volumen 41, Número 1

The scope of this second problem covers most phrase and sentence-level semantic
modelos, from bag-of-words approaches in information retrieval to logic-based formal
semantic models, via language models used for machine translation. It is heavily task
dependent, in that a representation that is suitable for machine translation may not be
appropriate for textual inference tasks, and one that is appropriate for IR may not be
ideal for paraphrase detection. Therefore this aspect of semantic model evaluation ide-
ally should take the form of application-oriented testing. Por ejemplo, to test semantic
representations designed for machine translation purposes, we should use a machine
translation evaluation task.

The DisCoCat framework (clark, Coecke, and Sadrzadeh 2008; Coecke, Sadrzadeh,
and Clark 2010), descrito en la Sección 3, allows for the composition of any words of
any syntactic type. The general learning algorithm presented in Section 4.3 technically
can be applied to learn and model relations of any semantic type. Sin embargo, muchos
open questions remain, such as how to deal with logical words, determiners, y
quantiﬁcation, and how to reconcile the different semantic types used for sentences
with transitive and intransitive sentences. We will leave these questions for future work,
brieﬂy discussed in Section 6. In the meantime, we concretely are left with a way of
satisfactorily modeling only simple sentences without having to answer these bigger
preguntas.

Teniendo esto en cuenta, in this section we present a series of experiments centered around
evaluating how well various models of semantic vector composition perform (a lo largo de
with the one described in Section 4) in a phrase similarity comparison task. This task
aims to test the quality of a compositional process by determining how well it forms a
clear joint meaning for potentially ambiguous words. The intuition here is that tokens,
on their own, can have several meanings; and that it is through the compositional
process—through giving them context—that we understand their speciﬁc meaning. Para
ejemplo, bank itself could (among other meanings) mean a river bank or a ﬁnancial
bank; yet in the context of a sentence such as The bank refunded the deposit, it is likely we
are talking about the ﬁnancial institution.

En esta sección, we present three data sets designed to evaluate how well word-sense
disambiguation occurs as a byproduct of composition. We begin by describing the ﬁrst
data set, based around noun-intransitive verb phrases, en la sección 5.1. En la sección 5.2,
we present a data set based around short transitive-verb phrases (a transitive verb
with subject and object). En la sección 5.3, we discuss a new data set, based around short
transitive-verb phrases where the subject and object are qualiﬁed by adjectives. We leave
discussion of these results for Section 6.

5.1 First Experiment

This ﬁrst experiment, originally presented in Mitchell and Lapata (2008), evaluates
the degree to which an ambiguous intransitive verb (p.ej., draws) is disambiguated by
combination with its subject.

Data set Description. The data set4 comprises 120 pairs of intransitive sentences, each of
the form NOUN VERB. These sentence pairs are generated according to the following

4 Available at http://homepages.inf.ed.ac.uk/s0453356/results.

102

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Grefenstette and Sadrzadeh

Categorical Compositional Distributional Model of Meaning

procedimiento, which will be the basis for the construction of the other data sets discussed
subsequently:

A number of ambiguous intransitive verbs (15, en este caso) are selected
from frequently occurring verbs in the corpus.

For each verb V, two mutually exclusive synonyms V1 and V2 of the verb
are produced, and each is paired with the original verb separately (for a
total of 30 verb pairs). These are generated by taking maximally distant
pairs of synonyms of the verb on WordNet, but any method could be used
aquí.

For each pair of verb pairs (V, V1) y (V, V2), two frequently occurring
nouns N1 and N2 are picked from the corpus, one for each synonym of V.
Por ejemplo, if V is glow and the synonyms V1 and V2 are beam and burn,
we might choose face as N1, because a face glowing and a face beaming
mean roughly the same thing; and ﬁre as N2, because a ﬁre glowing and a
ﬁre burning mean roughly the same thing.

By combining the nouns with the verb pairs, we form two high similarity
triplets (V, V1, N1) y (V, V2, N2), and two low similarity triplets
(V, V1, N2) y (V, V2, N1).

The last two steps can be repeated to form more than four triplets per pair of verb
pares. In Mitchell and Lapata (2008), eight triplets are generated for each pair of verb
pares, obtaining a total of 120 triplets from the 15 original verbs. Each triplet, junto con
its HIGH or LOW classiﬁcation (based on the choice of noun for the verb pair) is an
entry in the data set, and can be read as a pair of sentences: (V, Vi, norte) translates into the
intransitive sentences N V and N Vi.

Finalmente, the data set is presented, without the HIGH/LOW ratings, to human an-
notators. These annotators are asked to rate the similarity of meaning of the pairs of
sentences in each entry on a scale of 1 (low similarity) a 7 (high similarity). The ﬁnal
form of the data set is a set of lines each containing:

(cid:2)
(cid:2)

(cid:2)

A (V, Vi, norte) triplet.

A HIGH or LOW label for that triplet.

An annotator identiﬁer and the annotator’s score for that triplet.

Sample sentences from this data set are shown in Table 3.

Mesa 3
Example entries from the intransitive data set without annotator score, ﬁrst experiment.

Oración 1

Oración 2

butler bow
head bow
company bow
government bow government stoop

butler submit
head stoop
company submit

103

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 1

Evaluation Methodology. This data set is used to compare various compositional distribu-
tional semantic models, according to the following procedure:

For each entry in the data set, the representation of the two sentences
N V and N Vi formed from the entry triple (V, Vi, norte), which we
will name S1 and S2, is constructed by the model.

The similarity of these sentences according to the model’s semantic
distance measure constitutes the model score for the entry.

The rank correlation of entry model scores against entry annotator scores
is calculated using Spearman’s rank correlation coefﬁcient ρ.

The Spearman ρ scores are values between −1 (perfect negative correlation) y 1
(perfect correlation). The higher the ρ score, the higher the compositional model can be
said to produce sentence representations that match human understanding of sentence
significado, when it comes to comparing the meaning of sentences. Tal como, we will rank
the models evaluated using the task by decreasing order of ρ score.

One of the principal appealing features of Spearman’s ρ is that the coefﬁcient is
rank-based: It does not require models’ semantic similarity metrics to be normalized
for a comparison to be made. One consequence is that a model providing excellent
rank correlation with human scores, but producing model scores on a small scale (p.ej.,
values between 0.5 y 0.6), will obtain a higher ρ score than a model producing model
scores on a larger scale (p.ej., entre 0 y 1) but with less perfect rank correlation.
If we wished to then use the former model in a task requiring some greater degree of
numerical separation (let us say 0 for non-similar sentences and 1 for completely similar
oraciones), we could simply renormalize the model scores to ﬁt the scale. By eschewing
score normalization as an evaluation factor, we minimize the risk of erroneously ranking
one model over another.

Finalmente, in addition to computing the rank alignment coefﬁcient between model
scores and annotator scores, Mitchell and Lapata (2008) calculate the mean model
scores for entries labeled HIGH, and for entries labeled LOW. This information is
reported in their paper as additional means for model comparison. Sin embargo, para el
same reason we considered Spearman’s ρ to be a fair means of model comparison—
a saber, in that it required no model score normalization procedure and thus was
less likely to introduce error by adding such a degree of freedom—we consider the
HIGH/LOW means to be inadequate grounds for comparison, precisely because it
requires normalized model scores for comparison to be meaningful. Tal como, we will not
include these mean scores in the presentation of this, or any further experiments in this
artículo.

Models Compared. In this experiment, we compare the best and worst performing models
of Mitchell and Lapata (2008) to our own. We begin by building a vector space W
for all words in the corpus, using standard distributional semantic model construction
procedures (n-word window) with the parameters of Mitchell and Lapata (2008). Estos
parameters are as follows: The basis elements of this vector space are the 2,000 mayoría
frequently occurring words in the BNC, excluding common stop words. For our evalu-
ación, the corpus was lemmatized using Carroll’s Morpha (Minnen, Carroll, and Pearce
2001), which was applied as a byproduct of our parsing of the BNC with C&C Tools
(Curran, clark, and Bos 2007).

104

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Grefenstette and Sadrzadeh

Categorical Compositional Distributional Model of Meaning

The context of each occurrence of a word in the corpus was deﬁned to be ﬁve words
on either side. After the vector for each word w was produced, the basis weights cw
i
associated with context word bi were normalized by the following ratio of probabilities
weighting scheme:

cw
i

=P(bi

|w)
PAG(w)

−−→
verb and

−−→
noun be the lexical semantic vectors for the verb and the noun, from W.
Dejar
It should be evident that there is no signiﬁcant difference in using W for nouns in lieu
of building a separate vector space N strictly for noun vectors.

As a ﬁrst baseline, Verb Baseline, we ignored the information provided by the noun
in constructing the sentence representation, effectively comparing the semantic content
of the verbs:

Verb Baseline :

−−−−−−→
noun verb =

−−→
verb

The models from Mitchell and Lapata (2008) we evaluate here are those which were
strictly unsupervised (es decir., no free parameters for composition). These are the additive
model Add, wherein

−−−−−−→
noun verb = −−→

noun +

−−→
verb

Add :

and the multiplicative model Multiply, wherein

Multiply :

−−−−−−→
noun verb = −−→

noun (cid:6)

−−→
verb

Other models with parameters that must be optimized against a held-out section of the
data set are presented in Mitchell and Lapata (2008). We omitted them here principally
because they do not perform as well as Multiply, but also because the need to optimize
the free parameters for this and other data sets makes fair comparison with completely
unsupervised models more difﬁcult, and less fair.

We also evaluate the Categorical model from Section 4 aquí, wherein

Categorical :

−−−−−−→
noun verb =

−−−→
verbcat

(cid:6) −−→
noun

−−−→
verbcat is the compact representation of the relation the verb stands for, computed
dónde
according to the procedure described in Section 4.3. It should be noted that for the case
of intransitive verbs, this composition operation is mathematically equivalent to that of
the Multiply model (as component-wise multiplication (cid:6) is a commutative operation),
the difference being the learning procedure for the verb vector.

For all such vector-based models, the similarity of the two sentences s1 and s2 is

taken to be the cosine similarity of their vectors, deﬁned in Section 2.2:

semejanza(s1, s2) = cosine(

−→
s1 ,

−→
s2 )

In addition to these vector-based methods, we deﬁne an additional baseline and
an upper-bound. The additional baseline, Bigram Baseline, is a bigram-based language

105

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 1

model trained on the BNC with SRILM (Stolcke 2002), using the standard language
model settings for computing log-probabilities of bigrams. To determine the semantic
= noun verb2, we assumed sentences have
similarity of sentences s1
mutually conditionally independent properties, and computed the joint probability:

= noun verb1 and s2

semejanza(s1, s2) = logP(s1

∧ s2) = log(PAG(s1)PAG(s2)) = logP(s1) + logP(s2)

The reasoning behind this baseline is as follows. The sentence formed by combining the
ﬁrst verb with its arguments is, by the design of this data set, a semantically coherent
oración (p.ej., the head bowed and the government bowed both make sense). We therefore
expect language models to treat this sort of bigram as having a higher probability
than bigrams that are not semantically coherent sentences (and therefore unlikely to
be observed in the corpus). A similar (relative) high probability is to be expected when
the sentence formed by taking the second verb in each entry and combining it with
the verb yields a sentence similar to the ﬁrst in meaning (p.ej., as would be the case
with the head stooped), whereas the probability of a semantically incoherent sentence
(p.ej., the government stooped) is expected to be low relative to that of the ﬁrst sentence. Por
taking the sum of log probabilities of sentences, we compute the log of the product
of the probabilities, which we expect to be low when the probability of the second
sentence is low, and high when it is high, with the probability of the ﬁrst sentence
acting as a normalizing factor. To summarize, while this bigram-based measure only
tracks a tangential aspect of semantic similarity, it can be one which plays an artiﬁcially
important role in experiments with a predetermined structure such as the one described
in this section. Por esta razón, we use this bigram baseline for this experiment and all
that follow.

The upper bound of the data set, UpperBound, was taken to be the inter-annotator
agreement: the average of how each annotator’s score aligns with other annotator
puntuaciones, using Spearman’s ρ.

Resultados. The results5 of the ﬁrst experiment are shown in Table 4. As expected from the
fact that Multiply and Categorical differ only in how the verb vector is learned, el
results of these two models are virtually identical, outperforming both baselines and
the Additive model by a signiﬁcant margin. Sin embargo, the distance from these models
to the upper bound is even greater, demonstrating that there is still a lot of progress to
be made.

5.2 Second Experiment

The ﬁrst experiment was followed by a second similar experiment,
in Mitchell
and Lapata (2010), covering different sorts of composition operations for binary
combinations of syntactic types (adjective-noun, noun-noun, verb-object). Such further
experiments are interesting, but rather than continue down this binary road, we now
turn to the development of our second experiment, involving sentences with larger
syntactic structures, to examine how well various compositional models cope with
more complex syntactic and semantic relations.

5 The results were state of the art when the experiments were run in 2011. The binary composition data set
used here are popular; now there are a host of new state-of-the-art systems available in the literature. Nosotros
invite the reader to check references to Mitchell and Lapata (2008) to ﬁnd the current state of the art.

106

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
1
7
1
1
8
0
5
6
4
5
/
C
oh

yo
i

_
a
_
0
0
2
0
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Grefenstette and Sadrzadeh

Categorical Compositional Distributional Model of Meaning

Mesa 4
Model correlation coefﬁcients with human judgments, ﬁrst experiment. pag < 0.05 for each ρ. Model Verb Baseline Bigram Baseline Add Multiply Categorical UpperBound ρ 0.08 0.02 0.04 0.17 0.17 0.40 This second experiment, which we initially presented in Grefenstette and Sadrzadeh (2011a), is an extension of the ﬁrst in the case of sentences centered around transitive verbs, composed with a subject and an object. The results of the ﬁrst experi- ment did not demonstrate any difference between the multiplicative model, which takes into account no syntactic information or word ordering, and our syntactically motivated categorical compositional model. By running the same experiment over a new data set, where the relations expressed by the verb have a higher arity than in the ﬁrst, we hope to demonstrate that added structure leads to better results for our syntax-sensitive model. Data Set Description. The construction procedure for this data set6 is almost exactly as for the ﬁrst data set, with the following differences: (cid:2) (cid:2) (cid:2) (cid:2) (cid:2) Verbs are transitive instead of intransitive. We arbitrarily took 10 verbs from the most frequent verbs in the BNC, and for each verb, took two maximally distant synonyms (again, using WordNet) to obtain 10 pairs of pairs. For each pair of verb pairs, we selected a set of subject and object nouns to use as context, as opposed to just a subject noun. Each subject-object pair was manually chosen so that one of the verb pairs would have high similarity in the context of that subject and object, and the other would have low similarity. We used these choices to annotate the entries with HIGH and LOW tags. Each combination of a verb pair with a subject-object pair constitutes an entry of our data set, of which there are 200. As a form of quality control, we inserted “gold standard” sentences in the form of identical sentence pairs and rejected annotators who did not score these gold standard 6 The data set, reannotated by Turkers in 2013, is available at http://www.cs.ox.ac.uk/activities/compdistmeaning/GS2013data.txt. 107 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 1 7 1 1 8 0 5 6 4 5 / c o l i _ a _ 0 0 2 0 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 41, Number 1 Table 5 Example entries from the transitive data set without annotator score, second experiment. Sentence 1 Sentence 2 HIGH-LOW Tag man draw sword report draw attention man draw sword report draw attention man attract sword report attract attention man depict sword report depict attention LOW HIGH HIGH LOW sentences with a high score of 6 or 7.7 Lemmatized sentences from sample entries of this data set and our HIGH-LOW tags for them are shown in Table 5. The data set was passed to a group of 50 annotators on Amazon Mechanical Turk, as for the previous data set. The annotators were shown, for each entry, a pair of sentences created by adding the in front of the subject and object nouns and putting the verbs in the past tense,8 and were instructed to score each pair of sentences on the same scale of 1 (not similar in meaning) to 7 (similar in meaning), based on how similar in meaning they believed the sentence pair was. Evaluation Methodology. The methodology for this experiment is exactly that of the previous experiment. Models compositionally construct sentence representations, and compare them using a distance metric (all vector-based models once again used cosine similarity). The rank correlation of models scores with annotator scores is calculated using Spearman’s ρ, which is in turn used to rank models. Models Compared. The models compared in this experiment are those of the ﬁrst experi- ment, with the addition of an extra trigram-based baseline (trained with SRILM, using the addition of log-probability of the sentence as a similarity metric), and a variation on our categorical model, presented subsequently. With W as the distributional semantic space for all words in the corpus, trained using the same parameters as in the ﬁrst exper- −−−→ object ∈ W as the vectors for subject, verb, and object of a sentence, iment, and respectively, and with verbcat as the compact representation of a transitive verb learned using the algorithm presented in this paper, we have the following compositional methods: −−→ subj, −−→ verb, −−−−−−−−−−−−→ subject verb object = −−−−→ subject + −−→ verb + −−−→ object Add : Multiply : Categorical : Kronecker : −−−−−−−−−−−−→ subject verb object = −−−−−−−−−−−−→ subject verb object = −−−−−−−−−−−−→ subject verb object = −−−−→ subject (cid:6) −−−→ verbcat (cid:13)−−→ verb ⊗ −−→ verb (cid:6) −−−→ object (cid:13)−−−−→ subject ⊗ (cid:14) (cid:6) −−→ verb (cid:13)−−−−→ subject ⊗ (cid:14) −−−→ object (cid:6) (cid:14) −−−→ object 7 During the 2013 reannotation of this data set, we rejected no Turker contributions, as the answers to the gold standard sentence pairs were aligned with our expectations. We attribute this to our adding the requirement that Turkers be based in the US or the UK and have English as the ﬁrst language. 8 For example, the entry draw table eye depict would yield the sentences The table drew the eye and The table depicted the eye. 108 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 1 7 1 1 8 0 5 6 4 5 / c o l i _ a _ 0 0 2 0 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Grefenstette and Sadrzadeh Categorical Compositional Distributional Model of Meaning All of these models have been explained earlier, with the exception of Kronecker. We ﬁrst presented this new addition in Grefenstette and Sadrzadeh (2011b), where we observed that the compact representation of a verb in the DisCoCat framework, under the assumptions presented in Section 4, can be viewed as dim(N) × dim(N) matrices in N ⊗ N. We considered alternatives to the algorithm presented earlier for the construction of such matrices, and were surprised by the results of the Kronecker method, wherein we replaced the matrix learned by our algorithm with the Kronecker product of the lexical semantic vectors for the verb. Further analysis performed since the publication of that paper can help to understand why this method might work. Using −→ b , the following property, for any vectors −→ d in a vector space −→ c , −→ a , −→ a ⊗ ( −→ b ) (cid:6) ( −→ c ⊗ −→ d ) = ( −→ a (cid:6) −→ c ) ⊗ ( −→ b (cid:6) −→ d ) we can see that the Kronecker model’s composition operation can be expressed as Kronecker : −−−−−−−−−−−−→ subject verb object = (cid:13)−−→ verb (cid:6) −−−−→ subject (cid:14) (cid:13)−−→ verb (cid:6) −−−→ object (cid:14) ⊗ Bearing in mind that the cosine measure we are using as a similarity metric is equivalent to the inner product of two vectors normalized by the product of their length −→ a , −→ b ) = cosine( −→ (cid:9)−→ b (cid:10) a | −→ b (cid:16) a (cid:16) × (cid:16) (cid:16)−→ and the following property of the inner product of kronecker products (cid:9)−→ a ⊗ −→ b |−→ c ⊗ −→ d (cid:10) = (cid:9)−→ a |−→ c (cid:10) × (cid:9) −→ b | −→ d (cid:10) we ﬁnally observe that comparing two sentences under Kronecker corresponds to the following computation: cosine( = α = α ⊗ −−−−−−−−−−−−→ subject verb1 object, −−→ verb1 −−−−→ subject −−−−→ subject (cid:27)(cid:13)−−→ verb1 (cid:27)(cid:13)−−→ verb1 (cid:27)(cid:13)−−→ verb1 −−−−−−−−−−−−→ subject verb2 object) (cid:14) −−−→ object −−−→ object −−−−→ subject (cid:13)−−−−→ subject ⊗ (cid:13)−−→ ⊗ verb1 (cid:13)−−→ verb2 (cid:6) (cid:14) (cid:6) (cid:6) (cid:6) (cid:6) (cid:14) | = α | (cid:14) (cid:14) ⊗ (cid:13)−−→ verb2 (cid:13)−−→ (cid:6) verb2 (cid:14)(cid:28) (cid:27)(cid:13)−−→ verb1 | (cid:14) −−→ verb2 −−−−→ subject −−−→ object (cid:6) (cid:14) (cid:14) (cid:13)−−−−→ subject ⊗ (cid:13)−−→ ⊗ verb2 (cid:13)−−→ verb2 (cid:6) (cid:6) | −−−→ object −−−→ object −−−→ object (cid:6) (cid:14)(cid:28) (cid:14)(cid:28) (cid:14)(cid:28) where α is the normalization factor α = −−−−−−−−−−−−→ subject verb1 object(cid:16) × (cid:16) (cid:16) −−−−−−−−−−−−→ subject verb2 object(cid:16) 1 We note here that the Kronecker is effectively a parallel application of the Multiply model, combining the subject and verb, and object and verb separately. Within the context of this task, the comparison of two sentences boils down to how well each verb combines with the subject multiplied by how well it combines with the object, with the 109 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 1 7 1 1 8 0 5 6 4 5 / c o l i _ a _ 0 0 2 0 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 41, Number 1 joint product forming the overall model score. In short, it more or less constitutes the introduction of some mild syntactic sensitivity into the multiplicative model of Mitchell and Lapata (2008), although it remains to be seen how well this scales with syntactic complexity (e.g., extended to ditransitive verbs, or other relations of higher arity). The UpperBound here is, again, the inter-annotator agreement, calculated by com- puting the pairwise alignment of each annotator with every other, and averaging the resulting rank correlation coefﬁcients. Results. The results of the second experiment are shown in Table 6. The baseline scores fall in the 0.14–0.16 range, with best results being obtained for the Bigram Baseline and Trigram Baseline, the difference between both models not being statistically signiﬁcant. The additive model Add performs on par with the ﬁrst experiment. The multiplicative model Multiply and our Categorical model perform on par with the version used for the intransitive experiment, but obtain a score comparable to the baselines. The best performing model here is our newly introduced Kronecker model, which leads the pack by a steady margin, with a score of 0.26. The inter-annotator agreement UpperBound is much higher in this experiment than in the previous experiment, indicating even more room for improvement. We therefore learn here that the Categorical model continues to operate on par with the multiplicative model, and that the Kronecker model provides the highest results, while being as simple in its construction as the multiplicative model, requiring only the learning of lexical vectors for the verb. 5.3 Third Experiment The third and ﬁnal experiment we present is a modiﬁed version of the second data set presented earlier, where the nouns in each entry are under the scope of adjectives ap- plied to them.The intuition behind the data sets presented in Section 5.1 and Section 5.2 was that ambiguous verbs are disambiguated through composition with nouns. These nouns themselves may also be ambiguous, and a good compositional model will be capable of separating the noise produced by other meanings through its compositional mechanism to produce unambiguous phrase representations. The intuition behind this data set is similar, in that adjectives provide both additional information for disam- biguation of the nouns they apply to, but also additional semantic noise. Therefore, a Table 6 Model correlation coefﬁcients with human judgments, second experiment. p < 0.05 for each ρ. Model Verb Baseline Bigram Baseline Trigram Baseline Add Multiply Categorical Kronecker UpperBound ρ 0.13 0.16 0.15 0.10 0.16 0.16 0.26 0.62 110 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 1 7 1 1 8 0 5 6 4 5 / c o l i _ a _ 0 0 2 0 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Grefenstette and Sadrzadeh Categorical Compositional Distributional Model of Meaning Table 7 Example entries from the adjective-transitive data set without annotator score, third experiment. Sentence 1 Sentence 2 statistical table show good result statistical table show good result statistical table express good result statistical table depict good result good model will also be able to separate the useful information of the adjective from its semantic noise when composing it with its argument, in addition to doing this when composing the noun phrases with the verb. Data Set Description. The construction procedure for this data set was to take the data set from Section 5.2, and, for each entry, add a pair of adjectives from those most frequently occurring in the corpus. The ﬁrst adjective from the pair is applied to the ﬁrst noun (subject) of the entry when forming the sentences, and the second adjective is applied to the second noun (object). For each entry, we chose adjectives which best preserved the meaning of the phrase constructed by combining the ﬁrst verb with its subject and object. This new data set9 was then annotated again by a group of 50 annotators using Amazon’s Mechanical Turk service. The annotators were shown, for each entry, a pair of sentences created by adding the in front of the subject and object noun phrases and putting the verbs in the past tense,10 and asked to give each sentence pair a meaning similarity score between 1 and 7, as for the previous data sets. We applied the same quality control mechanism as in the second experiment. Some 94 users returned anno- tations, of which we kept 50 according to our gold standard tests. We are unaware of whether or not it was applied in the production of the ﬁrst data set, but believe that this can only lead to the production of higher quality annotations. Sample sentences from this data set are shown in Table 7. Evaluation Methodology. The evaluation methodology in this experiment is identical to that of the previous experiments. Models Compared. In this experiment, in lieu of simply comparing compositional models “across the board” (e.g., using the multiplicative model for both adjective-noun compo- sition and verb-argument composition), we experimented with different combinations of models. This evaluation procedure was chosen because we believe that adjective-noun composition need not necessarily be the same kind of compositional process as subject- verb-object composition, and also because different models may latch onto different semantic features during the compositional process, and it would be interesting to see what model mixtures work well together. Naturally, this is not a viable approach to selecting composition operations in general, as we will not have the luxury of try- ing every combination of composition operations for every combination of syntactic types, but it is worthwhile performing these tests to at least verify the hypothesis that 9 Available at http://www.cs.ox.ac.uk/activities/compdistmeaning/GS2012data.txt. 10 For example, the entry statistical table show express good result would yield the sentences The statistical table showed the good result and The statistical table expressed the good result. 111 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 1 7 1 1 8 0 5 6 4 5 / c o l i _ a _ 0 0 2 0 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 41, Number 1 operation-speciﬁc composition operations (or parameters) is a good thing. Notably, this idea has been explored very recently, within the context of deep learning networks, by Chen et al. (2013). Each mixed model has two components: a verb-argument composition model and a adjective-noun composition model. For verb-argument composition, we used the three best models from the previous experiment, namely, Multiply, Categorical, and Kronecker. For adjective composition we used three different methods of adjective- −−→ noun being the vectors for an adjective and a noun noun composition. With in our distributional lexical semantic space W (built using the same procedure as the −−→ adjcat being the compact representation in the Categorical previous experiments) and model, built according to the algorithm from Section 4.3, we have the following models: −−−−−→ adjective and AdjMult : Categorical : −−−−−−−−−→ adjective noun = −−−−−−−−−→ adjective noun = −−−−−→ adjective (cid:6) −−→ noun −−−−−−→ adjectivecat (cid:6) −−→ noun The third model, AdjNoun, is a holistic (non-compositional) model, wherein the adjective-noun compound was treated as a single token, as its semantic vector −−−−−−−−−−−−→ ∈ W was learned from the corpus using the same learning procedure (adjective noun)lex applied to construct other vectors in W. Hence, the model deﬁnes adjective-noun “composition” as: AdjNoun : −−−−−−−−−→ adjective noun = −−−−−−−−−−−−→ (adjective noun)lex In addition to these models, we also evaluated three baselines: Verb Baseline, Bigram Baseline, and Trigram Baseline. As in previous experiments, the verb baseline uses the verb vector as sentence vector, ignoring the information provided by other words. The bigram and trigram baselines are calculated from the same language model as used in the second experiment. In both cases, the log-probability of each sentence is calculated using SRLIM, and the sum of log-probabilities of two sentences is used as a similarity measure. Finally, for comparison, we also considered the full additive model: Additive : −−−−−→ sentence = −−−−−→ adjective1 + −−−→ noun1 + −−→ verb + −−−−−→ adjective2 + −−−→ noun2 Results. The results for the third experiment are shown in Table 8. The best performing adjective-noun combination operations for each verb-argument combination operation are shown in bold. Going through the combined models, we notice that in most cases the results stay the same whether the adjective-noun combination method is AdjMult or CategoricalAdj. This is because, as was shown in the ﬁrst experiment, composition of a unary-relation such as an adjective or intransitive verb with its sole argument, under the categorical model with reduced representations, is mathematically equivalent to the multiplicative model. The sole difference is the way the adjective or intransitive verb vector is constructed. We note, however, that with Categorical as a verb-argument composition method, the CategoricalAdj outperforms AdjMult by a non-negligible margin (0.19 vs. 0.14), indicating that the difference in learning procedure can lead to different results depending on what other models it is combined with. 112 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 1 7 1 1 8 0 5 6 4 5 / c o l i _ a _ 0 0 2 0 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Grefenstette and Sadrzadeh Categorical Compositional Distributional Model of Meaning Table 8 Model correlation coefﬁcients with human judgments, third experiment. p < 0.05 for each ρ. Model Verb Baseline Bigram Baseline Trigram Baseline Additive Multiplicative AdjMult AdjNoun CategoricalAdj Categorical AdjMult AdjNoun CategoricalAdj Kronecker AdjMult AdjNoun CategoricalAdj Upperbound ρ 0.20 0.14 0.16 0.10 0.20 0.05 0.20 0.14 0.16 0.19 0.26 0.17 0.27 0.48 Overall, the best results are obtained for AdjMult+Kronecker (ρ = 0.26) and CategoricalAdj+Kronecker (ρ = 0.27). Combinations of the adjective composition methods with other composition methods at best matches the best-performing baseline, Verb Baseline. In all cases, the holistic model AdjNoun provides the worst results. 6. Discussion In this article, we presented various approaches to compositionality in distributional semantics. We discussed what mathematical operations could underlie vector combina- tion, and several different ways of including syntactic information into the combinatory process. We reviewed an existing compositional framework that leverages the ability to communicate information across mathematical structures provided by category theory in order to deﬁne a general way of linking syntactic structure to syntax-sensitive com- position operations. We presented concrete ways to apply this framework to linguistic tasks and evaluate it, and developed algorithms to construct semantic representations for words and relations within this framework. In this section, we ﬁrst brieﬂy comment upon the combined results of all three experiments, and then conclude by discussing what aspects of compositionality require further attention, and how experiment design should adapt towards this goal. Results Commentary. We evaluated this framework against other unsupervised compo- sitional distributional models, using non-compositional models and n-gram language models as baselines, within the context of three experiments. These experiments show that the concrete categorical model developed here, and the Kronecker-based variant 113 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 1 7 1 1 8 0 5 6 4 5 / c o l i _ a _ 0 0 2 0 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 41, Number 1 presented alongside it, outperform all other models in each experiment save the ﬁrst, where they perform on par with what was the leading model at the time this experiment was performed (namely, 2011). As the experiments involved progressively more syntac- tically complicated sentences, the increased reliability of our categorical approaches rel- ative to competing models as sentence complexity rises seems to indicate that both the categorical and Kronecker model successfully leverage the added information provided by additional terms and syntactic structures. The third experiment also served to show that using different combinations of composition operations, depending on the syntactic type of the terms being combined, can yield better results, and that some models combine better than others. Notably, the adjective-noun combination models AdjMult and CategoricalAdj, despite their mathematical similarity, produce noticeably different results when combined with the categorical verb-argument composition operation, while they perform equally with most other verb-argument composition operations. We can conclude that different mod- els combine different semantic aspects more prominently than others, and that through combination we can ﬁnd better performance by assuming that different kinds of compo- sition play on different semantic properties. For example, predicates such as intersective adjectives add information to their argument (a red ball is a ball that is also red). This raises the question of how to design models of composition that systematically select which operations will match the semantic aspects of the words being combined based on their syntactic type. This is an open question, which we believe warrants further investigation. Overall, we observe two advantages of these models over those presented in the early part of this article, and those evaluated in the later part. First, we have shown that a linguistically motivated and mathematically sound framework can be imple- mented and learned. Second, we showed that simple methods such as the Kronecker model perform well, especially when combined with the multiplicative model (effec- tively a unary instance of the Kronecker model) for sentences with adjectives. Con- sidering the “simple is best” position recently argued for experimentally by Blacoe and Lapata (2012), this is an interesting candidate for dealing with binary relation composition. Future Work. These experiments showcased the ability of various compositional models to produce sentence representations that were less ambiguous than the words that formed them. They also more generally demonstrated that concrete models could be built from the general categorical framework and perform adequately in simple paraphrase detection tasks. However, various aspects of compositionality were not evaluated here. First, syntax sensitivity is not as important in these experiments as it might be in “real” language. For instance, whereas models that treat adjectives, nouns, and verbs differently tended to perform better than those that did not, the actual capacity of a model to use the syntactic structure was not tested. For instance, models that ignore word order and syntax altogether were not particularly penalized, as might be done if some sentences were simply the reverse of another sentence they are paired with (where syntax insensitive models would erroneously give such sentence pairs a high similarity score). One way of testing word order while expanding the data set was suggested by Turney (2012), who for every entry in a phrase similarity test such as Mitchell and Lapata’s, discussed herein, creates a new entry where one of the sentences has the order of its words reversed, and is assigned an artiﬁcial annotator score of 1 (the minimum). This is an interesting approach, but we would prefer to see such reversals designed into the data set and seen by annotators, for instance, 114 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 1 7 1 1 8 0 5 6 4 5 / c o l i _ a _ 0 0 2 0 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Grefenstette and Sadrzadeh Categorical Compositional Distributional Model of Meaning in the recent data set presented in Pham et al. (2013). This data set has entries such as guitar played man and man played guitar, where the subjects and objects of verbs are indeed reversed but the two sentences do not necessarily express opposite mean- ings; we will be looking into expanding such data sets to ones where both sentences make sense and have opposite meanings, such as in the pair man bites dog and dog bites man. Furthermore, sentences all have the exact same sentence structure, and therefore words are aligned. Models that might indirectly beneﬁt from this alignment, such as Kronecker, may have been given an advantage due to this, as opposed to models such as Categorical, which are designed to compare sentences of different length or syntactic structure, when resolving the difference between intransitive and transitive sentence spaces as has been done in Grefenstette et al. (2011). Future experiments should aim to do away with this automated alignment, and include sentence comparisons that would penalize models which do not leverage syntactic information. Furthermore, each of these experiments dealt with the comparison of a certain type of sentence (transitive, intransitive). This was convenient for our concrete cate- gorical model, as we deﬁned the sentence space differently based on the valency of the head verb. However, sentences with different sorts of verbs should be able to be directly compared. Not only do several models, both non-syntax sensitive (additive, multiplicative) and syntax-sensitive (Baroni and Zamparelli 2010; Socher et al. 2012), not face this problem, as the product of composition is either naturally in the same space as the vectors being composed or is projected back into it, but the categorical framework the concrete categorical models were derived from does not commit us to different sentence spaces either. The topic of how to solve this problem for the concrete models developed here is beyond the scope of this article, but various options exist, such as the projection of ST into SI, the embedding of SI into ST, the use of a combined sentence space S = SI ⊕ ST, and so on. It is clear that future experiments should not give this degree of convenience to the models being evaluated (and their designers), by comparing sentences of different types, lengths, and structure so as to more fairly evaluate the capacity of models to produce meaningful semantic representations in a single space, which can be directly compared regardless of syntactic structure and verb-type. Finally, we mentioned at the beginning of Section 5 that evaluations need to be application-oriented. The class of experiments presented in this article speciﬁcally see how well disambiguation occurs as a byproduct of composition. We presented concrete models that would offer ways of combining relations underlying verbs and adjectives, and tested how well these relations disambiguated their arguments and were in turn disambiguated by their arguments. We did not address other aspects of language, such as quantiﬁcation, logical operations, relative clauses, intensional clauses, embedded sentences, and many other linguistic aspects of spoken and written language that have complex syntactic structure and complicated semantic roles. Addressing these sorts of problems requires determining how negation, conjunction, and other logical relations should be modeled within compositional distributional formalisms, a problem we share with other similar approaches. The development of newer models within the categorical framework we based our work on, and the further development of other approaches mentioned here, must be driven by new experimental goals different from those offered by the task and experiments discussed in this article. Thinking not only about how to address these aspects of language within compositional models, but how to evaluate them, should be a priority for all those interested in further developing this ﬁeld. 115 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 1 7 1 1 8 0 5 6 4 5 / c o l i _ a _ 0 0 2 0 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 41, Number 1 Clark, S., B. Coecke, and M. Sadrzadeh. 2008. A compositional distributional model of meaning. In Proceedings of the Second Quantum Interaction Symposium (QI-2008), Oxford. Clark, S. and J. R. Curran. 2007. Wide-coverage efﬁcient statistical parsing with CCG and log-linear models. Computational Linguistics, 33:493–552. Clark, S. and S. Pulman. 2007. Combining symbolic and distributional models of meaning. In AAAI Spring Symposium on Quantum Interaction, Stanford, USA. Clarke, Daoud. 2009. Context-theoretic semantics for natural language: An overview. In Proceedings of the Workshop on Geometrical Models of Natural Language Semantics, pages 112–119, Edinburgh. Clarke, Daoud. 2012. A context-theoretic framework for compositionality in distributional semantics. Computational Linguistics, 38(1):41–71. Coecke, B. and ´E. O. Paquette. 2011. Categories for the practising physicist. In B. Coecke, editor, New Structures for Physics. Lecture Notes in Physics, volume 813, pages 173–286, Springer, Berlin Heidelberg. Coecke, B., M. Sadrzadeh, and S. Clark. 2010. Mathematical Foundations for a Compositional Distributional Model of Meaning. Linguistic Analysis, 36:345–384. Curran, James, Stephen Clark, and Johan Bos. 2007. Linguistically motivated large-scale NLP with C&C and boxer. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pages 33–36, Prague. Curran, J. R. 2004. From distributional to semantic similarity. Ph.D. thesis, School of Informatics, University of Edinburgh. Dean, Jeffrey and Sanjay Ghemawat. 2008. MapReduce: Simpliﬁed data processing on large clusters. Communications of the ACM, 51(1):107–113. Erk, Katrin and Sebastian Pad ´o. 2008. A structured vector space model for word meaning in context. Proceedings of the Conference on Empirical Methods in Natural Language Processing - EMNLP ’08, pages 897–906, Edinburgh. Finkelstein, L., E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin. 2001. Placing search in Acknowledgments We thank the EPSRC for funding the research leading to this article via EPSRC grant EP/J002607/1. Furthermore, we would like to thank John Harding and the equational theorem prover “Prover 9”11 for checking the identities occurred as a result of translating CCG’s rules to the language of a pregroup grammar. We would also like to thank Stephen Clark, Stephen Pulman, Bob Coecke, Karl Moritz Hermann, Richard Socher, and Dimitri Kartsaklis for valuable discussions and comments. References Abramsky, S. and B. Coecke. 2004. A categorical semantics of quantum protocols. In Proceedings of the 19th Annual IEEE Symposium on Logic in Computer Science, pages 415–425, Turku, Finland. Alshawi, H., editor. 1992. The Core Language Engine. MIT Press. Baroni, M. and R. Zamparelli. 2010. Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1,183–1,193, Boston, MA. B´echet, D., A. Foret, and I. Tellier. 2007. Learnability of Pregroup Grammars. Studia Logica, 87(2–3), pages 225–252. Blacoe, W. and M. Lapata. 2012. A comparison of vector-based representations for semantic composition. Proceedings of the 2012 Conference on Empirical Methods in Natural Language Processing, Jeju Island. Buszkowski, W. 2001. Lambek grammars based on pregroups. In P. de Groote, G. Morrill, and C. Retor´e, editors, Logical Aspects of Computational Linguistics. Lecture Notes in Computer Science, volume 2099. Springer, Berlin Heidelberg, pages 95–109. Chen, Danqi, Richard Socher, Christopher D. Manning, and Andrew Y. Ng. 2013. Learning new facts from knowledge bases with neural tensor networks and semantic word vectors. arXiv preprint arXiv:1301.3618. Clark, S. 2013. Type-driven syntax and semantics for composing meaning vectors. In Quantum Physics and Linguistics: A Compositional, Diagrammatic Discourse. Oxford University Press. 11 http://www.cs.unm.edu/∼mccune/mace4/. 116 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 1 7 1 1 8 0 5 6 4 5 / c o l i _ a _ 0 0 2 0 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Grefenstette and Sadrzadeh Categorical Compositional Distributional Model of Meaning context: The concept revisited. In Proceedings of the 10th International Conference on World Wide Web, pages 406–414, Hong Kong. Firth, J. R. 1957. A synopsis of linguistic theory 1930-1955. Studies in linguistic analysis. Fowler, T. A. D. and G. Penn. 2010. Accurate context-free parsing with combinatory categorial grammar. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 335–344, Uppsala. Frege, G. 1892. Uber sinn und bedeutung. Zeitschrift f ¨ur Philosophie und philosophische Kritik, 100(1):25–50. Genkin, Daniel, Nissim Francez, and Michael Kaminski. 2010. Mildly context-sensitive languages via buffer augmented pregroup grammars. In Z. Manna and D. A. Peled, editors, Time for Veriﬁcation, pages 144–166, Springer-Verlag, Berlin, Heidelberg. Girard, Jean-Yves. 1987. Linear logic. Theoretical Computer Science, 50:1–102. Grefenstette, E. 2009. Analysing document similarity measures. Master’s thesis, University of Oxford. Grefenstette, E., G. Dinu, Y. Zhang, M. Sadrzadeh, and M. Baroni. 2013. Multi- step regression learning for compositional distributional semantics. In Proceedings of the Tenth International Conference on Computational Semantics, Potsdam. Grefenstette, E. and M. Sadrzadeh. 2011a. Experimental support for a categorical compositional distributional model of meaning. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1,394–1,404, Edinburgh. Grefenstette, E. and M. Sadrzadeh. 2011b. Experimenting with transitive verbs in a DisCoCat. In Proceedings of the 2011 EMNLP Workshop on Geometric Models of Natural Language Semantics, pages 62–66, Edinburgh. Grefenstette, E., M. Sadrzadeh, S. Clark, B. Coecke, and S. Pulman. 2011. Concrete sentence spaces for compositional distributional models of meaning. In Proceedings of the Ninth International Conference on Computational Semantics, pages 125–134, Oxford. Grefenstette, G. 1992. Use of syntactic context to produce term association lists for text retrieval. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 89–97, Copenhagen. Grefenstette, G. 1994. Explorations in automatic thesaurus discovery. Guevara, E. 2010. A regression model of adjective-noun compositionality in distributional semantics. In Proceedings of the 2010 Workshop on Geometrical Models of Natural Language Semantics, pages 33–37, Uppsala. Harris, Z. S. 1968. Mathematical structures of language. Wiley. Hockenmaier, Julia. 2003. Data and Models for Statistical Parsing with Combinatory Categorial Grammar. Ph.D. thesis, School of Informatics, University of Edinburgh. Hockenmaier, Julia and Mark Steedman. 2007. CCGBank: A corpus of CCG derivations and dependency structures extracted from the Penn treebank. Computational Linguistics, 33(3):355–396. Joshi, A. K., K. Vijay-Shanker, and D. J. Weir. 1989. The convergence of mildly context-sensitive grammar formalisms. Working paper, University of Pennsylvania, School of Engineering and Applied Science, Dept. of Computer and Information Science. Lambek, J. 1958. The mathematics of sentence structure. The American Mathematical Monthly, 65(3):154–170. Lambek, J. 1999. Type grammar revisited. Logical Aspects of Computational Linguistics, pages 1–27. Lambek, J. 2008. From word to sentence. A computational algebraic approach to grammar. Milan, Polimetrica. Lambek, J., 2010. Compact Monoidal Categories from Linguistics to Physics, pages 451–469. Landauer, T. K. and S. T. Dumais. 1997. A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review. Mac Lane, S. 1998. Categories for the Working Mathematician. Springer Verlag. Manning, C. D., P. Raghavan, and H. Sch ¨utze. 2011. Introduction to information retrieval. Cambridge University Press, New York, NY. Minnen, G., J. Carroll, and D. Pearce. 2001. Applied morphological processing of English. Natural Language Engineering, 7(03):207–223. Mitchell, J. and M. Lapata. 2008. Vector- based models of semantic composition. In Proceedings of ACL, volume 8, pages 236–244, Columbus, OH. 117 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 1 7 1 1 8 0 5 6 4 5 / c o l i _ a _ 0 0 2 0 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 41, Number 1 Mitchell, J. and M. Lapata. 2010. Composition in Distributional Models of Semantics. Cognitive Science 34(8):1388–1429. Montague, R. 1974. English as a Formal Language. In R. H. Thomason, editor, Formal Semantics: The Essential Readings. Moortgat, M. 1997. Categorial type logics. In H. van Ditmarsch and L. S. Moss, editors, Handbook of Logic and Language. Elsevier, pages 93–177. Pad ´o, Sebastian and Mirella Lapata. 2007. Dependency-based construction of semantic space models. Computational Linguistics, 33(2):161–199. Pham, N., R. Bernardi, Y.-Z. Zhang, and M. Baroni. 2013. Sentence paraphrase detection: When determiners and word order make the difference. In Proceedings of the Towards a Formal Distributional Semantics Workshop at IWCS 2013, pages 21–29, Potsdam. Plate, T. A. 1991. Holographic reduced representations: Convolution algebra for compositional distributed representations. In Proceedings of the 12th International Joint Conference on Artiﬁcial Intelligence, pages 30–35, Hyderabad. Preller, A. 2010. Polynomial pregroup grammars parse context sensitive languages. Linguistic Analysis, 36:483–516. Preller, A. and M. Sadrzadeh. 2010. Bell states and negative sentences in the distributed model of meaning. In P. Selinger, B. Coecke, P. Panangaden, editors, Electronic Notes in Theoretical Computer Science, Proceedings of the 6th QPL Workshop on Quantum Physics and Logic. University of Oxford. Selinger, P. 2010. A survey of graphical languages for monoidal categories. New Structures for Physics, pages 275–337. Smolensky, P. 1990. Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artiﬁcial Intelligence, 46(1-2):159–216. Smolensky, P. and G. Legendre. 2006. The Harmonic Mind: From Neural Computation to Optimality-Theoretic Grammar Volume I: Cognitive Architecture. MIT Press. Socher, R., B. Huval, C. D. Manning, and A. Y. Ng. 2012. Semantic compositionality through recursive matrix-vector spaces. In Proceedings of the 2012 Conference on Empirical Methods in Natural Language Processing, pages 1,201–1,211, Jeju Island. Steedman, M. 2001. The Syntactic Process. The MIT Press. Steedman, M. and J. Baldridge. 2011. Combinatory Categorial Grammar. Wiley-Blackwell. Stolcke, A. 2002. SRILM—An extensible language modeling toolkit. In Seventh International Conference on Spoken Language Processing. Turney, P. D. and P. Pantel. 2010. From frequency to meaning: Vector space models of semantics. Journal of Artiﬁcial Intelligence Research, 37(1):141–188. Turney, Peter D. 2012. Domain and function: A dual-space model of semantic relations and compositions. Journal of Artiﬁcial Intelligence Research, 44:533–585. Van Rijsbergen, C. J. 2004. The Geometry of Information Retrieval. Cambridge University Press. Walters, R. F. 1991. Categories and Computer Science. Cambridge University Press. Widdows, D. 2005. Geometry and Meaning. University of Chicago Press. Widdows, D. 2008. Semantic vector products: Some initial investigations. In Proceedings of the Second Quantum Interaction Symposium (QI-2008). College Publications. CITESEER, Oxford. Wittgenstein, L. 1953. Philosophical Investigations. Blackwell. Zanzotto, F. M., I. Korkontzelos, F. Fallucchi, and S. Manandhar. 2010. Estimating linear models for compositional distributional semantics. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 1,263–1,271, Beijing. 118 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 1 7 1 1 8 0 5 6 4 5 / c o l i _ a _ 0 0 2 0 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Concrete Models and Empirical Evaluations image

Descargar PDF