Text Representations for Patent Classiﬁcation

∗
Eva D’hondt
Radboud University Nijmegen

Suzan Verberne
Radboud University Nijmegen

∗∗

†
Cornelis Koster
Radboud University Nijmegen

‡
Lou Boves
Radboud University Nijmegen

With the increasing rate of patent application ﬁlings, automated patent classiﬁcation is of rising
economic importance. This article investigates how patent classiﬁcation can be improved by
using different representations of the patent documents. Using the Linguistic Classiﬁcation
Sistema (LCS), we compare the impact of adding statistical phrases (in the form of bigrams)
and linguistic phrases (in two different dependency formats) to the standard bag-of-words text
representation on a subset of 532,264 English abstracts from the CLEF-IP 2010 cuerpo. En
contrast to previous ﬁndings on classiﬁcation with phrases in the Reuters-21578 data set, para
patent classiﬁcation the addition of phrases results in signiﬁcant improvements over the unigram
base. The best results were achieved by combining all four representations, and the second
best by combining unigrams and lemmatized bigrams. This article includes extensive analyses of
the class models (a.k.a. class proﬁles) created by the classiﬁers in the LCS framework, to examine
which types of phrases are most informative for patent classiﬁcation. It appears that bigrams
contribute most to improvements in classiﬁcation accuracy. Similar experiments were performed
on subsets of French and German abstracts to investigate the generalizability of these ﬁndings.

1. Introducción

Around the world, the patent ﬁling rates in the national patent ofﬁces have been in-
creasing year after year, creating an enormous volume of texts, which patent examiners

∗ Center for Language Studies, PO Box 9103, 6500 HD Nijmegen, Los países bajos.

Correo electrónico: e.dhondt@let.ru.nl.

∗∗ Center for Language Studies / Institute for Computing and Information Sciences, PO Box 9103, 6500 HD

Nimega, Los países bajos. Correo electrónico: s.verberne@let.ru.nl.

† Institute for Computing and Information Sciences, PO Box 9010, 6500 HD Nijmegen, Los países bajos.

Correo electrónico: kees@cs.ru.nl.

‡ Center for Language Studies, PO Box 9103, 6500 HD Nijmegen, Los países bajos.

Correo electrónico: l.boves@let.ru.nl.

Envío recibido: 19 Marzo 2012; revised submission received: 8 Agosto 2012; accepted for publication:
19 Septiembre 2012.

doi:10.1162/COLI a 00149

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
3
7
5
5
1
8
0
1
9
1
6
/
C
oh

yo
i

_
a
_
0
0
1
4
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 39, Número 3

are struggling to manage (Benzineb and Guyot 2011). To speed up the examination
proceso, a patent application needs to be directed to patent examiners specialized
in the subﬁeld(s) of that particular patent as quickly as possible (Herrero 2002). Este
preclassiﬁcation is done automatically in most patent ofﬁces, but substantial addi-
tional manual labor is still necessary. Además, desde 2010, the International Patent
Classiﬁcation1 (IPC) is revised every year to keep track of recent developments in the
various subdomains. Such a revision is followed by a reclassiﬁcation of portions of
the existing patent corpus, which is currently done mainly by hand by the national
patent ofﬁces (Held, Schellner, and Ota 2011). Both preclassiﬁcation and reclassiﬁcation
could be improved, and a higher consistency of the classiﬁcations of the documents
in the patent corpus could be obtained, if more reliable and precise automatic text
classiﬁcation algorithms were available (Benzineb and Guyot 2011).

Most approaches to text classiﬁcation use the bag-of-words (BOW) text representa-
ción, which represents each document by the words that occur in it, irrespective of their
ordering in the original document. In the last decades much research has gone into
expanding this representation with additional information, such as statistical phrases2
(n-grams) or some forms of syntactic or semantic knowledge. Even though (estadístico)
phrases are more representative units for classes than single words (Caropreso, matwin,
and Sebastiani 2001), they are so sparsely distributed that they have limited impact
during the classiﬁcation process. Por lo tanto, it is not surprising that the best scoring
multi-class, multi-label3 classiﬁcation results for the well-known Reuters-21578 data set
have been obtained using a BOW representation (Bekkerman and Allan 2003). Pero el
limited contribution of phrases in addition to the BOW-representation does not seem
to hold for all classiﬁcation tasks: ¨Ozg ¨ur and G ¨ung ¨or (2010) found signiﬁcant differ-
ences in the impact of linguistic phrases between short newswire texts (Reuters-21578),
scientiﬁc abstracts (NSF), and informal posts in usenet groups (MiniNg): Especially the
classiﬁcation of scientiﬁc abstracts could be improved by using phrases as index terms.
In a follow-up study, ¨Ozg ¨ur and G ¨ung ¨or (2012) found that for the three different data
conjuntos, different types of linguistic phrases have most impact. The authors conclude that
more formal text types beneﬁt from more complex syntactic dependencies.

In this article, we investigate if similar improvements can be found for patent
classiﬁcation and, more speciﬁcally, which types of phrases are most effective for this
particular task. In this article we investigate the value of phrases for classiﬁcation by
comparing the improvements that can be gained from extending the BOW representa-
tion with (1) statistical phrases (in the form of bigrams); (2) linguistic phrases originating
from the Stanford parser (mira la sección 3.2.2); (3) aboutness-based4 linguistic phrases from
the AEGIR parser (Sección 3.2.3); y (4) a combination of all of these. Además, nosotros
will investigate the importance of different syntactic relations for the classiﬁcation task,

1 The IPC is a complex hierarchical classiﬁcation system comprising sections, classes, subclasses, y

grupos. Por ejemplo, the “A42B 1/12” class label, which groups designs for bathing caps, falls under
section A “Human necessities,” class 42 “Headwear,” subclass B “Head coverings,” group 1 “Hats; caps;
hoods.” The latest edition of the IPC contains eight sections, 129 classes, 639 subclasses, 7,352 grupos,
y 61,847 subgroups. The IPC covers inventions in all technological ﬁelds in which inventions can
be patented.

2 By a phrase we mean an index unit consisting of two or more words, generated through either syntactic

or statistical methods.

3 Multi-class classiﬁcation is the problem of classifying instances into more than two classes. Multi-label
signiﬁes that documents in this test set are associated with more than one class, and must be assigned a
set of labels during classiﬁcation.

4 The notion of aboutness refers to the conceptual content expressed by a dependency triple. For a more

detailed description, mira la sección 3.2.3.

756

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
3
7
5
5
1
8
0
1
9
1
6
/
C
oh

yo
i

_
a
_
0
0
1
4
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

D’hondt et al.

Text Representations for Patent Classiﬁcation

and the extent to which the words in the phrases overlap with the unigrams. Nosotros también
investigate which syntactic relations capture most information in the opinion of human
annotators. Finalmente, we perform experiments to investigate if our ﬁndings are language-
dependent. We will then draw some conclusions on what information is most valuable
for improving automatic patent classiﬁcation.

2. Fondo

2.1 Text Representations in Classiﬁcation

Luis (1992) was the ﬁrst to investigate the use of phrases as index terms for text
classiﬁcation. He found that phrases generally suffer from data sparseness and may
actually cause classiﬁcation performance to deteriorate. These ﬁndings were conﬁrmed
by Apt´e, Damerau, and Weiss (1994). With the advent of increasing computational
power and bigger data sets, sin embargo, the topic has been revisited in the last two decades
(Bekkerman and Allan 2003).

In this section we will give an overview of the major ﬁndings in previous re-
search on the use of statistical and syntactic phrases for text classiﬁcation. Except when
mentioned explicitly, all the classiﬁcation experiments reported here were conducted
using the Reuters-21578 data set, a well-known benchmark of 21,578 short newswire
texts for multi-class classiﬁcation into 118 categories (a document has an average of
1.24 class labels).

2.1.1 Combining Unigrams with Statistical Phrases. For an excellent overview of the work
on using phrases done up to 2002, see Bekkerman and Allan (2003), and Tan, Wang, y
Sotavento (2002).

Because they contain more speciﬁc information, one might think that phrases are
more powerful features for text classiﬁcation. There are two ways of using phrases as
index terms: either index terms only or in combination with unigrams. All experimental
resultados, sin embargo, show that using only phrases as index terms leads to a decrease
in classiﬁcation accuracy compared with the BOW baseline (Bekkerman and Allan
2003). Both Mladenic and Grobelnik (1998) and F ¨urnkranz (1998) showed that classiﬁers
trained on combinations of unigrams and n-grams composed of at most three words
performed better than classiﬁers that only use unigrams; no improvement was obtained
when using larger n-grams. Because trigrams are sparser than bigrams, most of the
subsequent research has focused on optimizing the combination of unigrams and
bigrams using different feature selection techniques.

2.1.2 Feature Selection. Obviamente, unigrams and bigrams overlap: Bigrams are pairs of
unigrams. Caropreso, matwin, and Sebastiani (2001) evaluated the relative importance
of unigrams and bigrams in a classiﬁer-independent study: Instead of determining the
impact of features on the classiﬁcation scores, they scored all unigrams and bigrams
using conventional feature evaluation functions to ﬁnd the features that are most
representative for the document classes. For the Reuters-21578 data set, they found
that many bigram features scored higher than unigram features. Estos (teórico)
ﬁndings were not conﬁrmed in subsequent classiﬁcation experiments, sin embargo. Cuando
the bigram/unigram ratio for a ﬁxed number of features is changed to favor bigrams,
classiﬁcation performance tends to go down. It appears that the information in the
bigrams does not turn the unigrams redundant.

757

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
3
7
5
5
1
8
0
1
9
1
6
/
C
oh

yo
i

_
a
_
0
0
1
4
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 39, Número 3

braga, Monard, and Matsubara (2009) used a Multinomial Naive Bayes classiﬁer
to investigate classiﬁcation performance with unigrams and bigrams by comparing
multiview classiﬁcation (the results of two independent classiﬁers trained with unigram
and bigram features are merged) with monoview classiﬁcation (unigrams and bigrams
are combined in a single feature set).5 They found that there is little difference between
the output of the mono- and multiview classiﬁers. In the multiview classiﬁers, el
unigram and bigram classiﬁers make similar decisions in assigning labels, a pesar de
the latter generally yielded lower conﬁdence values. Como consecuencia, in the merge the
unigram and bigram classiﬁers afﬁrm each other’s decisions, which does not result
in an overall improvement in classiﬁcation accuracy. The authors suggest combining
unigrams only with those bigrams for which it holds that the whole provides more
information than the sum of the parts.

Broncearse, Wang, and Lee (2002) proposed selecting highly representative and meaningful
bigrams based on the Mutual Information scores of the words in a bigram compared
with the unigram class model. They selected only the top 2% of the bigrams as index
terms, and found a signiﬁcant improvement over their unigram baseline, which was low
compared to state-of-the-art results. Bekkerman and Allan (2003) failed to improve over
their unigram baseline when using similar selection criteria based on the distributional
clustering of unigram models. Crawford, Koprinska, and Patrick (2004) were not able to
improve e-mail classiﬁcation when using the selection criteria proposed by Tan, Wang,
and Lee.

2.1.3 Combining Unigrams with Syntactic Phrases. Luis (1992) and Apt´e, Damerau, y
Weiss (1994) were the ﬁrst to investigate the impact of syntactic phrases6 as features for
text classiﬁcation. Dumais et al. (1998) and Scott and Matwin (1999) did not observe
a signiﬁcant improvement in classiﬁcation on the Reuters-21578 collection when noun
phrases obtained with a shallow parser were used instead of unigrams. Moschitti and
Basili (2004) found that neither words augmented with word sense information, nor
syntactic phrases (acquired through shallow parsing) in combination with unigrams
improved over the BOW baseline. Syntactic phrases appear to be even sparser than
bigrams. Por lo tanto, it is not surprising that most papers concluded that classiﬁers using
only syntactic phrases perform worse than the baseline, except when the BOW baseline
is low for that particular classiﬁcation task (Mitra et al. 1997; F ¨urnkranz 1999).

Deep syntactic parsing is a computationally expensive process, but thanks to the
increase in computational power it is now possible to use phrases acquired through
deep syntactic parsing in classiﬁcation tasks. Nastase, Sayyad, and Caropreso (2007)
used Minipar to generate dependency triples that are combined with lemmatized and
unlemmatized unigrams to classify the 10 most frequent classes in the Reuters-21578
data set. Their criterion for selecting triples as index terms is document frequency ≥ 2.
The small improvement over the lemmatized unigram baseline was not statistically
signiﬁcant. ¨Ozg ¨ur and G ¨ung ¨or (2010, 2012) achieve small but signiﬁcant improvements
when combining unigrams with a subset of the dependency types from the Stanford
parser on three different data sets, including the Reuters-21578 set. They ﬁnd that
separate pruning levels (based on the term frequency–inverse document frequency
[TF-IDF] score of the index units) for the unigrams and syntactic phrases inﬂuence

5 The difference between multiview and monoview classiﬁcation corresponds to what is called late and

early fusion in the pattern recognition literature.

6 The concept “syntactic phrase” can be given several different interpretations, such as noun phrases,

verb phrases, predicate structures, dependency triples, Etcétera.

758

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
3
7
5
5
1
8
0
1
9
1
6
/
C
oh

yo
i

_
a
_
0
0
1
4
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

D’hondt et al.

Text Representations for Patent Classiﬁcation

classiﬁcation accuracy. Which dependency relations prove most relevant for a classiﬁ-
cation task depends greatly on the language use in the different data sets: The informal
MiniNG data set (usenet posts) beneﬁts a little from “simple” dependencies such as part,
denoting a phrasal verb, for example write down, while classiﬁcation in the more formal
Reuters-21578 (newswire) and NSF (scientiﬁc abstracts) data sets is more improved by
using dependencies on phrase and clause level (adjectival modiﬁer, compound noun,
prepositional attachment; and subject and object, respectivamente). The highest-ranking
features for the NSF data set are compound noun (nn), adjectival modiﬁer (amod),
sujeto (subj), and object (obj), respectivamente. Además, they observe that splitting up
more generic relator types (such as prep) into different, more speciﬁc, subtypes increases
the classiﬁcation accuracy.

2.2 Patent Classiﬁcation

It is not possible to draw far-reaching conclusions from previous research on patent
classiﬁcation, because there is no tradition of using a “standard” data set, y un estándar
split of patent corpora in a training and test set. Además, there are differences
between the various experiments in task deﬁnitions (mono-label versus multi-label
classiﬁcation); the granularity of the classiﬁcation (depth in the IPC hierarchy); y el
choices of (sub)sets of data. Fall and Benzineb (2002) give an overview of the work done
in patent classiﬁcation research up to 2002 and of the commercial patent classiﬁcation
systems available; see Benzineb and Guyot (2011) for a general introduction to patent
classiﬁcation.

Larkey (1999) was the ﬁrst to present a fully automated patent classiﬁcation system,
but she did not report her overall accuracy results. Larkey (1998) used a combination
of weighted words and noun phrases as index terms to classify a subset of the USPTO
database, but found no improvement over a BOW baseline. The weights were calculated
como sigue: Frequency of a word or phrase in a particular section times the manually
assigned weight (importance) given to that section. The weights for each word or phrase
were then summed across sections. Term selection was based on a threshold for these
weights.

Krier and Zacc`a (2002) organized a comparative study of various academic and
commercial systems for patent classiﬁcation for a common data set. In this informal
benchmark Koster, Seutter, and Beney (2001) achieved the best results, using the Bal-
anced Winnow algorithm with a word-only text representation. Classiﬁcation is per-
formed for 44 o 549 categories (which correspond to different levels of depth in the
then used version of the IPC), with around 78% y 68% precision at 100% recordar,
respectivamente.

Fall et al. (2003) introduced the EPO-alpha data set, attempting to create a common
benchmark for patent classiﬁcation. Using only words as index terms, they tested
different classiﬁcation algorithms and found that SVM outperform Naive Bayes, k-NN,
SNoW, and decision-based classiﬁers. They achieved P@3-scores7 of 73% y 59% en
114 classes and 451 subclasses, respectivamente. They also found that when using only
the ﬁrst 300 words from the abstract, claims, and description sections, classiﬁcation
accuracy is increased compared with using the complete sections. The same data set
was later used by Koster and Seutter (2003), who experimented with a combined

7 Precision at rank 3 (P@3) signiﬁes the percentage correct labels in the ﬁrst three labels by the classiﬁer to a

given document.

759

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
3
7
5
5
1
8
0
1
9
1
6
/
C
oh

yo
i

_
a
_
0
0
1
4
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 39, Número 3

representation of words and phrases consisting of head-modiﬁer pairs.8 They found
that head-modiﬁer pairs could not improve on the BOW-baseline: The phrases were too
sparse to have much impact on the classiﬁcation process.

Starting in 2009, the IRF9 has organized CLEF-IP patent classiﬁcation tracks in an
attempt to bridge the gap between academic research and the patent industry. Para esto
purpose the IRF has put a lot of effort into providing very large patent data sets,10 cual
have enabled academic researchers to train their algorithms on real-life data. En el
CLEF-IP 2010 classiﬁcation track the best results were achieved by Guyot, Benzineb,
and Falquet (2010). Using the Balanced Winnow algorithm, they achieved a P@1-score
de 83%, while classifying on subclass level. They used a combination of words and
statistical phrases (collocations of variable length extracted from the corpus) as index
terms and used all available documents (en Inglés, Francés, and German) in the corpus
as training data. In the same competition, Derieux et al. (2010) came second (in terms of
P@1). They also used a mixed document representation of both single words and longer
phrases, which had been extracted from the corpus by counting word co-occurrences.
Verberne, Vogel, and D’hondt (2010) and Beney (2010) experimented with a combined
representation of words and syntactic phrases derived from an English and French
syntactic parser, respectivamente. They both found that adding syntactic phrases to words
improves classiﬁcation accuracy slightly. Beney (2010) remarks that this improvement
may be language-dependent. As a follow-up, Koster et al. (2011) investigated the added
value of syntactic phrases. They found that attributive phrases, eso es, combinations
of adjective or nouns with nouns, were by far the most important syntactic phrases for
patent classiﬁcation. On a subset of the CLEF-IP 2010 corpus11 they also found a small,
but signiﬁcant, improvement when adding dependency triples to words.

3. Experimental Set-up

In this article, we investigate the relative contributions of different types of terms to
the performance of patent classiﬁcation. We use four different types of terms, a saber,
lemmatized unigrams, lemmatized bigrams (mira la sección 3.2.1), lemmatized dependency
triples obtained with the Stanford parser (mira la sección 3.2.2), and lemmatized depen-
dency triples obtained with the AEGIR parser (mira la sección 3.2.3). We will leave term
(feature) selection to the preprocessing module of the Linguistic Classiﬁcation System
(LCS) which we used for all experiments (mira la sección 3.3). We will analyze the rela-
tion between unigrams and phrases in the class proﬁles in some detail, sin embargo (ver
Secciones 4.2 y 4.3).

3.1 Data Selection

We conducted classiﬁcation experiments on a collection of patent documents obtained
from the CLEF-IP 2010 cuerpo,12 which is a subset of the larger MAREC patent col-
lection. The corpus contains 2.6 million patent documents, which roughly correspond

8 Head-modiﬁer pairs were derived from the syntactic analysis output of the EP4IR syntactic parser.
9 Information Retrieval Facility, see www.irf.com.
10 The CLEF-IP 2009, CLEF-IP 2010, and CLEF-IP 2011 data sets can be obtained through the IRF. The more

recent data sets subsume the older sets.

11 The same data set as will be used in this article. For a more detailed description, mira la sección 3.1.
12 This test collection is available through the IRF (http://www.ir-facility.org/collection).

760

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
3
7
5
5
1
8
0
1
9
1
6
/
C
oh

yo
i

_
a
_
0
0
1
4
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

D’hondt et al.

Text Representations for Patent Classiﬁcation

a 1.3 million individual patents, published between 1985 y 2001.13 The documents
in the collection are encoded in a customized XML format and may include text in
Inglés, Francés, and German. In addition to the standard sections of a patent document
(título, abstract, claims, and description section), the documents also include meta-
information on inventor, date of application, assignee, Etcétera. Because our focus
lies on text representation, we did not include any of the meta-data in our document
representaciones.

The most informative sections of a patent document are generally considered to
be the title, the abstract, and the beginning of the description (Benzineb and Guyot
2011). Verberne and D’hondt (2011) showed that using both the description and the
abstract gives a small, but signiﬁcant, improvement in classiﬁcation results on the
CLEF-IP 2011 cuerpo, compared with classiﬁcation on abstracts only. The effort in-
volved in parsing the descriptions is considerable, sin embargo: Because of the long
sentences and the dense word use, a parser will have much more difﬁculty in pro-
cessing text from the description section than from the abstracts. The titles of the
patent documents also pose a parsing problem: These are generally short noun phrases
that contain ambiguous PP-attachments that are impossible to disambiguate without
any domain knowledge. This leads to incorrect syntactic analyses and, como consecuencia,
noisy dependency triple features. Because we are interested in comparing classiﬁcation
results for different text representations, and not in comparing results for different
secciones, we opted to use only the abstract sections of the patent document in the
current article.

From the corpus, we extracted all ﬁles that contain both an abstract in English and at
least one IPC class14 in the ﬁeld. We extracted the IPC classes on the
document level; this means that we did not include the documents where the IPC class
is in a separate ﬁle than the English abstract. In total, Había 121 different classes in
our data set. Most documents have been assigned one to three different IPC classes (en
class level). De término medio, a patent abstract in our data set has 2.12 class labels. Anterior
cross-validation experiments on the same document set showed very little variation
(standard deviation < 0.3%) between the classiﬁcation accuracies in different training- test splits (Verberne, Vogel, and D’hondt 2010). We therefore decided to use only one training and test set split.15 The ﬁnal data set contained 532,264 abstracts, divided into two sets: (1) a training set (425,811 documents) and (2) a test set (106,453 documents). The distribution of the data over the classes is in accordance with the Pareto Principle: 20% of the classes cover 80% of the data, and 80% of the classes comprise only 20% of the data. 3.2 Data Preprocessing Preprocessing included cleaning up character conversion errors like Expression (1) and removing claims and images references (Expression (2)) and list references 13 Note the difference between a patent and a patent document: A patent is not a physical document itself, but a name for a group of patent documents that have the same patent ID number. 14 For our classiﬁcation experiments we use the codes on the class level in the IPC8 classiﬁcation. 15 The data split was performed using a perl script that randomly shufﬂes the documents and puts them into a train set and test set, while ensuring that the class distribution of the examples in the train set approximates that of the whole corpus. It can be downloaded as part of the LCS distribution. 761 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 3 7 5 5 1 8 0 1 9 1 6 / c o l i _ a _ 0 0 1 4 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 39, Number 3 (Expression (3)) from the original texts. This was done automatically, using the follow- ing regular expressions (based on Parapatics and Dittenbach 2009): s/;gt&/>/gramo

s/(\([ ]*[0-9][0-9a-z,.; ]*\))//gramo

s/(\([ ]*[A-Za-z]\))//gramo

(1)

(2)

(3)

We then used a perl script to divide the running text into sentences, by splitting on
end-of-sentence punctuation such as question marks and full stops. In order to mini-
mize incorrect splitting, the perl script was supplied with a list of common English
abbreviations and a list containing abbreviations and acronyms that occur frequently in
technical texts, derived from the Specialist lexicon.16

3.2.1 Unigrams and Bigrams. The sentences in the abstract documents were converted
to single words by splitting on whitespaces and removing punctuation. The words
were then lemmatized using the AEGIR lexicon. Bigrams were created through a
similar procedure. We did not create bigrams that spanned sentence boundaries. Este
resulted in approximately 60 million unigram and bigram tokens for the present
cuerpo.

3.2.2 stanford. The Stanford parser is a broad-coverage natural language parser that is
trained on newswire text, for which it achieves state-of-the-art performance. The parser
has not been optimized/retrained for the patent domain.17 In spite of the technical
difﬁculties (Parapatics and Dittenbach 2009) and loss of linguistic accuracy for patent
texts reported in Mille and Wanner (2008), most patent processing systems that use
linguistic phrases use the Stanford parser because its dependency scheme has a number
of properties that are valuable for Text Mining purposes (de Marneffe and Manning
2008). The Stanford parser collapsed typed dependency model has a set of 55 differ-
ent syntactic relators to capture semantically contentful relations between words. Para
ejemplo, the sentence The system will consist of four separate modules is analyzed into the
following set of dependency triples in the Stanford representation:

det(system-2, The-1)
nsubj(consist-4, system-2)
aux(consist-4, will-3)
raíz(ROOT-0, consist-4)
num(modules-8, four-6)
amod(modules-8, separate-7)
prep_of(consist-4, modules-8)

The Stanford parser was compiled with a maximum memory heap of 1.2 ES.
Sentences longer than 100 words were automatically skipped. Combined with failed
parses this led to a 1.2% loss of parser output on the complete data set. Parsing the

16 The lexicon can be downloaded at http://lexsrv3.nlm.nih.gov/Specialist/Summary/lexicon.html.
17 For retraining a parser, a substantial amount of annotated data (in the form of syntactically annotated

dependency trees) is needed. Creating such annotations is a very expensive task and beyond the scope of
this article.

762

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
3
7
5
5
1
8
0
1
9
1
6
/
C
oh

yo
i

_
a
_
0
0
1
4
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

D’hondt et al.

Text Representations for Patent Classiﬁcation

Mesa 1
Impact of lemmatization on the different text types in the training set (80% of the corpus).

# tokens

# types (terms)

token/type (lem.)

raw

lemmatized

gain

unigram 48,898,738
48,473,756
bigram
35,772,003
stanford
31,004,525
AEGIR

160,424
3,836,212
8,750,839
–

142,396
3,119,422
7,430,397
5,096,918

1.12
1.23
1.18
–

343.39
15.54
4.81
6.08

entire set of abstracts took 1.5 weeks on a computer cluster consisting of 60 2.4GHz
cores with 4 GB RAM per core. The resulting dependency triples were stripped of the
word indexes and then lemmatized using the AEGIR lexicon.

3.2.3 AEGIR. AEGIR18 is a dependency parser that was designed speciﬁcally for ro-
bust parsing of technical texts. It combines a hand-crafted grammar with an exten-
sive word-form lexicon. The parser lexicon was compiled from different technical
terminologies, such as the SPECIALIST lexicon and the UMLS.19 The AEGIR parser
aims to capture the aboutness of sentences. Rather than outputting extensive linguis-
tic detail on the syntactic structure of the input sentence as in the Stanford parser,
AEGIR returns only the bare syntactic–semantic structure of the sentence. durante el
parsing process, it effectively performs normalization at various levels, such as ty-
pography (Por ejemplo, upper and lower case, spacing), spelling (Por ejemplo, British
and American English, hyphenation), morfología (lemmatization of word forms), y
syntax (standardization of the word order and transforming passive structures into
active ones).

The AEGIR parser uses only eight syntactic relators and returns fewer unique
triples than the Stanford parser. The parser is currently still under development; for this
article we used the version AEGIR v.1.7.5. The parser was constrained to a time limit of
maximum three seconds per sentence. This caused a loss of 0.7% of parser output on the
complete data set. Parsing the entire set of abstracts took slightly less than a week on
the computer cluster described above. The AEGIR parser has several output formats,
among which its own dependency format. The example sentence used to illustrate the
output of the Stanford parser is analyzed as follows:

[sistema,SUBJ,consist]
[consist,PREPof,module]
[module,ATTR,separate]
[module,QUANT,four]

3.2.4 Lemmatization. Mesa 1 shows the impact of lemmatization (using the AEGIR lex-
icon) on the distribution of terms for the different text representations. Lemmatization

18 AEGIR stands for Accurate English Grammar for Information Retrieval. Using the AGFL compiler (found at
http://www.agfl.cs.ru.nl/) this grammar can be compiled into an operational parser. The grammar is
not freely distributed.

19 The Uniﬁed Medical Language System contains a widely-used terminology of the biomedical domain

and can be downloaded at http://www.nlm.nih.gov/research/umls/.

763

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
3
7
5
5
1
8
0
1
9
1
6
/
C
oh

yo
i

_
a
_
0
0
1
4
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 39, Número 3

and stemming are standard approaches to decreasing the sparsity of features; stemming
is more aggressive than lemmatization. Ozg ¨ur and G ¨ung ¨or (2009) showed that—when
using only words as index terms—stemming (with the Porter Stemmer) appears to
improve performance; stemming dependency triples did not improve performance,
sin embargo.

We opted to use a less aggressive form of generalization: Lemmatizing the word
formas. We found that the bigrams gain20 most by lemmatizing the word forms, resulting
in a higher token/type ratio. From Table 1 it can be seen that there are fewer triple
tokens than bigram tokens: Whereas all the (high-frequency) function words are kept in
the bigram representations, both dependency formats discard some function words in
their parser output. Por ejemplo, the AEGIR parser does not create triples for auxiliary
verbos, and in both dependency formats, the prepositions become part of the relator.
Como consecuencia, the parsers will output fewer but more variable tokens, which results in
lower token/type ratios and a lower impact of lemmatization.

3.3 Classiﬁcation Experiments

The classiﬁcation experiments were carried out within the framework of the LCS
(Koster, Seutter, and Beney 2003). The LCS has been developed for the purpose of
comparing different text representations. Actualmente, three classiﬁer algorithms are avail-
capaz: Naive Bayes, Balanced Winnow (Dagan, Karov, and Roth 1997), and SVM-light
(Joachims 1999). Verberne, Vogel, and D’hondt (2010) found that Balanced Winnow and
SVM-light yield comparable classiﬁcation accuracy scores for patent texts on a similar
data set, but that Balanced Winnow is much faster than SVM-light for classiﬁcation
problems with a large number of classes. The Naive Bayes classiﬁer yielded a lower
exactitud. We therefore only used the Balanced Winnow algorithm for our classiﬁcation
experimentos, which were run with the following LCS conﬁguration, based on tuning
experiments on the same data by Koster et al. (2011):

(cid:1)

Global term selection (GTS): Document frequency minimum is 2, term frequency
minimum is 3. Although initial term selection is necessary when dealing with such
a large corpus, we deliberately aimed at keeping as many of the sparse phrasal
terms as possible.

Local term selection (LTS): Simple Chi Square (Galavotti, Sebastiani, and Simi
2000). We used the LCS option to automatically select the most representative
terms for every class, with a hard maximum of 10,000 terms per class.21

After LTS the selected terms of all classes are aggregated into one combined term
vocabulary, which is used as the starting point for training the individual classes
(ver tabla 3).

20 By “gain” we mean the decrease in number of types for the lemmatized forms compared to the

non-lemmatized forms, which will result in higher corresponding token/type ratios.

21 Increasing the cut-off to 100,000 terms resulted in a small increase in accuracy (F1 values) para el

combined representations, mostly for the larger classes. Because the patent domain has a large lexical
variety, a large amount of low-frequency terms in the tail of the term distribution can have a large
impact on the accuracy scores. Because we are more interested in the relative gains between different
text representations and the corresponding top terms in the class proﬁles than in achieving maximum
classiﬁcation scores, we opted to use only 10,000 terms for efﬁciency reasons.

764

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
3
7
5
5
1
8
0
1
9
1
6
/
C
oh

yo
i

_
a
_
0
0
1
4
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

D’hondt et al.

Text Representations for Patent Classiﬁcation

Mesa 2
Impact of global term selection (GTS) criteria on the different text types in the training set (80%
of the corpus).

total # of terms

# of terms selected in GTS % of terms removed in GTS

unigram
bigram
stanford
AEGIR

142,396
3,119,422
7,430,397
5,096,918

58,42322

1,115,170
1,618,478
1,312,715

58.97
64.25
78.22
74.24

(cid:1)

Term strength calculation: LTC algorithm (Salton and Buckley 1988) which is an
extension of the TF–IDF measure.

Training method: Ensemble learning based on one-versus-rest binary classiﬁers.

Winnow conﬁguration: We performed tuning experiments for the Winnow param-
eters on a development set of around 100,000 documentos. We arrived at using the
same setting as Koster et al. (2011), a saber, un = 1.02, β = 0.98, θ+ = 2.0, θ− = 0.5,
with a maximum of 10 training iterations.

For each document the LCS returns a ranked list of all possible labels and the
attendant conﬁdence scores. If the score assigned is higher than a predetermined
límite, the document is assigned that category. The Winnow algorithm has a
default (natural) threshold equal to one. We conﬁgured the LCS to return a min-
imum of one label (with the highest score, even if it is lower than the threshold)
and a maximum of four labels for each document.

The classiﬁcation quality was determined by calculating the Precision, Recordar, y
F1 measures per document/class combination (ver, p.ej., Koster, Seutter, and Beney
2003), on the document level (micro-averaged scores).

Mesa 2 shows the impact of our global term selection criteria for the different text
representaciones. This ﬁrst feature reduction step is category-independent: The features
are discarded on the basis of the term and document frequencies over the corpus, dis-
regarding their distributions for the speciﬁc categories. We can see that the token/type
ratio of Table 1 is mirrored in this table: The sparsest syntactic phrases lose most terms.
Although the Stanford parser output is the sparsest text representation, it has the largest
pool of terms to select from at the end of the GTS process.

The impact of the second feature reduction phase is shown in Table 3. During local
term selection, the LCS ﬁnds the most representative terms for each class by selecting
the terms whose distributions in the sets of positive and negative training examples
for that class are maximally different from the general term distribution. We can see
that in the combined runs only around 50% of the selectable unigrams (after GTS) son

22 For the BOW baseline, the GTS criteria resulted in a too small term set that could then be used as a

starting point for the local term selection process for the individual classes. In such cases, the LCS has
a back-off mechanism that automatically (re)selects terms that were initially discarded during GTS.
En otras palabras, the baseline classiﬁer used terms that do not comply with the criteria in the GTS as
described in the text. In the combination runs, enough terms remained after GTS and no unigrams or
phrases that did not match the GTS criteria were selected.

765

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
3
7
5
5
1
8
0
1
9
1
6
/
C
oh

yo
i

_
a
_
0
0
1
4
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 39, Número 3

Mesa 3
Impact of local term selection (LTS) criteria in the training set (80% of the corpus).

# of terms after GTS

# of terms after LTS

base

unigrams + bigrams

unigrams + stanford triples

unigrams + AEGIR triples

uni

uni
bi

uni
stanford

uni
AEGIR

58,423

58,423
1,115,170

58,423
1,618,478

58,423
1,312,715

69,476

23,753
300,826

26,630
424,204

29,348
409,851

Mesa 4
Classiﬁcation results on CLEF-IP 2010 English abstracts, with ranges for a 95% conﬁdence
interval. Bold ﬁgures indicate the best results obtained with the ﬁve classiﬁers. (PAG: Precision;
R: Recordar, F1: F1-score).

PAG

weighted random guessing
unigrams
unigrams + bigrams
unigrams + Stanford triples
unigrams + AEGIR triples
all representations

6.09% ± 0.14
76.27% ± 0.26
79.00% ± 0.24
78.35% ± 0.25
78.51% ± 0.25
79.51% ± 0.24

6.04% ± 0.14
66.13% ± 0.28
70.19% ± 0.27
69.57% ± 0.28
69.18% ± 0.28
71.11% ± 0.27

6.06% ± 0.14
70.84% ± 0.27
74.34% ± 0.26
73.70% ± 0.26
73.55% ± 0.26
75.08% ± 0.26

selected as features during LTS. This means that the phrases replace at least a part of the
information contained in the possible unigrams.

4. Results and Discussion

4.1 Classiﬁcation Accuracy

Mesa 4 shows the micro-averages of Precision, Recordar, and F1 for ﬁve classiﬁcation ex-
periments with different document representations. To give an idea of the complexity of
the task we have included a random guessing baseline in the ﬁrst row.23 We found that
extending a unigram representation with statistical and/or linguistic phrases gives a
signiﬁcant improvement in classiﬁcation accuracy over the unigram baseline. The best-
performing classiﬁer is the one that combines all four text representations. When adding
only type of phrase to unigrams, the unigrams + bigrams combination is signiﬁcantly
better than the combinations with syntactic phrases. Combining all four representations
boosts recall, but has less impact on precision.

23 The script used to calculate the baseline can be downloaded at http://lands.let.ru.nl/~dhondt/.
We used a weighted randomization that takes the category label distributions and label frequency
distributions into account.

766

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
3
7
5
5
1
8
0
1
9
1
6
/
C
oh

yo
i

_
a
_
0
0
1
4
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

D’hondt et al.

Text Representations for Patent Classiﬁcation

Mesa 5
Penetration of the bigrams and triples in the B60 class proﬁles (en % of terms at given rank).

bigrams
stanford
AEGIR

all representations

bigrams
stanford
AEGIR

rnk10

rnk20

rnk50

rnk100

rnk1000

3.0
0.0
0.0

2.0
0.0
0.0

4.0
1.0
0.5

2.0
0.0
0.0

48.0
24.0
20.0

34.0
4.0
2.0

45.0
26.0
25.0

36.0
6.0
4.0

70.5
48.0
44.9

43.2
13.0
18.0

The results are similar to ¨Ozg ¨ur and G ¨ung ¨or’s (2012) ﬁndings for scientiﬁc
abstracts: Adding phrases to unigrams can signiﬁcantly improve classiﬁcation. The text
in the patent corpus is vastly different from the newswire text in the Reuters corpus.
Like scientiﬁc abstracts, patents are full of jargon and terminology, often expressed in
multi-word units, which might favor phrasal representations. Además, the innovative
concepts in a patent are sometimes described in generalized terms combined with some
speciﬁer (to ensure larger legal scope). Por ejemplo, a hose might be referred to as a
watering device. The term hose can be captured with a unigram representation, pero el
multi-word expression cannot. The difference with the results on the Reuters-21578
data set (discutido en la Sección 2.1.1), sin embargo, may not completely be due to genre
diferencias: Bekkerman and Allan (2003) remark that the unigram baseline for the
Reuters-21578 task is difﬁcult to improve upon, because in that data set a few keywords
are enough to distinguish between the categories.

4.2 Unigram versus Phrases

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
3
7
5
5
1
8
0
1
9
1
6
/
C
oh

In this section we investigate whether adding phrases suppresses, complements, o
changes unigram selection. To examine the impact of phrases in the classiﬁcation
proceso, we analyzed the class proﬁles24 of two large classes (H04 – Electric Communica-
tion Technique; and H01 – Basic electric elements) that show signiﬁcant improvements
in both Precision and Recall25 for the bigram classiﬁer compared with the unigram
base. We look at (1) the overlap of the single words in the class proﬁles of the
unigram and combined representations; y (2) the overlap of the single words and the
words that make up the phrases (hereafter referred to as parts) within the class proﬁle
of one text representation.

4.2.1 Overlap of Unigrams. The class proﬁles in the baseline unigram classiﬁer contained
far fewer terms ( < 20%) than the proﬁles in the classiﬁers that combine unigrams and phrases. This could be expected from the data in tables 2 and 3. Unigrams are the highest ranked26 features in the combined representation class proﬁles (see Table 5). Furthermore, words that are important terms for unigram clas- siﬁcation also rank high in the combined class proﬁles: On average, there is an 80% l i _ a _ 0 0 1 4 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 24 A class proﬁle is the model built by the LCS classiﬁer for a class during training. It consists of a ranked list of terms that contribute most to distinguishing members from a class from all other classes. 25 H04: P: + 3.09%; R: + 1.83%; H01: P: + 3.61%; R: + 5.14%. 26 The rank of a term is based on the decreasing order of mass assigned to that term in the class proﬁle. (See Section 4.2.2.) 767 Computational Linguistics Volume 39, Number 3 overlap of the top 1,000 most important words in unigram and combined representation class proﬁles. This decreases to 75% when looking at the 5,000 most important words. This shows that the classiﬁer tends to select mostly the same words as important terms for the different text representation combinations. The relative ranking of the words is very similar in the class proﬁles of all the text representations. Thus, adding phrases to unigrams does not result in replacing the most important unigrams for a particular class and the improvements in classiﬁcation accuracy must derive from the additional information in the selected phrases. 4.2.2 Overlap of Single Words and Parts of Bigrams. Like Caropreso, Matwin, and Sebastiani (2001), we investigated to what extent the parts of the high-ranked phrases overlap with words in the unigrams + bigrams class proﬁle. We ﬁrst looked at the lexical overlap of the words and the parts of the bigrams in the H01 unigrams + bigrams class proﬁle. Interestingly, we found a relatively low overlap between the words and the parts of the phrases: For the 20 most important bigrams, only 11 of the 32 unique parts of the bigrams overlap with the 100 most important single word terms; in the complete class proﬁle only 56% of the 10,387 parts of the bigrams overlap with the 9,064 words in the class proﬁle. This means that a large part of the bigrams contains complementary information not present in the unigrams in the class proﬁle. To gain a deeper insight into the relationship between the bigrams and their parts, we also looked at the mass of the different terms in the class proﬁles. The mass of a term for a certain class is the product of its TF–IDF score and its Winnow weight for that class; ”mass” provides an estimate of how much a term contributes to the score of documents for a particular class. We can divide the terms into three main categories: (a) mass(partA) ≥ mass(partB) ≥ mass(bigram); (b) mass(partA) ≥ mass(bigram) > mass(partB);
(C) mass(bigram) > mass(partA) ≥ mass(partB).

We note that 50% of the top 1,000 highest ranked bigrams fall within category (b) y
typically consist of one part with high mass accompanied by a part with a low mass,
which can be a function word (for example a transmitter), or a general term (Por ejemplo,
device in optical device). The highest ranked bigrams can be found in category (a) dónde
two highly informative words are combined to form very speciﬁc concepts, Por ejemplo,
fuel cell. These are speciﬁcations of a more general concept that is typical for that class in
the corpus. The bigrams in this category are similar to those investigated by Caropreso,
matwin, and Sebastiani (2001) and Tan, Wang, and Lee (2002). Though highly ranked,
they only make up a small subset (22%) of the important bigram features.

The bigrams in category (C) (27%) are typically made up from low-ranked single
palabras, such as mobile station. Curiosamente, most bigram parts in this subset do not occur
as word terms in the unigram and bigram class proﬁles, but occur in the negative class
proﬁles (a selection of terms that are considered to describe anything but that particular
class). The complementary information of bigram phrases (compared to unigrams) es
contained in this set of bigrams.

4.3 Statistical versus Linguistic Phrases

Results in Section 4.1 indicate that bigrams are most important additional features, pero
the experiment combining all four representations showed that dependency triples do

768

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
3
7
5
5
1
8
0
1
9
1
6
/
C
oh

yo
i

_
a
_
0
0
1
4
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

D’hondt et al.

Text Representations for Patent Classiﬁcation

complement bigrams. In this section we examine what information is captured by the
different phrases and how this accounts for the differences in classiﬁcation accuracy.

4.3.1 Class Proﬁle Analysis. We ﬁrst examined the differences between the statistical
phrases and the two types of linguistic phrases to discover what information contained
in the bigrams leads to better classiﬁcation results. We performed our analysis on the
different class proﬁles of B60 (“Vehicles in general”), a medium-sized class, which most
clearly shows the advantage of the bigram classiﬁer compared to the classiﬁers with
linguistic phrases.27

All four class proﬁles with phrases contain roughly the same set of unigrams
(entre 78% a 91% superposición) that occur quite high in the corresponding unigram class
proﬁle. The AEGIR class proﬁle contains 10% more unigrams than the other combined
representation class proﬁles; these are mainly words that appear in the negative class
proﬁle of the corresponding unigram classiﬁer. As in class H01, the relative position of
the words remains the same. The absolute position of the words in the list, sin embargo,
does change: Caropreso, matwin, and Sebastiani (2001) introduced a measure for the
effectiveness of phrases as terms, called the penetration, eso es, the percentage of
phrases in the top k terms when classifying with both words and phrases.

Comparing the penetration levels at the various ranks for the different classiﬁers,
we can see that the classiﬁcation results correspond with the tendency of a classiﬁer to
select phrases in the top k terms. Curiosamente, we see a large disparity in the phrasal
features that are selected by the combination classiﬁer. The preference for bigrams
is mirrored by the penetration levels of the unigrams + bigrams classiﬁer which has
selected more bigrams at higher ranks in the class proﬁle than the classiﬁers with the
linguistic phrases. This is in line with the ﬁndings of Caropreso, matwin, and Sebastiani
(2001) that penetration levels are a reasonable way to compute the contribution of
n-grams to the quality of a feature set. De término medio, the linguistic phrases have much
smaller weight in the class proﬁles than the bigrams and, como consecuencia, are likely
to have a smaller impact during the classiﬁcation process. For the combination run,
sin embargo, it seems that a long tail of small-impact features does improve classiﬁcation
exactitud.

Linguistic analysis of the top 100 phrases in the proﬁles of class B60 shows that
all classiﬁers select similar types of phrases. We manually annotated the bigrams with
the correct syntactic dependencies (in the Stanford collapsed typed dependency format)
and compared these with the syntactic relations expressed in the linguistic phrases. El
results are summarized in Table 6.

It appears that noun phrases and compounds such as circuit board and electric
device are by far the most important terms in the class proﬁles. Curiosamente, phrases
that contain a determiner relation (p.ej., the device) are deemed equally important in
all four different class proﬁles. It is unlikely that this is a semantic effect, eso es, eso
the determiner relation provides additional semantic information to the nouns in the
phrases, but rather it seems an artefact of the abundance of noun phrases which occur
in patent texts. We also looked into the lexical overlap between the parts of the different
types of phrases. We found that the selected phrases encode almost exactly the same
information in all three representations: There is an 80% overlap between the parts of

27 Precision is 77.34% for unigrams+bigrams, 75.67% for unigrams+Stanford, 73.47% for unigrams+AEGIR,
y 77.38% for unigrams+bigrams+Stanford+AEGIR. The Recall scores are essentially equal for all three,
eso es, 68.81%, 68.38%, 69.7%, y 70.18%, respectivamente.

769

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
3
7
5
5
1
8
0
1
9
1
6
/
C
oh

yo
i

_
a
_
0
0
1
4
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 39, Número 3

Mesa 6
Distribution of the top 100 statistical and syntactic phrases in the B60 class proﬁles.

grammatical relation

bigrams

stanford AEGIR combination

noun–noun compounds
adjectival modiﬁer
determiner
sujeto
prepositions

41
11
34
6
2
7

48
8
28
4
4
8

6228

27
6
1
4

4428

41
9
2
4

the top 100 most important phrases. This decreases only to 75% when looking at the
10,000 most important phrases.

Given that the class proﬁles select the same set of words and contain phrases with
a high lexical overlap, por lo tanto, how do we explain the marked differences in classiﬁ-
cation accuracy between the three different representations? These must stem from the
different combinations of the words in the phrasal features. To examine in detail how the
features created through the different text represenations differ, we conducted a feature
quality assessment experiment against a manually created reference set.

4.3.2 Human Quality Assessment Experiment. To gain more insight in the syntactic and
semantic relations that are considered most informative by humans, we conducted
an experiment in which we asked human annotators to select the ﬁve to ten most
informative phrases29 for 15 sentences taken at random from documents in the three
largest classes in the corpus. We then compiled a reference set consisting of 70 phrases
(4.6 phrases per sentence) which were considered as “informative” by at least three
out of four annotators. De estos, 57 phrases were noun–noun compounds and 11 eran
combinations of an adjectival modiﬁer with a noun. None of the annotators selected
phrases containing determiners.

We created bigrams from the input and extracted head–modiﬁer pairs30 from the
parser output for the sentences in the test set. We then compared the overlap of the
generated phrases with the reference phrases. We found that bigrams overlap with
53 del 70 reference phrases; Stanford triples overlap with 62 phrases and AEGIR
triples overlap with 57 phrases. Although three data points are not enough to compute
a formal measure, it is interesting to note the correspondence with the number of terms
kept for the three text representations after Local Term Selection (ver tabla 3). The fact
that the text representation with the smallest number of terms after LTC and with the
smallest overlap with “contentful” phrases in a text as indicated by human annotators
still yields the best classiﬁcation performance suggests that not all “contentful” phrases
are important or useful for the task of classifying that text. This ﬁnding is reminiscent of
the fact that the “optimal” summary of a text is dependent on the goal with which the
summary was produced (Nenkova and McKeown 2011).

Solo 15% of the phrases extracted by the human annotators contain word combi-
nations that have long-distance dependencies in the original sentences. This suggests

28 As mentioned in Section 3.2.3 the AEGIR parser uses a more condensed dependency output format.

The Stanford’s nn and amod are collapsed into the attributive (ATTR) relation.

29 “Phrase” was deﬁned as a combination of two words that both occur in the sentence, irrespective of

the order in which they occur in the sentence.

30 Head–modiﬁer pairs are syntactic triples that are stripped of their grammatical relations.

770

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
3
7
5
5
1
8
0
1
9
1
6
/
C
oh

yo
i

_
a
_
0
0
1
4
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

D’hondt et al.

Text Representations for Patent Classiﬁcation

that the most meaningful phrases are expressed in local dependencies, eso es, adyacente
palabras. Como consecuencia, syntactic analysis aimed at discovering meaning expressed by
long-distance dependencies can only make a small contribution. A further analysis of
the phrases showed that the smaller coverage of the bigrams is due to the fact that some
of the relevant noun–noun combinations are missed because function words, típicamente
determiners or prepositions, occur between the nouns. Por ejemplo, the annotators
constructed the reference phrase rotation axis for the noun phrase the rotation of the
second axis. This reference phrase cannot be captured by the bigram representation.
When intervening function words are removed from the sentences, the coverage of
the resulting bigrams on the reference set rises31 to 59 phrases (more than AEGIR, y
almost as many as Stanford). Despite the fact that generating more phrases does not
necessarily lead to better classiﬁcation performance, we intend to use bigrams stripped
of function words as additional terms for patent classiﬁcation in future experiments.

The analysis also revealed an indication why syntactic phrases may lead to inferior
classiﬁcation results: Both syntactic parsers consistently fail to ﬁnd the correct structural
analysis of the long and complex noun phrases such as an implantable, inﬂatable dual
chamber shape retention tissue expander, which are frequent in patent texts. Phrases like
this contain many compounds in an otherwise complex syntactic structure, a saber

[un [implantable, inﬂatable [[dual chamber] [shape retention] [tissue expander]]]].

For a parser it is impossible to parse this correctly without knowing which word se-
quences are actually compounds. That knowledge might be gleaned from the frequency
with which sequences of nouns and adjectives occur in a given domain. For the time
ser, the Stanford parser (and the AEGIR parser, en un grado menor) will parse any
noun phrase by attaching the individual words to the right-most head noun, resulting
in the following analysis:

[un [implantable, [inﬂatable [dual [cámara [forma [retention [tissue expander]]]]]]]].

This effectively destroys many of the noun–noun compounds, which are the most
important features for patent classiﬁcation (ver tabla 6). Bigrams are less prone to this
type of “error.”

These ﬁndings are conﬁrmed when looking at the overlap of the word combi-
naciones: Although there is high lexical overlap between the phrases of the different
representaciones (80% overlap of the parts of phrases in Section 4.3.1), the overlap of
the word combinations that make up the phrases is much lower: Solo 33% of the top
1,000 phrases are common between all three representations.

4.4 Stanford versus AEGIR Triples

The performance with the unigrams + Stanford triples is not signiﬁcantly different from
the combination with AEGIR triples. Because the AEGIR triples are slightly less sparse
(ver tabla 1), we expected that these would have an advantage over Stanford triples.
Most of the normalization processes that make the AEGIR triples less sparse concern
syntactic variation on the clause level, sin embargo. But as was shown in Section 4.3,

31 This result is language-dependent: English has a fairly rigid phrase-internal word order but for a more
synthetic language with a more variable word order, like Russian, bigram coverage might suffer from
the variation in the surface form.

771

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
3
7
5
5
1
8
0
1
9
1
6
/
C
oh

yo
i

_
a
_
0
0
1
4
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 39, Número 3

Mesa 7
Classiﬁcation results on CLEF-IP 2010 French and German abstracts, with ranges for
95% conﬁdence intervals.

PAG

Francés

Alemán

unigrams
unigrams + bigrams
unigrams
unigrams + bigrams

70.65% ± 0.68
72.31% ± 0.67
76.44% ± 0.34
76.39% ± 0.34

61.40% ± 0.73
62.58% ± 0.72
65.82% ± 0.38
65.41% ± 0.38

65.70% ± 0.70
67.09% ± 0.69
70.73% ± 0.37
70.47% ± 0.37

the most important terms for classiﬁcation in the patent domain are found in the
noun phrase, where Stanford and AEGIR perform similar syntactic analyses. A pesar de
Stanford’s dependency scheme is more detailed (ver tabla 6), the noun-phrase internal
dependencies in the Stanford parser map practically one-to-one onto AEGIR’s set of
relators, resulting in very similar dependency triple features for classiﬁcation. Estafa-
sequently, there is no normalization gain in using the AEGIR dependency format to
describe the internal structure of the noun phrases.

4.5 Comparison with French and German Patent Classiﬁcation

We found that phrases contribute to improving classiﬁcation on English patent ab-
stracts. The improvement might be language-dependent, sin embargo, because compounds
are treated differently in different languages. A compounding language like German
might beneﬁt less from using phrases than English. To estimate the generalizability of
our ﬁndings, we conducted additional experiments in which we compared the impact
of adding bigrams to unigrams for both French and German.

Using the same methods described in sections 3.1 y 3.2, we extracted and pro-
cessed all French and German abstracts from the CLEF-IP 2010 cuerpo, resulting in two
new data sets that contained 86,464 y 294,482 documentos, respectivamente (Mesa 7). Ambos
data sets contained the same set of 121 labels and had label distributions similar to the
English data set. The sentencing script was updated with the most common French and
German abbreviations to minimize incorrect sentence splitting. The resulting sentences
were then tagged using the French and German versions of the TreeTagger.32 From the
tagged output, we extracted the lemmas and used these to construct unigrams and
bigrams for both languages. We ran the experiments with the LCS using the settings
reported in Section 3.3.

The results show a much smaller but still signiﬁcant improvement for using bigrams
when classifying French patent abstracts and even a deterioration for German. Due to
the difference in size between the English and French data set it is difﬁcult to draw hard
conclusions on which language beneﬁts most from adding bigrams. It is clear, sin embargo,
that our ﬁndings are not generalizable to German (and probably other compounding
idiomas).

5. Conclusión

In this article we have examined the usefulness of statistical and linguistic phrases
for patent classiﬁcation. Similar to ¨Ozg ¨ur and G ¨ung ¨or’s (2010) results for scientiﬁc

32 http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/.

772

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
3
7
5
5
1
8
0
1
9
1
6
/
C
oh

yo
i

_
a
_
0
0
1
4
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

D’hondt et al.

Text Representations for Patent Classiﬁcation

abstracts, we found that adding phrases to unigrams signiﬁcantly improves classiﬁ-
cation results for English. Of the three types of phrases examined in this article, bigrams
have the most impact, both in the experiment that combined all four text representa-
ciones, and in combination with unigrams only.

The abundance of compounds in the terminology-rich language of the patent do-
main results in a relatively high importance for the phrases. The top phrases across
the different representations were mostly noun–noun compounds (for example watering
device), followed by phrases containing a determiner relation (for example the module)
and adjective modiﬁer phrases (for example separate module).

The information in the phrases and unigrams overlaps to a large extent: La mayoría de
phrases consist of words that are important unigram features in the combined proﬁle
and that also appear in the corresponding unigram class proﬁle. When examining the
H01 class proﬁles, sin embargo, Encontramos eso 27% of the selected phrases contain words
that were not selected in the unigram proﬁle (mira la sección 4.2.2).

When comparing the impact of features created from the output of the aboutness-
based AEGIR parser with those from the Stanford parser, we found the latter resulted in
slightly (but not signiﬁcantly) better classiﬁcation results. AEGIR’s normalization fea-
tures are not advantageous (compared with Stanford) in creating noun-phrase internal
triples, which are the most informative features for patent classiﬁcation.

The parsers were not speciﬁcally trained for the patent domain and both experi-
enced problems with long, complex noun phrases consisting of sequences of words
that can function as adjective/adverb or noun and that are not interrupted by function
words that clarify the syntactic structure. The right-headed bias of both syntactic parsers
caused problems in analyzing those constructions, yielding erroneous and variable
datos. Como consecuencia, parsers may miss potentially relevant noun–noun compounds
and noun phrases with adjectival modiﬁers. Because of the highly idiosyncratic nature
of the terminology used in the patent domain, it is not evident whether this prob-
lem can be solved by giving a parser access to information about the frequency with
which speciﬁc noun–noun, adjective–noun, and adjective/adverb–adjective pairs occur
in technical texts. Bigrams, por otro lado, are less variable (as seen in Table 1) y
therefore yield better classiﬁcation results. This is the more important point because the
dependency relations marked as important for understanding a sentence by the human
annotators consist mainly of pairs of adjacent words.

We also performed additional experiments to examine the generalizability of our
ﬁndings for French and German: As could be expected, compounding languages like
German which express complex concepts in “one word” do not gain from using
bigrams.

In line with Bekkerman and Allan (2003) we can conclude that with the large quan-
tities of text available today, the role of phrases as features in text classiﬁcation must
be reconsidered. For the automated classiﬁcation of English patents at least, agregando
phrases and more speciﬁcally bigrams signiﬁcantly improves classiﬁcation accuracy.

Referencias
Apt´e, Chidanand, Fred Damerau, y

Sholom Weiss. 1994. Automated learning
of decision rules for text categorization.
ACM Transactions on Information Systems,
12(3):233–251.

Bekkerman, Ron and John Allan. 2003. Usando
bigrams in text categorization. Técnico

Report IR-408, Center of Intelligent
Information Retrieval, Universidad de
Massachusetts, Amherst.

Beney, Jean. 2010. LCI-INSA linguistic

experiment for CLEF-IP classiﬁcation
track. In Proceedings of the Conference on
Multilingual and Multimodal Information
Access Evaluation (CLEF 2010), Padua.

773

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
3
7
5
5
1
8
0
1
9
1
6
/
C
oh

yo
i

_
a
_
0
0
1
4
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 39, Número 3

Benzineb, Karim and Jacques Guyot. 2011.

Automated patent classiﬁcation. En
Mihai Lupu, Katja Mayer, John Tait, y
Antonio J.. Trippe, editores, Actual
Challenges in Patent Information Retrieval,
volumen 29. Saltador, Nueva York,
pages 239–261.

braga, Igor, Maria Monard, and Edson

Matsubara. 2009. Combining unigrams
and bigrams in semi-supervised text
classiﬁcation. In Proceedings of Progress
en Inteligencia Artificial, 14th Portuguese
Conferencia sobre Inteligencia Artificial
(EPIA 2009), pages 489–500, Aveiro.
Caropreso, Maria Fernanda, Stan Matwin,

and Fabrizio Sebastiani. 2001. A
learner-independent evaluation of the
usefulness of statistical phrases for
automated text categorization. En
A. GRAMO. Chin, editor, Text Databases &
Document Management. IGI Publishing,
Hershey, Pensilvania, pages 78–102.

Crawford, Elisabeth, Irena Koprinska,
and Jon Patrick. 2004. Phrases and
feature selection in e-mail classiﬁcation.
In Proceedings of the 9th Australasian
Document Computing Symposium (ADCS),
pages 59–62, Melbourne.

Dagan, Ido, Yael Karov, and Dan Roth.
1997. Mistake-driven learning in text
categorization. In Proceedings of 2nd
Conference on Empirical Methods in NLP,
pages 55–63, Providencia, Rhode Island.
de Marneffe, Marie-Catherine and

Christopher Manning. 2008. The Stanford
Typed Dependencies representation. En
Coling 2008: Proceedings of the Workshop on
Cross-Framework and Cross-Domain Parser
Evaluation, pages 1–8, Manchester.

Derieux, Franck, Mihaela Bobeica, Delphine

Pois, and Jean-Pierre Raysz. 2010.
Combining semantics and statistics for
patent classiﬁcation. En Actas de la
Conference on Multilingual and Multimodal
Information Access Evaluation (CLEF 2010),
Padua.

Dumais, Susan, John Platt, David

Heckerman, and Mehran Sahami. 1998.
Inductive learning algorithms and
representations for text categorization.
In Proceedings of the Seventh International
Conference on Information and Knowledge
Management (CIKM ’98), pages 148–155,
Bethesda.

Caer, Caspar J. and Karim Benzineb. 2002.

Literature survey: Issues to be considered
in the automatic classiﬁcation of patents.
Technical report, World Intellectual
Property Organization, Geneva.

774

Caer, Caspar J., Atilla T ¨orcsv´ari, karim
Benzineb, and Gabor Karetka. 2003.
Automated categorization in the
international patent classiﬁcation.
ACM SIGIR Forum, 37(1):10–25.

F ¨urnkranz, Johannes. 1998. A study using
n-gram features for text categorization.
Technical Report OEFAI-TR-98-30,
Austrian Research Institute for
Artiﬁcial Intelligence, Viena.

F ¨urnkranz, Johannes. 1999. Exploiting

structural information for text
classiﬁcation on the WWW. En procedimientos
of Advances in Intelligent Data Analysis
(IDA-99), pages 487–497, Ámsterdam.
Galavotti, Luigi, Fabrizio Sebastiani, y
Maria Simi. 2000. Experiments on the
use of feature selection and negative
evidence in automated text categorization.
In Proceedings of Research and Advanced
Technology for Digital Libraries,
4th European Conference, pages 59–68,
Lisbon.

Guyot, jacques, Karim Benzineb, and Gilles
Falquet. 2010. Myclass: A mature tool for
patent classiﬁcation. En Actas de la
Conference on Multilingual and Multimodal
Information Access Evaluation (CLEF 2010),
Padua.

Held, Pierre, Irene Schellner, and Ryuichi
Ota. 2011. Understanding the world’s
major patent classiﬁcation schemes. Paper
presented at the PIUG 2011 Annual
Conference Workshop, Viena, 13 Abril.
Joachims, Thorsten. 1999. Making large-scale
support vector machine learning practical.
In Bernhard Sch ¨olkopf, Christopher J. C.
Burges, and Alexander J. Smola, editores,
Advances in Kernel Methods. CON prensa,
Cambridge, MAMÁ, pages 169–184.

Koster, Cornelis, Jean Beney, Suzan Verberne,

and Merijn Vogel. 2011. Phrase-based
document categorization. In Mihai Lupu,
Katja Mayer, John Tait, and Anthony J.
Trippe, editores, Current Challenges in Patent
Information Retrieval, volumen 29. Saltador,
Nueva York, pages 263–286.

Koster, Cornelis, Marc Seutter, y

Jean Beney. 2001. Classifying patent
applications with winnow. En procedimientos
Benelearn 2001. pages 19–26, Antwerpen.

Koster, Cornelis, Marc Seutter, y

Jean Beney. 2003. Multi-classiﬁcation
of patent applications with winnow.
In Manfred Broy and Alexandre V.
Zamulin, editores, Perspectives of
Systems Informatics: 5th International
Andrei Ershov Memorial Conference,
volumen 2890 of Lecture Notes in

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
3
7
5
5
1
8
0
1
9
1
6
/
C
oh

yo
i

_
a
_
0
0
1
4
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

D’hondt et al.

Text Representations for Patent Classiﬁcation

Computer Science. Saltador, Nueva York,
pages 546–555.

Koster, Cornelis and Mark Seutter. 2003.
Taming wild phrases. En procedimientos
of the 25th European conference on
IR research (ECIR’03), pages 161–176, Pisa.

Krier, Marc and Francesco Zacc`a. 2002.

Automatic categorization applications at
the European patent ofﬁce. World Patent
Información, 24(3):187–196.

Larkey, Leah. 1998. Some issues in the

automatic classiﬁcation of U.S. patents.
In Working Notes of the Workshop on
Learning for Text Categorization, 15th
National Conference on AI, pages 87–90,
Madison, Wisconsin.

Larkey, Leah S. 1999. A patent search and

classiﬁcation system. En Actas de la
Fourth ACM Conference on Digital Libraries
(DL’99), pages 179–187, berkeley.
Luis, David D. 1992. An evaluation of

phrasal and clustered representations on a
text categorization task. En procedimientos de
the 15th Annual International ACM SIGIR
Conference on Research and Development
in Information Retrieval (SIGIR ’92),
pages 37–50, Copenhague.

Mille, Simon and Leo Wanner. 2008. Making
text resources accessible to the reader: El
case of patent claims. En Actas de la
6th International Conference on Language
Resources and Evaluation (LREC’08),
Marrakech.

Mitra, Mandar, Chris Buckley, Amit Singhal,

and Claire Cardie. 1997. An analysis
of statistical and syntactic phrases.
In Proceedings of RIAO’97 Computer-Assisted
Information Searching on Internet,
pages 200–214, Montréal.

Mladenic, Dunja and Marko Grobelnik.
1998. Word Sequences as Features in
Text-Learning. In Proceedings of the 17th
Electrotechnical and Computer Science
Conferencia (ERK98), pages 145–148,
Ljubljana.

Moschitti, Alessandro and Roberto Basili.
2004. Complex linguistic features for
text classiﬁcation: A comprehensive
estudiar. In Sharon McDonald and
John Tait, editores, Advances in Information
Retrieval, volumen 2997 of Lecture Notes in
Computer Science. Saltador, Nueva York,
páginas 181–196.

Nastase, Vivi, Jelber Sayyad, y

Maria Fernanda Caropreso. 2007.
Using dependency relations for text
classiﬁcation. Technical Report TR-2007-12,
University of Ottawa.

Nenkova, Ani and Kathleen McKeown. 2011.
Automatic summarization. Foundations
and Trends in Information Retrieval,
5(2–3):103–233.

Ozg ¨ur, Levent and Tunga G ¨ung ¨or.

2009. Analysis of stemming alternatives
and dependency pattern support in text
classiﬁcation. In Proceedings of Tenth
International Conference on Intelligent
Text Processing and Computational
Lingüística (CICLing 2009),
pages 195–206, Ciudad de México.
¨Ozg ¨ur, Levent and Tunga G ¨ung ¨or.
2010. Text classiﬁcation with the support of
pruned dependency patterns. Patrón
Recognition Letters, 31(12):1598–1607.

¨Ozg ¨ur, Levent and Tunga G ¨ung ¨or.
2012. Optimization of dependency and
pruning usage in text classiﬁcation.
Pattern Analysis and Applications,
15(1):45–58.

Parapatics, Peter and Michael Dittenbach.
2009. Patent claim decomposition for
improved information extraction. En
Proceedings of the 2nd International
Workshop on Patent Information Retrieval
(PAIR’09), pages 33–36, Hong Kong.
Salton, Gerard and Christopher Buckley.
1988. Term-weighting approaches in
automatic text retrieval. Información
Processing Management, 24(5):513–523.
Scott, Sam and Stan Matwin. 1999. Característica

engineering for text classiﬁcation.
In Proceedings of the Sixteenth International
Conference on Machine Learning (ICML ’99),
pages 379–388, Bled.

Herrero, Harold. 2002. Automation of patent
classiﬁcation. World Patent Information,
24(4):269–271.

Broncearse, Chade-Meng, Yuan-Fang Wang,
and Chan-Do Lee. 2002. el uso de
bigrams to enhance text categorization.
Information Processing and Management,
38(4):529–546.

Verberne, Suzan and Eva D’hondt. 2011.
Patent classiﬁcation experiments with
the Linguistic Classiﬁcation System
LCS in CLEF-IP 2011. En Actas de la
Conference on Multilingual and Multimodal
Information Access Evaluation (CLEF 2011),
Ámsterdam.

Verberne, Suzan, Merijn Vogel, y

Eva D’hondt. 2010. Patent classiﬁcation
experiments with the Linguistic
Classiﬁcation System LCS. En procedimientos
of the Conference on Multilingual and
Multimodal Information Access Evaluation
(CLEF 2010), Padua.

775

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
9
3
7
5
5
1
8
0
1
9
1
6
/
C
oh

yo
i

_
a
_
0
0
1
4
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Descargar PDF