A Survey of Arabic Named Entity - Specialized Research AI at MIT

A Survey of Arabic Named Entity
Recognition and Classiﬁcation

Khaled Shaalan∗
School of Informatics, University of Edinburgh, UK
The British University in Dubai, UAE

As more and more Arabic textual information becomes available through the Web in homes
and businesses, via Internet and Intranet services, there is an urgent need for technologies and
tools to process the relevant information. Named Entity Recognition (NER) is an Information
Extraction task that has become an integral part of many other Natural Language Processing
(NLP) tasks, such as Machine Translation and Information Retrieval. Arabic NER has begun
to receive attention in recent years. The characteristics and peculiarities of Arabic, a member
of the Semitic languages family, make dealing with NER a challenge. The performance of
an Arabic NER component affects the overall performance of the NLP system in a positive
manner. This article attempts to describe and detail the recent increase in interest and progress
made in Arabic NER research. The importance of the NER task is demonstrated, the main
characteristics of the Arabic language are highlighted, and the aspects of standardization in
annotating named entities are illustrated. Moreover, the different Arabic linguistic resources
are presented and the approaches used in Arabic NER ﬁeld are explained. The features of
common tools used in Arabic NER are described, and standard evaluation metrics are illustrated.
In addition, a review of the state of the art of Arabic NER research is discussed. Finally,
we present our conclusions. Throughout the presentation, illustrative examples are used for
clariﬁcation.

1. Introduction

In the 1990s, in particular at the Message Understanding Conferences, Named Entity
Recognition (NER) was ﬁrst introduced as an information extraction task and deemed
important by the research community. In NER, the expression “named entity” (NE)
covers not only proper names but also includes temporal expressions and some nu-
merical expressions such as monetary amounts and other types of units. Proper names
include three classic specializations (referred to as types or classes in the literature):
persons, locations, and organizations. For example, in the sentence Ahmed Khaled, CEO
of Arabisoft Company in Egypt, Ahmed Khaled, Arabisoft Company, and Egypt would be
identiﬁed as references to a person, an organization, and a location, respectively. A type
can in turn be divided into subtypes (Sekine, Sudo, and Nobata 2002), possibly forming
an entity type hierarchy (Pappu 2009). For example, locations might be divided into

∗ The British University in Dubai (BUiD), P.O. Box 345015, Dubai, UAE.

E-mail: khaled.shaalan@buid.ac.ae.

Submission received: 12 September 2012; revised submission received: 12 March 2013; accepted for
publication: 17 July 2013.

doi:10.1162/COLI a 00178

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
0
2
4
6
9
1
8
0
3
5
9
1
/
c
o

l
i

_
a
_
0
0
1
7
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 40, Number 2

multiple ﬁne-grained locations, such as city, state, and country. For speciﬁc needs other
types might be introduced, such as e-mail address, phone number, book ISBN, ﬁlename,
and so on.

A good portion of NER research is devoted to the study of English, due to its signif-
icance as a dominant language that is used internationally for communications, science,
information technology, business, seafaring, aviation, entertainment, and diplomacy.
This has limited the diversity of text genre and domain factors from other languages
that are usually considered when developing NER for these ﬁelds. For instance, as
most scientiﬁc studies are conducted in English in almost all Arabic-speaking countries,
there is no urgency to investigate Arabic NER for areas such as bioinformatics, drug, or
chemical named entities.

NER can be deﬁned as the task that attempts to locate, extract, and automatically
classify named entities into predeﬁned classes or types in open-domain and unstruc-
tured texts, such as newspaper articles (Nadeau and Sekine 2007). One obvious reason
for the importance of named entities is their pervasiveness, which is evidenced by the
high frequency, including occurrence and co-occurrence, of named entities in corpora
(cf. Saravanan et al. 2012). Arabic is a language of rich morphology and syntax. Its
characteristics and peculiarities make dealing with it a challenge (Farghaly and Shaalan
2009). The last decade has shown a growing interest in addressing challenges that
underlie the development of a productive and robust Arabic NER system (Al-Jumaily
et al. 2012; Oudah and Shaalan 2012).

This article investigates the progress in Arabic NER research. The survey by Nadeau
and Sekine (2007) presents background on much of the work on NER for a variety of
languages and myriad machine learning (ML) techniques. To the best of our knowl-
edge, Arabic NER and classiﬁcation have not yet been surveyed extensively, which has
motivated us to conduct this survey.

The survey is structured as follows. Section 2 provides background information
relevant for working with Arabic NER. Section 3 presents some aspects of the Arabic
language that will allow the reader to appreciate the difﬁculties associated with Arabic
NER. Section 4 brieﬂy introduces the standard tag sets commonly used to annotate
named entities. Section 5 describes the Arabic NER language-speciﬁc resources that are
involved in the NER task. Section 6 gives a brief description of approaches used in
Arabic NER. Section 7 discusses feature selection, which is a critical factor for achieving
better performance for NER systems. Section 8 presents various tools that have been
used in building Arabic NER systems and Section 9 illustrates evaluation techniques
for NER systems. Section 10 presents the state-of-the-art in Arabic NER research. Finally,
the concluding remarks are presented in Section 11.

2. Background

2.1 Entity Tracking

The task of identifying named entities must be distinguished from entity tracking,
which involves identifying mentions, relations, and the co-references that may exist
between them. In this regard, a NE may contain only one mention such as a person
name (e.g., Mohammed Morsi), but when a pronoun is used to refer to the same person,
it is considered another mention of that entity. Moreover, a nominal (e.g., president) can
also be used as a mention to refer to the same NE (cf. Zitouni et al. 2005). It should be
noted that the richness of Arabic morphology allows two mentions to appear in one

470

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
0
2
4
6
9
1
8
0
3
5
9
1
/
c
o

l
i

_
a
_
0
0
1
7
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Shaalan

A Survey of Arabic Named Entity Recognition and Classiﬁcation

(cid:13)KP our president, president-our), where a pronominal (A(cid:9)K, our) can appear
word (e.g., A(cid:9)J‚(cid:28)(cid:10)
(cid:13)KP). A co-reference exists when a
as a sufﬁx pronoun to a nominal (e.g., president (cid:129)(cid:28)(cid:10)
group of mentions refers to the same entity. For example, in the sentence The [Egyptian
President], [Mohammad Morsi], as the [chair of the 15th Non-Aligned Movement summit]
declared opening of the 16th summit, there are three mentions that refer to the same person.
Mentions also include aliases such as Abu Ammar, which refers to the same entity as
Yasser Arafat.

An entity relation may be established between two or more NEs, such as a
person, an organization, a location, or a speciﬁc time. Relationships between NEs can be
binary, such as person-afﬁliation or organization-location, or may involve more entities;
for example, [a person] is in [a place] at [a speciﬁc time]. The entity relation is usually
expressed in a predicate form and is used to establish relations such as whether two
persons were working at the same organization at the same time (Ben Hamadou, Odile,
and H´ela 2010a).

In summary, it is important to direct attention to the choice of the recognition
unit (i.e., real world NE, mention, co-reference, or relation), because mention detec-
tion, co-reference resolution, and relation extraction are considered more difﬁcult than
the traditional NER task due to the complexity incurred by extracting non-named
mentions, grouping mentions into entities, and deriving semantic relations among
entities.

2.2 The Broader Role of NER

The implications of research in NER for NLP more generally are too obvious to
enumerate. Examples of applications for which NER is useful are shown in this
section.

Information Retrieval. This is the task of identifying and retrieving relevant
documents from a set of data according to an input query. A study by Guo et al.
(2009) has indicated that about 71% of the queries in search engines contain NEs.
Information Retrieval can beneﬁt from NER in two phases (Benajiba, Diab, and Rosso
2009a): ﬁrstly, recognizing the NEs within the query; and secondly, recognizing the NEs
within the searched documents, and then extracting the relevant documents taking into
account their classiﬁed NEs and how they are related to the query. For example, the

word (cid:16)èQK(cid:10) (cid:9)Qm.Ì’@ (Aljazeera) can be recognized as an organization name or a noun corre-

sponding to the word island; the correct classiﬁcation will facilitate extracting relevant
documents.

Question Answering. This is very similar to Information Retrieval but with more
sophisticated results. A Question Answering system takes questions as input and gives
in return concise and precise answers (Ezzeldin and Shaheen 2012). The NER task can
be utilized in the phase of analyzing the question so as to recognize the NEs within the
question that will help later in identifying the relevant documents and constructing the
answer from relevant passages (Moll´a, van Zaanen, and Smith 2006; Badawy, Shaheen,
and Hamadene 2011; Lahsen, Bouzoubaa, and Rosso 2012). For instance, the NE
¡ƒð
or as a location name according to the context. Hence, the correct classiﬁcation for the
NE will help to target the relevant group of documents that answer the input query.
Moreover, Question Answering systems could beneﬁt substantially from NER, because
the answer to many factoid questions involve NEs (Trigui et al. 2012) (e.g., answers
(cid:13)
@)

(cid:16)†Qå(cid:17)„Ë@ (Middle East) may be classiﬁed as an organization name (e.g., a newspaper)

to who (ñë (cid:9)áÓ/ù(cid:10) ëAÓ) questions usually involve persons or organizations, where ( (cid:9)áK(cid:10)

(cid:13)
B@

471

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
0
2
4
6
9
1
8
0
3
5
9
1
/
c
o

l
i

_
a
_
0
0
1
7
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 40, Number 2

(Brini et al. 2009).

questions involve locations, and when (ú(cid:10)

(cid:16)æÓ) questions involve temporal expressions)
Machine Translation. This is the task of automatically translating a text from
one natural language into another. NEs need special attention in order to decide
which parts of an NE should be meaning-translated and which parts should be
phoneme-transliterated (Al-Onaizan and Knight 2002b; Hassan and Sorensen 2005).
Usually this depends on the type of the NE (Chen, Yang, and Lin 2003). For example,
personal names tend to be transliterated.1 For a location name, the name part and the
category part (e.g., mountains) are usually transliterated and translated, respectively.
Organization names are completely different in that most of the constituents are
translated (e.g., United Nations). The quality of the NER system plays a signiﬁcant
role in determining the overall quality of the machine translation system, and
hence, NE translation is critical for most multilingual application systems (Babych
and Hartley 2003; Ben Hamadou, Odile, and H´ela 2010b; Steinberger 2012). In
addition, NE translation is very important for other applications such as cross-lingual
information retrieval for extracting newly introduced NEs from the Web and news
documents and regularly updating the list of NE translation pairs (Hassan, Fahmy, and
Hassan 2007).

Text Clustering. Search results clustering may exploit NER by ranking the re-
sulting clusters based on the ratio of entities each cluster contains (Benajiba, Diab,
and Rosso 2009a). This enhances the process of analyzing the nature of each cluster
and also improves the clustering approach in terms of selected features. For example,
time expressions along with location NEs can be utilized as factors that will give an
indication of when and where the events mentioned in a cluster of documents have
occurred.

Navigation Systems. These systems, which facilitate navigation using digital maps,
now play signiﬁcant roles in our lives. They provide directions, information about
nearby places possibly linked with other on-line resources, and trafﬁc conditions. In
these systems, points of interest (also known as waypoints) are NEs that are stored in
a database with their geographic coordinates (Kim, Kim, and Cho 2012). They refer to
areas of interest that are typically of signiﬁcance to, among others, tourists, visitors,
and rescuers, allowing the location of places such as parking areas, shops, hospitals,
restaurants, universities, schools, landmarks, and so on.

3. Linguistic Issues and Challenges

Arabic is a highly inﬂected language, with a rich morphology and complex syn-
tax (Al-Sughaiyer and Al-Kharashi 2004; Ryding 2005). Current Arabic NLP research
efforts cannot cope with the massive growth of Arabic data on the Internet and
the heightened need for accurate and robust processing tools (Abdul-Mageed, Diab,
and Korayem 2011). NER is considered one of the building blocks of Arabic NLP
tools and applications. Though signiﬁcant progress has been achieved in Arabic
NER research in the last decade, the task remains challenging due to the following

1 Transliteration is the task of replacing words in the source language with their approximate phonetic or
spelling equivalents in the target language. It unambiguously represents the graphemes, rather than the
phonemes, of the NE. Transliteration between languages that use similar alphabets and sound systems is
very simple. However, transliterating NEs between Arabic and English is a non-trivial task, mainly due
to the differences in their sound and writing systems (Al-Onaizan and Knight 2002a).

472

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
0
2
4
6
9
1
8
0
3
5
9
1
/
c
o

l
i

_
a
_
0
0
1
7
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Shaalan

A Survey of Arabic Named Entity Recognition and Classiﬁcation

features of the Arabic language; opportunities for improved performance are still
available.

3.1 Arabic Script

The Arabic language relies on the Arabic script, which is also used in writing other
languages such as Persian, Urdu, Kurdish, and Pashto (Habash 2010). Some researchers
have developed Arabic computational tools and resources based on Romanized2
or transliterated3 Arabic text rather than genuine Arabic script (e.g., Buckwalter’s
Arabic Morphological Analyzer [Buckwalter 2002], the CJK lexical resources [Halpern
2009], and Arabic NER systems [Bidhend, Minaei-Bidgoli, and Jouzi 2012; Zayed
and El-Beltagy 2012]), either because these formats are more familiar to non-native
Arabic speakers or because of limitations in the Arabic script encoding imposed by
the development environment. This approach should disappear over time with the
rapidly growing quantity of Arabic script-based Web content and new technologies
that support multiple encodings.

3.2 Language in Use

With regard to language usage, Arabic can be classiﬁed into three types (Elgibali
2005): Classical Arabic (CA), Modern Standard Arabic (MSA), and Colloquial Arabic
Dialects (Abdel Monem et al. 2008; Habash 2010; Korayem, Crandall, and Abdul-
Mageed 2012). As far as Arabic NE is concerned, it is important to know the difference
between these various uses of the language. CA is the formal version that has been
used continuously for over 1,500 years as the language of Islam, used by Muslims
in their daily prayers. Most Arabic religious texts are written in CA. In this context,
person name recognition is of particular interest in order to identify and verify
the correctness of citations (Zaraket and Makhlouta 2012) (a sequence of hadith
narrators referencing each other who provide narrations related to the Prophet
Mohammed based on known truthful and untruthful relaters). The importance of
veriﬁcation is that the authenticity of a hadith needs to be established before his
narration is used in jurisprudence, and this depends on the credibility of the narrators.
Furthermore, many historical Arabic manuscripts are handwritten in CA (or Arabic
calligraphy); when they are digitized and converted to text, Arabic NE will become
important.

MSA is the language of today’s Arabic newspapers, magazines, periodicals, letters,
modern writers, and education. MSA is one of the six ofﬁcial languages of the United
Nations used in meetings and ofﬁcial UN documents. Most Arabic NLP, including NER
research projects, is focused on MSA. The main difference between MSA and CA lies
in the vocabulary, including NEs, and the orthography of conventional written Arabic
(Farber et al. 2008): MSA does not require the inclusion of short vowels. Moreover, MSA
reﬂects the needs of contemporary expression, whereas CA reﬂects the needs of older
styles. For example, the Arabic NEs in rare documents and old manuscripts that refer

2 Transliteration from Arabic to languages using the Latin alphabet is called Romanization.
3 In a multilingual context, transliteration of NEs would differ depending on the target language

(Pouliquenet et al. 2005). For example, the Arabic name ù(cid:10)

as Mustafa or Moustapha, while a likely French transliteration would be Moustafa or Moustapha.

(cid:9)®¢’Ó could be transliterated into English

473

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
0
2
4
6
9
1
8
0
3
5
9
1
/
c
o

l
i

_
a
_
0
0
1
7
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 40, Number 2

to places, jobs, or organizations are different from the corresponding NEs in modern
documents.

Colloquial Arabic is the spoken Arabic used by Arabs in their informal daily com-
munication; it is not taught in schools due to its irregularity. Unlike the widespread
use of MSA across all Arab countries, colloquial Arabic is a regional variant that
differs not only among Arab countries, but also across regions in the same country.
Written Colloquial Arabic is presently used mainly in social media communication.
For comparison, a person name in either CA or MSA could be expressed in Arabic

dialect by more than one form; for example, PXA(cid:16)®Ë@ YJ.« (Abd Al-Kader) versus PXAm.Ì’@ YJ.«
(cid:14)
(Abd Al-Gader) or PX
B@ YJ.« (Abd Al-Aader). Salloum and Habash (2012) presented a
universal machine translation pre-processing approach that has the ability to produce
MSA paraphrases of dialectal input. In this way, available MSA tools can also be used
to process Colloquial Arabic text, as most of the Arabic NER systems are developed to
support MSA.

3.3 Lack of Capitalization

Unlike languages like English that use the Latin script, where most NEs begin with
a capital letter, capitalization is not a distinguishing orthographic feature of Arabic
script for recognizing NEs such as proper names, acronyms, and abbreviations (Farber
et al. 2008). The ambiguity caused by the absence of this feature is further increased
by the fact that most Arabic proper nouns (NEs) are indistinguishable from forms
that are common nouns and adjectives (non-NEs). Thus, an approach relying only
on looking up entries in proper noun dictionaries would not be an appropriate way
to tackle this problem, as ambiguous tokens/words that fall in this category are
more likely to be used as non-proper nouns in text (Algahtani 2011). For example,
(cid:13)
@ (Ashraf ) can be used in a sentence as a given name,
the Arabic proper name
an inﬂected verb (he-supervised), and a superlative (the-most-honorable) (Mesfar 2007).
An NE is usually found in a context, namely, with trigger and cue words to the left
and/or right of the NE. Therefore, it is common to resolve this type of ambiguity by
analyzing the context surrounding the NE. However, this might require deeper analysis

of the NE’s context. As an example, consider the nominal sentence (cid:16)èYm.(cid:26)’. éƒ

(cid:13)
@P ¡(cid:16)®‚Ó,
whose literal meaning might be the falling of his head in grandfather/Jeddah. The
(cid:13)
@P ¡(cid:16)®‚Ó as a multiword expression
correct analysis of the trigger constituent
denoting place of birth leads to the recognition of
location name.

the following noun as a

(cid:9)¬Qå(cid:17)…

éƒ

3.4 Agglutination

The agglutinative nature of Arabic results in many different patterns that create many
lexical variations. Each word may consist of one or more preﬁxes, a stem or root,
and one or more sufﬁxes in different combinations, resulting in a very systematic but
complicated morphology. Clitics, which in other languages such as English would be
treated as separate words, agglutinate to words. Arabic has a set of clitics that are

attached to an NE, including conjunctions such as ð (Waw, and) and (cid:9)¬ (if … then)
and prepositions such as È (Laam, for/to), ¼ (k, as), and H.
(baa, by/with), or a combi-
nation of both, as in Èð (Waw-Laam, and-for). NER relies on the words forming the NE
and the context in which it appears. Both the words and the contexts may appear in
different inﬂected forms. In order to address data sparseness issues without requiring

474

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
0
2
4
6
9
1
8
0
3
5
9
1
/
c
o

l
i

_
a
_
0
0
1
7
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Shaalan

A Survey of Arabic Named Entity Recognition and Classiﬁcation

massive training corpora, these bound morphemes should undergo morphological pre-
processing. One solution is to omit all the afﬁxes and keep only the root morpheme
(Grefenstette, Semmar, and Elkateb-Gara 2005; Farber et al. 2008; Alkharashi 2009).
For example, the analysis of the word Qå”Öß.ð (and by Egypt, and-by-Egypt) yields Qå”Ó
(Egypt) as a location name. Another solution is to perform text segmentation and
insert a delimiter between constituent morphemes, thus preventing loss of contextual
information (Benajiba and Rosso 2007). This information is more convenient for NLP
tasks that need to process these morphemes. As an example that shows an occur-
rence of both preﬁx and sufﬁx morphemes, consider the trigger word Aî(cid:16)DÖÞ•A«ð (and
its capital, and-capital-its), which is segmented into three parts—a conjunction, and
both a nominal and a pronominal mention—separated by a space character: Aë (cid:16)éÖÞ•A« ð
(and capital its).

3.5 Optional Short Vowels

Arabic text contains diacritics representing most vowels that affect the phonetic repre-
sentation and give different meaning to the same lexical form.4 Nowadays, the modern
version of Arabic is written without diacritics, creating a one-to-many, unvocalized-to-
vocalized, ambiguity (Alkharashi 2009), which gives mutually incompatible morpho-
logical analyses for the same surface form. As such, most Arabic texts that appear
in the media (whether in printed documents or digitized format) are undiacritized.
This is comprehensible for native Arabic speakers, but not for a computational sys-
tem. The simpliﬁcation made by ignoring such diacritics had led to structural and
lexical types of ambiguity because different diacritics represent different meanings.
These ambiguities can only be resolved by contextual information and an adequate

knowledge of the language (Benajiba, Diab, and Rosso 2009a). For instance, Q¢(cid:16)¯ may
refer to the country name Qatar (a location NE) if transliterated as qatar, the literal
meaning of country (a trigger word for location NEs), or radius (a trigger word for
measure NEs) if transliterated as qutr, or the literal meaning of distill if transliterated
as qat∽ar. Unfortunately, this solution might not work if the contextual information is
itself ambiguous due to non-vocalization (Mesfar 2007). To consider another example,
the likely vocalizations of the unvoweled form (cid:16)é‚ƒ (cid:13)ñÓ might lead to trigger words that
(cid:11)(cid:13)ñ(cid:12)Ó [a foundation/corporation], internal evidence
denote two different NE types (e.g., (cid:16)é (cid:11)‚ (cid:11)(cid:15)ƒ
(cid:11)(cid:13)ñ(cid:12)Ó [a founder], a trigger word for
of a constituent of an organization name; and (cid:16)é (cid:11)‚ƒ(cid:11)

personal names).

3.6 Inherent Ambiguity in Named Entities

Arabic, like other languages, faces the problem of ambiguity between two or more
(cid:9)áK(cid:10) (cid:9)Q(cid:13)KA (cid:9)®ËAK. I. kP XAK. @ YÔg@ (Ahmed Abad
NEs. For example consider the following text:
welcomed the winners). In this example, XAK. @ YÔg@ (Ahmed Abad) is both a person name
and a location name, thereby giving rise to a conﬂict situation, where the same NE
is tagged as two different NE types. Heuristic techniques for resolving ambiguities
by cross-recognizing NE types are suggested. One heuristic technique, proposed by
Shaalan and Raza (2009), uses heuristic rules for preferring one NE type over the other.

4 A diacritic in Arabic is a small mark placed either above or under a letter to indicate what short vowel

will follow that letter. Long vowels are usually indicated by one of three designated letters.

475

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
0
2
4
6
9
1
8
0
3
5
9
1
/
c
o

l
i

_
a
_
0
0
1
7
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 40, Number 2

Another technique, proposed by Benajiba, Diab, and Rosso (2008b), favors the NE type
for which the classiﬁer achieves the highest precision.

3.7 Lack of Uniformity in Writing Styles

Arabic has a high level of transcriptional ambiguity: An NE can be transliterated in a
multitude of ways (Shaalan and Raza 2007). This multiplicity arises from both differ-
ences among Arabic writers and ambiguous transcription schemes (Halpern 2009). The
lack of standardization is signiﬁcant and leads to many variants of the same word that
are spelled differently but still correspond to the same word with the same meaning,
creating a many-to-one, variants-to-well-formed, ambiguity. For example, transcribing
(also known as “Arabizing”) an NE such as the city of Washington into Arabic NE
(cid:9)á¢(cid:9)J (cid:17)ƒð. One reason for this is
(cid:9)á¢(cid:9)J (cid:17)ƒ@ð,
produces variants such as (cid:9)á¢j.
that Arabic has more speech sounds than Western European languages, which can
ambiguously or erroneously lead to an NE having more variants. One solution is to
retain all versions of the name variants with a possibility of linking them together.
Another solution is to normalize each occurrence of the variant to a canonical form
(Pouliquen et al. 2005); this requires a mechanism (such as string distance calculation)
for name variant matching between a name variant and its normalized representation
(Refaat and Madkour 2009; Steinberger 2012).

(cid:9)á¢ (cid:9)ª(cid:9)J (cid:17)ƒ@ð,

(cid:9)J (cid:17)ƒ@ð,

3.8 Systematic Spelling Mistakes

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
0
2
4
6
9
1
8
0
3
5
9
1
/
c
o

l
i

_
a
_
0
0
1
7
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Typographic errors are frequently made by Arabic writers with regard to certain charac-
ters (Shaalan et al. 2012). This is due to either a character similarity or inherent disagree-
ment about the characters, which often leads to orthographical confusion (El Kholy and
Habash 2010; Habash 2010; Al-Jumaily et al. 2012). The former category includes the

character Ta-Marbuta ( (cid:16)è), literally ‘tied Ta’, which is a special morphological marker
typically marking a feminine ending; this is carelessly written interchangeably with
Ha ( è). Ta-Marbuta is a hybrid character merging the form of the characters Ha ( è)
and Ta ( (cid:16)H). The latter category includes the Hamza-Alif letter variants that are often
reductively normalized by brute force replacement with a bare Alif. Some computa-
tional linguists avoid writing the Hamza (especially with stem-initial Alifs), viewing
this as a Hamza restoration problem that is part of the Arabic diacritization problem.

(cid:16)éªÓAm.Ì’@ (The
As an example that combines both types of errors, consider (cid:16)èYm.(cid:26)’.
Islamic University in Jeddah), which might be written with both typographical variants
as èYm.(cid:26)’. éJ(cid:10)ÓCƒB@ éªÓAm.Ì’@. An edit-distance technique can be used to resolve the spelling
variant problem. It should be noted that not all systematic spelling mistakes can be
handled in this way. For example, consider the difference between (cid:16)éªÓAm.Ì’AK. (and by/with
the university) and (cid:16)éªÓAg. CK. (without a university). It is difﬁcult to determine whether or
not this mistake is due to the transposition of the two characters @ (Alif ) and È (Lam),
where the preﬁx È@ (means the) whereas the preﬁx B (means no). The latter variation also
shows another orthographic problem: Arabic “run-on” words, or free concatenation
of words, when the word immediately preceding ends with a non-connector letter,
(cid:9)P (za), ð (waw), and so forth. For example,
such as @ (Alif), X (Dal),
the following phrase shows a fully concatenated person NE and its surrounding con-
text: (cid:16)éJ(cid:10)k. PA(cid:9)mÌ’@QK(cid:10) (cid:9)PðYÒm×Pñ(cid:16)J»YË@ (Dr-Mohammed-the-Minister-of-Foreign-Affairs). This is com-

prehensible by most readers but not by a computational system that needs to work on
segmented words.

(cid:9)X (Dhal), P (Ra),

(cid:16)éJ(cid:10)ÓCƒ

(cid:13)
B@

476

Shaalan

A Survey of Arabic Named Entity Recognition and Classiﬁcation

3.9 Lack of Resources

Large collections of tagged documents (corpora) as well as gazetteers (predeﬁned lists
of typed NEs) are excellent sources that we can rely upon when implementing and
testing the performance of an Arabic NER system. For these linguistic resources to be
useful, they should include unbiased distribution and representative numbers of NEs
that do not suffer from sparseness. Unfortunately, the available Arabic resources for
NER research often have limited capacity and/or coverage (Abouenour, Bouzoubaa,
and Rosso 2010). Moreover, it is expensive to create or license these important Arabic
NER resources (Huang et al. 2004; Bies, DiPersio, and Maamouri 2012). For these
reasons, researchers often rely on their own corpora, which require human annotation
and veriﬁcation. Few of these corpora have been made freely and publicly available
for research purposes (Benajiba, Rosso, and Bened´ı Ruiz 2007; Benajiba and Rosso 2007;
Mohit et al. 2012), whereas others are available but under license agreements (Strassel,
Mitchell, and Huang 2003; Mostefa et al. 2009).

4. Named Entity Tag Set

Tagging, also known as labeling, is the task of assigning a contextually appropriate tag
(label) to every NE in the text. The sequence of words that is annotated with the same tag
is considered a single multiword NE. The tag set used to tag NEs may differ according to
user requirements. For example, Nezda et al. (2006) used an extended set of 18 different
NE classes. Mohit et al. (2012)’s research adopted a very ﬂexible scheme that allows
annotators more freedom in deﬁning entity types. In this research, entity types were not
predetermined and category matches between annotators were determined by post hoc
analysis.

In the literature, there are three standard general-purpose tag sets that have been
used to annotate Arabic linguistic resources in the ﬁeld of NER research. These tag sets
may be used as a basis for annotating linguistic resources and system outputs.

The 6th Message Understanding Conference (MUC-6):5 This conference can be
considered as the initiator of the NER task. NEs are classiﬁed into three main tag ele-
ments: ENAMEX (i.e., person name, location, and organization), NUMEX (i.e., money
and percentage [numerical] expressions), and TIMEX (i.e., time and date expressions).
Each tag element is categorized via the TYPE attribute. Most researchers adopt this tag
set. For example, a NER system producing MUC-style output might tag the sentence

(cid:16)é»Qå(cid:17)…

(cid:9)áÓ ÑîD… 300 YËA (cid:9)g øQ(cid:16)(cid:30) (cid:17)ƒ@(cid:13) (Khaled bought 300 shares of Apple Corp.) as

(cid:14)
(cid:9)¯ ÉK.
@

2012 ú(cid:10)

illustrated in Table 1.

The Conference on Computational Natural Language Learning (CoNLL): As
an outcome of CoNLL20026 and CoNLL2003, four categories of NEs were deﬁned:
person name, location, organization, and miscellaneous. CoNLL follows the IOB
format to tag chunks of text representing NEs in a data set (Benajiba, Rosso,
and Bened´ı Ruiz 2007). The CoNLL annotations are formulated as a word-based
classiﬁcation problem, where each word in the text is assigned a tag, indicating
whether it is the beginning (B) of a speciﬁc NE, inside (I) a speciﬁc NE, or (O) outside
any NE. IOB notation is used when NEs are not nested and therefore do not overlap.
For example, a NER system producing CoNLL-style output might tag the sentence

5 http://www.cs.nyu.edu/cs/faculty/grishman/muc6.html.
6 http://ifarm.nl/signll/conll/.

477

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
0
2
4
6
9
1
8
0
3
5
9
1
/
c
o

l
i

_
a
_
0
0
1
7
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 40, Number 2

Table 1
Example of MUC tagging.

h/ENAMEXi
(cid:9)áÓ ÑîD…
(cid:9)¯
ú(cid:10)

YËA (cid:9)g
h/NUMEXi
h/ENAMEXi ÉK.
h/TIMEXi

300
(cid:13)
@
2012

(cid:16)é»Qå(cid:17)…

hENAMEX TYPE=PERSONi ø(cid:10)

Q(cid:16)(cid:30) (cid:17)ƒ@(cid:13)

hNUMEX TYPE=CARDINALi

Table 2
Example of CoNLL tagging.

Arabic
(cid:16)HPñ (cid:9)®º(cid:9)K@Q(cid:9)¯
(cid:13)
(cid:9)áÊ«
@
XAm(cid:26)(cid:16)’@
(cid:16)é«A(cid:9)J“
(cid:16)H@PAJ(cid:10)‚Ë@
(cid:9)¯
ú(cid:10)
AJ(cid:10)(cid:9)KAÖÏ@

English Trans.

Tag

Frankfurt

said
Association
Industry
Auto
in
Germany

B-LOC

O
B-ORG
I-ORG
I-ORG
O
B-LOC

hENAMEX TYPE=ORGANIZATIONi

hTIMEX TYPE=DATE i

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
0
2
4
6
9
1
8
0
3
5
9
1
/
c
o

l
i

_
a
_
0
0
1
7
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Table 3
Example of ACE tagging.
(cid:9)¯

ú(cid:10)æ(cid:9)•AÖÏ@ ÐAªË@ ú(cid:10)

h/GPEi

(cid:9)àA(cid:9)JJ.Ë

h/GPEi

h/PERi

(cid:9)á(cid:30)(cid:10)‚k

hPERi

½ÊÖÏ@ P@ (cid:9)P

AJ(cid:10)(cid:9)KAÖÏ

(cid:13)
@ ú(cid:10)

(cid:9)¯

(cid:16)H@PAJ(cid:10)‚Ë@

(cid:16)é«A(cid:9)J“ XAm(cid:26)(cid:16)’@

(cid:13)
@ ,

(cid:9)áÊ«

Germany said) as illustrated in Table 2.

(cid:16)HPñ (cid:9)®º(cid:9)K@Q(cid:9)¯ (Frankfurt, Auto Industry Association in

BILOU (Ratinov and Roth 2009) was also suggested as an efﬁcient alternative to the
BIO format. It is used to identify the beginning, the inside, and the last tokens of multi-
token chunks as well as unit-length chunks. Experimental results indicate that BILOU
representation of text chunks signiﬁcantly outperforms the BIO format.

The Automatic Content Extraction (ACE) program: Arabic resources for Infor-
mation Extraction have been developed as part of the ACE program. According
to the ACE 2003 tag elements,7 four categories are deﬁned: person name, facility,
organization, and geographical and political entities (GPE). Later in ACE 2004
and 2005,
two categories were added to this tag set: vehicles and weapons.
For example, a NER system producing ACE-style output might tag the sentence

(cid:9)àA(cid:9)JJ.Ë

(cid:9)á(cid:30)(cid:10)‚k ½ÊÖÏ@ P@ (cid:9)P (King Hussein visited Lebanon last year) (Habash 2010)

(cid:9)¯
ú(cid:10)æ(cid:9)•AÖÏ@ ÐAªË@ ú(cid:10)

as illustrated in Table 3.

7 The ACE tag sets for English, Arabic, and Chinese are available at http://projects.ldc.upenn.

edu/ace/data/.

478

Shaalan

A Survey of Arabic Named Entity Recognition and Classiﬁcation

5. Arabic Linguistic Resources

The lack of digital linguistic resources creates a formidable obstacle when it comes to
Arabic NLP in general and Arabic NER in particular. Investing in these resources is
justiﬁed because it would lead to many beneﬁts such as reusability, broad coverage, and
frequency and distributional information, as well as a way of evaluating and comparing
systems. Corpora and lexical resources are two main types of linguistic resources that
are commonly used in NER.

5.1 Corpora

The corpus needed for NER is a sufﬁciently large annotated corpus where every
NE has a type assigned to it. An important characteristic of a reliable corpus is
that it should be well balanced in terms of the NE type distribution. A corpus can
be genre independent/speciﬁc; domain independent/speciﬁc: and consist of texts
in one natural language (a monolingual corpus), two natural languages (a bilin-
gual, parallel, or comparable corpus), or more natural languages (a multilingual or
crosslingual corpus). In Hassan, Fahmy, and Hassan (2007), a general framework
is proposed for extracting NE translation pairs from both comparable and paral-
lel corpora. Parallel corpora that are aligned on the sentence level have been used
to tag one corpus based on the tagged information in the other corpus such that
they can complement and improve each other (Benajiba et al. 2010; Burkett et al.
2010; Ma 2010). For example, Samy, Moreno, and Guirao’s (2005) approach creates
an NE aligned bilingual corpus that relies on the basic assumption that, given a
pair of sentences where each one is the translation of the other, and given that in
one sentence one or more NE were detected, then the corresponding aligned sen-
tence should contain the same NE either translated or transliterated. As described,
the approach is very effective because it involves Arabic, which is a case-insensitive
language, and Spanish, which does have orthographical differences between names
and non-names.

Experimental results of NLP research are more easily compared with each other
when they rely on publicly available data sets or corpora. The frequent use of these
corpora in the research community makes them standard data sets or corpora, serving as
stable benchmark data for measuring ongoing progress and ranking systems according
to their annotation capability. Some NER corpora are available to members of organiza-
tions under paid license agreements, for example, ACE8(Strassel, Mitchell, and Huang
2003). Because they are not free, it is difﬁcult for small research groups to access them.
However, contributors from the Arabic NLP research community are striving to develop
freely available Arabic NER corpora to alleviate this problem and help other researchers
to exploit these resources, for example, ANERcorp9(Benajiba, Rosso, and Bened´ı Ruiz
2007; Benajiba and Rosso 2007). Nonetheless, these efforts are still limited and focused
around a small set of domains (Mohit et al. 2012). In most cases where researchers want
to conduct further investigation by studying the impact of different parameters and new
features of NER, therefore, they have found that it is indispensable to build their own
corpora. In the literature, common and recent examples of Arabic corpora that have

8 ACE corpora are available under license agreement from LDC (http://www.ldc.upenn.edu).

A signiﬁcant number of the data sets developed by LDC are Arabic language resources, making
LDC the leading source for such materials (Bies, DiPersio, and Maamouri 2012).

9 Available for free at http://www1.ccls.columbia.edu/∼ybenajiba/downloads.html.

479

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
0
2
4
6
9
1
8
0
3
5
9
1
/
c
o

l
i

_
a
_
0
0
1
7
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 40, Number 2

been used for Arabic NLP in general, and for Arabic NER and classiﬁcation topics in
particular, are:

ACE 2003 corpus: This includes Broadcast News (BN) and Newswire (NW)
genres. The total size is 55.29 KB and the number of NEs is 5,505.

ACE 2004 corpus: This includes BN and NW from Arabic Tree Bank (ATB)
genres. The total size is 154.12 KB and the number of NEs is 11,520.

ACE 2005 corpus: This includes BN, NW, and Weblogs (WL) genres. The
total size is 104.65 KB and the number of NEs is 10,218.

ANERcorp: This includes NW genre. The corpus size is 174.76 KB and the
number of NEs is 12,989.

5.2 Lexical Resources

Another primary linguistic resource is the gazetteer, which is a collection of predeﬁned
lists of typed entities; a gazetteer is also known as a dictionary or whitelist (Shaalan and
Raza 2008). Gazetteers include names that have been identiﬁed beforehand and have
been classiﬁed into NE types. When the acquisition of a gazetteer is fully automated,
the number of NEs increases with the growth of the input linguistic resource or text
used to create it. The contents of a gazetteer should be consistent and belong to only
one type of NE. For example, a location gazetteer consists of names of continents,
countries, cities, states, political regions, towns, and villages, and so on (Shaalan and
Raza 2009). A gazetteer might include full or partial NEs; for example, a person NE
could have separate gazetteers for ﬁrst names (possibly distinguishing male names and
female names), middle names, surnames, full forms, and even nicknames (Shaalan and
Raza 2007; Higgins, McGrath, and Moretto 2010). A gazetteer entry provides internal
evidence to fully or partially match a candidate NE in the input. Whenever a prede-
ﬁned NE that appears in the relevant gazetteer is detected in the input text, the NER
system should recognize it directly as an NE of this type. Very large gazetteers are
publicly available from the CJK Dictionary Institute10 under license agreement in the
form of Arabic person, organization, company, and location name databases. However,
researchers who ﬁnd these resources difﬁcult to acquire build their own gazetteers from
different resources such as the Web and from organizations (Benajiba and Rosso 2008;
Shaalan and Raza 2009).

Some systems used a blacklist (Shaalan and Raza 2009) that allows for discarding of
negative evidence. A ﬁltering mechanism is used to reject incorrect matches. To see how
(cid:16)éJ(cid:10)k. PA(cid:9)mÌ’@ QK(cid:10) (cid:9)Pð (The Iraqi
(cid:9)á(cid:30)(cid:10)ÓB@ ú(cid:10)
this works, consider the following example: ÐAªË@
(cid:16)¯@QªË@
Foreign Minister the Secretary-General). The contextual information (cid:16)éJ(cid:10)k. PA(cid:9)mÌ’@ QK(cid:10) (cid:9)Pð ú(cid:10)
(The Iraqi Foreign Minister) indicates that the following words are a person name.
(cid:9)á(cid:30)(cid:10)ÓB@ (the Secretary-General) do not
However, in this example, the following words, ÐAªË@
constitute a valid person name; rather, they form an appositive which should be ﬁltered
out from the results.

(cid:16)¯@QªË@

Lexical triggers are also considered one of the important linguistic resources
(Shaalan and Raza 2007). There are two kinds of lexical triggers that provide either
internal or contextual evidence. The internal evidence lies within the NE itself, for

10 See Arabic lexical resources at http://www.cjk.org/cjk/arabic/arabsam.htm.

480

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
0
2
4
6
9
1
8
0
3
5
9
1
/
c
o

l
i

_
a
_
0
0
1
7
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Shaalan

A Survey of Arabic Named Entity Recognition and Classiﬁcation

example, (cid:16)é»Qå(cid:17)… (company) is internal evidence of an organization NE. Contextual evi-

dence is provided by the clues around the entities. They might be deduced from analy-
sis of the most frequent left- and right-hand-side contexts. For example, the phrase

(cid:9)j(cid:16)(cid:28)(cid:9)JÖÏ@ ø(cid:10) Qå”ÖÏ@ (cid:129)(cid:28)(cid:10)

(cid:13)KQË@ ú(cid:10)æ…QÓ YÒm× Pñ(cid:16)J»X (Dr Mohammed Morsi the newly elected Egyp-
A(cid:17)JK(cid:10)Yg I.
tian president) includes the preceding lexical trigger Pñ(cid:16)J»X (Dr) and the following lexical
(cid:13)KP (president) and ø(cid:10) Qå”Ó (Egyptian) for the person NE ú(cid:10)æ…QÓ YÒm× (Mohammed
triggers (cid:129)(cid:28)(cid:10)
Morsi). Generally, lexical triggers provide clues that would indicate the presence or
absence of NEs.

As far as the morphological properties are concerned, additional Arabic resources
are needed to furnish information to NER systems, including lemmas, dictionaries, afﬁx
compatibility tables, and English glosses. For example, the English gloss, which is de-
rived as a companion to some Arabic morphological analyzers, is used to check whether
it starts with a capital letter, a key clue for an English NER. Its presence functions as
a hint that suggests the presence of an Arabic NE. Benajiba, Rosso, and Bened´ı Ruiz
(2007), among others, have used POS tags to improve NE boundary detection. Morpho-
logical information can be obtained from deep Arabic morphological analysis (Farber
et al. 2008). However, leading and trailing character n-grams in surface word forms can
also be used to handle afﬁx attachment without the need for morphological analysis
(Abdul-Hamid and Darwish 2010).

6. NER Approaches

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
0
2
4
6
9
1
8
0
3
5
9
1
/
c
o

l
i

_
a
_
0
0
1
7
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

A number of Arabic NER systems have been developed using primarily two ap-
the rule-based (linguistic-based) approach, notably the NERA system
proaches:
(Shaalan and Raza 2009); and the ML-based approach, notably ANERsys 2.0 (Benajiba,
Rosso, and Bened´ı Ruiz 2007). Rule-based NER systems rely on handcrafted local
grammatical rules written by linguists. Grammar rules make use of gazetteers and
lexical triggers in the context in which the NEs appear. The main advantage of the
rule-based NER systems is that they are based on a core of solid linguistic knowledge
(Shaalan 2010). However, any maintenance or updates required for these systems is
labor-intensive and time-consuming; the problem is compounded if the linguists with
the required knowledge and background are not available. On the other hand, ML-
based NER systems utilize learning algorithms that require large tagged data sets for
training and testing (Hewavitharana and Vogel 2011). ML algorithms involve a selected
set of features extracted from data sets annotated with NEs in order to generate statisti-
cal models for NE prediction. An advantage of the ML-based NER systems is that they
are adaptable and updatable with minimal time and effort as long as sufﬁciently large
data sets are available. Moreover, if we deal with an unrestricted domain, it is better to
choose the ML approach, as it would be expensive both in terms of cost and time to
acquire and/or derive rules and gazetteers. Recently, a hybrid Arabic NER approach
that combines ML and rule-based approaches has resulted in signiﬁcant improvement
by exploiting the rule-based decisions of NEs as features used by the ML classiﬁer
(Abdallah, Shaalan, and Shoaib 2012; Oudah and Shaalan 2012). For a comprehensive
survey of NER approaches more generally, see Nadeau and Sekine (2007).
Arabic morphology is relatively complex, so morphological

information is
needed in these approaches for identifying NEs. For example, consider the phrase
(cid:13)
@ (The Ministry of Egyptian Interior announced, announced
the-ministry the-interior the-Egyptian). In this case, the rule or pattern that allows
(cid:16)èP@ (cid:9)Pð (The Ministry of Egyptian Interior) as an

(cid:16)éK(cid:10)Qå”ÖÏ@
the recognizer to identify (cid:16)éK(cid:10)Qå”ÖÏ@

(cid:16)èP@ (cid:9)Pð (cid:16)I(cid:9)JÊ«

(cid:16)éJ(cid:10)Ê (cid:9)g@YË@

481

Computational Linguistics

Volume 40, Number 2

organization name stipulates that if the NE is preceded directly by a verb trigger and is
followed by a noun (internal evidence of an NE constituent), which in turn is followed
by one or two speciﬁc adjectives, then the sequence of these two or three words should
be tagged as an organization entity. For more precise identiﬁcation of NEs, sometimes

the adjective forms of nationality are also used in the recognition process (e.g., (cid:16)éK(cid:10)Qå”ÖÏ@,

the-Egyptian.fem from Egypt). Known organization NEs that are kept in the organization
gazetteer can be used to improve the performance of the NER system. As such, the
(cid:16)èP@ (cid:9)Pð (The Ministry of Egyptian Foreign Affairs)
(cid:16)éJ(cid:10)k. PA(cid:9)mÌ’@ð (cid:16)éJ(cid:10)Ê (cid:9)g@YË@ ú(cid:10)
(cid:16)GP@ (cid:9)Pð (Egyptian
Ministries of Interior and Foreign Affairs, Ministries.dual the-interior and the-Foreign-
(cid:16)èP@ (cid:9)Pð (The Ministry of
(cid:16)éJ(cid:10)Ê (cid:9)g@YË@

system is able to recognize (cid:16)éK(cid:10)Qå”ÖÏ@
in the short conjunction of organization NEs (cid:16)éK(cid:10)Qå”ÖÏ@
Affairs Egyptian) by using the gazetteer entry for (cid:16)éK(cid:10)Qå”ÖÏ@

(cid:16)éJ(cid:10)k. PA(cid:9)mÌ’@

Egyptian Interior).

7. Feature Space of Arabic Named Entity Recognition

Features in NER are properties or characteristic attributes of words designed for con-
sumption by a computational system. This process begins by transforming the set of
words (tokens) to be categorized into a set of feature vectors that belong to a feature
space, which is fed to the text classiﬁer as input. The feature vector representation is
an abstraction over the text, which usually characterizes each word by one or more
Boolean or binary values (such as whether a word is capitalized), numerical values
(word length), and nominal values (English gloss). The source of these values might
be their appearance as surface features, a pre-processing step, surrounding items, or
the characters that the word is composed of, or a combination of several features, or
external knowledge (Oudah and Shaalan 2013).

In this section, we present the features most often used for the recognition and
classiﬁcation of Arabic NEs. We organize11 them along the following different axes:
word-level features, list lookup features, contextual features, and language-speciﬁc
features. In the ML approach, the selection of the features to be taken into account by a
classiﬁer is a very critical issue and can signiﬁcantly affect the performance of a system.
Section 7.5 is dedicated to discussing the feature selection step.

7.1 Word-Level Features

Word-level features are related to the individual orthographic nature and structure of
each word. Table 4 lists subcategories of these features. They speciﬁcally describe special
markers and special characters, word length, corresponding English word case, and
afﬁx segments. Special markers are used to indicate an abbreviation (e.g., acronym or
contraction) that might include internal periods, a hyphen, an ampersand, and so on.
Word length is sometimes used to indicate the minimum length required in order for
the word to be considered as an NE type. This feature capitalizes on the fact that short
words are unlikely to be NEs.

Capitalization is a key feature of an English NER. Arabic is at a disadvantage in
this regard because the script does not orthographically mark proper names in this
way. However, many researchers (e.g., Benajiba, Diab, and Rosso 2008a; Mohit et al.
2012; Farber et al. 2008), have been able to derive the assumed capitalization from the

11 In the literature, other ways used to classify features are linguistic-dependent versus independent

features and contextual versus internal features.

482

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
0
2
4
6
9
1
8
0
3
5
9
1
/
c
o

l
i

_
a
_
0
0
1
7
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Shaalan

A Survey of Arabic Named Entity Recognition and Classiﬁcation

Table 4
Word-level features.

Feature

Description

Special markers A binary feature indicating the presence of punctuation marks and special

Word length

Capitalization

Lexical

characters in a word.
A binary feature indicating whether the length of the word is greater than a
predeﬁned threshold.
A binary feature indicating the existence of capitalization information on
the gloss corresponding to the Arabic word.
The surface features of a character n-gram up to a range of characters from
1 to n that indicate preﬁx and sufﬁx attachment.

lexical correspondences between Arabic and English, based on the underlying bilingual
lexicon of BAMA (Buckwalter 2002) that MADA exploits (Habash and Rambow 2005).
The capitalization feature has been designed with this in mind. The insight is that if the
translation begins with a capital letter then it is most probably an NE.

One of the major problems of the Arabic language is the large number of preﬁxes
and sufﬁxes that are attached to an inﬂected word. Lexical features are extracted via
pattern matching rather than linguistic processing. Hence, in the literature they are
considered language-independent features that capture the word preﬁx and sufﬁx char-
acter sequences of length up to n. The sequences are matched from the leftmost (preﬁx)
and rightmost (sufﬁx) positions of the words. In Benajiba, Diab, and Rosso (2008b) and
Abdul-Hamid and Darwish (2010), lexical features are represented by character n-grams
of leading and trailing characters in a word, which can frequently be used to identify
Arabic NEs without the need for linguistic analysis.

7.2 List Lookup Features

These features are used to classify the identity of the target word with respect to its
membership in various lists, called word-identity features by Farber et al. (2008). In
Table 5, we present four important categories of lists used in the literature as binary
discriminative features indicating whether a word is a member of any of these lists.
Gazetteer list inclusion is a direct way to express a typical NE.

The Lexical Trigger list provides a way to identify entity cues or predictive

words, such as the relation between a person and a title (e.g., ú(cid:10)
(cid:16)HAJ(cid:10)(cid:9)KA‚ÊË@

(cid:9)Gñ(cid:16)JK(cid:10) (cid:9)P XAÔ« (cid:16)éJ(cid:10)K. ñƒAmÌ’@
(cid:13)
@, Professor of Computational Linguistics Imed Zitouni), whereas the Blacklist

(cid:9)XA(cid:16)Jƒ

Table 5
List lookup features.

Feature

Description

Gazetteer
Lexical Trigger A binary feature indicating the existence of the word in the individual lexical

A binary feature indicating the existence of the word in an individual gazetteer.

Blacklist

A binary feature indicating the non-existence of the word in an individual

trigger list.

Nationality

A binary feature indicating the existence of the word in the nationality list.

blacklist.

483

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
0
2
4
6
9
1
8
0
3
5
9
1
/
c
o

l
i

_
a
_
0
0
1
7
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 40, Number 2

(e.g., QÖ (cid:16)ß (cid:13)ñÖÏ@ (cid:129)(cid:28)(cid:10)

(cid:13)
@, Professor of Computational Linguistics chairman
of the conference) counterindicates the presence of an NE as a means of resolving the
ambiguity of words in the ambiguous position.

(cid:13)KP (cid:16)éJ(cid:10)K. ñƒAmÌ’@

(cid:16)HAJ(cid:10)(cid:9)KA‚ÊË@

(cid:9)XA(cid:16)Jƒ

Many authors have proposed a way to recognize nationality by identifying rele-

vant word forms that are frequently used in NEs and their context, e.g., (cid:16)éJ(cid:10)(cid:9)KXP
(cid:16)éºÊÖÏ@
(The Jordanian University) and (cid:16)éJ(cid:10)(cid:9)KXP

(cid:16)éªÓAm.Ì’@
AJ(cid:10)(cid:9)K@P (the Jordanian queen Rania), respectively.

Nationality word forms can be stemmed to a country name using a country gazetteer
and well-known afﬁxes in the rule-based approach (Shaalan and Raza 2008), for exam-

closed list in the ML approach (Benajiba, Diab, and Rosso 2008b), for example, Jordanian

ple, [ (cid:16)éK(cid:10)] (cid:9)àXP
in this list might be expressed by the forms ú(cid:10)

(cid:16)éªÓAm.Ì’@ (Jordan[ian] University); or they may be searched using a separate
(cid:13)
@, (cid:16)éJ(cid:10)(cid:9)KXP

(cid:13)
B@, or (cid:16)éJ(cid:10)(cid:9)KXP

(cid:13)
@, ú(cid:10)

(cid:13)
B@.

(cid:9)GXP

(cid:13)
B@

7.3 Contextual Features

Contextual features are local features deﬁned over the targeted word and include the
type of words that occur with the NEs, namely, left and right neighbors of the can-
didate word which carry effective information for the identiﬁcation of NEs. Table 6
lists subcategories of these features. Usually, they are deﬁned in terms of a sliding
window of tokens/words. For example, if the size of the sliding window is 5, the
decision on the targeted word is made based on its features as well as the features of
its two immediate left and right neighbors (i.e., +/- 2 words Abdallah, Shaalan, and
Shoaib 2012). Different window sizes have been used with contextual features. For
example, in Benajiba, Diab, and Rosso (2008b) the window size was +/- 1, whereas in
Benajiba et al. (2010) it was +/- 1 to 3. The sliding step over the text, which refers to the
interval between two adjacent sliding windows, should also be deﬁned: usually it is 1.
In the literature, contextual features speciﬁcally describe word n-gram and rule-based
features.

Word n-gram contextual features can be derived from the context of a document in
order to extract the relationships between previously identiﬁed NEs and an encountered
word within the input document (Benajiba, Diab, and Rosso 2008b). They are used to
investigate the space of the surrounding context for the NEs by taking into account
the features of a window of words surrounding a candidate word in the recognition
process.

Rule-based features are contextual features that are derived from rule-based de-
cisions. Abdallah, Shaalan, and Shoaib (2012) suggested that these features have a
critical impact on the performance of pure ML-based NER components in particular,
and proposed hybrid systems combining rule-based with ML-based components in
general. In this system, an n-word sliding window is used for each word in corpus.
Table 7 provides sample instances of these features for a window of size 5.

Table 6
Contextual features.

Feature

Description

Word n-gram The features of a sliding window comprising a word n-gram that includes the

Rule-based

The features of a sliding window derived from rule-based NER decisions.

candidate word, along with preceding and succeeding words.

484

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
0
2
4
6
9
1
8
0
3
5
9
1
/
c
o

l
i

_
a
_
0
0
1
7
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Shaalan

A Survey of Arabic Named Entity Recognition and Classiﬁcation

Table 7
Sample rule-based features for 5-word window.

Targeted Word

English Tran. Wi-2

Wi-1

Wi+1

Wi+2

(cid:13)KQË@
(cid:129)(cid:28)(cid:10)
ú(cid:10)æ…ðQË@
Q(cid:30)(cid:10)Öß(cid:10)XC(cid:9)¯
(cid:9)á(cid:30)(cid:10)(cid:16)KñK.

President

Russian

Vladimir

Putin

OTHER OTHER OTHER OTHER

Person

OTHER OTHER OTHER

Person

OTHER OTHER

Person

OTHER

Person

OTHER OTHER

7.4 Language-Speciﬁc Features

These features are related to certain aspects of the Arabic language. Table 8 lists sub-
categories of language-speciﬁc features. They speciﬁcally describe part-of-speech (POS),
morphological features, and base-phrase chunks (BPC).

Arabic words generally carry rich morphological information (Marton, Habash,
and Rambow 2010), some of which includes noun–adjective agreement and special
markings indicating nominals in compounds. The MADA toolkit has been found to
be very useful in generating a number of informative language-speciﬁc features for
each input word (Habash, Rambow, and Roth 2009). One of these features is the POS
morpho-syntactic tag, which plays a signiﬁcant role in Arabic NLP. An Arabic NE
usually consists of either noun (NN) or proper noun (NNP) tags. In Benajiba and
Rosso (2007), very good results were obtained using the POS tagging feature, which
was exploited to improve NE boundary detection. The shared task of CoNLL now
includes a POS column in its corpora. Thus, the POS tag is a good distinguishing
feature for Arabic NEs; it has been studied separately in the literature to determine
its impact on NER. As an example, Farber et al. (2008) demonstrated a signiﬁcant
improvement in Arabic NER using a POS feature. In order to make use of the varying
importance of different morphological features, a careful choice of relevant features
and their associated value representations have to be taken into consideration when
studying Arabic NER. Benajiba, Diab, and Rosso (2008b) report on the impact of
morphological features that affect NEs, such as aspect, person, deﬁniteness, gender,
and number.

The structure of an Arabic sentence allows different arrangements of NEs: NEs
may appear anywhere in the sentence and at different distances from lexical trig-
gers. Elsebai, Meziane, and Belkredim (2009) and Elsebai and Meziane (2011) point
out that these arrangements might complicate the structure of the induced heuristics
rules of their rule-based NER system. This observation has led to using the BPC
feature as an indicator of embedded NEs (Benajiba and Rosso 2008). BPC features
are related to the type of words that occur with NEs and their syntactic relations
(Benajiba, Diab, and Rosso 2008b). They are usually identiﬁed by shallow syntactic
parsing. The Amira toolkit has been found to be very useful in generating BPC features
(Diab 2009).

7.5 Feature Selection

It is useful to think of the ML-based NER as consisting of four major steps: 1) feature
selection; 2) algorithm selection or the decision of which ML algorithm(s) to use for

485

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
0
2
4
6
9
1
8
0
3
5
9
1
/
c
o

l
i

_
a
_
0
0
1
7
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 40, Number 2

Table 8
Language-speciﬁc features.

Feature

Description

POS
Morphological A set of morphological information (excluding POS).
BPC

The label identifying the part-of-speech category of a word.

Phrase-level labels identifying syntactic chunks such as noun phrases (NPs)

and verb phrases (VPs) within a text.

training and classiﬁcation; 3) training, the actual learning of distinguishing patterns
using the selected feature list; and 4) classiﬁcation, applying these patterns to the input
text to detect and classify the NEs.

The success of a learning algorithm is crucially dependent on the features it uses. A
supervised learning algorithm uses an annotated corpus. The training set derived from
an annotated corpus represents the NEs in terms of feature values.

Feature selection refers to the task of identifying a useful subset of features chosen
to represent elements of a larger set (i.e., the feature space). The selection of the subset to
be utilized by a classiﬁer is a very critical issue and when optimized it can enhance the
performance of a system dramatically (Nadeau and Sekine 2007). The main purpose of
this step is to try to ﬁnd a strong correlation between an NE and one or more combined
features in order to explore generalizations over the set of selected features. Iterative
experiments are conducted to gain a better understanding of different combinations
of the selected features and their impact on the NER task. In a typical learning envi-
ronment, reporting experiments with all the different combinations of features would
adversely affect the readability of the achieved results (Abdul-Hamid and Darwish
2010). So, in the literature, the presentation highlights experiments that their enabled
feature combination show signiﬁcant (or best) obtained results for the evaluation data
sets.

Under each type of feature, there is a set of characteristics that need to be considered
and the methods used to extract them may differ in their degree of accuracy. If all feature
values and their combinations are selected the feature space becomes high-dimensional.
Not all features are equally important for the recognition task. Thus, even the set of
selected features needs to be evaluated in order to ﬁnd the optimal feature set for an
NER system. There are different ways to carry out feature selection.

The most widely used method is to select features manually by a process of en-
abling features one by one to determine their effects. Another method is to initially
decide on the feature set by testing features in isolation at the beginning, and in-
crementally combining them in different sets until a set containing all the features
is reached and is tested. Benajiba, Diab, and Rosso (2008a) and Benajiba, Diab, and
Rosso (2008b) used an incremental approach that selects the top n features. Then, the
features are ranked in a decreasing order according to their individual impact (using
the F-measure obtained for each NE), keeping only the set that yields the best results at
each iteration.

8. Tools for Developing Arabic NER Systems

A good number of tools are available for developing and evaluating Arabic NER sys-
tems, allowing for easy replicability of experiments. The following is a non-exhaustive

486

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
0
2
4
6
9
1
8
0
3
5
9
1
/
c
o

l
i

_
a
_
0
0
1
7
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Shaalan

A Survey of Arabic Named Entity Recognition and Classiﬁcation

list of NER tools that have been used in the Arabic NER literature. The tools can be
classiﬁed into three categories according to their functions: Integrated Development
Environments tools, ML tools, and Arabic NLP tools.

8.1 Integrated Development Environments

GATE12 (The General Architecture for Text Engineering): This is one of the most
popular freely available software tools dealing with NLP. GATE is a suite of Java
tools that provides an infrastructure for developing and deploying software com-
ponents that process human language (Maynard et al. 2000; Cunningham 2002;
Cunningham et al. 2011). The motivating factors behind the development of GATE
include reusability of components, task-based evaluation, comparative evaluation,
collaborative research, robustness, efﬁciency, and portability; the tools support nine
languages (English, French, German, Italian, Chinese, Arabic, Romanian, Hindi, and
Cebuano). GATE provides a set of essential tools for NLP system development,
including tokenizers, gazetteers, POS taggers, chunkers, and parsers. It facilitates the
development of rule-based NER systems by providing the user with the capability of
implementing grammatical rules as a ﬁnite state transducer using JAPE. It also has an
Arabic plug-in that contains a tokenizer, gazetteers, an OrthoMatcher component, and
a grammar, all of which are used within a simple Arabic rule-based NER application
built as a part of GATE. GATE can be used to extract basic entities, such as date,
name, location, organization, and so on. A number of scholars have used the GATE
environment in their research studies on Arabic NER, including Maynard et al. (2002),
Elsebai, Meziane, and Belkredim (2009), Elsebai and Meziane (2011), and Abdallah,
Shaalan, and Shoaib (2012).

NooJ:13 This is a freely available linguistic development environment for many
languages. NooJ allows the developer to construct, test, and maintain large coverage
lexical resources, as well as apply morpho-syntactic tools for Arabic processing. It
can recognize all Unicode encodings, which is a very important feature for process-
ing Arabic Script languages. NooJ can recognize rules written in ﬁnite-state form or
context-free grammar form, facilitating the development of rule-based NER systems.
Nooj provides a disambiguation technique based on grammars to resolve duplicate
annotations. Arabic is one of the languages that are supported by NooJ; there are free
Arabic resources for use within the NooJ environment on the NooJ ofﬁcial Web site.
Mesfar (2007) has used NooJ in his Arabic NER research.

LingPipe:14 A toolkit for text engineering and processing, the free version has
limited production capabilities and one must upgrade in order to obtain full pro-
duction abilities. The toolkit is language-, domain-, and genre-independent. It sup-
ports the development of different language processing tasks such as POS tagging,
spelling correction, NE recognition, and word sense disambiguation. The NER com-
ponent is based on hidden Markov models and the learned model can be evaluated
using k-fold cross validation over annotated data sets. LingPipe recognizes corpora
annotated using the IOB scheme. The LingPipe NER system has been applied by
ANERcorp to demonstrate how to generate a statistical NER model for Arabic; the

12 GATE is available at http://gate.ac.uk/.
13 NooJ is available at http://www.nooj4nlp.net.
14 LingPipe is available at http://alias-i.com/lingpipe/.

487

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
0
2
4
6
9
1
8
0
3
5
9
1
/
c
o

l
i

_
a
_
0
0
1
7
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 40, Number 2

details and results are presented on the toolkit’s ofﬁcial Web site. AbdelRahman et al.
(2010) used ANERcorp to compare their proposed Arabic NER system with LingPipe’s
built-in NER.

8.2 Machine Learning Tools

In the Arabic NER literature, the ML tools of choice are data-mining-based tools that
support one or more ML algorithms, such as Support Vector Machines (SVM), Con-
ditional Random Fields (CRF), Maximum Entropy (ME), hidden Markov models, and
Decision Trees. These tools are YASMET, CRF++, YamCha, and WEKA. They all share
the following features: a generic toolkit, language independence, absence of embedded
linguistic resources, a requirement to be trained on a tagged corpus, the performance of
sequence labeling classiﬁcation using discriminative features, and a suitability for the
pre-processing steps of NLP tasks.

YASMET:15 This free toolkit, which is written in C++, is applicable to ME models.
The toolkit can estimate the parameters and computes the weights of an ME model.
YASMET is designed to handle a large set of features efﬁciently. However, there are
not many details available about the features of this toolkit. In Benajiba, Rosso, and
Bened´ı Ruiz (2007), Benajiba and Rosso (2007), and Benajiba, Diab, and Rosso (2009a),
YASMET was used to implement ME approach in Arabic NER.

CRF++:16 This is a free open source toolkit, written in C++, for learning CRF models
in order to segment and annotate sequences of data. The toolkit is efﬁcient in training
and testing and can produce n-best outputs. It can be utilized in developing many
NLP components for tasks such as text chunking and NER, and can handle large
feature sets. Both Benajiba and Rosso (2008), Benajiba, Diab, and Rosso (2008a, 2009a),
and Abdul-Hamid and Darwish (2010) have utilized CRF++ to develop CRF-based
Arabic NER.

YamCha:17 A commonly used free open source toolkit written in C++ for learning
SVM models. This toolkit is generic, customizable, efﬁcient, and has an open source text
chunker. It has been utilized to develop NLP pre-processing tasks such as NER, POS tag-
ging, base-NP chunking, text chunking, and partial chunking. YamCha performs well
as a chunker and is capable of handling large sets of features. Moreover, it allows for re-
deﬁning feature parameters (window-size) and parsing-direction (forward/backward),
and applies algorithms to multi-class problems (pair wise/one vs. rest). Benajiba,
Diab, and Rosso (2008a), Benajiba, Diab, and Rosso (2008b), Benajiba, Diab, and Rosso
(2009a), and Benajiba, Diab, and Rosso (2009b) have used YamCha to train and test
SVM models for Arabic NER.

Weka:18 A collection of ML algorithms developed for data mining tasks. The
algorithms can either be applied directly to a data set or called from your own
Java code. The toolkit contains tools for data pre-processing, classiﬁcation, regres-
sion, clustering, association rules, and visualization. It has also been found useful
for developing new ML schemes (Witten, Frank, and Hall 2011). The Weka work-
bench supports the use of k-fold cross validation with each classiﬁer and the presen-
tation of results by means of standard Information Extraction measures. Most recently,

15 http://www-i6.informatik.rwth-aachen.de/web/Software/YASMET.html.
16 http://crfpp.sourceforge.net/.
17 http://chasen.org/∼taku/software/yamcha/.
18 http://www.cs.waikato.ac.nz/ml/weka/.

488

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
0
2
4
6
9
1
8
0
3
5
9
1
/
c
o

l
i

_
a
_
0
0
1
7
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Shaalan

A Survey of Arabic Named Entity Recognition and Classiﬁcation

Abdallah, Shaalan, and Shoaib (2012) and Oudah and Shaalan (2012) have success-
fully used Weka to develop an ML-based NER classiﬁer as part of a hybrid Arabic
NER system.

8.3 Arabic NLP Tools

The complexity of Arabic morphology makes it a very challenging research topic. In this
section we present Arabic morpho-syntactic pre-processing tools that are widespread
and used extensively in the Arabic NER literature, including BAMA, MADA, and the
AMIRA toolkit.

BAMA (Buckwalter Arabic Morphological Analyzer).19 BAMA is one of the
most widely used Arabic NLP tools and is widely cited in the literature (Buckwalter
2002; Elsebai and Meziane 2011). It contains over 80,000 words, 38,600 lemmas, three
dictionaries (Preﬁx, Stem, Sufﬁx), and three compatibility tables (Preﬁx-Stem, Stem-
Sufﬁx, Preﬁx-Sufﬁx) (Habash 2010). Entries of the stem dictionary include English
glosses, which have been used to disambiguate NEs. BAMA output lends itself to
information extraction and retrieval processing as it takes an input Arabic word and
returns a stem rather than a root. The word is selected with or without short vowels.
Then it is segmented and compatibility-checked for the correct combination of its
segments, producing all possible analyses of the input word. BAMA transliteration
of the output makes it readable; this is more useful for readers who do not have
the ability to read the Arabic script but are familiar with Latin script. In addition,
the transliteration20 output can be converted directly to Unicode Arabic with a
minimal amount of automatic processing. BAMA has been made available through the
Linguistic Data Consortium. Some of the Arabic NER studies that rely on BAMA for
performing morphological analysis include Farber et al. (2008), Elsebai, Meziane, and
Belkredim (2009), and Al-Jumaily et al. (2012).

(MADA+TOKAN).21 MADA stands for Morphological Analysis and Disambigua-
tion for Arabic. The combined package is built on top of BAMA as a natural successor
that builds on prior successes and meets the growing requirements of many Arabic
NLP applications (Habash, Rambow, and Roth 2009). The package consists of two
components. Morphological analysis and disambiguation are handled in the MADA
component. Morphological analysis also supports the ability to tokenize and stem de-
terministically. Because there are many different ways to tokenize Arabic (tokenization
is a convention adopted by researchers), the TOKAN component allows the user to
specify any tokenization scheme that can be generated from disambiguated analyses.
The MADA+TOKAN package provides one solution to all of the basic problems in
Arabic NLP, including tokenization (the segmentation of clitics from a word with at-
tendant spelling modiﬁcations), diacritization (insertion of disambiguating short-vowel
diacritics), morphological disambiguation (determining the full morphological infor-
mation for each word given its context), POS tagging (determining speciﬁc morpho-
logical information for each word), stemming (reducing each word to its base form),
and lemmatization (determining the citation form lemma of the set of word lexemes

19 LDC Catalog No.: LDC2004L02, on http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?

catalogId=LDC2004L02.

20 See the BAMA mapping table at http://www.qamus.org/transliteration.htm.
21 MADA+TOKAN constitute a single package that is continuously updated. The system is freely available

for research purposes at http://www1.ccls.columbia.edu/MADA/MADA download.html.

489

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
0
2
4
6
9
1
8
0
3
5
9
1
/
c
o

l
i

_
a
_
0
0
1
7
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 40, Number 2

to which each word in the data belongs). MADA operates by examining a list of all
possible analyses for each word generated by BAMA, and then selecting the analysis
that best matches the immediate context by means of SVM models. This classiﬁer
uses 19 distinct and weighted morphological features to provide complete diacritic,
lexemic, glossary, and morphological information (Habash 2010). However, because
MADA is built on top of BAMA, it inherits all of BAMA’s limitations. For example,
if no analysis is given by BAMA, no lemmatization or diacritization is undertaken. It
has been noted in the literature that because MADA was trained and tested on the Penn
Arabic Treebank (Maamouri et al. 2004), its coverage and quality relative to other text
types has not yet been evaluated (Attia et al. 2010; Mohit et al. 2012). The richness of
MADA’s extracted morphological features has been exploited by Arabic NER studies
such as those carried out by Farber et al. (2008), Benajiba and Rosso (2008), Benajiba,
Diab, and Rosso (2008a), Benajiba, Diab, and Rosso (2009a), Benajiba, Diab, and Rosso
(2009b), Oudah and Shaalan (2012), and Oudah and Shaalan (2013).

AMIRA.22 A statistical Arabic processing toolkit that includes a clitic tokenizer, POS
tagger, and BPC or shallow syntactic parser (Diab 2009). It has been widely used for
different NLP applications due to its speed and high performance. BPC is one of the
distinctive characteristics of this toolkit. AMIRA has been used in the extensive studies
of Arabic NER by Benajiba, Diab, and Rosso (2008a), Benajiba, Diab, and Rosso (2008b),
Benajiba, Diab, and Rosso (2009a), and Benajiba, Diab, and Rosso (2009b).

9. Evaluation

The main objective of evaluation is to rank NER systems based on the ability to
annotate a text in the way that an Arabic linguist would. For any research undertaking,
it is necessary to evaluate the system’s results with respect to existing systems on
the assumption that the same reported results should be replicated under the same
experimental settings (Kumaran, Khapra, and Li 2010). Results are easily compared
when they utilize the same standard evaluation corpora, where every NE has a type
assigned to it.

CoNLL’s evaluation metrics are used in the Arabic NER literature. These are
aggressive metrics that do not assign partial credit: An exact match of the NE as
a whole and a correct classiﬁcation must be identiﬁed in order to earn credit. The
reason that this method of scoring is popular is due to its simplicity in calculating and
analyzing results. NER systems are compared based on the standard micro-averaged
F-measure with the Precision being the ratio of the detected NEs that are correctly
classiﬁed by the system, and the Recall being the ratio of the relevant NEs that
are detected by the system (Yang 1999). Mesfar (2007) has redeﬁned the evaluation
measures to account for partially correct NE tagging that arises due to a lack of
information about unknown words within NEs. No other research has accepted this
additional parameter of the evaluation measures.

High Recall means that the system returned most of the relevant results, whereas
high Precision means that the system returned more relevant results than irrelevant.
Often, there is an inverse relationship between Precision and Recall, where it is possible
to increase one at the cost of lowering the other. Recently, Mohit et al. (2012)’s explo-
ration of the Recall–Precision tradeoff proposed a Recall-oriented learning method that

22 A demo of the system is available at http://nlp.ldeo.columbia.edu/amira/.

490

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
0
2
4
6
9
1
8
0
3
5
9
1
/
c
o

l
i

_
a
_
0
0
1
7
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Shaalan

A Survey of Arabic Named Entity Recognition and Classiﬁcation

improved Recall over Precision during semi-supervised discriminative learning of NEs
from Wikipedia.

K-fold cross validation is usually adopted with the scoring method in order to
avoid over-ﬁtting. The data set is randomly divided into k folds of equal size. Each fold
is used as a testing set and the remaining folds are used as a training set, and then
the test results (i.e., F-measure, Precision, Recall) are averaged over the rounds. When
comparing evaluation results it is important to replicate the same split for training and
testing because different splits can have signiﬁcant effects on the Precision and Recall
values (Benajiba et al. 2010). Characteristics of splits include the size of training and test
data sets, ratio of NEs, number of NEs, and average length of NEs (Benajiba, Diab, and
Rosso 2008a). The advantage of the cross-validation method over other methods, such
as repeated random sub-sampling or the percentage split method (holdout), is that all
observations are used equally for both training and validation, and each observation is
used for validation exactly once. The disadvantage of this method is that the training
algorithm has to be rerun from scratch k times, which means it takes k times as much
computation to make an evaluation. Typically, 10-fold cross-validation is used, but in
general k remains a variable parameter.

10. NER Systems

The importance of Arabic NER systems has been well recognized by the community,
as evidenced by the noteworthy publications in this important area. In this section
we present different NER systems. They are classiﬁed according to the approach used.
Unfortunately for the research community, most of the efforts to develop reliable Arabic
NER systems have been undertaken for commercial purposes (Benajiba, Rosso, and
Bened´ı Ruiz 2007; Zaghouani 2012). Because information on the speciﬁcations and
performance of these systems is generally not available, it is difﬁcult to carry out a
fair comparison of the performance of these systems relative to the systems proposed
by the Arabic NER research community. Examples of commercial Arabic NER sys-
tems are: ANEE23 (Coltec), IdentiFinder24 (BBN), NetOwlExtractor25 (NetOwl), Siraj26
(Sakhr), Clear Tags27 (ClearForest), Enterprise Search28 (FAST ESP), and InXight-Smart-
Discovery-Entity-Extractor29 (InXight).

10.1 Rule-Based Systems

Rule-based NER systems depend mainly on hand-made linguistic rules (i.e., grammars)
deﬁned by linguists. In the literature, the development of systems using the rule-
based approach was motivated mainly by the fact that the architecture of the available
NER development tools was optimized for building rule-based systems. The approach
compensates for the lack of Arabic NER linguistics resources, and is favored based on
the encouraging results obtained by various Arabic rule-based systems as shown in this
section. Experiments for reporting the performance of rule-based systems are described

23 http://www.coltec.net/Portals/0/COLTEC PDFs/ANEE NEW.pdf.
24 http://www.bbn.com/technology/speech/identifinder.
25 http://www.sra.com/netowl/entity-extraction/.
26 Online demo version available at http://siraj.sakhr.com/.
27 http://www.clearforest.com/solutions.html.
28 http://www.microsoft.com/enterprisesearch.
29 http://www.inxightfedsys.com/products/sdks/tf/default.asp.

491

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
0
2
4
6
9
1
8
0
3
5
9
1
/
c
o

l
i

_
a
_
0
0
1
7
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 40, Number 2

at three levels: the NE type, the level of linguistic knowledge (morphology and syntax),
and the inclusion/exclusion of gazetteers. A corpus is often needed to evaluate an NER
system, but not necessarily for its development. This is the reason that many of these
experiments are based on a non-standard data set that has been acquired by developers
for evaluation purposes.

Maloney and Niv (1998) presented the TAGARAB system, an early attempt to
handle Arabic rule-based NER. The system identiﬁes the following NE types: person,
organization, location, number, and time. A morphological analyzer is used to decide
where a name ends and the non-name context begins. For evaluation, 14 texts from
the AI-Hayat CD-ROM were selected randomly and manually tagged. The overall
performance obtained for the various categories (time, person, location, and number)
was a Precision of 89.5%, a Recall of 80.8%, and an F-measure of 85%.

AJ(cid:10)k. ñËñ(cid:9)Jº(cid:16)K Õæ„(cid:16)¯ (cid:129)(cid:28)(cid:10)

Abuleil (2004) developed a rule-based NER system that makes use of lexical trig-
(cid:13)
gers. Some special verbs, such as (cid:9)áÊ«
@ (announce), is used to predict the positions of
names in the Arabic sentence. The research assumes that an NE appears close to lexical
triggers no more than three words from the cue word and that the NE has a maximum
length of seven words. Some names may be attached to different types of lexical triggers
and to more than one lexical trigger in the same phrase. For example, the phrase
(cid:13)KP (cid:9)àCª (cid:17)ƒ YËA (cid:9)g Pñ(cid:16)J»YË@ (Dr. Khaled Shaalan the Chairman of IT
(cid:16)HAÓñÊªÖÏ@
Department) has the lexical triggers Pñ(cid:16)J»YË@ (Dr) and Õæ„(cid:16)¯ (cid:129)(cid:28)(cid:10)
(cid:13)KP (Chairman Department).
In Abuleil’s (2004) work, Arabic NER is part of a question-answering system. The
system starts by marking the phrases that could include names. Afterwards, it builds
up a graph that represents the words in these phrases and the relationships between
them. Finally, rules are applied to classify and generate the NEs before saving them in
a database. The system has been evaluated on 500 articles from the Al-Raya newspaper,
published in Qatar. It obtained a Precision of 90.4% on persons, 93% on locations, and
92.3% on organizations.

Samy, Moreno, and Guirao (2005) used comparable corpora in Spanish and Arabic
and an NE tagger. A mapping technique is used to transliterate words in the Arabic text
and return those matching with NEs in the Spanish text as NEs in Arabic. The Spanish
NE tags are used as indicators for tagging the corresponding NEs in the Arabic corpus.
Exceptions arise when it tries to recognize NEs whose Arabic equivalents are completely

different, such as Grecia (Greece) (cid:9)àA(cid:9)KñJ(cid:10)Ë@, or do not have a precise transliteration, such as
Somalia ÈAÓñ’Ë@. An experiment was conducted using 1,200 sentence pairs. In another
experiment, a stop word ﬁlter was additionally applied to exclude the stop words from
the potential transliterated candidates. The ﬁlter improved the overall Precision from
84% to 90%; the Recall was very high at 97.5%.

Mesfar (2007) used NooJ to develop a rule-based Arabic NER system. The system
identiﬁes the following NE types: person, location, organization, currency, and temporal
expressions. The Arabic NER is a pipeline process that goes through three sequential
modules: a tokenizer, a morphological analyzer, and Arabic NER. Morphological infor-
mation is used by the system to extract unclassiﬁed proper nouns and thereby enhance
the overall performance of the system. An evaluation corpus was built from Arabic
news articles extracted from the Le Monde Diplomatique newspaper. The reported results
based on individual NE types were as follows: Precision, Recall, and F-measure range
from 82%, 71%, and 76% for Place names to 97%, 95%, and 96% for Time and Numerical
expressions, respectively.

Another system adopting the rule-based approach for identifying person names is
PERA (Shaalan and Raza 2007). This research describes the structure of Arabic personal
names: ‘ism’, ‘kunya’, ‘nasab’, ‘laqab’, and ‘nisba’. The ‘ism’ is a proper name given

492

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
0
2
4
6
9
1
8
0
3
5
9
1
/
c
o

l
i

_
a
_
0
0
1
7
8
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Shaalan

A Survey of Arabic Named Entity Recognition and Classiﬁcation

MSA,

(cid:9)k (cid:16)I(cid:9)(cid:28)K.

(cid:13)
@ Q (cid:9)®ªk. Ð

(cid:13)
@, Abu) or mother (Ð

shortly after birth (i.e., the given name). Examples of such names are YÒm× (Muhammad,
Mohammed), ú(cid:10)æ…ñÓ (Musa, Moses), Õæ(cid:10)ë@QK. @(cid:13) (Ibrahim, Abraham).
The ‘kunya’ is an honoriﬁc name or surname that states the name of someone’s
(cid:13)
(cid:13)
@ (Abu Da’ud, the father of
@, Umm). For example: Xð@X ñK.
father (ñK.
(cid:13)
David), Õæ(cid:10)Êƒ Ð
@ (umm Salim, the mother of Salim). When using a person’s full name, the
(cid:13)
(cid:9)á‚k (cid:9) ƒñK(cid:10) ñK.
@ (Abu Yusuf Hassan, the
‘kunya’ precedes the given name, for example,
(cid:13)
father of Joseph, Hassan), (cid:16)é(cid:9)JJ(cid:10)Ó
@ (Umm Ja’far Aminah, the mother of Ja’far, Aminah).
The ‘nasab’ indicates the person’s heritage by the word (cid:9)áK. @(cid:13) Ibn (colloquially and
(cid:9)áK. Bin), which means son ( (cid:16)I(cid:9)(cid:28)K. Bint for daughter): for example, QÔ« (cid:9)áK. @(cid:13) (Ibn ‘Umar,
the son of Omar), €AJ.« (cid:16)I(cid:9)(cid:28)K. (bint ‘Abbas, the daughter of Abbas). The ‘nasab’ follows the
(cid:9)á‚k (Hasan Ibn Faraj, Hasan the son of Faraj),
‘ism’ in usage, for example, h.
(cid:16)éJ(cid:10)ÖÞ… (Sumayya Bint Khubbat, Sumayya the daughter of Khubbat). Many histori-
(cid:16)I(cid:28)(cid:10)J.
cal persons are more familiar to us by their ‘nasab’ than by their ‘ism’. Notable examples
are: the historian (cid:9)àðYÊ (cid:9)g (cid:9)áK. @ (Ibn Khaldun), the traveler (cid:16)é£ñ¢(cid:29).
(cid:9)áK. @(cid:13) (Ibn Battuta), and the
philosopher A(cid:9)J(cid:28)(cid:10)ƒ (cid:9)áK. @(cid:13) (Ibn Sina, Avicenna).

A ‘laqab’ is a combination of words into a byname or epithet, usually religious or
relating to a trait, a descriptive, or some admirable quality the person had or would

Al-Rashid, Aaron the Rightly guided).

like to have. Examples are: YJ(cid:10) (cid:17)ƒQË@ (Al-Rashid, the Rightly guided), and É (cid:9)“A (cid:9)®Ë@ (Al-Fadl,
(cid:9)àðPAë (Harun
the Prominent). In practice, ‘laqabs’ follow the ‘ism’, for example, YJ(cid:10) (cid:17)ƒQË@
Finally, a ‘nisba’ is a name derived from a person’s: trade or profession, place of
residence or birth, or religious afﬁliation. Examples are: h. CmÌ’@ (Al-Hallaj, the dresser of
cotton), ø(cid:10) Qå”ÖÏ@ (Al Msri, The Egyptian), ú(cid:10)×Cƒ@(cid:13) (Islami, Islamic). Nisbas follow the ‘ism’
or, if the name contains a ‘nasab’ (of however many generations), generally follow the
‘nasab.’

(cid:9)áK. @(cid:13)

@Q(cid:9)¯

In PERA, rules use regular expressions that include these naming constituents to
recognize person names, where “+” indicates one or more elements; “\s” represents
white space; “|” represents alternatives; and “?” represents an optional element. For
example, consider the following rule:

((honorific trigger+\s((È@)?location GAZ( (cid:16)éK(cid:10)|ø(cid:10)
first name GAZ(\s+last name GAZ)?\s+(number)?
)

)+\s)?)+

This rule recognizes a person name such as ú(cid:10)

(cid:13)
B@ ½ÊÖÏ@ (The Jordanian
king Abdullah II) that is composed of a ﬁrst name followed by optional last name, which
in turn is followed by an optional ordinal number based on preceding person triggers.
(cid:13)
B@ (Jordanian). ‘Nisba’
) that indicates a nationality
(cid:13)
B@] (Jordan[ian]) and [ (cid:16)é(cid:7)][(cid:5)K(cid:10)][Qå”Ó][(cid:5)Ë@]

The triggers are the honoriﬁc ½ÊÖÏ@ (the king) and the ‘nisba’ ú(cid:10)
is represented by the expression (È@)? location GAZ ( (cid:16)éK(cid:10)|ø(cid:10)
(masculine or feminine) adjective as in [ (cid:16)éK(cid:10)][(cid:5)K(cid:10)][ (cid:9)àXP

(cid:9)GXP
(cid:9)GA(cid:17)JË@ é<Ë@ YJ.« ú(cid:10) (cid:9)GXP ([The] Egypt[ian]). The system consists of three components: gazetteers, grammar rules, and a ﬁltering mechanism. Whitelists of person names are provided in the gazetteers component in order to extract the exact matching of NEs regardless of the grammar. Afterwards, the input text is presented to the grammar in order to identify other person NEs. Finally, the ﬁltering mechanism is applied to NEs in order to exclude invalid person names. PERA was evaluated using the ACE and Treebank Arabic data sets and obtained 85.5%, 89%, and 87.5% for Precision, Recall, and F-measure, respectively. 493 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 2 4 6 9 1 8 0 3 5 9 1 / c o l i _ a _ 0 0 1 7 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 40, Number 2 As a continuation of the research carried out by Shaalan and Raza (2007), a NERA system was introduced by Shaalan and Raza (2008, 2009) that generalizes the ﬁndings from PERA. NERA addresses major challenges posed by NER in the Arabic language arising from the complexity of the morphological system, peculiarities in the Arabic orthographic system, non-standardization of the written text, ambiguity, and lack of resources. The system identiﬁes the following NE types: person, location, organization, date, time, ISBN, price, measurement, phone numbers, and ﬁlenames. NERA used the FAST ESP30 framework, whose architecture is optimized for rule-based systems, as an implementation platform. Like PERA, the NERA system has three components (gazetteers, local grammars in the form of regular expressions, and a ﬁltering mecha- nism). Gazetteer entries include English transliterations, an important feature for cross- lingual and multilingual applications. The evaluation is based on manually constructed corpora from ACE, the Web, and organizations. NERA obtained an F-measure of 87.7% for person, 85.9% for locations, and 83.15% for organizations. Traboulsi (2009) presented a ruled-based approach for person NER that uses a local grammar and dictionaries. The extraction process is based on reporting verbs that can be used within grammars to indicate one or more NE types. In the following example, two NEs (person and organization names) can be recognized using verb and job title triggers. (cid:9)à@ [½K. ð@] ORG Name [(cid:129)(cid:28)(cid:10) ... ... said [Ahmad Al-Fahd Al-Sabah] [OPEC]’s [president] that ... (cid:13)KP] TITLE trigger [hAJ.’Ë@ Yê (cid:9)®Ë@ YÔg(cid:13) @] Person Name ÈA(cid:16)¯ ... Notice that not all verbs that occur before person names can correctly identify NEs. (cid:17)€ñK. Ð@Y“ Ñî(cid:16)E@(cid:13) (Saddum accused Bush, accused (cid:17)€ñK. Ð@Y“ For example, in the following sentence Saddum Bush), using the verb as a trigger would result in the extraction of (Saddum Bush) as a name although these are in fact two different names, corresponding to the subject and object of the verb, respectively. An analytical study was conducted by Traboulsi (2009) for his own corpus (arabiCorpus) that was collected from several newspapers, books, the Quran, and some medieval medical and philosophical texts. The study addressed frequency, collocation, and concordance analyses of the corpus. No substantive evaluation results were reported. Elsebai, Meziane, and Belkredim (2009) and Elsebai and Meziane (2011) have pro- posed a rule-based person name recognition system. The system is implemented using GATE. Heuristic rules make use of two kinds of lexical triggers in the Arabic text. An introductory verb trigger, for example, ÈA(cid:16)¯ (said), identiﬁes the phrases that probably include person names. An NE trigger, for example, I. (cid:28)(cid:10)J.£ (doctor), a job title, identiﬁes a person name within phrases. The structure of the heuristic rule depends on the relative position of each kind of lexical trigger in the input text and its position relative to other words. BAMA (Buckwalter 2002) has been integrated to extract the morphological features of the target word that are used within rules to identify whether the target word is a proper noun. This has led to the elimination of the need for any predeﬁned person name gazetteers. Name lists, speciﬁcally, place and organization names, and stop words, such as prepositions, which occur after lexical triggers, are used to counter-indicate the (cid:13) (cid:9)£ ñK. @ (Abu Dhabi) in the phrase (cid:13) @ (Abu Dhabi announced the winners) is recognized as a proper noun, it is discarded because it belongs to the list of places and hence should not be presence of a person name. For example, although ú(cid:10)æ. (cid:9)áK(cid:10) (cid:9)Q(cid:13)KA (cid:9)®Ë@ (cid:9)á« ú(cid:10)æ. (cid:9)£ ñK. (cid:16)I(cid:9)JÊ« (cid:13) @ 30 FAST ESP is a product of the FAST Search & Transfer Company, which was acquired by Microsoft in 2008. 494 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 2 4 6 9 1 8 0 3 5 9 1 / c o l i _ a _ 0 0 1 7 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Shaalan A Survey of Arabic Named Entity Recognition and Classiﬁcation recognized as a person name. Two experiments were conducted (Elsebai, Meziane, and Belkredim 2009; Elsebai and Meziane 2011). The ﬁrst experiment used around 700 news articles extracted from an Arabic media Web site, and the second used 500 articles. The overall system performance in the ﬁrst experiment was 93%, 86%, and 89%, for Precision, Recall, and F-measure, respectively; the overall performance in the second experiment was 88%, 90%, and 89%, for Precision, Recall, and F-measure, respectively. Alkharashi (2009) described the formation of an Arabic person name from root and pattern using the traditional Arabic morphology and suggested relevant computational resources. The author introduced a set of database tables in order to assist Arabic NER: root-pattern, a frequency list of roots, and lexical trigger tables. A corpus was created from Saudi person names with speciﬁc person name tags: root of person NE, features indicating the possibility of afﬁxation, and gender characteristics. The main objective was to recognize the constituents of the person NE, these being the simple form, the afﬁx, and connectors. For example, the name of the Umayyad caliphate ½ÊÖÏ@ YJ.« (cid:9)áK. YJ(cid:10)ËñË@ (Al-Waleed bin Abd Al-Malik) has ½ÊÓ (Malik) and YJ(cid:10)Ëð (Waleed) as simple names, YJ.« (Abd) and È@ (Al) as name preﬁxes, and (cid:9)áK. (Bin) as a name connector. The study has reported interesting observations about features of highly frequent patterns and their lengths. A simple test for assessing how well the pattern of a person name was recognized was conducted on 60,000 generated person names entries. It demonstrated that the correct pattern appears 94% of the time as one of the ﬁrst three suggested patterns, 86% as one of the ﬁrst two suggested patterns, and 69% of the time as the ﬁrst suggested pattern. Al-Shalabi et al. (2009) presented an Arabic NER algorithm for retrieving Arabic proper nouns using lexical triggers. The research takes into consideration regional patterns such as the name connector YËð (ould, son of ) used in Mauritanian person names (e.g., è@X@X YËð PA(cid:16)J(cid:9)m×, Moktar Ould Daddah). The algorithm identiﬁes the following NE types: people, major cities, locations, countries, organizations, political parties, and terrorist groups. However, the reported research only focuses on person NEs. The algorithm uses heuristic rules to preprocess the input to clean the data and remove afﬁxes. Then, internal evidence triggers, such as person name connectors, are used to recognize the NEs. The system was evaluated using 20 randomly selected documents from the Al-Raya newspaper published in Qatar, and the Alrai newspaper published in Jordan. An overall precision of 86.1% was observed. Attia et al. (2010) proposed a method for acquiring a richer NE lexicon using Arabic WordNet (Elkateb et al. 2006) and Arabic Wikipedia.31 The proposed NE lexicon enhances the lexical entries in WordNet and produces a well-structured Arabic NE lexical resource. The main objective is to extract Arabic WordNet’s instantiable nouns and to identify the corresponding categories in the Arabic Wikipedia. These categories act as lexical triggers. A decision is made in order to identify which of the Wikipedia articles of these categories correspond to NEs. They are then extracted, connected to Arabic WordNet, and inserted in the NE repository. In a subsequent post-processing step, further NEs are acquired by exploiting inter-lingual links. Finally, the NEs acquired are diacritized. This lexical resource is useful for Arabic NER; the results are not only recognized (tagged) NEs but also identiﬁed synsets which are semantically related to them (synonyms, subtypes, supertypes, etc.). Likewise, Abouenour, Bouzoubaa, 31 WordNet (Fellbaum 2005) is a large lexical database that originally implemented for English. Arabic WordNet is still a limited lexical resource. On the other hand, Wikipedia is a popular ubiquitous source of corpus data for information extraction because of its size, currency, rich semi-structured content, diverse topics, and closer resemblance to web text than newswire (Balasuriya et al. 2009; Mohit et al. 2012). 495 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 2 4 6 9 1 8 0 3 5 9 1 / c o l i _ a _ 0 0 1 7 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 40, Number 2 and Rosso (2010) suggested enriching the NEs in Arabic WordNet by using an ontology-based method. This is used in the query expansion stage of an Arabic Question Answering system (Lahsen, Bouzoubaa, and Rosso 2012), resulting in an improvement of the ranking of the returned passages. Shihadeh and Neumann (2012) proposed an Arabic NER system called ARNE, which recognizes person, location, and organization NEs based only on a gazetteer lookup approach; the system provides morphological information using a system called ElixirFM, developed by Smrz (2007). The system does not use any rules or context information for Arabic NER. Before recognizing the NEs, ARNE carries out three pre-processing steps that are not used by the gazetteer lookup approach: tokenization, Buckwalter transliteration, and POS tagging. ARNE uses the ANERgazet gazetteer that was developed by Benajiba, Rosso, and Bened´ı Ruiz (2007) and Benajiba and Rosso (2007). ARNE can recognize a NE that has a maximum length of four words. The experimental results obtained low overall performance: 38%, 27%, and 30% for Precision, Recall, and F-measure, respectively. The authors suggest several reasons as to why the F-measure did not achieve higher values. These include the size and quality of the gazetteers, the richness and complexity of Arabic morphology, and the ambiguity problem inherent in Arabic NEs. Al-Jumaily et al. (2012) proposed a rule-based NER system that can be used in Web applications. The system identiﬁes the following NE types: person, location, and organi- zation NEs. The system was developed using GATE and provides Arabic morphological analysis in a method similar to BAMA. It also integrates different gazetteers from GATE, DBPedia,32 and ANERGazet.33 The system was evaluated using ANERcorp. Two experiments were carried out to study the effect of Arabic preﬁxes and sufﬁxes on the recognition results. If an Arabic token (preﬁx-stem-sufﬁx) is recognized, then a veriﬁcation process is used to ensure the compatibility between the three possible combinations (preﬁx-stem, stem-sufﬁx, and preﬁx-sufﬁx). The veriﬁcation process has improved the recognition results of NEs across all types, although these improvements were not symmetrical. The improvements in the Precision of person, location, and organization are 7.32%, 5.55%, and 5.14%, respectively. Suggestions for improvements include: 1) adding new patterns to the system’s dictionary, 2) accounting for all translit- eration variants of Latin names, 3) adopting semi-automatic methods to tag unrecog- nized words, and 4) performing contextual analysis to resolve ambiguity arising from words that may belong to different entity types (e.g., whether (cid:129)(cid:29)(cid:10)PAK. (Paris) is a location or person). Zaghouani et al. (2010) presented an adaptation of a multilingual system, the Europe Media Monitor (EMM) Information Retrieval and Extraction application NewsExplorer34 (Steinberger, Pouliquen, and Van der Goot 2009), to consider Arabic. This system at present includes 19 languages and is able to analyze large volumes of news text. The EMM-NewsExplorer architecture is optimized for ruled-based systems. The adaptation resulted in a rule-based Arabic NER system (RENAR; Zaghouani 2012), which uses a handwritten set of language-independent rules (Steinberger, Pouliquen, and Ignat 2008) in combination with speciﬁc resources for Arabic. Rules are described using the following notations: “\w+” for an unknown word, “\b” 32 See http://dbpedia.org/About. Entries of the DBPedia are translated from English to Arabic using Google Translate. 33 See http://users.dsic.upv.es/grupos/nle/?file=kop4.php for a set of resources including ANERGazet and ANERcorp. 34 See http://press.jrc.it/overview.html. 496 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 2 4 6 9 1 8 0 3 5 9 1 / c o l i _ a _ 0 0 1 7 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Shaalan A Survey of Arabic Named Entity Recognition and Classiﬁcation for an obligatory word boundary (white space, possibly with punctuation), “+” for one or more elements, and “*” for zero or more elements. For example, consider the rule: Organization BEG+\b known Name\b name Infix \b Known Name ∗\b Organization END ∗ This rule recognizes complex company names such as é(cid:9)K@ñ (cid:9)k@(cid:13)ð Yj. ÖÏ@ ñK. (cid:13) @ YÒm× (cid:16)é»Qå(cid:17)… (company of Mohamed Abu Al-Majd and Brothers), which include person (known) names (cid:13) @ YÒm× (Mohamed Abu Al-Majd) and the preceding and following organization Yj. ÖÏ@ ñK. internal evidence trigger (cid:16)é»Qå(cid:17)… (company) and é(cid:9)K@ñ (cid:9)k@(cid:13) (Brothers), respectively. The Arabic NER component is able to recognize the following NE types: person, organization, location, date, and number, as well as quotations (direct reported speech) by and about people. The system was ﬁrst evaluated using a corpus built from on-line news sources from the Tunisian newspaper Assabah and the Lebanese newspaper Alanwar. The sys- tem’s overall performance was calculated in terms of Precision, Recall, and F-measure, delivering results of 87.17%, 65.74%, and 74.95%, respectively. Then, the system was evaluated only for person, organization, and location using ANERcorp. The system’s overall performance in terms of Precision, Recall, and F-measure was 73.39%, 62.13%, and 67.13%, respectively. 10.2 Machine Learning Systems In the ﬁeld of NER, ML algorithms have been widely used in order to determine NE tagging decisions from annotated texts that are used to generate statistical models for NE prediction. Experiments reporting ML system performance are evaluated in three dimensions: the NE type, the single/combined ML classiﬁer (learning technique), and the inclusion/exclusion of certain features from the whole feature space. Most often these experiments use a very well deﬁned framework and their reliance on standard corpora allows for an objective comparison of the performance of a proposed system relative to existing systems. Much research work on ML-based Arabic NER was done by Benajiba (Benajiba, Rosso, and Bened´ı Ruiz 2007; Benajiba and Rosso 2007, 2008; Benajiba, Diab, and Rosso 2008a, 2008b, 2009a, 2009b; Benajiba et al. 2010), who explored different ML techniques with various combinations of features. Benajiba, Rosso, and Bened´ı Ruiz (2007) have developed an Arabic ME-based NER system called ANERsys 1.0. The authors have built their own linguistic resources, ANERcorp and ANERgazet.35 Lexical, contextual, and gazetteer features are used by this system. ANERsys identiﬁes the following NE types: person, location, organization, and miscellaneous. All the experiments are carried out within the framework of the shared task of the CONLL 2002 conference. The overall system’s performance in terms of Precision, Recall, and F-measure was 63.21%, 49.04%, and 55.23%, respectively. The ANERsys 1.0 system had difﬁculties with detecting NEs that were composed of more than one token/word. An extension of this work is ANERsys 2.0 (Benajiba and Rosso 2007), which uses a two-step mechanism for NER: 1) detecting the start and the end points of each NE, then 2) classifying the detected NEs. A POS tagging feature was exploited to improve NE boundary detection. The overall system’s performance in terms of Precision, Recall, and F-measure was 70.24%, 62.08%, 35 For ANERcorp and ANERgazet, see http://www1.ccls.columbia.edu/∼ybenajiba/. 497 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 2 4 6 9 1 8 0 3 5 9 1 / c o l i _ a _ 0 0 1 7 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 40, Number 2 and 65.91%, respectively. The performance of the classiﬁcation module was very good with F-measure 83.22%, although the identiﬁcation phase was poor with F-measure 72.03%. Benajiba and Rosso (2008) have applied CRF instead of ME in an attempt to improve performance. The same four types of NEs used in ANERsys 2.0 were also used in the CRF-based system. Neither Benajiba, Rosso, and Bened´ı Ruiz (2007) nor Benajiba and Rosso (2007) included Arabic-speciﬁc features; all the features used were language-independent. Language-independent and Arabic-speciﬁc features were used in the CRF model, including POS tags, BPC, gazetteers, and nationality. The CRF- based system achieved best results when all the features were combined. The overall system’s performance in terms of Precision, Recall, and F-measure was 86.90%, 72.77%, and 79.21%, respectively. The improvement was not only dependent on the use of the CRF model but also on the additional language-speciﬁc features, including POS and BPC. Benajiba, Diab, and Rosso (2008a) examined the lexical, contextual, morphological, gazetteer, and shallow syntactic features of ACE data sets using the SVM classiﬁer. The system’s performance was evaluated using 5-fold cross validation. The impact of the different features is measured independently and in joint combination across different standard data sets and genres. The best system’s overall performance in terms of F-measure was 82.71% for ACE 2003, 76.43% for ACE 2004, and 81.47% for ACE 2005, respectively. Benajiba, Diab, and Rosso (2008b) investigated the sensitivity of different NE types to various types of features rather than adopting a single set of features for all NE types simultaneously. The set of features examined were the lexical, contextual, morpholog- ical, gazetteer, and shallow syntactic features, forming 16 speciﬁc features in total. A multiple classiﬁer approach was developed using SVM and CRF models, where each classiﬁer tags an NE type separately. They used a voting scheme to rank the features according to the best performance of the two models for each NE type. The result in tagging a word with different NE types is resolved by selecting the classiﬁer output with the highest Precision (i.e., overriding the tagging of the classiﬁer that returned more relevant results than irrelevant). An incremental feature selection method was used to select an optimized feature set and to better understand the resulting errors. A global NER system could be developed from the union of all the optimized set of features for each NE type. ACE data sets are used in the evaluation process. The best system’s overall performance in terms of F-measure was 83.5% for ACE 2003, 76.7% for ACE 2004, and 81.31% for ACE 2005, respectively. On the basis of the analysis of the best recognition results obtained by individual and combined features experiments, it cannot be concluded whether CRF is better than SVM or vice versa. Each NE type is sensitive to different features and each feature plays a role in recognizing the NE to varying degrees. Further studies conducted in Benajiba, Diab, and Rosso (2009a, 2009b) have con- ﬁrmed the importance of considering both language-independent and Arabic-speciﬁc features in the NER system. In particular, Benajiba, Diab, and Rosso (2009a) studied the impact of SVM, ME, and CRF models using the same approach and features described in Benajiba, Diab, and Rosso (2008b). The best system’s overall performance in terms of F-measure was 83.34% for ACE 2003, 77.61% for ACE 2004, and 82.02% for ACE 2005, respectively. Interesting conclusions and recommendations have been suggested by this study. Both SVMs and CRFs achieved very similar performance, outperforming the ME model. An important observation concerns the number of available features as the main factor for the choice of using SVMs versus CRFs: SVMs seem to achieve good results 498 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 2 4 6 9 1 8 0 3 5 9 1 / c o l i _ a _ 0 0 1 7 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Shaalan A Survey of Arabic Named Entity Recognition and Classiﬁcation with fewer features. Another signiﬁcant observation concerns the better performance achieved by carrying out pre-processing of the Arabic text by a clitic segmenter, which is more suitable given the morphological richness of Arabic. Later, in Benajiba et al. (2010), the Arabic NER system described in Benajiba, Diab, and Rosso (2008b) is used as a baseline NER system to automatically tag an Arabic– English parallel corpus in order to provide sufﬁcient training data for studying the impact of deep syntactic features, also referred to as syntagmatic features. The fea- ture space is enhanced by syntagmatic features that are bootstrapped by prediction from this corpus. These features are derived from Arabic sentence parses that include an NE. The relatively low performance of the available Arabic parser leads to noisy features as well. The inclusion of the extra features has achieved high performance for the ACE (2003–2005) data sets. The best system’s overall performance in terms of F-measure was 84.32% for ACE 2003, 78.12% for ACE 2004, and 81.73% for ACE 2005, respectively. Moreover, the authors reported an F-measure improvement of up to 1.64 percentage points compared to the performance when the syntagmatic features were excluded. Abdul-Hamid and Darwish (2010) developed a CRF-based Arabic NER system that explores using a set of simpliﬁed features for recognizing the three classic NE types: person, location, and organization. The proposed set of features include: bound- ary character n-grams (leading and trailing character n-gram features), word n-gram probability-based features that attempt to capture the distribution of NEs in text, word sequence features, and word length. Remarkably, the system did not use any external lexical resources. Moreover, the character n-gram models attempt to capture surface clues that would indicate the presence or absence of an NE. For example, character bigram, trigram, and 4-gram models can be used to capture the preﬁx attachment of a noun for a candidate NE such as the determiner (cid:5)Ë@ (Al), a coordinating conjunction and a determiner (cid:5)Ë@ð (w+Al), and a coordinating conjunction, a preposition, and a determiner (cid:5)ËAK. ð (w+b+Al), respectively. On the other hand, these features can also be used to conclude that a word may not be an NE if the word is a verb that starts with any of the verb present tense character set (i.e., (cid:13) @ (A), (cid:5)(cid:9)K (n), (cid:5)K(cid:10) (y), or (cid:5)(cid:16)K (t). Despite the fact that lexical features have solved the problem of dealing with a large number of preﬁxes and sufﬁxes, they do not resolve the compatibility problem between preﬁxes, sufﬁxes, and stems. The compatibility checking is needed in order to verify whether a correct combination is met (cf. Buckwalter 2002). The system was evaluated using ANERcorp and the ACE 2005 data set. The overall system’s performance using ANERcorp for Precision, Recall, and F-measure was 89%, 74%, and 81%, respectively. These results show that the system outperforms the CRF-based NER system of Benajiba and Rosso (2008). Farber et al. (2008) proposed integrating a morphological-based tagger with an Arabic NER system. The integration is aimed at enhancing Arabic NER. The rich morphological information produced by MADA provides important features for the classiﬁer. The system adopts the structured perceptron approach proposed by Collins (2002) as a baseline for Arabic NER, using morphological features produced by MADA. The system was developed to extract person, organization, and GPEs. The empirical results from a 5-fold cross validation experiment show that the disambiguated mor- phological features in conjunction with a capitalization feature improve the perfor- mance of the Arabic NER system. They reported 71.5% F-measure on the ACE 2005 data set. An integrated approach was investigated in AbdelRahman et al. (2010) by com- bining bootstrapping, semi-supervised pattern recognition, and CRF. The feature set 499 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 2 4 6 9 1 8 0 3 5 9 1 / c o l i _ a _ 0 0 1 7 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 40, Number 2 is extracted by the Research and Development International36 toolkit, which includes ArabTagger and an Arabic lexical semantic analyzer. The features used include word- level, POS tag, BPC, gazetteers, semantic ﬁeld tag, and morphological features. The semantic ﬁeld tag is a generic cluster that refers to a set of related lexical triggers. For example, the “Corporation” cluster includes the following internal evidence that can be used to identify an organization name: (cid:16)é«ñÒm.× (group), (cid:16)é‚ƒ (cid:13)ñÓ (foundation), (cid:16)é(cid:13)JJ(cid:10)ë (authority), and (cid:16)é»Qå(cid:17)… (company). The system identiﬁes the following NEs: person, location, organi- zation, job, device, car, cell phone, currency, date, and time. A 6-fold cross validation experiment using the ANERcorp data set showed that the system yielded F-measures of 74.06%, 89.09%, 75.01%, 69.47%, 77.52%, 80.95%, 80.63%, 98.52%, 76.99%, and 96.05% for the person, location, organization, job, device, car, cell phone, currency, date, and time NEs, respectively. The results also showed that the system outperforms the NER component of LingPipe when both are applied to the ANERcorp data set. Mohit et al. (2012) proposed a learning (Recall-oriented) model for Arabic NER from diverse text domains like Wikipedia within the AQMAR (American and Qatari Modeling of Arabic) project. They used a ﬂexible annotation scheme that allows for the introduction of new NE tags. As Arabic Wikipedia is not tagged for NEs, they adopted semi-supervised learning (self-training) for building their own corpus. The learning method does not utilize any gazetteer. Once the evaluation corpus is built, a supervised learning method can be used to develop and evaluate an NER classiﬁer. The feature space consists of 15 contextual and lexical features capturing local context and shallow morphology. Morphological features are extracted from MADA output. The training model is built using the structured perceptron described in Collins (2002). This framework allows them to manipulate two key elements of the model: the features and the loss function used in training. This function measures the recognition error for each token/word, which is the difference between the correct and predicted label. It penalizes Recall errors (i.e., reduction of false negatives that arise by erroneously predicting the non-entity token/word as part of the actual NE), which is the chief difﬁculty for the news-text–trained model in the news domain. The system was tested on 24 Wikipedia articles37 for possible combinations of the supervised learning phase with self-training on unlabeled Wikipedia data. The experimental results showed im- provements on F-measure by the proposed Recall-oriented model in both stages of learning. When Recall-oriented bias is used in the supervised phase, the recall gains are substantial: nearly 8% over the baseline. Integrating this bias within self-training pro- duces a more modest improvement of about 4% relative to the baseline. In both cases, the improvements to recall more than compensate for the degradation in Precision. Zayed and El-Beltagy (2012) proposed a person NER system that automatically generates dictionaries of male and female ﬁrst names as well as family names by a pre- processing step. It relies on ASVMTools (Diab, Hacioglu, and Jurafsky 2004) for POS tag- ging to identify proper nouns. Thereafter, the dictionaries are expanded using Web sites listing Arabic given names. The system takes into consideration the common preﬁxes of (cid:13) @ (Abu, father person names. For example, a name may take a preﬁx such as È@ (AL, the),ñK. (cid:13) (cid:9)áK. (Bin, son of ), or YJ.« (Abd, servant of ), or a combination of preﬁxes such as YJ.« ñK. of ), @ (Abu Abd, father of servant of ). It also takes into consideration the common embedded words in compound names. For example the person names (cid:9)áK(cid:10)YË@ Pñ(cid:9)K (Nour Al-dain) or 36 http://www.rdi-eg.com/. 37 A small corpus of Arabic Wikipedia articles was developed via a ﬂexible entity annotation scheme spanning four topical domains (history, technology, science, and sports); this is publicly available at http://www.ark.cs.cmu.edu/AQMAR. 500 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 2 4 6 9 1 8 0 3 5 9 1 / c o l i _ a _ 0 0 1 7 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Shaalan A Survey of Arabic Named Entity Recognition and Classiﬁcation (cid:9)áK(cid:10)YË@ (cid:129)ÖÞ(cid:17)… (Shams Al-dain) have (cid:9)áK(cid:10)YË@ (Al-dain) as an embedded word. The ambiguity of having a person name as a non-NE in the text is resolved by heuristic disambiguation rules. The system is evaluated on two data sets: MSA data sets collected from news Web sites and colloquial Arabic data sets collected from the Google Moderator page. The overall system’s performance using an MSA test set collected from news Web sites for Precision, Recall, and F-measure was 93.52%, 87.89%, and 90.62%, respectively. In comparison, the overall system’s performance obtained using a colloquial Arabic test set collected from the Google Moderator page for Precision, Recall, and F-measure was 88.7%, 85.56%, and 87.1%, respectively. Koulali, Meziane, and Abdelouaﬁ (2012) developed an Arabic NER using a com- bined pattern extractor (a set of regular expressions) and SVM classiﬁer that learns patterns from POS tagged text. The system covers the NE types used in the CoNLL conference, and uses a set of dependent and independent language features. Arabic fea- tures include: a determiner È@ (AL) feature that appears as the ﬁrst letters of organization names (e.g., ñº‚(cid:9)(cid:29)ñJ(cid:10)Ë@, UNESCO) and last name (e.g., ø(cid:10) Xñ(cid:9)JK. (cid:9)áÔgQË@ YJ.«, Abd Al-Rahman Al-Abnudi), a character-based feature that denotes common preﬁxes of nouns, a POS feature, and a “verb around” feature that denotes the presence of an NE if it is preceded or followed by a certain verb. The system was trained on 90% of the ANERCorp data and tested on the remainder. The system was tested with different feature combinations and the best result for an overall average F-measure was 83.20%. (cid:13) B@ Bidhend, Minaei-Bidgoli, and Jouzi (2012) presented a CRF-based NER system, called Noor, that extracts person names from religious texts. Corpora of ancient religious text called NoorCorp were developed, consisting of three genres: historic, Prophet Mohammed’s Hadith, and jurisprudence books. Noor-Gazet, a gazetteer of religious person names, was also developed. Person names were tokenized by a pre-processing (cid:9)á‚k (Hassan bin Ali bin Abd-Allah bin Al-Moghayrah) produces six tokens as follows: (cid:9)á‚k (Hassan bin Ali Abd-Allah Al-Moghayrah). Another pre- processing tool, AMIRA, was used for POS tagging. The tagging is enriched by indicating the presence of the person NE entry, if any, in Noor-Gazet. Details of the experimental setting are not provided. The F-measure for the overall system’s perfor- mance using new historic, Hadith, and jurisprudence corpora was 99.93%, 93.86%, and 75.86%, respectively. step; for example, the tokenization of the full name (cid:16)èQ(cid:30)(cid:10) (cid:9)ªÖÏ@ (cid:16)èQ(cid:30)(cid:10) (cid:9)ªÖÏ@ é<Ë@ YJ.« ú(cid:10)Î« (cid:9)áK. (cid:9)áK. é<Ë@ YJ.« (cid:9)áK. ú(cid:10)Î« (cid:9)áK. 10.3 Hybrid Systems The hybrid approach integrates the rule-based approach with the ML-based approach in order to optimize overall performance (Petasis et al. 2001). Recently, Abdallah, Shaalan, and Shoaib (2012) proposed a hybrid NER system for Arabic. The rule-based component is a re-implementation of the NERA system (Shaalan and Raza 2008) using GATE. The ML-based component uses Decision Trees. The feature space includes the NE tags predicted by the rule-based component and other language independent and Arabic speciﬁc features. The system identiﬁes the following types of NEs: person, location, and organization. The F-measure performance using ANERcorp was 92.8%, 87.39%, and 86.12% for the person, location, and organization NEs, respectively. Continuing the research of Abdallah, Shaalan, and Shoaib (2012), the hybrid Arabic NER system is extended in the following key directions (Oudah and Shaalan 2012, 2013): 1) increasing the NEs to 11 types by adding time, measurement, phone number, ﬁlename, date, price, percent, and ISBN; 2) investigating two more ML 501 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 2 4 6 9 1 8 0 3 5 9 1 / c o l i _ a _ 0 0 1 7 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 40, Number 2 models: SVMs and Logistic Regression; and 3) increasing the features to a larger set by adding morphological features and an English-gloss capitalization feature. The experimental results showed that the hybrid Arabic NER approach outperforms the rule-based and the ML-based components when they are processed individu- ally. The performance obtained using ANERcorp for F-measure was 94.4% for person, 90.1% for location, and 88.2% for organization NEs. 11. Conclusion NER is one of the most fundamental and important tasks for developing NLP systems. Accurate identiﬁcation of NEs from the text plays an important role for a range of NLP systems such as machine translation and information retrieval. The literature demonstrates that explicitly devoting one step of processing to NE identiﬁcation helps such systems achieve better performance levels. There are an increasing number of Arabic textual information resources available on electronic media, such as Web pages, blogs, e-mails, and text messages, which makes automated NER for the Arabic text relevant. In this survey we have presented various challenges to processing Arabic NEs, including highly ambiguous Arabic words, the absence of rigorous standards of written text, and the current state-of-the-art in Arabic NLP resources and tools. Advances in human language technology require an ever increasing amount of data and annotation. The number of current state-of-the-art of Arabic linguistic resources is still insufﬁcient compared with Arabic’s actual importance as a language. Many existing Arabic NER resources are annotated manually or are only available at signiﬁcant ex- pense. We have described some research that adopted semi-automatic (bootstrapping) methods in order to enrich Arabic NER resources from diverse text types such as Web sources and (multilingual) corpora developed within evaluation projects. In the Arabic NER ﬁeld, NEs falling under proper names representing person, location, and organi- zation names are commonly applied to newswire domains, reﬂecting the importance of these limited NEs in this domain. We have described three main approaches that have been used to develop Arabic NER systems: linguistic rule-based, ML-based, and hybrid approaches. Rule-based sys- tems follow a classical approach and ML-based systems follow a modern and rapidly growing approach. The main reasons for choosing the rule-based approach are the lack and limitations of Arabic linguistic resources, optimized platform architectures for rule-based systems, and the high performance of such systems. In addition, ML-based approaches have proven their usefulness as they take advantage of ML algorithms by building models that include learning patterns associated with individual entity types trained from annotated data. The success of both the rule-based and ML-based approaches motivates the investigation of a hybrid Arabic NER approach, yielding signiﬁcant improvements by exploiting the rule-based decisions on NEs as features used by the ML classiﬁer. Features are a critical aspect and are the key component for enhancing the perfor- mance of NER systems. We reviewed many attempts to select features that investigate the sensitivity of each entity when applied to different sets of features. We showed how researchers applied different techniques that beneﬁt differently from the enabled features and obtain different results for varying NE types. Some suggest that NER for Arabic use not only language-independent features but also Arabic-speciﬁc features. Researchers sometimes exploit language-independent features based on promising variables, such as lexical and orthographic features, to overcome the problems related 502 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 2 4 6 9 1 8 0 3 5 9 1 / c o l i _ a _ 0 0 1 7 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Shaalan A Survey of Arabic Named Entity Recognition and Classiﬁcation to the Arabic language and orthography. Lexical features avoid complex morphology by extracting the word preﬁx and sufﬁx sequence of a word from the character n-gram of leading and trailing letters. Orthographic features attempt to overcome the lack of capitalization for NEs in Arabic by relying on the corresponding English capitaliza- tion of NEs. Alternatively, other researchers suggest including a rich set of language speciﬁc features extracted by Arabic morpho-syntactic tools in order to deeply analyze the inherent complex structure of NEs within their context. Regardless of the features selected, various studies have reported that signiﬁcant system performance is achieved when a combination that includes all features is enabled. We have discussed many existing tools that have been used to build many different Arabic NER systems. IDEs are convenient for rapid development of NER systems. GATE is more diversiﬁed and comprehensive for developing rule-based Arabic NER systems because it has built-in gazetteers and rules offering the ability to create new ones. On the other hand, the availability of diverse generic ML tools is sufﬁcient for developing a wide range of Arabic NER classiﬁers. The main problem with these generic tools is that they are language-independent with limited support for Arabic. Fortunately, the availability of Arabic morpho-syntactic pre-processing tools, such as BAMA and its successor MADA for morphological processing and AMIRA for BPC, has lessened the need for extensive development efforts. Almost all of the tools adopted for developing Arabic NER provide for system evaluation by calculating the value of Precision, Recall, and F-measure. Sometimes it is too expensive to acquire linguistic evaluation resources to compare a proposed system’s performance to existing systems. Fortunately, the increasing contributions from the Arabic NLP research community have been sufﬁcient to provide a practical solution and satisfy the critical need for free corpora and gazetteers (e.g., ANERsys, which can be used to compare Arabic NER under different experimental settings). We have reviewed the state-of-the-art in Arabic NER systems in some detail. It should be noted that the list of references provided here may not be comprehensive. Our aim was to provide a review of the essential aspects of Arabic NER and discuss major publications that have made use of those ideas. We hope that this survey provides a way to access the main branches of the literature dealing with Arabic NER research and guides researchers in interesting and fruitful research directions. Since the presence of NE in the context of one language points to a correspondence in other natural languages, studies of NEs in one language could provide mutual and valuable insight for developing resources and technologies that can handle NEs in many languages. This survey describes the progress made by Arabic NER research. This study might be easily extrapolated to most NLP tasks in general and to many of the morphologically complex/rich languages in particular. References Abdallah, Sherief, Khaled Shaalan, and Muhammad Shoaib. 2012. Integrating rule-based system with classiﬁcation for Arabic named entity recognition. In Alexander Gelbukh, editor, Computational Linguistics and Intelligent Text Processing, volume 7181 of Lecture Notes in Computer Science. Springer, Berlin Heidelberg, pages 311–322. Abdel Monem, Azza, Khaled Shaalan, Ahmed Rafea, and Hoda Baraka. 2008. Generating Arabic text in multilingual speech-to-speech machine translation framework. Machine Translation, 22(4):205–258. AbdelRahman, Samir, Mohamed Elarnaoty, Marwa Magdy, and Aly Fahmy. 2010. Integrated machine learning techniques for Arabic named entity recognition. International Journal of Computer Science Issues (IJCSI), 7(4):27–368. Abdul-Hamid, Ahmed and Kareem Darwish. 2010. Simpliﬁed feature set for Arabic 503 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 2 4 6 9 1 8 0 3 5 9 1 / c o l i _ a _ 0 0 1 7 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 40, Number 2 named entity recognition. In Proceedings of the 2010 Named Entities Workshop (NEWS 2010), pages 110–115, Stroudsburg, PA. Abdul-Mageed, Muhammad, Mona Diab, and Mohammed Korayem. 2011. Subjectivity and sentiment analysis of modern standard Arabic. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT 2011): short papers - Volume 2, pages 587–591, Stroudsburg, PA. Abouenour, Lahsen, Karim Bouzoubaa, and Paolo Rosso. 2010. Using the yago ontology as a resource for the enrichment of named entities in Arabic wordnet. In Proceedings of The Seventh International Conference on Language Resources and Evaluation (LREC 2010) Workshop on Language Resources and Human Language Technology for Semitic Languages, pages 27–31, Valletta. Abuleil, Saleem. 2004. Extracting names from Arabic text for question-answering systems. In Proceedings of the 7th International Conference on Coupling Approaches, Coupling Media, and Coupling Languages for Information Retrieval (RIAO 2004), pages 638–647, Vaucluse. Al-Jumaily, Harith, Paloma Mart´ınez, Mart´ınez-Fern´andez Jos´e, and Erik Goot. 2012. A real time named entity recognition system for Arabic text mining. Language Resources and Evaluation Journal, 46(4):543–563. Al-Onaizan, Yaser and Kevin Knight. 2002a. Machine transliteration of names in Arabic text. In Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages (SEMITIC 2002), pages 1–13, Stroudsburg, PA. Al-Onaizan, Yaser and Kevin Knight. 2002b. Translating named entities using monolingual and bilingual resources. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL 2002), pages 400–408, Stroudsburg, PA. Al-Shalabi, Riyad, Ghassan Kanaan, Bashar Al-Sarayreh, Khalid Khanfar, Ali AIGhonmein, Hamed Talhouni, and Salem Al-Azazmeh. 2009. Proper noun extracting algorithm for the Arabic language. In International Conference on IT to Celebrate S. Charmonman’s 72nd Birthday, pages 28.1–28.9, Bangkok. Al-Sughaiyer, Imad and Ibrahim Al-Kharashi. 2004. Arabic morphological analysis techniques: A comprehensive survey. Journal of the American Society for 504 Information Science and Technology, 55(3):189–213. Algahtani, Shabib. 2011. Arabic Named Entity Recognition: A Corpus-Based Study. Ph.D. thesis, The University of Manchester, UK. Alkharashi, Ibrahim. 2009. Person named entity generation and recognition for Arabic language. In Proceedings of the Second International Conference on Arabic Language Resources and Tools, pages 205–208, Cairo. Attia, Mohammed, Antonio Toral, Lamia Tounsi, Monica Monachini, and Josef van Genabith. 2010. An automatically built named entity lexicon for Arabic. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), pages 3,614–3,621, Valletta. Babych, Bogdan and Anthony Hartley. 2003. Improving machine translation quality with automatic named entity recognition. In Proceedings of the 7th International EAMT workshop on MT and other Language Technology Tools, Improving MT through other Language Technology Tools: Resources and Tools for Building MT, EAMT 2003, pages 1–8, Stroudsburg, PA. Badawy, Osama, Mohamed Shaheen, and Abdelbaki Hamadene. 2011. ARQA: An intelligent Arabic question answering system. In Proceedings of Arabic Language Technology International Conference (ALTIC 2011), pages 1–8, Alexandria. Balasuriya, Dominic, Nicky Ringland, Joel Nothman, Tara Murphy, and James R. Curran. 2009. Named entity recognition in wikipedia. In Proceedings of the 2009 Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources, People’s Web 2009, pages 10–18, Stroudsburg, PA. Ben Hamadou, Abdelmajid, Piton Odile, and Fehri H´ela. 2010a. Multilingual extraction of functional relations between Arabic named entities using NooJ platform. In HAL Archives, pages 1–10, Available at http://hal.archives-ouvertes.fr/ hal-00547940. Ben Hamadou, Abdelmajid, Piton Odile, and Fehri H´ela. 2010b. Recognition and Arabic-French translation of named entities: Case of the sport places. In arXiv, pages 1–10, Available at https://www.researchgate.net/ publication/45898820 Recognition and translation Arabic-French of Named Entities case of the Sport places. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 2 4 6 9 1 8 0 3 5 9 1 / c o l i _ a _ 0 0 1 7 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Shaalan A Survey of Arabic Named Entity Recognition and Classiﬁcation Benajiba, Yassine, Mona Diab, and Paolo Rosso. 2008a. Arabic named entity recognition: An SVM-based approach. In Proceedings of Arab International Conference on Information Technology (ACIT 2008), pages 16–18, Hammamet. Benajiba, Yassine, Mona Diab, and Paolo Rosso. 2008b. Arabic named entity recognition using optimized feature sets. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2008), pages 284–293, Stroudsburg, PA. Benajiba, Yassine, Mona Diab, and Paolo Rosso. 2009a. Arabic named entity recognition: A feature-driven study. IEEE Transactions on Audio, Speech, and Language Processing, 17(5):926–934. Benajiba, Yassine, Mona Diab, and Paolo Rosso. 2009b. Using language independent and language speciﬁc features to enhance Arabic named entity recognition. The International Arab Journal of Information Technology (IAJIT), 6(5):463–471. Benajiba, Yassine and Paolo Rosso. 2007. ANERsys 2.0: Conquering the NER task for the Arabic language by combining the maximum entropy with POS-tag information. In Proceedings of Workshop on Natural Language-Independent Engineering, 3rd Indian International Conference on Artiﬁcial Intelligence (IICAI-2007), pages 1,814–1,823, Mumbay. Benajiba, Yassine and Paolo Rosso. 2008. Arabic named entity recognition using conditional random ﬁelds. In Proceedings of the Workshop on HLT & NLP within the Sixth International Conference on Language Resources and Evaluation (LREC 2008), pages 143–153, Marrakech. Benajiba, Yassine, Paolo Rosso, and Jos´e Miguel Bened´ı Ruiz. 2007. ANERsys: An Arabic named entity recognition system based on maximum entropy. In Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2007), pages 143–153, Berlin. Benajiba, Yassine, Imed Zitouni, Mona Diab, and Paolo Rosso. 2010. Arabic named entity recognition: Using features extracted from noisy data. In Proceedings of the ACL 2010 Conference Short Papers, ACLShort 2010, pages 281–285, Stroudsburg, PA. Bidhend, Majidi, Behrouz Minaei-Bidgoli, and Hosein Jouzi. 2012. Extracting person names from ancient Islamic Arabic texts. In Proceedings of Language Resources and Evaluation for Religious Texts (LRE-Rel) Workshop Programme, Eight International Conference on Language Resources and Evaluation (LREC 2012), pages 1–6, Istanbul. Bies, Ann, Denise DiPersio, and Mohamed Maamouri. 2012. Linguistic resources for Arabic machine translation: The Linguistic Data Consortium (LDC) catalog. In Abdelhadi Soudi, Ali Farghaly, G ¨unter Neumann, and Rabih Zbib, editors, Challenges for Arabic Machine Translation, volume 322 of Natural Language Processing 9. John Benjamins Publishing Company, Amesterdam, pages 15–22. Brini, Wissal, Mariem Ellouze, Omar Trigui, Slim Mesfar, Lamia Hadrich, and Paolo Rosso. 2009. Factoid and deﬁnitional Arabic question answering system. In Proceedings of the NOOJ-2009, pages 1–11, Tozeur. Buckwalter, Tim. 2002. Buckwalter Arabic morphological analyzer version 1.0. Technical Report LDC2002L49, Linguistic Data Consortium (LDC), Philadelphia, PA. Burkett, David, Slav Petrov, John Blitzer, and Dan Klein. 2010. Learning better monolingual models with unannotated bilingual text. In Proceedings of the Fourteenth Conference on Computational Natural Language Learning, CoNLL 2010, pages 46–54, Stroudsburg, PA. Chen, Hsin-Hsi, Changhua Yang, and Ying Lin. 2003. Learning formulation and transformation rules for multilingual named entities. In Proceedings of the ACL 2003 Workshop on Multilingual and Mixed-language Named Entity Recognition, pages 1–8, Boston, MA. Collins, Michael. 2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10, EMNLP 2002, pages 1–8, Stroudsburg, PA. Cunningham, Hamish. 2002. Gate, a general architecture for text engineering. Computers and the Humanities, 36(2):223–254. Cunningham, Hamish, Diana Maynard, Kalina Bontcheva, Valentin Tablan, Niraj Aswani, Ian Roberts, Genevieve Gorrell, Adam Funk, Angus Roberts, Danica Damljanovic, Thomas Heitz, Mark A. Greenwood, Horacio Saggion, Johann Petrak, Yaoyong Li, and Wim Peters. 2011. Text Processing with GATE (Version 6). GATE publications. 505 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 2 4 6 9 1 8 0 3 5 9 1 / c o l i _ a _ 0 0 1 7 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 40, Number 2 Diab, Mona. 2009. Second generation tools (AMIRA 2.0): Fast and robust tokenization, POS tagging, and Base Phrase Chunking. In Proceedings of the Second International Conference on Arabic Language Resources and Tools, pages 285–288, Cairo. Diab, Mona, Kadri Hacioglu, and Daniel Jurafsky. 2004. Automatic tagging of Arabic text: From raw text to base phrase chunks. In Proceedings of Human Language Technology-North American Association for Computational Linguistics, HLT-NAACL-Short 2004, pages 149–152, Stroudsburg, PA. El Kholy, Ahmed and Nizar Habash. 2010. Techniques for Arabic morphological detokenization and orthographic denormalization. In Proceedings of the Workshop on Language Resources and Human Language Technology for Semitic Languages in the Language Resources and Evaluation Conference (LREC), pages 45–51, Valletta. Elgibali, Alaa. 2005. Investigating Arabic: Current Parameters in Analysis and Learning. Studies in Semitic Languages and Linguistics Series. Brill Academic Publishers, Boston, MA. of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), pages 2,509–2,514, Marrakech. Farghaly, Ali and Khaled Shaalan. 2009. Arabic natural language processing: Challenges and solutions. ACM Transactions on Asian Language Information Processing (TALIP), 8(4):1–22. Fellbaum, Christiane. 2005. Wordnet and wordnets. In Keith Brown, editor, Encyclopedia of Language and Linguistics. Oxford, Elsevier, pages 665–670. Grefenstette, Gregory, Nasredine Semmar, and Fa¨ıza Elkateb-Gara. 2005. Modifying a natural language processing system for European languages to treat Arabic in information processing and information retrieval applications. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, pages 31–38, Ann Arbor, MI. Guo, Jiafeng, Gu Xu, Xueqi Cheng, and Hang Li. 2009. Named entity recognition in query. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2009), pages 267–274, New York City. Elkateb, Sabri, William Black, Piek Vossen, Habash, Nizar. 2010. Introduction to David Farwell, Adam Pease, and Christiane Fellbaum. 2006. Arabic WordNet and the challenges of Arabic. In Proceedings of Arabic NLP/MT Conference, pages 15–24, London. Elsebai, Ali and Farid Meziane. 2011. Extracting persons names from Arabic newspapers. In Proceedings of the International Conference on Innovations in Information Technology, pages 87–89, Dubai. Elsebai, Ali, Farid Meziane, and Fatma Belkredim. 2009. A rule based persons names Arabic extraction system. In Proceedings of the 11th International Business Information Management Association Conference (IBIMA 2009), Special Track on Arabic Information Processing, pages 53–59, Cairo. Ezzeldin, Ahmed and Mohamed Shaheen. 2012. A survey of Arabic question answering: Challenges, tasks, approaches, tools, and future trends. In Proceedings of The 13th International Arab Conference on Information Technology (ACIT 2012), pages 1–8, Zarqa. Farber, Benjamin, Dayne Freitag, Nizar Habash, and Owen Rambow. 2008. Improving NER in Arabic using a morphological tagger. In Proceedings 506 Arabic Natural Language Processing. Synthesis Lectures on Human Language Technologies. Morgan and Claypool Publishers. Habash, Nizar and Owen Rambow. 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), ACL 2005, pages 573–580, Stroudsburg, PA. Habash, Nizar, Owen Rambow, and Ryan Roth. 2009. MADA+TOKAN: A toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In Proceedings of the Second International Conference on Arabic Language Resources and Tools, pages 102–109, Cairo. Halpern, Jack. 2009. Lexicon-driven approach to the recognition of Arabic named entities. In Proceedings of the Second International Conference on Arabic Language Resources and Tools, pages 193–198, Cairo. Hassan, Ahmed, Haytham Fahmy, and Hany Hassan. 2007. Improving named entity translation by exploiting comparable and parallel corpora. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 2 4 6 9 1 8 0 3 5 9 1 / c o l i _ a _ 0 0 1 7 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Shaalan A Survey of Arabic Named Entity Recognition and Classiﬁcation In Proceedings of the 2007 Conference on Recent Advances in Natural Language Processing (RANLP 2007), pages 1–6, Borovets. Hassan, Hany and Jeffrey Sorensen. 2005. An integrated approach for Arabic-English named entity translation. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, pages 87–93, Ann Arbor, MI. Hewavitharana, Sanjika and Stephan Vogel. 2011. Extracting parallel phrases from comparable data. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora, 49th Annual Meeting of the Association for Computational Linguistics (ACL), pages 61–68, Portland, OR. Higgins, Chiara, Elizabeth McGrath, and Lailla Moretto. 2010. Mturk crowdsourcing: A viable method for rapid discovery of Arabic nicknames? In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, CSLDAMT ’10, pages 89–92, Stroudsburg, PA. Huang, Shudong, Stephanie Strassel, Alexis Mitchell, and Zhiyi Song. 2004. Shared resources for multilingual information extraction and challenges in named entity annotation. In Proceedings of the IJCNLP-04 Workshop on Named Entity Recognition for NLP Applications, pages 112–119, Hainan Island. Kim, SeonYeong, Sung-Hwan Kim, and Hwan-Gue Cho. 2012. Developing a system for searching a shop name on a mobile device using voice recognition and GPS information. In Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication (ICUIMC 2012), pages 1–8, New York City. Korayem, Mohammed, David Crandall, and Muhammad Abdul-Mageed. 2012. Subjectivity and sentiment analysis of Arabic: A survey. In Aboul-Ella Hassanien, Abdel-Badeeh M. Salem, Rabie Ramadan, and Tai-hoon Kim, editors, Advanced Machine Learning Technologies and Applications, volume 322 of Communications in Computer and Information Science. Springer, Berlin Heidelberg, pages 128–139. Koulali, Rim, Meziane, and Abdelouaﬁ. 2012. A contribution to Arabic named entity recognition. In Proceedings of the 10th International Conference on ICT and Knowledge Engineering, pages 46–52, Morocco. Kumaran, A., Mitesh M. Khapra, and Haizhou Li. 2010. Report of NEWS 2010 transliteration mining shared task. In Proceedings of the 2010 Named Entities Workshop, NEWS 2010, pages 21–28, Stroudsburg, PA. Lahsen, Abouenour, Karim Bouzoubaa, and Paolo Rosso. 2012. IDRAAQ: New Arabic question answering system based on query expansion and passage retrieval. In Online Working Notes/Labs/Workshop of the CLEF 2012, Question Answering for Machine Reading Evaluation (QA4MRE) main task, Rome, Italy. Ma, Xiaoyi. 2010. Toward a name entity aligned bilingual corpus. In Proceedings of LREC 2010 Workshop on Methods for the Automatic Acquisition of Language Resources and their Evaluation Methods, pages 211–216, Valletta. Maamouri, Mohamed, Ann Bies, Tim Buckwalter, and Wigdan Mekki. 2004. The Penn Arabic treebank: Building a large-scale annotated Arabic corpus. In Proceedings of the NEMLAR Conference on Arabic Language Resources and Tools, pages 102–109, Cairo. Maloney, John and Michael Niv. 1998. TAGARAB: A fast, accurate Arabic name recognizer using high-precision morphological analysis. In Proceedings of the Workshop on Computational Approaches to Semitic Languages, Semitic 1998, pages 8–15, Stroudsburg, PA. Marton, Yuval, Nizar Habash, and Owen Rambow. 2010. Improving Arabic dependency parsing with lexical and inﬂectional morphological features. In Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, SPMRL ’10, pages 13–21, Stroudsburg, PA. Maynard, Diana, Hamish Cunningham, Kalina Bontcheva, Roberta Catizone, George Demetriou, Gaizauskas Robert, Oana Hamza, Mark Hepple, Patrick Herring, Brian Mitchell, Michael Oakes, Wim Peters, Andrea Setzer, Mark Stevenson, Valentin Tablan, Christian Ursu, and Yorick Wilks. 2000. A survey of uses of gate. Technical Report CS-00-06, Department of Computer Science, University of Shefﬁeld. Maynard, Diana, Hamish Cunningham, Kalina Bontcheva, and Marin Dimitrov. 2002. Adapting a robust multi-genre NE system for automatic content extraction. In Donia Scott, editor, Artiﬁcial Intelligence: Methodology, Systems, and Applications, 507 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 2 4 6 9 1 8 0 3 5 9 1 / c o l i _ a _ 0 0 1 7 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 40, Number 2 10th International Conference, Varna, Bulgaria, volume 2443 of Lecture Notes in Computer Science. Springer, Berlin Heidelberg, pages 264–273. Mesfar, Slim. 2007. Named entity recognition for Arabic using syntactic grammars. In Proceedings of the 12th International Conference on Applications of Natural Language to Information Systems (NLDB’07), pages 305–316, Berlin. Mohit, Behrang, Nathan Schneider, Rishav Bhowmick, Kemal Oﬂazer, and Noah Smith. 2012. Recall-oriented learning of named entities in Arabic wikipedia. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL) 2012, pages 162–173, Stroudsburg, PA. Moll´a, Diego, Menno van Zaanen, and Daniel Smith. 2006. Named entity recognition for question answering. In Zakerman Covendon, Lowrence and Ingrid, editors, Proceedings of the 2006 Australasian Language Technology Workshop (ALTW 2006), pages 51–58, Sydney. Mostefa, Djamel, St´ephane Chaudiron, La¨ıb Mariama, Khalid Choukri, and Ga¨el de Chalendar. 2009. A multilingual named entities corpus for Arabic, English and French. In Khalid Choukri and Bente Maegaard, editors, Proceedings of the Second International Conference on Arabic Language Resources and Tools, pages 213–216, Cairo. Nadeau, David and Satoshi Sekine. 2007. A survey of named entity recognition and classiﬁcation. Lingvisticae Investigationes, 30(1):3–26. Nezda, Luke, Andrew Hickl, John Lehmann, and Sarmad Fayyaz. 2006. What in the world is a shahab? Wide coverage named entity recognition for Arabic. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC-06), pages 41–46, Genoa. Oudah, Mai and Khaled Shaalan. 2012. A pipeline Arabic named entity recognition using a hybrid approach. In Proceedings of the International Conference on Computational Linguistics, pages 2,159–2,176, Mumbai. Oudah, Mai and Khaled Shaalan. 2013. Person name recognition using the hybrid approach. In Elisabeth M´etais, Farid Meziane, Mohamad Saraee, Vijayan Sugumaran, and Sunil Vadera, editors, Natural Language Processing and Information Systems, volume 7934 of Lecture Notes in Computer Science. Springer, Berlin Heidelberg, pages 237–248. 508 Pappu, Aasish. 2009. Using wikipedia for hierarchical ﬁner categorization of named entities. In Proceedings of the 23rd Paciﬁc Asia Conference on Language, Information and Computation, pages 779–786, Hong Kong. Petasis, Georgios, Frantz Vichot, Francis Wolinski, Georgios Paliouras, Vangelis Karkaletsis, and Constantine D. Spyropoulos. 2001. Using machine learning to maintain rule-based named-entity recognition and classiﬁcation systems. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pages 426–433, Stroudsburg, PA. Pouliquen, Bruno, Ralf Steinberger, Camelia Ignat, Irina Temnikova, Anna Widiger, Wajdi Zaghouani, and Jan Zizka. 2005. Multilingual person name recognition and transliteration. In arXiv, pages 1–10. Available at http://www.researchgate. net/publication/1959893 Multilingual person name recognition and transliteration/file/ d912f50ecfc2949c6d.pdf. Ratinov, Lev and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), pages 147–155, Boulder, CO. Refaat, Khaled and Amgad Madkour. 2009. An optimized method for Arabic cross document named entity normalization. In Proceedings of the Second International Conference on Arabic Language Resources and Tools, pages 209–212, Cairo. Ryding, Karin. 2005. A Reference Grammar of Modern Standard Arabic. Cambridge University Press, New York. Salloum, Wael and Nizar Habash. 2012. Elissa: A dialectal to standard Arabic machine translation system. In Proceedings of the International Conference on Computational Linguistics: Demonstration Papers, pages 385–392, Mumbai. Samy, Doaa, Antonio Moreno, and Jos´e Guirao. 2005. A proposal for an Arabic named entity tagger leveraging a parallel corpus. In Proceedings of the 2005 Conference on Recent Advances in Natural Language Processing (RANLP 2005), pages 459–465, Borovets. Saravanan, K., Monojit Choudhury, Raghavendra Udupa, and A. Kumaran. 2012. An empirical study of the occurrence and co-occurrence of named entities in natural language corpora. In Proceedings l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 2 4 6 9 1 8 0 3 5 9 1 / c o l i _ a _ 0 0 1 7 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Shaalan A Survey of Arabic Named Entity Recognition and Classiﬁcation of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pages 3,118–3,125, Istanbul. Sekine, Satoshi, Kiyoshi Sudo, and Chikashi Nobata. 2002. Extended named entity hierarchy. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC 2002), pages 1,818–1,824, Las Palmas. Shaalan, Khaled. 2010. Rule-based approach in Arabic natural language processing. The International Journal on Information and Communication Technologies (IJICT), 3(3):11–19. Shaalan, Khaled, Mohammed Attia, Pavel Pecina, Younes Samih, and Josef van Genabith. 2012. Arabic word generation and modelling for spell checking. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012), 719–725, Istanbul. Shaalan, Khaled and Hafsa Raza. 2007. Person name entity recognition for Arabic. In Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, Semitic 2007, pages 17–24, Stroudsburg, PA. Shaalan, Khaled and Hafsa Raza. 2008. Arabic named entity recognition from diverse text types. In Bengt Nordstr ¨om and Aarne Ranta, editors, Advances in Natural Language Processing, volume 5221 of Lecture Notes in Computer Science. Springer, Berlin Heidelberg, pages 440–451. Shaalan, Khaled and Hafsa Raza. 2009. NERA: Named entity recognition for Arabic. Journal of the American Society for Information Science and Technology, 60(8):1,652–1,663. Shihadeh, Carolin and G ¨unter Neumann. 2012. ARNE: A tool for named entity recognition from Arabic text. In Fourth Workshop on Computational Approaches to Arabic Script-based Languages (CAASL4), located at the Tenth Biennial Conference of the Association for Machine Translation in the Americas (AMTA), pages 24–31, San Diego, CA. Smrz, Otakar. 2007. Functional Arabic Morphology Formal System and Implementation. Ph.D. thesis, Charles University in Prague, Czech Republic. Steinberger, Ralf. 2012. A survey of methods to ease the development of highly multilingual text mining applications. Language Resources and Evaluation Journal, 46(2):155–176. Steinberger, Ralf, Bruno Pouliquen, and Camelia Ignat. 2008. Using language-independent rules to achieve high multilinguality in text mining. In Francois Fogelman-Soulie, Domenico Perrotta, Jakub Piskorski, and Ralf Steinberger, editors, Mining Massive Data Sets for Security: Advances in Data Mining, Search, Social Networks and Text Mining, and their Applications to Security, volume 19 of Information and Communication Security. IOS Press, Amsterdam, Netherlands, pages 217–240. Steinberger, Ralf, Bruno Pouliquen, and Erik Van der Goot. 2009. An introduction to the Europe media monitor family of applications. In Information Access in a Multilingual World-Proceedings of the SIGIR 2009 Workshop (SIGIR-CLIR 2009), pages 1–8, Boston, MA. Strassel, Stephanie, Alexis Mitchell, and Shudong Huang. 2003. Multilingual resources for entity extraction. In Proceedings of the ACL 2003 Workshop on Multilingual and Mixed-Language Named Entity Recognition - Volume 15, MultiNER ’03, pages 49–56, Stroudsburg, PA. Traboulsi, Hayssam. 2009. Arabic named entity extraction: A local grammar-based approach. In Proceedings of the International Multi-conference on Computer Science and Information Technology (IMCSIT 2009), pages 139–143, Mragowo. Trigui, Omar, Lamia Belguith, Paolo Rosso, Hichem Ben Amor, and Bilel Gafsaoui. 2012. Arabic QA4MRE at CLEF 2012: Arabic question answering for machine reading evaluation. In Online Working Notes/Labs/Workshop of the CLEF 2012, Question Answering for Machine Reading Evaluation (QA4MRE main task), Rome, Italy. CLEF. Witten, Ian, Eibe Frank, and Mark Hall. 2011. Data Mining: Practical Machine Learning Tools and Techniques: Practical Machine Learning Tools and Techniques. The Morgan Kaufmann Series in Data Management Systems. Elsevier Science. Yang, Yiming. 1999. An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1-2):69–90. Zaghouani, Wajdi. 2012. RENAR: A rule-based Arabic named entity recognition system. ACM Transactions on Asian Language Information Processing (TALIP), 11(1):2:1–2:13. 509 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 2 4 6 9 1 8 0 3 5 9 1 / c o l i _ a _ 0 0 1 7 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 40, Number 2 Zaghouani, Wajdi, Bruno Pouliquen, Mohamed Ebrahim, and Ralf Steinberger. 2010. Adapting a resource-light highly multilingual named entity recognition system to Arabic. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), pages 563–567, Valletta. Zaraket, Fadi and Jad Makhlouta. 2012. Arabic cross-document NLP for the hadith and biography literature. In Proceedings of the Twenty-Fifth International Florida Artiﬁcial Intelligence Research Society Conference (FLAIRS 2012), pages 256–261, Marco Island, FL. Zayed, Omnia and Samhaa El-Beltagy. 2012. Person name extraction from modern standard Arabic or colloquial text. In Proceedings of the 8th International Conference on Informatics and Systems Conference (INFOS2012), NLP track, pages 44–48, Cairo. Zitouni, Imed, Jeff Sorensen, Xiaoqiang Luo, and Radu Florian. 2005. The impact of morphological stemming on Arabic mention detection and coreference resolution. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Semitic 2005, pages 63–70, Stroudsburg, PA. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 0 2 4 6 9 1 8 0 3 5 9 1 / c o l i _ a _ 0 0 1 7 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 510
Download pdf