Transactions of the Association for Computational Linguistics, 2 (2014) 27–40. Action Editor: Kristina Toutanova.

Submitted 1/2013; Revised 7/2013; Published 2/2014. c
(cid:13)

2014 Association for Computational Linguistics.

AutomaticDetectionandLanguageIdentiﬁcationofMultilingualDocumentsMarcoLui♥♣,JeyHanLau♠andTimothyBaldwin♥♣♥DepartmentofComputingandInformationSystemsTheUniversityofMelbourne♣NICTAVictoriaResearchLaboratory♠DepartmentofPhilosophyKing’sCollegeLondonmhlui@unimelb.edu.au,jeyhan.lau@gmail.com,tb@ldwin.netAbstractLanguageidentiﬁcationisthetaskofautomat-icallydetectingthelanguage(s)presentinadocumentbasedonthecontentofthedocu-ment.Inthiswork,weaddresstheproblemofdetectingdocumentsthatcontaintextfrommorethanonelanguage(multilingualdocu-ments).Weintroduceamethodthatisabletodetectthatadocumentismultilingual,iden-tifythelanguagespresent,andestimatetheirrelativeproportions.Wedemonstratetheef-fectivenessofourmethodoversyntheticdata,aswellasreal-worldmultilingualdocumentscollectedfromtheweb.1IntroductionLanguageidentiﬁcationisthetaskofautomaticallydetectingthelanguage(s)presentinadocumentbasedonthecontentofthedocument.Languageidentiﬁcationtechniquescommonlyassumethatev-erydocumentiswritteninoneofaclosedsetofknownlanguagesforwhichthereistrainingdata,andisthusformulatedasthetaskofselectingthemostlikelylanguagefromthesetoftraininglan-guages.Inthiswork,weremovethismonolingualassumption,andaddresstheproblemoflanguageidentiﬁcationindocumentsthatmaycontaintextfrommorethanonelanguagefromthecandidateset.Weproposeamethodthatconcurrentlydetectsthatadocumentismultilingual,andestimatesthepropor-tionofthedocumentthatiswrittenineachlanguage.Detectingmultilingualdocumentshasavarietyofapplications.Mostnaturallanguageprocessingtechniquespresupposemonolingualinputdata,soinclusionofdatainforeignlanguagesintroducesnoise,andcandegradetheperformanceofNLPsys-tems(Alexetal.,2007;CookandLui,2012).Au-tomaticdetectionofmultilingualdocumentscanbeusedasapre-ﬁlteringsteptoimprovethequalityofinputdata.Detectingmultilingualdocumentsisalsoimportantforacquiringlinguisticdatafromtheweb(Scannell,2007;AbneyandBird,2010),andhasapplicationsinminingbilingualtextsforstatisticalmachinetranslationfromonlineresources(Resnik,1999;Nieetal.,1999;Lingetal.,2013).Therehasbeenparticularinterestinextractingtextresourcesforlow-densitylanguagesfrommultilingualwebpagescontainingboththelow-densitylanguageandanotherlanguagesuchasEnglish(YamaguchiandTanaka-Ishii,2012;KingandAbney,2013).KingandAbney(2013,p1118)speciﬁcallymentiontheneedforanautomaticmethod“toexamineamul-tilingualdocument,andwithhighaccuracy,listthelanguagesthatarepresentinthedocument”.Weintroduceamethodthatisabletodetectmulti-lingualdocuments,andsimultaneouslyidentifyeachlanguagepresentaswellasestimatethepropor-tionofthedocumentwritteninthatlanguage.Weachievethiswithaprobabilisticmixturemodel,us-ingadocumentrepresentationdevelopedformono-linguallanguageidentiﬁcation(LuiandBaldwin,2011).Themodelpositsthateachdocumentisgen-eratedassamplesfromanunknownmixtureoflan-guagesfromthetrainingset.WeintroduceaGibbssamplertomapsamplestolanguagesforanygivensetoflanguages,andusethistoselectthesetoflan-guagesthatmaximizestheposteriorprobabilityofthedocument.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
1
6
3
1
5
6
6
8
5
5

/
t

a
c
_
a
_
0
0
1
6
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Ourmethodisabletolearnalanguageidenti-ﬁerformultilingualdocumentsfrommonolingualtrainingdata.Thisisanimportantpropertyastherearenostandardcorporaofmultilingualdocumentsavailable,whereascorporaofmonolingualdocu-mentsarereadilyavailableforareasonablylargenumberoflanguages(LuiandBaldwin,2011).Wedemonstratetheeffectivenessofourmethodempir-ically,ﬁrstlybyevaluatingitonsyntheticdatasetsdrawnfromWikipediadata,andthenbyapplyingittoreal-worlddata,showingthatweareabletoiden-tifymultilingualdocumentsintargetedwebcrawlsofminoritylanguages(KingandAbney,2013).Ourmaincontributionsare:(1)wepresentamethodforidentifyingmultilingualdocuments,thelanguagescontainedthereinandtherelativepropor-tionofthedocumentineachlanguage;(2)weshowthatourmethodoutperformsstate-of-the-artmeth-odsforlanguageidentiﬁcationinmultilingualdoc-uments;(3)weshowthatourmethodisabletoes-timatetheproportionofthedocumentineachlan-guagetoahighdegreeofaccuracy;and(4)weshowthatourmethodisabletoidentifymultilingualdoc-umentsinreal-worlddata.2BackgroundMostlanguageidentiﬁcationresearchfocusesonlanguageidentiﬁcationformonolingualdocuments(Hughesetal.,2006).InmonolingualLangID,thetaskistoassigneachdocumentDauniquelanguageLi∈L.Someworkhasreportednear-perfectaccu-racyforlanguageidentiﬁcationoflargedocumentsinasmallnumberoflanguages(CavnarandTren-kle,1994;McNamee,2005).However,inordertoattainsuchaccuracy,alargenumberofsimplifyingassumptionshavetobemade(Hughesetal.,2006;BaldwinandLui,2010a).Inthiswork,wetackletheassumptionthateachdocumentismonolingual,i.e.itcontainstextfromasinglelanguage.Inlanguageidentiﬁcation,documentsaremod-eledasastreamofcharacters(CavnarandTrenkle,1994;Kikui,1996),oftenapproximatedbythecor-respondingstreamofbytes(Kruengkraietal.,2005;BaldwinandLui,2010a)forrobustnessovervari-ablecharacterencodings.Inthiswork,wefollowBaldwinandLui(2010a)intrainingasinglemodelforlanguagesthatnaturallyusemultipleencodings(e.g.UTF8,Big5andGBencodingsforChinese),asissuesofencodingarenotthefocusofthisresearch.Thedocumentrepresentationusedforlanguageidentiﬁcationgenerallyinvolvesestimatingtherel-ativedistributionsofparticularbytesequences,se-lectedsuchthattheirdistributionsdifferbetweenlanguages.Insomecasestherelevantsequencesmaybeexternallyspeciﬁed,suchasfunctionwordsandcommonsufﬁxes(Giguet,1995)orgrammati-calwordclasses(DueireLinsandGonc¸alves,2004),thoughtheyaremorefrequentlylearnedfromla-beleddata(CavnarandTrenkle,1994;Grefenstette,1995;Prager,1999a;LuiandBaldwin,2011).Learningalgorithmsappliedtolanguageidenti-ﬁcationfallintotwogeneralcategories:Bayesianclassiﬁersandnearest-prototype(Rocchio-style)classiﬁers.BayesianapproachesincludeMarkovprocesses(Dunning,1994),naiveBayesmethods(Grefenstette,1995;LuiandBaldwin,2011;Tiede-mannandLjubeˇsi´c,2012),andcompressivemod-els(Teahan,2000).Thenearest-prototypemethodsvaryprimarilyinthedistancemeasureused,includ-ingmeasuresbasedonrankorderstatistics(Cav-narandTrenkle,1994),informationtheory(Bald-winandLui,2010a),stringkernels(Kruengkraietal.,2005)andvectorspacemodels(Prager,1999a;McNamee,2005).Languageidentiﬁcationhasbeenappliedindo-mainssuchasUSENETmessages(CavnarandTrenkle,1994),webpages(Kikui,1996;Mar-tinsandSilva,2005;LiuandLiang,2008),websearchqueries(CeylanandKim,2009;BoscaandDini,2010),miningthewebforbilingualtext(Resnik,1999;Nieetal.,1999),buildingminor-itylanguagecorpora(Ghanietal.,2004;Scannell,2007;Bergsmaetal.,2012)aswellasalarge-scaledatabaseofInterlinearGlossedText(Xiaetal.,2010),andtheconstructionofalarge-scalemultilin-gualwebcrawl(CallanandHoy,2009).2.1MultilingualDocumentsLanguageidentiﬁcationoverdocumentsthatcontaintextfrommorethanonelanguagehasbeenidentiﬁedasanopenresearchquestion(Hughesetal.,2006).Commonexamplesofmultilingualdocumentsarewebpagesthatcontainexcerptsfromanotherlan-guage,anddocumentsfrommultilingualorganiza-tionssuchastheEuropeanUnion.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
1
6
3
1
5
6
6
8
5
5

/
t

a
c
_
a
_
0
0
1
6
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

EnglishFrenchItalianGermanDutchJapanesecharacterthepourdiaufvooはbyte74686520706F7572064692020617566766F6E381AFTable1:Examplesofper-languagebytesequencesselectedbyinformationgain.TheAustraliasianLanguageTechnologyWork-shop2010hostedasharedtaskwhereparticipantswererequiredtopredictthelanguage(s)presentinaheld-outtestsetcontainingmonolingualandbilin-gualdocuments(BaldwinandLui,2010b).ThedatasetwaspreparedusingdatafromWikipedia,andbilingualdocumentswereproducedusingasegmentfromapageinonelanguage,andasegmentfromthesamepageinanotherlanguage.Weusethedatasetfromthissharedtaskforourinitialexperiments.Totheauthors’knowledge,theonlyotherworktodirectlytackleidentiﬁcationofmultiplelanguagesandtheirrelativeproportionsinasingledocumentistheLINGUINIsystem(Prager,1999a).Thesystemisbasedonavectorspacemodel,andcosinesimi-laritybetweenafeaturevectorforthetestdocumentandafeaturevectorforeachlanguageLi,computedasthesumoffeaturevectorsforallthedocumentsforlanguageLiinthetrainingdata.Theelementsinthefeaturevectorsarefrequencycountsoverbyten-grams(2≤n≤5)andwords.Languageiden-tiﬁcationformultilingualdocumentsisperformedthroughtheuseofvirtualmixedlanguages.Prager(1999a)showshowtoconstructvectorsrepresenta-tiveofparticularcombinationsoflanguagesinde-pendentoftherelativeproportions,andproposesamethodforchoosingcombinationsoflanguagestoconsiderforanygivendocument.Languageidentiﬁcationinmultilingualdocu-mentscouldalsobeperformedbyapplicationofsu-pervisedlanguagesegmentationalgorithms.Givenasystemthatcansegmentadocumentintola-beledmonolingualsegments,wecanthenextractthelanguagespresentaswellastherelativepropor-tionoftextineachlanguage.Severalmethodsforsupervisedlanguagesegmentationhavebeenpro-posed.Teahan(2000)proposedasystembasedontextcompressionthatidentiﬁesmultilingualdocu-mentsbyﬁrstsegmentingthetextintomonolingualblocks.RehurekandKolkus(2009)performlan-guagesegmentationbycomputingarelevancescorebetweentermsandlanguages,smoothingacrossad-joiningtermsandﬁnallyidentifyingpointsoftransi-tionbetweenhighandlowrelevance,whicharein-terpretedasboundariesbetweenlanguages.Yam-aguchiandTanaka-Ishii(2012)useaminimumde-scriptionlengthapproach,embeddingacompressivemodeltocomputethedescriptionlengthoftextseg-mentsineachlanguage.Theypresentalinear-timedynamicprogrammingsolutiontooptimizethelo-cationofsegmentboundariesandlanguagelabels.3MethodologyLanguageidentiﬁcationformultilingualdocumentsisamulti-labelclassiﬁcationtask,inwhichadoc-umentcanbemappedontoanynumberoflabelsfromaclosedset.Intheremainderofthispaper,wedenotethesetofalllanguagesbyL.Wede-noteadocumentDwhichcontainslanguagesLxandLyasD→{Lx,Ly},whereLx,Ly∈L.Wedenoteadocumentthatdoesnotcontainalan-guageLxbyD→{Lx},thoughwegenerallyomitallthelanguagesnotcontainedinthedocumentforbrevity.Wedenoteclassiﬁeroutputusing.;e.g.D.{La,Lb}indicatesthatdocumentDhasbeenpredictedtocontaintextinlanguagesLaandLb.3.1DocumentRepresentationandFeatureSelectionWerepresenteachdocumentDasafrequencydis-tributionoverbyten-gramsequencessuchasthoseinTable1.Eachdocumentisconvertedintoavectorwhereeachentrycountsthenumberoftimesapar-ticularbyten-gramispresentinthedocument.Thisisanalogoustoabag-of-wordsmodel,wherethevo-cabularyof“words”isasetofbytesequencesthathasbeenselectedtodistinguishbetweenlanguages.TheexactsetoffeaturesisselectedfromthetrainingdatausingInformationGain(IG),aninformation-theoreticmetricdevelopedasasplit-tingcriterionfordecisiontrees(Quinlan,1993).IG-basedfeatureselectioncombinedwithanaiveBayesclassiﬁerhasbeenshowntobeparticularlyeffectiveforlanguageidentiﬁcation(LuiandBaldwin,2011).

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
1
6
3
1
5
6
6
8
5
5

/
t

a
c
_
a
_
0
0
1
6
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

3.2GenerativeMixtureModelsGenerativemixturemodelsarepopularfortextmod-elingtaskswhereamixtureofinﬂuencesgovernsthecontentofadocument,suchasinmulti-labeldoc-umentclassiﬁcation(McCallum,1999;Ramageetal.,2009),andtopicmodeling(Bleietal.,2003).Suchmodelsnormallyassumefullexchangeabilitybetweentokens(i.e.thebag-of-wordsassumption),andlabeleachtokenwithasinglediscretelabel.Multi-labeltextclassiﬁcation,topicmodelingandourmodelforlanguageidentiﬁcationinmultilingualdocumentssharethesamefundamentalrepresenta-tionofthelatentstructureofadocument.Eachla-belismodeledwithaprobabilitydistributionovertokens,andeachdocumentismodeledasaproba-bilisticmixtureoflabels.AspresentedinGrifﬁthsandSteyvers(2004),theprobabilityoftheithtoken(wi)givenasetofTlabelsz1···zTismodeledas:P(wi)=TXj=1P(wi|zi=j)P(zi=j)(1)Thesetoftokenswisthedocumentitself,whichinallcasesisobserved.Inthecaseoftopicmodel-ing,thetokensarewordsandthelabelsaretopics,andzislatent.Whereastopicmodelingisgener-allyunsupervised,multi-labeltextclassiﬁcationisasupervisedtextmodelingtask,wherethelabelsareasetofpre-deﬁnedcategories(suchasRUBBER,IRON-STEEL,TRADE,etc.inthepopularReuters-21578dataset(Lewis,1997)),andthetokensareindividualwordsindocuments.zisstilllatent,butconstrainedinthetrainingdata(i.e.documentsarelabeledbuttheindividualwordsarenot).Someap-proachestolabelingunseendocumentsrequirethatzforthetrainingdatabeinferred,andmethodsfordoingthisincludeanapplicationoftheExpectation-Maximization(EM)algorithm(McCallum,1999)andLabeledLDA(Ramageetal.,2009).Themodelthatweproposeforlanguageidentiﬁ-cationinmultilingualdocumentsissimilartomulti-labeltextclassiﬁcation.IntheframeworkofEqua-tion1,eachper-tokenlabelziisalanguageandthevocabularyoftokensisnotgivenbywordsbutratherbyspeciﬁcbytesequences(Section3.1).Thekeydifferencewithmulti-labeltextclassiﬁcationisthatweusemonolingual(i.e.mono-label)trainingdata.Hence,ziseffectivelyobservedforthetrainingdata(sincealltokensmustsharethesamelabel).Toinferzforunlabeleddocuments,weutilizeaGibbssam-pler,closelyrelatedtothatproposedbyGrifﬁthsandSteyvers(2004)forLDA.Thesamplingprobabilityforalabelzifortokenwinadocumentdis:P(zi=j|z−i,w)∝φ(w)j·θ(d)j(2)φ(w)j=P(wi|zi=j,z−i,w−i)θ(d)j=P(zi=j|z−i)IntheLDAmodel,θ(d)jisassumedtohaveaDirich-letdistributionwithhyperparameterα,andtheworddistributionforeachtopicφ(w)jisalsoassumedtohaveaDirichletdistributionwithhyperparameterβ.Grifﬁths(2002)describesagenerativemodelforLDAwherebothφ(w)jandθ(d)jareinferredfromtheoutputofaGibbssampler.Inourmethod,weestimateφ(w)jusingmaximumlikelihoodestima-tion(MLE)fromthetrainingdata.Estimatingφ(w)jthroughMLEisequivalenttoamultinomialNaiveBayesmodel(McCallumandNigam,1998):ˆφ(w)j=n(w)j+βn(.)j+Wβ(3)wheren(w)jisthenumberoftimeswordwoccurswithlabelj,andn(.)jisthetotalnumberofwordsthatoccurwithlabelj.Bysettingβto1,weobtainstandardLaplaciansmoothing.Hence,onlyˆθ(d)jisupdatedateachstepintheGibbssampler:ˆθ(d)j=n(d)−i,j+αn(d)−i+Tα(4)wheren(d)−i,jisthenumberoftokensindocumentdthatarecurrentlymappedtolanguagej,andn(d)−iisthetotalnumberoftokensindocumentd.Inbothcases,thecurrentassignmentofziisexcludedfromthecount.Tisthenumberoflanguages(i.e.thesizeofthelabelset).Forsimplicity,wesetαto0.WenotethatintheLDAmodel,αandβinﬂuencethesparsityofthesolution,andsoitmaybepossibletotunetheseparametersforourmodelaswell.Weleavethisasanavenueforfurtherresearch.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
1
6
3
1
5
6
6
8
5
5

/
t

a
c
_
a
_
0
0
1
6
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

3.3LanguageIdentiﬁcationinMultilingualDocumentsThemodeldescribedinSection3.2canbeusedtocomputethemostlikelydistributiontohavegen-eratedanunlabeleddocumentoveragivensetoflanguagesforwhichwehavemonolingualtrainingdata,bylettingthesetoftermswbethebyten-gramsequencesweselectedusingper-languageinforma-tiongain(Section3.1),andallowingthelabelsztorangeoverthesetofalllanguagesL.Usingtrain-ingdata,wecomputeˆφ(w)j(Equation3),andthenweinferP(Lj|D)foreachLj∈Lfortheunla-beleddocument,byrunningtheGibbssampleruntilthesamplesforziconvergeandthentabulatingzioverthewholedandnormalizingby|d|.Naively,wecouldidentifythelanguagespresentinthedoc-umentbyD.{Lxif∃(zi=Lx|D)},butclosely-relatedlanguagestendtohavesimilarfrequencydis-tributionsoverbyten-gramfeatures,andhenceitislikelythatsometokenswillbeincorrectlymappedtoalanguagethatissimilartothe“correct”language.Weaddressthisissuebyﬁndingthesubsetoflan-guagesλfromthetrainingsetLthatmaximizesP(λ|D)(asimilarapproachistakeninMcCallum(1999)).ThroughanapplicationofBayes’theorem,P(λ|D)∝P(D|λ)·P(λ),notingthatP(D)isanormalizingconstantandcanbedropped.Weas-sumethatP(λ)isconstant(i.e.anysubsetoflan-guagesisequallylikely,areasonableassumptionintheabsenceofotherevidence),andhencemaximizeP(D|λ).ForanygivenD=w1···wnandλ,weinferP(D|λ)fromtheoutputoftheGibbssampler:P(D|λ)=NYi=1P(wi|λ)(5)=NYi=1Xj∈λP(wi|zi=j)P(zi=j)(6)wherebothP(wi|zi=j)andP(zi=j)areesti-matedbytheirmaximumlikelihoodestimates.Inpractice,exhaustiveevaluationofthepowersetofLisprohibitivelyexpensive,andsowegreed-ilyapproximatetheoptimalλusingAlgorithm1.Inessence,weinitiallyrankallthecandidatelanguagesbycomputingthemostlikelydistributionoverthefullsetofcandidatelanguages.Then,foreachofthetop-Nlanguagesinturn,weconsiderwhetherAlgorithm1DetectLang(L,D)LN←top-Nz∈LbyP(z|D)λ←{Lu}foreachLt∈LNdoλ0←λ∪LtifP(D|λ)+tDownload pdf

Specialized Research AI at MIT

Specialized Research AI at MIT

Transactions of the Association for Computational Linguistics, 2 (2014) 27–40. Action Editor: Kristina Toutanova.