Transaktionen des Assoziation für Computer -Linguistik, 2 (2014) 27–40. Action Editor: Kristina Toutanova.

Submitted 1/2013; Überarbeitet 7/2013; Veröffentlicht 2/2014. C
(cid:13)

2014 Verein für Computerlinguistik.

AutomaticDetectionandLanguageIdentiﬁcationofMultilingualDocumentsMarcoLui♥♣,JeyHanLau♠andTimothyBaldwin♥♣♥DepartmentofComputingandInformationSystemsTheUniversityofMelbourne♣NICTAVictoriaResearchLaboratory♠DepartmentofPhilosophyKing’sCollegeLondonmhlui@unimelb.edu.au,jeyhan.lau@gmail.com,tb@ldwin.netAbstractLanguageidentiﬁcationisthetaskofautomat-icallydetectingthelanguage(s)presentinadocumentbasedonthecontentofthedocu-ment.Inthiswork,weaddresstheproblemofdetectingdocumentsthatcontaintextfrommorethanonelanguage(multilingualdocu-ments).Weintroduceamethodthatisabletodetectthatadocumentismultilingual,iden-tifythelanguagespresent,andestimatetheirrelativeproportions.Wedemonstratetheef-fectivenessofourmethodoversyntheticdata,aswellasreal-worldmultilingualdocumentscollectedfromtheweb.1IntroductionLanguageidentiﬁcationisthetaskofautomaticallydetectingthelanguage(s)presentinadocumentbasedonthecontentofthedocument.Languageidentiﬁcationtechniquescommonlyassumethatev-erydocumentiswritteninoneofaclosedsetofknownlanguagesforwhichthereistrainingdata,andisthusformulatedasthetaskofselectingthemostlikelylanguagefromthesetoftraininglan-guages.Inthiswork,weremovethismonolingualassumption,andaddresstheproblemoflanguageidentiﬁcationindocumentsthatmaycontaintextfrommorethanonelanguagefromthecandidateset.Weproposeamethodthatconcurrentlydetectsthatadocumentismultilingual,andestimatesthepropor-tionofthedocumentthatiswrittenineachlanguage.Detectingmultilingualdocumentshasavarietyofapplications.Mostnaturallanguageprocessingtechniquespresupposemonolingualinputdata,soinclusionofdatainforeignlanguagesintroducesnoise,andcandegradetheperformanceofNLPsys-tems(Alexetal.,2007;CookandLui,2012).Au-tomaticdetectionofmultilingualdocumentscanbeusedasapre-ﬁlteringsteptoimprovethequalityofinputdata.Detectingmultilingualdocumentsisalsoimportantforacquiringlinguisticdatafromtheweb(Scannell,2007;AbneyandBird,2010),andhasapplicationsinminingbilingualtextsforstatisticalmachinetranslationfromonlineresources(Resnik,1999;Nieetal.,1999;Lingetal.,2013).Therehasbeenparticularinterestinextractingtextresourcesforlow-densitylanguagesfrommultilingualwebpagescontainingboththelow-densitylanguageandanotherlanguagesuchasEnglish(YamaguchiandTanaka-Ishii,2012;KingandAbney,2013).KingandAbney(2013,p1118)speciﬁcallymentiontheneedforanautomaticmethod“toexamineamul-tilingualdocument,andwithhighaccuracy,listthelanguagesthatarepresentinthedocument”.Weintroduceamethodthatisabletodetectmulti-lingualdocuments,andsimultaneouslyidentifyeachlanguagepresentaswellasestimatethepropor-tionofthedocumentwritteninthatlanguage.Weachievethiswithaprobabilisticmixturemodel,us-ingadocumentrepresentationdevelopedformono-linguallanguageidentiﬁcation(LuiandBaldwin,2011).Themodelpositsthateachdocumentisgen-eratedassamplesfromanunknownmixtureoflan-guagesfromthetrainingset.WeintroduceaGibbssamplertomapsamplestolanguagesforanygivensetoflanguages,andusethistoselectthesetoflan-guagesthatmaximizestheposteriorprobabilityofthedocument.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
m
h

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
1
6
3
1
5
6
6
8
5
5

/
T

A
C
_
A
_
0
0
1
6
3
P
D

B
j
G
u
e
s
T

Ö
N
0
7
S
e
P
e
m
B
e
R
2
0
2
3

Ourmethodisabletolearnalanguageidenti-ﬁerformultilingualdocumentsfrommonolingualtrainingdata.Thisisanimportantpropertyastherearenostandardcorporaofmultilingualdocumentsavailable,whereascorporaofmonolingualdocu-mentsarereadilyavailableforareasonablylargenumberoflanguages(LuiandBaldwin,2011).Wedemonstratetheeffectivenessofourmethodempir-ically,ﬁrstlybyevaluatingitonsyntheticdatasetsdrawnfromWikipediadata,andthenbyapplyingittoreal-worlddata,showingthatweareabletoiden-tifymultilingualdocumentsintargetedwebcrawlsofminoritylanguages(KingandAbney,2013).Ourmaincontributionsare:(1)wepresentamethodforidentifyingmultilingualdocuments,thelanguagescontainedthereinandtherelativepropor-tionofthedocumentineachlanguage;(2)weshowthatourmethodoutperformsstate-of-the-artmeth-odsforlanguageidentiﬁcationinmultilingualdoc-uments;(3)weshowthatourmethodisabletoes-timatetheproportionofthedocumentineachlan-guagetoahighdegreeofaccuracy;Und(4)weshowthatourmethodisabletoidentifymultilingualdoc-umentsinreal-worlddata.2BackgroundMostlanguageidentiﬁcationresearchfocusesonlanguageidentiﬁcationformonolingualdocuments(Hughesetal.,2006).InmonolingualLangID,thetaskistoassigneachdocumentDauniquelanguageLi∈L.Someworkhasreportednear-perfectaccu-racyforlanguageidentiﬁcationoflargedocumentsinasmallnumberoflanguages(CavnarandTren-kle,1994;McNamee,2005).Jedoch,inordertoattainsuchaccuracy,alargenumberofsimplifyingassumptionshavetobemade(Hughesetal.,2006;BaldwinandLui,2010A).Inthiswork,wetackletheassumptionthateachdocumentismonolingual,i.e.itcontainstextfromasinglelanguage.Inlanguageidentiﬁcation,documentsaremod-eledasastreamofcharacters(CavnarandTrenkle,1994;Kikui,1996),oftenapproximatedbythecor-respondingstreamofbytes(Kruengkraietal.,2005;BaldwinandLui,2010A)forrobustnessovervari-ablecharacterencodings.Inthiswork,wefollowBaldwinandLui(2010A)intrainingasinglemodelforlanguagesthatnaturallyusemultipleencodings(e.g.UTF8,Big5andGBencodingsforChinese),asissuesofencodingarenotthefocusofthisresearch.Thedocumentrepresentationusedforlanguageidentiﬁcationgenerallyinvolvesestimatingtherel-ativedistributionsofparticularbytesequences,se-lectedsuchthattheirdistributionsdifferbetweenlanguages.Insomecasestherelevantsequencesmaybeexternallyspeciﬁed,suchasfunctionwordsandcommonsufﬁxes(Giguet,1995)orgrammati-calwordclasses(DueireLinsandGonc¸alves,2004),thoughtheyaremorefrequentlylearnedfromla-beleddata(CavnarandTrenkle,1994;Grefenstette,1995;Prager,1999A;LuiandBaldwin,2011).Learningalgorithmsappliedtolanguageidenti-ﬁcationfallintotwogeneralcategories:Bayesianclassiﬁersandnearest-prototype(Rocchio-style)classiﬁers.BayesianapproachesincludeMarkovprocesses(Dunning,1994),naiveBayesmethods(Grefenstette,1995;LuiandBaldwin,2011;Tiede-mannandLjubeˇsi´c,2012),andcompressivemod-els(Teahan,2000).Thenearest-prototypemethodsvaryprimarilyinthedistancemeasureused,includ-ingmeasuresbasedonrankorderstatistics(Cav-narandTrenkle,1994),informationtheory(Bald-winandLui,2010A),stringkernels(Kruengkraietal.,2005)andvectorspacemodels(Prager,1999A;McNamee,2005).Languageidentiﬁcationhasbeenappliedindo-mainssuchasUSENETmessages(CavnarandTrenkle,1994),webpages(Kikui,1996;Mar-tinsandSilva,2005;LiuandLiang,2008),websearchqueries(CeylanandKim,2009;BoscaandDini,2010),miningthewebforbilingualtext(Resnik,1999;Nieetal.,1999),buildingminor-itylanguagecorpora(Ghanietal.,2004;Scannell,2007;Bergsmaetal.,2012)aswellasalarge-scaledatabaseofInterlinearGlossedText(Xiaetal.,2010),andtheconstructionofalarge-scalemultilin-gualwebcrawl(CallanandHoy,2009).2.1MultilingualDocumentsLanguageidentiﬁcationoverdocumentsthatcontaintextfrommorethanonelanguagehasbeenidentiﬁedasanopenresearchquestion(Hughesetal.,2006).Commonexamplesofmultilingualdocumentsarewebpagesthatcontainexcerptsfromanotherlan-guage,anddocumentsfrommultilingualorganiza-tionssuchastheEuropeanUnion.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
m
h

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
1
6
3
1
5
6
6
8
5
5

/
T

A
C
_
A
_
0
0
1
6
3
P
D

B
j
G
u
e
s
T

Ö
N
0
7
S
e
P
e
m
B
e
R
2
0
2
3

EnglishFrenchItalianGermanDutchJapanesecharacterthepourdiaufvooはbyte74686520706F7572064692020617566766F6E381AFTable1:Examplesofper-languagebytesequencesselectedbyinformationgain.TheAustraliasianLanguageTechnologyWork-shop2010hostedasharedtaskwhereparticipantswererequiredtopredictthelanguage(s)presentinaheld-outtestsetcontainingmonolingualandbilin-gualdocuments(BaldwinandLui,2010B).ThedatasetwaspreparedusingdatafromWikipedia,andbilingualdocumentswereproducedusingasegmentfromapageinonelanguage,andasegmentfromthesamepageinanotherlanguage.Weusethedatasetfromthissharedtaskforourinitialexperiments.Totheauthors’knowledge,theonlyotherworktodirectlytackleidentiﬁcationofmultiplelanguagesandtheirrelativeproportionsinasingledocumentistheLINGUINIsystem(Prager,1999A).Thesystemisbasedonavectorspacemodel,andcosinesimi-laritybetweenafeaturevectorforthetestdocumentandafeaturevectorforeachlanguageLi,computedasthesumoffeaturevectorsforallthedocumentsforlanguageLiinthetrainingdata.Theelementsinthefeaturevectorsarefrequencycountsoverbyten-grams(2≤n≤5)andwords.Languageiden-tiﬁcationformultilingualdocumentsisperformedthroughtheuseofvirtualmixedlanguages.Prager(1999A)showshowtoconstructvectorsrepresenta-tiveofparticularcombinationsoflanguagesinde-pendentoftherelativeproportions,andproposesamethodforchoosingcombinationsoflanguagestoconsiderforanygivendocument.Languageidentiﬁcationinmultilingualdocu-mentscouldalsobeperformedbyapplicationofsu-pervisedlanguagesegmentationalgorithms.Givenasystemthatcansegmentadocumentintola-beledmonolingualsegments,wecanthenextractthelanguagespresentaswellastherelativepropor-tionoftextineachlanguage.Severalmethodsforsupervisedlanguagesegmentationhavebeenpro-posed.Teahan(2000)proposedasystembasedontextcompressionthatidentiﬁesmultilingualdocu-mentsbyﬁrstsegmentingthetextintomonolingualblocks.RehurekandKolkus(2009)performlan-guagesegmentationbycomputingarelevancescorebetweentermsandlanguages,smoothingacrossad-joiningtermsandﬁnallyidentifyingpointsoftransi-tionbetweenhighandlowrelevance,whicharein-terpretedasboundariesbetweenlanguages.Yam-aguchiandTanaka-Ishii(2012)useaminimumde-scriptionlengthapproach,embeddingacompressivemodeltocomputethedescriptionlengthoftextseg-mentsineachlanguage.Theypresentalinear-timedynamicprogrammingsolutiontooptimizethelo-cationofsegmentboundariesandlanguagelabels.3MethodologyLanguageidentiﬁcationformultilingualdocumentsisamulti-labelclassiﬁcationtask,inwhichadoc-umentcanbemappedontoanynumberoflabelsfromaclosedset.Intheremainderofthispaper,wedenotethesetofalllanguagesbyL.Wede-noteadocumentDwhichcontainslanguagesLxandLyasD→{Lx,Ly},whereLx,Ly∈L.Wedenoteadocumentthatdoesnotcontainalan-guageLxbyD→{Lx},thoughwegenerallyomitallthelanguagesnotcontainedinthedocumentforbrevity.Wedenoteclassiﬁeroutputusing.;e.g.D.{Der,Lb}indicatesthatdocumentDhasbeenpredictedtocontaintextinlanguagesLaandLb.3.1DocumentRepresentationandFeatureSelectionWerepresenteachdocumentDasafrequencydis-tributionoverbyten-gramsequencessuchasthoseinTable1.Eachdocumentisconvertedintoavectorwhereeachentrycountsthenumberoftimesapar-ticularbyten-gramispresentinthedocument.Thisisanalogoustoabag-of-wordsmodel,wherethevo-cabularyof“words”isasetofbytesequencesthathasbeenselectedtodistinguishbetweenlanguages.TheexactsetoffeaturesisselectedfromthetrainingdatausingInformationGain(IG),aninformation-theoreticmetricdevelopedasasplit-tingcriterionfordecisiontrees(Quinlan,1993).IG-basedfeatureselectioncombinedwithanaiveBayesclassiﬁerhasbeenshowntobeparticularlyeffectiveforlanguageidentiﬁcation(LuiandBaldwin,2011).

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
m
h

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
1
6
3
1
5
6
6
8
5
5

/
T

A
C
_
A
_
0
0
1
6
3
P
D

B
j
G
u
e
s
T

Ö
N
0
7
S
e
P
e
m
B
e
R
2
0
2
3

3.2GenerativeMixtureModelsGenerativemixturemodelsarepopularfortextmod-elingtaskswhereamixtureofinﬂuencesgovernsthecontentofadocument,suchasinmulti-labeldoc-umentclassiﬁcation(McCallum,1999;Ramageetal.,2009),andtopicmodeling(Bleietal.,2003).Suchmodelsnormallyassumefullexchangeabilitybetweentokens(i.e.thebag-of-wordsassumption),andlabeleachtokenwithasinglediscretelabel.Multi-labeltextclassiﬁcation,topicmodelingandourmodelforlanguageidentiﬁcationinmultilingualdocumentssharethesamefundamentalrepresenta-tionofthelatentstructureofadocument.Eachla-belismodeledwithaprobabilitydistributionovertokens,andeachdocumentismodeledasaproba-bilisticmixtureoflabels.AspresentedinGrifﬁthsandSteyvers(2004),theprobabilityoftheithtoken(wi)givenasetofTlabelsz1···zTismodeledas:P(wi)=TXj=1P(wi|zi=j)P(zi=j)(1)Thesetoftokenswisthedocumentitself,whichinallcasesisobserved.Inthecaseoftopicmodel-ing,thetokensarewordsandthelabelsaretopics,andzislatent.Whereastopicmodelingisgener-allyunsupervised,multi-labeltextclassiﬁcationisasupervisedtextmodelingtask,wherethelabelsareasetofpre-deﬁnedcategories(suchasRUBBER,IRON-STEEL,TRADE,etc.inthepopularReuters-21578dataset(Lewis,1997)),andthetokensareindividualwordsindocuments.zisstilllatent,butconstrainedinthetrainingdata(i.e.documentsarelabeledbuttheindividualwordsarenot).Someap-proachestolabelingunseendocumentsrequirethatzforthetrainingdatabeinferred,andmethodsfordoingthisincludeanapplicationoftheExpectation-Maximization(EM)Algorithmus(McCallum,1999)andLabeledLDA(Ramageetal.,2009).Themodelthatweproposeforlanguageidentiﬁ-cationinmultilingualdocumentsissimilartomulti-labeltextclassiﬁcation.IntheframeworkofEqua-tion1,eachper-tokenlabelziisalanguageandthevocabularyoftokensisnotgivenbywordsbutratherbyspeciﬁcbytesequences(Section3.1).Thekeydifferencewithmulti-labeltextclassiﬁcationisthatweusemonolingual(i.e.mono-label)trainingdata.Hence,ziseffectivelyobservedforthetrainingdata(sincealltokensmustsharethesamelabel).Toinferzforunlabeleddocuments,weutilizeaGibbssam-pler,closelyrelatedtothatproposedbyGrifﬁthsandSteyvers(2004)forLDA.Thesamplingprobabilityforalabelzifortokenwinadocumentdis:P(zi=j|z−i,w)∝φ(w)j·θ(D)j(2)Phi(w)j=P(wi|zi=j,z−i,w−i)θ(D)j=P(zi=j|z−i)IntheLDAmodel,θ(D)jisassumedtohaveaDirich-letdistributionwithhyperparameterα,andtheworddistributionforeachtopicφ(w)jisalsoassumedtohaveaDirichletdistributionwithhyperparameterβ.Grifﬁths(2002)describesagenerativemodelforLDAwherebothφ(w)jandθ(D)jareinferredfromtheoutputofaGibbssampler.Inourmethod,weestimateφ(w)jusingmaximumlikelihoodestima-tion(MLE)fromthetrainingdata.Estimatingφ(w)jthroughMLEisequivalenttoamultinomialNaiveBayesmodel(McCallumandNigam,1998):ˆφ(w)j=n(w)j+βn(.)j+Wβ(3)wheren(w)jisthenumberoftimeswordwoccurswithlabelj,andn(.)jisthetotalnumberofwordsthatoccurwithlabelj.Bysettingβto1,weobtainstandardLaplaciansmoothing.Hence,onlyˆθ(D)jisupdatedateachstepintheGibbssampler:ˆθ(D)j=n(D)−i,j+αn(D)−i+Tα(4)wheren(D)−i,jisthenumberoftokensindocumentdthatarecurrentlymappedtolanguagej,andn(D)−iisthetotalnumberoftokensindocumentd.Inbothcases,thecurrentassignmentofziisexcludedfromthecount.Tisthenumberoflanguages(i.e.thesizeofthelabelset).Forsimplicity,wesetαto0.WenotethatintheLDAmodel,αandβinﬂuencethesparsityofthesolution,andsoitmaybepossibletotunetheseparametersforourmodelaswell.Weleavethisasanavenueforfurtherresearch.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
m
h

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
1
6
3
1
5
6
6
8
5
5

/
T

A
C
_
A
_
0
0
1
6
3
P
D

B
j
G
u
e
s
T

Ö
N
0
7
S
e
P
e
m
B
e
R
2
0
2
3

3.3LanguageIdentiﬁcationinMultilingualDocumentsThemodeldescribedinSection3.2canbeusedtocomputethemostlikelydistributiontohavegen-eratedanunlabeleddocumentoveragivensetoflanguagesforwhichwehavemonolingualtrainingdata,bylettingthesetoftermswbethebyten-gramsequencesweselectedusingper-languageinforma-tiongain(Section3.1),andallowingthelabelsztorangeoverthesetofalllanguagesL.Usingtrain-ingdata,wecomputeˆφ(w)j(Equation3),andthenweinferP(Lj|D)foreachLj∈Lfortheunla-beleddocument,byrunningtheGibbssampleruntilthesamplesforziconvergeandthentabulatingzioverthewholedandnormalizingby|D|.Naively,wecouldidentifythelanguagespresentinthedoc-umentbyD.{Lxif∃(zi=Lx|D)},butclosely-relatedlanguagestendtohavesimilarfrequencydis-tributionsoverbyten-gramfeatures,andhenceitislikelythatsometokenswillbeincorrectlymappedtoalanguagethatissimilartothe“correct”language.Weaddressthisissuebyﬁndingthesubsetoflan-guagesλfromthetrainingsetLthatmaximizesP(λ|D)(asimilarapproachistakeninMcCallum(1999)).ThroughanapplicationofBayes’theorem,P(λ|D)∝P(D|λ)·P(λ),notingthatP(D)isanormalizingconstantandcanbedropped.Weas-sumethatP(λ)isconstant(i.e.anysubsetoflan-guagesisequallylikely,areasonableassumptionintheabsenceofotherevidence),andhencemaximizeP(D|λ).ForanygivenD=w1···wnandλ,weinferP(D|λ)fromtheoutputoftheGibbssampler:P(D|λ)=NYi=1P(wi|λ)(5)=NYi=1Xj∈λP(wi|zi=j)P(zi=j)(6)wherebothP(wi|zi=j)andP(zi=j)areesti-matedbytheirmaximumlikelihoodestimates.Inpractice,exhaustiveevaluationofthepowersetofLisprohibitivelyexpensive,andsowegreed-ilyapproximatetheoptimalλusingAlgorithm1.Inessence,weinitiallyrankallthecandidatelanguagesbycomputingthemostlikelydistributionoverthefullsetofcandidatelanguages.Then,foreachofthetop-Nlanguagesinturn,weconsiderwhetherAlgorithm1DetectLang(L,D)LN←top-Nz∈LbyP(z|D)λ←{Lu}foreachLt∈LNdoλ0←λ∪LtifP(D|λ)+Te d u / t a c l / l A R T ich C e - P D F / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 6 3 1 5 6 6 8 5 5 / / t l a c _ a _ 0 0 1 6 3 P D . F B j G u e S T T O N 0 7 S e P e M B e R 2 0 2 3 32 thevalueofkindirectlycontrolsthenumberoffea-turesselected.Valuesofkarenotcomparableacrossdatasetsasmisnotnormalizedforthesizeofthetrainingdata,sointhisworkwedonotreportthevaluesofkandinsteaddirectlyselectthetop-Nfea-tures,weightedbymn.InLINGUINI,eachlanguageismodeledasasinglepseudo-document,obtainedbyconcatenatingallthetrainingdataforthegivenlanguage.Adocumentisthenclassiﬁedaccordingtothevectorwithwhichithasthesmallestangle;thisisimplementedbyﬁndingthelanguagevectorwiththehighestcosinewiththedocumentvector.Prager(1999A)alsoproposesanextensiontotheapproachtoallowidentiﬁcationofbilingualdocu-ments,andsuggestshowthismaybegeneralizedtoanynumberoflanguagesinadocument.Thegistofthemethodissimple:foranygivenpairoflan-guages,theprojectionofadocumentvectorontothehyperplanecontainingthelanguagevectorsofthetwolanguagesgivesthemixtureproportionsofthetwolanguagesthatminimizestheanglewiththedocumentvector.Prager(1999A)termsthisprojec-tionavirtualmixedlanguage(VML),andshowshowtoﬁndtheanglebetweenthedocumentvec-torandtheVML.Ifthisangleislessthanthatbe-tweenthedocumentvectorandanyindividuallan-guagevector,thedocumentislabeledasbilingualinthetwolanguagesfromwhichthemixedvectorwasderived.Thepracticaldifﬁcultypresentedbythisapproachisthatexhaustivelyevaluatingallpossiblecombinationsoflanguagesisprohibitivelyexpen-sive.Prager(1999A)addressesthisbyarguingthatinmultilingualdocuments,“theindividualcomponentlanguageswillbeclosetod(thedocumentvector)–probablycloserthanmostorallotherlanguages”.Hence,languagemixturesareonlyconsideredforcombinationsofthetopmlanguages.Prager(1999A)showshowtoobtainthemixturecoefﬁcientsforbilingualVMLs,arguingthattheprocessgeneralizes.Prager(1999B)includesthecoefﬁcientsfor3-languageVMLs,whicharemuchmorecomplexthanthe2-languagevariants.Us-ingacomputeralgebrasystem,weveriﬁedthean-alyticformsofthecoefﬁcientsinthe3-languageVML.Wealsoattemptedtoobtainananalyticformforthecoefﬁcientsina4-languageVML,buttheseweretoocomplexforthecomputeralgebrasystemtocompute.Thus,ourevaluationoftheVMLap-proachproposedbyPrager(1999A)islimitedto3-languageVMLs.NeitherPrager(1999A)norPrager(1999B)includeanempiricalevaluationovermul-tilingualdocuments,sotothebestofourknowl-edgethispaperistheﬁrstempiricalevaluationofthemethodonmultilingualdocuments.Asnorefer-enceimplementationofthismethodisavailable,wehaveproducedourownimplementation,whichwehavemadefreelyavailable.1Theotherbenchmarkweconsiderinthispaperisthemethodfortextsegmentationbylanguagepro-posedbyYamaguchiandTanaka-Ishii(2012)(here-afterreferredtoasSEGLANG).Theactualtaskad-dressedbyYamaguchiandTanaka-Ishii(2012)istodivideadocumentintomonolingualsegments.ThisisformulatedasthetaskofsegmentingadocumentD=x1,···,X|D|(wherexidenotestheithchar-acterofDand|D|isthelengthofthedocument)byﬁndingalistofboundariesB=[B1,···,B|B|]whereeachBiindicatesthelocationofalanguageboundaryasanoffsetfromthestartofthedocument,resultinginalistofsegmentsX=[X0,···,X|B|].ForeachsegmentXi,thesystempredictsLi,thelanguageassociatedwiththesegment,producingalistoflabellingsL=[L0,···,L|B|],withthecon-straintthatadjacentelementsinLmustdiffer.Ya-maguchiandTanaka-Ishii(2012)solvetheproblemofdeterminingXandLforanunlabeledtextus-ingamethodbasedonminimumdescriptionlength.Theypresentadynamicprogrammingsolutiontothisproblem,andanalyzeanumberofparametersthataffecttheoverallaccuracyofthesystem.GiventhismethodtodetermineXandL,itisthentriv-ialtolabelanunlabeleddocumentaccordingtoD.{Lxif∃Lx∈L},andthelengthofeachseg-mentinXcanthenbeusedtodeterminethepro-portionsofthedocumentthatareineachlanguage.Inthiswork,weuseareferenceimplementationofSEGLANGkindlyprovidedtousbytheauthors.UsingthetextsegmentationapproachofSEGLANGtodetectmultilingualdocumentsdiffersfromLINGUINIandourmethodprimarilyinthatLINGUINIandourmethodfragmentthedocumentintosmallsequencesofbytes,anddiscardinforma-tionabouttherelativeorderofthefragments.ThisisincontrasttoSEGLANG,wherethisinformation1https://github.com/saffsd/linguini.py l D o w n o a d e d f r o m h t t p : / / D ich R e C T . M ich T . e d u / t a c l / l A R T ich C e - P D F / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 6 3 1 5 6 6 8 5 5 / / t l a c _ a _ 0 0 1 6 3 P D . F B j G u e S T T O N 0 7 S e P e M B e R 2 0 2 3 33 SystemPMRMFMPµRµFµBenchmark.497.467.464.833.826.829Winner.718.703.699.932.931.932SEGLANG.801.810.784.866.946.905LINGUINI.616.535.513.713.688.700Ourmethod.753.771.748.945.922.933Table2:ResultsontheALTW2010dataset.“Benchmark”isthebenchmarksystemproposedbythesharedtaskorganizers.“Winner”isthehighest-Fµsystemsubmittedtothesharedtask.isutilizedinthesequentialpredictionoflabelsforconsecutivesegmentsoftext,andisthusabletomakebetteruseofthelocalityoftext(sincetherearelikelytobemonolingualblocksoftextinanygivenmultilingualdocument).Thedisadvantageofthisisthattheunderlyingmodelbecomesmorecomplexandhencemorecomputationallyexpensive,asweobserveinSection5.3.5EvaluationWeseektoevaluatetheabilityofeachmethod:(1)tocorrectlyidentifythelanguage(s)presentineachtestdocument;Und(2)formultilingualdoc-uments,toestimatetherelativeproportionofthedocumentwrittenineachlanguage.Intheﬁrstin-stance,thisisaclassiﬁcationproblem,andthestan-dardnotionsofprecision(P),abrufen(R)andF-score(F)apply.Consistentwithpreviousworkinlan-guageidentiﬁcation,wereportboththedocument-levelmicro-average,aswellasthelanguage-levelmacro-average.ForconsistencywithBaldwinandLui(2010A),themacro-averagedF-scorewereportistheaverageoftheper-classF-scores,ratherthantheharmonicmeanofthemacro-averagedprecisionandrecall;assuch,itispossiblefortheF-scoretonotfallbetweentheprecisionandrecallvalues.Asiscommonpractice,wecomputetheF-scoreforβ=1,givingequalimportancetoprecisionandrecall.2Wetestedthedifferenceinperformanceforstatisticalsigniﬁcanceusinganapproximateran-domizationprocedure(Yeh,2000)with10000iter-ations.Withineachtableofresults(Tables2,3and2Intuitively,itmayseemthatthemaximalprecisionandre-callshouldbeachievedwhenprecisionandrecallarebalanced.However,becauseofthemulti-labelnatureofthetaskandvari-ablenumberoflabelsassignedtoagivendocumentbyourmod-els,itistheoreticallypossibleandindeedcommoninourresultsforthemaximalmacro-averagedF-scoretobeachievedwhenmacro-averagedprecisionandrecallarenotbalanced.4),alldifferencesbetweensystemsarestatisticallysigniﬁcantatap<0.05level.Toevaluatethepredictionsoftherelativepropor-tionsofadocumentDwrittenineachdetectedlan-guageLi,wecomparethetopicproportionpredictedbyourmodeltothegold-standardproportion,mea-suredasabyteratioasfollows:gs(Li|D)=lengthofLipartofDinbyteslengthofDinbytes(7)Wereportthecorrelationbetweenpredictedandac-tualproportionsintermsofPearson’srcoefﬁcient.Wealsoreportthemeanabsoluteerror(MAE)overalldocument–languagepairs.4ExperimentsonALTW2010OurﬁrstexperimentutilizestheALTW2010sharedtaskdataset(BaldwinandLui,2010b),asyntheticdatasetof10000bilingualdocuments3generatedfromWikipediadata,introducedintheALTW2010sharedtask,4Thedatasetisorganizedintotraining,developmentandtestpartitions.Followingstandardmachinelearningpractice,wetraineachsystemus-ingthetrainingpartition,andtuneparametersusingthedevelopmentpartition.Wethenreportmacroandmicro-averagedprecision,recallandF-scoreonthetestpartition,usingthetunedparameters.TheresultsontheALTW2010sharedtaskdatasetaresummarizedinTable2.Eachofthethreesys-temswecomparewasre-trainedusingthetrainingdataprovidedforthesharedtask,withaslightdif-ference:inthesharedtask,participantswerepro-videdwithmultilingualtrainingdocuments,butthesystemstargetedinthisresearchrequiremonolin-gualtrainingdata.Wethussplitthetrainingdoc-umentsintomonolingualsegmentsusingthemeta-dataprovidedwiththedataset.Themetadatawasonlypublishedaftercompletionofthetaskandwasnotavailabletotaskparticipants.Forcomparison,wehaveincludedthebenchmarkresultspublishedbythesharedtaskorganizers,aswellasthescoreattainedbythewinningentry(Tranetal.,2010).3Withasmallnumberofmonolingualdocuments,formedbyrandomlyselectingthetwolanguagesforagivendocu-mentindependently,leavingthepossibilityofthesametwolan-guagesbeingselected.4http://comp.mq.edu.au/programming/task_description/ l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 6 3 1 5 6 6 8 5 5 / / t l a c _ a _ 0 0 1 6 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 34 Wetunetheparametersforeachsystemusingthedevelopmentpartitionofthedataset,andreportre-sultsonthetestpartition.ForLINGUINI,thereisasingleparameterktobetuned:thenumberoffea-turesperlanguage.Wetestedvaluesbetween10000and50000,andselected46000featuresastheopti-malvalue.Forourmethod,therearetwoparameterstobetuned:(1)thenumberoffeaturesselectedforeachlanguage,and(2)thethresholdtforincludingalanguage.Wetestedfeatures-per-languagecountsbetween30and150,andfoundthataddingfeaturesbeyond70perlanguagehadminimaleffect.Wetestedvaluesofthethresholdtfrom0.01to0.15,andfoundthebestvaluewas0.14.ForSEGLANG,weintroduceathresholdtontheminimumpropor-tionofadocument(measuredinbytes)thatmustbelabeledbyalanguagebeforethatlanguageisin-cludedintheoutputset.ThiswasdonebecauseourinitialexperimentsindicatethatSEGLANGtendstoover-producelabels.Usingthedevelopmentdata,wefoundthebestvalueoftwas0.10.Weﬁndthatofthethreesystemstested,twoout-performthewinningentrytothesharedtask.Thisismoreevidentinthemacro-averagedresultsthaninthemicro-averagedresults.Inmicro-averagedterms,ourmethodisthebestperformer,whereasonthemacro-average,SEGLANGhasthehigh-estF-score.Thissuggeststhatourmethoddoeswellonhigher-densitylanguages(relativetotheALTW2010dataset),andpoorlyonlower-densitylanguages.Thisalsoaccountsforthehighermicro-averagedprecisionbutlowermicro-averagedrecallforourmethodascomparedtoSEGLANG.Theim-provedmacro-averageF-scoreofSEGLANGcomesatamuchhighercomputationalcost,whichin-creasesdramaticallyasthenumberoflanguagesisincreased.Inourtestingona16-coreworksta-tion,SEGLANGtookalmost24hourstoprocesstheALTW2010sharedtasktestdata,comparedto2minutesforourmethodand40secondsforLIN-GUINI.Assuch,SEGLANGispoorlysuitedtode-tectingmultilingualdocumentswherealargenum-berofcandidatelanguagesisconsidered.TheALTW2010datasetisanexcellentstartingpointforthisresearch,butitpredominantlycontainsbilingualdocuments,makingitdifﬁculttoassesstheabilityofsystemstodistinguishmultilingualdocu-mentsfrommonolingualones.Furthermore,weareunabletouseittoassesstheabilityofsystemstodetectmorethan2languagesinadocument.Toad-dresstheseshortcomings,weconstructanewdatasetinasimilarvein.Thedatasetandexperimentsper-formedonitaredescribedinthenextsection.5ExperimentsonWIKIPEDIAMULTITofullytestthecapabilitiesofourmodel,wegen-eratedWIKIPEDIAMULTI,adatasetthatcontainsamixtureofmonolingualandmultilingualdocu-ments.Toallowforreplicabilityofourresultsandtofacilitateresearchinlanguageidentiﬁcation,wehavemadethedatasetpubliclyavailable.5WIKI-PEDIAMULTIisgeneratedusingexcerptsfromthemediawikisourcesofWikipediapagesdownloadedfromtheWikimediafoundation.6ThedumpsweusedarefromJuly–August2010.TogenerateWIKIPEDIAMULTI,weﬁrstnormal-izedtherawmediawikidocuments.Mediawikidoc-umentstypicallycontainoneparagraphperline,in-terspersedwithstructuralelements.Weﬁlteredeachdocumenttoremoveallstructuralelements,andonlykeptdocumentsthatexceeded2500bytesafternormalization.Thisyieldedacollectionofaround500,000documentsin156languages.Fromthisinitialdocumentset(hereafterreferredtoasWI-KICONTENT),weonlyretainedlanguagesthathadmorethan1000documents(44languages),andgen-erateddocumentsforWIKIPEDIAMULTIasfollows:1.randomlyselectthenumberoflanguagesK(1≤K≤5)2.randomlyselectasetofKlanguagesS={Li∈Lfori=1···K}withoutreplacement3.randomlyselectadocumentforeachLi∈SfromWIKICONTENTwithoutreplacement4.takethetop1Klinesofthedocument5.jointheKsectionsintoasingledocument.Asaresultoftheprocedure,therelativepropor-tionofeachlanguageinamultilingualdocumenttendsnottobeuniform,asitisconditionedonthelengthoftheoriginaldocumentfromwhichitwassourced,independentoftheotherK−1fortheotherlanguagesthatitwascombinedwith.Overall,theaveragedocumentlengthis5500bytes(standardde-viation=3800bytes).Duetoroundingupintaking5http://www.csse.unimelb.edu.au/˜tim/6http://dumps.wikimedia.org l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 6 3 1 5 6 6 8 5 5 / / t l a c _ a _ 0 0 1 6 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 35 SystemPMRMFMPµRµFµSEGLANG.809.975.875.771.975.861LINGUINI.853.772.802.838.774.805Ourmethod.962.954.957.963.955.959Table3:ResultsontheWIKIPEDIAMULTIdataset.thetop1klines(step4),documentswithhigherKtendtobelonger(6200bytesforK=5vs5100bytesforK=1).TheWIKIPEDIAMULTIdatasetcontainstraining,developmentandtestpartitions.Thetrainingparti-tionconsistsof5000monolingual(i.e.K=1)doc-uments.Thedevelopmentpartitionconsistsof5000documents,1000documentsforeachvalueofKwhere1≤K≤5.Thetestpartitioncontains200doc-umentsforeachK,foratotalof1000documents.Thereisnooverlapbetweenanyofthepartitions.5.1ResultsoverWIKIPEDIAMULTIWetrainedeachsystemusingthemonolingualtrain-ingpartition,andtunedparametersusingthedevel-opmentpartition.ForLINGUINI,wetestedfeaturecountsbetween10000and50000,andfoundthattheeffectwasrelativelysmall.Wethususe10000featuresastheoptimumvalue.ForSEGLANG,wetestedvaluesforthresholdtbetween0.01and0.20,andfoundthatthemaximalmacro-averagedF-scoreisattainedwhent=0.06.Finally,forourmethodwetestedfeatures-per-languagecountsbetween30and130andfoundthebestperformancewith120featuresperlanguage,althoughtheactualeffectofvaryingthisvalueisrathersmall.Wetestedvaluesofthethresholdtforaddinganextralanguagetoλfrom0.01to0.15,andfoundthatthebestresultswereattainedwhent=0.02.TheresultsofevaluatingeachsystemonthetestpartitionaresummarizedinTable3.Inthisevaluation,ourmethodclearlyoutperformsbothSEGLANGandLINGUINI.TheresultsonWIKI-PEDIAMULTIandALTW2010aredifﬁculttocom-paredirectlyduetothedifferentcompositionsofthetwodatasets.ALTW2010ispredominantlybilin-gual,whereasWIKIPEDIAMULTIcontainsdocu-mentswithtextin1–5languages.Furthermore,theaveragedocumentinALTW2010ishalfthelengthofthatinWIKIPEDIAMULTI.Overall,weobservethatSEGLANGhasatendencytoover-label(despitetheintroductionofthetparametertoreducethisef-fect),evidencedbyhighrecallbutlowerprecision.LINGUINIisinherentlylimitedinthatitisonlyabletodetectupto3languagesperdocument,causingrecalltosufferonWIKIPEDIAMULTI.However,italsotendstoalwaysoutput3languages,regardlessoftheactualnumberoflanguagesinthedocument,hurtingprecision.Furthermore,evenonALTW2010ithaslowerrecallthantheothertwosystems.6EstimatingLanguageProportionsInadditiontodetectingmultiplelanguageswithinadocument,ourmethodalsoestimatestherelativeproportionsofthedocumentthatarewrittenineachlanguage.Thisinformationmaybeusefulfordetect-ingdocumentsthatarecandidatebitextsfortrainingmachinetranslationsystems,sincewemayexpectlanguagesinthedocumenttobepresentinequalproportions.Italsoallowsustoidentifythepre-dominantlanguageofadocument.Acoreelementofourmodelofadocumentisadistributionoverasetoflabels.Sinceeachla-belcorrespondstoalanguage,asaﬁrstapproxima-tion,wetaketheprobabilitymassassociatedwitheachlabelasadirectestimateoftheproportionofthedocumentwritteninthatlanguage.WeexaminetheresultsforpredictingthelanguageproportionsinthetestpartitionofWIKIPEDIAMULTI.Mappinglabeldistributionsdirectlytolanguageproportionsproducesexcellentresults,withaPearson’srvalueof0.863andanMAEof0.108.Althoughlabelshaveaone-to-onecorrespon-dencewithlanguages,thelabeldistributiondoesnotactuallycorresponddirectlytothelanguagepro-portion,becausethedistributionestimatesthepro-portionofbyten-gramsequencesassociatedwithalabelandnottheproportionofbytesdirectly.Thesamenumberofbytesindifferentlanguagescanproducedifferentnumbersofn-gramsequences,becauseafterfeatureselectionnotalln-gramse-quencesareretainedinthefeatureset.Hereafter,werefertoeachn-gramsequenceasatoken,andtheaveragenumberoftokensproducedperbyteoftextasthetokenemissionrate.Weestimatetheper-languagetokenemissionrate(Figure1)usingthetrainingpartitionofWIKIPE-DIAMULTI.Toimproveourestimateofthelan-guageproportions,wecorrectourlabeldistribution l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 6 3 1 5 6 6 8 5 5 / / t l a c _ a _ 0 0 1 6 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 36 Originaltextthecatinthehatn-gramfeatureshe:2the:2hat:1in:1th:1the:1hat:1hec:1int:1nth:1Emissionrate#bytes#tokens=1812=1.5bytes/tokenFigure1:Exampleofcalculatingn-gramemissionrateforatextstring.usingestimatesoftheper-languagetokenemissionrateRLiinbytespertokenforLi∈L.AssumethatadocumentDoflength|D|isestimatedtocontainKlanguagesinproportionsPifori=1···K.ThecorrectedestimatefortheproportionofLiis:Prop(Li)=Pi×RLiPKj=1(Pj×RLj)(8)Notethatthe|D|termiscommontothenumeratoranddenominatorandhasthusbeeneliminated.Thiscorrectionimprovesourestimatesoflan-guageproportions.Aftercorrection,thePearson’srrisesto0.981,andtheMAEisreducedto0.024.Theimprovementismostnoticeableforlanguage–documentpairswheretheproportionofthedocu-mentinthegivenlanguageisabout0.5(Figure2).7Real-worldMultilingualDocumentsSofar,wehavedemonstratedtheeffectivenessofourproposedapproachusingsyntheticdata.Theresultshavebeenexcellent,andinthissectionwevalidatetheapproachbyapplyingittoareal-worldtaskthathasrecentlybeendiscussedinthelit-erature.YamaguchiandTanaka-Ishii(2012)andKingandAbney(2013)bothobservethatintryingtogatherlinguisticdatafor“non-major”languagesfromtheweb,onechallengefacedisthatdocumentsretrievedoftencontainsectionsinanotherlanguage.SEGLANG(thesolutionofYamaguchiandTanaka-Ishii(2012))concurrentlydetectsmultilingualdoc-umentsandsegmentsthembylanguage,buttheap-proachiscomputationallyexpensiveandhasaten-dencytoover-label(Section5).Ontheotherhand,thesolutionofKingandAbney(2013)isincom-plete,andtheyspeciﬁcallymentiontheneedforanautomaticmethod“toexamineamultilingualdocu-ment,andwithhighaccuracy,listthelanguagesthatarepresentinthedocument”.Inthissection,weshowthatourmethodisabletoﬁllthisneed.WeSystemPRFBaseline0.7191.000.837SEGLANG0.7790.9910.872LINGUINI0.7290.9810.837Ourmethod0.9070.9160.912Table4:DetectionaccuracyforEnglish-languageinclusioninwebdocumentsfromtargetedwebcrawlsforlow-densitylanguages.makeuseofmanually-annotateddatakindlypro-videdtousbyBenKing,whichconsistsof149doc-umentscontaining42languagesretrievedfromthewebusingasetoftargetedqueriesforlow-densitylanguages.NotethatthedatasetdescribedinKingandAbney(2013)wasbasedonmanualconﬁrma-tionofthepresenceofEnglishinadditiontothelow-densitylanguageofprimaryinterest;ourdatasetcontainsthesebilingualdocumentsaswellasmono-lingualdocumentsinthelow-densitylanguageofin-terest.Ourpurposeinthissectionistoinvestigatetheabilityofautomaticsystemstoselectthissubsetofbilingualdocuments.Speciﬁcally,givenacol-lectionofdocumentsretrievedforatargetlanguage,thetaskistoidentifythedocumentsthatcontaintextinEnglishinadditiontothetargetlanguage.Thus,were-traineachsystemforeachtargetlanguage,us-ingonlytrainingdataforEnglishandthetargetlan-guage.WereservethedataprovidedbyBenKingforevaluation,andtrainourmethodsusingdatasep-aratelyobtainedfromtheUniversalDeclarationofHumanRights(UDHR).WhereUDHRtranslationsforaparticularlanguagewerenotavailable,weuseddatafromWikipediaorfromabibletranslation.Ap-proximately20–80kBofdatawereusedforeachlanguage.Aswedonothavesuitabledevelopmentdata,wemadeuseofthebestparametersforeachsystemfromtheexperimentsonWIKIPEDIAMULTI.Weﬁndthatall3systemsareabletodetectthateachdocumentcontainsthetargetlanguagewith100%accuracy.However,systemsvaryintheirabil-itytodetectifadocumentalsocontainsEnglishinadditiontothetargetlanguage.Thedetectionaccu-racyforEnglish-languageinclusionissummarizedinTable4.7Forcomparison,weincludeaheuristicbaselinebasedonlabelingalldocumentsascontain-7NotethatTable2andTable3bothreportmacroandmicro-averagedresultsacrossanumberoflanguages.IncontrastTa-ble4onlyreportsresultsforEnglish,andthevaluesarenotdirectlycomparabletoourearlierevaluation. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 6 3 1 5 6 6 8 5 5 / / t l a c _ a _ 0 0 1 6 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 37 0.20.40.60.81.0Actual Proportion0.20.40.60.81.0Predicted ProportionPearson's r: 0.863MAE: 0.108(a)withoutemissionratecorrection0.20.40.60.81.0Actual Proportion0.20.40.60.81.0Predicted ProportionPearson's r: 0.981MAE: 0.0241(b)withemissionratecorrectionFigure2:Scatterplotofthepredictedvs.actuallanguageproportionsinadocumentforthetestpartitionofWIKIPEDIAMULTI(predictionsarefromourmethod;eachpointcorrespondstoadocument-languagepair).ingEnglish.Weﬁndthat,liketheheuristicbase-line,SEGLANGandLINGUINIbothtendtoover-labeldocuments,producingfalsepositivelabelsofEnglish,resultinginincreasedrecallattheexpenseofprecision.Ourmethodproduceslessfalsepos-itives(butslightlymorefalsenegatives).Overall,ourmethodattainsthebestFfordetectingEn-glishinclusions.ManualerroranalysissuggeststhatthefalsenegativesforourmethodgenerallyoccurwherearelativelysmallproportionofthedocumentiswritteninEnglish.8FutureWorkDocumentsegmentationbylanguagecouldbeac-complishedbyacombinationofourmethodandthemethodofKingandAbney(2013),whichcouldbecomparedtothemethodofYamaguchiandTanaka-Ishii(2012)inthecontextofconstructingcorporaforlow-densitylanguagesusingtheweb.Anotherareawehaveidentiﬁedinthispaperisthetuningoftheparametersαandβinourmodel(currentlyα=0andβ=1),whichmayhavesomeeffectonthesparsityofthemodel.Furtherworkisrequiredindealingwithcross-domaineffects,toallowfor“off-the-shelf”languageidentiﬁcationinmultilingualdocuments.Previousworkhasshownthatitispossibletogenerateadocu-mentrepresentationthatisrobusttovariationacrossdomains(LuiandBaldwin,2011),andweintendtoinvestigateiftheseresultsarealsoapplicabletolan-guageidentiﬁcationinmultilingualdocuments.An-otheropenquestionistheextensionofthegenera-tivemixturemodelsto“unknown”languageidenti-ﬁcation(i.e.eliminatingtheclosed-worldassump-tion(Hughesetal.,2006)),whichmaybepossiblethroughtheuseofnon-parametricmixturemodelssuchasHierarchicalDirichletProcesses(Tehetal.,2006).9ConclusionWehavepresentedasystemforlanguageidentiﬁ-cationinmultilingualdocumentsusingagenerativemixturemodelinspiredbysupervisedtopicmodel-ingalgorithms,combinedwithadocumentrepresen-tationbasedonpreviousresearchinlanguageiden-tiﬁcationformonolingualdocuments.Weshowedthatthesystemoutperformsalternativeapproachesfromtheliteratureonsyntheticdata,aswellasonreal-worlddatafromrelatedresearchonlinguisticcorpuscreationforlow-densitylanguagesusingthewebasaresource.Wealsoshowedthatoursystemisabletoaccuratelyestimatetheproportionofthedocumentwrittenineachofthelanguagesidenti-ﬁed.Wehavemadeafullreferenceimplementationofoursystemfreelyavailable,8aswellasthesyn-theticdatasetpreparedforthispaper(Section5),inordertofacilitatetheadoptionofthistechnologyandfurtherresearchinthisarea.8https://github.com/saffsd/polyglot l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 6 3 1 5 6 6 8 5 5 / / t l a c _ a _ 0 0 1 6 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 38 AcknowledgmentsWethankHiroshiYamaguchiformakingareferenceimplementationofSEGLANGavailabletous,andBenKingforprovidinguswithacollectionofreal-worldmultilingualwebdocuments.Thisworkwassubstantiallyimprovedasaresultoftheinsightfulfeedbackreceivedfromthereviewers.NICTAisfundedbytheAustralianGovernmentasrepresentedbytheDepartmentofBroadband,CommunicationsandtheDigitalEconomyandtheAustralianResearchCouncilthroughtheICTCen-treofExcellenceprogram.ReferencesStevenAbneyandStevenBird.2010.Thehumanlanguageproject:buildingauniversalcorpusoftheworld’slanguages.InProceedingsofthe48thAnnualMeetingoftheAssociationforComputationalLinguis-tics,pages88–97.AssociationforComputationalLin-guistics.BeatriceAlex,AmitDubey,andFrankKeller.2007.Usingforeigninclusiondetectiontoimproveparsingperformance.InProceedingsoftheJointConferenceonEmpiricalMethodsinNaturalLanguageProcess-ingandComputationalNaturalLanguageLearning2007(EMNLP-CoNLL2007),pages151–160,Prague,CzechRepublic.TimothyBaldwinandMarcoLui.2010a.Languageidentiﬁcation:Thelongandtheshortofthematter.InProceedingsofHumanLanguageTechnologies:The11thAnnualConferenceoftheNorthAmericanChap-teroftheAssociationforComputationalLinguistics(NAACLHLT2010),pages229–237,LosAngeles,USA.TimothyBaldwinandMarcoLui.2010b.Multilin-guallanguageidentiﬁcation:ALTW2010sharedtaskdataset.InProceedingsoftheAustralasianLanguageTechnologyWorkshop2010(ALTW2010),pages5–7,Melbourne,Australia.ShaneBergsma,PaulMcNamee,MossaabBagdouri,ClaytonFink,andTheresaWilson.2012.Languageidentiﬁcationforcreatinglanguage-speciﬁcTwittercollections.InProceedingstheSecondWorkshoponLanguageinSocialMedia(LSM2012),pages65–74,Montr´eal,Canada.DavidM.Blei,AndrewY.Ng,andMichaelI.Jordan.2003.LatentDirichletallocation.JournalofMachineLearningResearch,3:993–1022.AlessioBoscaandLucaDini.2010.Languageidenti-ﬁcationstrategiesforcrosslanguageinformationre-trieval.InWorkingNotesoftheCrossLanguageEval-uationForum(CLEF).JamieCallanandMarkHoy,2009.ClueWeb09Dataset.Availableathttp://boston.lti.cs.cmu.edu/Data/clueweb09/.WilliamB.CavnarandJohnM.Trenkle.1994.N-gram-basedtextcategorization.InProceedingsoftheThirdSymposiumonDocumentAnalysisandInforma-tionRetrieval,pages161–175,LasVegas,USA.HakanCeylanandYookyungKim.2009.Languageidentiﬁcationofsearchenginequeries.InProceedingsoftheJointConferenceofthe47thAnnualMeetingoftheACLandthe4thInternationalJointConferenceonNaturalLanguageProcessingoftheAFNLP,pages1066–1074,Singapore.PaulCookandMarcoLui.2012.langid.pyforbet-terlanguagemodelling.InProceedingsoftheAus-tralasianLanguageTechnologyAssociationWorkshop2012,pages107–112,Dunedin,NewZealand.RafaelDueireLinsandPauloGonc¸alves.2004.Au-tomaticlanguageidentiﬁcationofwrittentexts.InProceedingsofthe2004ACMSymposiumonAppliedComputing(SAC2004),pages1128–1133,Nicosia,Cyprus.TedDunning.1994.Statisticalidentiﬁcationoflan-guage.TechnicalReportMCCS940-273,ComputingResearchLaboratory,NewMexicoStateUniversity.RayidGhani,RosieJones,andDunjaMladenic.2004.Buildingminoritylanguagecorporabylearningtogeneratewebsearchqueries.KnowledgeandInfor-mationSystems,7(1):56–83.EmmanuelGiguet.1995.Categorisationaccordingtolanguage:Asteptowardcombininglinguisticknowl-edgeandstatisticallearning.InProceedingsofthe4thInternationalWorkshoponParsingTechnologies(IWPT-1995),Prague,CzechRepublic.GregoryGrefenstette.1995.Comparingtwolanguageidentiﬁcationschemes.InProceedingsofAnalisiStatisticadeiDatiTestuali(JADT),pages263–268,Rome,Italy.ThomasL.GrifﬁthsandMarkSteyvers.2004.Find-ingscientiﬁctopics.ProceedingsoftheNationalAcademyofSciences,101:5228–5235.ThomasGrifﬁths.2002.Gibbssamplinginthegener-ativemodeloflatentDirichletallocation.TechnicalReport,StanfordUniversity.BadenHughes,TimothyBaldwin,StevenBird,JeremyNicholson,andAndrewMacKinlay.2006.Recon-sideringlanguageidentiﬁcationforwrittenlanguageresources.InProceedingsofthe5thInternationalConferenceonLanguageResourcesandEvaluation(LREC2006),pages485–488,Genoa,Italy. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 6 3 1 5 6 6 8 5 5 / / t l a c _ a _ 0 0 1 6 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 39 GenitiroKikui.1996.Identifyingthecodingsystemandlanguageofon-linedocumentsontheinternet.InProceedingsofthe16thInternationalConferenceonComputationalLinguistics(COLING’96),pages652–657,Kyoto,Japan.BenKingandStevenAbney.2013.Labelingthelan-guagesofwordsinmixed-languagedocumentsusingweaklysupervisedmethods.InProceedingsofthe2013ConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics:Hu-manLanguageTechnologies,pages1110–1119,At-lanta,Georgia.CanasaiKruengkrai,PrapassSrichaivattana,VirachSornlertlamvanich,andHitoshiIsahara.2005.Lan-guageidentiﬁcationbasedonstringkernels.InPro-ceedingsofthe5thInternationalSymposiumonCom-municationsandInformationTechnologies(ISCIT-2005),pages896–899,Beijing,China.DavidD.Lewis.1997.TheReuters-21578dataset.availableathttp://www.daviddlewis.com/resources/testcollections/reuters21578/.WangLing,GuangXiang,ChrisDyer,AlanBlack,andIsabelTrancoso.2013.Microblogsasparallelcor-pora.InProceedingsofthe51stAnnualMeetingoftheAssociationforComputationalLinguistics(Volume1:LongPapers),pages176–186,Soﬁa,Bulgaria,Au-gust.AssociationforComputationalLinguistics.JichengLiuandChunyanLiang.2008.TextCategoriza-tionofMultilingualWebPagesinSpeciﬁcDomain.InProceedingsofthe12thPaciﬁc-AsiaConferenceonAdvancesinKnowledgeDiscoveryandDataMining,PAKDD’08,pages938–944,Osaka,Japan.MarcoLuiandTimothyBaldwin.2011.Cross-domainfeatureselectionforlanguageidentiﬁcation.InPro-ceedingsofthe5thInternationalJointConferenceonNaturalLanguageProcessing(IJCNLP2011),pages553–561,ChiangMai,Thailand.BrunoMartinsandM´arioJ.Silva.2005.Languageiden-tiﬁcationinwebpages.InProceedingsofthe2005ACMsymposiumonAppliedcomputing,pages764–768,SantaFe,USA.AndrewMcCallumandKamalNigam.1998.Acom-parisonofeventmodelsforNaiveBayestextclassiﬁ-cation.InProceedingsoftheAAAI-98WorkshoponLearningforTextCategorization,pagesAvailableasTechnicalReportWS–98–05,AAAIPress.,Madison,USA.AndrewKachitesMcCallum.1999.Multi-labeltextclassiﬁcationwithamixturemodeltrainedbyEM.InProceedingsofAAAI99WorkshoponTextLearning.PaulMcNamee.2005.Languageidentiﬁcation:asolvedproblemsuitableforundergraduateinstruction.Jour-nalofComputingSciencesinColleges,20(3):94–101.Jian-YunNie,MichelSimard,PierreIsabelle,andRichardDurand.1999.Cross-languageinformationretrievalbasedonparalleltextsandautomaticmin-ingofparalleltextsfromtheweb.InProceedingsof22ndInternationalACM-SIGIRConferenceonRe-searchandDevelopmentinInformationRetrieval(SI-GIR’99),pages74–81,Berkeley,USA.JohnM.Prager.1999a.Linguini:languageidentiﬁcationformultilingualdocuments.InProceedingsthe32ndAnnualHawaiiInternationalConferenceonSystemsSciences(HICSS-32),Maui,Hawaii.JohnM.Prager.1999b.Linguini:Languageidentiﬁca-tionformultilingualdocuments.JournalofManage-mentInformationSystems,16(3):71–101.JohnRossQuinlan.1993.C4.5:ProgramsforMachineLearning.MorganKaufmann,SanMateo,USA.DanielRamage,DavidHall,RameshNallapati,andChristopherD.Manning.2009.LabeledLDA:Asupervisedtopicmodelforcreditattributioninmulti-labeledcorpora.InProceedingsofthe2009Confer-enceonEmpiricalMethodsinNaturalLanguagePro-cessing(EMNLP2009),pages248–256,Singapore.RadimRehurekandMilanKolkus.2009.LanguageIdentiﬁcationontheWeb:ExtendingtheDictionaryMethod.InProceedingsofComputationalLinguis-ticsandIntelligentTextProcessing,10thInternationalConference(CICLing2009),pages357–368,MexicoCity,Mexico.PhilipResnik.1999.MiningtheWebforbilingualtext.InProceedingsofthe37thAnnualMeetingoftheAsso-ciationforComputationalLinguistics,pages527–534,CollegePark,USA.KevinPScannell.2007.TheCr´ubad´anProject:Cor-pusbuildingforunder-resourcedlanguages.InBuild-ingandExploringWebCorpora:Proceedingsofthe3rdWebasCorpusWorkshop,pages5–15,Louvain-la-Neuve,Belgium.W.J.Teahan.2000.TextClassiﬁcationandSeg-mentationUsingMinimumCross-Entropy.InPro-ceedingsthe6thInternationalConference“Recherched’InformationAssisteeparOrdinateur”(RIAO’00),pages943–961,Paris,France.YeeWhyeTeh,MichaelI.Jordan,MatthewJ.Beal,andDavidM.Blei.2006.HierarchicalDirichletpro-cesses.JournaloftheAmericanStatisticalAssocia-tion,101:1566–1581.J¨orgTiedemannandNikolaLjubeˇsi´c.2012.Efﬁcientdiscriminationbetweencloselyrelatedlanguages.InProceedingsofthe24thInternationalConferenceonComputationalLinguistics(COLING2012),pages2619–2634,Mumbai,India.GiangBinhTran,DatBaNguyen,andBinThanhKieu.2010.N-grambasedapproachformul-tilinguallanguageidentiﬁcation.poster.available l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 6 3 1 5 6 6 8 5 5 / / t l a c _ a _ 0 0 1 6 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 40 athttp://comp.mq.edu.au/programming/task_description/VILangTek.pdf.FeiXia,CarrieLewis,andWilliamD.Lewis.2010.Lan-guageIDforathousandlanguages.InLSAAnnualMeetingExtendedAbstracts,Baltimore,USA.HiroshiYamaguchiandKumikoTanaka-Ishii.2012.Textsegmentationbylanguageusingminimumde-scriptionlength.InProceedingsthe50thAnnualMeetingoftheAssociationforComputationalLinguis-tics(Volume1:LongPapers),pages969–978,JejuIs-land,Korea.AlexanderYeh.2000.Moreaccuratetestsforthesta-tisticalsigniﬁcanceofresultdifferences.InProceed-ingsofthe18thInternationalConferenceonCompu-tationalLinguistics(COLING2000),pages947–953,Saarbr¨ucken,Germany.
PDF Herunterladen

Am MIT spezialisierte KI-Forschung

Am MIT spezialisierte KI-Forschung

Transaktionen des Assoziation für Computer -Linguistik, 2 (2014) 27–40. Action Editor: Kristina Toutanova.