Transactions of the Association for Computational Linguistics, 2 (2014) 27–40. Action Editor: Kristina Toutanova.
Submitted 1/2013; Überarbeitet 7/2013; Published 2/2014. C
(cid:13)
2014 Verein für Computerlinguistik.
AutomaticDetectionandLanguageIdentificationofMultilingualDocumentsMarcoLui♥♣,JeyHanLau♠andTimothyBaldwin♥♣♥DepartmentofComputingandInformationSystemsTheUniversityofMelbourne♣NICTAVictoriaResearchLaboratory♠DepartmentofPhilosophyKing’sCollegeLondonmhlui@unimelb.edu.au,jeyhan.lau@gmail.com,tb@ldwin.netAbstractLanguageidentificationisthetaskofautomat-icallydetectingthelanguage(S)presentinadocumentbasedonthecontentofthedocu-ment.Inthiswork,weaddresstheproblemofdetectingdocumentsthatcontaintextfrommorethanonelanguage(multilingualdocu-ments).Weintroduceamethodthatisabletodetectthatadocumentismultilingual,iden-tifythelanguagespresent,andestimatetheirrelativeproportions.Wedemonstratetheef-fectivenessofourmethodoversyntheticdata,aswellasreal-worldmultilingualdocumentscollectedfromtheweb.1IntroductionLanguageidentificationisthetaskofautomaticallydetectingthelanguage(S)presentinadocumentbasedonthecontentofthedocument.Languageidentificationtechniquescommonlyassumethatev-erydocumentiswritteninoneofaclosedsetofknownlanguagesforwhichthereistrainingdata,andisthusformulatedasthetaskofselectingthemostlikelylanguagefromthesetoftraininglan-guages.Inthiswork,weremovethismonolingualassumption,andaddresstheproblemoflanguageidentificationindocumentsthatmaycontaintextfrommorethanonelanguagefromthecandidateset.Weproposeamethodthatconcurrentlydetectsthatadocumentismultilingual,andestimatesthepropor-tionofthedocumentthatiswrittenineachlanguage.Detectingmultilingualdocumentshasavarietyofapplications.Mostnaturallanguageprocessingtechniquespresupposemonolingualinputdata,soinclusionofdatainforeignlanguagesintroducesnoise,andcandegradetheperformanceofNLPsys-tems(Alexetal.,2007;CookandLui,2012).Au-tomaticdetectionofmultilingualdocumentscanbeusedasapre-filteringsteptoimprovethequalityofinputdata.Detectingmultilingualdocumentsisalsoimportantforacquiringlinguisticdatafromtheweb(Scannell,2007;AbneyandBird,2010),andhasapplicationsinminingbilingualtextsforstatisticalmachinetranslationfromonlineresources(Resnik,1999;Nieetal.,1999;Lingetal.,2013).Therehasbeenparticularinterestinextractingtextresourcesforlow-densitylanguagesfrommultilingualwebpagescontainingboththelow-densitylanguageandanotherlanguagesuchasEnglish(YamaguchiandTanaka-Ishii,2012;KingandAbney,2013).KingandAbney(2013,p1118)specificallymentiontheneedforanautomaticmethod“toexamineamul-tilingualdocument,andwithhighaccuracy,listthelanguagesthatarepresentinthedocument”.Weintroduceamethodthatisabletodetectmulti-lingualdocuments,andsimultaneouslyidentifyeachlanguagepresentaswellasestimatethepropor-tionofthedocumentwritteninthatlanguage.Weachievethiswithaprobabilisticmixturemodel,us-ingadocumentrepresentationdevelopedformono-linguallanguageidentification(LuiandBaldwin,2011).Themodelpositsthateachdocumentisgen-eratedassamplesfromanunknownmixtureoflan-guagesfromthetrainingset.WeintroduceaGibbssamplertomapsamplestolanguagesforanygivensetoflanguages,andusethistoselectthesetoflan-guagesthatmaximizestheposteriorprobabilityofthedocument.
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
1
6
3
1
5
6
6
8
5
5
/
/
T
l
A
C
_
A
_
0
0
1
6
3
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
28
Ourmethodisabletolearnalanguageidenti-fierformultilingualdocumentsfrommonolingualtrainingdata.Thisisanimportantpropertyastherearenostandardcorporaofmultilingualdocumentsavailable,whereascorporaofmonolingualdocu-mentsarereadilyavailableforareasonablylargenumberoflanguages(LuiandBaldwin,2011).Wedemonstratetheeffectivenessofourmethodempir-ically,firstlybyevaluatingitonsyntheticdatasetsdrawnfromWikipediadata,andthenbyapplyingittoreal-worlddata,showingthatweareabletoiden-tifymultilingualdocumentsintargetedwebcrawlsofminoritylanguages(KingandAbney,2013).Ourmaincontributionsare:(1)wepresentamethodforidentifyingmultilingualdocuments,thelanguagescontainedthereinandtherelativepropor-tionofthedocumentineachlanguage;(2)weshowthatourmethodoutperformsstate-of-the-artmeth-odsforlanguageidentificationinmultilingualdoc-uments;(3)weshowthatourmethodisabletoes-timatetheproportionofthedocumentineachlan-guagetoahighdegreeofaccuracy;Und(4)weshowthatourmethodisabletoidentifymultilingualdoc-umentsinreal-worlddata.2BackgroundMostlanguageidentificationresearchfocusesonlanguageidentificationformonolingualdocuments(Hughesetal.,2006).InmonolingualLangID,thetaskistoassigneachdocumentDauniquelanguageLi∈L.Someworkhasreportednear-perfectaccu-racyforlanguageidentificationoflargedocumentsinasmallnumberoflanguages(CavnarandTren-kle,1994;McNamee,2005).Jedoch,inordertoattainsuchaccuracy,alargenumberofsimplifyingassumptionshavetobemade(Hughesetal.,2006;BaldwinandLui,2010A).Inthiswork,wetackletheassumptionthateachdocumentismonolingual,i.e.itcontainstextfromasinglelanguage.Inlanguageidentification,documentsaremod-eledasastreamofcharacters(CavnarandTrenkle,1994;Kikui,1996),oftenapproximatedbythecor-respondingstreamofbytes(Kruengkraietal.,2005;BaldwinandLui,2010A)forrobustnessovervari-ablecharacterencodings.Inthiswork,wefollowBaldwinandLui(2010A)intrainingasinglemodelforlanguagesthatnaturallyusemultipleencodings(e.g.UTF8,Big5andGBencodingsforChinese),asissuesofencodingarenotthefocusofthisresearch.Thedocumentrepresentationusedforlanguageidentificationgenerallyinvolvesestimatingtherel-ativedistributionsofparticularbytesequences,se-lectedsuchthattheirdistributionsdifferbetweenlanguages.Insomecasestherelevantsequencesmaybeexternallyspecified,suchasfunctionwordsandcommonsuffixes(Giguet,1995)orgrammati-calwordclasses(DueireLinsandGonc¸alves,2004),thoughtheyaremorefrequentlylearnedfromla-beleddata(CavnarandTrenkle,1994;Grefenstette,1995;Prager,1999A;LuiandBaldwin,2011).Learningalgorithmsappliedtolanguageidenti-ficationfallintotwogeneralcategories:Bayesianclassifiersandnearest-prototype(Rocchio-style)classifiers.BayesianapproachesincludeMarkovprocesses(Dunning,1994),naiveBayesmethods(Grefenstette,1995;LuiandBaldwin,2011;Tiede-mannandLjubeˇsi´c,2012),andcompressivemod-els(Teahan,2000).Thenearest-prototypemethodsvaryprimarilyinthedistancemeasureused,includ-ingmeasuresbasedonrankorderstatistics(Cav-narandTrenkle,1994),informationtheory(Bald-winandLui,2010A),stringkernels(Kruengkraietal.,2005)andvectorspacemodels(Prager,1999A;McNamee,2005).Languageidentificationhasbeenappliedindo-mainssuchasUSENETmessages(CavnarandTrenkle,1994),webpages(Kikui,1996;Mar-tinsandSilva,2005;LiuandLiang,2008),websearchqueries(CeylanandKim,2009;BoscaandDini,2010),miningthewebforbilingualtext(Resnik,1999;Nieetal.,1999),buildingminor-itylanguagecorpora(Ghanietal.,2004;Scannell,2007;Bergsmaetal.,2012)aswellasalarge-scaledatabaseofInterlinearGlossedText(Xiaetal.,2010),andtheconstructionofalarge-scalemultilin-gualwebcrawl(CallanandHoy,2009).2.1MultilingualDocumentsLanguageidentificationoverdocumentsthatcontaintextfrommorethanonelanguagehasbeenidentifiedasanopenresearchquestion(Hughesetal.,2006).Commonexamplesofmultilingualdocumentsarewebpagesthatcontainexcerptsfromanotherlan-guage,anddocumentsfrommultilingualorganiza-tionssuchastheEuropeanUnion.
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
1
6
3
1
5
6
6
8
5
5
/
/
T
l
A
C
_
A
_
0
0
1
6
3
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
29
EnglishFrenchItalianGermanDutchJapanesecharacterthepourdiaufvooはbyte74686520706F7572064692020617566766F6E381AFTable1:Examplesofper-languagebytesequencesselectedbyinformationgain.TheAustraliasianLanguageTechnologyWork-shop2010hostedasharedtaskwhereparticipantswererequiredtopredictthelanguage(S)presentinaheld-outtestsetcontainingmonolingualandbilin-gualdocuments(BaldwinandLui,2010B).ThedatasetwaspreparedusingdatafromWikipedia,andbilingualdocumentswereproducedusingasegmentfromapageinonelanguage,andasegmentfromthesamepageinanotherlanguage.Weusethedatasetfromthissharedtaskforourinitialexperiments.Totheauthors’knowledge,theonlyotherworktodirectlytackleidentificationofmultiplelanguagesandtheirrelativeproportionsinasingledocumentistheLINGUINIsystem(Prager,1999A).Thesystemisbasedonavectorspacemodel,andcosinesimi-laritybetweenafeaturevectorforthetestdocumentandafeaturevectorforeachlanguageLi,computedasthesumoffeaturevectorsforallthedocumentsforlanguageLiinthetrainingdata.Theelementsinthefeaturevectorsarefrequencycountsoverbyten-grams(2≤n≤5)andwords.Languageiden-tificationformultilingualdocumentsisperformedthroughtheuseofvirtualmixedlanguages.Prager(1999A)showshowtoconstructvectorsrepresenta-tiveofparticularcombinationsoflanguagesinde-pendentoftherelativeproportions,andproposesamethodforchoosingcombinationsoflanguagestoconsiderforanygivendocument.Languageidentificationinmultilingualdocu-mentscouldalsobeperformedbyapplicationofsu-pervisedlanguagesegmentationalgorithms.Givenasystemthatcansegmentadocumentintola-beledmonolingualsegments,wecanthenextractthelanguagespresentaswellastherelativepropor-tionoftextineachlanguage.Severalmethodsforsupervisedlanguagesegmentationhavebeenpro-posed.Teahan(2000)proposedasystembasedontextcompressionthatidentifiesmultilingualdocu-mentsbyfirstsegmentingthetextintomonolingualblocks.RehurekandKolkus(2009)performlan-guagesegmentationbycomputingarelevancescorebetweentermsandlanguages,smoothingacrossad-joiningtermsandfinallyidentifyingpointsoftransi-tionbetweenhighandlowrelevance,whicharein-terpretedasboundariesbetweenlanguages.Yam-aguchiandTanaka-Ishii(2012)useaminimumde-scriptionlengthapproach,embeddingacompressivemodeltocomputethedescriptionlengthoftextseg-mentsineachlanguage.Theypresentalinear-timedynamicprogrammingsolutiontooptimizethelo-cationofsegmentboundariesandlanguagelabels.3MethodologyLanguageidentificationformultilingualdocumentsisamulti-labelclassificationtask,inwhichadoc-umentcanbemappedontoanynumberoflabelsfromaclosedset.Intheremainderofthispaper,wedenotethesetofalllanguagesbyL.Wede-noteadocumentDwhichcontainslanguagesLxandLyasD→{Lx,Ly},whereLx,Ly∈L.Wedenoteadocumentthatdoesnotcontainalan-guageLxbyD→{Lx},thoughwegenerallyomitallthelanguagesnotcontainedinthedocumentforbrevity.Wedenoteclassifieroutputusing.;e.g.D.{Der,Lb}indicatesthatdocumentDhasbeenpredictedtocontaintextinlanguagesLaandLb.3.1DocumentRepresentationandFeatureSelectionWerepresenteachdocumentDasafrequencydis-tributionoverbyten-gramsequencessuchasthoseinTable1.Eachdocumentisconvertedintoavectorwhereeachentrycountsthenumberoftimesapar-ticularbyten-gramispresentinthedocument.Thisisanalogoustoabag-of-wordsmodel,wherethevo-cabularyof“words”isasetofbytesequencesthathasbeenselectedtodistinguishbetweenlanguages.TheexactsetoffeaturesisselectedfromthetrainingdatausingInformationGain(IG),aninformation-theoreticmetricdevelopedasasplit-tingcriterionfordecisiontrees(Quinlan,1993).IG-basedfeatureselectioncombinedwithanaiveBayesclassifierhasbeenshowntobeparticularlyeffectiveforlanguageidentification(LuiandBaldwin,2011).
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
1
6
3
1
5
6
6
8
5
5
/
/
T
l
A
C
_
A
_
0
0
1
6
3
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
30
3.2GenerativeMixtureModelsGenerativemixturemodelsarepopularfortextmod-elingtaskswhereamixtureofinfluencesgovernsthecontentofadocument,suchasinmulti-labeldoc-umentclassification(McCallum,1999;Ramageetal.,2009),andtopicmodeling(Bleietal.,2003).Suchmodelsnormallyassumefullexchangeabilitybetweentokens(i.e.thebag-of-wordsassumption),andlabeleachtokenwithasinglediscretelabel.Multi-labeltextclassification,topicmodelingandourmodelforlanguageidentificationinmultilingualdocumentssharethesamefundamentalrepresenta-tionofthelatentstructureofadocument.Eachla-belismodeledwithaprobabilitydistributionovertokens,andeachdocumentismodeledasaproba-bilisticmixtureoflabels.AspresentedinGriffithsandSteyvers(2004),theprobabilityoftheithtoken(wi)givenasetofTlabelsz1···zTismodeledas:P(wi)=TXj=1P(wi|zi=j)P(zi=j)(1)Thesetoftokenswisthedocumentitself,whichinallcasesisobserved.Inthecaseoftopicmodel-ing,thetokensarewordsandthelabelsaretopics,andzislatent.Whereastopicmodelingisgener-allyunsupervised,multi-labeltextclassificationisasupervisedtextmodelingtask,wherethelabelsareasetofpre-definedcategories(suchasRUBBER,IRON-STEEL,TRADE,etc.inthepopularReuters-21578dataset(Lewis,1997)),andthetokensareindividualwordsindocuments.zisstilllatent,butconstrainedinthetrainingdata(i.e.documentsarelabeledbuttheindividualwordsarenot).Someap-proachestolabelingunseendocumentsrequirethatzforthetrainingdatabeinferred,andmethodsfordoingthisincludeanapplicationoftheExpectation-Maximization(EM)Algorithmus(McCallum,1999)andLabeledLDA(Ramageetal.,2009).Themodelthatweproposeforlanguageidentifi-cationinmultilingualdocumentsissimilartomulti-labeltextclassification.IntheframeworkofEqua-tion1,eachper-tokenlabelziisalanguageandthevocabularyoftokensisnotgivenbywordsbutratherbyspecificbytesequences(Section3.1).Thekeydifferencewithmulti-labeltextclassificationisthatweusemonolingual(i.e.mono-label)trainingdata.Hence,ziseffectivelyobservedforthetrainingdata(sincealltokensmustsharethesamelabel).Toinferzforunlabeleddocuments,weutilizeaGibbssam-pler,closelyrelatedtothatproposedbyGriffithsandSteyvers(2004)forLDA.Thesamplingprobabilityforalabelzifortokenwinadocumentdis:P(zi=j|z−i,w)∝φ(w)j·θ(D)J(2)Phi(w)j=P(wi|zi=j,z−i,w−i)θ(D)j=P(zi=j|z−i)IntheLDAmodel,θ(D)jisassumedtohaveaDirich-letdistributionwithhyperparameterα,andtheworddistributionforeachtopicφ(w)jisalsoassumedtohaveaDirichletdistributionwithhyperparameterβ.Griffiths(2002)describesagenerativemodelforLDAwherebothφ(w)jandθ(D)jareinferredfromtheoutputofaGibbssampler.Inourmethod,weestimateφ(w)jusingmaximumlikelihoodestima-tion(MLE)fromthetrainingdata.Estimatingφ(w)jthroughMLEisequivalenttoamultinomialNaiveBayesmodel(McCallumandNigam,1998):ˆφ(w)j=n(w)j+βn(.)j+Wβ(3)wheren(w)jisthenumberoftimeswordwoccurswithlabelj,andn(.)jisthetotalnumberofwordsthatoccurwithlabelj.Bysettingβto1,weobtainstandardLaplaciansmoothing.Hence,onlyˆθ(D)jisupdatedateachstepintheGibbssampler:ˆθ(D)j=n(D)−i,j+αn(D)−i+Tα(4)wheren(D)−i,jisthenumberoftokensindocumentdthatarecurrentlymappedtolanguagej,andn(D)−iisthetotalnumberoftokensindocumentd.Inbothcases,thecurrentassignmentofziisexcludedfromthecount.Tisthenumberoflanguages(i.e.thesizeofthelabelset).Forsimplicity,wesetαto0.WenotethatintheLDAmodel,αandβinfluencethesparsityofthesolution,andsoitmaybepossibletotunetheseparametersforourmodelaswell.Weleavethisasanavenueforfurtherresearch.
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
1
6
3
1
5
6
6
8
5
5
/
/
T
l
A
C
_
A
_
0
0
1
6
3
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
31
3.3LanguageIdentificationinMultilingualDocumentsThemodeldescribedinSection3.2canbeusedtocomputethemostlikelydistributiontohavegen-eratedanunlabeleddocumentoveragivensetoflanguagesforwhichwehavemonolingualtrainingdata,bylettingthesetoftermswbethebyten-gramsequencesweselectedusingper-languageinforma-tiongain(Section3.1),andallowingthelabelsztorangeoverthesetofalllanguagesL.Usingtrain-ingdata,wecomputeˆφ(w)J(Equation3),andthenweinferP(Lj|D)foreachLj∈Lfortheunla-beleddocument,byrunningtheGibbssampleruntilthesamplesforziconvergeandthentabulatingzioverthewholedandnormalizingby|D|.Naively,wecouldidentifythelanguagespresentinthedoc-umentbyD.{Lxif∃(zi=Lx|D)},butclosely-relatedlanguagestendtohavesimilarfrequencydis-tributionsoverbyten-gramfeatures,andhenceitislikelythatsometokenswillbeincorrectlymappedtoalanguagethatissimilartothe“correct”language.Weaddressthisissuebyfindingthesubsetoflan-guagesλfromthetrainingsetLthatmaximizesP(λ|D)(asimilarapproachistakeninMcCallum(1999)).ThroughanapplicationofBayes’theorem,P(λ|D)∝P(D|λ)·P(λ),notingthatP(D)isanormalizingconstantandcanbedropped.Weas-sumethatP(λ)isconstant(i.e.anysubsetoflan-guagesisequallylikely,areasonableassumptionintheabsenceofotherevidence),andhencemaximizeP(D|λ).ForanygivenD=w1···wnandλ,weinferP(D|λ)fromtheoutputoftheGibbssampler:P(D|λ)=NYi=1P(wi|λ)(5)=NYi=1Xj∈λP(wi|zi=j)P(zi=j)(6)wherebothP(wi|zi=j)andP(zi=j)areesti-matedbytheirmaximumlikelihoodestimates.Inpractice,exhaustiveevaluationofthepowersetofLisprohibitivelyexpensive,andsowegreed-ilyapproximatetheoptimalλusingAlgorithm1.Inessence,weinitiallyrankallthecandidatelanguagesbycomputingthemostlikelydistributionoverthefullsetofcandidatelanguages.Then,foreachofthetop-Nlanguagesinturn,weconsiderwhetherAlgorithm1DetectLang(L,D)LN←top-Nz∈LbyP(z|D)λ←{Lu}foreachLt∈LNdoλ0←λ∪LtifP(D|λ)+T
e d u
/ t a c l / l A R T ich C e - P D F / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 6 3 1 5 6 6 8 5 5 / / t l a c _ a _ 0 0 1 6 3 P D . F B j G u e S T T O N 0 7 S e P e M B e R 2 0 2 3 32 thevalueofkindirectlycontrolsthenumberoffea-turesselected.Valuesofkarenotcomparableacrossdatasetsasmisnotnormalizedforthesizeofthetrainingdata,sointhisworkwedonotreportthevaluesofkandinsteaddirectlyselectthetop-Nfea-tures,weightedbymn.InLINGUINI,eachlanguageismodeledasasinglepseudo-document,obtainedbyconcatenatingallthetrainingdataforthegivenlanguage.Adocumentisthenclassifiedaccordingtothevectorwithwhichithasthesmallestangle;thisisimplementedbyfindingthelanguagevectorwiththehighestcosinewiththedocumentvector.Prager(1999A)alsoproposesanextensiontotheapproachtoallowidentificationofbilingualdocu-ments,andsuggestshowthismaybegeneralizedtoanynumberoflanguagesinadocument.Thegistofthemethodissimple:foranygivenpairoflan-guages,theprojectionofadocumentvectorontothehyperplanecontainingthelanguagevectorsofthetwolanguagesgivesthemixtureproportionsofthetwolanguagesthatminimizestheanglewiththedocumentvector.Prager(1999A)termsthisprojec-tionavirtualmixedlanguage(VML),andshowshowtofindtheanglebetweenthedocumentvec-torandtheVML.Ifthisangleislessthanthatbe-tweenthedocumentvectorandanyindividuallan-guagevector,thedocumentislabeledasbilingualinthetwolanguagesfromwhichthemixedvectorwasderived.Thepracticaldifficultypresentedbythisapproachisthatexhaustivelyevaluatingallpossiblecombinationsoflanguagesisprohibitivelyexpen-sive.Prager(1999A)addressesthisbyarguingthatinmultilingualdocuments,“theindividualcomponentlanguageswillbeclosetod(thedocumentvector)–probablycloserthanmostorallotherlanguages”.Hence,languagemixturesareonlyconsideredforcombinationsofthetopmlanguages.Prager(1999A)showshowtoobtainthemixturecoefficientsforbilingualVMLs,arguingthattheprocessgeneralizes.Prager(1999B)includesthecoefficientsfor3-languageVMLs,whicharemuchmorecomplexthanthe2-languagevariants.Us-ingacomputeralgebrasystem,weverifiedthean-alyticformsofthecoefficientsinthe3-languageVML.Wealsoattemptedtoobtainananalyticformforthecoefficientsina4-languageVML,buttheseweretoocomplexforthecomputeralgebrasystemtocompute.Thus,ourevaluationoftheVMLap-proachproposedbyPrager(1999A)islimitedto3-languageVMLs.NeitherPrager(1999A)norPrager(1999B)includeanempiricalevaluationovermul-tilingualdocuments,sotothebestofourknowl-edgethispaperisthefirstempiricalevaluationofthemethodonmultilingualdocuments.Asnorefer-enceimplementationofthismethodisavailable,wehaveproducedourownimplementation,whichwehavemadefreelyavailable.1Theotherbenchmarkweconsiderinthispaperisthemethodfortextsegmentationbylanguagepro-posedbyYamaguchiandTanaka-Ishii(2012)(here-afterreferredtoasSEGLANG).Theactualtaskad-dressedbyYamaguchiandTanaka-Ishii(2012)istodivideadocumentintomonolingualsegments.ThisisformulatedasthetaskofsegmentingadocumentD=x1,···,X|D|(wherexidenotestheithchar-acterofDand|D|isthelengthofthedocument)byfindingalistofboundariesB=[B1,···,B|B|]whereeachBiindicatesthelocationofalanguageboundaryasanoffsetfromthestartofthedocument,resultinginalistofsegmentsX=[X0,···,X|B|].ForeachsegmentXi,thesystempredictsLi,thelanguageassociatedwiththesegment,producingalistoflabellingsL=[L0,···,L|B|],withthecon-straintthatadjacentelementsinLmustdiffer.Ya-maguchiandTanaka-Ishii(2012)solvetheproblemofdeterminingXandLforanunlabeledtextus-ingamethodbasedonminimumdescriptionlength.Theypresentadynamicprogrammingsolutiontothisproblem,andanalyzeanumberofparametersthataffecttheoverallaccuracyofthesystem.GiventhismethodtodetermineXandL,itisthentriv-ialtolabelanunlabeleddocumentaccordingtoD.{Lxif∃Lx∈L},andthelengthofeachseg-mentinXcanthenbeusedtodeterminethepro-portionsofthedocumentthatareineachlanguage.Inthiswork,weuseareferenceimplementationofSEGLANGkindlyprovidedtousbytheauthors.UsingthetextsegmentationapproachofSEGLANGtodetectmultilingualdocumentsdiffersfromLINGUINIandourmethodprimarilyinthatLINGUINIandourmethodfragmentthedocumentintosmallsequencesofbytes,anddiscardinforma-tionabouttherelativeorderofthefragments.ThisisincontrasttoSEGLANG,wherethisinformation1https://github.com/saffsd/linguini.py l D o w n o a d e d f r o m h t t p : / / D ich R e C T . M ich T . e d u / t a c l / l A R T ich C e - P D F / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 6 3 1 5 6 6 8 5 5 / / t l a c _ a _ 0 0 1 6 3 P D . F B j G u e S T T O N 0 7 S e P e M B e R 2 0 2 3 33 SystemPMRMFMPµRµFµBenchmark.497.467.464.833.826.829Winner.718.703.699.932.931.932SEGLANG.801.810.784.866.946.905LINGUINI.616.535.513.713.688.700Ourmethod.753.771.748.945.922.933Table2:ResultsontheALTW2010dataset.“Benchmark”isthebenchmarksystemproposedbythesharedtaskorganizers.“Winner”isthehighest-Fµsystemsubmittedtothesharedtask.isutilizedinthesequentialpredictionoflabelsforconsecutivesegmentsoftext,andisthusabletomakebetteruseofthelocalityoftext(sincetherearelikelytobemonolingualblocksoftextinanygivenmultilingualdocument).Thedisadvantageofthisisthattheunderlyingmodelbecomesmorecomplexandhencemorecomputationallyexpensive,asweobserveinSection5.3.5EvaluationWeseektoevaluatetheabilityofeachmethod:(1)tocorrectlyidentifythelanguage(S)presentineachtestdocument;Und(2)formultilingualdoc-uments,toestimatetherelativeproportionofthedocumentwrittenineachlanguage.Inthefirstin-stance,thisisaclassificationproblem,andthestan-dardnotionsofprecision(P),recall(R)andF-score(F)apply.Consistentwithpreviousworkinlan-guageidentification,wereportboththedocument-levelmicro-average,aswellasthelanguage-levelmacro-average.ForconsistencywithBaldwinandLui(2010A),themacro-averagedF-scorewereportistheaverageoftheper-classF-scores,ratherthantheharmonicmeanofthemacro-averagedprecisionandrecall;assuch,itispossiblefortheF-scoretonotfallbetweentheprecisionandrecallvalues.Asiscommonpractice,wecomputetheF-scoreforβ=1,givingequalimportancetoprecisionandrecall.2Wetestedthedifferenceinperformanceforstatisticalsignificanceusinganapproximateran-domizationprocedure(Yeh,2000)with10000iter-ations.Withineachtableofresults(Tables2,3and2Intuitively,itmayseemthatthemaximalprecisionandre-callshouldbeachievedwhenprecisionandrecallarebalanced.However,becauseofthemulti-labelnatureofthetaskandvari-ablenumberoflabelsassignedtoagivendocumentbyourmod-els,itistheoreticallypossibleandindeedcommoninourresultsforthemaximalmacro-averagedF-scoretobeachievedwhenmacro-averagedprecisionandrecallarenotbalanced.4),alldifferencesbetweensystemsarestatisticallysignificantatap<0.05level.Toevaluatethepredictionsoftherelativepropor-tionsofadocumentDwrittenineachdetectedlan-guageLi,wecomparethetopicproportionpredictedbyourmodeltothegold-standardproportion,mea-suredasabyteratioasfollows:gs(Li|D)=lengthofLipartofDinbyteslengthofDinbytes(7)Wereportthecorrelationbetweenpredictedandac-tualproportionsintermsofPearson’srcoefficient.Wealsoreportthemeanabsoluteerror(MAE)overalldocument–languagepairs.4ExperimentsonALTW2010OurfirstexperimentutilizestheALTW2010sharedtaskdataset(BaldwinandLui,2010b),asyntheticdatasetof10000bilingualdocuments3generatedfromWikipediadata,introducedintheALTW2010sharedtask,4Thedatasetisorganizedintotraining,developmentandtestpartitions.Followingstandardmachinelearningpractice,wetraineachsystemus-ingthetrainingpartition,andtuneparametersusingthedevelopmentpartition.Wethenreportmacroandmicro-averagedprecision,recallandF-scoreonthetestpartition,usingthetunedparameters.TheresultsontheALTW2010sharedtaskdatasetaresummarizedinTable2.Eachofthethreesys-temswecomparewasre-trainedusingthetrainingdataprovidedforthesharedtask,withaslightdif-ference:inthesharedtask,participantswerepro-videdwithmultilingualtrainingdocuments,butthesystemstargetedinthisresearchrequiremonolin-gualtrainingdata.Wethussplitthetrainingdoc-umentsintomonolingualsegmentsusingthemeta-dataprovidedwiththedataset.Themetadatawasonlypublishedaftercompletionofthetaskandwasnotavailabletotaskparticipants.Forcomparison,wehaveincludedthebenchmarkresultspublishedbythesharedtaskorganizers,aswellasthescoreattainedbythewinningentry(Tranetal.,2010).3Withasmallnumberofmonolingualdocuments,formedbyrandomlyselectingthetwolanguagesforagivendocu-mentindependently,leavingthepossibilityofthesametwolan-guagesbeingselected.4http://comp.mq.edu.au/programming/task_description/ l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 6 3 1 5 6 6 8 5 5 / / t l a c _ a _ 0 0 1 6 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 34 Wetunetheparametersforeachsystemusingthedevelopmentpartitionofthedataset,andreportre-sultsonthetestpartition.ForLINGUINI,thereisasingleparameterktobetuned:thenumberoffea-turesperlanguage.Wetestedvaluesbetween10000and50000,andselected46000featuresastheopti-malvalue.Forourmethod,therearetwoparameterstobetuned:(1)thenumberoffeaturesselectedforeachlanguage,and(2)thethresholdtforincludingalanguage.Wetestedfeatures-per-languagecountsbetween30and150,andfoundthataddingfeaturesbeyond70perlanguagehadminimaleffect.Wetestedvaluesofthethresholdtfrom0.01to0.15,andfoundthebestvaluewas0.14.ForSEGLANG,weintroduceathresholdtontheminimumpropor-tionofadocument(measuredinbytes)thatmustbelabeledbyalanguagebeforethatlanguageisin-cludedintheoutputset.ThiswasdonebecauseourinitialexperimentsindicatethatSEGLANGtendstoover-producelabels.Usingthedevelopmentdata,wefoundthebestvalueoftwas0.10.Wefindthatofthethreesystemstested,twoout-performthewinningentrytothesharedtask.Thisismoreevidentinthemacro-averagedresultsthaninthemicro-averagedresults.Inmicro-averagedterms,ourmethodisthebestperformer,whereasonthemacro-average,SEGLANGhasthehigh-estF-score.Thissuggeststhatourmethoddoeswellonhigher-densitylanguages(relativetotheALTW2010dataset),andpoorlyonlower-densitylanguages.Thisalsoaccountsforthehighermicro-averagedprecisionbutlowermicro-averagedrecallforourmethodascomparedtoSEGLANG.Theim-provedmacro-averageF-scoreofSEGLANGcomesatamuchhighercomputationalcost,whichin-creasesdramaticallyasthenumberoflanguagesisincreased.Inourtestingona16-coreworksta-tion,SEGLANGtookalmost24hourstoprocesstheALTW2010sharedtasktestdata,comparedto2minutesforourmethodand40secondsforLIN-GUINI.Assuch,SEGLANGispoorlysuitedtode-tectingmultilingualdocumentswherealargenum-berofcandidatelanguagesisconsidered.TheALTW2010datasetisanexcellentstartingpointforthisresearch,butitpredominantlycontainsbilingualdocuments,makingitdifficulttoassesstheabilityofsystemstodistinguishmultilingualdocu-mentsfrommonolingualones.Furthermore,weareunabletouseittoassesstheabilityofsystemstodetectmorethan2languagesinadocument.Toad-dresstheseshortcomings,weconstructanewdatasetinasimilarvein.Thedatasetandexperimentsper-formedonitaredescribedinthenextsection.5ExperimentsonWIKIPEDIAMULTITofullytestthecapabilitiesofourmodel,wegen-eratedWIKIPEDIAMULTI,adatasetthatcontainsamixtureofmonolingualandmultilingualdocu-ments.Toallowforreplicabilityofourresultsandtofacilitateresearchinlanguageidentification,wehavemadethedatasetpubliclyavailable.5WIKI-PEDIAMULTIisgeneratedusingexcerptsfromthemediawikisourcesofWikipediapagesdownloadedfromtheWikimediafoundation.6ThedumpsweusedarefromJuly–August2010.TogenerateWIKIPEDIAMULTI,wefirstnormal-izedtherawmediawikidocuments.Mediawikidoc-umentstypicallycontainoneparagraphperline,in-terspersedwithstructuralelements.Wefilteredeachdocumenttoremoveallstructuralelements,andonlykeptdocumentsthatexceeded2500bytesafternormalization.Thisyieldedacollectionofaround500,000documentsin156languages.Fromthisinitialdocumentset(hereafterreferredtoasWI-KICONTENT),weonlyretainedlanguagesthathadmorethan1000documents(44languages),andgen-erateddocumentsforWIKIPEDIAMULTIasfollows:1.randomlyselectthenumberoflanguagesK(1≤K≤5)2.randomlyselectasetofKlanguagesS={Li∈Lfori=1···K}withoutreplacement3.randomlyselectadocumentforeachLi∈SfromWIKICONTENTwithoutreplacement4.takethetop1Klinesofthedocument5.jointheKsectionsintoasingledocument.Asaresultoftheprocedure,therelativepropor-tionofeachlanguageinamultilingualdocumenttendsnottobeuniform,asitisconditionedonthelengthoftheoriginaldocumentfromwhichitwassourced,independentoftheotherK−1fortheotherlanguagesthatitwascombinedwith.Overall,theaveragedocumentlengthis5500bytes(standardde-viation=3800bytes).Duetoroundingupintaking5http://www.csse.unimelb.edu.au/˜tim/6http://dumps.wikimedia.org l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 6 3 1 5 6 6 8 5 5 / / t l a c _ a _ 0 0 1 6 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 35 SystemPMRMFMPµRµFµSEGLANG.809.975.875.771.975.861LINGUINI.853.772.802.838.774.805Ourmethod.962.954.957.963.955.959Table3:ResultsontheWIKIPEDIAMULTIdataset.thetop1klines(step4),documentswithhigherKtendtobelonger(6200bytesforK=5vs5100bytesforK=1).TheWIKIPEDIAMULTIdatasetcontainstraining,developmentandtestpartitions.Thetrainingparti-tionconsistsof5000monolingual(i.e.K=1)doc-uments.Thedevelopmentpartitionconsistsof5000documents,1000documentsforeachvalueofKwhere1≤K≤5.Thetestpartitioncontains200doc-umentsforeachK,foratotalof1000documents.Thereisnooverlapbetweenanyofthepartitions.5.1ResultsoverWIKIPEDIAMULTIWetrainedeachsystemusingthemonolingualtrain-ingpartition,andtunedparametersusingthedevel-opmentpartition.ForLINGUINI,wetestedfeaturecountsbetween10000and50000,andfoundthattheeffectwasrelativelysmall.Wethususe10000featuresastheoptimumvalue.ForSEGLANG,wetestedvaluesforthresholdtbetween0.01and0.20,andfoundthatthemaximalmacro-averagedF-scoreisattainedwhent=0.06.Finally,forourmethodwetestedfeatures-per-languagecountsbetween30and130andfoundthebestperformancewith120featuresperlanguage,althoughtheactualeffectofvaryingthisvalueisrathersmall.Wetestedvaluesofthethresholdtforaddinganextralanguagetoλfrom0.01to0.15,andfoundthatthebestresultswereattainedwhent=0.02.TheresultsofevaluatingeachsystemonthetestpartitionaresummarizedinTable3.Inthisevaluation,ourmethodclearlyoutperformsbothSEGLANGandLINGUINI.TheresultsonWIKI-PEDIAMULTIandALTW2010aredifficulttocom-paredirectlyduetothedifferentcompositionsofthetwodatasets.ALTW2010ispredominantlybilin-gual,whereasWIKIPEDIAMULTIcontainsdocu-mentswithtextin1–5languages.Furthermore,theaveragedocumentinALTW2010ishalfthelengthofthatinWIKIPEDIAMULTI.Overall,weobservethatSEGLANGhasatendencytoover-label(despitetheintroductionofthetparametertoreducethisef-fect),evidencedbyhighrecallbutlowerprecision.LINGUINIisinherentlylimitedinthatitisonlyabletodetectupto3languagesperdocument,causingrecalltosufferonWIKIPEDIAMULTI.However,italsotendstoalwaysoutput3languages,regardlessoftheactualnumberoflanguagesinthedocument,hurtingprecision.Furthermore,evenonALTW2010ithaslowerrecallthantheothertwosystems.6EstimatingLanguageProportionsInadditiontodetectingmultiplelanguageswithinadocument,ourmethodalsoestimatestherelativeproportionsofthedocumentthatarewrittenineachlanguage.Thisinformationmaybeusefulfordetect-ingdocumentsthatarecandidatebitextsfortrainingmachinetranslationsystems,sincewemayexpectlanguagesinthedocumenttobepresentinequalproportions.Italsoallowsustoidentifythepre-dominantlanguageofadocument.Acoreelementofourmodelofadocumentisadistributionoverasetoflabels.Sinceeachla-belcorrespondstoalanguage,asafirstapproxima-tion,wetaketheprobabilitymassassociatedwitheachlabelasadirectestimateoftheproportionofthedocumentwritteninthatlanguage.WeexaminetheresultsforpredictingthelanguageproportionsinthetestpartitionofWIKIPEDIAMULTI.Mappinglabeldistributionsdirectlytolanguageproportionsproducesexcellentresults,withaPearson’srvalueof0.863andanMAEof0.108.Althoughlabelshaveaone-to-onecorrespon-dencewithlanguages,thelabeldistributiondoesnotactuallycorresponddirectlytothelanguagepro-portion,becausethedistributionestimatesthepro-portionofbyten-gramsequencesassociatedwithalabelandnottheproportionofbytesdirectly.Thesamenumberofbytesindifferentlanguagescanproducedifferentnumbersofn-gramsequences,becauseafterfeatureselectionnotalln-gramse-quencesareretainedinthefeatureset.Hereafter,werefertoeachn-gramsequenceasatoken,andtheaveragenumberoftokensproducedperbyteoftextasthetokenemissionrate.Weestimatetheper-languagetokenemissionrate(Figure1)usingthetrainingpartitionofWIKIPE-DIAMULTI.Toimproveourestimateofthelan-guageproportions,wecorrectourlabeldistribution l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 6 3 1 5 6 6 8 5 5 / / t l a c _ a _ 0 0 1 6 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 36 Originaltextthecatinthehatn-gramfeatureshe:2the:2hat:1in:1th:1the:1hat:1hec:1int:1nth:1Emissionrate#bytes#tokens=1812=1.5bytes/tokenFigure1:Exampleofcalculatingn-gramemissionrateforatextstring.usingestimatesoftheper-languagetokenemissionrateRLiinbytespertokenforLi∈L.AssumethatadocumentDoflength|D|isestimatedtocontainKlanguagesinproportionsPifori=1···K.ThecorrectedestimatefortheproportionofLiis:Prop(Li)=Pi×RLiPKj=1(Pj×RLj)(8)Notethatthe|D|termiscommontothenumeratoranddenominatorandhasthusbeeneliminated.Thiscorrectionimprovesourestimatesoflan-guageproportions.Aftercorrection,thePearson’srrisesto0.981,andtheMAEisreducedto0.024.Theimprovementismostnoticeableforlanguage–documentpairswheretheproportionofthedocu-mentinthegivenlanguageisabout0.5(Figure2).7Real-worldMultilingualDocumentsSofar,wehavedemonstratedtheeffectivenessofourproposedapproachusingsyntheticdata.Theresultshavebeenexcellent,andinthissectionwevalidatetheapproachbyapplyingittoareal-worldtaskthathasrecentlybeendiscussedinthelit-erature.YamaguchiandTanaka-Ishii(2012)andKingandAbney(2013)bothobservethatintryingtogatherlinguisticdatafor“non-major”languagesfromtheweb,onechallengefacedisthatdocumentsretrievedoftencontainsectionsinanotherlanguage.SEGLANG(thesolutionofYamaguchiandTanaka-Ishii(2012))concurrentlydetectsmultilingualdoc-umentsandsegmentsthembylanguage,buttheap-proachiscomputationallyexpensiveandhasaten-dencytoover-label(Section5).Ontheotherhand,thesolutionofKingandAbney(2013)isincom-plete,andtheyspecificallymentiontheneedforanautomaticmethod“toexamineamultilingualdocu-ment,andwithhighaccuracy,listthelanguagesthatarepresentinthedocument”.Inthissection,weshowthatourmethodisabletofillthisneed.WeSystemPRFBaseline0.7191.000.837SEGLANG0.7790.9910.872LINGUINI0.7290.9810.837Ourmethod0.9070.9160.912Table4:DetectionaccuracyforEnglish-languageinclusioninwebdocumentsfromtargetedwebcrawlsforlow-densitylanguages.makeuseofmanually-annotateddatakindlypro-videdtousbyBenKing,whichconsistsof149doc-umentscontaining42languagesretrievedfromthewebusingasetoftargetedqueriesforlow-densitylanguages.NotethatthedatasetdescribedinKingandAbney(2013)wasbasedonmanualconfirma-tionofthepresenceofEnglishinadditiontothelow-densitylanguageofprimaryinterest;ourdatasetcontainsthesebilingualdocumentsaswellasmono-lingualdocumentsinthelow-densitylanguageofin-terest.Ourpurposeinthissectionistoinvestigatetheabilityofautomaticsystemstoselectthissubsetofbilingualdocuments.Specifically,givenacol-lectionofdocumentsretrievedforatargetlanguage,thetaskistoidentifythedocumentsthatcontaintextinEnglishinadditiontothetargetlanguage.Thus,were-traineachsystemforeachtargetlanguage,us-ingonlytrainingdataforEnglishandthetargetlan-guage.WereservethedataprovidedbyBenKingforevaluation,andtrainourmethodsusingdatasep-aratelyobtainedfromtheUniversalDeclarationofHumanRights(UDHR).WhereUDHRtranslationsforaparticularlanguagewerenotavailable,weuseddatafromWikipediaorfromabibletranslation.Ap-proximately20–80kBofdatawereusedforeachlanguage.Aswedonothavesuitabledevelopmentdata,wemadeuseofthebestparametersforeachsystemfromtheexperimentsonWIKIPEDIAMULTI.Wefindthatall3systemsareabletodetectthateachdocumentcontainsthetargetlanguagewith100%accuracy.However,systemsvaryintheirabil-itytodetectifadocumentalsocontainsEnglishinadditiontothetargetlanguage.Thedetectionaccu-racyforEnglish-languageinclusionissummarizedinTable4.7Forcomparison,weincludeaheuristicbaselinebasedonlabelingalldocumentsascontain-7NotethatTable2andTable3bothreportmacroandmicro-averagedresultsacrossanumberoflanguages.IncontrastTa-ble4onlyreportsresultsforEnglish,andthevaluesarenotdirectlycomparabletoourearlierevaluation. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 6 3 1 5 6 6 8 5 5 / / t l a c _ a _ 0 0 1 6 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 37 0.20.40.60.81.0Actual Proportion0.20.40.60.81.0Predicted ProportionPearson's r: 0.863MAE: 0.108(a)withoutemissionratecorrection0.20.40.60.81.0Actual Proportion0.20.40.60.81.0Predicted ProportionPearson's r: 0.981MAE: 0.0241(b)withemissionratecorrectionFigure2:Scatterplotofthepredictedvs.actuallanguageproportionsinadocumentforthetestpartitionofWIKIPEDIAMULTI(predictionsarefromourmethod;eachpointcorrespondstoadocument-languagepair).ingEnglish.Wefindthat,liketheheuristicbase-line,SEGLANGandLINGUINIbothtendtoover-labeldocuments,producingfalsepositivelabelsofEnglish,resultinginincreasedrecallattheexpenseofprecision.Ourmethodproduceslessfalsepos-itives(butslightlymorefalsenegatives).Overall,ourmethodattainsthebestFfordetectingEn-glishinclusions.ManualerroranalysissuggeststhatthefalsenegativesforourmethodgenerallyoccurwherearelativelysmallproportionofthedocumentiswritteninEnglish.8FutureWorkDocumentsegmentationbylanguagecouldbeac-complishedbyacombinationofourmethodandthemethodofKingandAbney(2013),whichcouldbecomparedtothemethodofYamaguchiandTanaka-Ishii(2012)inthecontextofconstructingcorporaforlow-densitylanguagesusingtheweb.Anotherareawehaveidentifiedinthispaperisthetuningoftheparametersαandβinourmodel(currentlyα=0andβ=1),whichmayhavesomeeffectonthesparsityofthemodel.Furtherworkisrequiredindealingwithcross-domaineffects,toallowfor“off-the-shelf”languageidentificationinmultilingualdocuments.Previousworkhasshownthatitispossibletogenerateadocu-mentrepresentationthatisrobusttovariationacrossdomains(LuiandBaldwin,2011),andweintendtoinvestigateiftheseresultsarealsoapplicabletolan-guageidentificationinmultilingualdocuments.An-otheropenquestionistheextensionofthegenera-tivemixturemodelsto“unknown”languageidenti-fication(i.e.eliminatingtheclosed-worldassump-tion(Hughesetal.,2006)),whichmaybepossiblethroughtheuseofnon-parametricmixturemodelssuchasHierarchicalDirichletProcesses(Tehetal.,2006).9ConclusionWehavepresentedasystemforlanguageidentifi-cationinmultilingualdocumentsusingagenerativemixturemodelinspiredbysupervisedtopicmodel-ingalgorithms,combinedwithadocumentrepresen-tationbasedonpreviousresearchinlanguageiden-tificationformonolingualdocuments.Weshowedthatthesystemoutperformsalternativeapproachesfromtheliteratureonsyntheticdata,aswellasonreal-worlddatafromrelatedresearchonlinguisticcorpuscreationforlow-densitylanguagesusingthewebasaresource.Wealsoshowedthatoursystemisabletoaccuratelyestimatetheproportionofthedocumentwrittenineachofthelanguagesidenti-fied.Wehavemadeafullreferenceimplementationofoursystemfreelyavailable,8aswellasthesyn-theticdatasetpreparedforthispaper(Section5),inordertofacilitatetheadoptionofthistechnologyandfurtherresearchinthisarea.8https://github.com/saffsd/polyglot l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 6 3 1 5 6 6 8 5 5 / / t l a c _ a _ 0 0 1 6 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 38 AcknowledgmentsWethankHiroshiYamaguchiformakingareferenceimplementationofSEGLANGavailabletous,andBenKingforprovidinguswithacollectionofreal-worldmultilingualwebdocuments.Thisworkwassubstantiallyimprovedasaresultoftheinsightfulfeedbackreceivedfromthereviewers.NICTAisfundedbytheAustralianGovernmentasrepresentedbytheDepartmentofBroadband,CommunicationsandtheDigitalEconomyandtheAustralianResearchCouncilthroughtheICTCen-treofExcellenceprogram.ReferencesStevenAbneyandStevenBird.2010.Thehumanlanguageproject:buildingauniversalcorpusoftheworld’slanguages.InProceedingsofthe48thAnnualMeetingoftheAssociationforComputationalLinguis-tics,pages88–97.AssociationforComputationalLin-guistics.BeatriceAlex,AmitDubey,andFrankKeller.2007.Usingforeigninclusiondetectiontoimproveparsingperformance.InProceedingsoftheJointConferenceonEmpiricalMethodsinNaturalLanguageProcess-ingandComputationalNaturalLanguageLearning2007(EMNLP-CoNLL2007),pages151–160,Prague,CzechRepublic.TimothyBaldwinandMarcoLui.2010a.Languageidentification:Thelongandtheshortofthematter.InProceedingsofHumanLanguageTechnologies:The11thAnnualConferenceoftheNorthAmericanChap-teroftheAssociationforComputationalLinguistics(NAACLHLT2010),pages229–237,LosAngeles,USA.TimothyBaldwinandMarcoLui.2010b.Multilin-guallanguageidentification:ALTW2010sharedtaskdataset.InProceedingsoftheAustralasianLanguageTechnologyWorkshop2010(ALTW2010),pages5–7,Melbourne,Australia.ShaneBergsma,PaulMcNamee,MossaabBagdouri,ClaytonFink,andTheresaWilson.2012.Languageidentificationforcreatinglanguage-specificTwittercollections.InProceedingstheSecondWorkshoponLanguageinSocialMedia(LSM2012),pages65–74,Montr´eal,Canada.DavidM.Blei,AndrewY.Ng,andMichaelI.Jordan.2003.LatentDirichletallocation.JournalofMachineLearningResearch,3:993–1022.AlessioBoscaandLucaDini.2010.Languageidenti-ficationstrategiesforcrosslanguageinformationre-trieval.InWorkingNotesoftheCrossLanguageEval-uationForum(CLEF).JamieCallanandMarkHoy,2009.ClueWeb09Dataset.Availableathttp://boston.lti.cs.cmu.edu/Data/clueweb09/.WilliamB.CavnarandJohnM.Trenkle.1994.N-gram-basedtextcategorization.InProceedingsoftheThirdSymposiumonDocumentAnalysisandInforma-tionRetrieval,pages161–175,LasVegas,USA.HakanCeylanandYookyungKim.2009.Languageidentificationofsearchenginequeries.InProceedingsoftheJointConferenceofthe47thAnnualMeetingoftheACLandthe4thInternationalJointConferenceonNaturalLanguageProcessingoftheAFNLP,pages1066–1074,Singapore.PaulCookandMarcoLui.2012.langid.pyforbet-terlanguagemodelling.InProceedingsoftheAus-tralasianLanguageTechnologyAssociationWorkshop2012,pages107–112,Dunedin,NewZealand.RafaelDueireLinsandPauloGonc¸alves.2004.Au-tomaticlanguageidentificationofwrittentexts.InProceedingsofthe2004ACMSymposiumonAppliedComputing(SAC2004),pages1128–1133,Nicosia,Cyprus.TedDunning.1994.Statisticalidentificationoflan-guage.TechnicalReportMCCS940-273,ComputingResearchLaboratory,NewMexicoStateUniversity.RayidGhani,RosieJones,andDunjaMladenic.2004.Buildingminoritylanguagecorporabylearningtogeneratewebsearchqueries.KnowledgeandInfor-mationSystems,7(1):56–83.EmmanuelGiguet.1995.Categorisationaccordingtolanguage:Asteptowardcombininglinguisticknowl-edgeandstatisticallearning.InProceedingsofthe4thInternationalWorkshoponParsingTechnologies(IWPT-1995),Prague,CzechRepublic.GregoryGrefenstette.1995.Comparingtwolanguageidentificationschemes.InProceedingsofAnalisiStatisticadeiDatiTestuali(JADT),pages263–268,Rome,Italy.ThomasL.GriffithsandMarkSteyvers.2004.Find-ingscientifictopics.ProceedingsoftheNationalAcademyofSciences,101:5228–5235.ThomasGriffiths.2002.Gibbssamplinginthegener-ativemodeloflatentDirichletallocation.TechnicalReport,StanfordUniversity.BadenHughes,TimothyBaldwin,StevenBird,JeremyNicholson,andAndrewMacKinlay.2006.Recon-sideringlanguageidentificationforwrittenlanguageresources.InProceedingsofthe5thInternationalConferenceonLanguageResourcesandEvaluation(LREC2006),pages485–488,Genoa,Italy. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 6 3 1 5 6 6 8 5 5 / / t l a c _ a _ 0 0 1 6 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 39 GenitiroKikui.1996.Identifyingthecodingsystemandlanguageofon-linedocumentsontheinternet.InProceedingsofthe16thInternationalConferenceonComputationalLinguistics(COLING’96),pages652–657,Kyoto,Japan.BenKingandStevenAbney.2013.Labelingthelan-guagesofwordsinmixed-languagedocumentsusingweaklysupervisedmethods.InProceedingsofthe2013ConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics:Hu-manLanguageTechnologies,pages1110–1119,At-lanta,Georgia.CanasaiKruengkrai,PrapassSrichaivattana,VirachSornlertlamvanich,andHitoshiIsahara.2005.Lan-guageidentificationbasedonstringkernels.InPro-ceedingsofthe5thInternationalSymposiumonCom-municationsandInformationTechnologies(ISCIT-2005),pages896–899,Beijing,China.DavidD.Lewis.1997.TheReuters-21578dataset.availableathttp://www.daviddlewis.com/resources/testcollections/reuters21578/.WangLing,GuangXiang,ChrisDyer,AlanBlack,andIsabelTrancoso.2013.Microblogsasparallelcor-pora.InProceedingsofthe51stAnnualMeetingoftheAssociationforComputationalLinguistics(Volume1:LongPapers),pages176–186,Sofia,Bulgaria,Au-gust.AssociationforComputationalLinguistics.JichengLiuandChunyanLiang.2008.TextCategoriza-tionofMultilingualWebPagesinSpecificDomain.InProceedingsofthe12thPacific-AsiaConferenceonAdvancesinKnowledgeDiscoveryandDataMining,PAKDD’08,pages938–944,Osaka,Japan.MarcoLuiandTimothyBaldwin.2011.Cross-domainfeatureselectionforlanguageidentification.InPro-ceedingsofthe5thInternationalJointConferenceonNaturalLanguageProcessing(IJCNLP2011),pages553–561,ChiangMai,Thailand.BrunoMartinsandM´arioJ.Silva.2005.Languageiden-tificationinwebpages.InProceedingsofthe2005ACMsymposiumonAppliedcomputing,pages764–768,SantaFe,USA.AndrewMcCallumandKamalNigam.1998.Acom-parisonofeventmodelsforNaiveBayestextclassifi-cation.InProceedingsoftheAAAI-98WorkshoponLearningforTextCategorization,pagesAvailableasTechnicalReportWS–98–05,AAAIPress.,Madison,USA.AndrewKachitesMcCallum.1999.Multi-labeltextclassificationwithamixturemodeltrainedbyEM.InProceedingsofAAAI99WorkshoponTextLearning.PaulMcNamee.2005.Languageidentification:asolvedproblemsuitableforundergraduateinstruction.Jour-nalofComputingSciencesinColleges,20(3):94–101.Jian-YunNie,MichelSimard,PierreIsabelle,andRichardDurand.1999.Cross-languageinformationretrievalbasedonparalleltextsandautomaticmin-ingofparalleltextsfromtheweb.InProceedingsof22ndInternationalACM-SIGIRConferenceonRe-searchandDevelopmentinInformationRetrieval(SI-GIR’99),pages74–81,Berkeley,USA.JohnM.Prager.1999a.Linguini:languageidentificationformultilingualdocuments.InProceedingsthe32ndAnnualHawaiiInternationalConferenceonSystemsSciences(HICSS-32),Maui,Hawaii.JohnM.Prager.1999b.Linguini:Languageidentifica-tionformultilingualdocuments.JournalofManage-mentInformationSystems,16(3):71–101.JohnRossQuinlan.1993.C4.5:ProgramsforMachineLearning.MorganKaufmann,SanMateo,USA.DanielRamage,DavidHall,RameshNallapati,andChristopherD.Manning.2009.LabeledLDA:Asupervisedtopicmodelforcreditattributioninmulti-labeledcorpora.InProceedingsofthe2009Confer-enceonEmpiricalMethodsinNaturalLanguagePro-cessing(EMNLP2009),pages248–256,Singapore.RadimRehurekandMilanKolkus.2009.LanguageIdentificationontheWeb:ExtendingtheDictionaryMethod.InProceedingsofComputationalLinguis-ticsandIntelligentTextProcessing,10thInternationalConference(CICLing2009),pages357–368,MexicoCity,Mexico.PhilipResnik.1999.MiningtheWebforbilingualtext.InProceedingsofthe37thAnnualMeetingoftheAsso-ciationforComputationalLinguistics,pages527–534,CollegePark,USA.KevinPScannell.2007.TheCr´ubad´anProject:Cor-pusbuildingforunder-resourcedlanguages.InBuild-ingandExploringWebCorpora:Proceedingsofthe3rdWebasCorpusWorkshop,pages5–15,Louvain-la-Neuve,Belgium.W.J.Teahan.2000.TextClassificationandSeg-mentationUsingMinimumCross-Entropy.InPro-ceedingsthe6thInternationalConference“Recherched’InformationAssisteeparOrdinateur”(RIAO’00),pages943–961,Paris,France.YeeWhyeTeh,MichaelI.Jordan,MatthewJ.Beal,andDavidM.Blei.2006.HierarchicalDirichletpro-cesses.JournaloftheAmericanStatisticalAssocia-tion,101:1566–1581.J¨orgTiedemannandNikolaLjubeˇsi´c.2012.Efficientdiscriminationbetweencloselyrelatedlanguages.InProceedingsofthe24thInternationalConferenceonComputationalLinguistics(COLING2012),pages2619–2634,Mumbai,India.GiangBinhTran,DatBaNguyen,andBinThanhKieu.2010.N-grambasedapproachformul-tilinguallanguageidentification.poster.available l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 6 3 1 5 6 6 8 5 5 / / t l a c _ a _ 0 0 1 6 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 40 athttp://comp.mq.edu.au/programming/task_description/VILangTek.pdf.FeiXia,CarrieLewis,andWilliamD.Lewis.2010.Lan-guageIDforathousandlanguages.InLSAAnnualMeetingExtendedAbstracts,Baltimore,USA.HiroshiYamaguchiandKumikoTanaka-Ishii.2012.Textsegmentationbylanguageusingminimumde-scriptionlength.InProceedingsthe50thAnnualMeetingoftheAssociationforComputationalLinguis-tics(Volume1:LongPapers),pages969–978,JejuIs-land,Korea.AlexanderYeh.2000.Moreaccuratetestsforthesta-tisticalsignificanceofresultdifferences.InProceed-ingsofthe18thInternationalConferenceonCompu-tationalLinguistics(COLING2000),pages947–953,Saarbr¨ucken,Germany.PDF Herunterladen