Transactions of the Association for Computational Linguistics, 1 (2013) 1–12. Action Editor: Sharon Goldwater.

Transactions of the Association for Computational Linguistics, 1 (2013) 1–12. Action Editor: Sharon Goldwater.

Submitted 11/2012; Revised 1/2013; Published 3/2013. c
(cid:13)

2013 Association for Computational Linguistics.

TokenandTypeConstraintsforCross-LingualPart-of-SpeechTaggingOscarT¨ackstr¨om(cid:5)†∗DipanjanDas‡SlavPetrov‡RyanMcDonald‡JoakimNivre†∗(cid:5)SwedishInstituteofComputerScience†DepartmentofLinguisticsandPhilology,UppsalaUniversity‡GoogleResearch,NewYorkoscar@sics.se{dipanjand|slav|ryanmcd}@google.comjoakim.nivre@lingfil.uu.seAbstractWeconsidertheconstructionofpart-of-speechtaggersforresource-poorlanguages.Recently,manuallyconstructedtagdictionariesfromWiktionaryanddictionariesprojectedviabitexthavebeenusedastypeconstraintstoovercomethescarcityofannotateddatainthissetting.Inthispaper,weshowthatadditionaltokenconstraintscanbeprojectedfromaresource-richsourcelanguagetoaresource-poortargetlanguageviaword-alignedbitext.Wepresentseveralmodelstothisend;inparticularapar-tiallyobservedconditionalrandomfieldmodel,wherecoupledtokenandtypeconstraintspro-videapartialsignalfortraining.AveragedacrosseightpreviouslystudiedIndo-Europeanlanguages,ourmodelachievesa25%relativeerrorreductionoverthepriorstateoftheart.Wefurtherpresentsuccessfulresultsonsevenadditionallanguagesfromdifferentfamilies,empiricallydemonstratingtheapplicabilityofcoupledtokenandtypeconstraintsacrossadiversesetoflanguages.1IntroductionSupervisedpart-of-speech(POS)taggersareavail-ableformorethantwentylanguagesandachieveac-curaciesofaround95%onin-domaindata(Petrovetal.,2012).Thankstotheirefficiencyandrobustness,supervisedtaggersareroutinelyemployedinmanynaturallanguageprocessingapplications,suchassyn-tacticandsemanticparsing,named-entityrecognitionandmachinetranslation.Unfortunately,theresourcesrequiredtotrainsupervisedtaggersareexpensivetocreateandunlikelytoexistforthemajorityofwritten∗WorkprimarilycarriedoutwhileatGoogleResearch.languages.ThenecessityofbuildingNLPtoolsfortheseresource-poorlanguageshasbeenpartofthemotivationforresearchonunsupervisedlearningofPOStaggers(Christodoulopoulosetal.,2010).Inthispaper,weinsteadtakeaweaklysupervisedapproachtowardsthisproblem.Recently,learningPOStaggerswithtype-leveltagdictionaryconstraintshasgainedpopularity.Tagdictionaries,noisilypro-jectedviaword-alignedbitext,havebridgedthegapbetweenpurelyunsupervisedandfullysupervisedtaggers,resultinginanaverageaccuracyofover83%onabenchmarkofeightIndo-Europeanlanguages(DasandPetrov,2011).Lietal.(2012)furtherim-proveduponthisresultbyemployingWiktionary1asatagdictionarysource,resultinginthehithertobestpublishedresultofalmost85%onthesamesetup.Althoughtheaforementionedweaklysupervisedapproacheshaveresultedinsignificantimprovementsoverfullyunsupervisedapproaches,theyhavenotexploitedthebenefitsoftoken-levelcross-lingualprojectionmethods,whicharepossiblewithword-alignedbitextbetweenatargetlanguageofinterestandaresource-richsourcelanguage,suchasEnglish.Thisisthesettingweconsiderinthispaper(§2).Whilepriorworkhassuccessfullyconsideredbothtoken-andtype-levelprojectionacrossword-alignedbitextforestimatingthemodelparametersofgenera-tivetaggingmodels(YarowskyandNgai,2001;XiandHwa,2005,interalia),akeyobservationunder-lyingthepresentworkisthattoken-andtype-levelinformationofferdifferentandcomplementarysig-nals.Ontheonehand,highconfidencetoken-levelprojectionsofferpreciseconstraintsonataginaparticularcontext.Ontheotherhand,manuallycre-1http://www.wiktionary.org/.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
2
0
5
1
5
6
6
6
2
7

/

/
t

je

un
c
_
un
_
0
0
2
0
5
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

2

atedtype-leveldictionariescanhavebroadcoverageanddonotsufferfromword-alignmenterrors;theycanthereforebeusedtofiltersystematicaswellasrandomnoiseintoken-levelprojections.Inordertoreapthesepotentialbenefits,wepro-poseapartiallyobservedconditionalrandomfield(CRF)model(Laffertyetal.,2001)thatcouplesto-kenandtypeconstraintsinordertoguidelearning(§3).Inessence,themodelisgiventhefreedomtopushprobabilitymasstowardshypothesesconsistentwithbothtypesofinformation.Thisapproachisflex-ible:wecanuseeithernoisyprojectedormanuallyconstructeddictionariestogeneratetypeconstraints;furthermore,wecanincorporatearbitraryfeaturesovertheinput.Inadditiontostandard(contextual)lexicalfeaturesandtransitionfeatures,weobservethataddingfeaturesfromamonolingualwordcluster-ing(UszkoreitandBrants,2008)cansignificantlyim-proveaccuracy.Whilemostofthesefeaturescanalsobeusedinagenerativefeature-basedhiddenMarkovmodel(HMM)(Berg-Kirkpatricketal.,2010),weachievethebestaccuracywithagloballynormalizeddiscriminativeCRFmodel.Toevaluateourapproach,wepresentextensiveresultsonstandardpubliclyavailabledatasetsfor15languages:theeightIndo-Europeanlanguagespre-viouslystudiedinthiscontextbyDasandPetrov(2011)andLietal.(2012),andsevenadditionallan-guagesfromdifferentfamilies,forwhichnocompa-rablestudyexists.In§4wecomparevariousfeatures,constraintsandmodeltypes.OurbestmodelusestypeconstraintsderivedfromWiktionary,togetherwithtokenconstraintsderivedfromhigh-confidencewordalignments.WhenaveragedacrosstheeightlanguagesstudiedbyDasandPetrov(2011)andLietal.(2012),weachieveanaccuracyof88.8%.Thisisa25%relativeerrorreductionoverthepreviousstateoftheart.Averagedacrossall15languages,ourmodelobtainsanaccuracyof84.5%comparedto78.5%obtainedbyastronggenerativebaseline.Fi-nally,weprovideanindepthanalysisoftherelativecontributionsofthetwotypesofconstraintsin§5.2CouplingTokenandTypeConstraintsType-levelinformationhasbeenamplyusedinweaklysupervisedPOSinduction,eitherviapuremanuallycraftedtagdictionaries(SmithandEisner,2005;RaviandKnight,2009;GarretteandBaldridge,2012),noisilyprojectedtagdictionaries(DasandPetrov,2011)orthroughcrowdsourcedlexica,suchasWiktionary(Lietal.,2012).Attheotherendofthespectrum,therehavebeeneffortsthatprojecttoken-levelinformationacrossword-alignedbitext(YarowskyandNgai,2001;XiandHwa,2005).How-ever,systemsthatcombinebothsourcesofinforma-tioninasinglemodelhaveyettobefullyexplored.ThefollowingthreesubsectionsoutlineouroverallapproachforcouplingthesetwotypesofinformationtobuildrobustPOStaggersthatdonotrequireanydirectsupervisioninthetargetlanguage.2.1TokenConstraintsForthemajorityofresource-poorlanguages,thereisatleastsomebitextwitharesource-richsourcelanguage;forsimplicity,wechooseEnglishasoursourcelanguageinallexperiments.Itisthennat-uraltoconsiderusingasupervisedpart-of-speechtaggertopredictpart-of-speechtagsfortheEnglishsideofthebitext.Thesepredictedtagscansubse-quentlybeprojectedtothetargetsideviaautomaticwordalignments.ThisapproachwaspioneeredbyYarowskyandNgai(2001),whousedtheresultingpartialtargetannotationtoestimatetheparametersofanHMM.However,duetotheautomaticnatureofthewordalignmentsandthePOStags,therewillbesignificantnoiseintheprojectedtags.Toconquerthisnoise,theyusedveryaggressivesmoothingtech-niqueswhentrainingtheHMM.FossumandAbney(2005)usedsimilartoken-levelprojections,butin-steadcombinedprojectionsfrommultiplesourcelan-guagestofilteroutrandomprojectionnoiseaswellasthesystematicnoisearisingfromdifferentsourcelanguageannotationsandsyntacticdivergences.2.2TypeConstraintsItiswellknownthatgivenatagdictionary,evenifitisincomplete,itispossibletolearnaccuratePOStaggers(SmithandEisner,2005;Goldbergetal.,2008;RaviandKnight,2009;Naseemetal.,2009).Whilewidelydifferinginthespecificmodelstruc-tureandlearningobjective,alloftheseapproachesachieveexcellentresults.Unfortunately,theyrelyontagdictionariesextracteddirectlyfromtheun-derlyingtreebankdata.Suchdictionariesprovideindepthcoverageofthetestdomainandalsolistall

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
2
0
5
1
5
6
6
6
2
7

/

/
t

je

un
c
_
un
_
0
0
2
0
5
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

3

(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:3)(cid:6)(cid:7)(cid:8)(cid:9)(cid:3)(cid:2)(cid:4)(cid:6)(cid:7)(cid:10)(cid:11)(cid:3)(cid:12)(cid:13)(cid:14)(cid:15)(cid:8)(cid:10)(cid:11)(cid:16)(cid:13)(cid:3)(cid:13)(cid:3)(cid:11)(cid:12)(cid:13)(cid:2)(cid:17)(cid:18)(cid:19)(cid:15)(cid:3)(cid:20)(cid:12)(cid:10)(cid:11)(cid:20)(cid:12)(cid:12)(cid:11)(cid:18)(cid:15)(cid:21)(cid:21)(cid:13)(cid:12)(cid:15)(cid:22)(cid:3)(cid:13)(cid:10)(cid:20)(cid:21)(cid:21)(cid:8)(cid:13)(cid:10)(cid:8)(cid:11)(cid:3)(cid:23)(cid:1)(cid:2)(cid:3)(cid:1)(cid:4)(cid:5)(cid:6)(cid:7)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:10)(cid:11)(cid:2)(cid:1)(cid:10)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:4)(cid:4)(cid:5)(cid:6)(cid:7)(cid:9)(cid:5)(cid:12)(cid:1)(cid:2)(cid:3)(cid:1)(cid:13)(cid:14)(cid:15)(cid:16)(cid:17)(cid:15)(cid:8)(cid:9)(cid:13)(cid:18)(cid:8)(cid:9)(cid:10)(cid:4)(cid:5)(cid:6)(cid:7)(cid:1)(cid:2)(cid:3)(cid:1)(cid:4)(cid:5)(cid:6)(cid:7)(cid:1)(cid:2)(cid:3)(cid:1)(cid:8)(cid:9)(cid:10)(cid:4)(cid:5)(cid:6)(cid:7)(cid:11)(cid:2)(cid:1)(cid:10)(cid:4)(cid:5)(cid:6)(cid:7)(cid:1)(cid:2)(cid:3)(cid:1)(cid:8)(cid:9)(cid:4)(cid:4)(cid:5)(cid:6)(cid:7)(cid:13)(cid:6)(cid:2)(cid:1)(cid:1)(cid:2)(cid:3)(cid:1)(cid:13)(cid:14)(cid:15)(cid:19)(cid:20)(cid:21)(cid:21)(cid:21)Figure1:LatticerepresentationoftheinferencesearchspaceY(X)foranauthenticsentenceinSwedish(“Thefarmingproductsmustbepureandmustnotcontainanyadditives”),afterpruningwithWiktionarytypeconstraints.Thecorrectpartsofspeecharelistedunderneatheachword.Boldnodesshowprojectedtokenconstraints˜y.Underlinedtextindicatesincorrecttags.ThecoupledconstraintslatticebY(X,˜y)consistsoftheboldnodestogetherwithnodesforwordsthatarelackingtokenconstraints;inthiscase,thecoupledconstraintslatticethusdefinesexactlyonevalidpath.inflectedforms–bothofwhicharedifficulttoobtainandunrealistictoexpectforresource-poorlanguages.Incontrast,DasandPetrov(2011)automaticallycreatetype-leveltagdictionariesbyaggregatingoverprojectedtoken-levelinformationextractedfrombi-text.Tohandlethenoiseintheseautomaticdictionar-ies,theyuselabelpropagationonasimilaritygraphtosmooth(andalsoexpand)thelabeldistributions.Whiletheirapproachproducesgoodresultsandisapplicabletoresource-poorlanguages,itrequiresacomplexmulti-stagetrainingprocedureincludingtheconstructionofalargedistributionalsimilaritygraph.Recently,Lietal.(2012)presentedasimpleandviablealternative:crowdsourceddictionariesfromWiktionary.Whilenoisyandsparseinnature,Wik-tionarydictionariesareavailablefor170languages.2Furthermore,theirqualityandcoverageisgrowingcontinuously(Lietal.,2012).ByincorporatingtypeconstraintsfromWiktionaryintothefeature-basedHMMofBerg-Kirkpatricketal.(2010),Lietal.wereabletoobtainthebestpublishedresultsinthissetting,surpassingtheresultsofDasandPetrov(2011)oneightIndo-Europeanlanguages.2.3CoupledConstraintsRatherthanrelyingexclusivelyoneithertokenortypeconstraints,weproposetocomplementtheonewiththeotherduringtraining.Foreachsentenceinourtrainingset,apartiallyconstrainedlatticeoftagsequencesisconstructedasfollows:2http://meta.wikimedia.org/wiki/Wiktionary—October2012.1.Foreachtokenwhosetypeisnotinthetagdic-tionary,weallowtheentiretagset.2.Foreachtokenwhosetypeisinthetagdictio-nary,weprunealltagsnotlicensedbythedictio-naryandmarkthetokenasdictionary-pruned.3.Foreachtokenthathasatagprojectedviaahigh-confidencebidirectionalwordalignment:iftheprojectedtagisstillpresentinthelattice,thenwepruneeverytagbuttheprojectedtagforthattoken;iftheprojectedtagisnotpresentinthelattice,whichcanonlyhappenfordictionary-prunedtokens,thenweignoretheprojectedtag.Figure1providesarunningexample.Thelatticeshowstagspermittedafterconstrainingthewordstotagslicensedbythedictionary(upuntilStep2fromabove).Thereisonlyasingletoken“Jordbruk-sprodukterna”(“thefarmingproducts”)notinthedictionary;inthiscasethelatticepermitsthefullsetoftags.Withtoken-levelprojections(Step3;nodeswithboldborderinFigure1),thelatticecanbefurtherpruned.Inmostcases,theprojectedtagisbothcorrectandisinthedictionary-prunedlattice.Wethussuccessfullydisambiguatesuchtokensandshrinkthesearchspacesubstantially.Therearetwocaseswehighlightinordertoshowwhereourmodelcanbreak.First,forthetoken“Jordbruksprodukterna”,theerroneouslyprojectedtagADJwilleliminateallothertagsfromthelattice,includingthecorrecttagNOUN.Second,thetoken“n˚agra”(“any”)hasasingledictionaryentryPRONandismissingthecorrecttagDET.Inthecasewhere

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
2
0
5
1
5
6
6
6
2
7

/

/
t

je

un
c
_
un
_
0
0
2
0
5
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

4

DETistheprojectedtag,wewillnotaddittothelatticeandsimplyignoreit.Thisisbecausewehy-pothesizethatthetagdictionarycanbetrustedmorethanthetagsprojectedvianoisywordalignments.Aswewillseein§4,takingtheunionoftagsperformsworse,whichsupportsthishypothesis.Forgenerativemodels,suchasHMMs(§3.1),weneedtodefineonlyonelattice.Forourbestgen-erativemodelthisisthecoupledtoken-andtype-constrainedlattice.3Atpredictiontime,inboththediscriminativeandthegenerativecases,wefindthemostlikelylabelsequenceusingViterbidecoding.Fordiscriminativemodels,suchasCRFs(§3.2),weneedtodefinetwolattices:onethatthemodelmovesprobabilitymasstowardsandanotheronedefiningtheoverallsearchspace(orpartitionfunc-tion).Intraditionalsupervisedlearningwithoutadictionary,theformerisatriviallatticecontainingthegoldstandardtagsequenceandthelatteristhesetofallpossibletagsequencesspanningthetokens.Withourbestmodel,wewillmovemasstowardsthecoupledtoken-andtype-constrainedlattice,suchthatthemodelcanfreelydistributemassacrossallpathsconsistentwiththeseconstraints.Thelatticedefiningthepartitionfunctionwillbethefullsetofpossibletagsequenceswhennodictionaryisused;whenadictionaryisuseditwillconsistofalldictionary-prunedtagsequences(sansStep3above;thefullsetofpossibilitiesshowninFigure1forourrunningexample).Figures2and3providestatisticsregardingthesupervisioncoverageandremainingambiguity.Fig-ure2showsthatmorethantwothirdsofalltokensinourtrainingdataareinWiktionary.However,thereisconsiderablevariationbetweenlanguages:Spanishhasthehighestcoveragewithover90%,whileTurk-ish,anagglutinativelanguagewithavastnumberofwordforms,haslessthan50%coverage.Fig-ure3showsthatthereissubstantialuncertaintyleftafterpruningwithWiktionary,sincetokensarerarelyfullydisambiguated:1.3tagspertokenareallowedonaveragefortypesinWiktionary.Figure2furthershowsthathigh-confidencealign-mentsareavailableforabouthalfofthetokensformostlanguages(Japaneseisanotableexceptionwith3Othertrainingmethodsexistaswell,forexample,con-trastiveestimation(SmithandEisner,2005).0255075100avgbgcsdadeelesfritjanlptslsvtrzhPercent of tokens coveredTokencoverageWiktionaryProjectedProjected+FilteredFigure2:Wiktionaryandprojectiondictionarycoverage.ShownisthepercentageoftokensinthetargetsideofthebitextthatarecoveredbyWiktionary,thathaveaprojectedtag,andthathaveaprojectedtagafterintersectingthetwo.0.00.51.01.5avgbgcsdadeelesfritjanlptslsvtrzhNumber of tags per tokenFigure3:Averagenumberoflicensedtagspertokenonthetargetsideofthebitext,fortypesinWiktionary.lessthan30%ofthetokenscovered).IntersectingtheWiktionarytagsandtheprojectedtags(Step2and3above)filtersoutsomeofthepotentiallyerroneoustags,butpreservesthemajorityoftheprojectedtags;theremaining,presumablymoreaccurateprojectedtagscoveralmosthalfofalltokens,greatlyreducingthesearchspacethatthelearnerneedstoexplore.3ModelswithCoupledConstraintsWenowformallypresenthowwecoupletokenandtypeconstraintsandhowweusethesecoupledcon-straintstotrainprobabilistictaggingmodels.Letx=(x1x2…X|X|)∈Xdenoteasentence,whereeachtokenxi∈VisaninstanceofawordtypefromthevocabularyVandlety=(y1y2…oui|X|)∈Yde-noteatagsequence,whereyi∈TisthetagassignedtotokenxiandTdenotesthesetofallpossiblepart-of-speechtags.WedenotethelatticeofalladmissibletagsequencesforthesentencexbyY(X).Thisisthe

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
2
0
5
1
5
6
6
6
2
7

/

/
t

je

un
c
_
un
_
0
0
2
0
5
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

5

inferencesearchspaceinwhichthetaggeroperates.Asweshallsee,itiscrucialtoconstrainthesizeofthislatticeinordertosimplifylearningwhenonlyincompletesupervisionisavailable.Atagdictionarymapsawordtypexj∈VtoasetofadmissibletagsT(xj)⊆T.ForwordtypesnotinthedictionaryweallowthefullsetoftagsT(whilepossible,inthispaperwedonotat-tempttodistinguishclosed-classversusopen-classwords).Whenprovidedwithatagdictionary,thelatticeofadmissibletagsequencesforasentencexisY(X)=T(x1)×T(x2)×…×T(X|X|).Whennotagdictionaryisavailable,wesimplyhavethefulllatticeY(X)=T|X|.Let˜y=(˜y1˜y2˜y|X|)betheprojectedtagsforthesentencex.Notethat{˜yi}=∅fortokenswithoutaprojectedtag.Next,wedefineapiecewiseoperator_thatcouples˜yandY(X)withrespecttoeverysentenceindex,whichresultsinatoken-andtype-constrainedlattice.Theoperatorbehavesasfollows,coherentwiththehighleveldescriptionin§2.3:bT(xi,˜yi)=˜yi_T(xi)=({˜yi}if˜yi∈T(xi)T(xi)otherwise.Wedenotethetoken-andtype-constrainedlatticeasbY(X,˜y)=bT(x1,˜y1)×bT(x2,˜y2)×…×bT(X|X|,˜y|X|).Notethatwhentoken-levelprojectionsarenotused,thedictionary-prunedlatticeandthelatticewithcou-pledconstraintsareidentical,thatisbY(X,˜y)=Y(X).3.1HMMswithCoupledConstraintsAfirst-orderhiddenMarkovmodel(HMM)specifiesthejointdistributionofasentencex∈Xandatag-sequencey∈Y(X)comme:(X,oui)=|X|Yi=1pβ(xi|yi)|{z}emissionpβ(yi|yi−1)|{z}transition.Wefollowtherecenttrendofusingalog-linearparametrizationoftheemissionandthetransitiondistributions,insteadofamultinomialparametriza-tion(Chen,2003).Thisallowsmodelparametersβtobesharedacrosscategoricalevents,whichhasbeenshowntogivesuperiorperformance(Berg-Kirkpatricketal.,2010).Thecategoricalemissionandtransitioneventsarerepresentedbyfeaturevec-torsφ(xi,yi)andφ(yi,yi−1).Eachelementoftheparametervectorβcorrespondstoaparticularfea-ture;thecomponentlog-lineardistributionsare:(xi|yi)=exp(cid:0)β>φ(xi,yi)(cid:1)Px0i∈Vexp(β>φ(x0i,yi)),andpβ(yi|yi−1)=exp(cid:0)β>φ(yi,yi−1)(cid:1)Py0i∈Texp(β>φ(y0i,yi−1)).Inmaximum-likelihoodestimationoftheparameters,weseektomaximizethelikelihoodoftheobservedpartsofthedata.Forthisweneedthejointmarginaldistributionpβ(X,bY(X,˜y))ofasentencex,anditscoupledconstraintslatticebY(X,˜y),whichisobtainedbymarginalizingoverallconsistentoutputs:(X,bY(X,˜y))=Xy∈bY(X,˜y)(X,oui).Iftherearenoprojectionsandnotagdictionary,thenbY(X,˜y)=T|X|,andthuspβ(X,bY(X,˜y))=pβ(X),whichreducestofullyunsupervisedlearning.The‘2-regularizedmarginaljointlog-likelihoodoftheconstrainedtrainingdataD={(X(je),˜y(je))}ni=1is:L(β;D)=nXi=1logpβ(X(je),bY(X(je),˜y(je)))−γkβk22.(1)WefollowBerg-Kirkpatricketal.(2010)andtakeadirectgradientapproachforoptimizingEq.1withL-BFGS(LiuandNocedal,1989).Wesetγ=1andrun100iterationsofL-BFGS.Onecouldalsoem-ploytheExpectation-Maximization(EM)algorithme(Dempsteretal.,1977)tooptimizethisobjective,al-thoughtherelativemeritsofEMversusdirectgradi-enttrainingforthesemodelsisstillatopicofdebate(Berg-Kirkpatricketal.,2010;Lietal.,2012).4Notethatsincethemarginallikelihoodisnon-concave,weareonlyguaranteedtofindalocalmaximumofEq.1.Afterestimatingthemodelparametersβ,thetag-sequencey∗∈Y(X)forasentencex∈Xispre-dictedbychoosingtheonewithmaximaljointprob-ability:y∗←argmaxy∈Y(X)(X,oui).4WetrainedtheHMMwithEMaswell,butachievedbetterresultswithdirectgradienttrainingandhenceomitthoseresults.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
2
0
5
1
5
6
6
6
2
7

/

/
t

je

un
c
_
un
_
0
0
2
0
5
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

6

3.2CRFswithCoupledConstraintsWhereasanHMMmodelsthejointprobabilityoftheinputx∈Xandoutputy∈Y(X),usinglocallynormalizedcomponentdistributions,aconditionalrandomfield(CRF)insteadmodelstheprobabilityoftheoutputconditionedontheinputasagloballynor-malizedlog-lineardistribution(Laffertyetal.,2001):(oui|X)=exp(cid:0)θ>Φ(X,oui)(cid:1)Py0∈Y(X)exp(θ>Φ(X,y0)),whereθisaparametervector.AsfortheHMM,Oui(X)isnotnecessarilythefullspaceofpossibletag-sequences;specifically,forus,itisthedictionary-prunedlatticewithoutthetokenconstraints.Withafirst-orderMarkovassumption,thefeaturefunctionfactorsas:Φ(X,oui)=|X|Xi=1φ(X,yi,yi−1).ThismodelismorepowerfulthantheHMMinthatitcanusericherfeaturedefinitions,suchasjointin-put/transitionfeaturesandfeaturesoverawiderinputcontext.Wemodelamarginalconditionalprobabil-ity,givenbythetotalprobabilityofalltagsequencesconsistentwiththelatticebY(X,˜y):(bY(X,˜y)|X)=Xy∈bY(X,˜y)(oui|X).TheparametersofthisconstrainedCRFareestimatedbymaximizingthe‘2-regularizedmarginalcondi-tionallog-likelihoodoftheconstraineddata(Riezleretal.,2002):L(je;D)=nXi=1logpθ(bY(X(je),˜y(je))|X(je))−γkθk22.(2)AswithEq.1,wemaximizeEq.2with100itera-tionsofL-BFGSandsetγ=1.IncontrasttotheHMM,afterestimatingthemodelparametersθ,thetag-sequencey∗∈Y(X)forasentencex∈Xischosenasthesequencewiththemaximalconditionalprobability:y∗←argmaxy∈Y(X)(oui|X).4EmpiricalStudyWenowpresentadetailedempiricalstudyofthemod-elsproposedintheprevioussections.InadditiontocomparingwiththestateoftheartinDasandPetrov(2011)andLietal.(2012),wepresentmodelswithseveralcombinationsoftokenandtypeconstraints,additionalfeaturesincorporatingwordclusters.Bothgenerativeanddiscriminativemodelsareexplored.4.1ExperimentalSetupBeforedelvingintotheexperimentaldetails,wepresentoursetupanddatasets.Languages.Weevaluateoneighttargetlanguagesusedinpreviouswork(DasandPetrov,2011;Lietal.,2012)andonsevenadditionallanguages(seeTa-ble1).WhiletheformereightlanguagesallbelongtotheIndo-Europeanfamily,webroadenthecoveragetolanguagefamiliesmoredistantfromthesourcelanguage(forexample,Chinese,JapaneseandTurk-ish).WeusethetreebanksfromtheCoNLLsharedtasksondependencyparsing(BuchholzandMarsi,2006;Nivreetal.,2007)forevaluation.5Thetwo-letterabbreviationsfromtheISO639-1standardareusedwhenreferringtotheselanguagesintablesandfigures.Tagset.Inallcases,wemapthelanguage-specificPOStagstouniversalPOStagsusingthemappingofPetrovetal.(2012).6Sinceweuseindirectsuper-visionviaprojectedtagsorWiktionary,themodelstatesinducedbyallmodelscorresponddirectlytoPOStags,enablingustocomputetaggingaccuracywithoutagreedy1-to-1ormany-to-1mapping.Bitext.Forallexperiments,weuseEnglishasthesourcelanguage.Dependingonavailability,therearebetween1Mand5Mparallelsentencesforeachlanguage.Themajorityoftheparalleldataisgath-eredautomaticallyfromthewebusingthemethodofUszkoreitetal.(2010).WefurtherincludedatafromEuroparl(Koehn,2005)andfromtheUNpar-allelcorpus(UN,2006),forlanguagescoveredbythesecorpora.TheEnglishsideofthebitextisPOStaggedwithastandardsupervisedCRFtagger,trainedonthePennTreebank(Marcusetal.,1993),withtagsmappedtouniversaltags.Theparallelsen-5ForFrenchweusethetreebankofAbeill´eetal.(2003).6Weuseversion1.03ofthemappingsavailableathttp://code.google.com/p/universal-pos-tags/.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
2
0
5
1
5
6
6
6
2
7

/

/
t

je

un
c
_
un
_
0
0
2
0
5
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

7

tencesarewordalignedwiththealignerofDeNeroandMacherey(2011).Intersectedhigh-confidencealignments(confidence>0.95)areextractedandag-gregatedintoprojectedtype-leveldictionaries.Forpurelypracticalreasons,thetrainingdatawithtoken-levelprojectionsiscreatedbyrandomlysamplingtarget-sidesentenceswithatotalof500Ktokens.Wiktionary.WeuseasnapshotoftheWiktionaryworddefinitions,andfollowtheheuristicsofLietal.(2012)forcreatingtheWiktionarydictionarybymappingtheWiktionarytagstouniversalPOStags.7Features.Forallmodels,weuseonlyanidentityfeaturefortag-pairtransitions.Weusefivefeaturesthatcouplethecurrenttagandtheobservedword(analogoustotheemissioninanHMM):wordiden-tity,suffixesofuptolength3,andthreeindicatorfeaturesthatfirewhenthewordstartswithacapitalletter,containsahyphenorcontainsadigit.ThesearethesamefeaturesasthoseusedbyDasandPetrov(2011).Enfin,forsomemodelsweaddawordclusterfeaturethatcouplesthecurrenttagandthewordclusteridentityoftheword.These(monolin-gual)wordclustersareinducedwiththeexchangealgorithm(UszkoreitandBrants,2008).Wesetthenumberofclustersto256acrossalllanguages,asthishaspreviouslybeenshowntoproducerobustresultsforsimilartasks(Turianetal.,2010;T¨ackstr¨ometal.,2012).Theclustersforeachlanguagearelearnedonalargemonolingualnewswirecorpus.4.2ModelswithTypeConstraintsToexaminethesoleeffectoftypeconstraints,weexperimentwiththeHMM,drawingconstraintsfromthreedifferentdictionaries.Table1comparestheper-formanceofourmodelswiththebestresultsofDasandPetrov(2011,D&P.)andLietal.(2012,LG&T).Asinpreviouswork,trainingisdoneexclusivelyonthetrainingportionofeachtreebank,strippedofanymanuallinguisticannotation.Wefirstuseallofourparalleldatatogenerateprojectedtagdictionaries:theEnglishPOStagsareprojectedacrosswordalignmentsandaggregatedtotagdistributionsforeachwordtype.AsinDasandPetrov(2011),thedistributionsarethenfilteredwithathresholdof0.2toremovenoisytagsandtocreate7ThedefinitionsweredownloadedonAugust31,2012fromhttp://toolserver.org/˜enwikt/definitions/.ThissnapshotismorerecentthanthatusedbyLietal.PriorworkHMMwithtypeconstraintsLang.D&PLG&TYHMMproj.YHMMwik.YHMMunionYHMMunion+Cbg––84.268.187.287.9cs––75.470.275.479.2da83.283.387.782.078.489.5de82.885.886.685.180.088.3el82.579.283.383.886.083.2es84.286.483.983.788.387.3fr––88.475.775.686.6it86.886.589.085.489.990.6ja––45.276.974.473.7nl79.586.381.779.183.882.7pt87.984.586.779.083.890.4sl––78.764.882.883.4sv80.586.180.685.985.986.7tr––66.244.165.165.7zh––59.273.963.273.0avg(8)83.484.884.983.084.587.3avg––78.575.980.083.2Table1:Taggingaccuraciesfortype-constrainedHMMmodels.D&Pisthe“WithLP”modelinTable2ofDasandPetrov(2011),whileLG&Tisthe“SHMM-ME”modelinTable2ofLietal.(2012).YHMMproj.,YHMMwik.andYHMMunionareHMMstrainedsolelywithtypeconstraintsderivedfromtheprojecteddictionary,Wiktionaryandtheunionofthesedictionaries,respectively.YHMMunion+CisequivalenttoYHMMunionwithadditionalclusterfeatures.Allmodelsaretrainedonthetreebankofeachlanguage,strippedofgoldlabels.Resultsareaveragedoverthe8languagesfromDasandPetrov(2011),denotedavg(8),aswellasoverthefullsetof15languages,denotedavg.anunweightedtagdictionary.WecallthismodelYHMMproj.;itsaverageaccuracyof84.9%ontheeightlanguagesishigherthanthe83.4%ofD&PandonparwithLG&T(84.8%).8Ournextmodel(YHMMwik.)simplydrawstypeconstraintsfromWiktionary.ItslightlyunderperformsLG&T(83.0%),presumablybecausetheyusedasecond-orderHMM.Asasimpleextensiontothesetwomodels,wetaketheunionoftheprojecteddictionaryandWiktionarytocon-strainanHMM,whichwenameYHMMunion.ThismodelperformsalittleworseontheeightIndo-Europeanlanguages(84.5),butgivesanimprovementovertheprojecteddictionarywhenevaluatedacrossall15languages(80.0%vs.78.5%).8Ourmodelcorrespondstotheweaker,“NoLP”projectionofDasandPetrov(2011).Wefoundthatlabelpropagationwasonlybeneficialwhensmallamountsofbitextwereavailable.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
2
0
5
1
5
6
6
6
2
7

/

/
t

je

un
c
_
un
_
0
0
2
0
5
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

8

TokenconstraintsHMMwithcoupledconstraintsCRFwithcoupledconstraintsLang.YHMMunion+C+L˜yHMM+C+L˜yCRF+C+LbYHMMproj.+C+LbYHMMwik.+C+LbYHMMunion+C+LbYCRFproj.+C+LbYCRFwik.+C+LbYCRFunion+C+Lbg87.777.984.184.583.986.786.087.885.4cs78.365.474.974.881.176.974.780.3**75.0da87.380.985.187.285.688.185.588.2*86.0de87.781.483.385.089.386.784.490.5**85.5el85.981.177.880.187.083.979.689.5**79.7es89.1**84.185.583.785.988.085.787.186.0fr88.4**83.584.785.986.487.484.987.285.6it89.685.288.588.787.689.888.389.389.4ja72.847.654.243.276.170.544.981.0**68.0nl83.178.482.482.384.283.283.185.9**83.2pt89.184.787.086.688.788.087.991.0**88.3sl82.469.878.278.581.880.179.782.380.0sv86.180.184.282.387.986.984.488.9**85.5tr62.458.164.564.661.864.865.064.1**65.2zh72.652.739.556.074.173.359.774.4**73.4avg(8)87.282.084.284.587.086.884.988.885.4avg82.874.176.977.682.882.378.284.581.1Table2:Taggingaccuraciesformodelswithtokenconstraintsandcoupledtokenandtypeconstraints.Allmodelsuseclusterfeatures(…+C)andaretrainedonlargetrainingsetseachcontaining500ktokenswith(partial)token-levelprojections(…+L).Thebesttype-constrainedmodel,trainedonthelargerdatasets,YHMMunion+C+L,isincludedforcomparison.TheremainingcolumnscorrespondtoHMMandCRFmodelstrainedonlywithtokenconstraints(˜y…)andwithcoupledtokenandtypeconstraints(bY…).Thelatteraretrainedusingtheprojecteddictionary(·proj.),Wiktionary(·wik.)andtheunionofthesedictionaries(·union),respectively.Thesearchspacesofthemodelstrainedwithcoupledconstraints(bY…)areeachprunedwiththerespectivetagdictionaryusedtoderivethecoupledconstraints.TheobserveddifferencebetweenbYCRFwik.+C+LandYHMMunion+C+Lisstatisticallysignificantatp<0.01(**)andp<0.015(*)accordingtoapairedbootstraptest(EfronandTibshirani,1993).Significancewasnotassessedforavgoravg(8).Wenextaddmonolingualclusterfeaturestothemodelwiththeuniondictionary.Thismodel,YHMMunion+C,significantlyoutperformsallothertype-constrainedmodels,demonstratingtheutilityofword-clusterfeatures.9Forfurtherexploration,wetrainthesamemodelonthedatasetscontaining500Ktokenssampledfromthetargetsideoftheparalleldata(YHMMunion+C+L);thisisdonetoexploretheeffectsoflargedataduringtraining.Wefindthattrainingonthesedatasetsresultinanaverageaccuracyof87.2%whichiscomparabletothe87.3%reportedforYHMMunion+CinTable1.ThisshowsthatthedifferentsourcedomainandamountoftrainingdatadoesnotinfluencetheperformanceoftheHMMsignificantly.Finally,wetrainCRFmodelswherewetreattypeconstraintsasapartiallyobservedlatticeandusethefullunprunedlatticeforcomputingthepartitionfunc-9Thesearemonolingualclusters.Bilingualclustersasintro-ducedinT¨ackstr¨ometal.(2012)mightbringadditionalbenefits.tion(§3.2).Duetospaceconsiderations,theresultsoftheseexperimentsarenotshownintable1.Weob-servesimilartrendsintheseresults,butonaverage,accuraciesaremuchlowercomparedtothetype-constrainedHMMmodels;theCRFmodelwiththeuniondictionaryalongwithclusterfeaturesachievesanaverageaccuracyof79.3%whentrainedonsamedata.Thisresultisnotunsurprising.First,theCRF’ssearchspaceisfullyunconstrained.Second,thedic-tionaryonlyprovidesaweaksetofobservationcon-straints,whichdonotprovidesufficientinformationtosuccessfullytrainadiscriminativemodel.How-ever,aswewillobservenext,couplingthedictionaryconstraintswithtoken-levelinformationsolvesthisproblem.4.3ModelswithTokenandTypeConstraintsWenowproceedtoaddtoken-levelinformation,focusinginparticularoncoupledtokenandtype l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 0 5 1 5 6 6 6 2 7 / / t l a c _ a _ 0 0 2 0 5 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 9 constraints.Sinceitisnotpossibletogenerateprojectedtokenconstraintsforourmonolingualtreebanks,wetrainallmodelsinthissubsectiononthe500K-tokensdatasetssampledfromthebi-text.Asabaseline,wefirsttrainHMMandCRFmodelsthatuseonlyprojectedtokenconstraints(˜yHMM+C+Land˜yCRF+C+L).AsshowninTable2,thesemodelsunderperformthebesttype-levelmodel(YHMMunion+C+L),10whichconfirmsthatprojectedto-kenconstraintsarenotreliableontheirown.ThisisinlinewithsimilarprojectionmodelspreviouslyexaminedbyDasandPetrov(2011).Wethenstudymodelswithcoupledtokenandtypeconstraints.Thesemodelsusethesamethreedictio-nariesasusedin§4.2,butadditionallycouplethederivedtypeconstraintswithprojectedtokencon-straints;seethecaptionofTable2foralistofthesemodels.Notethatsinceweonlyallowprojectedtagsthatarelicensedbythedictionary(Step3ofthetrans-fer,§2.3),theactualtokenconstraintsusedinthesemodelsvarywiththedifferentdictionaries.FromTable2,weseethatcoupledconstraintsaresuperiortotokenconstraints,whenusedbothwiththeHMMandtheCRF.However,fortheHMM,cou-pledconstraintsdonotprovideanybenefitovertypeconstraintsalone,inparticularwhentheprojecteddictionaryortheuniondictionaryisusedtoderivethecoupledconstraints(bYHMMproj.+C+LandbYHMMunion+C+L).Wehypothesizethatthisisbecausethesedictionar-ies(inparticulartheformer)havethesamebiasasthetoken-leveltagprojections,sothatthedictionaryisunabletocorrectthesystematicerrorsinthepro-jections(see§2.1).Sincethetokenconstraintsarestrongerthanthetypeconstraintsinthecoupledmod-els,thisbiasmayhaveasubstantialimpact.WiththeWiktionarydictionary,thedifferencebetweenthetype-constrainedandthecoupled-constrainedHMMisnegligible:YHMMunion+C+LandbYHMMwik.+C+Lbothav-erageatanaccuracyof82.8%.TheCRFmodel,ontheotherhand,isabletotakeadvantageofthecomplementaryinformationinthecoupledconstraints,providedthatthedictionaryisabletofilteroutthesystematictoken-levelerrors.WithadictionaryderivedfromWiktionaryandpro-jectedtoken-levelconstraints,bYCRFwik.+C+Lperforms10Tomakethecomparisonfairvis-a-vispotentialdivergencesintrainingdomains,wecomparetothebesttype-constrainedmodeltrainedonthesame500Ktokenstrainingsets.012302550751000110100011010001101000110100Number of token−level projectionsTagging accuracyNumber of tags listed in WiktionaryFigure4:RelativeinfluenceoftokenandtypeconstraintsontaggingaccuracyinthebYCRFwik.+C+Lmodel.Wordtypesarecategorizedaccordingtoa)theirnumberofWiktionarytags(0,1,2or3+tags,with0representingnoWiktionaryentry;top-axis)andb)thenumberoftimestheyaretoken-constrainedinthetrainingset(dividedintobucketsof0,1-9,10-99and100+occurrences;x-axis).Theboxessummarizetheaccuracydistributionsacrosslanguagesforeachwordtypecategoryasdefinedbya)andb).Thehorizontallineineachboxmarksthemedianaccuracy,thetopandbottommarkthefirstandthirdquantile,re-spectively,whilethewhiskersmarktheminimumandmaximumvaluesoftheaccuracydistribution.betterthanalltheremainingmodels,withanaverageaccuracyof88.8%acrosstheeightIndo-EuropeanlanguagesavailabletoD&PandLG&T.Averagedoverall15languages,itsaccuracyis84.5%.5FurtherAnalysisInthissectionweprovideadetailedanalysisoftheimpactoftokenversustypeconstraintsandwestudythepruningandfilteringmistakesresultingfromin-completeWiktionaryentriesindetail.Thisanalysisisbasedonthetrainingportionofeachtreebank.5.1InfluenceofTokenandTypeConstraintsTheempiricalsuccessofthemodeltrainedwithcou-pledtokenandtypeconstraintsconfirmsthattheseconstraintsindeedprovidecomplementarysignals.Figure4providesamoredetailedviewoftherela-tivebenefitsofeachtypeofconstraint.Weobserveseveralinterestingtrends.First,wordtypesthatoccurwithmoretokencon-straintsduringtrainingaregenerallytaggedmoreaccurately,regardlessofwhetherthesetypesoccur l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 0 5 1 5 6 6 6 2 7 / / t l a c _ a _ 0 0 2 0 5 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 10 90.092.595.097.5100.0050100150200250Number of corrected Wiktionary entriesPruning accuracyFigure5:Averagepruningaccuracy(line)acrosslan-guages(dots)asafunctionofthenumberofhypotheti-callycorrectedWiktionaryentriesforthekmostfrequentwordtypes.Forexample,position100onthex-axiscor-respondstomanuallycorrectingtheentriesforthe100mostfrequenttypes,whileposition0correspondstoex-perimentalconditions.inWiktionary.ThemostcommonscenarioisforawordtypetohaveexactlyonetaginWiktionaryandtooccurwiththisprojectedtagover100timesinthetrainingset(facet1,rightmostbox).Thesecom-monwordtypesaretypicallytaggedveryaccuratelyacrossalllanguages.Second,thewordtypesthatareambiguousaccord-ingtoWiktionary(facets2and3)arepredominantlyfrequentones.Theaccuracyistypicallylowerforthesewordscomparedtotheunambiguouswords.However,asthenumberofprojectedtokencon-straintsisincreasedfromzeroto100+observations,theambiguouswordsareeffectivelydisambiguatedbythetokenconstraints.Thisshowstheadvantageofintersectingtokenandtypeconstraints.Finally,projectiongenerallyhelpsforwordsthatarenotinWiktionary,althoughtheaccuracyforthesewordsneverreachtheaccuracyofthewordswithonlyonetaginWiktionary.Interestingly,wordsthatoccurwithaprojectedtagconstraintlessthan100timesaretaggedmoreaccuratelyfortypesnotinthedictionarycomparedtoambiguouswordtypeswiththesamenumberofprojectedconstraints.ApossibleexplanationforthisisthattheambiguouswordsareinherentlymoredifficulttopredictandthatmostofthewordsthatarenotinWiktionaryarelesscommonwordsthattendtoalsobelessambiguous.zhtrsvslptnljaitfreseldedacsbgavg0255075100Proportion of pruning errorsPRONNOUNDETADPPRTADVNUMCONJADJVERBX.Figure6:PrevalenceofpruningmistakesperPOStag,whenpruningtheinferencesearchspacewithWiktionary.5.2WiktionaryPruningMistakesTheerroranalysisbyLietal.(2012)showedthatthetagslicensedbyWiktionaryareoftenvalid.WhenusingWiktionarytoprunethesearchspaceofourconstrainedmodelsandtofiltertoken-levelprojec-tions,itisalsoimportantthatcorrecttagsarenotmistakenlyprunedbecausetheyaremissingfromWiktionary.Whiletheaccuracyoffilteringismoredifficulttostudy,duetothelackofagoldstandardtaggingofthebitext,Figure5(position0onthex-axis)showsthatsearchspacepruningerrorsarenotamajorissueformostlanguages;onaveragethepruningaccuracyisalmost95%.However,forsomelanguagessuchasChineseandCzechthecorrecttagisprunedfromthesearchspacefornearly10%ofalltokens.WhenusingWiktionaryasapruner,theupperboundonaccuracyfortheselanguagesisthereforeonlyaround90%.However,Figure5alsoshowsthatwithsomemanualeffortwemightbeabletoremedymanyoftheseerrors.Forexample,byaddingmiss-ingvalidtagstothe250mostcommonwordtypesintheworstlanguage,theminimumpruningaccuracywouldriseabove95%frombelow90%.Ifthesamewastobedoneforallofthestudiedlanguages,themeanpruningaccuracywouldreachover97%.Figure6breaksdownpruningerrorsresultingfromincorrectorincompleteWiktionaryentriesacrossthecorrectPOStags.Fromthisweobservethat,formanylanguages,thepruningerrorsarehighlyskewedtowardsspecifictags.Forexample,forCzechover80%ofthepruningerrorsarecausedbymistak-enlyprunedpronouns. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 0 5 1 5 6 6 6 2 7 / / t l a c _ a _ 0 0 2 0 5 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 11 6ConclusionsWeconsideredtheproblemofconstructingmultilin-gualPOStaggersforresource-poorlanguages.Tothisend,weexploredanumberofdifferentmodelsthatcombinetokenconstraintswithtypeconstraintsfromdifferentsources.Thebestresultswereob-tainedwithapartiallyobservedCRFmodelthatef-fectivelyintegratesthesecomplementaryconstraints.Inanextensiveempiricalstudy,weshowedthatthisapproachsubstantiallyimprovesonthestateoftheartinthiscontext.Ourbestmodelsignificantlyout-performedthesecond-bestmodelon10outof15evaluatedlanguages,whentrainedonidenticaldatasets,withaninsignificantdifferenceon3languages.Comparedtothepriorstateoftheart(Lietal.,2012),weobservedarelativereductioninerrorby25%,averagedovertheeightlanguagescommontoourstudies.AcknowledgmentsWethankAlexanderRushforhelpwiththehyper-graphframeworkthatwasusedtoimplementourmodelsandKlausMachereyforhelpwiththebi-textextraction.Thisworkbenefitedfrommanydis-cussionswithYoavGoldberg,KeithHall,KuzmanGanchevandHaoZhang.Wealsothanktheeditorandthethreeanonymousreviewersfortheirvaluablefeedback.ThefirstauthorisgratefulforthefinancialsupportfromtheSwedishNationalGraduateSchoolofLanguageTechnology(GSLT).ReferencesAnneAbeill´e,LionelCl´ement,andFranc¸oisToussenel.2003.BuildingaTreebankforFrench.InA.Abeill´e,editor,Treebanks:BuildingandUsingParsedCorpora,chapter10.Kluwer.TaylorBerg-Kirkpatrick,AlexandreBouchard-Cˆot´e,JohnDeNero,andDanKlein.2010.Painlessunsupervisedlearningwithfeatures.InProceedingsofNAACL-HLT.SabineBuchholzandErwinMarsi.2006.CoNLL-Xsharedtaskonmultilingualdependencyparsing.InProceedingsofCoNLL.StanleyFChen.2003.Conditionalandjointmodelsforgrapheme-to-phonemeconversion.InProceedingsofEurospeech.ChristosChristodoulopoulos,SharonGoldwater,andMarkSteedman.2010.TwodecadesofunsupervisedPOSinduction:Howfarhavewecome?InProceed-ingsofEMNLP.DipanjanDasandSlavPetrov.2011.Unsupervisedpart-of-speechtaggingwithbilingualgraph-basedprojec-tions.InProceedingsofACL-HLT.ArthurP.Dempster,NanM.Laird,andDonaldB.Rubin.1977.MaximumlikelihoodfromincompletedataviatheEMalgorithm.JournaloftheRoyalStatisticalSociety,SeriesB,39.JohnDeNeroandKlausMacherey.2011.Model-basedalignercombinationusingdualdecomposition.InPro-ceedingsofACL-HLT.BradEfronandRobertJ.Tibshirani.1993.AnIntroduc-tiontotheBootstrap.Chapman&Hall,NewYork,New York,USA.VictoriaFossumandStevenAbney.2005.Automaticallyinducingapart-of-speechtaggerbyprojectingfrommultiplesourcelanguagesacrossalignedcorpora.InProceedingsofIJCNLP.DanGarretteandJasonBaldridge.2012.Type-supervisedhiddenmarkovmodelsforpart-of-speechtaggingwithincompletetagdictionaries.InProceedingsofEMNLP-CoNLL.YoavGoldberg,MeniAdler,andMichaelElhadad.2008.EMcanfindprettygoodHMMPOS-taggers(whengivenagoodstart).InProceedingsofACL-HLT.PhilippKoehn.2005.Europarl:Aparallelcorpusforstatisticalmachinetranslation.InMTSummit.JohnD.Lafferty,AndrewMcCallum,andFernandoC.N.Pereira.2001.Conditionalrandomfields:Probabilisticmodelsforsegmentingandlabelingsequencedata.InProceedingsofICML.ShenLi,Jo˜aoGrac¸a,andBenTaskar.2012.Wiki-lysupervisedpart-of-speechtagging.InProceedingsofEMNLP-CoNLL.DongC.LiuandJorgeNocedal.1989.OnthelimitedmemoryBFGSmethodforlargescaleoptimization.MathematicalProgramming,45.MitchellP.Marcus,MaryAnnMarcinkiewicz,andBeat-riceSantorini.1993.BuildingalargeannotatedcorpusofEnglish:thePenntreebank.ComputationalLinguis-tics,19(2).TahiraNaseem,BenjaminSnyder,JacobEisenstein,andReginaBarzilay.2009.Multilingualpart-of-speechtagging:Twounsupervisedapproaches.JAIR,36.JoakimNivre,JohanHall,SandraK¨ubler,RyanMcDon-ald,JensNilsson,SebastianRiedel,andDenizYuret.2007.TheCoNLL2007sharedtaskondependencyparsing.InProceedingsofEMNLP-CoNLL.SlavPetrov,DipanjanDas,andRyanMcDonald.2012.Auniversalpart-of-speechtagset.InProceedingsofLREC.SujithRaviandKevinKnight.2009.Minimizedmodelsforunsupervisedpart-of-speechtagging.InProceed-ingsofACL-IJCNLP. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 0 5 1 5 6 6 6 2 7 / / t l a c _ a _ 0 0 2 0 5 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 12 StefanRiezler,TracyH.King,RonaldM.Kaplan,RichardCrouch,JohnT.Maxwell,III,andMarkJohnson.2002.Parsingthewallstreetjournalusingalexical-functionalgrammaranddiscriminativeestimationtechniques.InProceedingsofACL.NoahSmithandJasonEisner.2005.Contrastiveestima-tion:Traininglog-linearmodelsonunlabeleddata.InProceedingsofACL.OscarT¨ackstr¨om,RyanMcDonald,andJakobUszkoreit.2012.Cross-lingualwordclustersfordirecttransferoflinguisticstructure.InProceedingsofNAACL-HLT.JosephTurian,Lev-ArieRatinov,andYoshuaBengio.2010.Wordrepresentations:Asimpleandgeneralmethodforsemi-supervisedlearning.InProceedingsofACL.UN.2006.ODSUNparallelcorpus.JakobUszkoreitandThorstenBrants.2008.Distributedwordclusteringforlargescaleclass-basedlanguagemodelinginmachinetranslation.InProceedingsofACL-HLT.JakobUszkoreit,JayPonte,AshokPopat,andMosheDubiner.2010.Largescaleparalleldocumentminingformachinetranslation.InProceedingsofCOLING.ChenhaiXiandRebeccaHwa.2005.Abackoffmodelforbootstrappingresourcesfornon-Englishlanguages.InProceedingsofHLT-EMNLP.DavidYarowskyandGraceNgai.2001.Inducingmul-tilingualPOStaggersandNPbracketersviarobustprojectionacrossalignedcorpora.InProceedingsofNAACL.
Télécharger le PDF