Operazioni dell'Associazione per la Linguistica Computazionale, 1 (2013) 1–12. Redattore di azioni: Sharon Goldwater.
Submitted 11/2012; Revised 1/2013; Pubblicato 3/2013. C
(cid:13)
2013 Associazione per la Linguistica Computazionale.
TokenandTypeConstraintsforCross-LingualPart-of-SpeechTaggingOscarT¨ackstr¨om(cid:5)†∗DipanjanDas‡SlavPetrov‡RyanMcDonald‡JoakimNivre†∗(cid:5)SwedishInstituteofComputerScience†DepartmentofLinguisticsandPhilology,UppsalaUniversity‡GoogleResearch,NewYorkoscar@sics.se{dipanjand|slav|ryanmcd}@google.comjoakim.nivre@lingfil.uu.seAbstractWeconsidertheconstructionofpart-of-speechtaggersforresource-poorlanguages.Recently,manuallyconstructedtagdictionariesfromWiktionaryanddictionariesprojectedviabitexthavebeenusedastypeconstraintstoovercomethescarcityofannotateddatainthissetting.Inthispaper,weshowthatadditionaltokenconstraintscanbeprojectedfromaresource-richsourcelanguagetoaresource-poortargetlanguageviaword-alignedbitext.Wepresentseveralmodelstothisend;inparticularapar-tiallyobservedconditionalrandomfieldmodel,wherecoupledtokenandtypeconstraintspro-videapartialsignalfortraining.AveragedacrosseightpreviouslystudiedIndo-Europeanlanguages,ourmodelachievesa25%relativeerrorreductionoverthepriorstateoftheart.Wefurtherpresentsuccessfulresultsonsevenadditionallanguagesfromdifferentfamilies,empiricallydemonstratingtheapplicabilityofcoupledtokenandtypeconstraintsacrossadiversesetoflanguages.1IntroductionSupervisedpart-of-speech(POS)taggersareavail-ableformorethantwentylanguagesandachieveac-curaciesofaround95%onin-domaindata(Petrovetal.,2012).Thankstotheirefficiencyandrobustness,supervisedtaggersareroutinelyemployedinmanynaturallanguageprocessingapplications,suchassyn-tacticandsemanticparsing,named-entityrecognitionandmachinetranslation.Unfortunately,theresourcesrequiredtotrainsupervisedtaggersareexpensivetocreateandunlikelytoexistforthemajorityofwritten∗WorkprimarilycarriedoutwhileatGoogleResearch.languages.ThenecessityofbuildingNLPtoolsfortheseresource-poorlanguageshasbeenpartofthemotivationforresearchonunsupervisedlearningofPOStaggers(Christodoulopoulosetal.,2010).Inthispaper,weinsteadtakeaweaklysupervisedapproachtowardsthisproblem.Recently,learningPOStaggerswithtype-leveltagdictionaryconstraintshasgainedpopularity.Tagdictionaries,noisilypro-jectedviaword-alignedbitext,havebridgedthegapbetweenpurelyunsupervisedandfullysupervisedtaggers,resultinginanaverageaccuracyofover83%onabenchmarkofeightIndo-Europeanlanguages(DasandPetrov,2011).Lietal.(2012)furtherim-proveduponthisresultbyemployingWiktionary1asatagdictionarysource,resultinginthehithertobestpublishedresultofalmost85%onthesamesetup.Althoughtheaforementionedweaklysupervisedapproacheshaveresultedinsignificantimprovementsoverfullyunsupervisedapproaches,theyhavenotexploitedthebenefitsoftoken-levelcross-lingualprojectionmethods,whicharepossiblewithword-alignedbitextbetweenatargetlanguageofinterestandaresource-richsourcelanguage,suchasEnglish.Thisisthesettingweconsiderinthispaper(§2).Whilepriorworkhassuccessfullyconsideredbothtoken-andtype-levelprojectionacrossword-alignedbitextforestimatingthemodelparametersofgenera-tivetaggingmodels(YarowskyandNgai,2001;XiandHwa,2005,interalia),akeyobservationunder-lyingthepresentworkisthattoken-andtype-levelinformationofferdifferentandcomplementarysig-nals.Ontheonehand,highconfidencetoken-levelprojectionsofferpreciseconstraintsonataginaparticularcontext.Ontheotherhand,manuallycre-1http://www.wiktionary.org/.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
2
0
5
1
5
6
6
6
2
7
/
/
T
l
UN
C
_
UN
_
0
0
2
0
5
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
2
atedtype-leveldictionariescanhavebroadcoverageanddonotsufferfromword-alignmenterrors;theycanthereforebeusedtofiltersystematicaswellasrandomnoiseintoken-levelprojections.Inordertoreapthesepotentialbenefits,wepro-poseapartiallyobservedconditionalrandomfield(CRF)modello(Laffertyetal.,2001)thatcouplesto-kenandtypeconstraintsinordertoguidelearning(§3).Inessence,themodelisgiventhefreedomtopushprobabilitymasstowardshypothesesconsistentwithbothtypesofinformation.Thisapproachisflex-ible:wecanuseeithernoisyprojectedormanuallyconstructeddictionariestogeneratetypeconstraints;furthermore,wecanincorporatearbitraryfeaturesovertheinput.Inadditiontostandard(contextual)lexicalfeaturesandtransitionfeatures,weobservethataddingfeaturesfromamonolingualwordcluster-ing(UszkoreitandBrants,2008)cansignificantlyim-proveaccuracy.Whilemostofthesefeaturescanalsobeusedinagenerativefeature-basedhiddenMarkovmodel(HMM)(Berg-Kirkpatricketal.,2010),weachievethebestaccuracywithagloballynormalizeddiscriminativeCRFmodel.Toevaluateourapproach,wepresentextensiveresultsonstandardpubliclyavailabledatasetsfor15languages:theeightIndo-Europeanlanguagespre-viouslystudiedinthiscontextbyDasandPetrov(2011)andLietal.(2012),andsevenadditionallan-guagesfromdifferentfamilies,forwhichnocompa-rablestudyexists.In§4wecomparevariousfeatures,constraintsandmodeltypes.OurbestmodelusestypeconstraintsderivedfromWiktionary,togetherwithtokenconstraintsderivedfromhigh-confidencewordalignments.WhenaveragedacrosstheeightlanguagesstudiedbyDasandPetrov(2011)andLietal.(2012),weachieveanaccuracyof88.8%.Thisisa25%relativeerrorreductionoverthepreviousstateoftheart.Averagedacrossall15languages,ourmodelobtainsanaccuracyof84.5%comparedto78.5%obtainedbyastronggenerativebaseline.Fi-nally,weprovideanindepthanalysisoftherelativecontributionsofthetwotypesofconstraintsin§5.2CouplingTokenandTypeConstraintsType-levelinformationhasbeenamplyusedinweaklysupervisedPOSinduction,eitherviapuremanuallycraftedtagdictionaries(SmithandEisner,2005;RaviandKnight,2009;GarretteandBaldridge,2012),noisilyprojectedtagdictionaries(DasandPetrov,2011)orthroughcrowdsourcedlexica,suchasWiktionary(Lietal.,2012).Attheotherendofthespectrum,therehavebeeneffortsthatprojecttoken-levelinformationacrossword-alignedbitext(YarowskyandNgai,2001;XiandHwa,2005).How-ever,systemsthatcombinebothsourcesofinforma-tioninasinglemodelhaveyettobefullyexplored.ThefollowingthreesubsectionsoutlineouroverallapproachforcouplingthesetwotypesofinformationtobuildrobustPOStaggersthatdonotrequireanydirectsupervisioninthetargetlanguage.2.1TokenConstraintsForthemajorityofresource-poorlanguages,thereisatleastsomebitextwitharesource-richsourcelanguage;forsimplicity,wechooseEnglishasoursourcelanguageinallexperiments.Itisthennat-uraltoconsiderusingasupervisedpart-of-speechtaggertopredictpart-of-speechtagsfortheEnglishsideofthebitext.Thesepredictedtagscansubse-quentlybeprojectedtothetargetsideviaautomaticwordalignments.ThisapproachwaspioneeredbyYarowskyandNgai(2001),whousedtheresultingpartialtargetannotationtoestimatetheparametersofanHMM.However,duetotheautomaticnatureofthewordalignmentsandthePOStags,therewillbesignificantnoiseintheprojectedtags.Toconquerthisnoise,theyusedveryaggressivesmoothingtech-niqueswhentrainingtheHMM.FossumandAbney(2005)usedsimilartoken-levelprojections,butin-steadcombinedprojectionsfrommultiplesourcelan-guagestofilteroutrandomprojectionnoiseaswellasthesystematicnoisearisingfromdifferentsourcelanguageannotationsandsyntacticdivergences.2.2TypeConstraintsItiswellknownthatgivenatagdictionary,evenifitisincomplete,itispossibletolearnaccuratePOStaggers(SmithandEisner,2005;Goldbergetal.,2008;RaviandKnight,2009;Naseemetal.,2009).Whilewidelydifferinginthespecificmodelstruc-tureandlearningobjective,alloftheseapproachesachieveexcellentresults.Unfortunately,theyrelyontagdictionariesextracteddirectlyfromtheun-derlyingtreebankdata.Suchdictionariesprovideindepthcoverageofthetestdomainandalsolistall
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
2
0
5
1
5
6
6
6
2
7
/
/
T
l
UN
C
_
UN
_
0
0
2
0
5
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
3
(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:3)(cid:6)(cid:7)(cid:8)(cid:9)(cid:3)(cid:2)(cid:4)(cid:6)(cid:7)(cid:10)(cid:11)(cid:3)(cid:12)(cid:13)(cid:14)(cid:15)(cid:8)(cid:10)(cid:11)(cid:16)(cid:13)(cid:3)(cid:13)(cid:3)(cid:11)(cid:12)(cid:13)(cid:2)(cid:17)(cid:18)(cid:19)(cid:15)(cid:3)(cid:20)(cid:12)(cid:10)(cid:11)(cid:20)(cid:12)(cid:12)(cid:11)(cid:18)(cid:15)(cid:21)(cid:21)(cid:13)(cid:12)(cid:15)(cid:22)(cid:3)(cid:13)(cid:10)(cid:20)(cid:21)(cid:21)(cid:8)(cid:13)(cid:10)(cid:8)(cid:11)(cid:3)(cid:23)(cid:1)(cid:2)(cid:3)(cid:1)(cid:4)(cid:5)(cid:6)(cid:7)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:10)(cid:11)(cid:2)(cid:1)(cid:10)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:4)(cid:4)(cid:5)(cid:6)(cid:7)(cid:9)(cid:5)(cid:12)(cid:1)(cid:2)(cid:3)(cid:1)(cid:13)(cid:14)(cid:15)(cid:16)(cid:17)(cid:15)(cid:8)(cid:9)(cid:13)(cid:18)(cid:8)(cid:9)(cid:10)(cid:4)(cid:5)(cid:6)(cid:7)(cid:1)(cid:2)(cid:3)(cid:1)(cid:4)(cid:5)(cid:6)(cid:7)(cid:1)(cid:2)(cid:3)(cid:1)(cid:8)(cid:9)(cid:10)(cid:4)(cid:5)(cid:6)(cid:7)(cid:11)(cid:2)(cid:1)(cid:10)(cid:4)(cid:5)(cid:6)(cid:7)(cid:1)(cid:2)(cid:3)(cid:1)(cid:8)(cid:9)(cid:4)(cid:4)(cid:5)(cid:6)(cid:7)(cid:13)(cid:6)(cid:2)(cid:1)(cid:1)(cid:2)(cid:3)(cid:1)(cid:13)(cid:14)(cid:15)(cid:19)(cid:20)(cid:21)(cid:21)(cid:21)Figure1:LatticerepresentationoftheinferencesearchspaceY(X)foranauthenticsentenceinSwedish(“Thefarmingproductsmustbepureandmustnotcontainanyadditives”),afterpruningwithWiktionarytypeconstraints.Thecorrectpartsofspeecharelistedunderneatheachword.Boldnodesshowprojectedtokenconstraints˜y.Underlinedtextindicatesincorrecttags.ThecoupledconstraintslatticebY(X,˜y)consistsoftheboldnodestogetherwithnodesforwordsthatarelackingtokenconstraints;inthiscase,thecoupledconstraintslatticethusdefinesexactlyonevalidpath.inflectedforms–bothofwhicharedifficulttoobtainandunrealistictoexpectforresource-poorlanguages.Incontrast,DasandPetrov(2011)automaticallycreatetype-leveltagdictionariesbyaggregatingoverprojectedtoken-levelinformationextractedfrombi-text.Tohandlethenoiseintheseautomaticdictionar-ies,theyuselabelpropagationonasimilaritygraphtosmooth(andalsoexpand)thelabeldistributions.Whiletheirapproachproducesgoodresultsandisapplicabletoresource-poorlanguages,itrequiresacomplexmulti-stagetrainingprocedureincludingtheconstructionofalargedistributionalsimilaritygraph.Recently,Lietal.(2012)presentedasimpleandviablealternative:crowdsourceddictionariesfromWiktionary.Whilenoisyandsparseinnature,Wik-tionarydictionariesareavailablefor170languages.2Furthermore,theirqualityandcoverageisgrowingcontinuously(Lietal.,2012).ByincorporatingtypeconstraintsfromWiktionaryintothefeature-basedHMMofBerg-Kirkpatricketal.(2010),Lietal.wereabletoobtainthebestpublishedresultsinthissetting,surpassingtheresultsofDasandPetrov(2011)oneightIndo-Europeanlanguages.2.3CoupledConstraintsRatherthanrelyingexclusivelyoneithertokenortypeconstraints,weproposetocomplementtheonewiththeotherduringtraining.Foreachsentenceinourtrainingset,apartiallyconstrainedlatticeoftagsequencesisconstructedasfollows:2http://meta.wikimedia.org/wiki/Wiktionary—October2012.1.Foreachtokenwhosetypeisnotinthetagdic-tionary,weallowtheentiretagset.2.Foreachtokenwhosetypeisinthetagdictio-nary,weprunealltagsnotlicensedbythedictio-naryandmarkthetokenasdictionary-pruned.3.Foreachtokenthathasatagprojectedviaahigh-confidencebidirectionalwordalignment:iftheprojectedtagisstillpresentinthelattice,thenwepruneeverytagbuttheprojectedtagforthattoken;iftheprojectedtagisnotpresentinthelattice,whichcanonlyhappenfordictionary-prunedtokens,thenweignoretheprojectedtag.Figure1providesarunningexample.Thelatticeshowstagspermittedafterconstrainingthewordstotagslicensedbythedictionary(upuntilStep2fromabove).Thereisonlyasingletoken“Jordbruk-sprodukterna”(“thefarmingproducts”)notinthedictionary;inthiscasethelatticepermitsthefullsetoftags.Withtoken-levelprojections(Step3;nodeswithboldborderinFigure1),thelatticecanbefurtherpruned.Inmostcases,theprojectedtagisbothcorrectandisinthedictionary-prunedlattice.Wethussuccessfullydisambiguatesuchtokensandshrinkthesearchspacesubstantially.Therearetwocaseswehighlightinordertoshowwhereourmodelcanbreak.First,forthetoken“Jordbruksprodukterna”,theerroneouslyprojectedtagADJwilleliminateallothertagsfromthelattice,includingthecorrecttagNOUN.Second,thetoken“n˚agra”(“any”)hasasingledictionaryentryPRONandismissingthecorrecttagDET.Inthecasewhere
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
2
0
5
1
5
6
6
6
2
7
/
/
T
l
UN
C
_
UN
_
0
0
2
0
5
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
4
DETistheprojectedtag,wewillnotaddittothelatticeandsimplyignoreit.Thisisbecausewehy-pothesizethatthetagdictionarycanbetrustedmorethanthetagsprojectedvianoisywordalignments.Aswewillseein§4,takingtheunionoftagsperformsworse,whichsupportsthishypothesis.Forgenerativemodels,suchasHMMs(§3.1),weneedtodefineonlyonelattice.Forourbestgen-erativemodelthisisthecoupledtoken-andtype-constrainedlattice.3Atpredictiontime,inboththediscriminativeandthegenerativecases,wefindthemostlikelylabelsequenceusingViterbidecoding.Fordiscriminativemodels,suchasCRFs(§3.2),weneedtodefinetwolattices:onethatthemodelmovesprobabilitymasstowardsandanotheronedefiningtheoverallsearchspace(orpartitionfunc-tion).Intraditionalsupervisedlearningwithoutadictionary,theformerisatriviallatticecontainingthegoldstandardtagsequenceandthelatteristhesetofallpossibletagsequencesspanningthetokens.Withourbestmodel,wewillmovemasstowardsthecoupledtoken-andtype-constrainedlattice,suchthatthemodelcanfreelydistributemassacrossallpathsconsistentwiththeseconstraints.Thelatticedefiningthepartitionfunctionwillbethefullsetofpossibletagsequenceswhennodictionaryisused;whenadictionaryisuseditwillconsistofalldictionary-prunedtagsequences(sansStep3above;thefullsetofpossibilitiesshowninFigure1forourrunningexample).Figures2and3providestatisticsregardingthesupervisioncoverageandremainingambiguity.Fig-ure2showsthatmorethantwothirdsofalltokensinourtrainingdataareinWiktionary.However,thereisconsiderablevariationbetweenlanguages:Spanishhasthehighestcoveragewithover90%,whileTurk-ish,anagglutinativelanguagewithavastnumberofwordforms,haslessthan50%coverage.Fig-ure3showsthatthereissubstantialuncertaintyleftafterpruningwithWiktionary,sincetokensarerarelyfullydisambiguated:1.3tagspertokenareallowedonaveragefortypesinWiktionary.Figure2furthershowsthathigh-confidencealign-mentsareavailableforabouthalfofthetokensformostlanguages(Japaneseisanotableexceptionwith3Othertrainingmethodsexistaswell,forexample,con-trastiveestimation(SmithandEisner,2005).0255075100avgbgcsdadeelesfritjanlptslsvtrzhPercent of tokens coveredTokencoverageWiktionaryProjectedProjected+FilteredFigure2:Wiktionaryandprojectiondictionarycoverage.ShownisthepercentageoftokensinthetargetsideofthebitextthatarecoveredbyWiktionary,thathaveaprojectedtag,andthathaveaprojectedtagafterintersectingthetwo.0.00.51.01.5avgbgcsdadeelesfritjanlptslsvtrzhNumber of tags per tokenFigure3:Averagenumberoflicensedtagspertokenonthetargetsideofthebitext,fortypesinWiktionary.lessthan30%ofthetokenscovered).IntersectingtheWiktionarytagsandtheprojectedtags(Step2and3above)filtersoutsomeofthepotentiallyerroneoustags,butpreservesthemajorityoftheprojectedtags;theremaining,presumablymoreaccurateprojectedtagscoveralmosthalfofalltokens,greatlyreducingthesearchspacethatthelearnerneedstoexplore.3ModelswithCoupledConstraintsWenowformallypresenthowwecoupletokenandtypeconstraintsandhowweusethesecoupledcon-straintstotrainprobabilistictaggingmodels.Letx=(x1x2…X|X|)∈Xdenoteasentence,whereeachtokenxi∈VisaninstanceofawordtypefromthevocabularyVandlety=(y1y2…sì|X|)∈Yde-noteatagsequence,whereyi∈TisthetagassignedtotokenxiandTdenotesthesetofallpossiblepart-of-speechtags.WedenotethelatticeofalladmissibletagsequencesforthesentencexbyY(X).Thisisthe
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
2
0
5
1
5
6
6
6
2
7
/
/
T
l
UN
C
_
UN
_
0
0
2
0
5
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
5
inferencesearchspaceinwhichthetaggeroperates.Asweshallsee,itiscrucialtoconstrainthesizeofthislatticeinordertosimplifylearningwhenonlyincompletesupervisionisavailable.Atagdictionarymapsawordtypexj∈VtoasetofadmissibletagsT(xj)⊆T.ForwordtypesnotinthedictionaryweallowthefullsetoftagsT(whilepossible,inthispaperwedonotat-tempttodistinguishclosed-classversusopen-classwords).Whenprovidedwithatagdictionary,thelatticeofadmissibletagsequencesforasentencexisY(X)=T(x1)×T(x2)×…×T(X|X|).Whennotagdictionaryisavailable,wesimplyhavethefulllatticeY(X)=T|X|.Let˜y=(˜y1˜y2…˜y|X|)betheprojectedtagsforthesentencex.Notethat{˜yi}=∅fortokenswithoutaprojectedtag.Next,wedefineapiecewiseoperator_thatcouples˜yandY(X)withrespecttoeverysentenceindex,whichresultsinatoken-andtype-constrainedlattice.Theoperatorbehavesasfollows,coherentwiththehighleveldescriptionin§2.3:bT(xi,˜yi)=˜yi_T(xi)=({˜yi}if˜yi∈T(xi)T(xi)otherwise.Wedenotethetoken-andtype-constrainedlatticeasbY(X,˜y)=bT(x1,˜y1)×bT(x2,˜y2)×…×bT(X|X|,˜y|X|).Notethatwhentoken-levelprojectionsarenotused,thedictionary-prunedlatticeandthelatticewithcou-pledconstraintsareidentical,thatisbY(X,˜y)=Y(X).3.1HMMswithCoupledConstraintsAfirst-orderhiddenMarkovmodel(HMM)specifiesthejointdistributionofasentencex∈Xandatag-sequencey∈Y(X)COME:pβ(X,sì)=|X|Yi=1pβ(xi|yi)|{z}emissionpβ(yi|yi−1)|{z}transition.Wefollowtherecenttrendofusingalog-linearparametrizationoftheemissionandthetransitiondistributions,insteadofamultinomialparametriza-tion(Chen,2003).Thisallowsmodelparametersβtobesharedacrosscategoricalevents,whichhasbeenshowntogivesuperiorperformance(Berg-Kirkpatricketal.,2010).Thecategoricalemissionandtransitioneventsarerepresentedbyfeaturevec-torsφ(xi,yi)andφ(yi,yi−1).Eachelementoftheparametervectorβcorrespondstoaparticularfea-ture;thecomponentlog-lineardistributionsare:pβ(xi|yi)=exp(cid:0)β>φ(xi,yi)(cid:1)Px0i∈Vexp(β>φ(x0i,yi)),andpβ(yi|yi−1)=exp(cid:0)β>φ(yi,yi−1)(cid:1)Py0i∈Texp(β>φ(y0i,yi−1)).Inmaximum-likelihoodestimationoftheparameters,weseektomaximizethelikelihoodoftheobservedpartsofthedata.Forthisweneedthejointmarginaldistributionpβ(X,bY(X,˜y))ofasentencex,anditscoupledconstraintslatticebY(X,˜y),whichisobtainedbymarginalizingoverallconsistentoutputs:pβ(X,bY(X,˜y))=Xy∈bY(X,˜y)pβ(X,sì).Iftherearenoprojectionsandnotagdictionary,thenbY(X,˜y)=T|X|,andthuspβ(X,bY(X,˜y))=pβ(X),whichreducestofullyunsupervisedlearning.The‘2-regularizedmarginaljointlog-likelihoodoftheconstrainedtrainingdataD={(X(io),˜y(io))}ni=1is:l(β;D)=nXi=1logpβ(X(io),bY(X(io),˜y(io)))−γkβk22.(1)WefollowBerg-Kirkpatricketal.(2010)andtakeadirectgradientapproachforoptimizingEq.1withL-BFGS(LiuandNocedal,1989).Wesetγ=1andrun100iterationsofL-BFGS.Onecouldalsoem-ploytheExpectation-Maximization(EM)algorithm(Dempsteretal.,1977)tooptimizethisobjective,al-thoughtherelativemeritsofEMversusdirectgradi-enttrainingforthesemodelsisstillatopicofdebate(Berg-Kirkpatricketal.,2010;Lietal.,2012).4Notethatsincethemarginallikelihoodisnon-concave,weareonlyguaranteedtofindalocalmaximumofEq.1.Afterestimatingthemodelparametersβ,thetag-sequencey∗∈Y(X)forasentencex∈Xispre-dictedbychoosingtheonewithmaximaljointprob-ability:y∗←argmaxy∈Y(X)pβ(X,sì).4WetrainedtheHMMwithEMaswell,butachievedbetterresultswithdirectgradienttrainingandhenceomitthoseresults.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
2
0
5
1
5
6
6
6
2
7
/
/
T
l
UN
C
_
UN
_
0
0
2
0
5
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
6
3.2CRFswithCoupledConstraintsWhereasanHMMmodelsthejointprobabilityoftheinputx∈Xandoutputy∈Y(X),usinglocallynormalizedcomponentdistributions,aconditionalrandomfield(CRF)insteadmodelstheprobabilityoftheoutputconditionedontheinputasagloballynor-malizedlog-lineardistribution(Laffertyetal.,2001):pθ(sì|X)=exp(cid:0)θ>Φ(X,sì)(cid:1)Py0∈Y(X)esp(θ>Φ(X,y0)),whereθisaparametervector.AsfortheHMM,Y(X)isnotnecessarilythefullspaceofpossibletag-sequences;specifically,forus,itisthedictionary-prunedlatticewithoutthetokenconstraints.Withafirst-orderMarkovassumption,thefeaturefunctionfactorsas:Φ(X,sì)=|X|Xi=1φ(X,yi,yi−1).ThismodelismorepowerfulthantheHMMinthatitcanusericherfeaturedefinitions,suchasjointin-put/transitionfeaturesandfeaturesoverawiderinputcontext.Wemodelamarginalconditionalprobabil-ity,givenbythetotalprobabilityofalltagsequencesconsistentwiththelatticebY(X,˜y):pθ(bY(X,˜y)|X)=Xy∈bY(X,˜y)pθ(sì|X).TheparametersofthisconstrainedCRFareestimatedbymaximizingthe‘2-regularizedmarginalcondi-tionallog-likelihoodoftheconstraineddata(Riezleretal.,2002):l(θ;D)=nXi=1logpθ(bY(X(io),˜y(io))|X(io))−γkθk22.(2)AswithEq.1,wemaximizeEq.2with100itera-tionsofL-BFGSandsetγ=1.IncontrasttotheHMM,afterestimatingthemodelparametersθ,thetag-sequencey∗∈Y(X)forasentencex∈Xischosenasthesequencewiththemaximalconditionalprobability:y∗←argmaxy∈Y(X)pθ(sì|X).4EmpiricalStudyWenowpresentadetailedempiricalstudyofthemod-elsproposedintheprevioussections.InadditiontocomparingwiththestateoftheartinDasandPetrov(2011)andLietal.(2012),wepresentmodelswithseveralcombinationsoftokenandtypeconstraints,additionalfeaturesincorporatingwordclusters.Bothgenerativeanddiscriminativemodelsareexplored.4.1ExperimentalSetupBeforedelvingintotheexperimentaldetails,wepresentoursetupanddatasets.Languages.Weevaluateoneighttargetlanguagesusedinpreviouswork(DasandPetrov,2011;Lietal.,2012)andonsevenadditionallanguages(seeTa-ble1).WhiletheformereightlanguagesallbelongtotheIndo-Europeanfamily,webroadenthecoveragetolanguagefamiliesmoredistantfromthesourcelanguage(forexample,Chinese,JapaneseandTurk-ish).WeusethetreebanksfromtheCoNLLsharedtasksondependencyparsing(BuchholzandMarsi,2006;Nivreetal.,2007)forevaluation.5Thetwo-letterabbreviationsfromtheISO639-1standardareusedwhenreferringtotheselanguagesintablesandfigures.Tagset.Inallcases,wemapthelanguage-specificPOStagstouniversalPOStagsusingthemappingofPetrovetal.(2012).6Sinceweuseindirectsuper-visionviaprojectedtagsorWiktionary,themodelstatesinducedbyallmodelscorresponddirectlytoPOStags,enablingustocomputetaggingaccuracywithoutagreedy1-to-1ormany-to-1mapping.Bitext.Forallexperiments,weuseEnglishasthesourcelanguage.Dependingonavailability,therearebetween1Mand5Mparallelsentencesforeachlanguage.Themajorityoftheparalleldataisgath-eredautomaticallyfromthewebusingthemethodofUszkoreitetal.(2010).WefurtherincludedatafromEuroparl(Koehn,2005)andfromtheUNpar-allelcorpus(UN,2006),forlanguagescoveredbythesecorpora.TheEnglishsideofthebitextisPOStaggedwithastandardsupervisedCRFtagger,trainedonthePennTreebank(Marcusetal.,1993),withtagsmappedtouniversaltags.Theparallelsen-5ForFrenchweusethetreebankofAbeill´eetal.(2003).6Weuseversion1.03ofthemappingsavailableathttp://code.google.com/p/universal-pos-tags/.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
2
0
5
1
5
6
6
6
2
7
/
/
T
l
UN
C
_
UN
_
0
0
2
0
5
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
7
tencesarewordalignedwiththealignerofDeNeroandMacherey(2011).Intersectedhigh-confidencealignments(confidence>0.95)areextractedandag-gregatedintoprojectedtype-leveldictionaries.Forpurelypracticalreasons,thetrainingdatawithtoken-levelprojectionsiscreatedbyrandomlysamplingtarget-sidesentenceswithatotalof500Ktokens.Wiktionary.WeuseasnapshotoftheWiktionaryworddefinitions,andfollowtheheuristicsofLietal.(2012)forcreatingtheWiktionarydictionarybymappingtheWiktionarytagstouniversalPOStags.7Features.Forallmodels,weuseonlyanidentityfeaturefortag-pairtransitions.Weusefivefeaturesthatcouplethecurrenttagandtheobservedword(analogoustotheemissioninanHMM):wordiden-tity,suffixesofuptolength3,andthreeindicatorfeaturesthatfirewhenthewordstartswithacapitalletter,containsahyphenorcontainsadigit.ThesearethesamefeaturesasthoseusedbyDasandPetrov(2011).Finalmente,forsomemodelsweaddawordclusterfeaturethatcouplesthecurrenttagandthewordclusteridentityoftheword.These(monolin-gual)wordclustersareinducedwiththeexchangealgorithm(UszkoreitandBrants,2008).Wesetthenumberofclustersto256acrossalllanguages,asthishaspreviouslybeenshowntoproducerobustresultsforsimilartasks(Turianetal.,2010;T¨ackstr¨ometal.,2012).Theclustersforeachlanguagearelearnedonalargemonolingualnewswirecorpus.4.2ModelswithTypeConstraintsToexaminethesoleeffectoftypeconstraints,weexperimentwiththeHMM,drawingconstraintsfromthreedifferentdictionaries.Table1comparestheper-formanceofourmodelswiththebestresultsofDasandPetrov(2011,D&P)andLietal.(2012,LG&T).Asinpreviouswork,trainingisdoneexclusivelyonthetrainingportionofeachtreebank,strippedofanymanuallinguisticannotation.Wefirstuseallofourparalleldatatogenerateprojectedtagdictionaries:theEnglishPOStagsareprojectedacrosswordalignmentsandaggregatedtotagdistributionsforeachwordtype.AsinDasandPetrov(2011),thedistributionsarethenfilteredwithathresholdof0.2toremovenoisytagsandtocreate7ThedefinitionsweredownloadedonAugust31,2012fromhttp://toolserver.org/˜enwikt/definitions/.ThissnapshotismorerecentthanthatusedbyLietal.PriorworkHMMwithtypeconstraintsLang.D&PLG&TYHMMproj.YHMMwik.YHMMunionYHMMunion+Cbg––84.268.187.287.9cs––75.470.275.479.2da83.283.387.782.078.489.5de82.885.886.685.180.088.3el82.579.283.383.886.083.2es84.286.483.983.788.387.3fr––88.475.775.686.6it86.886.589.085.489.990.6ja––45.276.974.473.7nl79.586.381.779.183.882.7pt87.984.586.779.083.890.4sl––78.764.882.883.4sv80.586.180.685.985.986.7tr––66.244.165.165.7zh––59.273.963.273.0avg(8)83.484.884.983.084.587.3avg––78.575.980.083.2Table1:Taggingaccuraciesfortype-constrainedHMMmodels.D&Pisthe“WithLP”modelinTable2ofDasandPetrov(2011),whileLG&Tisthe“SHMM-ME”modelinTable2ofLietal.(2012).YHMMproj.,YHMMwik.andYHMMunionareHMMstrainedsolelywithtypeconstraintsderivedfromtheprojecteddictionary,Wiktionaryandtheunionofthesedictionaries,respectively.YHMMunion+CisequivalenttoYHMMunionwithadditionalclusterfeatures.Allmodelsaretrainedonthetreebankofeachlanguage,strippedofgoldlabels.Resultsareaveragedoverthe8languagesfromDasandPetrov(2011),denotedavg(8),aswellasoverthefullsetof15languages,denotedavg.anunweightedtagdictionary.WecallthismodelYHMMproj.;itsaverageaccuracyof84.9%ontheeightlanguagesishigherthanthe83.4%ofD&PandonparwithLG&T(84.8%).8Ournextmodel(YHMMwik.)simplydrawstypeconstraintsfromWiktionary.ItslightlyunderperformsLG&T(83.0%),presumablybecausetheyusedasecond-orderHMM.Asasimpleextensiontothesetwomodels,wetaketheunionoftheprojecteddictionaryandWiktionarytocon-strainanHMM,whichwenameYHMMunion.ThismodelperformsalittleworseontheeightIndo-Europeanlanguages(84.5),butgivesanimprovementovertheprojecteddictionarywhenevaluatedacrossall15languages(80.0%vs.78.5%).8Ourmodelcorrespondstotheweaker,“NoLP”projectionofDasandPetrov(2011).Wefoundthatlabelpropagationwasonlybeneficialwhensmallamountsofbitextwereavailable.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
2
0
5
1
5
6
6
6
2
7
/
/
T
l
UN
C
_
UN
_
0
0
2
0
5
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
8
TokenconstraintsHMMwithcoupledconstraintsCRFwithcoupledconstraintsLang.YHMMunion+C+L˜yHMM+C+L˜yCRF+C+LbYHMMproj.+C+LbYHMMwik.+C+LbYHMMunion+C+LbYCRFproj.+C+LbYCRFwik.+C+LbYCRFunion+C+Lbg87.777.984.184.583.986.786.087.885.4cs78.365.474.974.881.176.974.780.3**75.0da87.380.985.187.285.688.185.588.2*86.0de87.781.483.385.089.386.784.490.5**85.5el85.981.177.880.187.083.979.689.5**79.7es89.1**84.185.583.785.988.085.787.186.0fr88.4**83.584.785.986.487.484.987.285.6it89.685.288.588.787.689.888.389.389.4ja72.847.654.243.276.170.544.981.0**68.0nl83.178.482.482.384.283.283.185.9**83.2pt89.184.787.086.688.788.087.991.0**88.3sl82.469.878.278.581.880.179.782.380.0sv86.180.184.282.387.986.984.488.9**85.5tr62.458.164.564.661.864.865.064.1**65.2zh72.652.739.556.074.173.359.774.4**73.4avg(8)87.282.084.284.587.086.884.988.885.4avg82.874.176.977.682.882.378.284.581.1Table2:Taggingaccuraciesformodelswithtokenconstraintsandcoupledtokenandtypeconstraints.Allmodelsuseclusterfeatures(…+C)andaretrainedonlargetrainingsetseachcontaining500ktokenswith(partial)token-levelprojections(…+l).Thebesttype-constrainedmodel,trainedonthelargerdatasets,YHMMunion+C+L,isincludedforcomparison.TheremainingcolumnscorrespondtoHMMandCRFmodelstrainedonlywithtokenconstraints(˜y…)andwithcoupledtokenandtypeconstraints(bY…).Thelatteraretrainedusingtheprojecteddictionary(·proj.),Wiktionary(·wik.)andtheunionofthesedictionaries(·union),respectively.Thesearchspacesofthemodelstrainedwithcoupledconstraints(bY…)areeachprunedwiththerespectivetagdictionaryusedtoderivethecoupledconstraints.TheobserveddifferencebetweenbYCRFwik.+C+LandYHMMunion+C+Lisstatisticallysignificantatp<0.01(**)andp<0.015(*)accordingtoapairedbootstraptest(EfronandTibshirani,1993).Significancewasnotassessedforavgoravg(8).Wenextaddmonolingualclusterfeaturestothemodelwiththeuniondictionary.Thismodel,YHMMunion+C,significantlyoutperformsallothertype-constrainedmodels,demonstratingtheutilityofword-clusterfeatures.9Forfurtherexploration,wetrainthesamemodelonthedatasetscontaining500Ktokenssampledfromthetargetsideoftheparalleldata(YHMMunion+C+L);thisisdonetoexploretheeffectsoflargedataduringtraining.Wefindthattrainingonthesedatasetsresultinanaverageaccuracyof87.2%whichiscomparabletothe87.3%reportedforYHMMunion+CinTable1.ThisshowsthatthedifferentsourcedomainandamountoftrainingdatadoesnotinfluencetheperformanceoftheHMMsignificantly.Finally,wetrainCRFmodelswherewetreattypeconstraintsasapartiallyobservedlatticeandusethefullunprunedlatticeforcomputingthepartitionfunc-9Thesearemonolingualclusters.Bilingualclustersasintro-ducedinT¨ackstr¨ometal.(2012)mightbringadditionalbenefits.tion(§3.2).Duetospaceconsiderations,theresultsoftheseexperimentsarenotshownintable1.Weob-servesimilartrendsintheseresults,butonaverage,accuraciesaremuchlowercomparedtothetype-constrainedHMMmodels;theCRFmodelwiththeuniondictionaryalongwithclusterfeaturesachievesanaverageaccuracyof79.3%whentrainedonsamedata.Thisresultisnotunsurprising.First,theCRF’ssearchspaceisfullyunconstrained.Second,thedic-tionaryonlyprovidesaweaksetofobservationcon-straints,whichdonotprovidesufficientinformationtosuccessfullytrainadiscriminativemodel.How-ever,aswewillobservenext,couplingthedictionaryconstraintswithtoken-levelinformationsolvesthisproblem.4.3ModelswithTokenandTypeConstraintsWenowproceedtoaddtoken-levelinformation,focusinginparticularoncoupledtokenandtype
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
0
5
1
5
6
6
6
2
7
/
/
t
l
a
c
_
a
_
0
0
2
0
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
9
constraints.Sinceitisnotpossibletogenerateprojectedtokenconstraintsforourmonolingualtreebanks,wetrainallmodelsinthissubsectiononthe500K-tokensdatasetssampledfromthebi-text.Asabaseline,wefirsttrainHMMandCRFmodelsthatuseonlyprojectedtokenconstraints(˜yHMM+C+Land˜yCRF+C+L).AsshowninTable2,thesemodelsunderperformthebesttype-levelmodel(YHMMunion+C+L),10whichconfirmsthatprojectedto-kenconstraintsarenotreliableontheirown.ThisisinlinewithsimilarprojectionmodelspreviouslyexaminedbyDasandPetrov(2011).Wethenstudymodelswithcoupledtokenandtypeconstraints.Thesemodelsusethesamethreedictio-nariesasusedin§4.2,butadditionallycouplethederivedtypeconstraintswithprojectedtokencon-straints;seethecaptionofTable2foralistofthesemodels.Notethatsinceweonlyallowprojectedtagsthatarelicensedbythedictionary(Step3ofthetrans-fer,§2.3),theactualtokenconstraintsusedinthesemodelsvarywiththedifferentdictionaries.FromTable2,weseethatcoupledconstraintsaresuperiortotokenconstraints,whenusedbothwiththeHMMandtheCRF.However,fortheHMM,cou-pledconstraintsdonotprovideanybenefitovertypeconstraintsalone,inparticularwhentheprojecteddictionaryortheuniondictionaryisusedtoderivethecoupledconstraints(bYHMMproj.+C+LandbYHMMunion+C+L).Wehypothesizethatthisisbecausethesedictionar-ies(inparticulartheformer)havethesamebiasasthetoken-leveltagprojections,sothatthedictionaryisunabletocorrectthesystematicerrorsinthepro-jections(see§2.1).Sincethetokenconstraintsarestrongerthanthetypeconstraintsinthecoupledmod-els,thisbiasmayhaveasubstantialimpact.WiththeWiktionarydictionary,thedifferencebetweenthetype-constrainedandthecoupled-constrainedHMMisnegligible:YHMMunion+C+LandbYHMMwik.+C+Lbothav-erageatanaccuracyof82.8%.TheCRFmodel,ontheotherhand,isabletotakeadvantageofthecomplementaryinformationinthecoupledconstraints,providedthatthedictionaryisabletofilteroutthesystematictoken-levelerrors.WithadictionaryderivedfromWiktionaryandpro-jectedtoken-levelconstraints,bYCRFwik.+C+Lperforms10Tomakethecomparisonfairvis-a-vispotentialdivergencesintrainingdomains,wecomparetothebesttype-constrainedmodeltrainedonthesame500Ktokenstrainingsets.012302550751000110100011010001101000110100Number of token−level projectionsTagging accuracyNumber of tags listed in WiktionaryFigure4:RelativeinfluenceoftokenandtypeconstraintsontaggingaccuracyinthebYCRFwik.+C+Lmodel.Wordtypesarecategorizedaccordingtoa)theirnumberofWiktionarytags(0,1,2or3+tags,with0representingnoWiktionaryentry;top-axis)andb)thenumberoftimestheyaretoken-constrainedinthetrainingset(dividedintobucketsof0,1-9,10-99and100+occurrences;x-axis).Theboxessummarizetheaccuracydistributionsacrosslanguagesforeachwordtypecategoryasdefinedbya)andb).Thehorizontallineineachboxmarksthemedianaccuracy,thetopandbottommarkthefirstandthirdquantile,re-spectively,whilethewhiskersmarktheminimumandmaximumvaluesoftheaccuracydistribution.betterthanalltheremainingmodels,withanaverageaccuracyof88.8%acrosstheeightIndo-EuropeanlanguagesavailabletoD&PandLG&T.Averagedoverall15languages,itsaccuracyis84.5%.5FurtherAnalysisInthissectionweprovideadetailedanalysisoftheimpactoftokenversustypeconstraintsandwestudythepruningandfilteringmistakesresultingfromin-completeWiktionaryentriesindetail.Thisanalysisisbasedonthetrainingportionofeachtreebank.5.1InfluenceofTokenandTypeConstraintsTheempiricalsuccessofthemodeltrainedwithcou-pledtokenandtypeconstraintsconfirmsthattheseconstraintsindeedprovidecomplementarysignals.Figure4providesamoredetailedviewoftherela-tivebenefitsofeachtypeofconstraint.Weobserveseveralinterestingtrends.First,wordtypesthatoccurwithmoretokencon-straintsduringtrainingaregenerallytaggedmoreaccurately,regardlessofwhetherthesetypesoccur
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
0
5
1
5
6
6
6
2
7
/
/
t
l
a
c
_
a
_
0
0
2
0
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
10
90.092.595.097.5100.0050100150200250Number of corrected Wiktionary entriesPruning accuracyFigure5:Averagepruningaccuracy(line)acrosslan-guages(dots)asafunctionofthenumberofhypotheti-callycorrectedWiktionaryentriesforthekmostfrequentwordtypes.Forexample,position100onthex-axiscor-respondstomanuallycorrectingtheentriesforthe100mostfrequenttypes,whileposition0correspondstoex-perimentalconditions.inWiktionary.ThemostcommonscenarioisforawordtypetohaveexactlyonetaginWiktionaryandtooccurwiththisprojectedtagover100timesinthetrainingset(facet1,rightmostbox).Thesecom-monwordtypesaretypicallytaggedveryaccuratelyacrossalllanguages.Second,thewordtypesthatareambiguousaccord-ingtoWiktionary(facets2and3)arepredominantlyfrequentones.Theaccuracyistypicallylowerforthesewordscomparedtotheunambiguouswords.However,asthenumberofprojectedtokencon-straintsisincreasedfromzeroto100+observations,theambiguouswordsareeffectivelydisambiguatedbythetokenconstraints.Thisshowstheadvantageofintersectingtokenandtypeconstraints.Finally,projectiongenerallyhelpsforwordsthatarenotinWiktionary,althoughtheaccuracyforthesewordsneverreachtheaccuracyofthewordswithonlyonetaginWiktionary.Interestingly,wordsthatoccurwithaprojectedtagconstraintlessthan100timesaretaggedmoreaccuratelyfortypesnotinthedictionarycomparedtoambiguouswordtypeswiththesamenumberofprojectedconstraints.ApossibleexplanationforthisisthattheambiguouswordsareinherentlymoredifficulttopredictandthatmostofthewordsthatarenotinWiktionaryarelesscommonwordsthattendtoalsobelessambiguous.zhtrsvslptnljaitfreseldedacsbgavg0255075100Proportion of pruning errorsPRONNOUNDETADPPRTADVNUMCONJADJVERBX.Figure6:PrevalenceofpruningmistakesperPOStag,whenpruningtheinferencesearchspacewithWiktionary.5.2WiktionaryPruningMistakesTheerroranalysisbyLietal.(2012)showedthatthetagslicensedbyWiktionaryareoftenvalid.WhenusingWiktionarytoprunethesearchspaceofourconstrainedmodelsandtofiltertoken-levelprojec-tions,itisalsoimportantthatcorrecttagsarenotmistakenlyprunedbecausetheyaremissingfromWiktionary.Whiletheaccuracyoffilteringismoredifficulttostudy,duetothelackofagoldstandardtaggingofthebitext,Figure5(position0onthex-axis)showsthatsearchspacepruningerrorsarenotamajorissueformostlanguages;onaveragethepruningaccuracyisalmost95%.However,forsomelanguagessuchasChineseandCzechthecorrecttagisprunedfromthesearchspacefornearly10%ofalltokens.WhenusingWiktionaryasapruner,theupperboundonaccuracyfortheselanguagesisthereforeonlyaround90%.However,Figure5alsoshowsthatwithsomemanualeffortwemightbeabletoremedymanyoftheseerrors.Forexample,byaddingmiss-ingvalidtagstothe250mostcommonwordtypesintheworstlanguage,theminimumpruningaccuracywouldriseabove95%frombelow90%.Ifthesamewastobedoneforallofthestudiedlanguages,themeanpruningaccuracywouldreachover97%.Figure6breaksdownpruningerrorsresultingfromincorrectorincompleteWiktionaryentriesacrossthecorrectPOStags.Fromthisweobservethat,formanylanguages,thepruningerrorsarehighlyskewedtowardsspecifictags.Forexample,forCzechover80%ofthepruningerrorsarecausedbymistak-enlyprunedpronouns.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
0
5
1
5
6
6
6
2
7
/
/
t
l
a
c
_
a
_
0
0
2
0
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
11
6ConclusionsWeconsideredtheproblemofconstructingmultilin-gualPOStaggersforresource-poorlanguages.Tothisend,weexploredanumberofdifferentmodelsthatcombinetokenconstraintswithtypeconstraintsfromdifferentsources.Thebestresultswereob-tainedwithapartiallyobservedCRFmodelthatef-fectivelyintegratesthesecomplementaryconstraints.Inanextensiveempiricalstudy,weshowedthatthisapproachsubstantiallyimprovesonthestateoftheartinthiscontext.Ourbestmodelsignificantlyout-performedthesecond-bestmodelon10outof15evaluatedlanguages,whentrainedonidenticaldatasets,withaninsignificantdifferenceon3languages.Comparedtothepriorstateoftheart(Lietal.,2012),weobservedarelativereductioninerrorby25%,averagedovertheeightlanguagescommontoourstudies.AcknowledgmentsWethankAlexanderRushforhelpwiththehyper-graphframeworkthatwasusedtoimplementourmodelsandKlausMachereyforhelpwiththebi-textextraction.Thisworkbenefitedfrommanydis-cussionswithYoavGoldberg,KeithHall,KuzmanGanchevandHaoZhang.Wealsothanktheeditorandthethreeanonymousreviewersfortheirvaluablefeedback.ThefirstauthorisgratefulforthefinancialsupportfromtheSwedishNationalGraduateSchoolofLanguageTechnology(GSLT).ReferencesAnneAbeill´e,LionelCl´ement,andFranc¸oisToussenel.2003.BuildingaTreebankforFrench.InA.Abeill´e,editor,Treebanks:BuildingandUsingParsedCorpora,chapter10.Kluwer.TaylorBerg-Kirkpatrick,AlexandreBouchard-Cˆot´e,JohnDeNero,andDanKlein.2010.Painlessunsupervisedlearningwithfeatures.InProceedingsofNAACL-HLT.SabineBuchholzandErwinMarsi.2006.CoNLL-Xsharedtaskonmultilingualdependencyparsing.InProceedingsofCoNLL.StanleyFChen.2003.Conditionalandjointmodelsforgrapheme-to-phonemeconversion.InProceedingsofEurospeech.ChristosChristodoulopoulos,SharonGoldwater,andMarkSteedman.2010.TwodecadesofunsupervisedPOSinduction:Howfarhavewecome?InProceed-ingsofEMNLP.DipanjanDasandSlavPetrov.2011.Unsupervisedpart-of-speechtaggingwithbilingualgraph-basedprojec-tions.InProceedingsofACL-HLT.ArthurP.Dempster,NanM.Laird,andDonaldB.Rubin.1977.MaximumlikelihoodfromincompletedataviatheEMalgorithm.JournaloftheRoyalStatisticalSociety,SeriesB,39.JohnDeNeroandKlausMacherey.2011.Model-basedalignercombinationusingdualdecomposition.InPro-ceedingsofACL-HLT.BradEfronandRobertJ.Tibshirani.1993.AnIntroduc-tiontotheBootstrap.Chapman&Hall,NewYork,NY,USA.VictoriaFossumandStevenAbney.2005.Automaticallyinducingapart-of-speechtaggerbyprojectingfrommultiplesourcelanguagesacrossalignedcorpora.InProceedingsofIJCNLP.DanGarretteandJasonBaldridge.2012.Type-supervisedhiddenmarkovmodelsforpart-of-speechtaggingwithincompletetagdictionaries.InProceedingsofEMNLP-CoNLL.YoavGoldberg,MeniAdler,andMichaelElhadad.2008.EMcanfindprettygoodHMMPOS-taggers(whengivenagoodstart).InProceedingsofACL-HLT.PhilippKoehn.2005.Europarl:Aparallelcorpusforstatisticalmachinetranslation.InMTSummit.JohnD.Lafferty,AndrewMcCallum,andFernandoC.N.Pereira.2001.Conditionalrandomfields:Probabilisticmodelsforsegmentingandlabelingsequencedata.InProceedingsofICML.ShenLi,Jo˜aoGrac¸a,andBenTaskar.2012.Wiki-lysupervisedpart-of-speechtagging.InProceedingsofEMNLP-CoNLL.DongC.LiuandJorgeNocedal.1989.OnthelimitedmemoryBFGSmethodforlargescaleoptimization.MathematicalProgramming,45.MitchellP.Marcus,MaryAnnMarcinkiewicz,andBeat-riceSantorini.1993.BuildingalargeannotatedcorpusofEnglish:thePenntreebank.ComputationalLinguis-tics,19(2).TahiraNaseem,BenjaminSnyder,JacobEisenstein,andReginaBarzilay.2009.Multilingualpart-of-speechtagging:Twounsupervisedapproaches.JAIR,36.JoakimNivre,JohanHall,SandraK¨ubler,RyanMcDon-ald,JensNilsson,SebastianRiedel,andDenizYuret.2007.TheCoNLL2007sharedtaskondependencyparsing.InProceedingsofEMNLP-CoNLL.SlavPetrov,DipanjanDas,andRyanMcDonald.2012.Auniversalpart-of-speechtagset.InProceedingsofLREC.SujithRaviandKevinKnight.2009.Minimizedmodelsforunsupervisedpart-of-speechtagging.InProceed-ingsofACL-IJCNLP.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
0
5
1
5
6
6
6
2
7
/
/
t
l
a
c
_
a
_
0
0
2
0
5
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
12
StefanRiezler,TracyH.King,RonaldM.Kaplan,RichardCrouch,JohnT.Maxwell,III,andMarkJohnson.2002.Parsingthewallstreetjournalusingalexical-functionalgrammaranddiscriminativeestimationtechniques.InProceedingsofACL.NoahSmithandJasonEisner.2005.Contrastiveestima-tion:Traininglog-linearmodelsonunlabeleddata.InProceedingsofACL.OscarT¨ackstr¨om,RyanMcDonald,andJakobUszkoreit.2012.Cross-lingualwordclustersfordirecttransferoflinguisticstructure.InProceedingsofNAACL-HLT.JosephTurian,Lev-ArieRatinov,andYoshuaBengio.2010.Wordrepresentations:Asimpleandgeneralmethodforsemi-supervisedlearning.InProceedingsofACL.UN.2006.ODSUNparallelcorpus.JakobUszkoreitandThorstenBrants.2008.Distributedwordclusteringforlargescaleclass-basedlanguagemodelinginmachinetranslation.InProceedingsofACL-HLT.JakobUszkoreit,JayPonte,AshokPopat,andMosheDubiner.2010.Largescaleparalleldocumentminingformachinetranslation.InProceedingsofCOLING.ChenhaiXiandRebeccaHwa.2005.Abackoffmodelforbootstrappingresourcesfornon-Englishlanguages.InProceedingsofHLT-EMNLP.DavidYarowskyandGraceNgai.2001.Inducingmul-tilingualPOStaggersandNPbracketersviarobustprojectionacrossalignedcorpora.InProceedingsofNAACL.
Scarica il pdf