Transacciones de la Asociación de Lingüística Computacional, volumen. 4, páginas. 245–257, 2016. Editor de acciones: Hinrich Schütze.
Lote de envío: 1/2016; Lote de revisión: 3/2016; Publicado 6/2016.
2016 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.
C
(cid:13)
UnsupervisedPart-Of-SpeechTaggingwithAnchorHiddenMarkovModelsKarlStratos,MichaelCollins∗andDanielHsuDepartmentofComputerScience,ColumbiaUniversity{stratos,mcollins,djhsu}@cs.columbia.eduAbstractWetackleunsupervisedpart-of-speech(POS)taggingbylearninghiddenMarkovmodels(HMM)thatareparticularlywell-suitedfortheproblem.TheseHMMs,whichwecallan-chorHMMs,assumethateachtagisassoci-atedwithatleastonewordthatcanhavenoothertag,whichisarelativelybenigncon-ditionforPOStagging(e.g.,“the”isawordthatappearsonlyunderthedeterminertag).Weexploitthisassumptionandextendthenon-negativematrixfactorizationframeworkofAroraetal.(2013)todesignaconsistentestimatorforanchorHMMs.Inexperiments,ouralgorithmiscompetitivewithstrongbase-linessuchastheclusteringmethodofBrownetal.(1992)andthelog-linearmodelofBerg-Kirkpatricketal.(2010).Además,itpro-ducesaninterpretablemodelinwhichhiddenstatesareautomaticallylexicalizedbywords.1IntroductionPart-of-speech(POS)taggingwithoutsupervisionisaquintessentialprobleminunsupervisedlearningfornaturallanguageprocessing(NLP).Amajorap-plicationofthistaskisreducingannotationcost:por ejemplo,itcanbeusedtoproduceroughsyntacticannotationsforanewlanguagethathasnolabeleddata,whichcanbesubsequentlyrefinedbyhumanannotators.HiddenMarkovmodels(HMM)areanaturalchoiceofmodelandhavebeenaworkhorseforthisproblem.EarlyworksestimatedvanillaHMMs∗CurrentlyonleaveatGoogleInc.NewYork.withstandardunsupervisedlearningmethodssuchastheexpectation-maximization(EM)algoritmo,butitquicklybecameclearthattheyperformedverypoorlyininducingPOStags(Merialdo,1994).LaterworksimproveduponvanillaHMMsbyincorporat-ingspecificstructuresthatarewell-suitedforthetask,suchasasparseprior(Johnson,2007)orahard-clusteringassumption(Brownetal.,1992).Inthiswork,wetackleunsupervisedPOStaggingwithHMMswhosestructureisdeliberatelysuitableforPOStagging.TheseHMMsimposeanassump-tionthateachhiddenstateisassociatedwithanob-servationstate(“anchorword”)thatcanappearun-dernootherstate.Forthisreason,wedenotethisclassofrestrictedHMMsbyanchorHMMs.SuchanassumptionisrelativelybenignforPOStagging;itisreasonabletoassumethateachPOStaghasatleastonewordthatoccursonlyunderthattag.Forexample,inEnglish,“the”isananchorwordforthedeterminertag;“laughed”isananchorwordfortheverbtag.Webuildonthenon-negativematrixfactoriza-tion(NMF)frameworkofAroraetal.(2013)tode-riveaconsistentestimatorforanchorHMMs.Wemakeseveralnewcontributionsintheprocess.First,toourknowledge,thereisnopreviousworkdi-rectlybuildingonthisframeworktoaddressunsu-pervisedsequencelabeling.Second,wegeneralizetheNMF-basedlearningalgorithmtoobtainexten-sionsthatareimportantforempiricalperformance(Table1).Tercero,weperformextensiveexperimentsonunsupervisedPOStaggingandreportcompetitiveresultsagainststrongbaselinessuchasthecluster-ingmethodofBrownetal.(1992)andthelog-linear
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
9
6
1
5
6
7
4
0
2
/
/
t
yo
a
C
_
a
_
0
0
0
9
6
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
246
modelofBerg-Kirkpatricketal.(2010).Onecharacteristicoftheapproachistheimme-diateinterpretabilityofinferredhiddenstates.Be-causeeachhiddenstateisassociatedwithanobser-vation,wecanexaminethesetofsuchanchorobser-vationstoqualitativelyevaluatethelearnedmodel.InourexperimentsonPOStagging,wefindthatan-chorobservationscorrespondtopossiblePOStagsacrossdifferentlanguages(Table3).Thispropertycanbeusefulwhenwewishtodevelopataggerforanewlanguagethathasnolabeleddata;wecanlabelonlytheanchorwordstoachieveacompletelabel-ingofthedata.Thispaperisstructuredasfollows.InSection2,weestablishthenotationweusethroughout.InSec-tion3,wedefinethemodelfamilyofanchorHMMs.InSection4,wederiveamatrixdecompositional-gorithmforestimatingtheparametersofananchorHMM.InSection5,wepresentourexperimentsonunsupervisedPOStagging.InSection6,wediscussrelatedworks.2NotationWeuse[norte]todenotethesetofintegers{1,…,norte}.WeuseE[X]todenotetheexpectedvalueofaran-domvariableX.Wedefine∆m−1:={v∈Rm:vi≥0∀i,Pivi=1},i.e.,the(m−1)-dimensionalprobabilitysimplex.Givenavectorv∈Rm,weusediag(v)∈Rm×mtodenotethediagonalmatrixwithv1…vmonthemaindiagonal.GivenamatrixM∈Rn×m,wewriteMi∈Rmtodenotethei-throwofM(asacolumnvector).3TheAnchorHiddenMarkovModelDefinition3.1.AnanchorHMM(A-HMM)isa6-tuple(norte,metro,Pi,t,oh,A)forpositiveintegersn,mandfunctionsπ,t,oh,Awhere•[norte]isasetofobservationstates.•[metro]isasetofhiddenstates.•π(h)istheprobabilityofgeneratingh∈[metro]inthefirstpositionofasequence.•t(h0|h)istheprobabilityofgeneratingh0∈[metro]givenh∈[metro].•o(X|h)istheprobabilityofgeneratingx∈[norte]givenh∈[metro].•A(h):={x∈[norte]:oh(X|h)>0∧o(X|h0)=0∀h06=h}isnon-emptyforeachh∈[metro].Inotherwords,anA-HMMisanHMMinwhicheachhiddenstatehisassociatedwithatleastone“anchor”observationstatethatcanbegeneratedby,andonlyby,h.Notethattheanchorconditionim-pliesn≥m.AnequivalentdefinitionofanA-HMMisgivenbyorganizingtheparametersinmatrixform.Underthisdefinition,anA-HMMhasparameters(Pi,t,oh)whereπ∈RmisavectorandT∈Rm×m,O∈Rn×marematriceswhoseentriesaresetto:•πh=π(h)forh∈[metro]•Th0,h=t(h0|h)forh,h0∈[metro]•Ox,h=o(X|h)forh∈[metro],x∈[norte]Theanchorconditionimpliesthatrank(oh)=m.Toseethis,considertherowsOa1…Oamwhereah∈A(h).SinceeachOahhasasinglenon-zeroentryattheh-thindex,thecolumnsofOarelinearlyindependent.Weassumerank(t)=m.AnimportantspecialcaseofA-HMMintroducedbyBrownetal.(1992)isdefinedbelow.UnderthisA-HMM,everyobservationstateisananchorofsomehiddenstate.Definition3.2.ABrownmodelisanA-HMMinwhichA(1)…A(metro)partition[norte].4ParameterEstimationforA-HMMsWenowderiveanalgorithmforlearningA-HMMs.ThealgorithmreducesthelearningproblemtoaninstanceofNMFfromwhichthemodelparameterscanbecomputedinclosed-form.4.1NMFWegiveabriefreviewoftheNMFmethodofAroraetal.(2013).(Exact)NMFisthefollowingproblem:givenann×dmatrixA=BCwhereB∈Rn×mandC∈Rm×dhavenon-negativityconstraints,re-coverBandC.ThisproblemisNP-hardingeneral(Vavasis,2009),butAroraetal.(2013)provideanexactandefficientmethodwhenAhasthefollow-ingspecialstructure:Condition4.1.AmatrixA∈Rn×dsatisfiesthisconditionifA=BCforB∈Rn×mandC∈Rm×dwhere
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
9
6
1
5
6
7
4
0
2
/
/
t
yo
a
C
_
a
_
0
0
0
9
6
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
247
Anchor-NMFInput:A∈Rn×dsatisfyingCondition4.1withA=BCforsomeB∈Rn×mandC∈Rm×d,valuem•Forh=1…metro,findavertexahasU←Gram-Schmidt({Aal}h−1l=1)ah←argmaxx∈[norte](cid:12)(cid:12)(cid:12)(cid:12)Ax−UU>Ax(cid:12)(cid:12)(cid:12)(cid:12)2whereGram-Schmidt({Aal}h−1l=1)istheGram-Schmidtprocessthatorthonormalizes{Aal}h−1l=1.•Forx=1…norte,recoverthex-throwofBasBx←argminu∈∆m−1(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Ax−mXh=1uhAah(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2(1)•SetC=[Aa1…Aam]>.Output:BandCsuchthatB>h=B>ρ(h)andCh=Cρ(h)forsomepermutationρ:[metro]→[metro]Figure1:Non-negativematrixfactorizationalgorithmofAroraetal.(2012).1.Foreachx∈[norte],Bx∈∆m−1.I.e.,eachrowofBdefinesaprobabilitydistributionover[metro].2.Foreachh∈[metro],thereissomeah∈[norte]suchthatBah,h=1andBah,h0=0forallh06=h.3.rank(C)=m.Sincerank(B)=rank(C)=m(byproperty2and3),thematrixAmusthaverankm.Notethatbyproperty1,eachrowofAisgivenbyaconvexcom-binationoftherowsofC:forx∈[norte],Ax=mXh=1Bx,h×ChFurthermore,byproperty2eachh∈[metro]hasanas-sociatedrowah∈[norte]suchthatAah=Cah.ThesepropertiescanbeexploitedtorecoverBandC.Aconcretealgorithmforfactorizingamatrixsat-isfyingCondition4.1isgiveninFigure1(Aroraetal.,2013).Itfirstidentifiesa1…am(uptosomepermutation)bygreedilylocatingtherowofAfurthestawayfromthesubspacespannedbytheverticesselectedsofar.ThenitrecoverseachBxastheconvexcoefficientsrequiredtocombineAa1…AamtoyieldAx.Thelattercomputation(1)canbeachievedwithanyconstrainedoptimizationmethod;weusetheFrank-Wolfealgorithm(FrankandWolfe,1956).SeeAroraetal.(2013)foraproofofthecorrectnessofthisalgorithm.4.2RandomVariablesToderiveouralgorithm,wemakeuseofcer-tainrandomvariablesundertheA-HMM.Let(X1,…,XN)∈[norte]NbearandomsequenceofN≥2observationsdrawnfromthemodel,alongwiththecorrespondinghiddenstatesequence(H1,…,HN)∈[metro]norte;independientemente,pickaposi-tionI∈[N−1]uniformlyatrandom.LetYI∈Rdbead-dimensionalvectorwhichisconditionallyin-dependentofXIgivenHI,i.e.,P(YI|HI,XI)=P(YI|HI).WewillprovidehowtodefinesuchavariableinSection4.4.1:YIwillbeafunctionof(X1,…,XN)servingasa“context”representationofXIrevealingthehiddenstateHI.Defineunigramprobabilitiesu∞,u1∈RnandbigramprobabilitiesB∈Rn×nwhereu∞x:=P(XI=x)∀x∈[norte]u1x:=P(XI=x|I=1)∀x∈[norte]Bx,x0:=P(XI=x,XI+1=x0)∀x,x0∈[norte]Además,define¯π∈Rmwhere¯πh=P(HI=h)∀h∈[metro](2)Weassume¯πh>0forallh∈[metro].4.3DerivationofaLearningAlgorithmThefollowingpropositionprovidesawaytousetheNMFalgorithminFigure1torecovertheemissionparametersOuptoscaling.Proposition4.1.LetXI∈[norte]andYI∈RdberespectivelyanobservationandacontextvectordrawnfromtherandomprocessdescribedinSec-tion4.2.DefineamatrixΩ∈Rn×dwithrowsΩx=E[YI|XI=x]∀x∈[norte](3)Ifrank(Ω)=m,thenΩsatisfiesCondition4.1:Ω=eOΘ
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
9
6
1
5
6
7
4
0
2
/
/
t
yo
a
C
_
a
_
0
0
0
9
6
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
248
whereeOx,h=P(HI=h|XI=x)andΘh=E[YI|HI=h].Proof.E[YI|XI=x]=mXh=1P(HI=h|XI=x)×E[YI|HI=h,XI=x]=mXh=1P(HI=h|XI=x)×E[YI|HI=h]Thelastequalityfollowsbytheconditionalinde-pendenceofYI.ThisshowsΩ=eOΘ.Bythean-chorassumptionoftheA-HMM,eachh∈[metro]hasatleastonex∈A(h)suchthatP(HI=h|XI=x)=1,thusΩsatisfiesCondition4.1.AusefulinterpretationofΩinProposition4.1isthatitsrowsΩ1…Ωnared-dimensionalvec-torrepresentationsofobservationstatesformingaconvexhullinRd.Thisconvexhullhasmver-ticesΩa1…Ωamcorrespondingtoanchorsah∈A(h)whichcanbeconvexlycombinedtorealizeallΩ1…Ωn.GiveneO,wecanrecovertheA-HMMparametersasfollows.First,werecoverthestationarystatedis-tribution¯πas:¯πh=Xx∈[norte]PAG(HI=h|XI=x)×P(XI=x)=Xx∈[norte]eOx,h×u∞xTheemissionparametersOaregivenbyBayes’the-orem:Ox,h=P(HI=h|XI=x)×P(XI=x)Px∈[norte]PAG(HI=h|XI=x)×P(XI=x)=eOx,h×u∞x¯πhUsingthefactthattheemissionprobabilitiesareposition-independent,weseethattheinitialstatedistributionπsatisfiesu1=Oπ:u1x=P(XI=x|I=1)=Xh∈[metro]PAG(XI=x|HI=h,I=1)×P(HI=h|I=1)=Xh∈[metro]Ox,h×πhLearn-Anchor-HMMInput:ΩinProposition4.1,numberofhiddenstatesm,bigramprobabilitiesB,unigramprobabilitiesu∞,u1•Compute(eO,Θ)←Anchor-NMF(Ω,metro).•Recovertheparameters:¯π←eO>u∞(4)O←diag(¯π)−1diag(u∞)eO(5)π=O+u1(6)T←(diag(¯π)−1O+B(O>)+)>(7)Output:A-HMMparameters(Pi,t,oh)Figure2:NMF-basedlearningalgorithmforA-HMMs.ThealgorithmAnchor-NMFisgiveninFigure1.Thusπcanberecoveredasπ=O+u1whereO+istheMoore–PenrosepseudoinverseofO.Fi-nally,itcanbealgebraicallyverifiedthatB=Odiag(¯π)T>O>(Hsuetal.,2012).Sinceallthein-volvedmatriceshaverankm,wecandirectlysolveforTasT=(diag(¯π)−1O+B(O>)+)>Figure2showsthecompletealgorithm.Asinput,itreceivesamatrixΩsatisfyingProposition4.1,thenumberofhiddenstates,andtheprobabilitiesofob-servedunigramsandbigrams.ItfirstdecomposesΩusingtheNMFalgorithminFigure1.ThenitcomputestheA-HMMparameterswhosesolutionisgivenanalytically.Thefollowingtheoremguaranteestheconsis-tencyofthealgorithm.Theorem4.1.Let(Pi,t,oh)beanA-HMMsuchthatrank(t)=mand¯πdefinedin(2)hasstrictlypos-itiveentries¯πh>0.GivenrandomvariablesΩsatisfyingProposition4.1andB,u∞,u1underthismodel,thealgorithmLearn-Anchor-HMMinFig-ure2outputs(Pi,t,oh)uptoapermutationonhid-denstates.Proof.ByProposition4.1,ΩsatisfiesCondition4.1withΩ=eOΘ,thuseOcanberecovereduptoapermutationoncolumnswiththealgorithmAnchor-NMF.Theconsistencyoftherecoveredparametersfollowsfromthecorrectnessof(4–7)undertherankconditions.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
9
6
1
5
6
7
4
0
2
/
/
t
yo
a
C
_
a
_
0
0
0
9
6
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
249
4.3.1ConstrainedOptimizationforπandTNotethat(6)y(7)requirecomputingthepseu-doinverseoftheestimatedO,whichcanbeexpen-siveandvulnerabletosamplingerrorsinpractice.Tomakeourparameterestimationmorerobust,wecanexplicitlyimposeprobabilityconstraints.Were-coverπbysolving:π=argminπ0∈∆m−1(cid:12)(cid:12)(cid:12)(cid:12)u1−Oπ0(cid:12)(cid:12)(cid:12)(cid:12)2(8)whichcanagainbedonewithalgorithmssuchasFrank-Wolfe.WerecoverTbymaximizingtheloglikelihoodofobservationbigramsXx,x0Bx,x0logXh,h0∈[metro]¯πhOx,hTh0,hOx0,h0(9)subjecttotheconstraint(T>)h∈∆m−1.Since(9)isconcaveinTwithotherparametersOand¯πfixed,wecanuseEMtofindtheglobaloptimum.4.4ConstructionoftheConvexHullΩInthissection,weprovideseveralwaystoconstructaconvexhullΩsatisfyingProposition4.1.4.4.1ChoiceoftheContextYIInordertosatisfyProposition4.1,weneedtode-finethecontextvariableYI∈Rdwithtwoproper-ties:•P(YI|HI,XI)=P(YI|HI)•ThematrixΩwithrowsΩx=E[YI|XI=x]∀x∈[norte]hasrankm.Asimpleconstruction(Aroraetal.,2013)isgivenbydefiningYI∈Rntobeanindicatorvectorforthenextobservation:[YI]x0=(cid:26)1ifXI+1=x00otherwise(10)ThefirstconditionissatisfiedsinceXI+1doesnotdependonXIgivenHI.Forthesecondcondition,observethatΩx,x0=P(XI+1=x0|XI=x),orinmatrixformΩ=diag(u∞)−1B(11)UndertherankconditionsinTheorem4.1,(11)hasrankm.Moregenerally,wecanletYIbeanobservation(encodedasanindicatorvectorasin(10))randomlydrawnfromawindowofL∈Nnearbyobserva-tions.Wecaneitheronlyusetheidentityofthecho-senobservation(inwhichcaseYI∈Rn)oraddi-tionallyindicatetherelativepositioninthewindow(inwhichcaseYI∈RnL).Itisstraightforwardtoverifythattheabovetwoconditionsaresatisfiedun-derthesedefinitions.Clearly,(11)isaspecialcasewithL=1.4.4.2ReducingtheDimensionofΩxWiththedefinitionofΩintheprevioussection,thedimensionofΩxisd=O(norte)whichcanbedif-ficulttoworkwithwhenn(cid:29)m.Proposition4.1allowsustoreducethedimensionaslongasthefi-nalmatrixretainstheformin(3)andhasrankm.Inparticular,wecanmultiplyΩbyanyrank-mprojec-tionmatrixΠ∈Rd×montherightside:ifΩsat-isfiesthepropertiesinProposition4.1,thensodoesΩΠwithm-dimensionalrows(ΩΠ)x=E[YIΠ|XI=x]Sincerank(Ω)=m,anaturalchoiceofΠistheprojectionontothebest-fitm-dimensionalsubspaceoftherowspaceofΩ.WementionthatpreviousworksontheNMF-learningframeworkhaveemployedvariousprojec-tionmethods,buttheydonotexaminerelativemer-itsoftheirchoices.Forinstance,Aroraetal.(2013)simplyuserandomprojection,whichisconvenientfortheoreticalanalysis.CohenandCollins(2014)useaprojectionbasedoncanonicalcorrelationanal-ysis(CCA)withoutfurtherexploration.Incon-trast,wegiveafullcomparisonofvalidconstruc-tionmethodsandfindthatthechoiceofΩiscrucialinpractice.4.4.3ConstructionofΩfortheBrownModelWecanformulateanalternativewaytoconstructavalidΩwhenthemodelisfurtherrestrictedtobeaBrownmodel.Sinceeveryobservationisananchor,Ox∈Rmhasasinglenonzeroentryforeveryx.ThustherowsdefinedbyΩx=Ox/||Ox||(anin-dicatorvectorfortheuniquehiddenstateofx)forma
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
9
6
1
5
6
7
4
0
2
/
/
t
yo
a
C
_
a
_
0
0
0
9
6
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
250
Input:bigramprobabilitiesB,unigramprobabilitiesu∞,numberofhiddenstatesm,constructionmethodτScaledMatrices:(√·iselement-wise)B:=diag(u∞)−1/2Bdiag(u∞)−1/2eB:=diag(cid:16)√u∞(cid:17)−1/2√Bdiag(cid:16)√u∞(cid:17)−1/2SingularVectors:Ud.(METRO)(V(METRO))isann×mmatrixoftheleft(bien)singularvectorsofMcorrespondingtothelargestmsingularvalues•Ifτ6=brown:setΩ←diag(u∞)−1BΠwheretheprojectionmatrixΠ∈Rn×misgivenbyΠi,j∼N(0,1/metro)ifτ=randomΠ=V(diag(u∞)−1B)ifτ=best-fitΠ=diag(u∞)−1/2V(B)ifτ=cca•Ifτ=brown:computethetransformedemissionmatrixasf(oh)=U(eB)andsetΩ←diag(v)−1f(oh)wherevx:=||F(oh)X||2isthelengthofthex-throwoff(oh).Output:Ω∈Rn×minProposition4.1Figure3:AlgorithmforconstructingavalidΩwithdif-ferentconstructionmethods.Forsimplicity,weonlyshowthebigramconstruction(contextsizeL=1),butanextensionforlargercontext(L>1)isstraightforward.atrivialconvexhullinwhicheverypointisaver-tex.ThiscorrespondstochoosinganoraclecontextYI∈Rmwhere[YI]h=(cid:26)1ifHI=h0otherwiseItispossibletorecovertheBrownmodelparam-etersOuptoelement-wisescalingandrotationofrowsusingthealgorithmofStratosetal.(2015).Morespecifically,letf(oh)∈Rn×mdenotetheout-putoftheiralgorithm.Thentheyshowthatforsomevectors∈RmwithstrictlypositiveentriesandanorthogonalmatrixQ∈Rm×m:F(oh)=Oh1/4idiag(s)Q>whereOh1/4iisanelement-wiseexponentiationofOby1/4.Sincetherowsoff(oh)aresimplysomescalingandrotationoftherowsofO,usingΩx=f(oh)x/||F(oh)X||yieldsavalidΩ.Whileweneedtoimposeanadditionalassump-tion(theBrownmodelrestriction)inordertojustifythischoiceofΩ,wefindinourexperimentsthatitperformsbetterthanotheralternatives.Wespecu-latethatthisisbecauseaBrownmodelisratherap-propriateforthePOStaggingtask;manywordsareindeedunambiguouswithrespecttoPOStags(Ta-ble5).También,thegeneraleffectivenessoff(oh)forrepresentationalpurposeshasbeendemostratedinpreviousworks(Stratosetal.,2014;Stratosetal.,2015).ByrestrictingtheA-HMMtobeaBrownmodel,wecanpiggybackontheproveneffective-nessoff(oh).Figure3showsanalgorithmforconstructingΩwiththesedifferentconstructionmethods.Forsim-plicity,weonlyshowthebigramconstruction(con-textsizeL=1),butanextensionforlargercontext(L>1)isstraightforwardasdiscussedearlier.Theconstructionmethodsrandom(randomprojection),best-fit(projectiontothebest-fitsubspace),andcca(CCAprojection)allcompute(11)anddifferonlyinhowthedimensionisreduced.TheconstructionmethodbrowncomputesthetransformedBrownpa-rametersf(oh)astheleftsingularvectorsofascaledcovariancematrixandthennormalizesitsrows.WedirectthereadertoStratosetal.(2015)foraderiva-tionofthiscalculation.4.4.4ΩwithFeatureAugmentationThex-throwofΩisad-dimensionalvectorrepre-sentationofxlyinginaconvexsetwithmvertices.Thissuggestsanaturalwaytoincorporatedomain-specificfeatures:wecanaddadditionaldimensionsthatprovideinformationabouthiddenstatesfromthesurfaceformofx.Forinstance,considerthethePOStaggingtask.Inthesimpleconstruction(11),therepresentationofwordxisdefinedintermsofneighboringwordsx0:[Ωx]x0=E(cid:2)1(cid:0)XI+1=x0(cid:1)|XI=x(cid:3)where1(·)∈{0,1}istheindicatorfunction.Wecanaugmentthisvectorwithsadditionaldimen-
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
9
6
1
5
6
7
4
0
2
/
/
t
yo
a
C
_
a
_
0
0
0
9
6
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
251
sionsindicatingthespellingfeaturesofx.Forin-stance,el(n+1)-thdimensionmaybedefinedas:[Ωx]n+1=E[1(xendsin“ing”)|XI=x]Thisvaluewillbegenerallylargeforverbsandsmallfornon-verbs,nudgingverbsclosertogetherandawayfromnon-verbs.Themodified(n+s)-dimensionalrepresentationisfollowedbytheusualdimensionreduction.Notethatthespellingfeaturesareadeterministicfunctionofaword,andweareimplicitlyassumingthattheyareindependentofthewordgivenitstag.Whilethisisofcoursenottrueinpractice,wefindthatthesefeaturescansignificantlyboostthetaggingperformance.5ExperimentsWeevaluateourA-HMMlearningalgorithmonthetaskofunsupervisedPOStagging.ThegoalofthistaskistoinducethecorrectsequenceofPOStags(hiddenstates)givenasequenceofwords(observa-tionstates).Theanchorconditioncorrespondstoas-sumingthateachPOStaghasatleastonewordthatoccursonlyunderthattag.5.1BackgroundonUnsupervisedPOSTaggingUnsupervisedPOStagginghaslongbeenanactiveareaofresearch(SmithandEisner,2005a;John-son,2007;ToutanovaandJohnson,2007;HaghighiandKlein,2006;Berg-Kirkpatricketal.,2010),butresultsonthistaskarecomplicatedbyvary-ingassumptionsandunclearevaluationmetrics(Christodoulopoulosetal.,2010).Ratherthanad-dressingmultiplealternativesforevaluatingunsu-pervisedPOStagging,wefocusonasimpleandwidelyusedmetric:many-to-oneaccuracy(i.e.,wemapeachhiddenstatetothemostfrequentlycoin-cidingPOStaginthelabeleddataandcomputetheresultingaccuracy).5.1.1BetterModelv.s.BetterLearningVanillaHMMsarenotoriousfortheirmediocreperformanceonthistask,anditiswellknownthattheyperformpoorlylargelybecauseofmodelmis-specification,notbecauseofsuboptimalparameterestimation(e.g.,becauseEMgetsstuckinlocalop-tima).Moregenerally,alargebodyofworkpointstotheinappropriatenessofsimplegenerativemod-elsforunsupervisedinductionoflinguisticstructure(Merialdo,1994;SmithandEisner,2005b;LiangandKlein,2008).Como consecuencia,manyworksfocusonusingmoreexpressivemodelssuchaslog-linearmodels(SmithandEisner,2005a;Berg-Kirkpatricketal.,2010)andMarkovrandomfields(MRF)(HaghighiandKlein,2006).Thesemodelsareshowntodelivergoodperformanceeventhoughlearningisapproxi-mate.ThusonemayquestionthevalueofhavingaconsistentestimatorforA-HMMsandBrownmod-elsinthiswork:ifthemodeliswrong,whatisthepointoflearningitaccurately?Sin embargo,thereisalsoampleevidencethatHMMsarecompetitiveforunsupervisedPOSinduc-tionwhentheyincorporatedomain-specificstruc-tures.Johnson(2007)isabletooutperformthesophisticatedMRFmodelofHaghighiandKlein(2006)onone-to-oneaccuracybyusingasparsepriorinHMMestimation.TheclusteringmethodofBrownetal.(1992)whichisbasedonoptimizingthelikelihoodundertheBrownmodel(aspecialcaseofHMM)remainsabaselinedifficulttooutperform(Christodoulopoulosetal.,2010).Weaddtothisevidencebydemonstratingtheef-fectivenessofA-HMMsonthistask.WealsochecktheanchorassumptionondataandshowthattheA-HMMmodelstructureisinfactappropriatefortheproblem(Table5).5.2ExperimentalSettingWeusetheuniversaltreebankdataset(version2.0)whichcontainssentencesannotatedwith12POStagtypesfor10languages(McDonaldetal.,2013).Allmodelsaretrainedwith12hiddenstates.WeusetheEnglishportiontoexperimentwithdifferenthy-perparameterconfigurations.Attesttime,wefixaconfiguration(basedontheEnglishportion)andap-plyitacrossalllanguages.Thelistofcomparedmethodsisgivenbelow:BWTheBaum-Welchalgorithm,anEMalgorithmforHMMs(BaumandPetrie,1966).CLUSTERAparameterestimationschemeforHMMsbasedonBrownclustering(Brownetal.,1992).WeruntheBrownclusteringalgorithm1toobtain12wordclustersC1…C12.Thenweset1WeusetheimplementationofLiang(2005).
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
9
6
1
5
6
7
4
0
2
/
/
t
yo
a
C
_
a
_
0
0
0
9
6
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
252
theemissionparameterso(X|h),transitionparam-eterst(h0|h),andpriorπ(h)tobethemaximum-likelihoodestimatesunderthefixedclusters.ANCHOROuralgorithmLearn-Anchor-HMMinFigure2butwiththeconstrainedoptimization(8)y(9)forestimatingπandT.2ANCHOR-FEATURESSameasANCHORbutem-ploysthefeatureaugmentationschemedescribedinSection4.4.4.LOG-LINEARTheunsupervisedlog-linearmodeldescribedinBerg-Kirkpatricketal.(2010).Insteadofemissionparameterso(X|h),themodelmaintainsaminiaturelog-linearmodelwithaweightvectorwandafeaturefunctionφ.Theprobabilityofawordxgiventaghiscomputedasp(X|h)=exp(w>φ(X,h))Px∈[norte]exp.(w>φ(X,h))Themodelcanbetrainedbymaximizingthelike-lihoodofobservedsequences.WeuseL-BFGStodirectlyoptimizethisobjective.3Thisapproachob-tainsthecurrentstate-of-the-artaccuracyonfine-grained(45tags)EnglishWSJdataset.WeusemaximummarginaldecodingforHMMpredictions:i.e.,ateachposition,wepredictthemostlikelytaggiventheentiresentence.5.3PracticalIssueswiththeAnchorAlgorithmInourexperiments,wefindthatAnchor-NMF(Fig-ure1)tendstoproposeextremelyrarewordsasan-chors.Asimplefixistosearchforanchorsonlyamongrelativelyfrequentwords.Wefindthatanyreasonablefrequencythresholdworkswell;weusethe300mostfrequentwords.Notethatthisisnotaproblemifthese300wordsincludeanchorwordscorrespondingtoallthe12tags.WemustdefinethecontextforconstructingΩ.Weusethepreviousandnextwords(i.e.,contextsizeL=2)markedwithrelativepositions.ThusΩhas2ncolumnsbeforedimensionreduction.Table1showstheperformanceontheEnglishportionwithdifferentconstructionmethodsforΩ.TheBrown2https://github.com/karlstratos/anchor3WeusetheimplementationofBerg-Kirkpatricketal.(2010)(personalcommunication).ChoiceofΩAccuracyRandom48.2Best-Fit53.4CCA57.0Brown66.1Table1:Many-to-oneaccuracyontheEnglishdatawithdifferentchoicesoftheconvexhullΩ(Figure3).Theseresultsdonotusespellingfeatures.construction(τ=browninFigure3)clearlyper-formsthebest:essentially,theanchoralgorithmisusedtoextracttheHMMparametersfromtheCCA-basedwordembeddingsofStratosetal.(2015).WealsoexplorefeatureaugmentationdiscussedinSection4.4.4.Forcomparison,weemploythesamewordfeaturesusedbyBerg-Kirkpatricketal.(2010):•Indicatorsforwhetherawordiscapitalized,containsahyphen,orcontainsadigit•Suffixesoflength1,2,and3Weweighthel2normoftheseextradimensionsinrelationtotheoriginaldimensions:wefindasmallweight(e.g.,0.1ofthenormoftheoriginaldimensions)workswell.Wealsofindthatthesefea-turescansometimessignificantlyimprovetheper-formance.Forinstance,theaccuracyontheEnglishportioncanbeimprovedfrom66.1%to71.4%withfeatureaugmentation.AnothernaturalexperimentistorefinetheHMMparametersobtainedfromtheanchoralgorithm(orBrownclusters)withafewiterationsoftheBaum-Welchalgorithm.Inourexperiments,sin embargo,itdidnotsignificantlyimprovethetaggingperfor-mance,soweomitthisresult.5.4TaggingAccuracyTable2showsthemany-to-oneaccuracyonalllan-guagesinthedataset.FortheBaum-Welchalgo-rithmandtheunsupervisedlog-linearmodels,wereportthemeanandthestandarddeviation(inparen-theses)of10randomrestartsrunfor1,000itera-tions.BothANCHORandANCHOR-FEATUREScompetefavorably.On5outof10languages,ANCHOR-FEATURESachievesthehighestaccuracy,a menudo
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
9
6
1
5
6
7
4
0
2
/
/
t
yo
a
C
_
a
_
0
0
0
9
6
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
253
Modeldeenesfriditjakopt-brsvBW(4.8)45.5(3.4)59.8(2.2)60.6(3.6)60.1(3.1)49.6(2.6)51.5(2.1)59.5(0.6)51.7(3.7)59.5(3.0)42.4CLUSTER60.062.967.466.459.366.160.347.567.461.9ANCHOR61.166.169.068.263.760.465.353.864.951.1ANCHOR-FEATURES63.471.474.371.967.360.269.461.865.861.0LOG-LINEAR(1.8)67.5(3.5)62.4(3.1)67.1(4.5)62.1(3.9)61.3(2.9)52.9(2.9)78.2(3.6)60.5(2.2)63.2(2.5)56.7Table2:Many-to-oneaccuracyoneachlanguageusing12universaltags.ThefirstfourmodelsareHMMsestimatedwiththeBaum-Welchalgorithm(BW),theclusteringalgorithmofBrownetal.(1992),theanchoralgorithmwithout(ANCHOR)andwith(ANCHOR-FEATURES)featureaugmentation.LOG-LINEARisthemodelofBerg-Kirkpatricketal.(2010)trainedwiththedirect-gradientmethodusingL-BFGS.ForBWandLOG-LINEAR,wereportthemeanandthestandarddeviation(inparentheses)of10randomrestartsrunfor1,000iterations.closelyfollowedbyANCHOR.TheBrownclusteringestimationisalsocompetitiveandhasthehighestaccuracyon3languages.Notsurprisingly,vanillaHMMstrainedwithBWperformtheworst(seeSec-tion5.1.1foradiscussion).LOG-LINEARisarobustbaselineandperformsthebestontheremaining2languages.Itper-formsespeciallystronglyonJapaneseandKoreandatasetsinwhichpoorlysegmentedstringssuchas“1950年11月5日には”(onNovember5,1950)and“40.3%로”(by40.3%)abound.Inthesedatasets,itiscrucialtomakeeffectiveuseofmorphologicalfeatures.5.5QualitativeAnalysis5.5.1A-HMMParametersAnA-HMMcanbeeasilyinterpretedsinceeachhiddenstateismarkedwithananchorobservation.Table3showsthe12anchorsfoundineachlan-guage.NotethattheseanchorwordsgenerallyhaveawidecoverageofpossiblePOStags.Wealsoexperimentedwithusingtrueanchorwords(obtainedfromlabeleddata),buttheydidnotimproveperformanceoverautomaticallyinducedanchors.Sinceanchordiscoveryisinherentlytiedtoparameterestimation,itisbettertoobtainanchorsinadata-drivenmanner.Inparticular,certainPOStags(e.g.,X)appearquiteinfrequently,andthemodelisworseoffbybeingforcedtoallocateahiddenstateforsuchatag.Table4showswordswithhighestemissionprob-abilitieso(X|h)undereachanchor.Weobservethatananchorisrepresentativeofacertaingroupofwords.Forinstance,thestate“loss”rep-resentsnoun-likewords,“1”representsnumbers,“on”representspreposition-likewords,“one”rep-resentsdeterminer-likewords,and“closed”repre-sentsverb-likewords.Theconditionaldistributionispeakedforanchorsthatrepresentfunctiontags(e.g.,determiners,punctuation)andflatforanchorsthatrepresentcontenttags(e.g.,nouns).Occasion-ally,ananchorassignshighprobabilitiestowordsthatdonotseemtobelongtothecorrespondingPOStag.Butthisistobeexpectedsinceo(X|h)∝P(XI=x)isgenerallylargerforfrequentwords.5.5.2ModelAssumptionsonDataTable5checkstheassumptionsinA-HMMsandBrownmodelsontheuniversaltreebankdataset.Theanchorassumptionisindeedsatisfiedwith12universaltags:ineverylanguage,eachtaghasatleastoneworduniquelyassociatedwiththetag.TheBrownassumption(eachwordhasexactlyonepossibletag)isofcoursenotsatisfied,sincesomewordsaregenuinelyambiguouswithrespecttotheirPOStags.However,thepercentageofunambigu-ouswordsisveryhigh(wellover90%).Thisanaly-sissupportsthatthemodelassumptionsmadebyA-HMMsandBrownmodelsareappropriateforPOStagging.Table6reportstheloglikelihood(normalizedbythenumberofwords)ontheEnglishportionofdifferentestimationmethodsforHMMs.BWandCLUSTERobtainhigherlikelihoodthantheanchoralgorithm,butthisisexpectedgiventhatbothEM
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
9
6
1
5
6
7
4
0
2
/
/
t
yo
a
C
_
a
_
0
0
0
9
6
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
254
deenesfriditjakopt-brsvempfehlenlossyavaitbulanradarお世話に완전Eochwie1hizocommunetetapiper`oないと중에deb¨or;on-Lewilayahsulleことにより경우partidagrundSeinoneespeciede–されている。줄fazermellanBerlinclosedAdem´aspr´esidentBagaimanaStatiものを같아요mesesiundareelqui,Lo,많은ossociala,takepa´ıses(samaleggeした,:.-,la`a.alそれは볼diretorbliderviceEspa˜na´Etatsdanfar-、자신의2010denimtoenUnisUtaradi幸福の받고,,desYorkdeCettepadalaことが맛있는umatidRegionJapanmunicipioquelquesyangart.通常の위한ODettaTable3:Anchorwordsfoundineachlanguage(modelANCHOR-FEATURES).lossyear(.02)market(.01)compartir(.01)company(.01)stock(.01)quarter(.01)shares(.01)price(.01)11(.03)10(.02)30(.02)15(.02)8(.02)2(.01)20(.01)50(.01)onof(.14)en(.12).(.08)para(.06)en(.04)por(.04)de(.04)y(.03)onethe(.23)a(.12)“(.03)un(.03)$(.02)es(.02)eso(.02)este(.02)closedsaid(.05)'s(.02)es(.02)dice(.02)era(.01)tiene(.01)had(.01)esperado(.01)areand(.08)es(.08)son(.05)era(.04)'s(.04)“(.04)tiene(.03)de(.03)takebe(.04)%(.02)tener(.02)millón(.02)Pero(.02)hacer(.01)El(.01)make(.01),,(.53).(.25)y(.05)"(.04)%(.01)millón(.01)–(.01)eso(.01)vice’s(.03)El(.02)“(.02)Nuevo(.01)y(.01)nuevo(.01)primero(.01)jefe(.01)toto(.39).(.11)a(.06)will(.04)$(.03)n’t(.03)would(.02)%(.02)Yorkthe(.15)a(.05)El(.04)de(.04)'s(.04)millón(.01)%(.01)es(.01)JapanMr.(.03)él(.02)"(.02)$(.02)él(.02)eso(.02)cual(.01)company(.01)Table4:Mostlikelywordsundereachanchorword(EnglishmodelANCHOR-FEATURES).Emissionprobabilitieso(X|h)aregiveninparentheses.andBrownclusteringdirectlyoptimizelikelihood.Incontrast,theanchoralgorithmisbasedonthemethodofmomentsanddoesnot(atleastdirectly)optimizelikelihood.NotethathighlikelihooddoesnotimplyhighaccuracyunderHMMs.6RelatedWork6.1Latent-VariableModelsTherehasrecentlybeengreatprogressinestima-tionofmodelswithlatentvariables.DespitetheNP-hardnessingeneralcases(Terwijn,2002;Aroraetal.,2012),manyalgorithmswithstrongtheoreti-calguaranteeshaveemergedundernaturalassump-tions.Forexample,forHMMswithfull-rankcondi-tions,Hsuetal.(2012)deriveaconsistentestimatorofthemarginaldistributionofobservedsequences.Anandkumaretal.(2014)proposeanexacttensordecompositionmethodforlearningawideclassoflatentvariablemodelswithsimilarnon-degeneracyconditions.Aroraetal.(2013)deriveaprovablycor-rectlearningalgorithmfortopicmodelswithacer-tainparameterstructure.Theanchor-basedframeworkhasbeenoriginallyformulatedforlearningtopicmodels(Aroraetal.,2013).Ithasbeensubsequentlyadoptedtolearnothermodelssuchaslatent-variableprobabilisticcontext-freegrammars(CohenandCollins,2014).Inourwork,wehaveextendedthisframeworktoaddressunsupervisedsequencelabeling.Zhouetal.(2014)alsoextendAroraetal.(2013)’sframeworktolearnvariousmodelsinclud-ingHMMs,buttheyaddressamoregeneralprob-lem.Consequently,theiralgorithmdrawsfromAnandkumaretal.(2012)andissubstantiallydif-ferentfromours.6.2UnsupervisedPOSTaggingUnsupervisedPOStaggingisaclassicprobleminunsupervisedlearningthathasbeentackledwithvariousapproaches.Johnson(2007)observesthat
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
9
6
1
5
6
7
4
0
2
/
/
t
yo
a
C
_
a
_
0
0
0
9
6
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
255
deenesfriditjakopt-brsv%anchoredtags100.0100.0100.0100.0100.0100.0100.0100.0100.0100.0%unambigwords96.691.594.094.294.894.899.598.494.897.4.VERBPRONADPNOUNADVCONJDETNUMADJXPRT,saiditfromMr.n’torwhichbillionnewbonona5392863395147485644363582274824581935154285Table5:Verifyingmodelassumptionsontheuniversaltreebank.Theanchorassumptionissatisfiedineverylanguage.TheBrownassumption(eachwordhasexactlyonepossibletag)isviolatedbutnotbyalargemargin.ThelowertableshowsthemostfrequentanchorwordanditscountundereachtagontheEnglishportion.ModelNormalizedLLAccBW-6.4559.8CLUSTER-6.7162.9ANCHOR-7.0666.1ANCHOR-FEATURES-7.0571.4Table6:LoglikelihoodnormalizedbythenumberofwordsonEnglish(alongwithaccuracy).ForBW,wereportthemeanof10randomrestartsrunfor1,000it-erations.EMperformspoorlyinthistaskbecauseitinducesflatdistributions;thisisnotthecasewithouralgo-rithmasseeninthepeakydistributionsinTable4.HaghighiandKlein(2006)assumeasetofproto-typicalwordsforeachtagandreporthighaccuracy.Incontrast,ouralgorithmautomaticallyfindssuchprototypesinasubroutine.Berg-Kirkpatricketal.(2010)achievethestate-of-the-artresultinunsupervisedfine-grainedPOStagging(mid-70%).AsdescribedinSection5.2,theirmodelisanHMMinwhichprobabiltiesaregivenbylog-linearmodels.Table7providesapointofreferencecomparingourworkwithBerg-Kirkpatricketal.(2010)intheirsetting:modelsaretrainedandtestedontheentire45-tagWSJdataset.Theirmodeloutperformsourapproachinthisset-ting:withfine-grainedtags,spellingfeaturesbe-comemoreimportant,forinstancetodistinguish“played”(VBD)from“play”(VBZ).Sin embargo,wehaveshownthatourapproachiscompetitivewhenuniversaltagsareused(Table2).ManypastworksonPOSinductionpredatetheintroductionoftheuniversaltagsetbyPetrovetal.(2012)andthusreportresultswithfine-grainedtags.MorerecentworksadopttheuniversaltagsetbutModelsAccuracyBW62.6(1.1)CLUSTER65.6ANCHOR67.2ANCHOR-FEATURES67.7LOG-LINEAR74.9(1.5)Table7:Many-to-oneaccuracyontheEnglishdatawith45originaltags.WeusethesamesettingasinTable2.ForBWandLOG-LINEAR,wereportthemeanandthestandarddeviation(inparentheses)of10randomrestartsrunfor1,000iterations.theyleverageadditionalresources.Forinstance,DasandPetrov(2011)andT¨ackstr¨ometal.(2013)useparalleldatatoprojectPOStagsfromasupervisedsourcelanguage.Lietal.(2012)usetagdictionar-iesbuiltfromWiktionary.Thustheirresultsarenotdirectlycomparabletoours.47ConclusionWehavepresentedanexactestimationmethodforlearninganchorHMMsfromunlabeleddata.Thereareseveraldirectionsforfuturework.Animportantdirectionistoextendthemethodtoaricherfamilyofmodelssuchaslog-linearmodelsorneuralnet-works.AnotherdirectionistofurthergeneralizethemethodtohandleawiderclassofHMMsbyrelax-ingtheanchorcondition(Condition4.1).ThiswillrequireasignificantextensionoftheNMFalgorithminFigure1.4DasandPetrov(2011)conductunsupervisedexperimentsusingthemodelofBerg-Kirkpatricketal.(2010),buttheirdatasetandevaluationmethoddifferfromours.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
9
6
1
5
6
7
4
0
2
/
/
t
yo
a
C
_
a
_
0
0
0
9
6
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
256
AcknowledgmentsWethankTaylorBerg-KirkpatrickforprovidingtheimplementationofBerg-Kirkpatricketal.(2010).Wealsothankanonymousreviewersfortheircon-structivecomments.ReferencesAnimashreeAnandkumar,DanielHsu,andShamM.Kakade.2012.AmethodofmomentsformixturemodelsandhiddenMarkovmodels.InTwenty-FifthAnnualConferenceonLearningTheory.AnimashreeAnandkumar,RongGe,DanielHsu,ShamM.Kakade,andMatusTelgarsky.2014.Ten-sordecompositionsforlearninglatentvariablemodels.JournalofMachineLearningResearch,15(1):2773–2832.SanjeevArora,RongGe,andAnkurMoitra.2012.Learningtopicmodels–goingbeyondSVD.InFoun-dationsofComputerScience(FOCS),2012IEEE53rdAnnualSymposiumon,pages1–10.IEEE.SanjeevArora,RongGe,YonatanHalpern,DavidMimno,AnkurMoitra,domingo de david,YichenWu,andMichaelZhu.2013.Apracticalalgorithmfortopicmodelingwithprovableguarantees.InProceedingsofthe30thInternationalConferenceonMachineLearn-ing(ICML-13),pages280–288.LeonardE.BaumandTedPetrie.1966.Statisti-calinferenceforprobabilisticfunctionsoffinitestateMarkovchains.TheAnnalsofMathematicalStatis-tics,37(6):1554–1563.TaylorBerg-Kirkpatrick,AlexandreBouchard-Cˆot´e,JohnDeNero,andDanKlein.2010.Painlessunsu-pervisedlearningwithfeatures.InHumanLanguageTechnologies:The2010AnnualConferenceoftheNorthAmericanChapteroftheAssociationforCom-putationalLinguistics,pages582–590.AssociationforComputationalLinguistics.PeterF.Brown,PeterV.Desouza,RobertL.Mercer,Vin-centJ.DellaPietra,andJeniferC.Lai.1992.Class-basedn-grammodelsofnaturallanguage.Computa-tionalLinguistics,18(4):467–479.ChristosChristodoulopoulos,SharonGoldwater,andMarkSteedman.2010.TwodecadesofunsupervisedPOSinduction:Howfarhavewecome?InProceed-ingsofthe2010ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages575–584.Asso-ciationforComputationalLinguistics.ShayB.CohenandMichaelCollins.2014.Aprovablycorrectlearningalgorithmforlatent-variablePCFGs.InProceedingsofACL.DipanjanDasandSlavPetrov.2011.Unsupervisedpart-of-speechtaggingwithbilingualgraph-basedpro-jections.InProceedingsofthe49thAnnualMeet-ingoftheAssociationforComputationalLinguistics:HumanLanguageTechnologies-Volume1,pages600–609.AssociationforComputationalLinguistics.MargueriteFrankandPhilipWolfe.1956.Analgorithmforquadraticprogramming.NavalResearchLogisticsQuarterly,3(1-2):95–110.AriaHaghighiandDanKlein.2006.Prototype-drivenlearningforsequencemodels.InProceedingsofthemainconferenceonHumanLanguageTechnol-ogyConferenceoftheNorthAmericanChapteroftheAssociationofComputationalLinguistics,pages320–327.AssociationforComputationalLinguistics.DanielHsu,ShamM.Kakade,andTongZhang.2012.AspectralalgorithmforlearninghiddenMarkovmod-els.JournalofComputerandSystemSciences,78(5):1460–1480.MarkJohnson.2007.Whydoesn’tEMfindgoodHMMPOS-taggers?InEMNLP-CoNLL,pages296–305.ShenLi,JoaoV.Grac¸a,andBenTaskar.2012.Wiki-lysupervisedpart-of-speechtagging.InProceedingsofthe2012JointConferenceonEmpiricalMethodsinNaturalLanguageProcessingandComputationalNaturalLanguageLearning,pages1389–1398.Asso-ciationforComputationalLinguistics.PercyLiangandDanKlein.2008.Analyzingtheerrorsofunsupervisedlearning.InACL,pages879–887.PercyLiang.2005.Semi-supervisedlearningfornaturallanguage.Master’sthesis,MassachusettsInstituteofTechnology.RyanT.McDonald,JoakimNivre,YvonneQuirmbach-Brundage,YoavGoldberg,DipanjanDas,KuzmanGanchev,KeithB.Hall,SlavPetrov,HaoZhang,Os-carT¨ackstr¨om,ClaudiaBedini,N´uriaB.Castell´o,andJungmeeLee.2013.Universaldependencyannota-tionformultilingualparsing.InACL,pages92–97.BernardMerialdo.1994.TaggingEnglishtextwithaprobabilisticmodel.ComputationalLinguistics,20(2):155–171.SlavPetrov,DipanjanDas,andRyanMcDonald.2012.Auniversalpart-of-speechtagset.InProceedingsoftheEighthInternationalConferenceonLanguageRe-sourcesandEvaluation(LREC’12).NoahA.SmithandJasonEisner.2005a.Contrastiveestimation:Traininglog-linearmodelsonunlabeleddata.InProceedingsofthe43rdAnnualMeetingonAssociationforComputationalLinguistics,pages354–362.AssociationforComputationalLinguistics.NoahA.SmithandJasonEisner.2005b.Guidingun-supervisedgrammarinductionusingcontrastiveesti-mation.InProc.ofIJCAIWorkshoponGrammaticalInferenceApplications,pages73–82.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
9
6
1
5
6
7
4
0
2
/
/
t
yo
a
C
_
a
_
0
0
0
9
6
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
257
KarlStratos,Do-kyumKim,MichaelCollins,andDanielHsu.2014.Aspectralalgorithmforlearningclass-basedn-grammodelsofnaturallanguage.InProceed-ingsofthethirtiethconferenceonUncertaintyinArti-ficialIntelligence.KarlStratos,MichaelCollins,andDanielHsu.2015.Model-basedwordembeddingsfromdecompositionsofcountmatrices.InProceedingsofthe53rdAnnualMeetingoftheAssociationforComputationalLinguis-ticsandthe7thInternationalJointConferenceonNat-uralLanguageProcessing(Volume1:LongPapers),pages1282–1291,Beijing,Porcelana,July.AssociationforComputationalLinguistics.OscarT¨ackstr¨om,DipanjanDas,SlavPetrov,RyanMc-Donald,andJoakimNivre.2013.Tokenandtypeconstraintsforcross-lingualpart-of-speechtagging.TransactionsoftheAssociationforComputationalLinguistics,1:1–12.SebastiaanA.Terwijn.2002.Onthelearnabilityofhid-denMarkovmodels.InGrammaticalInference:Al-gorithmsandApplications,pages261–268.Springer.KristinaToutanovaandMarkJohnson.2007.ABayesianLDA-basedmodelforsemi-supervisedpart-of-speechtagging.InAdvancesinNeuralInformationProcessingSystems,pages1521–1528.StephenA.Vavasis.2009.Onthecomplexityofnonneg-ativematrixfactorization.SIAMJournalonOptimiza-tion,20(3):1364–1377.TianyiZhou,JeffA.Bilmes,andCarlosGuestrin.2014.Divide-and-conquerlearningbyanchoringaconicalhull.InAdvancesinNeuralInformationProcessingSystems,pages1242–1250.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
9
6
1
5
6
7
4
0
2
/
/
t
yo
a
C
_
a
_
0
0
0
9
6
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3