Transactions of the Association for Computational Linguistics, vol. 3, pp. 433–447, 2015. Action Editor: Sharon Goldwater.
Submission batch: 10/2014; Revision batch 3/2015; Published 8/2015.
2015 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.
c
(cid:13)
ModelingWordFormsUsingLatentUnderlyingMorphsandPhonologyRyanCotterellandNanyunPengandJasonEisnerDepartmentofComputerScience,JohnsHopkinsUniversity{ryan.cotterell,npeng1,eisner}@jhu.eduAbstractTheobservedpronunciationsorspellingsofwordsareoftenexplainedasarisingfromthe“underlyingforms”oftheirmor-phemes.Theseformsarelatentstringsthatlinguiststrytoreconstructbyhand.Weproposetoreconstructthemautomaticallyatscale,enablinggeneralizationtonewwords.Givensomesurfacewordtypesofaconcatenativelanguagealongwiththeabstractmorphemesequencesthattheyex-press,weshowhowtorecoverconsistentunderlyingformsforthesemorphemes,togetherwiththe(stochastic)phonologythatmapseachconcatenationofunderly-ingformstoasurfaceform.Ourtechniqueinvolvesloopybeliefpropagationinanat-uraldirectedgraphicalmodelwhosevari-ablesareunknownstringsandwhosecon-ditionaldistributionsareencodedasfinite-statemachineswithtrainableweights.Wedefinetrainingandevaluationparadigmsforthetaskofsurfacewordprediction,andreportresultsonsubsetsof7languages.1IntroductionHowispluralityexpressedinEnglish?Compar-ingcats([kæts]),dogs([dOgz]),andquizzes([kwIzIz]),thepluralmorphemeevidentlyhasatleastthreepronunciations([s],[z],[Iz])andatleasttwospellings(-sand-es).Aussi,consider-ingsingularquiz,perhapsthe“shortexam”mor-phemehasmultiplespellings(quizz-,quiz-).Heureusement,languagesaresystematic.There-alizationofamorphememayvarybycontextbutislargelypredictablefromcontext,inawaythatgeneralizesacrossmorphemes.Infact,gener-ativelinguiststraditionallypositthateachmor-phemeofalanguagehasasinglerepresentationsharedacrossallcontexts(Jakobson,1948;Ken-stowiczandKisseberth,1979,chapter6).How-ever,thisstringisalatentvariablethatisneverobserved.Variationappearswhenthephonologyofthelanguagemapstheseunderlyingrepresen-tations(URs)—incontext—tosurfacerepresen-tations(SRs)thatmaybeeasiertopronounce.Thephonologyisusuallydescribedbyagrammarthatmayconsistofeitherrewriterules(ChomskyandHalle,1968)orrankedconstraints(PrinceandSmolensky,2004).Wewillreviewthisframeworkinsection2.Theupshotisthattheobservedwordsinalanguagearesupposedtobeexplainableintermsofasmallerunderlyinglexiconofmorphemes,plusaphonol-ogy.Ourgoalinthispaperistorecoverthelexiconandphonology(enablinggeneralizationtonewwords).Thisisdifficultevenwhenwearetoldwhichmorphemesareexpressedbyeachword,be-causetheunknownunderlyingformsofthemor-phemesmustcooperateproperlywithoneanotherandwiththeunknownphonologicalrulestopro-ducetheobservedresults.Becauseofthesein-teractions,wemustreconstructeverythingjointly.Weregardthisasaproblemofinferenceinadi-rectedgraphicalmodel,assketchedinFigure1.Thisisanaturalproblemforcomputationallin-guistics.Phonologystudentsaretrainedtopuzzleoutsolutionsforsmalldatasetsbyhand.Childrenapparentlysolveitatthescaleofanentirelan-guage.Phonologistswouldliketohavegrammarsformanylanguages,notjusttostudyeachlan-guagebutalsotounderstanduniversalprinciplesanddifferencesamongrelatedlanguages.Auto-maticprocedureswouldrecoversuchgrammars.Theywouldalsoallowcomprehensiveevaluationandcomparisonofdifferentphonologicaltheories(i.e.,whatinductivebiasesareuseful?),andwouldsuggestmodelsofhumanlanguagelearning.Solvingthisproblemisalsopracticallyimpor-tantforNLP.Whatwerecoverisamodelthatcangenerateandhelpanalyzenovelwordforms,1whichaboundinmorphologicallycomplexlan-guages.Ourapproachisdesignedtomodelsur-facepronunciations(asneededfortext-to-speechandASR).Itmightalsobeappliedinpractice1Ananalyzerwouldrequireaprioroverpossibleanalyses.Ourpresentmodeldefinesjustthecorrespondinglikelihoods,i.e.,theprobabilityoftheobservedwordgiveneachanalysis.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
4
9
1
5
6
6
7
7
0
/
/
t
je
un
c
_
un
_
0
0
1
4
9
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
434
rizajgnzeɪʃ#ndæmnrizajgn#eɪʃ#nrizajgn#zdæmn#zdæmn#eɪʃ#nrˌɛ.zɪg.nˈeɪ.ʃ#nri.zˈajnzdæmzdˌæm.nˈeɪ.ʃ#nrˌɛzɪgnˈeɪʃn̩rizˈajnzdˈæmzdˌæmnˈeɪʃn̩1) Morpheme URs2) Word URs3) Word SRsConcatenation (par exemple.)Phonology (PFST)Phoneticsresignationresignsdamnsdamnation2M2U2S4) Word ObservationsFigure1:OurmodelasaBayesiannetwork,inwhichsurfaceformsarisefromapplyingphonologytoaconcatenationofunderlyingforms.Shadednodesshowtheobservedsurfaceformsforfourwords:resignation,resigns,damns,anddamnation.Thegraphicalmodelencodestheirmorphologicalrelationshipsusinglatentforms.EachmorphemeURatlayer1isgeneratedbythelexiconmodelMφ(aprobabilisticfinite-stateautomaton).TheseareconcatenatedintovariouswordURsatlayer2.EachSRatlayer3isgeneratedusingthephonologymodelSθ(aprobabilisticfinite-statetransducer).Layer4derivesobservablephoneticformsfromlayer3.Thisdeletesunpronouncedsymbolssuchassyllableboundaries,andtranslatesthephonemesintoanobservedphonetic,articulatory,oracousticrepresentation.However,ourpresentpapersimplymergeslayers3and4:ourlayer3doesnotcurrentlymakeuseofanyunpronouncedsymbols(e.g.,syllableboundaries)andweobserveitdirectly.tomodelsurfacespellings(asneededforMTontext).GoodmorphologicalanalysishasbeenusedtoimproveNLPtaskssuchasmachinetranslation,parsing,andNER(Fraseretal.,2012;HohenseeandBender,2012;Yeniterzi,2011).Usingloopybeliefpropagation,thispaperat-tackslarger-scalelearningproblemsthanpriorworkonthistask(section8).Wealsodevelopanewevaluationparadigmthatexamineshowwellaninferredgrammarpredictsheld-outSRs.Un-likepreviousalgorithms,wedonotpre-restrictthepossibleURsforeachmorphemetoasmallorstructuredfiniteset,butuseweightedfinite-statemachinestoreasonabouttheinfinitespaceofallstrings.OurgraphicalmodelcapturesthestandardassumptionthateachmorphemehasasingleUR,unlikesomeprobabilisticlearners.However,wedonottrytolearntraditionalorderedrulesorcon-straintrankingslikepreviousmethods.Wejustsearchdirectlyforaprobabilisticfinite-statetrans-ducerthatcaptureslikelyUR-to-SRmappings.2FormalFrameworkWeurgethereadertobeginbyexaminingFig-ure1,whichsummarizesourmodelingapproachthroughanexample.Theupcomingsectionsthengiveaformaltreatmentwithdetailsanddiscus-sion.Section2describestherandomvariablesinFigure1’sBayesiannetwork,whilesection3describesitsconditionalprobabilitydistributions.Sections4–5giveinferenceandlearningmethods.Amorphemeisalexicalentrythatpairsformwithcontent(Saussure,1916).Itsformisamorph—astringofphonemes.Itscontentisabundleofsyntacticand/orsemanticproperties.2Notethatinthispaper,wearenonstandardlyus-ing“morph”todenoteanunderlyingform.Weassumethatallunderlyingandsurfacerepresen-tationscanbeencodedasstrings,overrespectivealphabetsΣuandΣs.Thiswouldbepossibleevenforautosegmentalrepresentations(Kornai,1995).Alanguage’sphonologicalsystemthusconsistsofthefollowingcomponents.Wedenoteeachim-portantsetbyacalligraphicletter.Weusethecor-respondinguppercaselettertodenoteafunctiontothatset,thecorrespondinglowercaseletterasavariablethatrangesovertheset’selements,andadistinguishedtypefaceforspecificelements.•Aisasetofabstractmorphemessuchas(cid:130)qu(cid:146)je(cid:132)zand(cid:129)p(cid:134)je(cid:146)toi(cid:132)r$a(cid:134)l.Theseareatoms,notstrings.•M=Σ∗uisthespaceofpossiblemorphs:concreteURstringssuchas/kwIz/or/z/.•M:A→Misthelexiconthatmapseachmorphemeatoanunderlyingmorphm=M(un).WewillfindM(un)foreacha.•U=(Σu∪{#})∗isthespaceofunderlyingrepresentationsforwords,suchas/kwIz#z/.•U:M∗→Ucombinesmorphs.Awordisspecifiedbyasequenceofmorphemes~a=a1,a2,…,withconcreteformsmi=M(ai).Thatword’sunderlyingformisthenu=U(m1,m2,…)∈U.2Thispaperdoesnotdealwiththecontent.However,notethatasinglemorphememightspecifyaconjunctionordisjunctionofmultipleproperties,leadingtomorphologicalphenomenasuchasfusion,suppletion,orsyncretism. l Téléchargé à partir du site Web : / / direct . m je t . e d u / t a c l / l a r t i c e – pdf / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 4 9 1 5 6 6 7 7 0 / / t l a c _ a _ 0 0 1 4 9 pd . f par invité 0 8 Septembre 2 0 2 3 435 •S=Σ∗sisthespaceofsurfacerepresenta-tionsforwords,suchas[kwIzIz].•S:U→Sisthephonology.Itmapsanunderlyingformutoitssurfaceforms.WewillfindthisfunctionSalongwithM.WeassumeinthispaperthatUsimplycon-catenatesthesequenceofmorphs,separatingthembythemorphboundarysymbol#:u=U(m1,m2,…)=m1#m2#···.However,seesection4.3forgeneralizations.Theoverallsystemservestomapan(abstract)morphemesequence~a∈A∗toasurfacewords∈S.Crucially,Sactsontheunderlyingformuoftheentireword,notonemorphatatime.Henceitseffectonamorphmaydependoncon-text,aswesawforEnglishpluralization.Forex-ample,S(/kwIz#s/)=[kwIzIz]—orifweweretoapplyourmodeltoorthography,S(/quiz#s/)=[quizzes].Sproducesasinglewell-formedsur-faceform,whichisnotarbitrarilysegmentedas[quiz-zes]ou[quizz-es]ou[quizze-s].3ProbabilityModelOurgoalistoreconstructthelexiconMandmor-phophonologySforagivenlanguage.Wethere-foredefinepriorprobabilitydistributionsoverthem.(WeassumeΣu,Σs,UN,Uaregiven.)Foreachmorphemea∈A,wemodelthemorphM(un)∈MasanIIDsamplefromaproba-bilitydistributionMφ(m).3Thismodeldescribeswhatsortofunderlyingformsappearinthelan-guage’slexicon.Thephonologyisprobabilisticinasimilarway.Forawordwithunderlyingformu∈U,wepre-sumethatthesurfaceformS(toi)isasamplefromaconditionaldistributionSθ(s|toi).Thissinglesampleappearsinthelexicalentryofthewordtypeandisreusedforalltokensofthatword.Theparametervectorsφandθarespecifictothelanguagebeinggenerated.Thus,underourgener-ativestory,alanguageiscreatedasfollows:1.Sampleφandθfrompriors(seesection3.4).2.Foreacha∈A,sampleM(un)∼Mφ.3.Wheneveranewabstractword~a=a1,a2···mustbepronouncedforthefirsttime,con-structuasdescribedinsection2,andsampleS(toi)∼Sθ(·|toi).ReusethisS(toi)infuture.Notethatwehavenotspecifiedaprobabilitydistributionoverabstractwords~a,sinceinthis3Seesection3.3forageneralizationtoMφ(m|un).papier,thesesequenceswillalwaysbeobserved.Suchadistributionmightbeinfluencedbythese-manticandsyntacticcontentofthemorphemes.Wewouldneedittorecovertheabstractwordsiftheywereunobserved,e.g.,whenanalyzingnovelwordformsorattemptingunsupervisedtraining.3.1Discussion:Whyprobability?Alanguage’slexiconMandmorphophonologySaredeterministic,inthateachmorphemehasasin-gleunderlyingformandeachwordhasasinglesurfaceform.Thepointofthelanguage-specificdistributionsMφandSθistoaidrecoveryoftheseformsbycapturingregularitiesinMandS.Inparticular,Sθconstitutesatheoryoftheregu-larphonologyofthelanguage.Itshigh-probabilitysoundchangesarethe“regular”ones,whileirreg-ularitiesandexceptionscanbeexplainedasocca-sionallower-probabilitychoices.Wepreferathe-orySθthathashighlikelihood,i.e.,itassignshighprobability(≈1)toeachobservedformsgivenitsunderlyingu.Inlinguisticterms,wepreferpre-dictivetheoriesthatrequirefewexceptions.Inthelinguisticcommunity,theprimarymo-tivationforprobabilisticmodelsofphonology(Pierrehumbert,2003)hasbeentoexplain“soft”phenomena:synchronicvariation(Sankoff,1978;BoersmaandHayes,2001)orgradedacceptabil-ityjudgmentsonnovelsurfaceforms(HayesandWilson,2008).Theseapplicationsareorthog-onaltoourmotivation,aswedonotobserveanyvariationorgradienceinourpresentexper-iments.Fundamentally,weuseprobabilitiestomeasureirregularity—whichsimplymeansunpre-dictabilityandisamatterofdegree.Ourobjectivefunctionwillquantitativelyfavorexplanationsthatshowgreaterregularity(Eisner,2002b).Aprobabilistictreatmentalsoallowsrela-tivelysimplelearningmethods(e.g.,BoersmaandHayes(2001))sinceinferenceneverhastoback-trackfromacontradiction.OurmethodsearchesacontinuousspaceofphonologiesSθ,allofwhichareconsistentwitheverymappingS.Thatis,wealwayshaveSθ(s|toi)>0forallu,s,soourcurrentguessofSθisalwayscapableofexplain-ingtheobservedwords,albeitperhapswithlowprobability.OurEMlearnertunesSθ(andMφ)soastoraisetheprobabilityoftheobservedsur-faceforms,marginalizingoverthereconstructedlexiconMofunderlyingforms.WedowarnthatEMcangetstuckatalocaloptimum;randomrestartsandsimulatedannealingarewaystoes- l Téléchargé à partir du site Web : / / direct . m je t . e d u / t a c l / l a r t i c e – pdf / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 4 9 1 5 6 6 7 7 0 / / t l a c _ a _ 0 0 1 4 9 pd . f par invité 0 8 Septembre 2 0 2 3 436 wɛtɾwɛrəNext Input CharacterInput Right ContextInput Left ContextOutput Left ContextOutput Right ContextStochastic ChoiceOf Edit in Context CFigure2:Illustrationofacontextualeditprocessasitpro-nouncestheEnglishwordwetterbytransducingtheunder-lying/wEt#@r/(aftererasing#)tothesurface[wER@r].Atthepointshown,itisapplyingthe“intervocalicalveolarflap-ping”rule,replacing/t/inthiscontextbyapplyingSUBST(R.).capesuchlow-likelihoodsolutions,muchasback-trackingescapeszero-likelihoodsolutions.3.2MappingURstoSRs:ThephonologySθWecurrentlymodelSθ(s|toi)astheprobabilitythataleft-to-rightstochasticcontextualeditpro-cess(Figure2)wouldedituintos.Thisprobabil-ityisasumoveralleditsequencesthatproducesfromu—thatis,alls-to-ualignments.Stochasticcontextualeditprocesseswerede-scribedbyCotterelletal.(2014).Suchapro-cesswritessurfacestrings∈Σ∗swhilereadingtheunderlyingstringu∈Σ∗u.Iftheprocesshassofarconsumedsomeprefixoftheinputandpro-ducedsomeprefixoftheoutput,itwillnextmakeastochasticchoiceamong2|Σs|+1possibleed-its.EditsoftheformSUBST(c)orINSERT(c)(forc∈Σs)appendctotheoutputstring.EditsoftheformSUBST(c)orDELETEwill(aussi)consumethenextinputphoneme;ifnoinputphonemesremain,theonlypossibleeditsareINSERT(c)orHALT.Thestochasticchoiceofedit,givencontext,isgovernedbyaconditionallog-lineardistributionwithfeatureweightvectorθ.Thefeaturefunctionsmaylookataboundedamountofleftandrightinputcontext,aswellasleftoutputcontext.Ourfeaturefunctionsaredescribedinsection6.OurnormalizedprobabilitiesSθ(s|toi)canbecomputedbyaweightedfinite-statetransducer,acrucialcomputationalpropertythatwewillex-ploitinsection4.2.AsCotterelletal.(2014)explain,thepriceisthatourmodelisleft/right-asymmetric.Theinabilitytoconditiondirectlyontherightoutputcontextarisesfromlocalnormal-ization,justlike“labelbias”inmaximumentropyMarkovmodels(McCallumetal.,2000).WithcertainfancierapproachestomodelingSθ,whichweleavetofuturework,thiseffectcouldbemiti-gatedwhilepreservingthetransducerproperty.3.3GeneratingURs:ThelexiconmodelMφInourpresentexperiments,weuseaverysimplelexiconmodelMφ,sothattheburdenfallsonthephonologySθtoaccountforanylanguage-specificregularitiesinsurfaceforms.Thiscorrespondstothe“RichnessoftheBase”principleadvocatedbysomephonologists(PrinceandSmolensky,2004),andseemstoyieldgoodgeneralizationforus.Specifically,wesayallURsofthesamelengthhavethesameprobability,andthelengthisgeo-metricallydistributedwithmean(1/φ)−1.Thisisa0-grammodelwithasingleparameterφ∈(0,1],namelyMφ(m)=((1−φ)/|Σu|)|m|·φ.Itwouldbestraightforwardtoexperimentwithotherdivisionsoflaborbetweenthelexiconmodelandphonologymodel.A1-grammodelforMφwouldalsomodelwhichunderlyingphonemesarecommoninthelexicon.A2-grammodelwouldmodelthe“underlyingphonotactics”ofmorphs,thoughphonologicalprocesseswouldstillbeneededatmorphboundaries.Suchmodelsaretheprobabilisticanalogueofmorphemestruc-tureconstraints.WecouldfurthergeneralizefromMφ(m)toMφ(m|un),toallowtheshapeofthemorphmtobeinfluencedbya’scontent.Forexample,Mφ(m|un)forEnglishmightdescribehownounstendtohaveunderlyingstressonthefirstsyllable;similarly,Mφ(m|un)forArabicmightcapturethefactthatunderlyingstemstendtoconsistof3consonants;andacrosslanguages,Mφ(m|un)wouldpreferaffixestobeshort.Notethatwewillalwayslearnalanguage’sMφjointlywithitsactuallexiconM.Looselyspeak-ing,theparametervectorφisfoundfromeasilyreconstructedURsinM;thenMφservesasapriorthatcanhelpusreconstructmoredifficultURs.3.4PriorOvertheParametersForφ,whichisascalarunderour0-grammodel,ourpriorisuniformover(0,1].Weplaceaspher-icalGaussianprioronthevectorθ,withmean~0andavarianceσ2tunedbycoarsegridsearchondevdata(seecaptionsofFigures3–4).TheGaussianfavorsphonologiesthataresim-pleinthesensethattheyhavefewstronglyweightedfeatures.Agrammarthatrefersoncetothenaturalclassofvoicedconsonants(section6),whichcapturesageneralization,ispreferredtoan l D o w n o a d e d f r o m h t t p : / / direct . m je t . e d u / t a c l / l a r t i c e – pdf / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 4 9 1 5 6 6 7 7 0 / / t l a c _ a _ 0 0 1 4 9 pd . f par invité 0 8 Septembre 2 0 2 3 437 equallydescriptivegrammarthatrefersseparatelytoseveralspecificvoicedconsonants.Ifitishardtotellwhetherachangeappliestoroundorbackvowels(becausethesepropertiesarestronglycor-relatedinthetrainingdata),thenthepriorresistsgrammarsthatmakeanarbitrarychoice.Itprefersto“spreadtheblame”bygivinghalftheweighttoeachfeature.Thechangeisstillprobableforroundbackvowels,andmoderatelyprobableforothervowelsthatareeitherroundorback.4InferenceWearegivenatrainingsetofsurfacewordformssthatrealizeknownabstractwords~a.Weaimtoreconstructtheunderlyingmorphsmandwordsu,andpredictnewsurfacewordformss.4.1ABayesiannetworkForfixedθandφ,thistaskcanberegardedasmarginalinferenceinaBayesiannetwork(Pearl,1988).Figure1displayspartofanetworkthaten-codesthemodelingassumptionsofsection3.Thenodesatlayers1,2,and3ofthisnetworkrepre-sentstring-valuedrandomvariablesinM,U,andSrespectively.Eachvariable’sdistributioniscon-ditionedonthevaluesofitsparents,ifany.Layer1representstheunknownM(un)forvari-ousa.NoticethateachM(un)issoftlyconstrainedbythepriorMφ,andalsobyitsneedtohelppro-ducevariousobservedsurfacewordsviaSθ.Eachunderlyingworduatlevel2isaconcate-nationofitsunderlyingmorphsM(ai)atlevel1.Thus,thetopologyatlevels1–2isgivenbysuper-vision.Wewouldhavetolearnthistopologyiftheword’smorphemesaiwerenotknown.Ourapproachcapturestheunboundedgenera-tivecapacityoflanguage.IncontrasttoDreyerandEisner(2009)(seesection8),wehavedefinedadirectedgraphicalmodel.Hencenewunob-serveddescendantscanbeaddedwithoutchang-ingtheposteriordistributionovertheexistingvari-ables.Soourfinitenetworkcanbeviewedasasubgraphofaninfinitegraph.Thatis,wemakenoclosed-vocabularyassumption,butimplicitlyin-clude(andpredictthesurfaceformsof)anyun-observedwordsthatcouldresultfromcombiningmorphemes,evenmorphemesnotinourdataset.Whilethepresentpaperfocusesonwordtypes,wecouldextendthemodeltoconsidertokensaswell.InFigure1,eachphonologicalsurfacetypeatlayer3couldbeobservedtogeneratezeroormorenoisyphonetictokensatlayer4,incontextsthatcallforthemorphemesexpressedbythattype.4.2LoopybeliefpropagationThetoptwolayersofFigure1includealongundirectedcycle(involvingall8nodesandall8edgesshown).Onsuch“loopy”graphicalmodels,exactinferenceisingeneraluncomputablewhentherandomvariablesarestring-valued.However,DreyerandEisner(2009)showedhowtosubsti-tuteapopularapproximatejointinferencemethod,loopybeliefpropagation(Murphyetal.,1999).Qualitatively,whatdoesthisdoonFigure1?4Letudenotetheleftmostlayer-2node.MidwaythroughloopyBP,uisnotyetsureofitsvalue,butisreceivingsuggestionsfromitsneighbors.ThestemURimmediatelyaboveuwouldlikeutostartwithsomethinglike/rizajgn#/.5Meanwhile,thewordSRimmediatelybelowuencouragesutobeanyURthatwouldhaveahighprobability(underSθ)ofsurfacingas[rEzIgn#eIS@n].Soutriestomeetbothrequirements,guessingthatitsvaluemightbesomethinglike/rizajgn#eIS@n/(theproductofthisstring’sscoresunderthetwomes-sagestouisrelativelyhigh).Now,forUtohaveproducedsomethinglike/rizajgn#eIS@n/bystem-suffixconcatenation,thesuffix’sURmusthavebeensomethinglike/eIS@n/.usendsamessagesayingsotothethirdnodeinlayer1.Thisinducesthatnode(thesuffixUR)toinformtherightmostlayer-2nodethatitprobablyendsin/#eIS@n/aswell—andsoforth,iteratinguntilconvergence.Formally,theloopyBPalgorithmiterativelyupdatesmessagesandbeliefs.Eachisafunc-tionthatscorespossiblestrings(orstringtuples).DreyerandEisner(2009)’skeyinsightisthatthesemessagesandbeliefscanberepresentedusingweightedfinite-statemachines(WFSMs),andfur-thermore,loopyBPcancomputeallofitsupdatesusingstandardpolytimefinite-stateconstructions.4.3Discussion:Thefinite-staterequirementTheaboveresultsholdwhenthe“factors”thatde-finethegraphicalmodelarethemselvesexpressedasWFSMs.Thisistrueinourmodel.Thefac-torsofsection4.1correspondtotheconditional4LoopyBPactuallypassesmessagesonafactorgraphde-rivedfromFigure1.However,inthisinformalparagraphwewillspeakasifitwerepassingmessagesonFigure1directly.5BecausethatstemURthinksitsownvalueissomethinglike/rizajgn/—basedonthemessagesthatitiscurrentlyre-ceivingfromrelatedformssuchas/rizajgn#z/,andfromMφ. l Téléchargé à partir du site Web : / / direct . m je t . e d u / t a c l / l a r t i c e – pdf / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 4 9 1 5 6 6 7 7 0 / / t l a c _ a _ 0 0 1 4 9 pd . f par invité 0 8 Septembre 2 0 2 3 438 distributionsMφ,U,andSθthatrespectivelyse-lectvaluesfornodesatlayers1,2,and3giventhevaluesattheirparents.Assection3modelsthese,foranyφandθ,wecanrepresentMφasa1-tapeWFSM(acceptor),Uasamulti-tapeWFSM,andSθasa2-tapeWFSM(transducer).6AnyotherWFSMscouldbesubstituted.Weareonratherfirmgroundinrestrictingtofinite-state(regular)modelsofSθ.Theapparentregularityofnatural-languagephonologywasfirstobservedbyJohnson(1972),socomputationalphonologyhasgenerallypreferredgrammarformalismsthatcompileinto(unweighted)finite-statemachines,whethertheformalismisbasedonrewriterules(KaplanandKay,1994)orconstraints(Eisner,2002un;Riggle,2004).De la même manière,Ucouldbeanyfinite-staterelation,7notjustconcatenationasinsection2.Thusourframeworkcouldhandletemplaticmorphology(Hulden,2009),infixation,orcircumfixation.Althoughonlyregularfactorsareallowedinourgraphicalmodel,aloopygraphicalmodelwithmultiplesuchfactorscanactuallycapturenon-regularphenomena,forexamplebyusingauxil-iaryvariables(DreyerandEisner,2009,§3.4).Ap-proximateinferencethenproceedsbyloopyBPonthismodel.Inparticular,reduplicationisnotreg-ularifunbounded,butwecanadoptmorphologi-caldoublingtheory(InkelasandZoll,2005)andmodelitbyhavingUconcatenatetwocopiesofthesamemorph.DuringinferenceofURs,thismorphexchangesmessageswithtwosubstringsoftheunderlyingword.Similarly,wecanmanipulatethegraphicalmodelstructuretoencodecyclicphonology—i.e.,concatenatingawordSRwithaderivationalaffix6Mφhasasinglestate,withhaltprobabilityφandtheremainingprobability1−φdividedamongself-looparcslabeledwiththephonemesinΣu.Umustconcatenatekmorphsbycopyingalloftape1,thentape2,etc.,totapek+1:thisiseasilydoneusingk+1states,andarcsofprobability1.SθisconstructedasinCotterelletal.(2014).7Ingeneral,aUfactorenforcesu=U(m1,…,mk),soitisadegree-(k+1)factor,representedbya(k+1)-tapeWFSMconnectingthesevariables(DreyerandEisner,2009).Ifone’sfinite-statelibraryislimitedto2-tapeWFSMs,thenonecansimulateanysuchUfactorusing(1)anauxiliarystringvariableπencodingthepaththroughU,(2)aunaryfactorweightingπaccordingtoU,(3)asetofk+1binaryfactorsrelatingπtoeachofu,m1,…,mk.ItiseveneasiertohandletheparticularUusedinthispaper,whichenforcesu=m1#…#mk.GiventhisfactorU’sincomingmes-sagesµ·→U,eachbeinga1-tapeWFSM,computeitsloopyBPoutgoingmessagesµU→u=µm1→U#···#µmk→Uand(par exemple.)µU→m2=range(µu→U◦((µm1→U#×(cid:15))Σ∗u(#µm3→U#···#µmk→U×(cid:15)))).URandpassingtheresultthroughSθonceagain.Analternativeistoencodethishierarchicalstruc-tureintothewordURu,byencodinglevel-1andlevel-2boundarieswithdifferentsymbols.Asin-gleapplicationofSθcantreattheseboundariesdifferently:forexample,byimplementingcyclicphonologyasacompositionoftwotransductions.4.4LoopyBPimplementationdetailsEachloopyBPmessagetoorfromarandomvariableisa1-tapeWFSM(acceptor)thatscoresallpossiblevaluesofthatvariable(givenbythesetM,U,orS:seesection2).Weinitializedeachmessagetotheuniformdistribution.8Wethenupdatedthemessagesserially,alternatingbe-tweenupwardanddownwardsweepsthroughtheBayesiannetwork.After10iterationswestoppedandcomputedthefinalbeliefateachvariable.Acomplicationisthatapopularaffixsuchas(cid:129)p(cid:134)je(cid:146)toi(cid:132)r$a(cid:134)je(/z/inlayer1)receivesmessagesfromhun-dredsofwordsthatrealizethataffix.LoopyBPobtainsthataffix’sbeliefandoutgoingmessagesbyintersectingtheseWFSMs—whichcanleadtoastronomicallylargeresultsandruntimes.Wead-dressthiswithasimplepruningapproximationwhereateachvariablem,wedynamicallyrestricttoafinitesupportsetofplausiblevaluesform.Wetakethistobetheunionofthe20-bestlistsofallmessagessenttom.9Wethenprunethosemessagessothattheygiveweight0toallstringsoutsidem’ssupportset.Asaresult,m’soutgo-ingmessagesandbeliefarealsoconfinedtoitssupportset.Notethatthesupportsetisnothand-specified,butdeterminedautomaticallybytakingthetophypothesesundertheprobabilitymodel.Improvedapproacheswithnopruningarepos-sible.Aftersubmittingthispaper,wedevelopedapenalizedexpectationpropagationmethod(Cot-terellandEisner,2015).Itdynamicallyapproxi-matesmessagesusinglog-linearfunctions(basedonvariable-ordern-gramfeatures)whosesupportistheentirespaceΣ∗.Wealsodevelopedadualdecompositionmethod(Pengetal.,2015),whichifitconverges,exactlyrecoversthesinglemost8Thisisstandard—althoughtheuniformdistributionoverthespaceofstringsisactuallyanimproperdistribution.Itisexpressedbyasingle-stateWFSMwhosearcshaveweight1.Itcanbeshownthatthebeliefsareproperdistributionsafteroneiteration,thoughtheupwardmessagesmaynotbe.9Ingeneral,weshouldupdatethissupportsetdynami-callyasinferenceandlearningimprovethemessages.Butinourpresentexperiments,thatappearsunnecessary,sincetheinitialsupportsetalwaysappearstocontainthe“correct”UR.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
4
9
1
5
6
6
7
7
0
/
/
t
je
un
c
_
un
_
0
0
1
4
9
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
439
probableexplanationofthedata10givenφandθ.5ParameterLearningWeemployMAP-EMasthelearningalgorithm.TheE-stepisapproximatedbytheloopyBPalgo-rithmofsection4.TheM-steptakestheresultingbeliefs,togetherwiththepriorofsection3.4,andusesthemtoreestimatetheparametersθandφ.IfweknewthetrueURukforeachobservedwordtypesk,wewouldjustdosupervisedtrainingofθ,usingL-BFGS(LiuandNocedal,1989)tolocallymaximizeθ’sposteriorlog-probability(PklogSθ(sk|uk))+logpprior(je)Cotterelletal.(2014)givethenaturaldynamicprogrammingalgorithmtocomputeeachsum-mandanditsgradientw.r.t.θ.Thegradientisthedifferencebetweenobservedandexpectedfeaturevectorsofthecontextualedits(section3.2),aver-agedovereditcontextsinproportiontohowmanytimesthosecontextswerelikelyencountered.Thelatentalignmentmakestheobjectivenon-concave.InourEMsetting,ukisnotknown.SoourM-stepreplaceslogSθ(sk|uk)withitsexpectation,Pukbk(uk)logSθ(sk|uk),wherebkisthenor-malizedbeliefaboutukcomputedbytheprevi-ousE-step.SincebkandSθarebothrepresentedbyWFSMs(with1and2tapesrespectively),itispossibletocomputethisquantityanditsgradientexactly,usingfinite-statecompositioninasecond-orderexpectationsemiring(LiandEisner,2009).Forspeed,cependant,wecurrentlyprunebkbacktothe5-bestvaluesofuk.Thisletsususeasim-plerandfasterapproach:aweightedaverageover5runsoftheCotterelletal.(2014)algorithm.Ourasymptoticruntimebenefitsfromthefactthatourgraphicalmodelisdirected(soourobjec-tivedoesnothavetocontrastwithallothervaluesofuk)andthefactthatSθislocallynormalized(soourobjectivedoesnothavetocontrastwithallothervaluesofskforeachuk).InpracticewearefarfasterthanDreyerandEisner(2009).Weinitializedtheparametervectorθto~0,ex-ceptforsettingtheweightoftheCOPYfeature(section6)suchthattheprobabilityofaCOPYeditis0.99ineverycontextotherthanend-of-string.ThisencouragesURstoresembletheirSRs.Toreestimateφ,theM-stepdoesnotneedtouseL-BFGS,forsection3.3’ssimplemodelofMφ10Thatis,alexiconofmorphstogetherwithcontextualeditsequencesthatwillproducetheobservedwordSRs.BIGRAM(strident,strident)adjacentsurfacestridentsBIGRAM((cid:15),uvular)surfaceuvularEDIT([s],[z])/s/became[z]EDIT(coronal,labial)coronalbecamelabialEDIT((cid:15),phoneme)phonemewasinsertedEDIT(consonant,(cid:15))consonantwasdeletedTable1:Examplesofmarkednessandfaithfulnessfeaturesthatfireinourmodel.TheyhaveanaturalinterpretationasOptimality-Theoreticconstraints.(cid:15)denotestheemptystring.Thenaturalclasseswereadaptedfrom(Riggle,2012).anduniformprioroverφ∈(0,1].Itsimplysetsφ=1/(‘+1)where‘istheaverageexpectedlengthofaURaccordingtothepreviousE-step.TheexpectedlengthofeachukisextractedfromtheWFSMforthebeliefbk,usingdynamicpro-gramming(LiandEisner,2009).Weinitializedφto0.1;experimentsondevelopmentdatasug-gestedthatthechoiceofinitializerhadlittleeffect.6FeaturesofthePhonologyModelOurstochasticeditprocessSθ(s|toi)assignsaprobabilitytoeachpossibleu-to-seditsequence.Thiseditsequencecorrespondstoacharacter-wisealignmentofutos.OurfeaturesformodelingthecontextualprobabilityofeacheditarelooselyinspiredbyconstraintsfromHarmonicGrammarandOptimalityTheory(SmolenskyandLegendre,2006).Suchconstraintssimilarlyevaluateau-to-salignment(or“correspondence”).Theyaretra-ditionallydividedintomarkednessconstraintsthatencourageawell-formeds,andfaithfulnesscon-straintsthatencouragephonemesofstoresembletheiralignedphonemesinu.OurEDITfaithfulnessfeaturesevaluateanedit’s(input,output)phonemepair.OurBIGRAMmarkednessfeaturesevaluateaneditthatemitsanewphonemeofs.Theyevaluatethesurfacebi-gramitformswiththepreviousoutputphoneme.11Table1showsexamplefeatures.Noticethatthesefeaturesbackofftovariousnaturalclassesofphonemes(ClementsandHume,1995).Thesefeaturesofaneditneedtoexamineatmost(0,1,1)phonemesof(leftinput,rightinput,leftoutput)contextrespectively(seeFigure2).SothePFSTthatimplementsSθshouldbeabletousewhatCotterelletal.(2014)callsa(0,1,1)topol-ogy.However,weactuallyuseda(0,2,1)topology,toallowfeaturesthatalsolookatthe“upcoming”inputphonemethatimmediatelyfollowstheedit’s11Atbeginning-of-string,theprevious“phoneme”isthespecialsymbolBOS.FortheHALTeditatend-of-string,whichcopiesthesymbolEOS,thenew“phoneme”isEOS.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
4
9
1
5
6
6
7
7
0
/
/
t
je
un
c
_
un
_
0
0
1
4
9
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
440
input(/@/inFigure2).Specifically,foreachnat-uralclass,wealsoincludedcontextualversionsofeachEDITorBIGRAMfeature,whichfiredonlyifthe“upcoming”inputphonemefellinthatnatu-ralclass.ContextualBIGRAMfeaturesareourap-proximationtosurfacetrigramfeaturesthatlookattheedit’soutputphonemetogetherwiththepre-viousandnextoutputphonemes.(APFSTcan-notconditionitseditprobabilitiesonthenextout-putphonemebecausethathasnotbeengeneratedyet—seesection3.2—soweareusingtheupcom-inginputphonemeasaproxy.)ContextualEDITfeatureswerecheaptoaddoncewewereusinga(0,2,1)topology,andinfacttheyturnedouttobehelpfulforcapturingprocessessuchasCatalan’sdeletionoftheunderlyinglyfinalconsonant.Finally,weincludedaCOPYfeaturethatfiresonanyeditwheresurfaceandunderlyingphonemesareexactlyequal.(Thisfeatureresem-blesOptimalityTheory’sIDENT-IOconstraint,andendsupgettingthestrongestweight.)Intotal,ourmodelhasroughly50,000binaryfeatures.Manyimprovementstothisbasicfeaturesetwouldbepossibleinfuture.Wecannotcurrentlyexpressimplicationssuchas“adjacentobstruentsmustalsoagreeinvoicing,”“avowelthatsurfacesmustpreserveitsheight,”or“successivevowelsmustalsoagreeinheight.”Wealsohavenotyetdesignedfeaturesthataresensitivetosurfaceprosodicboundariesorunderlyingmorphbound-aries.(Prosodicstructureandautosegmentaltiersareabsentfromourcurrentrepresentations,andwecurrentlysimplifythestochasticeditprocess’sfeaturesetbyhavingSθerasethemorphbound-aries#beforeapplyingthatprocess.)Ourstandardprioroverθ(section3.4)resistsoverfittinginagenericway,byfavoringphonolo-giesthatare“simpletodescribe.”Linguisticim-provementsarepossiblehereaswell.Thepriorshouldarguablydiscouragepositiveweightsmorethannegativeones,sincemostofourfeaturesdetectconstraintviolationsthatordinarilyreduceprobability.Itshouldalsobeadjustedtomiti-gatethecurrentstructuralbiasagainstdeletioned-its,whicharisesbecausethesingledeletionpos-sibleinacontextmustcompeteonequalfoot-ingwith|Σs|insertionsand|Σs|−1substitu-tions.Moreambitiously,alinguisticallyplausiblepriorshouldpreferphonologiesthatareconserva-tive(s≈u)andhavelowconditionalentropiesH(s|toi),H(toi|s)tofacilitatecommunication.7ExperimentalDesignWeobjectivelyevaluateourlearneronitsabilitytopredictheld-outsurfaceforms.Thisblindtestingdiffersfromtraditionalpracticebylinguists,whoevaluateamanualorautomaticanalysis(=URs+phonology)onwhetheritdescribesthefulldatasetina“natural”waythatcaptures“appropriate”gen-eralizations.Weavoidsuchtheory-internalevalu-ationbysimplyquantifyingwhetherthelearner’sanalysisdoesgeneralize(Eisner,2015).Toavoidtailoringtoourtraining/testdata,wedevelopedourmethod,code,features,andhy-perparametersusingonlytwodevelopmentlan-guages,EnglishandGerman.Thus,ourlearnerwasnotengineeredtodowellontheother5lan-guagesbelow:thegraphsbelowshowitsfirstat-tempttolearnthoselanguages.Wedoalsoeval-uateourlearnersonEnglishandGerman,usingseparatetraining/testdata.Weprovideallourdata(includingcita-tions,developmentdata,training-testsplits,andnaturalclasses)athttp://hubal.cs.jhu.edu/tacl2015/,alongwithbriefsketchesofthephonologicalphenomenainthedatasets,the“gold”stemURsweassumedforevaluation,andourlearner’spredictionsanderrorpatterns.7.1EvaluationmethodologyGivenaprobabilitydistributionpoversurfacewordtypesofalanguage,wesampleatrainingsetofNtypeswithoutreplacement.ThissimulatesreadingtextuntilwehaveseenNdistincttypes.Foreachofthesefrequentwords,weobservetheSRsandthemorphemesequence~a.Aftertrainingourmodel,weevaluateitsbeliefsbabouttheSRssonadisjointsetoftestwordswhose~aareobserved.Toimproveinterpretabil-ityoftheresults,welimitthetestwordstothosewhosemorphemeshaveallappearedatleastonceinthetrainingset.(Anymethodwouldpresumablygetotherwordsbadlywrong,justasitwouldtendtogetthetrainingwordsright;weexcludeboth.)ToevaluateourbeliefbabouttheSRofatestword(~a,s∗),weusethreemeasuresforwhich“smallerisbetter.”First,0-1lossaskswhethers∗6=argmaxsb(s).Thiscouldbecomparedwithnon-probabilisticpredictors.Second,thesurprisal−log2b(s∗)islowifthemodelfindsitplausiblethats∗realizes~a.Ifso,thisholdsoutpromiseforfutureworkonanalyzingorlearningfromunan-notatedtokensofs∗.Third,weevaluatethewhole
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
4
9
1
5
6
6
7
7
0
/
/
t
je
un
c
_
un
_
0
0
1
4
9
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
441
MaoriCatalanTangaleIndon.0.01.02.03.04.05.0cross entropy (bits)MaoriCatalanTangaleIndon.0.00.20.40.60.81.01.2expected edit distanceMaoriCatalanTangaleIndon.0.00.20.40.60.81.01-best error rateNoisy ConcatenationOur MethodOracleFigure3:Resultsonthesmallphonologicalexercisedatasets(≈100wordtypes).Smallernumbersarebetter.Preliminarytestssuggestedthatthevarianceoftheprior(section3.4)didnotstronglyaffecttheresults,sowetookσ2=5forallexperiments.distributionbintermsofPsb(s)L(s∗,s)whereLisunweightedLevenshteindistance.Wetaketheaverageofeachmeasureovertestwords,weightingthosewordsaccordingtop.Thisyieldsourthreereportedmetrics:1-besterrorrate,cross-entropy,andexpectededitdistance.Eachmetricistheexpectedvalueofsomemea-sureonarandomtesttoken.Thesemetricsareactuallyrandomvariables,sincetheydependontherandomlysampledtrain-ingsetandtheresultingtestdistribution.Were-porttheexpectationsoftheserandomvariablesbyrunningmanytraining-testsplits(seesection7.2).7.2DatasetsTotestdiscoveryofinterestingpatternsfromlim-iteddata,weranourlearneron5“exercises”drawnfromphonologytextbooks(102Englishnouns,68Maoriverbs,72Catalanadjectives,55Tangalenouns,44Indonesiannouns),exhibitingarangeofphenomena.Ineachcasewetookptobetheuniformdistributionovertheprovidedwordtypes.WetookNtobeonelessthanthenum-berofprovidedtypes.Sotoreportourexpectedmetrics,weranallN+1experimentswherewetrainedjointlyonNformsandtestedonthe1re-mainingform.Thisisclosetolinguists’practiceoffittingananalysisontheentiredataset,yetitisafairtest.Thereisnosamplingerrorinthesereportedresults,hencenoneedforerrorbars.Totestonlarger,naturallyoccurringdatasets,weranourlearneronsubsetsoftheCELEXdatabase(Baayenetal.,1995),whichprovidessurfacephonologicalformsandtokencountsforGerman,Dutch,andEnglishwords.Foreachlanguage,weconstructedacoherentsubcorpusof1000nounsandverbs,focusingoninflectionswithcommonphonologicalphenomena.Theseturnedouttoinvolvemainlyvoicing:finalobstru-entdevoicing(German2nd-personpresentindica-tiveverbs,Germannominativesingularnouns,Dutchinfinitiveverbs,Dutchsingularnouns)andvoicingassimilation(Englishpasttenseverbs,En-glishpluralnouns).Wewererestrictedtorela-tivelysimplephenomenabecauseourcurrentrep-resentationsaresimplesegmentalstringsthatlackprosodicandautosegmentalstructure.Infutureweplantoconsiderstress,vowelharmony,andtemplaticmorphology.WeconstructedthedistributionpinproportiontoCELEX’stokencounts.Ineachlanguage,wetrainedonN=200,400,600,or800formssam-pledfromp.ToestimatetheexpectationofeachmetricoveralltrainingsetsofsizeN,wereportthesamplemeanandbootstrapstandarderrorover10randomtrainingsetsofsizeN.ExceptinIndonesian,everywordhappenstoconsistof≤2morphemes(astemplusapossiblyemptysuffix).Inallcases,wetakethephonemeinventoriesΣuandΣstobegivenasthesetofallsurfacephonemesthatappearinthefulldataset.7.3ComparisonsystemsTheredonotappeartobeprevioussystemsthatperformourgeneralizationtask.Therefore,wecomparedourownsystemagainstvariants.Weperformedanablationstudytodeterminewhetherthelearnedphonologywashelpful.WesubstitutedasimplifiedphonologymodelwhereSθ(s|toi)justdecaysexponentiallywiththeeditdistancebetweensandu;thedecayratewaslearnedbyEMasusual.Thatis,thismodelusesonlytheCOPYfeatureofsection6.Thisbaselinesystemtreatsphonologyas“noisyconcatenation”oflearnedURs,nottryingtomodelitsregularity.WeconsideredanadditionalablationstudytodeterminewhetherthelearnedURswerehelpful.However,wedidnotcomeupwithaplausible
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
4
9
1
5
6
6
7
7
0
/
/
t
je
un
c
_
un
_
0
0
1
4
9
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
442
0.000.050.100.150.200.251-best error rateGerman0.050.100.150.200.250.30Dutch0.00.10.20.30.40.50.60.7EnglishNoisy ConcatenationOur MethodOracle0.00.51.01.52.02.53.0cross-entropy (bits)0.00.51.01.52.02.53.00.01.02.03.04.05.06.02004006008000.000.050.100.150.200.250.30expected edit distance2004006008000.050.100.150.200.250.300.350.400.450.502004006008000.00.20.40.60.81.01.2Figure4:ResultsontheCELEXdatasets(1000wordtypes)at4differenttrainingsetsizesN.Thelargertrainingsetsaresupersetsofthesmaller,obtainedbycontinuingtosamplewithoutreplacementfromp.Foreachtrainingset,theunconnectedpointsevaluateallwords/∈trainingwhosemorphemes∈training.Meanwhile,theconnectedpointspermitcomparisonacrossthe4valuesofN,byevaluatingonlyonacommontestsetfoundbyintersectingthe4unconnectedtestsets.Eachpointestimatesthemetric’sexpectationoverallwaysofsamplingthe4trainingsets;specifically,weplotthesamplemeanfrom10suchruns,witherrorbarsshowingabootstrapestimateofthestandarderrorofthemean.Non-overlappingerrorbarsatagivenNalwayshappentoimplythatthedifferenceinthetwomethods’samplemeansistooextremetobelikelytohavearisenbychance(pairedpermutationtest,p<0.05).Eachtimeweevaluatedsometraining-testsplitonsomemetric,wefirsttunedσ2(section3.4)byacoarsegridsearchwherewetrainedonthefirst90%ofthetrainingsetandevaluatedontheremaining10%.heuristicforidentifyingURsinsomesimplerway.Thus,insteadweaskedwhetherthelearnedURswereasgoodashand-constructedURs.Our“ora-cle”systemwasallowedtoobservegold-standardURsforstemsinsteadofinferringthem.Thissystemisstillfallible:itmuststillinfertheaf-fixURsbybeliefpropagation,anditmuststilluseMAP-EMtoestimateaphonologywithinourcurrentmodelfamilySθ.Evenwithsupervision,thisfamilywillstillstruggletomodelmanytypesofphonology,e.g.,ablautpatterns(inGermanicstrongverbs)andmanystress-relatedphenomena.7.4ResultsWegraphourresultsinFigures3and4.Whengivenenoughevidence,ourmethodworksquitewellacrossthe7datasets.For94–98%ofheld-outwordsontheCELEXlanguages(whenN=800),and77–100%onthephonologicalexercises,ourmethod’stoppickisthecorrectsurfaceform.Fur-ther,theothermetricsshowthatitplacesmostofPhon.ExercisesCELEXMaori95.5German99.9Catalan99.5Dutch86.3Tangale79.8English82.2Indonesian100Table2:Percentoftrainingwords,weightedbythedistribu-tionp,whose1-bestrecoveredUR(includingtheboundary#)exactlymatchesthemanual“gold”analysis.Resultsareav-eragesoverallruns(withN=800fortheCELEXdatasets).itsprobabilitymassonthatform,12andtherestonhighlysimilarforms.Notably,ourmethod’spre-dictionsarenearlyasgoodasifgoldstemURshadbeensupplied(the“oracle”condition).Indeed,itdoestendtorecoverthosegoldURs(Table2).Yettherearesomeresidualerrorsinpredict-ingtheSRs.OurphonologicallearnercannotperfectlylearntheUR-to-SRmappingevenfrommanywell-supervisedpairs(theoraclecondition).IntheCELEXandTangaledatasets,thisispartly12Cross-entropy<1bitmeansthatthecorrectformhasprobability>1/2onaverage(usinggeometricmean).
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
4
9
1
5
6
6
7
7
0
/
/
t
je
un
c
_
un
_
0
0
1
4
9
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
443
duetoirregularityinthelanguageitself.However,erroranalysissuggestswealsomisssomegener-alizationsduetotheimperfectionsofourcurrentSθmodel(asdiscussedinsections3.2and6).Whengivenlessevidence,ourmethod’sper-formanceismoresensitivetothetrainingsampleandisworseonaverage.Thisisexpected:e.g.,astem’sfinalconsonantcannotbereconstructedifitwasdevoiced(German)ordeleted(Maori)inallthetrainingSRs.However,acontributingfactormaybetheincreasederrorrateofthephonolog-icallearner,visibleevenwithoracledata.Thus,wesuspectthataSθmodelwithbettergeneral-izationwouldimproveourresultsatalltrainingsizes.NotethatweakeningSθ—allowingonly“noisyconcatenation”—clearlyharmsthemethod,provingtheneedfortruephonologicalmodeling.8RelatedWorkWemustimputetheinputstothephonologi-calnoisychannelSθ(URs)becauseweobserveonlytheoutputs(SRs).OtherNLPproblemsofthisformincludeunsupervisedtextnormalization(YangandEisenstein,2013),unsupervisedtrain-ingofHMMs(Christodoulopoulosetal.,2010),andparticularlyunsupervisedlexiconacquisitionfromphonologicaldata(Elsneretal.,2012).How-ever,unlikethesestudies,wecurrentlyusesomeindirectsupervision—weknoweachSR’smor-phemesequence,thoughnottheactualmorphs.Jarosz(2013,§2)andTesar(2014,chapters5–6)reviewworkonlearningthephonologySθ.Phonologistspioneeredstochastic-gradientandpassive-aggressivetrainingmethods—theGradualLearningAlgorithm(Boersma,1998)andError-DrivenConstraintDemotion(TesarandSmolen-sky,1998)—forstructuredpredictionofthesur-facewordsfromtheunderlyingwordu.Ifsisnotfullyobservedduringtraining(layer4ofFigure1isobserved,notlayer3),thenitcanbeimputed,astepknownasRobustInterpretiveParsing.Recentpapersconsideroursettingwhereu=m1#m2#···isnotobservedeither.Thecontrastanalysismethod(Tesar,2004;Merchant,2008)ineffectusesconstraintpropagation(Dechter,2003).Thatis,itseriallyeliminatesvariablevalues(de-scribingaspectsoftheURsortheconstraintrank-ing)thatareprovablyincompatiblewiththedata.Constraintpropagationisanincompletemethodthatisnotguaranteedtomakealllogicalde-ductions.Weuseitsprobabilisticgeneralization,loopybeliefpropagation(Dechteretal.,2010)—whichisstillapproximatebutcandealwithnoiseandstochasticirregularity.Afurtherimprovementisthatweworkwithstring-valuedvariables,repre-sentinguncertaintyusingWFSMs;thisletsusrea-sonaboutURsofunknownlengthandunknownalignmenttotheSRs.(TesarandMerchantin-steadusedbinaryvariables,oneforeachsegmen-talfeatureineachUR—requiringthesimplify-ingassumptionthattheURsareknownexceptfortheirsegmentalfeatures.TheyassumethatSRsareannotatedwithmorphboundariesandthatthephonologyonlychangessegmentalfeatures,neverinsertingordeletingsegments.)Ontheotherhand,TesarandMerchantreasongloballyaboutthecon-straintranking,whereasinthispaper,weonlylo-callyimprovethephonology—weuseEM,ratherthanthefullBayesianapproachthattreatsthepa-rameters~θasvariableswithinBP.Jarosz(2006)isclosesttoourworkinthatsheusesEM,justaswedo,tomaximizetheprobabil-ityofobservedsurfaceformswhoseconstituentmorphemes(butnotmorphs)areknown.13HermodelisaprobabilisticanalogueofApoussidou(2006),whousesalatent-variablestructuredper-ceptron.Anon-standardaspectofthismodel(de-fendedbyPateretal.(2012))isthatamorphemeacanstochasticallychoosedifferentmorphsM(un)whenitappearsindifferentwords.Toobtainasin-glesharedmorph,onecouldpenalizethisdistribu-tion’sentropy,drivingittoward0aslearningpro-ceeds.Suchanapproach—whichbuildsonasug-gestionbyEisenstat(2009,§5.4)—wouldlooselyresembledualdecomposition(Pengetal.,2015).UnlikeourBPapproach,itwouldmaximizeratherthanmarginalizeoverpossibleunderlyingmorphs.Ourworkhasfocusedonscalingupinference.ForthephonologyS,theabovepaperslearntheweightsorrankingsofjustafewplausiblecon-straints(orJarosz(2006)learnsadiscretedistribu-tionoverall5!=120rankingsof5constraints),whereasweuseSθwithroughly50,000con-straints(features)toenablelearningofunknownlanguages.OurSalsoallowsexceptions.Theabovepapersalsoconsideronlyveryrestrictedsetsofmorphs,eitheridentifyingasmallsetofplausiblemorphsorprohibitingsegmentalinser-tion/deletion.Weusefinite-statemethodssothatitispossibletoconsiderthespaceΣ∗uofallstrings.13ShestillassumesthatwordSRsareannotatedwithmor-phemeboundaries,andthatasmallsetofpossiblemorphsisgiven.TheseassumptionsarerelaxedbyEisenstat(2009).
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
4
9
1
5
6
6
7
7
0
/
/
t
je
un
c
_
un
_
0
0
1
4
9
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
444
Ontheotherhand,wearedividedfrompre-viousworkbyourinabilitytouseanOTgram-mar(PrinceandSmolensky,2004),astochasticOTgrammar(Boersma,1997),orevenamaxi-mumentropygrammar(GoldwaterandJohnson,2003;Dreyeretal.,2008;Eisenstat,2009).ThereasonisthatourBPmethodinvertsthephonolog-icalmappingSθtofindpossiblewordURs.GivenawordSRs,weconstructaWFSM(message)thatscoreseverypossibleURu∈Σ∗u—thescoreofuisSθ(s|toi).Forthistobepossiblewithoutapproximation,SθitselfmustberepresentedasaWFSM(section3.2).Malheureusement,theWFSMforamaximumentropygrammardoesnotcom-puteSθbutonlyanunnormalizedversion,withadifferentnormalizingconstantZuneededforeachu.Weplantoconfrontthisissueinfuturework.IntheNLPcommunity,Elsneretal.(2013)re-semblesourworkinmanyrespects.Likeus,theyrecoveralatentunderlyinglexicon(usingthesamesimplepriorMφ)anduseEMtolearnaphonol-ogy(rathersimilartoourSθ,thoughlesspower-ful).14Unlikeus,theydonotassumeannotationofthe(abstract)morphemesequence,butjointlylearnanonparametricbigrammodeltodiscoverthemorphemes.Theirevaluationisquitedifferent,astheiraimisactuallytorecoverunderlyingwordsfromphonemicallytranscribedchild-directedEn-glishutterances.However,nothingintheirmodeldistinguisheswordsfrommorphemes—indeed,sometimestheydofindmorphemesinstead—sotheirmodelcouldbeusedinourtask.Forinfer-ence,theyinvertthefinite-stateSθlikeustorecon-structalatticeofpossibleURstrings.However,theydothisnotwithinBPbutwithinablockGibbssamplerthatstochasticallyreanalyzesutterancesoneatatime.WhereasourBPtriestofindacon-sensusURforeachgivenmorphemetype,theirsamplerpositsmorphtokenswhiletryingtoreusefrequentmorphtypes,whichareinterpretedasthemorphemes.Withobservedmorphemes(ourset-ting),thissamplerwouldfailtomix.DreyerandEisner(2009,2011)likeususedloopyBPandMAP-EMtopredictmorphologi-calSRs.Their2011paperwasalsoabletoex-ploitrawtextwithoutmorphologicalsupervision.However,theydirectlymodeledpairwisefinite-staterelationshipsamongthesurfacewordforms14Elsneretal.(2012)usedanSθquitesimilartooursthoughlackingbigramwell-formednessfeatures.Elsneretal.(2013)simplifiedthisforefficiency,disallowingsegmen-taldeletionandnolongermodelingthecontextofchanges.withoutusingURs.Theirmodelisajointdistribu-tionovernvariables:thewordSRsofasinglein-flectionalparadigm.Sinceitrequiresafixedn,itdoesnotdirectlyextendtoderivationalmorphol-ogy:derivingnewwordswouldrequireaddingnewvariables,which—foranundirectedmodelliketheirs—changesthepartitionfunctionandre-quiresretraining.Bycontrast,ourtraineddirectedmodelisaproductivephonologicalsystemthatcangenerateunboundedlymanynewwords(seesection4.1).Byanalogy,nsamplesfromaGaus-sianwouldbedescribedwithadirectedmodel,andinferringtheGaussianparameterspredictsanynumberoffuturesamplesn+1,n+2,….Bouchard-Cˆot´eetal.,inseveralpapersfrom2007through2013,haveuseddirectedgraphi-calmodelsoverstrings,likeoursthoughwithoutloops,tomodeldiachronicsoundchange.Some-timestheyusebeliefpropagationforinference(HallandKlein,2010).Theirgoalistorecoverla-tenthistoricalforms(conceptually,surfaceforms)ratherthanlatentunderlyingforms.Theresultsareevaluatedagainstmanualreconstructions.Noneofthisworkhassegmentedwordsintomorphs,althoughDreyeretal.(2008)didseg-mentsurfacewordsintolatent“regions.”CreutzandLagus(2005)andGoldsmith(2006)segmentanunannotatedcollectionofwordsintoreusablemorphs,butwithoutmodelingcontextualsoundchange,i.e.,phonology.9ConclusionsandFutureWorkWehavelaidoutaprobabilisticmodelforgener-ativephonology.Thisletsusinferlikelyexpla-nationsofacollectionofmorphologicallyrelatedsurfacewords,intermsofunderlyingmorphsandproductivephonologicalchanges.Wedosobyap-plyinggeneralalgorithmsforinferenceingraphi-calmodels(improvedinourfollowuppapers:seesection4.4)andforMAPestimationfromincom-pletedata,usingweightedfinite-statemachinestoencodeuncertainty.Throughoutourpresentation,wewerecarefultopointoutvariouslimitationsofoursetup.Butineachcase,wealsooutlinedhowfutureworkcouldaddresstheselimitationswithintheframeworkweproposehere.Finally,weproposedadetailedschemeforquantitativeevaluationofphonologicallearners.Across7differentlanguages,onbothsmallandlargerdatasets,ourlearnerwasabletopredictheld-outsurfaceformswithlowerrorrates.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
4
9
1
5
6
6
7
7
0
/
/
t
je
un
c
_
un
_
0
0
1
4
9
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
445
AcknowledgmentsThismaterialisbaseduponworksupportedbytheNationalScienceFoundationunderGrantNo.1423276,andbyaFulbrightgranttothefirstau-thor.TheworkwascompletedwhilethefirstauthorwasvisitingLudwigMaximilianUniver-sityofMunich.Forusefuldiscussionofpresen-tation,terminology,andrelatedwork,wewouldliketothankactioneditorSharonGoldwater,theanonymousreviewers,ReutTsarfaty,FrankFer-raro,DarceyRiley,ChristoKirov,andJohnSylak-Glassman.ReferencesDianaApoussidou.2006.On-linelearningofunder-lyingforms.TechnicalReportROA-835,RutgersOptimalityArchive.R.HaraldBaayen,RichardPiepenbrock,andLeonGu-likers.1995.TheCELEXlexicaldatabaseonCD-ROM.JulietteBlevins.1994.Aphonologicalandmorpho-logicalreanalysisoftheMaoripassive.TeReo,37:29–53.PaulBoersmaandBruceHayes.2001.EmpiricaltestsoftheGradualLearningAlgorithm.LinguisticIn-quiry,32(1):45–86.PaulBoersma.1997.Howwelearnvariation,option-ality,andprobability.InProceedingsoftheInstituteofPhoneticSciencesoftheUniversityofAmsterdam,volume21,pages43–58.PaulBoersma.1998.Howwelearnvariation,op-tionality,andprobability.InFunctionalPhonology:FormalizingtheInteractionsBetweenArticulatoryandPerceptualDrives,chapter15.Ph.D.Disserta-tion,UniversityofAmsterdam.PreviouslyappearedinIFAProceedings(1997),pp.43–58.AlexandreBouchard-Cˆot´e,PercyLiang,ThomasL.Griffiths,andDanKlein.2007.Aprobabilisticap-proachtolanguagechange.InProceedingsofNIPS.AlexandreBouchard-Cˆot´e,DavidHall,ThomasL.Griffiths,andDanKlein.2013.Automatedre-constructionofancientlanguagesusingprobabilis-ticmodelsofsoundchange.ProceedingsoftheNa-tionalAcademyofSciences.NoamChomskyandMorrisHalle.1968.TheSoundPatternofEnglish.HarperandRow.ChristosChristodoulopoulos,SharonGoldwater,andMarkSteedman.2010.Twodecadesofunsuper-visedPOSinduction:Howfarhavewecome?InProceedingsofEMNLP,pages575–584.GeorgeN.ClementsandElizabethV.Hume.1995.Theinternalorganizationofspeechsounds.InJohnGoldsmith,editor,HandbookofPhonologicalThe-ory.OxfordUniversityPress,Oxford.RyanCotterellandJasonEisner.2015.Penalizedexpectationpropagationforgraphicalmodelsoverstrings.InProceedingsofNAACL-HLT,pages932–942,Denver,June.Supplementarymaterial(11pages)alsoavailable.RyanCotterell,NanyunPeng,andJasonEisner.2014.StochasticcontextualeditdistanceandprobabilisticFSTs.InProceedingsofACL.MathiasCreutzandKristaLagus.2005.Induc-ingthemorphologicallexiconofanaturallan-guagefromunannotatedtext.InProceedingsoftheInternationalandInterdisciplinaryConferenceonAdaptiveKnowledgeRepresentationandReasoning(AKRR05),volume1.RinaDechter,BozhenaBidyuk,RobertMateescu,andEmmaRollon.2010.Onthepowerofbeliefpropagation:Aconstraintpropagationper-spective.InRinaDechter,HectorGeffner,andJosephY.Halpern,editors,Heuristics,ProbabilityandCausality:ATributetoJudeaPearl.CollegePublications.RinaDechter.2003.ConstraintProcessing.MorganKaufmann.MarkusDreyerandJasonEisner.2009.Graphicalmodelsovermultiplestrings.InProceedingsofEMNLP,pages101–110.MarkusDreyerandJasonEisner.2011.Discover-ingmorphologicalparadigmsfromplaintextusingaDirichletprocessmixturemodel.InProceedingsoftheConferenceonEmpiricalMethodsinNatu-ralLanguageProcessing(EMNLP),pages616–627,Edinburgh,July.Supplementarymaterial(9pages)alsoavailable.MarkusDreyer,JasonR.Smith,andJasonEisner.2008.Latent-variablemodelingofstringtransduc-tionswithfinite-statemethods.InProceedingsofEMNLP,pages1080–1089.MarkusDreyer.2011.ANon-ParametricModelfortheDiscoveryofInflectionalParadigmsfromPlainTextUsingGraphicalModelsoverStrings.Ph.D.thesis,JohnsHopkinsUniversity,Baltimore,MARYLAND,April.SarahEisenstat.2009.LearningunderlyingformswithMaxEnt.Master’sthesis,BrownUniversity,Providence,RI.JasonEisner.2002a.ComprehensionandcompilationinOptimalityTheory.InProceedingsofACL,pages56–63,Philadelphia,July.JasonEisner.2002b.DiscoveringsyntacticdeepstructureviaBayesianstatistics.CognitiveScience,26(3):255–268,May-June.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
4
9
1
5
6
6
7
7
0
/
/
t
je
un
c
_
un
_
0
0
1
4
9
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
446
JasonEisner.2015.Shouldlinguistsevaluategram-marsorgrammarlearners?Inpreparation.MichaElsner,SharonGoldwater,andJacobEisenstein.2012.Bootstrappingaunifiedmodeloflexicalandphoneticacquisition.InProceedingsofACL,pages184–193.MichaElsner,SharonGoldwater,NaomiFeldman,andFrankWood.2013.Ajointlearningmodelofwordsegmentation,lexicalacquisition,andphoneticvari-ability.InProceedingsofEMNLP,pages42–54.AlexanderM.Fraser,MarionWeller,AoifeCahill,andFabienneCap.2012.Modelinginflectionandword-formationinSMT.InProceedingsofEACL,pages664–674.J.Goldsmith.2006.Analgorithmfortheunsupervisedlearningofmorphology.NaturalLanguageEngi-neering,12(4):353–371.SharonGoldwaterandMarkJohnson.2003.LearningOTconstraintrankingsusingamaximumentropymodel.InJenniferSpenader,AndersEriksson,andOstenDahl,editors,ProceedingsoftheWorkshoponVariationwithinOptimalityTheory,pages113–122,StockholmUniversity.DavidHallandDanKlein.2010.Findingcognategroupsusingphylogenies.InProceedingsofACL.BruceHayesandColinWilson.2008.Amaximumen-tropymodelofphonotacticsandphonotacticlearn-ing.LinguisticInquiry,39(3):379–440.MattHohenseeandEmilyM.Bender.2012.Gettingmorefrommorphologyinmultilingualdependencyparsing.InProceedingsofNAACL-HLT,pages315–326.MansHulden.2009.Revisitingmulti-tapeautomataforSemiticmorphologicalanalysisandgeneration.InProceedingsoftheEACL2009WorkshoponComputationalApproachestoSemiticLanguages,pages19–26,March.SharonInkelasandCherylZoll.2005.Reduplication:DoublinginMorphology.Number106inCam-bridgeStudiesinLinguistics.CambridgeUniversityPress.RomanJakobson.1948.Russianconjugation.Word,4:155–167.GajaJarosz.2006.Richnessofthebaseandproba-bilisticunsupervisedlearninginOptimalityTheory.InProceedingsoftheEighthMeetingoftheACLSpecialInterestGrouponComputationalPhonologyandMorphology,pages50–59.GajaJarosz.2013.Learningwithhiddenstructureinoptimalitytheoryandharmonicgrammar:Beyondrobustinterpretiveparsing.Phonology,30(01):27–71.C.DouglasJohnson.1972.FormalAspectsofPhono-logicalDescription.Mouton.Ren´eKager.1999.OptimalityTheory,volume2.MITPress.RonaldM.KaplanandMartinKay.1994.Regu-larmodelsofphonologicalrulesystems.Compu-tationalLinguistics,20(3):331–378.MichaelJ.KenstowiczandCharlesW.Kisseberth.1979.GenerativePhonology.AcademicPressSanDiego.Andr´asKornai.1995.FormalPhonology.GarlandPublishing,NewYork.ZhifeiLiandJasonEisner.2009.First-andsecond-orderexpectationsemiringswithapplicationstominimum-risktrainingontranslationforests.InProceedingsofEMNLP,pages40–51,Singapore,August.DongC.LiuandJorgeNocedal.1989.OnthelimitedmemoryBFGSmethodforlargescaleoptimization.MathematicalProgramming,45(1-3):503–528.AndrewMcCallum,DayneFreitag,andFernandoC.N.Pereira.2000.MaximumentropyMarkovmod-elsforinformationextractionandsegmentation.InProceedingsofICML,pages591–598.Navarr´eMerchant.2008.DiscoveringUnderlyingForms:ContrastPairsandRanking.Ph.D.thesis,RutgersUniversity.AvailableontheRutgersOpti-malityArchiveasROA-964.KevinP.Murphy,YairWeiss,andMichaelI.Jordan.1999.Loopybeliefpropagationforapproximatein-ference:Anempiricalstudy.InProceedingsofUAI,pages467–475.JoePater,KarenJesney,RobertStaubs,andBrianSmith.2012.Learningprobabilitiesoverunderly-ingrepresentations.InProceedingsoftheTwelfthMeetingoftheSpecialInterestGrouponComputa-tionalMorphologyandPhonology,pages62–71.JudeaPearl.1988.ProbabilisticReasoninginIn-telligentSystems:NetworksofPlausibleInference.MorganKaufmannPublishersInc.,SanFrancisco,Californie,USA.NanyunPeng,RyanCotterell,andJasonEisner.2015.Dualdecompositioninferenceforgraphicalmodelsoverstrings.InProceedingsofEMNLP,Lisbon,September.Toappear.JanetPierrehumbert.2003.Probabilisticphonology:Discriminationandrobustness.InProbabilisticLin-guistics,pages177–228.MITPress.AlanPrinceandPaulSmolensky.2004.OptimalityTheory:ConstraintInteractioninGenerativeGram-mar.Wiley-Blackwell.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
4
9
1
5
6
6
7
7
0
/
/
t
je
un
c
_
un
_
0
0
1
4
9
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
447
JasonA.Riggle.2004.Generation,Recognition,andLearninginFiniteStateOptimalityTheory.Ph.D.thesis,UniversityofCaliforniaatLosAngeles.JasonRiggle.2012.Phonologicalfeatureschart(version12.12).Availableonlineathttps://dl.dropboxusercontent.com/u/5956329/Riggle/PhonChart_v1212.pdf(retrieved2015-08-07).DavidSankoff.1978.Probabilityandlinguisticvaria-tion.Synthese,37(2):217–238.FerdinanddeSaussure.1916.CourseinGeneralLin-guistics.ColumbiaUniversityPress.Englishedi-tionofJune2011,basedonthe1959translationbyWadeBaskin.PaulSmolenskyandG´eraldineLegendre.2006.TheHarmonicMind:FromNeuralComputationtoOptimality-TheoreticGrammar(Vol.1:CognitiveArchitecture).MITPress.BruceTesarandPaulSmolensky.1998.LearnabilityinOptimalitytheory.LinguisticInquiry,29(2):229–268.BruceTesar.2004.Contrastanalysisinphonologicallearning.TechnicalReportROA-695,RutgersOpti-malityArchive.BruceTesar.2014.Output-DrivenPhonology:TheoryandLearning.CambridgeUniversityPress.YiYangandJacobEisenstein.2013.Alog-linearmodelforunsupervisedtextnormalization.InEMNLP,pages61–72.ReyyanYeniterzi.2011.ExploitingmorphologyinTurkishnamedentityrecognitionsystem.InPro-ceedingsoftheACLStudentSession,pages105–110.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
4
9
1
5
6
6
7
7
0
/
/
t
je
un
c
_
un
_
0
0
1
4
9
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3