Transactions of the Association for Computational Linguistics, vol. 5, pp. 353–364, 2017. Action Editor: Eric Fosler-Lussier.

Transactions of the Association for Computational Linguistics, vol. 5, pp. 353–364, 2017. Action Editor: Eric Fosler-Lussier.
Submission batch: 10/2016; Revision batch: 12/2016; Published 10/2017.

2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

c
(cid:13)

UnsupervisedLearningofMorphologicalForestsJiamingLuoCSAIL,MITj_luo@mit.eduKarthikNarasimhanCSAIL,MITkarthikn@mit.eduReginaBarzilayCSAIL,MITregina@csail.mit.eduAbstractThispaperfocusesonunsupervisedmodelingofmorphologicalfamilies,collectivelycom-prisingaforestoverthelanguagevocabulary.Thisformulationenablesustocaptureedge-wisepropertiesreﬂectingsingle-stepmorpho-logicalderivations,alongwithglobaldistribu-tionalpropertiesoftheentireforest.Theseglobalpropertiesconstrainthesizeoftheaf-ﬁxsetandencourageformationoftightmor-phologicalfamilies.TheresultingobjectiveissolvedusingIntegerLinearProgramming(ILP)pairedwithcontrastiveestimation.Wetrainthemodelbyalternatingbetweenop-timizingthelocallog-linearmodelandtheglobalILPobjective.Weevaluateoursys-temonthreetasks:rootdetection,clusteringofmorphologicalfamilies,andsegmentation.Ourexperimentsdemonstratethatourmodelyieldsconsistentgainsinallthreetaskscom-paredwiththebestpublishedresults.11IntroductionThemorphologicalstudyofalanguageinherentlydrawsupontheexistenceoffamiliesofrelatedwords.Allwordswithinafamilycanbederivedfromacommonrootviaaseriesoftransformations,whetherinﬂectionalorderivational.Figure1de-pictsonesuchfamily,originatingfromthewordfaith.Thisrepresentationcanbeneﬁtarangeofapplications,includingsegmentation,rootdetectionandclusteringofmorphologicalfamilies.1Codeisavailableathttps://github.com/j-luo93/MorphForest.Figure1:Anillustrationofasingletreeinamor-phologicalforest.preandsufrepresentpreﬁxationandsufﬁxation.Eachedgehasanassociatedproba-bilityforthemorphologicalchange.Usinggraphterminology,afullmorphologicalas-signmentofthewordsinalanguagecanberepre-sentedasaforest.2Validforestsofmorphologicalfamiliesexhibitanumberofwell-knownregulari-ties.Atthegloballevel,thenumberofrootsislim-ited,andonlyconstituteasmallfractionofthevo-cabulary.Asimilarconstraintappliestothenum-berofpossibleafﬁxes,sharedacrossfamilies.Atthelocaledgelevel,wepreferderivationsthatfol-lowregularorthographicpatternsandpreservese-manticrelatedness.Wehypothesizethatenforcingtheseconstraintsaspartoftheforestinductionpro-2ThecorrectmathematicaltermforthestructureinFigure1isadirected1-forestorfunctionalgraph.Forsimplicity,weshallusethetermsforestandtreetorefertoadirected1-forestoradirected1-treebecauseofthecycleattheroot.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
6
6
1
5
6
7
5
2
5

/
t

a
c
_
a
_
0
0
0
6
6
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

354

cesswillallowustoaccuratelylearnmorphologicalstructuresinanunsupervisedfashion.Totestthishypothesis,wedeﬁneanobjectiveovertheentireforestrepresentation.Theproposedobjectiveisdesignedtomaximizethelikelihoodoflocalderivations,whileconstrainingtheoverallnumberofafﬁxesandencouragingtightermorpho-logicalfamilies.WeoptimizethisobjectiveusingIntegerLinearProgramming(ILP),whichiscom-monlyemployedtohandleglobalconstraints.Whileinpriorwork,ILPhasoftenbeenemployedinsuper-visedsettings,weexploreitseffectivenessinunsu-pervisedlearning.Weinduceaforestbyalternat-ingbetweenlearninglocaledgeprobabilitiesusingalog-linearmodel,andenforcingglobalconstraintswiththeILP-baseddecoder.Witheachiteration,themodelprogressestowardsmoreconsistentforests.Weevaluateourmodelonthreetasks:rootdetec-tion,clusteringofmorphologicallyrelatedfamilies,andsegmentation.Thelasttaskhasbeenextensivelystudiedinrecentliterature,providinguswiththeop-portunitytocomparethemodelwithmultipleunsu-pervisedtechniques.Onbenchmarkdatasetsrepre-sentingfourlanguages,ourmodeloutperformsthebaselines,yieldingnewstate-of-the-artresults.Forinstance,weimprovesegmentationperformanceonTurkishby4.4%andonEnglishby3.7%,relativetothebestpublishedresults(Narasimhanetal.,2015).Similarly,ourmodelexhibitssuperiorperformanceontheothertwotasks.Wealsoprovideanalysisofthemodelbehaviorwhichrevealsthatmostofthegaincomesfromenforcingglobalconstraintsonthenumberofuniqueafﬁxes.2RelatedWorkUnsupervisedmorphologicalsegmentationMosttopperformingalgorithmsforunsupervisedsegmentationtodaycenteraroundmodelingsingle-stepderivations(Poonetal.,2009;NaradowskyandToutanova,2011;Narasimhanetal.,2015).Acommonlyusedlog-linearformulationenablesthesemodelstoconsiderarichsetoffeaturesrangingfromorthographicpatternstosemanticrelatedness.However,thesemodelsgenerallybypassglobalconstraints(Narasimhanetal.,2015)orrequireperforminginferenceoververylargespaces(Poonetal.,2009).Asweshowinouranalysis(Section5),thisomissionnegativelyaffectsmodelperformance.Incontrast,earlierworkfocusesonmodelingglobalmorphologicalassignment,usinggenerativeprobabilisticmodels(CreutzandLagus,2007;Sny-derandBarzilay,2008;Goldwateretal.,2009;SirtsandGoldwater,2013).Thesemodelsareinherentlylimitedintheirabilitytoincorporatediversefeaturesthatareeffectivelyutilizedbylocaldiscriminativemodels.Ourproposedapproachattemptstocombinetheadvantagesofbothapproaches,bydeﬁninganob-jectivethatincorporatesbothlevelsoflinguisticpropertiesovertheentireforestrepresentation,andadoptinganalternatingtrainingregimeforoptimiza-tion.Graph-basedrepresentationsincomputationalmorphologyVariantsofagraph-basedrepresen-tationhavebeenusedtomodelvariousmorpholog-icalphenomena(DreyerandEisner,2009;Pengetal.,2015;SoricutandOch,2015;Faruquietal.,2016).Thegraphinductionmethodsvarywidelydependingonthetaskandtheavailablesupervi-sion.Thedistinctivefeatureofourworkistheuseofglobalconstraintstoguidethelearningoflocal,edge-levelderivations.ILPforcapturingglobalpropertiesIntegerLin-earProgramminghasbeensuccessfullyemployedtocaptureglobalconstraintsacrossmultipleapplica-tionssuchasinformationextraction(RothandYih,2001),sentencecompression(ClarkeandLapata,2008),andtextualentailment(Berantetal.,2011).Inalloftheseapplications,theILPformulationisusedwithasupervisedclassiﬁer.Ourworkdemon-stratesthatthisframeworkcontinuestobeeffectiveinanunsupervisedsetting,providingstrongguid-anceforalocal,unsupervisedclassiﬁer.3ModelOurmodelconsidersafullmorphologicalassign-mentforallthewordsinalanguage,representingitasaforest.LetF=(V,E)beadirectedgraphwhereeachwordcorrespondstoanodev∈V.Adirectededgee=(vc,vp)∈Eencodesasinglemorphologicalderivationfromaparentwordvptoachildwordvc.Edgesalsoreﬂectthetypeofthe

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
6
6
1
5
6
7
5
2
5

/
t

a
c
_
a
_
0
0
0
6
6
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

355

underlyingderivation(e.g.,preﬁxation),andanas-sociatedprobabilityPr(e).Notethattherootofatreeisalwaysmarkedwithaself-directed(i.e.vc=vp)edgeassociatedwiththelabelstop.Figure1illustratesasingletreeintheforest.3.1InducingmorphologicalforestsWepostulatethatavalidassignmentyieldsforestswiththefollowingproperties:1.IncreasededgeweightsEdgeweightsre-ﬂectprobabilitiesofsingle-stepderivationsbasedonthelocalfeaturesincludingor-thographicpatternsandsemanticrelatedness.Thislocalinformationhelpsidentifythattheedge(painter,paint)shouldbepreferredover(painter,pain),because−erisavalidsufﬁxandpaintissemanticallyclosertopainter.2.MinimizednumberofafﬁxesPriorresearchhasshownthatlocalmodelstendtogreatlyoverestimatethenumberofsufﬁxes.Forin-stance,themodelofNarasimhanetal.(2015)produces617uniqueafﬁxeswhensegmenting10000Englishwords.Thus,weexplicitlyen-couragethemodeltowardsassignmentswiththeleastnumberofafﬁxes.3.Minimizednumberofrootsrelativelytovo-cabularysizeSimilarly,thenumberofroots,andconsequentlythenumberofmorphologicalfamiliesismarkedlysmallerthanthesizeofthevocabulary.Theﬁrstpropertyislocalinnature,whilethelasttwoareglobalandembodytheprincipleofMin-imumDescriptionLength(MDL).Basedontheseproperties,weformulateanobjectivefunctionS(F)overaforestF:S(F)=−Pe∈ElogPr(e)|E|+α|Afﬁx|+β|F||V|,(1)where|·|denotessetcardinality,Afﬁx={ak}Kk=1isthesetofallafﬁxes,and|F|isthenumberoftreesinF.|E|and|V|arethesizeoftheedgesetandvo-cabulary,respectively.Thehyperparametersαandβcapturetherelativeimportanceofthethreeterms.Byminimizingthisobjective,weencourageas-signmentswithhighedgeprobabilities(ﬁrstterm),Figure2:Illustrationoftwochosenforestrepre-sentations.Thetopforesthasonlyoneafﬁx-s,buttworoots{pain,paint}.Showninthebottomforest,choosingtheedge(paint,pain)insteadof(paint,paint)willintroduceanotherafﬁx-t,whilereducingthesetofrootstojust{pain}.whilelimitingthenumberofafﬁxesandmorpho-logicalfamilies(secondandthirdterms,respec-tively).Thisobjectivecanalsobeviewedasasimplelog-likelihoodobjectiveregularizedbythelasttwotermsinEquation(1).Toillustratetheinteractionbetweenlocalandglobalconstraintsinthisobjective,consideranex-ampleinFigure2.Ifthemodelselectsadifferentedge–e.g.(paint,pain)instead,allthetermsinEquation(1)willbeaffected.3.2ComputinglocalprobabilitiesWenowdescribehowtoparameterizePr(e),whichcapturesthelikelihoodofasingle-stepmorpholog-icalderivationbetweentwowords.Followingprior

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
6
6
1
5
6
7
5
2
5

/
t

a
c
_
a
_
0
0
0
6
6
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

356

work(Narasimhanetal.,2015),wemodelthisprob-abilityusingalog-linearmodel:Pr(w,z)∝exp(θ·φ(w,z)),(2)whereθisthesetofparameterstobelearned,andφ(w,z)isthefeaturevectorextractedfromwandz.Eachcandidatezisatuple(string,label),wherelabelreferstothelabelofthepotentialedge.Asaresult,themarginalprobabilityisPr(w)=Xz∈C(w)Pr(w,z)=Pz∈C(w)exp(θ·φ(w,z))Pw0∈Σ∗Pz0∈C(w0)exp(θ·φ(w0,z0)),whereΣ∗isthesetofallpossiblestrings.Com-putingthesuminthedenominatorisinfeasible.In-stead,wemakeuseofcontrastiveestimation(SmithandEisner,2005),substitutingΣ∗witha(limited)setofneighborstringsN(w)thatareorthographi-callyclosetow.Thistechniquedistributestheprob-abilitymassamongneighboringwordsandforcesthemodeltoidentifymeaningfuldiscriminativefea-tures.WeobtainN(w)bytransposingcharactersinw,followingthemethoddescribedinNarasimhanetal.(2015).NowfortheforestoverthesetofnodesV,thelog-likelihoodlossfunctionisdeﬁnedas:L(V;θ)=−Xv∈VlogPr(v)=−Xv∈VhlogXz∈C(v)exp(θ·φ(v,z))−logXv0∈N(v)Xz0∈C(v0)exp(θ·φ(v0,z0))i,(3)Thisobjectivecanbeminimizedbygradientde-scent.SpaceofPossibleCandidatesWeonlyconsiderassignmentswheretheparentwordisstrictlyshorterthanthechildwordtopreventcyclesoflengthtwoormore.Inadditiontosufﬁxationandpreﬁxation,wealsoconsiderthreetypesoftransformationsintro-ducedinGoldwaterandJohnson(2004):repetition,deletion,andmodiﬁcation.Wealsohandlecom-pounding,wheretwostemsarecombinedtoformanewword(e.g.,football).Oneofthesestemscarriesthemainsemanticmeaningofthecompoundandisconsideredtobetheparentoftheword.Notethatstemsarenotconsideredafﬁxes,sothisdoesnotaf-fecttheafﬁxlist.WeallowparentstobewordsoutsideV,sincemanylegitimatewordformsmightneverappearinthecorpus.Forinstance,ifwehaveV={painter,paints},theoptimalsolutionwouldaddanunseenwordpainttotheforest,andchooseedges(painter,paint)and(paints,paint).FeaturesWeusethesamesetoffeaturesshowntobeeffectiveinpriorwork(Narasimhanetal.,2015),includingwordvectorsimilarity,beginningandend-ingcharacterbigrams,wordfrequenciesandafﬁxes.Afﬁxfeaturesareautomaticallyextractedfromthecorpusbasedonstringdifferenceandarethresh-oldedbasedonfrequency.Wealsoincludeanad-ditionalsiblingfeaturethatcountshowmanywordsaresiblingsofwordwinitstree.Siblingsarewordsthatarederivedfromthesameparent,e.g.,faithfulandfaithless,bothfromthewordfaith.3.3ILPformulationMinimizingtheobjectiveinEquation(1)ischal-lengingbecausethesecondandthirdtermscapturediscreteglobalpropertiesoftheforest,whichpre-ventsusfromperforminggradientdescentdirectly.Instead,weformulatethisoptimizationproblemasIntegerLinearProgramming(ILP),wherethesetwotermscanbecastasconstraints.3Foreachchildwordvi∈V,wehaveaboundedsetofitscandidateoutgoingedgesC(vi)={zji},wherezjiisthej-thcandidateforvi.C(vi)isthesamesetasdeﬁnedinSection3.2.Eachedgeisassociatedwithpij,whichiscomputedaslogPr(zji|vi).Letxijbeabinaryvariablethathasvalue1ifandonlyifzjiischosentobeintheforest.Withoutlossofgenerality,weassumetheﬁrstcan-didateedgeisalwaystheself-edge(orstopcase),i.e.,z1i=(vi,stop).Wealsouseasetofbinaryvariables{yk}toindicatewhetherafﬁxakisusedat3Ifwehadpriorknowledgeofwordsbelongingtothesamefamily,wecanframetheproblemasgrowingaMinimumSpan-ningTree(MST),anduseChu-Liu-Edmondsalgorithm(ChuandLiu,1965;Edmonds,1967)tosolveit.However,thisinfor-mationisnotavailabletous.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
6
6
1
5
6
7
5
2
5

/
t

a
c
_
a
_
0
0
0
6
6
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

357

leastonceinF(i.e.requiredtoexplainamorpho-logicalchange).NowletusconsiderhowtoderiveourILPfor-mulationusingthenotationsabove.Notethat|F|isequaltothenumberofself-edgesPixi1,andalsoavalidforestwillsatisfy|V|=|E|.Combiningthesepieces,wecanrewritetheobjectiveinEquation(1)andarriveatthefollowingILPformulation:minimizexij,yk−1|V|Xijxijpij+αXkyk+β|V|Xixi1subjecttoxij,yk∈{0,1},Xjxij=1,∀i,(4)xij≤yk,ifakisinvolvedinzji.(5)Constraint4statesthatexactlyoneofthecan-didateedgesshouldbechosenforeachword.Thelastconstraintimpliesthatwecanonlyconsiderthiscandidate(andconstructthecorrespondingedge)whentheinvolvedafﬁx4isusedatleastonceintheforestrepresentation.3.4AlternatingtrainingTheobjectivefunctioncontainstwosetsofparame-ters:acontinuousweightvectorθthatparameterizesedgeprobabilities,andbinaryvariables{xij}and{yk}inILP.Duetothediscordancebetweencon-tinuousanddiscretevariables,weneedtooptimizetheobjectiveinanalternatingmanner.Algorithm1detailsthetrainingprocedure.Afterautomaticallyextractingafﬁxesfromthecorpus,wealternatebe-tweenlearningthelocaledgeprobabilities(line3)andsolvingILP(line4).ThefeedbackfromsolvingILPwiththeglobalconstraintscanhelpusreﬁnethelearningoflocalprobabilitiesbyremovingincorrectafﬁxes(line5).Forinstance,automaticextractionbasedonfrequen-ciescaninclude-ersasanEnglishsufﬁx.ThisislikelytobeeliminatedbyILP,sincealloccurrencesof-erscanbeexplainedawaywithoutaddinganewafﬁxbyconcatenating-erand-s,twoverycommonsufﬁxes.Afterreﬁningtheafﬁxset,weremoveallcandidatesthatinvolveanyafﬁxdiscardedbyILP.ThiscorrespondstoreducingthesizeofC(w)inEquation(3).Wethentrainthelog-linearmodel4ForEnglishandGerman,wherenon-concatenativetrans-formationsarepossiblesuchasdeletionofendinge(taking→take),wealsoincludetheminAfﬁx.againusingthenewly-prunedcandidateset.Bydo-ingso,weforcethemodeltolearnfrombettercon-trastivesignals,andfocusonafﬁxesofhigherqual-ity,resultinginanewsetofprobabilities{pij}.Thisprocedureisrepeateduntilnomoreafﬁxesarere-jected.54ExperimentsWeevaluateourmodelonthreetasks:segmentation,morphologicalfamilyclustering,androotdetection.Whiletheﬁrsttaskhasbeenextensivelystudiedinthepriorliterature,weconsidertwoadditionaltaskstoassesstheﬂexibilityofthederivedrepresentation.4.1MorphologicalsegmentationDataWechoosefourlanguageswithdistinctmor-phologicalproperties:English,Turkish,Arabic,andGerman.Ourtrainingdataconsistsofstandarddatasetsusedinpriorwork.StatisticsforalldatasetsaresummarizedinTable1.NotethatfortheArabictestset,weﬁlteredoutduplicatewords,andwereranthebaselinestoobtaincomparableresults.FollowingNarasimhanetal.(2015),wereducethenoisebytruncatingthetrainingwordlisttothetopKfrequentwords.Inaddition,wetrainwordvectors(Mikolovetal.,2013)toobtaincosinesimi-larityfeatures.Statisticsforalldatasetsaresumma-rizedinTable1.BaselinesWecompareourapproachagainstthestate-of-the-artunsupervisedmethodofNarasimhanetal.(2015)whichoutperformsanumberofalter-nativeapproaches(CreutzandLagus,2005;Virpi-ojaetal.,2013;SirtsandGoldwater,2013;Leeetal.,2011;Stallardetal.,2012;Poonetal.,2009).Forthisbaseline,wereporttheresultsofthepubliclyavailableimplementationofthetech-nique(NBJ’15),aswellasourownimprovedreim-plementation(NBJ-Imp).SpeciﬁcallyinNBJ-Imp,weexpandedtheoriginalalgorithmtohandlecom-pounding,alongwithsiblingfeaturesasdescribedinSection3.2,makingitessentiallyanablationofourmodelwithoutILPandalternatingtraining.Weem-ploygridsearchtoﬁndtheoptimalhyperparametersetting.65Typicallythemodelconvergesafter5rounds6K∈{2500,5000,10000},numberofautomaticallyex-tractedafﬁxes∈{100,200,300,400,500}

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
6
6
1
5
6
7
5
2
5

/
t

a
c
_
a
_
0
0
0
6
6
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

358

Algorithm1MorphologicalForestInductionInput:wordlistVOutput:ForestrepresentationofV1:Afﬁx←ExtractAfﬁxes(W).Extractcommonpatternsasafﬁxesfromthewordlist2:fort←1toTdo.AlternatingtrainingforTiterations3:ptij←ContrastiveEstimation(W,Afﬁx).Computelocalprobabilities,cf.Section3.24:y∗t,Ft←ILP(ptij).Getindicatorsforafﬁxes,andtheforest,cf.Section3.35:PruneAfﬁxSet(Afﬁx,y∗t).PruneafﬁxsetusingtheoutputfromILP,cf.Section3.4returnFTLanguageTrainTestWordVec#Words#Words#WordsEnglishMC-10MC-05:10Wikipedia878K2212129MTurkishMC-10MC-05:10BOUN617K2531361MArabicGigawordATBGigaword3.83M210851.22GGermanMC-10DsolveWikipedia2.34M15522589MTable1:Datastatistics:MC-10=MorphoChal-lenge2010,MC:05-10=aggregatedfromMor-phoChallenge2005-2010,BOUN=BOUNcor-pus(Saketal.,2008),Gigaword=ArabicGigawordcorpus(Parkeretal.,2011),ATB=ArabicTree-bank(Maamourietal.,2003).DuplicatesinArabictestsetareﬁltered.DsolveisthedatasetreleasedbyW¨urznerandJurish(2015),andfortrainingGer-manvectors,weusethepre-processedWikipediadumpfrom(Al-Rfouetal.,2013).Wealsoincludeasupervisedcounterpart,whichusesthesamesetoffeaturesasNBJ-Impbuthasac-cesstogoldsegmentationduringtraining(weper-form5-foldcross-validationusingthesamedata).Weobtainthegoldstandardparent-childpairsre-quiredfortrainingfromthesegmentedwordsinastraightforwardfashion.EvaluationmetricFollowingpriorwork(Virpi-ojaetal.,2011),weevaluateallmodelsusingthestandardboundaryprecisionandrecall(BPR).Thismeasureassessestheaccuracyofindividualsegmen-tationpoints,producingIR-stylePrecision,RecallandF1scores.Language#Words#Clusters#WordsperClusterEnglish75,41620,2493.72German367,96728,19813.05Table2:Datastatisticsforthefamilyclusteringtask(CELEX).WeonlyevaluateonEnglishandGer-man,sincethesearethelanguagesMorphoChal-lengehassegmentationsfor.TrainingForunsupervisedtraining,weusethegradientdescentmethodADAM(KingmaandBa,2014)andoptimizeoverthewholebatchoftrainingwords.WeuseaGurobi7solverfortheILP.4.2MorphologicalfamilyclusteringMorphologicalfamilyclusteringisthetaskofclus-teringmorphologicallyrelatedwordforms.Forin-stance,wewanttogrouppaint,paintsandpainintotwoclusters:{paint,paints}and{pain}.Toderiveclustersfromtheforestrepresentation,weassumethatallthewordsinthesametreeformacluster.DataToobtaingoldinformationaboutmorpho-logicalclusters,weuseCELEX(Baayenetal.,1993).DatastatisticsaresummarizedinTable2.WeremovewordswithoutstemsfromCELEX.8BaselineWecompareourmodelagainstNBJ-Impdescribedabove.Weselectthebestvariantofourmodelandthebasemodelbasedontheirrespectiveperformanceonthesegmentationtask.EvaluationWeusethemetricsproposedbySchoneandJurafsky(2000).Speciﬁcally,letXw7http://www.gurobi.com/8Anexampleisaerodrome,wherebothaero-anddromeareafﬁxes.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
6
6
1
5
6
7
5
2
5

/
t

a
c
_
a
_
0
0
0
6
6
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

359

Language#Words#Words(Testonly)English1675687Turkish1759763German1747749Table3:Datastatisticsforrootdetectiontask.Du-plicatewordsareremoved.andYwbetheclustersforwordwinourpredic-tionsandgoldstandard,respectively.Wecomputethenumberofcorrect(C),inserted(I)anddeleted(D)wordsfortheclustersasfollows:C=Xw∈W|Xw∩Yw||Yw|I=Xw∈W|Xw\Yw||Yw|D=Xw∈W|Yw\Xw||Yw|Thenwecomputeprecision=CC+I,recall=CC+D,F1=2precision·recallprecision+recall.4.3RootdetectionInaddition,weevaluatehowaccuratelyourmodelcanpredicttherootofanygivenword.DataWereporttheresultsontheChipmunkdataset(Cotterelletal.,2015)whichhasbeenusedforevaluatingsupervisedmodelsforrootdetection.Sinceourmodelisunsupervised,wereporttheper-formancebothonthetestsetonly,andontheentiredataset,combiningthetrain/testsplit.StatisticsforthedatasetareshowninTable3.5ResultsInthefollowingsubsections,wereportmodelper-formanceoneachoneofthethreeevaluationtasks.5.1Segmentation9Weusedcosinesimilarityfeaturesinallexperiments.ButtherootformsofGermanverbsarerarelyused,exceptinim-perativesentences.Consequentlytheyhavebarelytrainedwordvectors,contributingtothelowrecallvalue.Wesuspectbettertreatmentwithwordvectorscanfurtherimprovetheresults.10http://www.mathcracker.com/sign-test.phpLanguageMethodBPRPRFEnglishSupervised0.9050.8130.856NBJ’150.8070.7220.762NBJ-Imp0.8200.7260.770Ourmodel0.8380.7290.780+Sibl0.7960.7390.767+Comp0.8400.7610.799∗+Comp,Sibl0.8150.7740.794TurkishSupervised0.8260.8030.815NBJ’150.7430.5200.612NBJ-Imp0.6970.5830.635Ourmodel0.7170.5770.639+Sibl0.6980.6190.656∗+Comp0.7160.5810.642+Comp,Sibl0.6920.6210.655ArabicSupervised0.9040.9210.912NBJ’150.8400.7240.778NBJ-Imp0.8660.7250.789Ourmodel0.8480.7690.806+Sibl0.8290.7870.807∗+Comp0.8510.7650.806+Comp,Sibl0.8810.7450.807∗German9Supervised0.8230.8100.816NBJ’150.7160.2750.397NBJ-Imp0.7900.4800.597Ourmodel0.7740.5400.636+Sibl0.7110.5140.596+Comp0.7770.5950.674∗+Comp,Sibl0.7010.6160.656Table4:Segmentationresultsforthesupervisedmodelandthreeunsupervisedmodels:thestate-of-the-artsystemNBJ’15(Narasimhanetal.,2015),ourimprovedimplementationoftheirsystemNBJ-Impandourmodel.Forourmodel,wealsoreportresultswithdifferentfeaturecombinations.+Sibland+Comprefertoadditionofsiblingandcom-poundingfeatures,respectively.Besthyperparame-tervaluesforunsupervisedbaselines(NBJ’15,NBJ-Imp)arechosenviagridsearch,whileforourmodel,weuse10Kwordsandtop500afﬁxesthroughout.*impliesstatisticalsigniﬁcancewithp<0.05againsttheNBJ-Impmodelusingthesigntest10.FromTable4,weobservethatourmodelconsis-tentlyoutperformsthebaselinesonallfourlan- l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 6 6 1 5 6 7 5 2 5 / / t l a c _ a _ 0 0 0 6 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 360 guages.ComparedtoNBJ’15,ourmodelhasahigherF1scoreby3.7%,4.4%,2.9%and27.7%onEnglish,Turkish,ArabicandGerman,respectively.WhiletheimprovedimplementationNBJ-Impben-eﬁtsfromtheadditionofcompoundingandsiblingfeatures,ourmodelstilldeliversanabsoluteincreaseinF1score,rangingfrom1.8%to7.7%overNBJ-Imp.NotethatourmodelachieveshigherscoresevenwithouttuningthethresholdKorthenumberofafﬁxes,whereasthebaselineshaveoptimalhyper-parametersettingsviagridsearch.Tounderstandtheimportanceofglobalcon-straints(thelasttwotermsofEquation(1)),wean-alyzeourmodel’sperformancewithdifferentvaluesofαandβ(seeFigure3).Theﬁrstconstraint,whichcontrolsthesizeoftheafﬁxset,playsamoredom-inantrolethanthesecond.Bysettingα=0.0,themodelscoresatbest75.7%onEnglishand63.2%onTurkish,lowerthanthebaseline.WhilethevalueofβalsoaffectstheF1score,itsroleissecondaryinachievingoptimalperformance.Theresultsalsodemonstratethatlanguageprop-ertiescangreatlyaffectthefeaturesetchoice.ForfusionallanguagessuchasEnglish,computingofsiblingfeaturesisunreliable.Forexample,twode-scendantsofthesameparentspot–spotlessandspotty–maynotbenecessarilyidentiﬁedassuchbyasimplesiblingcomputationalgorithm,sincetheyundergodifferentchanges.Incontrast,Turkishishighlyagglutinative,withminimal(ifany)transfor-mations,buteachwordcanhaveuptohundredsofrelatedforms.Consequently,siblingfeatureshavedifferenteffectsonEnglishandTurkish,leadingtochangesof−0.3%and+2.1%inF1scorerespec-tively.UnderstandingmodelbehaviorWeﬁndthatmuchofthegaininmodelperformancecomesfromtheﬁrsttworoundsoftraining.AsFigure4shows,theimprovementmainlystemsfromsolvingILPintheﬁrstround,followedbytrainingthelog-linearmodelinthesecondroundafterremovingafﬁxesandpruningcandidatesets.ThisisexactlywhatweexpectfromtheILPformulation–togloballyadjusttheforestbyreducingthenumberofuniqueafﬁxes.Weﬁndthistobequiteeffective–inEnglish,outof500preﬁxes,only6remain:de,dis,im,in,re,andun.Similarly,only72outof500sufﬁxessurviveFigure3:HeatmapsofαandβforEnglishandTurkish.Darkercellsmeanhigherscores.Modelsusedare+CompforEnglishand+SiblforTurkish.afterthisreduction.RobustnessWealsoinvestigatehowrobustourmodelistothechoiceofhyperparameters.Figure3illustratesthatwecanobtainasizableboostoverthebaselinebychoosingαandβwithinafairlywideregion.Notethatαtakesonamuchsmallervaluethanβ,tomaintainthetwoconstraints(|Afﬁx|and|F||V|)atcomparablemagnitudes.Narasimhanetal.(2015)observethatafterin-cludingmorethanK=10,000words,theper-formanceoftheunsupervisedmodeldropsnotice-ably.Incontrast,ourmodelhandlestrainingnoisemorerobustly,resultinginasteadyboostornottoobigdropinperformancewithincreasingtrain-ingsize(Figure5).Infact,itscores83.0%withK=40,000onEnglish,a6.0%increaseinabso-lutevalueoverthebaseline. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 6 6 1 5 6 7 5 2 5 / / t l a c _ a _ 0 0 0 6 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 361 Figure4:F1scorevsroundoftraining,for+ComponEnglish.Traininglog-linearmodelsandsolvingILParemarkedbycirclesandtrianglesrespectively.BestresultforNBJ-Impisrepresentedasadashedhorizontalline.QualitativeanalysisTable5showsexamplesofEnglishwordsthatourmodelsegmentscorrectly,whileNBJ’15failsonthem.Wepresenttheminthreecategories(toptobottom)basedonthecom-ponentofourmodelthatcontributestothesuccess-fulsegmentation.Theﬁrstcategorybeneﬁtsfromareﬁnementoftheafﬁxset,byremovingnoisyaf-ﬁxes,suchas-nce,-ch,andk-.Thisleadstocorrectstoppingasinthecaseofknuckleorinductionoftherightsufﬁx,asindivergence.Further,asmallerafﬁxsetalsoleadstomoreconcentratedweightsfortheremainingafﬁxes.Forexample,thefeatureweightfor-ivejumpsfrom0.06to0.25,sothatthederiva-tionnegative→negateisfavored,asshowninthesecondcategory.Finally,thelastcategorylistssomecompoundwordsthatourmodelsuccessfullyseg-ments.5.2MorphologicalfamilyclusteringWeshowtheresultsformorphologicalfamilyclus-teringinTable6.Forbothlanguages,ourmodelincreasesprecisionbyawidemargin,withamod-estboostforrecallaswell.Thiscorroboratesourﬁndingsinthesegmentationtask,whereourmodelcaneffectivelyremoveincorrectafﬁxeswhilestillencouragingwordstoformtight,cohesivefamilies.Figure5:Performanceusingbiggertrainingsets.+CompforEnglishand+SiblforTurkish.DashedlinesrepresentthebestresultsforNBJ-Imp(withsmallertrainingsets).5.3RootdetectionTable7summarizestheresultsfortherootdetectiontask.Ourmodelshowsconsistentimprovementsoverthebaselineonallthreelanguages.Wealsoincludetheresultsonthetestsetoftwosupervisedsystems:Morfette(Chrupalaetal.,2008)andChip-munk(Cotterelletal.,2015).MorfetteisastringtransducerwhileChipmunkisasegmenter.Bothsystemshaveaccesstomorphologicallyannotatedcorpora.OurmodelisquitecompetitiveagainstMorfette.Infact,itachieveshigheraccuracyforEnglishandTurkish.ComparedwithChipmunk,ourmodel l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 6 6 1 5 6 7 5 2 5 / / t l a c _ a _ 0 0 0 6 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 362 NBJ-ImpOurmodeldiverge-ncediverg-encelur-chlurchk-nuckleknucklenegativenegat-ivejunksjunk-sunreservedun-reserv-edgaslight-sgas-light-swatercourse-swater-course-sexpresswayexpress-wayTable5:SomeEnglishwordsthatourmodelseg-mentscorrectlywhichtheunsupervisedbasemodel(NBJ’15)failsat.LanguageMethodPRFEnglishNBJ-Imp0.3280.6800.442Ourmodel0.8950.7150.795GermanNBJ-Imp0.2070.4210.278Ourmodel0.4710.4840.477Table6:Resultsformorphologicalfamilycluster-ing.P=precision,R=recall.scores0.65versus0.70onEnglish,bridgingthegapsigniﬁcantly.However,thehighaccuracyformor-phologicallycomplexlanguagessuchasTurkishandGermansuggeststhatunsupervisedrootdetectionremainsahardtask.6ConclusionsInthiswork,wefocusonunsupervisedmodelingofmorphologicalfamilies,collectivelydeﬁningafor-estoverthelanguagevocabulary.Thisformulationenablesustoincorporatebothlocalandglobalprop-ertiesofmorphologicalassignment.TheresultingobjectiveissolvedusingIntegerLinearProgram-ming(ILP)pairedwithcontrastiveestimation.Ourexperimentsdemonstratethatourmodelyieldscon-sistentgainsinthreemorphologicaltaskscomparedwiththebestpublishedresults.AcknowledgementWethankTaoLei,YuanZhangandthemembersoftheMITNLPgroupforhelpfuldiscussionsandLanguageMethodAccuracyAccuracy(Testonly)EnglishNBJ-Imp0.5900.595Ourmodel0.6360.649Morfette-0.628Chipmunk-0.703TurkishNBJ-Imp0.4460.442Ourmodel0.4630.467Morfette-0.268Chipmunk-0.756GermanNBJ-Imp0.3470.331Ourmodel0.3830.364Morfette-0.438Chipmunk-0.674Table7:Resultsforrootdetection.NumbersforMorfetteandChipmunkarereportedbyCotterelletal.(2015).feedback.Wearealsogratefultoanonymousre-viewersfortheirinsightfulcomments.ReferencesRamiAl-Rfou,BryanPerozzi,andStevenSkiena.2013.Polyglot:Distributedwordrepresentationsformulti-lingualNLP.InProceedingsoftheSeventeenthCon-ferenceonComputationalNaturalLanguageLearn-ing,pages183–192,Soﬁa,Bulgaria,August.Associa-tionforComputationalLinguistics.RHaraldBaayen,RichardPiepenbrock,andRijnvanH.1993.TheCELEXlexicaldatabaseonCD-ROM.JonathanBerant,IdoDagan,andJacobGoldberger.2011.Globallearningoftypedentailmentrules.InProceedingsofthe49thAnnualMeetingoftheAsso-ciationforComputationalLinguistics:HumanLan-guageTechnologies-Volume1,pages610–619.Asso-ciationforComputationalLinguistics.GrzegorzChrupala,GeorgianaDinu,andJosefvanGen-abith.2008.LearningmorphologywithMorfette.InLREC.Yoeng-JinChuandTseng-HongLiu.1965.Onshort-estarborescenceofadirectedgraph.ScientiaSinica,14(10):1396.JamesClarkeandMirellaLapata.2008.Globalinfer-enceforsentencecompression:Anintegerlinearpro-grammingapproach.JournalofArtiﬁcialIntelligenceResearch,pages399–429. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 6 6 1 5 6 7 5 2 5 / / t l a c _ a _ 0 0 0 6 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 363 RyanCotterell,ThomasM¨uller,AlexanderFraser,andHinrichSch¨utze.2015.Labeledmorphologicalseg-mentationwithsemi-Markovmodels.CoNLL2015,page164.MathiasCreutzandKristaLagus.2005.Inducingthemorphologicallexiconofanaturallanguagefromunannotatedtext.InProceedingsoftheInternationalandInterdisciplinaryConferenceonAdaptiveKnowl-edgeRepresentationandReasoning(AKRR05),vol-ume1,pages51–59.MathiasCreutzandKristaLagus.2007.Unsupervisedmodelsformorphemesegmentationandmorphologylearning.ACMTransactionsonSpeechandLanguageProcessing(TSLP),4(1):3.MarkusDreyerandJasonEisner.2009.Graphicalmod-elsovermultiplestrings.InProceedingsofthe2009ConferenceonEmpiricalMethodsinNaturalLan-guageProcessing:Volume1,pages101–110.Asso-ciationforComputationalLinguistics.JackEdmonds.1967.Optimumbranchings.JournalofResearchoftheNationalBureauofStandardsB,71(4):233–240.ManaalFaruqui,RyanMcDonald,andRaduSoricut.2016.Morpho-syntacticlexicongenerationusinggraph-basedsemi-supervisedlearning.TransactionsoftheAssociationforComputationalLinguistics,4:1–16.SharonGoldwaterandMarkJohnson.2004.PriorsinBayesianlearningofphonologicalrules.InProceed-ingsofthe7thMeetingoftheACLSpecialInterestGroupinComputationalPhonology:CurrentThemesinComputationalPhonologyandMorphology,pages35–42.AssociationforComputationalLinguistics.SharonGoldwater,ThomasLGrifﬁths,andMarkJohn-son.2009.ABayesianframeworkforwordsegmen-tation:Exploringtheeffectsofcontext.Cognition,112(1):21–54.DiederikKingmaandJimmyBa.2014.Adam:Amethodforstochasticoptimization.arXivpreprintarXiv:1412.6980.YoongKeokLee,AriaHaghighi,andReginaBarzi-lay.2011.Modelingsyntacticcontextimprovesmorphologicalsegmentation.InProceedingsoftheFifteenthConferenceonComputationalNaturalLan-guageLearning,pages1–9.AssociationforComputa-tionalLinguistics.MohamedMaamouri,AnnBies,HubertJin,andTimBuckwalter.2003.Arabictreebank:Part1v2.0.DistributedbytheLinguisticDataConsortium.LDCCatalogNo.:LDC2003T06.TomasMikolov,KaiChen,GregCorrado,andJeffreyDean.2013.Efﬁcientestimationofwordrepresenta-tionsinvectorspace.arXivpreprintarXiv:1301.3781.JasonNaradowskyandKristinaToutanova.2011.Unsu-pervisedbilingualmorphemesegmentationandalign-mentwithcontext-richhiddensemi-Markovmodels.InProceedingsofthe49thAnnualMeetingoftheAs-sociationforComputationalLinguistics:HumanLan-guageTechnologies-Volume1,pages895–904.Asso-ciationforComputationalLinguistics.KarthikNarasimhan,ReginaBarzilay,andTommiJaakkola.2015.Anunsupervisedmethodforuncov-eringmorphologicalchains.TransactionsoftheAsso-ciationforComputationalLinguistics,3:157–167.RobertParker,DavidGraff,KeChen,JunboKong,andKazuakiMaeda.2011.Arabicgigawordﬁftheditionldc2011t11.Philadelphia:LinguisticDataConsor-tium.NanyunPeng,RyanCotterell,andJasonEisner.2015.Dualdecompositioninferenceforgraphicalmodelsoverstrings.InProceedingsofthe2015ConferenceonEmpiricalMethodsinNaturalLanguageProcess-ing,pages917–927,Lisbon,Portugal,September.As-sociationforComputationalLinguistics.HoifungPoon,ColinCherry,andKristinaToutanova.2009.Unsupervisedmorphologicalsegmentationwithlog-linearmodels.InProceedingsofHumanLan-guageTechnologies:The2009AnnualConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics,pages209–217.Associa-tionforComputationalLinguistics.DanRothandWen-tauYih.2001.Relationallearningviapropositionalalgorithms:Aninformationextrac-tioncasestudy.InInternationalJointConferenceonArtiﬁcialIntelligence,pages1257–1263.Has¸imSak,TungaG¨ung¨or,andMuratSarac¸lar.2008.Turkishlanguageresources:Morphologicalparser,morphologicaldisambiguatorandwebcorpus.InAd-vancesinnaturallanguageprocessing,pages417–427.Springer.PatrickSchoneandDanielJurafsky.2000.Knowledge-freeinductionofmorphologyusinglatentsemanticanalysis.InProceedingsofthe2ndworkshoponLearninglanguageinlogicandthe4thconferenceonComputationalnaturallanguagelearning-Volume7,pages67–72.AssociationforComputationalLinguis-tics.KairitSirtsandSharonGoldwater.2013.Minimally-supervisedmorphologicalsegmentationusingadaptorgrammars.TransactionsoftheAssociationforCom-putationalLinguistics,1:255–266.NoahASmithandJasonEisner.2005.Contrastiveesti-mation:Traininglog-linearmodelsonunlabeleddata.InProceedingsofthe43rdAnnualMeetingonAssoci-ationforComputationalLinguistics,pages354–362.AssociationforComputationalLinguistics. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 6 6 1 5 6 7 5 2 5 / / t l a c _ a _ 0 0 0 6 6 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 364 BenjaminSnyderandReginaBarzilay.2008.Unsuper-visedmultilinguallearningformorphologicalsegmen-tation.InProceedingsofthe46thAnnualMeetingonAssociationforComputationalLinguistics,pages737–745.AssociationforComputationalLinguistics.RaduSoricutandFranzOch.2015.Unsupervisedmor-phologyinductionusingwordembeddings.InPro-ceedingsofthe2015ConferenceoftheNorthAmeri-canChapteroftheAssociationforComputationalLin-guistics:HumanLanguageTechnologies,pages1627–1637.AssociationforComputationalLinguistics.DavidStallard,JacobDevlin,MichaelKayser,YoongKeokLee,andReginaBarzilay.2012.Unsupervisedmorphologyrivalssupervisedmor-phologyforArabicMT.InProceedingsofthe50thAnnualMeetingoftheAssociationforComputationalLinguistics:ShortPapers-Volume2,pages322–327.AssociationforComputationalLinguistics.SamiVirpioja,VilleT.Turunen,SebastianSpiegler,Os-karKohonen,andMikkoKurimo.2011.Empiricalcomparisonofevaluationmethodsforunsupervisedlearningofmorphology.TraitementAutomatiquedesLangues,52(2):45–90.SamiVirpioja,PeterSmit,Stig-ArneGr¨onroos,MikkoKurimo,etal.2013.Morfessor2.0:Pythonimple-mentationandextensionsforMorfessorbaseline.Kay-MichaelW¨urznerandBryanJurish.2015.Dsolve-morphologicalsegmentationforGermanusingcondi-tionalrandomﬁelds.InSystemsandFrameworksforComputationalMorphology,pages94–103.Springer. Transactions of the Association for Computational Linguistics, vol. 5, pp. 353–364, 2017. Action Editor: Eric Fosler-Lussier. image

Download pdf