Transactions of the Association for Computational Linguistics, vol. 3, pp. 157–167, 2015. Action Editor: Yuji Matsumoto.

Transactions of the Association for Computational Linguistics, vol. 3, pp. 157–167, 2015. Action Editor: Yuji Matsumoto.
Submission batch: 9/2014; Revision batch: 12/2014; Revision batch 2/2015; Published 3/2015.
2015 Association for Computational Linguistics. Distributed under a CC-BY-NC-SA 4.0 Licence.

c
(cid:13)

AnUnsupervisedMethodforUncoveringMorphologicalChainsKarthikNarasimhan,ReginaBarzilayandTommiJaakkolaCSAIL,MassachusettsInstituteofTechnology{karthikn,regina,tommi}@csail.mit.eduAbstractMoststate-of-the-artsystemstodayproducemorphologicalanalysisbasedonlyonortho-graphicpatterns.Incontrast,weproposeamodelforunsupervisedmorphologicalanal-ysisthatintegratesorthographicandseman-ticviewsofwords.Wemodelwordforma-tionintermsofmorphologicalchains,frombasewordstotheobservedwords,breakingthechainsintoparent-childrelations.Weuselog-linearmodelswithmorphemeandword-levelfeaturestopredictpossibleparents,in-cludingtheirmodifications,foreachword.Thelimitedsetofcandidateparentsforeachwordrendercontrastiveestimationfeasible.Ourmodelconsistentlymatchesoroutper-formsfivestate-of-the-artsystemsonArabic,EnglishandTurkish.11IntroductionMorphologicallyrelatedwordsexhibitconnectionsatmultiplelevels,rangingfromorthographicalpat-ternstosemanticproximity.Forinstance,thewordsplayingandplayedsharethesamestem,butalsocarrysimilarmeaning.Ideally,allthesecomple-mentarysourcesofinformationwouldbetakenintoaccountwhenlearningmorphologicalstructures.Moststate-of-the-artunsupervisedapproachestomorphologicalanalysisarebuiltprimarilyaroundorthographicpatternsinmorphologically-relatedwords(GoldwaterandJohnson,2004;CreutzandLagus,2007;SnyderandBarzilay,2008;Poonetal.,2009).Intheseapproaches,wordsarecom-monlymodeledasconcatenationsofmorphemes.1Codeisavailableathttps://github.com/karthikncode/MorphoChain.Thismorpheme-centricviewiswell-suitedforun-coveringdistributionalpropertiesofstemsandaf-fixes.Butitisnotwell-equippedtocapturesemanticrelatednessatthewordlevel.Incontrast,earlierapproachesthatcapturese-manticsimilarityinmorphologicalvariantsoper-atesolelyatthewordlevel(SchoneandJuraf-sky,2000;Baronietal.,2002).Giventwocandi-datewords,theproximityisassessedusingstandardword-distributionalmeasuressuchasmutualinfor-mation.However,thefactthatthesemodelsdonotmodelmorphemesdirectlygreatlylimitstheirper-formance.Inthispaper,weproposeamodeltointegrateor-thographicandsemanticviews.Ourgoalistobuildachainofderivationsforacurrentwordfromitsbaseform.Forinstance,givenawordplayfully,thecorrespondingchainisplay→playful→playfully.Thewordplayisabaseformofthisderivationasitcannotbereducedanyfurther.Individualderiva-tionsareobtainedbyaddingamorpheme(ex.-ful)toaparentword(ex.play).Thisadditionmaybeimplementedviaasimpleconcatenation,oritmayinvolvetransformations.Ateverystepofthechain,themodelaimstofindaparent-childpair(ex.play-playful)suchthattheparentalsoconstitutesavalidentryinthelexicon.Thisallowsthemodeltodi-rectlycomparethesemanticsimilarityoftheparent-childpair,whilealsoconsideringtheorthographicpropertiesofthemorphemiccombination.Wemodeleachstepofamorphologicalchainbymeansofalog-linearmodelthatenablesustoin-corporateawiderangeoffeatures.Attheseman-ticlevel,weconsidertherelatednessbetweentwowordsusingthecorrespondingvectorembeddings.Attheorthographiclevel,featurescapturewhether

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
1
3
0
1
5
6
6
7
3
8

/

/
t

je

un
c
_
un
_
0
0
1
3
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

158

thewordsinthechainactuallyoccurinthecorpus,howaffixesarereused,aswellashowthewordsarealteredduringtheadditionofmorphemes.WeuseContrastiveEstimation(SmithandEisner,2005)toefficientlylearnthismodelinanunsupervisedmanner.Specifically,werequirethateachwordhasgreatersupportamongitsboundedsetofcandidateparentsthananartificiallyconstructedneighboringwordwould.Weevaluateourmodelondatasetsinthreelan-guages:Arabic,EnglishandTurkish.Wecompareourperformanceagainstfivestate-of-the-artunsu-pervisedsystems:MorfessorBaseline(Virpiojaetal.,2013),MorfessorCatMAP(CreutzandLagus,2005),AGMorph(SirtsandGoldwater,2013),theLeeSegmenter(Leeetal.,2011;Stallardetal.,2012)andthesystemofPoonetal.(2009).Ourmodelconsistentlyequalsoroutperformsthesesys-temsacrossthethreelanguages.Forinstance,onEnglish,weobtainan8.5%gaininF-measureoverMorfessor.Ourexperimentsalsodemonstratethevalueofsemanticinformation.Whilethecontribu-tionvariesfrom3%onTurkishto11%ontheEn-glishdataset,itneverthelessimprovesperformanceacrossallthelanguages.2RelatedWorkCurrently,topperformingunsupervisedmorpholog-icalanalyzersarebasedontheorthographicprop-ertiesofsub-wordunits(CreutzandLagus,2005;CreutzandLagus,2007;Poonetal.,2009;SirtsandGoldwater,2013).Addingsemanticinformationtothesesystemsisnotaneasytask,astheyoperateatthelevelofindividualmorphemes,ratherthanmor-phologicallyrelatedwords.Thevalueofsemanticinformationhasbeendemonstratedinearlierworkonmorphologicalanal-ysis.SchoneandJurafsky(2000)employanLSA-basedsimilaritymeasuretoidentifymorphologicalvariantsfromalistoforthographicallyclosewordpairs.Thefilteredpairsarethenusedtoidentifystemsandaffixes.Basedonsimilarintuition,Baronietal.(2002)designamethodthatintegratesthesesourcesofinformation,capturedastwowordpairlists,rankedbasedoneditdistanceandmutualin-formation.Theselistsaresubsequentlycombinedusingadeterministicweightingfunction.Inbothofthesealgorithms,orthographicrelated-nessisbasedonsimpledeterministicrules.There-fore,semanticrelatednessplaysanessentialroleinthesuccessofthesemethods.However,theseal-gorithmsdonotcapturedistributionalpropertiesofmorphemesthatarecriticaltothesuccessofcurrentstate-of-the-artalgorithms.Incontrast,weutilizeasinglestatisticalframeworkthatseamlesslycom-binesbothsourcesofinformation.Moreover,ital-lowsustoincorporateawiderangeofadditionalfeatures.Ourworkalsorelatestothelog-linearmodelformorphologicalsegmentationdevelopedbyPoonetal.(2009).Theyproposeajointmodeloverallwords(observations)andtheirsegmentations(hid-den),usingmorphemesandtheircontexts(charac-tern-grams)forthefeatures.Sincethespaceofallpossiblesegmentationsetsishuge,learningandin-ferencearequiteinvolved.TheyusetechniqueslikeContrastiveEstimation,samplingandsimulatedan-nealing.Incontrast,ourformulationdoesnotre-sultinsuchalargesearchspace.Foreachword,thenumberofparentcandidatesisboundedbyitslengthmultipliedbythenumberofpossibletrans-formations.Therefore,ContrastiveEstimationcanbeimplementedviaenumeration,anddoesnotre-quiresampling.Moreover,operatingatthelevelofwords(ratherthanmorphemes)enablesustoincor-poratesemanticandword-levelfeatures.Mostrecently,workbySirtsandGoldwater(2013)usesAdaptorGrammarsforminimallysu-pervisedsegmentation.Bydefiningamorphologicalgrammarconsistingofzeroormoreprefixes,stemsandsuffixes,theyinducesegmentationsoverwordsinbothunsupervisedandsemi-supervisedsettings.Whiletheirmodel(AGMorph)buildsupawordbycombiningmorphemesintheformofaparsetree,weoperateatthewordlevelandbuildupthefinalwordviaintermediatewordsinthechain.Inotherrelatedwork,DreyerandEisner(2011)tackletheproblemofrecoveringmorphologicalparadigmsandinflectionalprinciples.TheyuseaBayesiangenerativemodelwithalog-linearframework,usingexpressivefeatures,overpairsofstrings.Theirwork,cependant,handlesadifferenttaskfromoursandrequiresasmallamountofan-notateddatatoseedthemodel.Inthiswork,wemakeuseofsemanticinfor-

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
1
3
0
1
5
6
6
7
3
8

/

/
t

je

un
c
_
un
_
0
0
1
3
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

159

mationtohelpmorphologicalanalysis.Leeetal.(2011)presentamodelthattakesadvantageofsyn-tacticcontexttoperformbettermorphologicalseg-mentation.Stallardetal.(2012)improveonthisap-proachusingthetechniqueofMaximumMarginaldecodingtoreducenoise.Theirbestsystemcon-sidersentiresentences,whileourapproach(andthemorphologicalanalyzersdescribedabove)operatesatthevocabularylevelwithoutregardingsentencecontext.Hence,thoughtheirworkisnotdirectlycomparabletoours,itpresentsaninterestingorthog-onalviewtotheproblem.3Model3.1DefinitionsandFrameworkWeusemorphologicalchainstomodelwordsinthelanguage.Amorphologicalchainisashortsequenceofwordsthatstartsfromthebasewordandendsupinamorphologicalvariant.Eachnodeinthechainis,byassumption,avalidword.Werefertothewordthatismorphologicallychangedastheparentwordanditsmorphologicalvariantasthechildword.Awordthatdoesnothaveanymorphologicalparentisabaseword(e.g.,wordslikeplay,chat,run).2Wordsinachain(otherthanthebaseword)arecreatedfromtheirparentsbyaddingmorphemes(prefixes,suffixes,orotherwords).Forexample,amorphologicalchainthatendsupinthewordin-ternationallycouldbenation→national→inter-national→internationally.Thebasewordforthischainisnation.Notethatthesamewordcanbelongtomultiplemorphologicalchains.Forexample,thewordnationalappearsalsoaspartofanotherchainthatendsupinnationalize.Thesechainsaretreatedseparatelybutwithsharedstatisticalsupportforthecommonparts.Forthisreason,ourmodelbreaksmorphologicalchainsintopossibleparent-childre-lationssuchas(nation,national).Weusealog-linearmodelforpredictingparent-childpairs.Alog-linearmodelallowsaneasy,effi-cientwayofincorporatingseveraldifferentfeaturespertainingtoparent-childrelations.Inourcase,weleveragebothorthographicandsemanticpatternstoencoderepresentativefeatures.2Wedistinguishbasewordsfrommorphologicalrootswhichdonotstrictlyspeakinghavetobevalidwordsinthelanguage.SegmentCosineSimilarityp0.095pl-0.037pla-0.041play0.580playe0.000player1.000Table1:Cosinesimilaritiesbetweenwordvectorsofvarioussegmentsofthewordplayerandthevectorofplayer.Alog-linearmodelconsistsofasetoffeaturesrepresentedbyafeaturevectorφ:W×Z→Rdandacorrespondingweightvectorθ∈Rd.Here,WisasetofwordsandZisthesetofcandidatesforwordsinW,thatincludestheparentsaswellastheirtypes.Specifically,acandidateisa(parent,type)pair,wherethetypevariablekeepstrackofthetypeofmorphologicalchange(orthelackthereofifthereisnoparent)aswegofromtheparenttothechild.Inourexperiments,Zisobtainedbycollectingto-getherallsub-wordscreatedbysplittingobservedwordsinWatalldifferentpoints.Forinstance,ifwetakethewordcars,thecandidatesobtainedbysplittingwouldinclude(car,Suffix),(ca,Suffix),(c,Suffix),(ars,Prefix),(rs,Prefix)et(s,Prefix).Notethattheparentmayundergochangesasitisjoinedwiththeaffixandthus,therearemorechoicesfortheparentthanjusttheonesobtainedbysplitting.Hence,tothesetofcandidates,wealsoaddmodifiedsub-wordswheretransformationsin-cludecharacterrepetition(plan→planning),dele-tion(decide→deciding)orreplacement(carry→carried).3Followingtheaboveexampleforthewordcars,wegetcandidateslike(cat,Modify)et(cart,Delete).Eachwordalsohasastopcandidate(-,Stop),whichisequivalenttoconsideringitasabasewordwithnoparent.Letusdefinetheprobabilityofaparticularword-candidatepair(w∈W,z∈Z)asP(w,z)∝eθ·φ(w,z).Theconditionalprobabilityofacandidate3Wefoundthatrestrictingthesetofparentstosub-wordsthatareatleasthalfthelengthoftheoriginalwordhelpedim-provetheperformanceofthesystem.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
1
3
0
1
5
6
6
7
3
8

/

/
t

je

un
c
_
un
_
0
0
1
3
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

160

zgivenawordwisthenP(z|w)=eθ·φ(w,z)Pz0∈C(w)eθ·φ(w,z0),z∈C(w)whereC(w)⊂Zreferstothesetofpossiblecandi-dates(parentsandtheirtypes)forthewordw∈W.Inordertogenerateapossibleancestralchainforaword,werecursivelypredictparentsuntilthewordispredictedtobeabaseword.Inourmodel,thesechoicesareincludedinthesetofcandidatesforthespecificword,andtheirlikelihoodiscontrolledbythetyperelatedfeatures.Detailsoftheseandotherfeaturesaregiveninsection3.2.3.2FeaturesThissectionprovidesanoverviewofthefeaturesusedinourmodel.Thefeaturesaredefinedforagivenwordw,parentpandtypet(recallthatacan-didatez∈Zisthepair(p,t)).Forcomputingsomeofthesefeatures,weuseanunannotatedlistofwordswithfrequencies(detailsinsection4).Ta-ble2providesasummaryofthefeatures.SemanticSimilarityWehypothesizethatmor-phologicallyrelatedwordsexhibitsemanticsimilar-ity.Tothisend,weintroduceafeaturethatmea-surescosinesimilaritybetweenthewordvectorsoftheword(~w)andtheparent(~p).Thesewordvectorsarelearnedfromco-occurrencepatternsfromalargecorpus4(seesection4fordetails).Tovalidatethismeasure,wecomputedthecosinesimilaritybetweenwordsandtheirmorphologicalparentsfromtheCELEX2database(Baayenetal.,1995).Onaverage,theresultingword-parentsim-ilarityscoreis0.351,comparedto0.116forran-domlychosenword-parentcombinations.5AffixesAdistinctivefeatureofaffixesistheirfre-quentoccurrenceinmultiplewords.Tocapturethispattern,weautomaticallygeneratealistoffre-quentlyoccurringcandidateaffixes.Thesecandi-datesarecollectedbyconsideringthestringdiffer-encebetweenawordanditsparentcandidateswhichappearinthewordlist.Forexample,forthewordpaints,possiblesuffixesinclude-sderivedfromthe4Forstringswhichdonothaveavectorlearntfromthecor-pus,wesetthecosinevaluetobe-0.5.5Thecosinevaluesrangefromaround-0.1to0.7usually.LanguageTopsuffixesEnglish-s,-’s,-d,-éd,-ing,-',-s’,-ly,-er,-eTurkish-n,-je,-lar,-dir,-un,-den,-de,-dans,-leri,-lerArabic-p,-UN,-F,-oui,-t,-AF,-h,-hA,-yp,-AtTable3:Toptensuffixesinautomaticallyproducedsuffixlists.parentpaint,-tsfromtheparentpainand-intsfromthewordpa.Similarly,wecompilealistofpoten-tialprefixes.Thesetwolistsaresortedbytheirfre-quencyandthresholded.Foreachaffixinthelists,wehaveacorrespondingindicatorvariable.Forun-seenaffixes,weuseanUNK(unknown)indicator.Theseautomaticallyconstructedlistsactasaproxyforthegoldaffixes.InEnglish,choosingthetop100suffixesinthismannergivesus43correctsuffixes(comparedagainstgoldsuffixes).Table3givessomeexamplesofsuffixesgeneratedthisway.AffixCorrelationWhilethepreviousfeaturecon-sidersoneaffixassignmentatatime,thereisaknowncorrelationbetweenaffixesattachedtothesamestem.Forinstance,inEnglish,verbsthatcanbemodifiedbythesuffix-ing,canalsotaketherelatedsuffix-ed.Therefore,weintroduceafea-turethatmeasures,whetherforagivenaffixandparent,wealsoobserveinthewordlistthesameparentmodifiedbyitsrelatedaffix.Forexam-ple,forthepair(walking,walk),thefeaturein-stanceAffixCorr(ing,éd)issetto1,becausethewordwalkedisintheWordList.Toconstructpairsofrelatedaffixes,wecomputethecorrelationbetweenpairsinauto-generatedaffixlistdescribedpreviously.Thiscorrelationispropor-tionaltothenumberofstemsthetwoaffixesshare.ForEnglish,examplesofsuchpairsinclude(inter-,re-),(under-,over-),(-ly,-s),et(-er,-ing).PresenceinWordlistWewanttobiasthemodeltoselectparentsthatconstitutevalidwords.6More-over,wewouldliketotakeintoaccountthefre-quencyoftheparentwords.Weencodethisinfor-mationasthelogarithmoftheirwordcountsinthewordlist(WordFreq).Forparentsnotinthewordlist,wesetabinaryOutOfVocabfeatureto1.6Thisisnotanabsoluterequirementinthemodel.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
1
3
0
1
5
6
6
7
3
8

/

/
t

je

un
c
_
un
_
0
0
1
3
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

161

FeaturetypeWord(w)Candidate(p,t)FeatureValueCosinepainter(paint,Suffix)~w·~p0.58Affixpainter(paint,Suffix)suffix=er1AffixCorrelationwalking(walk,Suffix)AffixCorr(ing,éd)1Wordlistpainter(paint,Suffix)WordFreq8.73OutOfVocab0Transformationsplanning(plan,Repeat)type=Repeat×chars=(n,-)1deciding(decide,Delete)type=Delete×chars=(e,-)1carried(carry,Modify)type=Modify×chars=(oui,je)1Stoppainter(-,Stop)begin=pa1end=er10.5l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 3 0 1 5 6 6 7 3 8 / / t l a c _ a _ 0 0 1 3 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 162 WeemployContrastiveEstimation(SmithandEisner,2005)andreplacethenormalizationtermbyasumovertheneighborsofeachword.Foreachwordinthelanguage,wecreateneighboringstringsintwosets.Forthefirstset,wetransposeasinglepairofadjacentcharactersoftheword.Weperformthistranspositionoverthefirstkorthelastkcharac-tersoftheword.9Forthesecondset,wetransposetwopairsofcharacterssimultaneously–onefromthefirstkcharactersandonefromthelastk.Thecombinedsetofartificiallyconstructedwordsrepresentstheeventsthatwewishtomoveprobabil-itymassawayfrominfavoroftheactuallyobservedwords.Theneighborsfacilitatethelearningofgoodweightsfortheaffixfeaturesbyprovidingthere-quiredcontrast(atbothendsofthewords)totheactualwordsinthevocabulary.Aremainingcon-cernisthatthemodelmaynotdistinguishanyarbi-trarysubstringfromagoodsuffix/prefix.Forexam-ple,-ngappearsinallthewordsthatendwith-ing,andcouldbeconsideredavalidsuffix.Weincludeotherfeaturestohelpmakethisdistinction.Specifi-cally,weincludefeaturessuchaswordvectorsimi-larityandthepresenceoftheparentintheobservedwordlist.Forexample,inthewordpainting,thepar-entcandidatepaintislikelytooccurandalsohasahighcosinesimilaritywithpaintingintermsoftheirwordvectors.Incontrast,paintidoesnot.Giventhelistofwordsandtheirneighborhoods,wedefinethecontrastivelikelihoodasfollows:(2)LCE(je,D)=Yw∗∈D"Pz∈C(w∗)eθ·φ(w∗,z)Pw∈N(w∗)Pz∈C(w)eθ·φ(w,z)#whereN(w∗)istheneighborhoodofw∗.Thislike-lihoodismucheasiertoevaluateandoptimize.Afteraddinginastandardregularizationterm,wemaximizethefollowingloglikelihoodobjective:(3)Xw∗∈DlogXz∈C(w∗)eθ·φ(w∗,z)−logXw∈N(w∗)Xz∈C(w)eθ·φ(w,z)−λ||je||29Theperformanceincreaseswithincreasingkuntilk=5,afterwhichnogainswereobserved.Thecorrespondinggradientcanbederivedas:∂LCE(je;D)∂θj=Xw∗∈D"Pz∈C(w∗)φj(w∗,z)·eθ·φ(w∗,z)Pz∈C(w∗)eθ·φ(w∗,z)−Pw∈N(w∗)Pz∈C(w)φj(w,z)·eθ·φ(w,z)Pw∈N(w∗)Pz∈C(w)eθ·φ(w,z)#−2λθj(4)WeuseLBFGS-B(Byrdetal.,1995)tooptimizeLCE(je;D)withgradientsgivenabove.3.4PredictionGivenatestword,wepredictamorphologicalchaininagreedystepbystepfashion.Ineachstep,weusethelearntweightstopredictthebestparentforthecurrentword(fromthesetofcandidates),orchoosetostopanddeclarethecurrentwordasabasewordifthestopcasehasthehighestscore.Oncewehavethechain,wecanderiveamorphologicalsegmentationbyinsertingasegmentationpoint(intothetestword)appropriatelyforeachedgeinthechain.Algorithms1and2providedetailsonthepredic-tionprocedure.Inbothalgorithms,typereferstothetypeofmodification(orlackof)thattheparentundergoes:Prefix/Suffixaddition,typesoftransfor-mationlikerepetition,deletion,modification,ortheStopcase.Algorithm1Proceduretopredictaparentforaword1:procedurePREDICT(word)2:candidates←CANDIDATES(word)3:bestScore←04:bestCand←(−,STOP)5:forcand∈candidatesdo6:features←FEATURES(word,cand)7:score←MODELSCORE(features)8:ifscore>bestScorethen9:bestScore←score10:bestCand←cand11:returnbestCand

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
1
3
0
1
5
6
6
7
3
8

/

/
t

je

un
c
_
un
_
0
0
1
3
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

163

Algorithm2Proceduretopredictamorphologicalchain1:procedureGETCHAIN(word)2:candidate←PREDICT(word)3:parent,type←candidate4:iftype=STOPthenreturn[(word,STOP)]5:returnGETCHAIN(parent)+[(parent,type)]LangTrainTestWordVec(#words)(#words)(#words)EnglishMC-10MC-05:10Wikipedia(878K)(2218)(129M.)TurkishMC-10MC-05:10BOUN(617K)(2534)(361M.)ArabicGigawordATBGigaword(3.83M.)(119K)(1.22G)Table4:Datacorporaandstatistics.MC-10=Mor-phoChallenge201010,MC-05:10=MorphoChal-lenges2005-10(aggregated),BOUN=BOUNcor-pus(Saketal.,2008),Gigaword=ArabicGigawordcorpus(Parkeretal.,2011),ATB=ArabicTree-bank(Maamourietal.,2003)4ExperimentalSetupDataWerunexperimentsonthreedifferentlan-guages:English,TurkishandArabic.Foreachlan-guage,weutilizecorporafortraining,testingandlearningwordvectors.Thetrainingdataconsistsofanunannotatedwordlistwithfrequencyinformation,whilethetestdataisasetofgoldmorphologicalsegmentations.Forthewordvectors,wetraintheword2vectool(Mikolovetal.,2013)onlargetextcorporaandobtain200-dimensionalvectorsforallthreelanguages.Table4providesinformationabouteachdataset.EvaluationmeasureWetestourmodelonthetaskofmorphologicalsegmentation.Weevalu-ateperformanceonindividualsegmentationpoints,whichisstandardforthistask(Virpiojaetal.,2011).WecomparepredictedsegmentationsagainstthegoldtestdataforeachlanguageandreportoverallPrecision,RecallandF-1scorescalculatedacross10http://research.ics.aalto.fi/events/morphochallenge/allsegmentationpointsinthedata.Asiscommoninunsupervisedsegmentation(Poonetal.,2009;SirtsandGoldwater,2013),weincludedthetestwords(withouttheirsegmentations)withthetrain-ingwordsduringparameterlearning.BaselinesWecompareourmodelwithfiveothersystems:MorfessorBaseline(Morf-Base),Morfes-sorCatMap(Morf-Cat),AGMorph,theLeeSeg-menterandthesystemofPoonetal.(2009).Mor-fessorhasachievedexcellentperformanceontheMorphoChallengedataset,andiswidelyusedforperformingunsupervisedmorphologicalanalysisonvariouslanguages,eveninfairlyrecentwork(Lu-ongetal.,2013).Inourexperiments,weemploytwovariantsofthesystembecausetheirrelativeper-formancevariesacrosslanguages.Weusepubliclyavailableimplementationsofthesevariants(Virpi-ojaetal.,2013;CreutzandLagus,2005).Weperformseveralrunswithvariousparameters,andchoosetherunwiththebestperformanceoneachlanguage.WeevaluateAGMorphbydirectlyobtainingtheposteriorgrammarsfromtheauthors.11WeshowresultsfortheCompoundinggrammar,whichwefindhasthebestaverageperformanceoverthelan-guages.TheLeeSegmenter(Leeetal.,2011),im-proveduponbyusingMaximumMarginaldecodinginStallardetal.(2012),hasachievedexcellentper-formanceontheArabic(ATB)dataset.Weperformcomparisonexperimentswiththemodel2(M2)ofthesegmenter,whichemployslatentPOStags,anddoesnotrequiresentencecontextwhichisnotavail-ableforotherlanguagesinthedataset.Weobtainedthecodeforthesystem,andrunitonourEnglishandTurkishdatasets.12WedonothaveaccesstoanimplementationofPoonetal’ssystem;hence,wedirectlyreportscoresfromtheirpaperontheATBdatasetandtestourmodelonthesamedata.5ResultsTable5detailstheperformanceofthevariousmod-elsonthesegmentationtask.WecanseethatourmethodoutperformsbothvariantsofMorfessor,11Thegrammarsweretrainedusingdataweprovidedtothem.12WereportnumbersonArabicdirectlyfromtheirpaper.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
1
3
0
1
5
6
6
7
3
8

/

/
t

je

un
c
_
un
_
0
0
1
3
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

164

Figure1:Modelperformancevsdatasizeobtainedbyfrequencythresholdingwithanabsolutegainof8.5%,5.1%and5%inF-scoreonEnglish,TurkishandArabic,respectively.OnArabic,weobtaina2.2%absoluteimprovementoverPoonetal.’smodel.AGMorphdoesn’tseg-mentbetterthanMorfessoronEnglishandArabicbutdoesverywellonTurkish(60.9%F1comparedtoourmodel’s61.2%).ThiscouldbeduetothefactthattheCompoundinggrammariswellsuitedtotheagglutinativemorphologyinTurkishandhencepro-videsmoregainsthanforEnglishandArabic.TheLeeSegmenter(M2)performsthebestonArabic(82%F1),butlagsbehindonEnglishandTurkish.ThisresultisconsistentwiththefactthatthesystemwasoptimizedforArabic.Thetablealsodemonstratestheimportanceoftheaddedsemanticinformationinourmodel.Forallthreelanguages,havingthefeaturesthatutilizeco-sinesimilarityprovidesasignificantboostinperfor-mance.Wealsoseethatthetransformationfeaturesandaffixcorrelationfeaturesplayaroleinimprov-ingtheresults,althoughalessimportantone.Next,westudytheeffectofdataqualityonthepredictionofthealgorithm.Atrainingsetoftencontainsmisspellings,abbreviationsandtruncatedwords.Thresholdingbasedonfrequencyiscom-monlyusedtoreducethisnoise.Figure1showstheperformanceofthealgorithmasafunctionofthedatasizeobtainedatvariousdegreesofthreshold-ing.WenotethattheperformanceofthemodelonallthreelanguagesremainsquitestablefromaboutLangMethodPrecRecallF-1EnglishMorf-Base0.7400.6230.677Morf-Cat0.6730.5870.627AGMorph0.6960.6040.647Lee(M2)0.8250.5250.642Model-C0.5550.7920.653Model-T0.8310.6640.738Model-A0.8100.7130.758Fullmodel0.8070.7220.762TurkishMorf-Base0.8270.3620.504Morf-Cat0.5220.6070.561AGMorph0.8780.4660.609Lee(M2)0.7870.3550.489Model-C0.5160.6520.576Model-T0.6650.5210.584Model-A0.6680.5430.599Fullmodel0.7430.5200.612ArabicMorf-Base0.8070.2040.326Morf-Cat0.7740.7260.749AGMorph0.6720.7610.713Poonetal.0.8850.6920.777Lee(M2)–0.820Model-C0.6260.9120.743Model-T0.7740.8070.790Model-A0.7750.8080.791Fullmodel0.7700.8310.799Table5:Resultsonunsupervisedmorphologicalsegmentation;scoresarecalculatedacrossallseg-mentationpointsinthetestdata.Baselinesareinitalics.-C=withoutcosinefeatures,-T=withouttransformationfeatures,-A=withoutaffixcorrela-tionfeatures.NumbersonArabicforPoonetal.andLee(M2)arereporteddirectlyfromtheirpapers.1000to10000trainingwords,afterwhichthedevia-tionsaremoreapparent.Theplotalsodemonstratesthatthemodelworkswellevenwithasmallamountofqualitydata(≈3000mostfrequentwords).ErroranalysisWelookatarandomsubsetof50incorrectlysegmentedwords13inthemodel’soutputforeachlanguage.Table7givesabreakupoferrorsinall3languagesduetooverorunder-segmentation.Table6providesexamplesofcorrectandincorrectsegmentationspredictedbyourmodel.13Wordswithatleastonesegmentationpointincorrect

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
1
3
0
1
5
6
6
7
3
8

/

/
t

je

un
c
_
un
_
0
0
1
3
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

165

LanguageCorrectSegmentationsIncorrectSegmentationsWordSegmentationWordPredictedCorrectEnglishsalvoessalvo-escontemptcon-temptcontemptnegotiationsnegotiat-ion-ssterilizingsteriliz-ingsteril-iz-ingtelephotographtele-photo-graphdesolatingdesolatingdesolat-ingunequivocallyun-equivocal-lystoreroomsstoreroom-sstore-room-scarsickness’scar-sick-ness-’stattlerstattler-stattl-er-sTurkishmodernimodern-imektuplas¸malarmektuplas¸ma-larmektup-las¸-ma-larteknolojidekiteknoloji-de-kigelecektinizgelecek-tinizgel-ecek-ti-nizburasıydıbura-sı-ydıaynalardanayna-lar-da-nayna-lar-danc¸izgisinec¸iz-gi-si-neuyudu˘gunuzuuyudu-˘gu-nuzuuyu-du˘g-unuz-ude˘gis¸ikliktede˘gis¸ik-lik-tedirse˘gedirse˘gedirse˘g-eArabicsy$Arks-y-$ArkwryfAldww-ry-fAldww-ryfAldwnyqwsyAnyqwsyAbHlwlhAb-Hlwl-h-Ab-Hlwl-hAAlmTrwHpAl-mTrwH-pjnwbyjnwb-yjnwbyytEAmlwAy-tEAml-wAwbAyrnw-bAyr-nw-bAyrnlAtnZrlA-t-nZrrknyprknyprkny-pTable6:Examplesofcorrectandincorrectsegmentationsproducedbyourmodelonthethreelanguages.CorrectsegmentationsaretakendirectlyfromgoldMorphoChallengedata.LangOver-segmentUnder-segmentEnglish10%86%Turkish12%78%Arabic60%40%Table7:Typesoferrorsinanalysisof50randomlysampledincorrectsegmentationsforeachlanguage.Theremainingerrorsareduetoincorrectplacementofsegmentationpoints.InEnglish,mosterrorsareduetounder-segmentationofaword.Wefindthataround60%oferrorsareduetorootsthatundergotransformationswhilemorphingintoavariant(seetable6forexam-ples).ErrorsinTurkisharealsomostlyduetounder-segmentation.Onfurtherinvestigation,wefindthatmostsucherrors(58%ofthe78%)areduetoparentwordseithernotinvocabularyorhavingaverylowwordcount(≤10).Incontrast,weobserveama-jorityofover-segmentationerrorsinArabic(60%).ThisislikelybecauseofArabichavingmoresin-glecharacteraffixesthantheotherlanguages.Wefindthat56%oferrorsinArabicinvolveasingle-characteraffix,whichismuchhigherthanthe24.6%thatinvolveatwo-letteraffix.Incontrast,25%ofer-rorsinEnglishareduetosinglecharacteraffixes–aroundthesamenumberasthe24%oferrorsduetotwo-letteraffixes.Sinceourmodelisanunsupervisedone,wemakeseveralsimplifyingassumptionstokeepthecandi-datesetsizemanageableforlearning.Forinstance,wedonotexplicitlymodelinfixes,sinceweselectparentcandidatesbyonlymodifyingtheendsofaword.Also,theroot-templatemorphologyofAra-bic,aSemiticlanguage,presentsacomplexitywedonotdirectlyhandle.Forinstance,wordsinAra-biccanbeformedusingspecificpatterns(knownasbinyanim)(ex.nZr→yntZr).Cependant,ongoingthroughtheerrors,wefindthatonly14%areduetothesebinyanimpatternsnotbeingcap-tured.14Addingintransformationrulestocapturethesetypesoflanguage-specificpatternscanhelpin-creasebothchainandsegmentationaccuracy.AnalysisoflearneddistributionsToinvestigatehowdecisivethelearntmodelis,weexaminethefinalprobabilitydistributionP(z|w)ofparentcan-didatesforthewordsintheEnglishwordlist.Weobservethattheprobabilityofthebestcandidate(maxzP(z|w)),averagedoverallwords,is0.77.Wealsofindthattheaverageentropyofthedistri-14Thismightbeduetothefactthatthegoldsegmentationsalsodonotcapturesuchpatterns.Forexample,thegoldseg-mentationforyntZrwnisgivenasy-ntZr-wn,eventhoughntZrisnotavalidroot.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
1
3
0
1
5
6
6
7
3
8

/

/
t

je

un
c
_
un
_
0
0
1
3
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

166

Figure2:Comparisonofgoldandpredictedfre-quencydistributionsofthetop15affixesforEnglishbutionsis0.65,whichisquitelowconsideringthattheaveragenumberofcandidatesis10.76perword,whichwouldresultinamaxpossibleentropyofaround2.37ifthedistributionswereuniform.Thisdemonstratesthatthemodeltendstopreferasingleparentforeveryword,15whichisexactlythebehav-iorwewant.AffixanalysisWealsoanalyzethevariousaffixesproducedbythemodel,andcomparewiththegoldaffixes.Particularly,weplotthefrequencydistri-butionsoftheaffixes16obtainedfromthegoldand15Notethatthecandidateprobabilitydistributionmayhavemorethanasinglepeakinsomecases.16Toconservespace,weonlyshowthedistributionofsuf-fixeshere,butweobserveasimilartrendforprefixes.predictedsegmentationsfortheEnglishtestdatainfigure2.Fromthefigure,wecanseethatourmodellearnstoidentifygoodaffixesforthegivenlanguage.Sev-eralofthetopaffixespredictedarealsopresentinthegoldlist,andwealsoobservesimilaritiesinthefrequencydistributions.6ConclusionInthiswork,wehaveproposedadiscriminativemodelforunsupervisedmorphologicalsegmenta-tionthatseamlesslyintegratesorthographicandse-manticpropertiesofwords.Weusemorpholog-icalchainstomodelthewordformationprocessandshowhowtoemploytheflexibilityoflog-linearmodelstoincorporatebothmorphemeandword-levelfeatures,whilehandlingtransformationsofparentwords.Ourmodelconsistentlyequalsorout-performsfivestate-of-the-artsystemsonArabic,En-glishandTurkish.Futuredirectionsofworkin-cludeusingbetterneighborhoodfunctionsforcon-trastiveestimation,exploringotherviewsofthedatathatcouldbeincorporated,examiningbetterpredic-tionschemesandemployingmorphologicalchainsinotherapplicationsinNLP.AcknowledgementsWethankKairitSirtsandYoongKeokLeeforhelpingrunexperimentswiththeirunsupervisedmorphologicalanalyzers,andYonatanBelinkovforhelpingwitherroranalysisinArabic.WealsothanktheanonymousTACLreviewersandmem-bersofMIT’sNLPgroupfortheirinsightfulcom-mentsandsuggestions.ThisworkwassupportedbytheIntelligenceAdvancedResearchProjectsActiv-ity(IARPA)viaDepartmentofDefenseUSArmyResearchLaboratorycontractnumberW911NF-12-C-0013.TheU.S.GovernmentisauthorizedtoreproduceanddistributereprintsforGovernmen-talpurposesnotwithstandinganycopyrightannota-tionthereon.Theviewsandconclusionscontainedhereinarethoseoftheauthorsandshouldnotbeinterpretedasnecessarilyrepresentingtheofficialpoliciesorendorsements,eitherexpressedorim-plied,ofIARPA,DoD/ARL,ortheU.S.Govern-ment.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
1
3
0
1
5
6
6
7
3
8

/

/
t

je

un
c
_
un
_
0
0
1
3
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

167

ReferencesRBaayen,RPiepenbrock,andLGulikers.1995.CELEX2LDC96L14.Philadelphia:LinguisticDataConsortium.MarcoBaroni,JohannesMatiasek,andHaraldTrost.2002.Unsuperviseddiscoveryofmorphologicallyre-latedwordsbasedonorthographicandsemanticsimi-larity.CoRR,cs.CL/0205006.RichardHByrd,PeihuangLu,JorgeNocedal,andCiyouZhu.1995.Alimitedmemoryalgorithmforboundconstrainedoptimization.SIAMJournalonScientificComputing,16(5):1190–1208.MathiasCreutzandKristaLagus.2005.Inducingthemorphologicallexiconofanaturallanguagefromunannotatedtext.InProceedingsoftheInternationalandInterdisciplinaryConferenceonAdaptiveKnowl-edgeRepresentationandReasoning(AKRR),pages106–113.MathiasCreutzandKristaLagus.2007.Unsuper-visedmodelsformorphemesegmentationandmor-phologylearning.ACMTrans.SpeechLang.Process.,4(1):3:1–3:34,February.MarkusDreyerandJasonEisner.2011.Discover-ingmorphologicalparadigmsfromplaintextusingadirichletprocessmixturemodel.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages616–627.AssociationforComputationalLinguistics.SharonGoldwaterandMarkJohnson.2004.Priorsinbayesianlearningofphonologicalrules.InProceed-ingsofthe7thMeetingoftheACLSpecialInterestGroupinComputationalPhonology:CurrentThemesinComputationalPhonologyandMorphology,SIG-MorPhon’04,pages35–42,Stroudsburg,Pennsylvanie,USA.AssociationforComputationalLinguistics.YoongKeokLee,AriaHaghighi,andReginaBarzi-lay.2011.Modelingsyntacticcontextimprovesmorphologicalsegmentation.InProceedingsoftheFifteenthConferenceonComputationalNaturalLan-guageLearning,CoNLL’11,pages1–9,Stroudsburg,Pennsylvanie,USA.AssociationforComputationalLinguistics.Minh-ThangLuong,RichardSocher,andChristopherD.Manning.2013.Betterwordrepresentationswithre-cursiveneuralnetworksformorphology.InCoNLL,Sofia,Bulgaria.MohamedMaamouri,AnnBies,HubertJin,andTimBuckwalter.2003.ArabicTreebank:Part1v2.0LDC2003T06.Philadelphia:LinguisticDataConsor-tium.TomasMikolov,IlyaSutskever,KaiChen,GregCorrado,andJeffreyDean.2013.Distributedrepresentationsofwordsandphrasesandtheircompositionality.CoRR,abs/1310.4546.RobertParker,DavidGraff,KeChen,JunboKong,andKazuakiMaeda.2011.ArabicGigawordfiftheditionLDC2011T11.Philadelphia:LinguisticDataConsor-tium.HoifungPoon,ColinCherry,andKristinaToutanova.2009.Unsupervisedmorphologicalsegmentationwithlog-linearmodels.InProceedingsofHumanLan-guageTechnologies:The2009AnnualConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics,NAACL’09,pages209–217,Stroudsburg,Pennsylvanie,USA.AssociationforCompu-tationalLinguistics.Has¸imSak,TungaG¨ung¨or,andMuratSarac¸lar.2008.Turkishlanguageresources:Morphologicalparser,morphologicaldisambiguatorandwebcorpus.InAd-vancesinnaturallanguageprocessing,pages417–427.Springer.PatrickSchoneandDanielJurafsky.2000.Knowledge-freeinductionofmorphologyusinglatentsemanticanalysis.InProceedingsofthe2ndWorkshoponLearningLanguageinLogicandthe4thConferenceonComputationalNaturalLanguageLearning-Vol-ume7,ConLL’00,pages67–72,Stroudsburg,Pennsylvanie,USA.AssociationforComputationalLinguistics.KairitSirtsandSharonGoldwater.2013.Minimally-supervisedmorphologicalsegmentationusingadaptorgrammars.TACL,1:255–266.NoahA.SmithandJasonEisner.2005.Contrastiveestimation:Traininglog-linearmodelsonunlabeleddata.InProceedingsofthe43rdAnnualMeetingonAssociationforComputationalLinguistics,ACL’05,pages354–362,Stroudsburg,Pennsylvanie,USA.AssociationforComputationalLinguistics.BenjaminSnyderandReginaBarzilay.2008.Unsuper-visedmultilinguallearningformorphologicalsegmen-tation.InTheAnnualConferenceoftheAssociationforComputationalLinguistics.DavidStallard,JacobDevlin,MichaelKayser,YoongKeokLee,andReginaBarzilay.2012.Unsupervisedmorphologyrivalssupervisedmor-phologyforarabicmt.InProceedingsofthe50thAnnualMeetingoftheAssociationforComputationalLinguistics:ShortPapers-Volume2,pages322–327.AssociationforComputationalLinguistics.SamiVirpioja,VilleT.Turunen,SebastianSpiegler,Os-karKohonen,andMikkoKurimo.2011.Empiricalcomparisonofevaluationmethodsforunsupervisedlearningofmorphology.TAL,52(2):45–90.SamiVirpioja,PeterSmit,Stig-ArneGr¨onroos,andMikkoKurimo.2013.Morfessor2.0:Pythonimple-mentationandextensionsforMorfessorBaseline.Re-portinAaltoUniversitypublicationseriesSCIENCE+TECHNOLOGY,DepartmentofSignalProcessingandAcoustics,AaltoUniversity,Helsinki,Finlande.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
1
3
0
1
5
6
6
7
3
8

/

/
t

je

un
c
_
un
_
0
0
1
3
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

168
Télécharger le PDF