Transactions of the Association for Computational Linguistics, vol. 3, pp. 157–167, 2015. Action Editor: Yuji Matsumoto.
Submission batch: 9/2014; Revision batch: 12/2014; Revision batch 2/2015; Published 3/2015.
2015 Association for Computational Linguistics. Distributed under a CC-BY-NC-SA 4.0 Licence.
c
(cid:13)
AnUnsupervisedMethodforUncoveringMorphologicalChainsKarthikNarasimhan,ReginaBarzilayandTommiJaakkolaCSAIL,MassachusettsInstituteofTechnology{karthikn,regina,tommi}@csail.mit.eduAbstractMoststate-of-the-artsystemstodayproducemorphologicalanalysisbasedonlyonortho-graphicpatterns.Incontrast,weproposeamodelforunsupervisedmorphologicalanal-ysisthatintegratesorthographicandseman-ticviewsofwords.Wemodelwordforma-tionintermsofmorphologicalchains,frombasewordstotheobservedwords,breakingthechainsintoparent-childrelations.Weuselog-linearmodelswithmorphemeandword-levelfeaturestopredictpossibleparents,in-cludingtheirmodifications,foreachword.Thelimitedsetofcandidateparentsforeachwordrendercontrastiveestimationfeasible.Ourmodelconsistentlymatchesoroutper-formsfivestate-of-the-artsystemsonArabic,EnglishandTurkish.11IntroductionMorphologicallyrelatedwordsexhibitconnectionsatmultiplelevels,rangingfromorthographicalpat-ternstosemanticproximity.Forinstance,thewordsplayingandplayedsharethesamestem,butalsocarrysimilarmeaning.Ideally,allthesecomple-mentarysourcesofinformationwouldbetakenintoaccountwhenlearningmorphologicalstructures.Moststate-of-the-artunsupervisedapproachestomorphologicalanalysisarebuiltprimarilyaroundorthographicpatternsinmorphologically-relatedwords(GoldwaterandJohnson,2004;CreutzandLagus,2007;SnyderandBarzilay,2008;Poonetal.,2009).Intheseapproaches,wordsarecom-monlymodeledasconcatenationsofmorphemes.1Codeisavailableathttps://github.com/karthikncode/MorphoChain.Thismorpheme-centricviewiswell-suitedforun-coveringdistributionalpropertiesofstemsandaf-fixes.Butitisnotwell-equippedtocapturesemanticrelatednessatthewordlevel.Incontrast,earlierapproachesthatcapturese-manticsimilarityinmorphologicalvariantsoper-atesolelyatthewordlevel(SchoneandJuraf-sky,2000;Baronietal.,2002).Giventwocandi-datewords,theproximityisassessedusingstandardword-distributionalmeasuressuchasmutualinfor-mation.However,thefactthatthesemodelsdonotmodelmorphemesdirectlygreatlylimitstheirper-formance.Inthispaper,weproposeamodeltointegrateor-thographicandsemanticviews.Ourgoalistobuildachainofderivationsforacurrentwordfromitsbaseform.Forinstance,givenawordplayfully,thecorrespondingchainisplay→playful→playfully.Thewordplayisabaseformofthisderivationasitcannotbereducedanyfurther.Individualderiva-tionsareobtainedbyaddingamorpheme(ex.-ful)toaparentword(ex.play).Thisadditionmaybeimplementedviaasimpleconcatenation,oritmayinvolvetransformations.Ateverystepofthechain,themodelaimstofindaparent-childpair(ex.play-playful)suchthattheparentalsoconstitutesavalidentryinthelexicon.Thisallowsthemodeltodi-rectlycomparethesemanticsimilarityoftheparent-childpair,whilealsoconsideringtheorthographicpropertiesofthemorphemiccombination.Wemodeleachstepofamorphologicalchainbymeansofalog-linearmodelthatenablesustoin-corporateawiderangeoffeatures.Attheseman-ticlevel,weconsidertherelatednessbetweentwowordsusingthecorrespondingvectorembeddings.Attheorthographiclevel,featurescapturewhether
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
3
0
1
5
6
6
7
3
8
/
/
t
je
un
c
_
un
_
0
0
1
3
0
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
158
thewordsinthechainactuallyoccurinthecorpus,howaffixesarereused,aswellashowthewordsarealteredduringtheadditionofmorphemes.WeuseContrastiveEstimation(SmithandEisner,2005)toefficientlylearnthismodelinanunsupervisedmanner.Specifically,werequirethateachwordhasgreatersupportamongitsboundedsetofcandidateparentsthananartificiallyconstructedneighboringwordwould.Weevaluateourmodelondatasetsinthreelan-guages:Arabic,EnglishandTurkish.Wecompareourperformanceagainstfivestate-of-the-artunsu-pervisedsystems:MorfessorBaseline(Virpiojaetal.,2013),MorfessorCatMAP(CreutzandLagus,2005),AGMorph(SirtsandGoldwater,2013),theLeeSegmenter(Leeetal.,2011;Stallardetal.,2012)andthesystemofPoonetal.(2009).Ourmodelconsistentlyequalsoroutperformsthesesys-temsacrossthethreelanguages.Forinstance,onEnglish,weobtainan8.5%gaininF-measureoverMorfessor.Ourexperimentsalsodemonstratethevalueofsemanticinformation.Whilethecontribu-tionvariesfrom3%onTurkishto11%ontheEn-glishdataset,itneverthelessimprovesperformanceacrossallthelanguages.2RelatedWorkCurrently,topperformingunsupervisedmorpholog-icalanalyzersarebasedontheorthographicprop-ertiesofsub-wordunits(CreutzandLagus,2005;CreutzandLagus,2007;Poonetal.,2009;SirtsandGoldwater,2013).Addingsemanticinformationtothesesystemsisnotaneasytask,astheyoperateatthelevelofindividualmorphemes,ratherthanmor-phologicallyrelatedwords.Thevalueofsemanticinformationhasbeendemonstratedinearlierworkonmorphologicalanal-ysis.SchoneandJurafsky(2000)employanLSA-basedsimilaritymeasuretoidentifymorphologicalvariantsfromalistoforthographicallyclosewordpairs.Thefilteredpairsarethenusedtoidentifystemsandaffixes.Basedonsimilarintuition,Baronietal.(2002)designamethodthatintegratesthesesourcesofinformation,capturedastwowordpairlists,rankedbasedoneditdistanceandmutualin-formation.Theselistsaresubsequentlycombinedusingadeterministicweightingfunction.Inbothofthesealgorithms,orthographicrelated-nessisbasedonsimpledeterministicrules.There-fore,semanticrelatednessplaysanessentialroleinthesuccessofthesemethods.However,theseal-gorithmsdonotcapturedistributionalpropertiesofmorphemesthatarecriticaltothesuccessofcurrentstate-of-the-artalgorithms.Incontrast,weutilizeasinglestatisticalframeworkthatseamlesslycom-binesbothsourcesofinformation.Moreover,ital-lowsustoincorporateawiderangeofadditionalfeatures.Ourworkalsorelatestothelog-linearmodelformorphologicalsegmentationdevelopedbyPoonetal.(2009).Theyproposeajointmodeloverallwords(observations)andtheirsegmentations(hid-den),usingmorphemesandtheircontexts(charac-tern-grams)forthefeatures.Sincethespaceofallpossiblesegmentationsetsishuge,learningandin-ferencearequiteinvolved.TheyusetechniqueslikeContrastiveEstimation,samplingandsimulatedan-nealing.Incontrast,ourformulationdoesnotre-sultinsuchalargesearchspace.Foreachword,thenumberofparentcandidatesisboundedbyitslengthmultipliedbythenumberofpossibletrans-formations.Therefore,ContrastiveEstimationcanbeimplementedviaenumeration,anddoesnotre-quiresampling.Moreover,operatingatthelevelofwords(ratherthanmorphemes)enablesustoincor-poratesemanticandword-levelfeatures.Mostrecently,workbySirtsandGoldwater(2013)usesAdaptorGrammarsforminimallysu-pervisedsegmentation.Bydefiningamorphologicalgrammarconsistingofzeroormoreprefixes,stemsandsuffixes,theyinducesegmentationsoverwordsinbothunsupervisedandsemi-supervisedsettings.Whiletheirmodel(AGMorph)buildsupawordbycombiningmorphemesintheformofaparsetree,weoperateatthewordlevelandbuildupthefinalwordviaintermediatewordsinthechain.Inotherrelatedwork,DreyerandEisner(2011)tackletheproblemofrecoveringmorphologicalparadigmsandinflectionalprinciples.TheyuseaBayesiangenerativemodelwithalog-linearframework,usingexpressivefeatures,overpairsofstrings.Theirwork,cependant,handlesadifferenttaskfromoursandrequiresasmallamountofan-notateddatatoseedthemodel.Inthiswork,wemakeuseofsemanticinfor-
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
3
0
1
5
6
6
7
3
8
/
/
t
je
un
c
_
un
_
0
0
1
3
0
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
159
mationtohelpmorphologicalanalysis.Leeetal.(2011)presentamodelthattakesadvantageofsyn-tacticcontexttoperformbettermorphologicalseg-mentation.Stallardetal.(2012)improveonthisap-proachusingthetechniqueofMaximumMarginaldecodingtoreducenoise.Theirbestsystemcon-sidersentiresentences,whileourapproach(andthemorphologicalanalyzersdescribedabove)operatesatthevocabularylevelwithoutregardingsentencecontext.Hence,thoughtheirworkisnotdirectlycomparabletoours,itpresentsaninterestingorthog-onalviewtotheproblem.3Model3.1DefinitionsandFrameworkWeusemorphologicalchainstomodelwordsinthelanguage.Amorphologicalchainisashortsequenceofwordsthatstartsfromthebasewordandendsupinamorphologicalvariant.Eachnodeinthechainis,byassumption,avalidword.Werefertothewordthatismorphologicallychangedastheparentwordanditsmorphologicalvariantasthechildword.Awordthatdoesnothaveanymorphologicalparentisabaseword(e.g.,wordslikeplay,chat,run).2Wordsinachain(otherthanthebaseword)arecreatedfromtheirparentsbyaddingmorphemes(prefixes,suffixes,orotherwords).Forexample,amorphologicalchainthatendsupinthewordin-ternationallycouldbenation→national→inter-national→internationally.Thebasewordforthischainisnation.Notethatthesamewordcanbelongtomultiplemorphologicalchains.Forexample,thewordnationalappearsalsoaspartofanotherchainthatendsupinnationalize.Thesechainsaretreatedseparatelybutwithsharedstatisticalsupportforthecommonparts.Forthisreason,ourmodelbreaksmorphologicalchainsintopossibleparent-childre-lationssuchas(nation,national).Weusealog-linearmodelforpredictingparent-childpairs.Alog-linearmodelallowsaneasy,effi-cientwayofincorporatingseveraldifferentfeaturespertainingtoparent-childrelations.Inourcase,weleveragebothorthographicandsemanticpatternstoencoderepresentativefeatures.2Wedistinguishbasewordsfrommorphologicalrootswhichdonotstrictlyspeakinghavetobevalidwordsinthelanguage.SegmentCosineSimilarityp0.095pl-0.037pla-0.041play0.580playe0.000player1.000Table1:Cosinesimilaritiesbetweenwordvectorsofvarioussegmentsofthewordplayerandthevectorofplayer.Alog-linearmodelconsistsofasetoffeaturesrepresentedbyafeaturevectorφ:W×Z→Rdandacorrespondingweightvectorθ∈Rd.Here,WisasetofwordsandZisthesetofcandidatesforwordsinW,thatincludestheparentsaswellastheirtypes.Specifically,acandidateisa(parent,type)pair,wherethetypevariablekeepstrackofthetypeofmorphologicalchange(orthelackthereofifthereisnoparent)aswegofromtheparenttothechild.Inourexperiments,Zisobtainedbycollectingto-getherallsub-wordscreatedbysplittingobservedwordsinWatalldifferentpoints.Forinstance,ifwetakethewordcars,thecandidatesobtainedbysplittingwouldinclude(car,Suffix),(ca,Suffix),(c,Suffix),(ars,Prefix),(rs,Prefix)et(s,Prefix).Notethattheparentmayundergochangesasitisjoinedwiththeaffixandthus,therearemorechoicesfortheparentthanjusttheonesobtainedbysplitting.Hence,tothesetofcandidates,wealsoaddmodifiedsub-wordswheretransformationsin-cludecharacterrepetition(plan→planning),dele-tion(decide→deciding)orreplacement(carry→carried).3Followingtheaboveexampleforthewordcars,wegetcandidateslike(cat,Modify)et(cart,Delete).Eachwordalsohasastopcandidate(-,Stop),whichisequivalenttoconsideringitasabasewordwithnoparent.Letusdefinetheprobabilityofaparticularword-candidatepair(w∈W,z∈Z)asP(w,z)∝eθ·φ(w,z).Theconditionalprobabilityofacandidate3Wefoundthatrestrictingthesetofparentstosub-wordsthatareatleasthalfthelengthoftheoriginalwordhelpedim-provetheperformanceofthesystem.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
3
0
1
5
6
6
7
3
8
/
/
t
je
un
c
_
un
_
0
0
1
3
0
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
160
zgivenawordwisthenP(z|w)=eθ·φ(w,z)Pz0∈C(w)eθ·φ(w,z0),z∈C(w)whereC(w)⊂Zreferstothesetofpossiblecandi-dates(parentsandtheirtypes)forthewordw∈W.Inordertogenerateapossibleancestralchainforaword,werecursivelypredictparentsuntilthewordispredictedtobeabaseword.Inourmodel,thesechoicesareincludedinthesetofcandidatesforthespecificword,andtheirlikelihoodiscontrolledbythetyperelatedfeatures.Detailsoftheseandotherfeaturesaregiveninsection3.2.3.2FeaturesThissectionprovidesanoverviewofthefeaturesusedinourmodel.Thefeaturesaredefinedforagivenwordw,parentpandtypet(recallthatacan-didatez∈Zisthepair(p,t)).Forcomputingsomeofthesefeatures,weuseanunannotatedlistofwordswithfrequencies(detailsinsection4).Ta-ble2providesasummaryofthefeatures.SemanticSimilarityWehypothesizethatmor-phologicallyrelatedwordsexhibitsemanticsimilar-ity.Tothisend,weintroduceafeaturethatmea-surescosinesimilaritybetweenthewordvectorsoftheword(~w)andtheparent(~p).Thesewordvectorsarelearnedfromco-occurrencepatternsfromalargecorpus4(seesection4fordetails).Tovalidatethismeasure,wecomputedthecosinesimilaritybetweenwordsandtheirmorphologicalparentsfromtheCELEX2database(Baayenetal.,1995).Onaverage,theresultingword-parentsim-ilarityscoreis0.351,comparedto0.116forran-domlychosenword-parentcombinations.5AffixesAdistinctivefeatureofaffixesistheirfre-quentoccurrenceinmultiplewords.Tocapturethispattern,weautomaticallygeneratealistoffre-quentlyoccurringcandidateaffixes.Thesecandi-datesarecollectedbyconsideringthestringdiffer-encebetweenawordanditsparentcandidateswhichappearinthewordlist.Forexample,forthewordpaints,possiblesuffixesinclude-sderivedfromthe4Forstringswhichdonothaveavectorlearntfromthecor-pus,wesetthecosinevaluetobe-0.5.5Thecosinevaluesrangefromaround-0.1to0.7usually.LanguageTopsuffixesEnglish-s,-’s,-d,-éd,-ing,-',-s’,-ly,-er,-eTurkish-n,-je,-lar,-dir,-un,-den,-de,-dans,-leri,-lerArabic-p,-UN,-F,-oui,-t,-AF,-h,-hA,-yp,-AtTable3:Toptensuffixesinautomaticallyproducedsuffixlists.parentpaint,-tsfromtheparentpainand-intsfromthewordpa.Similarly,wecompilealistofpoten-tialprefixes.Thesetwolistsaresortedbytheirfre-quencyandthresholded.Foreachaffixinthelists,wehaveacorrespondingindicatorvariable.Forun-seenaffixes,weuseanUNK(unknown)indicator.Theseautomaticallyconstructedlistsactasaproxyforthegoldaffixes.InEnglish,choosingthetop100suffixesinthismannergivesus43correctsuffixes(comparedagainstgoldsuffixes).Table3givessomeexamplesofsuffixesgeneratedthisway.AffixCorrelationWhilethepreviousfeaturecon-sidersoneaffixassignmentatatime,thereisaknowncorrelationbetweenaffixesattachedtothesamestem.Forinstance,inEnglish,verbsthatcanbemodifiedbythesuffix-ing,canalsotaketherelatedsuffix-ed.Therefore,weintroduceafea-turethatmeasures,whetherforagivenaffixandparent,wealsoobserveinthewordlistthesameparentmodifiedbyitsrelatedaffix.Forexam-ple,forthepair(walking,walk),thefeaturein-stanceAffixCorr(ing,éd)issetto1,becausethewordwalkedisintheWordList.Toconstructpairsofrelatedaffixes,wecomputethecorrelationbetweenpairsinauto-generatedaffixlistdescribedpreviously.Thiscorrelationispropor-tionaltothenumberofstemsthetwoaffixesshare.ForEnglish,examplesofsuchpairsinclude(inter-,re-),(under-,over-),(-ly,-s),et(-er,-ing).PresenceinWordlistWewanttobiasthemodeltoselectparentsthatconstitutevalidwords.6More-over,wewouldliketotakeintoaccountthefre-quencyoftheparentwords.Weencodethisinfor-mationasthelogarithmoftheirwordcountsinthewordlist(WordFreq).Forparentsnotinthewordlist,wesetabinaryOutOfVocabfeatureto1.6Thisisnotanabsoluterequirementinthemodel.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
3
0
1
5
6
6
7
3
8
/
/
t
je
un
c
_
un
_
0
0
1
3
0
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
161
FeaturetypeWord(w)Candidate(p,t)FeatureValueCosinepainter(paint,Suffix)~w·~p0.58Affixpainter(paint,Suffix)suffix=er1AffixCorrelationwalking(walk,Suffix)AffixCorr(ing,éd)1Wordlistpainter(paint,Suffix)WordFreq8.73OutOfVocab0Transformationsplanning(plan,Repeat)type=Repeat×chars=(n,-)1deciding(decide,Delete)type=Delete×chars=(e,-)1carried(carry,Modify)type=Modify×chars=(oui,je)1Stoppainter(-,Stop)begin=pa1end=er10.5
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
3
0
1
5
6
6
7
3
8
/
/
t
je
un
c
_
un
_
0
0
1
3
0
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
163
Algorithm2Proceduretopredictamorphologicalchain1:procedureGETCHAIN(word)2:candidate←PREDICT(word)3:parent,type←candidate4:iftype=STOPthenreturn[(word,STOP)]5:returnGETCHAIN(parent)+[(parent,type)]LangTrainTestWordVec(#words)(#words)(#words)EnglishMC-10MC-05:10Wikipedia(878K)(2218)(129M.)TurkishMC-10MC-05:10BOUN(617K)(2534)(361M.)ArabicGigawordATBGigaword(3.83M.)(119K)(1.22G)Table4:Datacorporaandstatistics.MC-10=Mor-phoChallenge201010,MC-05:10=MorphoChal-lenges2005-10(aggregated),BOUN=BOUNcor-pus(Saketal.,2008),Gigaword=ArabicGigawordcorpus(Parkeretal.,2011),ATB=ArabicTree-bank(Maamourietal.,2003)4ExperimentalSetupDataWerunexperimentsonthreedifferentlan-guages:English,TurkishandArabic.Foreachlan-guage,weutilizecorporafortraining,testingandlearningwordvectors.Thetrainingdataconsistsofanunannotatedwordlistwithfrequencyinformation,whilethetestdataisasetofgoldmorphologicalsegmentations.Forthewordvectors,wetraintheword2vectool(Mikolovetal.,2013)onlargetextcorporaandobtain200-dimensionalvectorsforallthreelanguages.Table4providesinformationabouteachdataset.EvaluationmeasureWetestourmodelonthetaskofmorphologicalsegmentation.Weevalu-ateperformanceonindividualsegmentationpoints,whichisstandardforthistask(Virpiojaetal.,2011).WecomparepredictedsegmentationsagainstthegoldtestdataforeachlanguageandreportoverallPrecision,RecallandF-1scorescalculatedacross10http://research.ics.aalto.fi/events/morphochallenge/allsegmentationpointsinthedata.Asiscommoninunsupervisedsegmentation(Poonetal.,2009;SirtsandGoldwater,2013),weincludedthetestwords(withouttheirsegmentations)withthetrain-ingwordsduringparameterlearning.BaselinesWecompareourmodelwithfiveothersystems:MorfessorBaseline(Morf-Base),Morfes-sorCatMap(Morf-Cat),AGMorph,theLeeSeg-menterandthesystemofPoonetal.(2009).Mor-fessorhasachievedexcellentperformanceontheMorphoChallengedataset,andiswidelyusedforperformingunsupervisedmorphologicalanalysisonvariouslanguages,eveninfairlyrecentwork(Lu-ongetal.,2013).Inourexperiments,weemploytwovariantsofthesystembecausetheirrelativeper-formancevariesacrosslanguages.Weusepubliclyavailableimplementationsofthesevariants(Virpi-ojaetal.,2013;CreutzandLagus,2005).Weperformseveralrunswithvariousparameters,andchoosetherunwiththebestperformanceoneachlanguage.WeevaluateAGMorphbydirectlyobtainingtheposteriorgrammarsfromtheauthors.11WeshowresultsfortheCompoundinggrammar,whichwefindhasthebestaverageperformanceoverthelan-guages.TheLeeSegmenter(Leeetal.,2011),im-proveduponbyusingMaximumMarginaldecodinginStallardetal.(2012),hasachievedexcellentper-formanceontheArabic(ATB)dataset.Weperformcomparisonexperimentswiththemodel2(M2)ofthesegmenter,whichemployslatentPOStags,anddoesnotrequiresentencecontextwhichisnotavail-ableforotherlanguagesinthedataset.Weobtainedthecodeforthesystem,andrunitonourEnglishandTurkishdatasets.12WedonothaveaccesstoanimplementationofPoonetal’ssystem;hence,wedirectlyreportscoresfromtheirpaperontheATBdatasetandtestourmodelonthesamedata.5ResultsTable5detailstheperformanceofthevariousmod-elsonthesegmentationtask.WecanseethatourmethodoutperformsbothvariantsofMorfessor,11Thegrammarsweretrainedusingdataweprovidedtothem.12WereportnumbersonArabicdirectlyfromtheirpaper.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
3
0
1
5
6
6
7
3
8
/
/
t
je
un
c
_
un
_
0
0
1
3
0
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
164
Figure1:Modelperformancevsdatasizeobtainedbyfrequencythresholdingwithanabsolutegainof8.5%,5.1%and5%inF-scoreonEnglish,TurkishandArabic,respectively.OnArabic,weobtaina2.2%absoluteimprovementoverPoonetal.’smodel.AGMorphdoesn’tseg-mentbetterthanMorfessoronEnglishandArabicbutdoesverywellonTurkish(60.9%F1comparedtoourmodel’s61.2%).ThiscouldbeduetothefactthattheCompoundinggrammariswellsuitedtotheagglutinativemorphologyinTurkishandhencepro-videsmoregainsthanforEnglishandArabic.TheLeeSegmenter(M2)performsthebestonArabic(82%F1),butlagsbehindonEnglishandTurkish.ThisresultisconsistentwiththefactthatthesystemwasoptimizedforArabic.Thetablealsodemonstratestheimportanceoftheaddedsemanticinformationinourmodel.Forallthreelanguages,havingthefeaturesthatutilizeco-sinesimilarityprovidesasignificantboostinperfor-mance.Wealsoseethatthetransformationfeaturesandaffixcorrelationfeaturesplayaroleinimprov-ingtheresults,althoughalessimportantone.Next,westudytheeffectofdataqualityonthepredictionofthealgorithm.Atrainingsetoftencontainsmisspellings,abbreviationsandtruncatedwords.Thresholdingbasedonfrequencyiscom-monlyusedtoreducethisnoise.Figure1showstheperformanceofthealgorithmasafunctionofthedatasizeobtainedatvariousdegreesofthreshold-ing.WenotethattheperformanceofthemodelonallthreelanguagesremainsquitestablefromaboutLangMethodPrecRecallF-1EnglishMorf-Base0.7400.6230.677Morf-Cat0.6730.5870.627AGMorph0.6960.6040.647Lee(M2)0.8250.5250.642Model-C0.5550.7920.653Model-T0.8310.6640.738Model-A0.8100.7130.758Fullmodel0.8070.7220.762TurkishMorf-Base0.8270.3620.504Morf-Cat0.5220.6070.561AGMorph0.8780.4660.609Lee(M2)0.7870.3550.489Model-C0.5160.6520.576Model-T0.6650.5210.584Model-A0.6680.5430.599Fullmodel0.7430.5200.612ArabicMorf-Base0.8070.2040.326Morf-Cat0.7740.7260.749AGMorph0.6720.7610.713Poonetal.0.8850.6920.777Lee(M2)–0.820Model-C0.6260.9120.743Model-T0.7740.8070.790Model-A0.7750.8080.791Fullmodel0.7700.8310.799Table5:Resultsonunsupervisedmorphologicalsegmentation;scoresarecalculatedacrossallseg-mentationpointsinthetestdata.Baselinesareinitalics.-C=withoutcosinefeatures,-T=withouttransformationfeatures,-A=withoutaffixcorrela-tionfeatures.NumbersonArabicforPoonetal.andLee(M2)arereporteddirectlyfromtheirpapers.1000to10000trainingwords,afterwhichthedevia-tionsaremoreapparent.Theplotalsodemonstratesthatthemodelworkswellevenwithasmallamountofqualitydata(≈3000mostfrequentwords).ErroranalysisWelookatarandomsubsetof50incorrectlysegmentedwords13inthemodel’soutputforeachlanguage.Table7givesabreakupoferrorsinall3languagesduetooverorunder-segmentation.Table6providesexamplesofcorrectandincorrectsegmentationspredictedbyourmodel.13Wordswithatleastonesegmentationpointincorrect
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
3
0
1
5
6
6
7
3
8
/
/
t
je
un
c
_
un
_
0
0
1
3
0
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
165
LanguageCorrectSegmentationsIncorrectSegmentationsWordSegmentationWordPredictedCorrectEnglishsalvoessalvo-escontemptcon-temptcontemptnegotiationsnegotiat-ion-ssterilizingsteriliz-ingsteril-iz-ingtelephotographtele-photo-graphdesolatingdesolatingdesolat-ingunequivocallyun-equivocal-lystoreroomsstoreroom-sstore-room-scarsickness’scar-sick-ness-’stattlerstattler-stattl-er-sTurkishmodernimodern-imektuplas¸malarmektuplas¸ma-larmektup-las¸-ma-larteknolojidekiteknoloji-de-kigelecektinizgelecek-tinizgel-ecek-ti-nizburasıydıbura-sı-ydıaynalardanayna-lar-da-nayna-lar-danc¸izgisinec¸iz-gi-si-neuyudu˘gunuzuuyudu-˘gu-nuzuuyu-du˘g-unuz-ude˘gis¸ikliktede˘gis¸ik-lik-tedirse˘gedirse˘gedirse˘g-eArabicsy$Arks-y-$ArkwryfAldww-ry-fAldww-ryfAldwnyqwsyAnyqwsyAbHlwlhAb-Hlwl-h-Ab-Hlwl-hAAlmTrwHpAl-mTrwH-pjnwbyjnwb-yjnwbyytEAmlwAy-tEAml-wAwbAyrnw-bAyr-nw-bAyrnlAtnZrlA-t-nZrrknyprknyprkny-pTable6:Examplesofcorrectandincorrectsegmentationsproducedbyourmodelonthethreelanguages.CorrectsegmentationsaretakendirectlyfromgoldMorphoChallengedata.LangOver-segmentUnder-segmentEnglish10%86%Turkish12%78%Arabic60%40%Table7:Typesoferrorsinanalysisof50randomlysampledincorrectsegmentationsforeachlanguage.Theremainingerrorsareduetoincorrectplacementofsegmentationpoints.InEnglish,mosterrorsareduetounder-segmentationofaword.Wefindthataround60%oferrorsareduetorootsthatundergotransformationswhilemorphingintoavariant(seetable6forexam-ples).ErrorsinTurkisharealsomostlyduetounder-segmentation.Onfurtherinvestigation,wefindthatmostsucherrors(58%ofthe78%)areduetoparentwordseithernotinvocabularyorhavingaverylowwordcount(≤10).Incontrast,weobserveama-jorityofover-segmentationerrorsinArabic(60%).ThisislikelybecauseofArabichavingmoresin-glecharacteraffixesthantheotherlanguages.Wefindthat56%oferrorsinArabicinvolveasingle-characteraffix,whichismuchhigherthanthe24.6%thatinvolveatwo-letteraffix.Incontrast,25%ofer-rorsinEnglishareduetosinglecharacteraffixes–aroundthesamenumberasthe24%oferrorsduetotwo-letteraffixes.Sinceourmodelisanunsupervisedone,wemakeseveralsimplifyingassumptionstokeepthecandi-datesetsizemanageableforlearning.Forinstance,wedonotexplicitlymodelinfixes,sinceweselectparentcandidatesbyonlymodifyingtheendsofaword.Also,theroot-templatemorphologyofAra-bic,aSemiticlanguage,presentsacomplexitywedonotdirectlyhandle.Forinstance,wordsinAra-biccanbeformedusingspecificpatterns(knownasbinyanim)(ex.nZr→yntZr).Cependant,ongoingthroughtheerrors,wefindthatonly14%areduetothesebinyanimpatternsnotbeingcap-tured.14Addingintransformationrulestocapturethesetypesoflanguage-specificpatternscanhelpin-creasebothchainandsegmentationaccuracy.AnalysisoflearneddistributionsToinvestigatehowdecisivethelearntmodelis,weexaminethefinalprobabilitydistributionP(z|w)ofparentcan-didatesforthewordsintheEnglishwordlist.Weobservethattheprobabilityofthebestcandidate(maxzP(z|w)),averagedoverallwords,is0.77.Wealsofindthattheaverageentropyofthedistri-14Thismightbeduetothefactthatthegoldsegmentationsalsodonotcapturesuchpatterns.Forexample,thegoldseg-mentationforyntZrwnisgivenasy-ntZr-wn,eventhoughntZrisnotavalidroot.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
3
0
1
5
6
6
7
3
8
/
/
t
je
un
c
_
un
_
0
0
1
3
0
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
166
Figure2:Comparisonofgoldandpredictedfre-quencydistributionsofthetop15affixesforEnglishbutionsis0.65,whichisquitelowconsideringthattheaveragenumberofcandidatesis10.76perword,whichwouldresultinamaxpossibleentropyofaround2.37ifthedistributionswereuniform.Thisdemonstratesthatthemodeltendstopreferasingleparentforeveryword,15whichisexactlythebehav-iorwewant.AffixanalysisWealsoanalyzethevariousaffixesproducedbythemodel,andcomparewiththegoldaffixes.Particularly,weplotthefrequencydistri-butionsoftheaffixes16obtainedfromthegoldand15Notethatthecandidateprobabilitydistributionmayhavemorethanasinglepeakinsomecases.16Toconservespace,weonlyshowthedistributionofsuf-fixeshere,butweobserveasimilartrendforprefixes.predictedsegmentationsfortheEnglishtestdatainfigure2.Fromthefigure,wecanseethatourmodellearnstoidentifygoodaffixesforthegivenlanguage.Sev-eralofthetopaffixespredictedarealsopresentinthegoldlist,andwealsoobservesimilaritiesinthefrequencydistributions.6ConclusionInthiswork,wehaveproposedadiscriminativemodelforunsupervisedmorphologicalsegmenta-tionthatseamlesslyintegratesorthographicandse-manticpropertiesofwords.Weusemorpholog-icalchainstomodelthewordformationprocessandshowhowtoemploytheflexibilityoflog-linearmodelstoincorporatebothmorphemeandword-levelfeatures,whilehandlingtransformationsofparentwords.Ourmodelconsistentlyequalsorout-performsfivestate-of-the-artsystemsonArabic,En-glishandTurkish.Futuredirectionsofworkin-cludeusingbetterneighborhoodfunctionsforcon-trastiveestimation,exploringotherviewsofthedatathatcouldbeincorporated,examiningbetterpredic-tionschemesandemployingmorphologicalchainsinotherapplicationsinNLP.AcknowledgementsWethankKairitSirtsandYoongKeokLeeforhelpingrunexperimentswiththeirunsupervisedmorphologicalanalyzers,andYonatanBelinkovforhelpingwitherroranalysisinArabic.WealsothanktheanonymousTACLreviewersandmem-bersofMIT’sNLPgroupfortheirinsightfulcom-mentsandsuggestions.ThisworkwassupportedbytheIntelligenceAdvancedResearchProjectsActiv-ity(IARPA)viaDepartmentofDefenseUSArmyResearchLaboratorycontractnumberW911NF-12-C-0013.TheU.S.GovernmentisauthorizedtoreproduceanddistributereprintsforGovernmen-talpurposesnotwithstandinganycopyrightannota-tionthereon.Theviewsandconclusionscontainedhereinarethoseoftheauthorsandshouldnotbeinterpretedasnecessarilyrepresentingtheofficialpoliciesorendorsements,eitherexpressedorim-plied,ofIARPA,DoD/ARL,ortheU.S.Govern-ment.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
3
0
1
5
6
6
7
3
8
/
/
t
je
un
c
_
un
_
0
0
1
3
0
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
167
ReferencesRBaayen,RPiepenbrock,andLGulikers.1995.CELEX2LDC96L14.Philadelphia:LinguisticDataConsortium.MarcoBaroni,JohannesMatiasek,andHaraldTrost.2002.Unsuperviseddiscoveryofmorphologicallyre-latedwordsbasedonorthographicandsemanticsimi-larity.CoRR,cs.CL/0205006.RichardHByrd,PeihuangLu,JorgeNocedal,andCiyouZhu.1995.Alimitedmemoryalgorithmforboundconstrainedoptimization.SIAMJournalonScientificComputing,16(5):1190–1208.MathiasCreutzandKristaLagus.2005.Inducingthemorphologicallexiconofanaturallanguagefromunannotatedtext.InProceedingsoftheInternationalandInterdisciplinaryConferenceonAdaptiveKnowl-edgeRepresentationandReasoning(AKRR),pages106–113.MathiasCreutzandKristaLagus.2007.Unsuper-visedmodelsformorphemesegmentationandmor-phologylearning.ACMTrans.SpeechLang.Process.,4(1):3:1–3:34,February.MarkusDreyerandJasonEisner.2011.Discover-ingmorphologicalparadigmsfromplaintextusingadirichletprocessmixturemodel.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages616–627.AssociationforComputationalLinguistics.SharonGoldwaterandMarkJohnson.2004.Priorsinbayesianlearningofphonologicalrules.InProceed-ingsofthe7thMeetingoftheACLSpecialInterestGroupinComputationalPhonology:CurrentThemesinComputationalPhonologyandMorphology,SIG-MorPhon’04,pages35–42,Stroudsburg,Pennsylvanie,USA.AssociationforComputationalLinguistics.YoongKeokLee,AriaHaghighi,andReginaBarzi-lay.2011.Modelingsyntacticcontextimprovesmorphologicalsegmentation.InProceedingsoftheFifteenthConferenceonComputationalNaturalLan-guageLearning,CoNLL’11,pages1–9,Stroudsburg,Pennsylvanie,USA.AssociationforComputationalLinguistics.Minh-ThangLuong,RichardSocher,andChristopherD.Manning.2013.Betterwordrepresentationswithre-cursiveneuralnetworksformorphology.InCoNLL,Sofia,Bulgaria.MohamedMaamouri,AnnBies,HubertJin,andTimBuckwalter.2003.ArabicTreebank:Part1v2.0LDC2003T06.Philadelphia:LinguisticDataConsor-tium.TomasMikolov,IlyaSutskever,KaiChen,GregCorrado,andJeffreyDean.2013.Distributedrepresentationsofwordsandphrasesandtheircompositionality.CoRR,abs/1310.4546.RobertParker,DavidGraff,KeChen,JunboKong,andKazuakiMaeda.2011.ArabicGigawordfiftheditionLDC2011T11.Philadelphia:LinguisticDataConsor-tium.HoifungPoon,ColinCherry,andKristinaToutanova.2009.Unsupervisedmorphologicalsegmentationwithlog-linearmodels.InProceedingsofHumanLan-guageTechnologies:The2009AnnualConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics,NAACL’09,pages209–217,Stroudsburg,Pennsylvanie,USA.AssociationforCompu-tationalLinguistics.Has¸imSak,TungaG¨ung¨or,andMuratSarac¸lar.2008.Turkishlanguageresources:Morphologicalparser,morphologicaldisambiguatorandwebcorpus.InAd-vancesinnaturallanguageprocessing,pages417–427.Springer.PatrickSchoneandDanielJurafsky.2000.Knowledge-freeinductionofmorphologyusinglatentsemanticanalysis.InProceedingsofthe2ndWorkshoponLearningLanguageinLogicandthe4thConferenceonComputationalNaturalLanguageLearning-Vol-ume7,ConLL’00,pages67–72,Stroudsburg,Pennsylvanie,USA.AssociationforComputationalLinguistics.KairitSirtsandSharonGoldwater.2013.Minimally-supervisedmorphologicalsegmentationusingadaptorgrammars.TACL,1:255–266.NoahA.SmithandJasonEisner.2005.Contrastiveestimation:Traininglog-linearmodelsonunlabeleddata.InProceedingsofthe43rdAnnualMeetingonAssociationforComputationalLinguistics,ACL’05,pages354–362,Stroudsburg,Pennsylvanie,USA.AssociationforComputationalLinguistics.BenjaminSnyderandReginaBarzilay.2008.Unsuper-visedmultilinguallearningformorphologicalsegmen-tation.InTheAnnualConferenceoftheAssociationforComputationalLinguistics.DavidStallard,JacobDevlin,MichaelKayser,YoongKeokLee,andReginaBarzilay.2012.Unsupervisedmorphologyrivalssupervisedmor-phologyforarabicmt.InProceedingsofthe50thAnnualMeetingoftheAssociationforComputationalLinguistics:ShortPapers-Volume2,pages322–327.AssociationforComputationalLinguistics.SamiVirpioja,VilleT.Turunen,SebastianSpiegler,Os-karKohonen,andMikkoKurimo.2011.Empiricalcomparisonofevaluationmethodsforunsupervisedlearningofmorphology.TAL,52(2):45–90.SamiVirpioja,PeterSmit,Stig-ArneGr¨onroos,andMikkoKurimo.2013.Morfessor2.0:Pythonimple-mentationandextensionsforMorfessorBaseline.Re-portinAaltoUniversitypublicationseriesSCIENCE+TECHNOLOGY,DepartmentofSignalProcessingandAcoustics,AaltoUniversity,Helsinki,Finlande.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
3
0
1
5
6
6
7
3
8
/
/
t
je
un
c
_
un
_
0
0
1
3
0
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3