Transactions of the Association for Computational Linguistics, vol. 6, pp. 33–48, 2018. Action Editor: Regina Barzilay.
Submission batch: 5/2016; Revision batch: 10/2016; Published 1/2018.
2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
c
(cid:13)
JointSemanticSynthesisandMorphologicalAnalysisoftheDerivedWordRyanCotterellDepartmentofComputerScienceJohnsHopkinsUniversityryan.cotterell@jhu.eduHinrichSch¨utzeCISLMUMunichinquiries@cislmu.orgAbstractMuchlikesentencesarecomposedofwords,wordsthemselvesarecomposedofsmallerunits.Forexample,theEnglishwordquestionablycanbeanalyzedasquestion+able+ly.However,thisstructuraldecompositionoftheworddoesnotdirectlygiveusasemanticrepresentationoftheword’smeaning.Sincemorphologyobeystheprincipleofcompositionality,thesemanticsofthewordcanbesystematicallyderivedfromthemeaningofitsparts.Inthiswork,weproposeanovelprobabilisticmodelofwordformationthatcapturesboththeanalysisofawordwintoitsconstituentsegmentsandthesynthesisofthemeaningofwfromthemean-ingsofthosesegments.Ourmodeljointlylearnstosegmentwordsintomorphemesandcomposedistributionalsemanticvectorsofthosemorphemes.WeexperimentwiththemodelonEnglishCELEXdataandGermanDErivBase(Zelleretal.,2013)data.WeshowthatjointlymodelingsemanticsincreasesbothsegmentationaccuracyandmorphemeF1bybetween3%and5%.Additionally,weinvestigatedifferentmodelsofvectorcompo-sition,showingthatrecurrentneuralnetworksyieldanimprovementoversimpleadditivemodels.Finally,westudythedegreetowhichtherepresentationscorrespondtoalinguist’snotionofmorphologicalproductivity.1IntroductionInmostlanguages,wordsdecomposefurtherintosmallerunits,termedmorphemes.Forexample,theEnglishwordquestionablycanbeanalyzedasquestion+able+ly.Thisstructuraldecompositionoftheword,however,byitselfisnotasemanticrep-resentationoftheword’smeaning;1wefurtherre-quireanaccountofhowtosynthesizethemeaningfromthedecomposition.Fortunately,words—justlikephrases—toalargeextentobeytheprincipleofcompositionality:thesemanticsofthewordcanbesystematicallyderivedfromthemeaningofitsparts.2Inthiswork,weproposeanoveljointprob-abilisticmodelofwordformationthatcapturesbothstructuraldecompositionofawordwintoitscon-stituentsegmentsandthesynthesisofw’smeaningfromthemeaningofthosesegments.Morphologicalsegmentationisastructuredpre-dictiontaskthatseekstobreakawordupintoitsconstituentmorphemes.Theoutputsegmentationhasbeenshowntoaidadiversesetofapplications,suchasautomaticspeechrecognition(Afifyetal.,2006),keywordspotting(Narasimhanetal.,2014),machinetranslation(CliftonandSarkar,2011)andparsing(SeekerandC¸etino˘glu,2015).Incontrasttomuchofthispriorwork,wefocusonsupervisedsegmentation,i.e.,weprovidethemodelwithgoldsegmentationsduringtrainingtime.Insteadofsur-1Therearemanydifferentlinguisticandcomputationaltheo-riesforinterpretingthestructuraldecompositionofaword.Forexample,un-oftensignifiesnegationanditseffectonsemanticscanthenbemodeledbytheoriesbasedonlogic.Thisworkad-dressesthequestionofstructuraldecompositionandsemanticsynthesisinthegeneralframeworkofdistributionalsemantics.2Morphologicalresearchintheoreticalandcomputationallinguisticsoftenfocusesonnoncompositionalorlesscom-positionalphenomena—simplybecausecompositionalderiva-tionposesfewerinterestingresearchproblems.Itisalsotruethat—justasmanyfrequentmultiwordunitsarenotcompletelycompositional—manyfrequentderivations(e.g.,refusal,fit-ness)arenotcompletelycompositional.Anindicationthatnon-lexicalizedderivationsareusuallycompositionalisthefactthatstandarddictionarieslikeOUPeditors(2010)listderivationalaffixeswiththeircompositionalmeaning,withoutahedgethattheycanalsooccuraspartofonlypartiallycompositionalforms.SeealsoHaspelmathandSims(2013),§5.3.6.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
0
3
1
5
6
7
6
1
2
/
/
t
l
a
c
_
a
_
0
0
0
0
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
34
facesegmentation,ourmodelperformscanonicalsegmentation(Cotterelletal.,2016a;Cotterelletal.,2016b;Kannetal.,2016),i.e.,itallowstheinduc-tionoforthographicchangestogetherwiththeseg-mentation,whichisnottypical.Fortheexamplequestionably,ourmodelcanrestorethedeletedchar-actersle,yieldingthecanonicalsegmentsquestion,ableandly.Inthiswork,ourprimarycontributionliesintheintegrationofcontinuoussemanticvec-torsintosupervisedmorphologicalsegmentation—wepresentajointmodelofmorphologicalanalysisandsemanticsynthesisattheword-level.Weexperimentallyinvestigatethreenovelaspectsofourmodel.•First,weshowthatjointlymodelingcontinu-ousrepresentationsofthesemanticsofmor-phemesandwordsallowsustoimprovemor-phologicalanalysis.OntheEnglishportionofCELEX(Baayenetal.,1993),weachievea5pointimprovementinsegmentationaccuracyanda3pointimprovementinmorphemeF1.OntheGermanDErivBasedatasetweachievea3pointimprovementinsegmentationaccu-racyanda3pointimprovementinmorphemeF1.•Second,weexploreimprovedmodelsofvec-torcompositionforsynthesizingwordmean-ing.Wefindarecurrentneuralnetworkim-provesoverpreviouslyproposedadditivemod-els.Moreover,wefindthatmoresyntacticallyorientedvectors(LevyandGoldberg,2014a)arebettersuitedformorphologythanbag-of-word(BOW)models.•Finally,weexploretheproductivityofEnglishderivationalaffixesinthecontextofdistribu-tionalsemantics.2DerivationalMorphologyTwoimportantgoalsofmorphology,thelinguisticstudyoftheinternalstructureofwords,aretode-scribetherelationbetweendifferentwordsinthelexiconandtodecomposethemintomorphemes,thesmallestlinguisticunitbearingmeaning.Morphol-ogycanbedividedintotwotypes:inflectionalandderivational.Inflectionalmorphologyisthesetofprocessesthroughwhichthewordformoutwardlydisplayssyntacticinformation,e.g.,verbtense.Itfollowsthataninflectionalaffixtypicallyneitherchangesthepart-of-speech(POS)northesemanticsoftheword.Forexample,theEnglishverbtoruntakesvariousforms:run,runs,ranandrunning,allofwhichconvey“movingbyfootquickly”,butap-pearincomplementarysyntacticcontexts.Derivationdealswiththeformationofnewwordsthathavesemanticshiftsinmeaning(ofteninclud-ingPOS)andistightlyintertwinedwithlexicalse-mantics(Light,1996).ConsidertheexampleoftheEnglishnoundiscontentedness,whichisderivedfromtheadjectivediscontented.Itistruethatbothwordsshareaclosesemanticrelationship,butthetransformationisclearlymorethanasimpleinflec-tionalmarkingofsyntax.Indeed,wecangoonestepfurtheranddefineachainofwordscontent7→contented7→discontented7→discontentedness.Inthecomputationalliterature,derivationalmor-phologyhasreceivedlessattentionthaninflectional.Thereare,however,twobodiesofworkonderiva-tionincomputationallinguistics.First,thereisaseriesofpapersthatexploretherelationbetweenlexicalsemanticsandderivation(Lazaridouetal.,2013;Zelleretal.,2014;Pad´oetal.,2015;Kisse-lewetal.,2015).Alloftheseassumeagoldmor-phologicalanalysisandprimarilyfocusontheef-fectofderivationondistributionalsemantics.Thesecondbodyofwork,e.g.,theunsupervisedmor-phologicalsegmenterMORFESSOR(CreutzandLa-gus,2007),doesnotdealwithsemanticsandmakesnodistinctionbetweeninflectionalandderivationalmorphology.3Eventhoughtheboundarybetweeninflectionalandderivationalmorphologyisacon-tinuumratherthanarigiddivide(HaspelmathandSims,2013),thereisstillthecleardistinctionthatderivationchangesmeaningwhereasinflectiondoesnot.Ourgoalinthispaperistodevelopanaccountofhowthemeaningofawordformcanbecomputedjointly,combiningthesetwolinesofwork.ProductivityandSemanticCoherence.Wehighlighttworelatedissuesinderivationthatmoti-vatedthedevelopmentofourmodel:productivity3Narasimhanetal.(2015)alsomakenodistinctionbetweeninflectionalandderivationalmorphology,buttheirmodelisanexceptioninthatitincludesvectorsimilarityasasemanticfea-ture.See§5fordiscussion.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
0
3
1
5
6
7
6
1
2
/
/
t
l
a
c
_
a
_
0
0
0
0
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
35
andsemanticcoherence.Roughly,aproductiveaffixisonethatcanstillactivelybeemployedtoformnewwordsinalanguage.Forexample,theEnglishnominalizingaffixness(red7→red+ness)canbeattachedtojustaboutanyadjective,includingnovelforms.Incontrast,thearchaicEnglishnominal-izingaffixth(dear7→dear+th,heal7→heal+th,steal7→steal+th)doesnotallowustoformnewwordssuchascheapth.Thisisacrucialissueinderivationalmorphologysincewewouldnotingeneralwanttoanalyzenewwordsashavingbeenformedfromnon-productiveendings;e.g.,wedonotwanttoanalyzehearthashear+th(orwugthaswug+th).Relationssuchasthosebetweenhealandhealtharelexicalizedsincetheynolongercanbederivedbyproductiveprocesses(Bauer,1983).Underagenerativetreatment(Chomsky,1965)ofmorphology,productivitybecomesacentralno-tionsinceagrammarneedstoaccountforactivewordformationprocessesinthelanguage(Aronoff,1976).Definingproductivityprecisely,however,istricky;Aronoff(1976)writes,“oneofthecentralmysteriesofderivationalmorphology…[isthat]…thoughmanythingsarepossibleinmorphology,somearemorepossiblethanothers.”Nevertheless,speakersoftenhaveclearintuitionsaboutwhichaf-fixesinthelanguageareproductive.4Relatedtoproductivityisthenotionofseman-ticcoherence.Theprincipleofcompositionality(Frege,1892;HeimandKratzer,1998)appliestointerpretationofwordsjustasitdoestophrases.In-deed,compositionalityisoftentakentobeasig-nalforproductivity(Aronoff,1976).Whende-cidingwhethertofurtherdecomposeaword,ask-ingwhetherthepartssumuptothewholeisof-tenagoodindicator.Inthecaseofquestionably7→question+able+ly,thecompositionalmeaningis“inamannerthatcouldbequestioned”,whichcorrespondstothemeaningoftheword.Contrastthiswiththewordunquiet,whichmeans“restless”,ratherthan“notquiet”andthecompoundblackmail,whichdoesnotrefertoaletterwritteninblackink.Themodelwewilldescribein§3isajointmodelofbothsemanticcoherenceandsegmentation;that4Itisalsoimportanttodistinguishproductivityfromcreativity—anon-rule-governedformofwordformation(Lyons,1977).Asanexampleofcreativity,considerthecre-ationofportmanteaux,e.g.,dramedyandsoundscape.is,ananalysisisjudgednotonlybycharacter-levelfeatures,butalsobythedegreetowhichthewordissemanticallycompositional.Implicitinsuchatreatmentisthedesiretoonlysegmentawordifthesegmentationisderivedfromaproductiveprocess.Whilemostpriorworkonmorphologicalsegmen-tationhasnotexplicitlymodeledproductivity,5webelieve,fromacomputationalmodelingperspective,segmentingonlyproductiveaffixesispreferable.Thisisanalogoustothemodelingofphrasecompo-sitionalityinembeddingmodels,whereitcanbebet-tertonotfurtherdecomposenoncompositionalmul-tiwordunitslikenamedentitiesandidiomaticex-pressions;see,e.g.,Mikolovetal.(2013b),Wangetal.(2014),YinandSch¨utze(2015),YaghoobzadehandSch¨utze(2015),andHashimotoandTsuruoka(2016).6Inthispaper,werefertothesemanticaspectofthemodeleitherassemanticsynthesisorascoherence.Thesearetwowaysoflookingatsemanticsthatarerelatedasfollows.Ifthesynthesis(i.e.,composi-tion)ofthemeaningofthederivedformfromthemeaningofitspartsisaregularapplicationofthelinguisticrulesofderivation,thenthemeaningsoconstructediscoherent.Thesearethecaseswhereajointmodelisexpectedtobebeneficialforbothsegmentationandinterpretation.3AJointModelFromanNLPperspective,canonicalsegmentation(NaradowskyandGoldwater,2009;Cotterelletal.,2016b)isthetaskthatseekstoalgorithmicallyde-composeawordintoitscanonicalsequenceofmor-phemes.Itisaversionofmorphologicalsegmenta-tionthatrequiresthelearnertohandleorthographicchangesthattakeplaceduringwordformation.Webelievethisisamorenaturalformulationofmor-phologicalanalysis—especiallyfortheprocessing5NotethatsegmenterssuchasMORFESSORutilizetheprin-cipleofminimumdescriptionlength,whichimplicitlyencodesproductivity,inordertoguidesegmentation.6Asareviewerpointsout,productivityofanaffixandse-manticcoherenceofthewordsformedfromitarenotperfectlyaligned.Nonproductiveaffixescanproducesemanticallycoher-entwords,e.g.,warm7→warm+th.Productiveaffixescanpro-ducesemanticallyincoherentwords,e.g.,canny7→un+canny.Again,thisisanalogoustomultiwordunits.However,thereisastrongcorrelationandourexperimentsshowthatrelyingonitgivesgoodresults.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
0
3
1
5
6
7
6
1
2
/
/
t
l
a
c
_
a
_
0
0
0
0
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
36
unquestionablelysuffixsuffixstemprefixunquestionably+++unquestionablely⇡surface formunderlying formsegmentationvector compositiontargetFigure1:Adepictionofthejointmodelthatmakestherelationbetweenthethreefactorsandtheobservedsur-faceformexplicit.Weshowasimpleadditivemodelofcompositionforeaseofexplication.ofderivationalmorphology—asitdrawsheavilyonlinguisticnotions(see§2).Themaininnovationwepresentistheaugmen-tationofcanonicalsegmentationtotakeintoac-countsemanticcoherenceandproductivity.Con-siderthewordhypercuriosityanditscanonicalseg-mentationhyper+curious+ity;thiscanonicalseg-mentationseekstodecomposethewordintoitscon-stituentmorphemesandaccountfororthographicchanges.Thisamountstoastructuraldecomposi-tionoftheword,i.e.,howdowebreakupthestringofcharactersintochunks?Thisissimilartothede-compositionofasentenceintoaparsetree.How-ever,itisalsonaturaltoconsiderthesemanticcom-positionalityofaword,i.e.,howisthemeaningofthewordsynthesizedfromthemeaningoftheindi-vidualmorphemes?Weconsiderbothofthesequestionstogetherinasinglemodel,wherewewouldliketoplacehighprobabilityoncanonicalsegmentationsthatarealsosemanticallycoherent.Returningtohy-percuriosity,wecouldfurtherdecomposeitintohyper+cure+ous+ityinanalogyto,say,vice7→vicious.Nothingaboutthesurfaceformofcuri-ousalonegivesusastrongcuethatweshouldruleoutthesegmentationcure+ous.Turningtodistri-butionalsemantics,however,itisthecasethatthecontextsinwhichcuriousoccursarequitedifferentfromthoseinwhichcureoccurs.Thisgivesusastrongcuewhichsegmentationiscorrect.Formally,givenawordstringw∈Σ∗,whereΣisadiscretealphabetofcharacters(inEnglishthiscouldbeassimpleasthe26letterlowercasealpha-bet),andawordvectorv∈V,whereVisasetoflow-dimensionalwordembeddings,wedefinethemodelas:p(v,s,l,u|w)=1Zθ(w)exp(cid:18)12σ2||v−Cβ(s,l)||22+f(s,l,u)>η+g(u,w)>ω(cid:17).(1)Thismodeliscomposedofthreefactors:composi-tionfactor(12σ2||v−Cβ(s,l)||22),segmentationfactorfandtransductionfactorg.Theparametersofthemodelareθ={β,η,ω},thefunctionCβcomposesmorphemevectorstogether,sisthesegmentation,listhelabelingofthesegments,uistheunderlyingrepresentationandZθ(w)isthepartitionfunction.Notethattheconditionaldistributionp(v|s,l,u,w)isGaussiandistributedbyconstruction.Avisualiza-tionofourmodelisfoundinFigure1.Thismodelisaconditionalrandomfield(CRF)thatismixed,i.e.,itisdefinedoverbothdiscreteandcontinuousrandomvariables(KollerandFriedman,2009).WerestricttherangeofutobeasubsetofΣ|w|+k,wherekisaninsertionlimit(Dreyer,2011).Inthiswork,wetakek=5.Explicitly,thepartitionfunctionisdefinedasZθ(w)=ZXl0,s0,u0exp(cid:18)12σ2||v0−Cβ(s0,l0)||22+f(s0,l0,u0)>η+g(u0,w)>ω(cid:17)dv0,(2)whichisguaranteedtobefinite.7ACRFissimplythegloballyrenormalizedprod-uctofseveralnon-negativefactors(SuttonandMcCallum,2006).Ourmodeliscomposedofthree:transduction,segmentationandcompositionfactors—wedescribeeachinturn.3.1TransductionFactorThefirstfactorweconsideristhetransductionfactor:exp(cid:0)g(u,w)>ω(cid:1),whichscoresasurface7Sincewehavecappedtheinsertionlimit,wehaveafinitenumberofvaluesthatucantakeforanyw.Thus,itfollowsthatwehaveafinitenumberofcanonicalsegmentationss.HencewetakeafinitenumberofGaussianintegrals.Theseintegralsallconvergesincewehavefixedthecovariancematrixasσ2I,whichispositivedefinite.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
0
3
1
5
6
7
6
1
2
/
/
t
l
a
c
_
a
_
0
0
0
0
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
37
representation(SR)w,thecharacterstringob-servedinrawtext,andanunderlyingrepresen-tation(UR),acharacterstringwithorthographicprocessesreversed.Theaimofthisfactoristoplacehighweightongoodpairs,e.g.,thepair(w=questionably,u=questionablely),sowecanac-curatelyrestorecharacter-levelchanges.Weencodethisportionofthemodelasaweightedfinite-statemachineforeaseofcomputation.Thisfactorgeneralizesprobabilisticeditdistance(RistadandYianilos,1998)bylookingatadditionalinputandoutputcontext;seeCotterelletal.(2014)forde-tails.AsmentionedaboveandincontrasttoCot-terelletal.(2014),weboundtheinsertionlimitintheeditdistancemodel.8Computingthescorebe-tweentwostringsuandwrequiresadynamicpro-gramthatrunsinO(|u|·|w|).ThisisageneralizationoftheforwardalgorithmforHiddenMarkovModels(HMMs)(Rabiner,1989).Weemploystandardfeaturetemplatesforthetaskthatlookatfeaturesofeditoperations,e.g.,substi-tuteifory,invaryingcontextgranularities.SeeCotterelletal.(2016b)fordetails.RecentworkhasalsoexploredweightingofWFSTarcswithscorescomputedbyLSTMs(HochreiterandSchmidhuber,1997),obviatingtheneedforhumanselectionoffeaturetemplates(Rastogietal.,2016).3.2SegmentationFactorThesecondfactoristhesegmentationfactor:exp(cid:0)f(s,l,u)>η(cid:1).ThegoalofthisfactoristoscoreasegmentationsofaURu.Inourexample,itscorestheinput-outputpair(u=questionablely,s=question+able+ly).Itadditionallyscoresalabelingofthesegmentation.OurlabelsetinthisworkisL={stem,prefix,suffix}.Theproperlabelingofthesegmentationaboveisl=question:stem+able:suffix+ly:suffix.Thelabel-ingiscriticalforourcompositionfunctionsCβ(Cot-terelletal.,2015):whichvectorsareuseddependsonthelabelgiventothesegment;e.g.,thevectorsoftheprefix“post”andthestem“post”aredifferent.Wecanviewthisfactorasanunnormalizedfirst-8AsourtransductionmodelisanunnormalizedfactorinaCRF,wedonotrequirethelocalnormalizationdiscussedinCotterelletal.(2014)—aweightonanedgemaybeanynon-negativerealnumbersincewewillrenormalizelater.Theun-derlyingmodel,however,remainsthesame.modelcompositionfunctionstemc=PNi=11li=stemmlisimultc=JNi=1mlisiaddc=PNi=1mlisiwaddc=PNi=1αimlisifulladdc=PNi=1UimlisiLDShi=Xhi−1+UmlisiRNNhi=tanh(Xhi−1+Umlisi)Table1:CompositionmodelsCβ(s,l)usedinthisandpriorwork.TherepresentationofthewordishNforthedynamicandcforthenon-dynamicmodels.Notethatforthedynamicmodelsh0isalearnedparameter.ordersemi-CRF(SarawagiandCohen,2005).Com-putationofthefactoragainrequiresdynamicpro-gramming.Thealgorithmisadifferentgeneraliza-tionoftheforwardalgorithmforHMMs,onethatextendsittothesemi-Markovcase.ThisalgorithmrunsinO(|u|2·|L|2).Features.Weagainusestandardfeaturetemplatesforthetask.Wecreateatomicindicatorfeaturesfortheindividualsegments.Wethenconjointheatomicfeatureswithleftandrightcontextfeaturesaswellasthelabeltocreatemorecomplexfeaturetemplates.Wealsoincludetransitionfeaturesthatfireonpairsofsequentiallabels.SeeCotterelletal.(2015)fordetails.Recentworkhasalsoshowedthataneuralparameterizationcanremovetheneedformanualfeaturedesign(Kongetal.,2016).3.3CompositionFactorThecompositionfactortakestheformofanunnormalizedmultivariateGaussiandensity:exp(cid:0)12σ2||v−Cβ(s,l)||22(cid:1),wherethemeaniscomputedbythe(potentiallynon-linear)compo-sitionfunction(SeeTable1)andthecovariancematrixσ2Iisadiagonalmatrix.ThegoalofthecompositionfunctionCβ(s,l)istostitchtogethermorphemeembeddingstoapproximatethevectoroftheentireword.ThesimplestformofthecompositionfunctionCβ(s,l)isadd,anadditivemodelofthemorphemes.SeeTable1:eachvectormlisireferstoamorpheme-
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
0
3
1
5
6
7
6
1
2
/
/
t
l
a
c
_
a
_
0
0
0
0
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
38
specific,label-dependentembedding.Ifli=stem,thensirepresentsastemmorpheme.Giventhatoursegmentationiscanonical,ansithatisastemgen-erallyitselfisanentryinthelexiconandv(si)∈V.Ifv(si)6∈V,thenwesetv(si)to0.9Weoptimizeovervectorswithli∈{prefix,suffix}astheycorre-spondtoboundmorphemes.Wealsoconsideramoreexpressivecompositionmodel,arecurrentneuralnetwork(RNN).LetNbethenumberofsegments.ThenCβ(s,l)=hNwherehiisahiddenvector,definedbythere-cursion:10hi=tanh(cid:0)Xhi−1+Umlisi(cid:1)(Elman,1990).Again,weoptimizethemorphemeembed-dingsmlisionlywhenli6=stemalongwiththeotherparametersoftheRNN,i.e.,thematricesUandX.4InferenceandLearningExactinferenceisintractablesinceweallowar-bitrarysegment-levelfeaturesonthecanonicalizedwordformsu.Sincethesemi-CRFfactorhasfea-turesthatfireonsubstrings,wewouldneedady-namicprogrammingstateforeachsubstringofeachoftheexponentiallymanysettingsofu;thisbreaksthedynamicprogram.Wethusturntoapproximateinferencethroughanimportancesamplingroutine(RubinsteinandKroese,2011).4.1InferencebyImportanceSamplingRatherthanconsideringallunderlyingorthographicformsuandsegmentationss,wesamplefromatractableproposaldistributionq—adistributionovercanonicalsegmentations.Inthefollowingequationsweomitthedependenceonwfornotationalbrevityanddefineh(l,s,u)=f(s,l,u)+g(u,w).Cru-cially,thepartitionfunctionZθ(w)isnotafunctionofparametersubvectorβanditsgradientwithre-9Thisisnotchangedintraining,soallsuchv(si)are0inthefinalmodel.Clearly,thiscouldbeimprovedinfutureworkasareviewerpointsout,e.g.,bysettingsuchv(si)toanaverageofasuitablechosensetofknownwordvectors.10WedonotexploremorecomplexRNNs,e.g.,LSTMs(HochreiterandSchmidhuber,1997)andGRUs(Choetal.,2014a)aswordsinourdatahave≤7morphemes.Thesearchi-tecturesmakethelearningoflongdistancedependencieseas-ier,butarenomorepowerfulthananElmanRNN,atleastintheory.NotethatperhapsifappliedtolanguageswithricherderivationalmorphologythanEnglish,consideringmorecom-plexneuralarchitectureswouldmakesense.specttoβis0.11Recallthatcomputingthegradi-entofthelog-partitionfunctionisequivalenttotheproblemofmarginalinference(WainwrightandJor-dan,2008).Wederiveourestimatorasfollows:∇θlogZ=E(l,s,u)∼p[h(l,s,u)](3)=Xl,s,up(l,s,u)h(l,s,u)(4)=Xl,s,uq(l,s,u)q(l,s,u)p(l,s,u)h(l,s,u)(5)=E(l,s,u)∼q(cid:20)p(l,s,u)q(l,s,u)h(l,s,u)(cid:21),(6)wherewehaveomittedthedependenceonw(whichweconditionon)andv(whichwemarginalizeout).Solongasqhassupporteverywherepdoes(i.e.,p(l,s,u)>0⇒q(l,s,u)>0),theestimateisun-biased.Unfortunately,wecanonlyefficientlycom-putep(l,s,u)uptoaconstantfactor,p(l,s,u)=¯p(l,s,u)/Z0θ(w).Thus,weusetheindirectimpor-tancesamplingestimator,1PMi=1w(i)MXi=1w(i)h(l(i),s(i),u(i)),(7)where(l(1),s(1),u(1))…(l(M),s(M),u(M))i.i.d.∼qandimportanceweightsw(i)aredefinedas:w(i)=¯p(l(i),s(i),u(i))q(l(i),s(i),u(i)).(8)Thisindirectestimatorisbiased,butconsistent.12ProposalDistribution.Thesuccessofimpor-tancesamplingdependsonthechoiceofa“good”proposaldistribution,i.e.,onethatideallyisclosetop.Sincewearefullysupervisedattrainingtime,wehavetheoptionoftraininglocallynormalizeddistributionsfortheindividualcomponents.Con-cretely,wetraintwoproposaldistributionsq1(u|w)andq2(l,s|u)thattaketheformofaWFSTandasemi-CRF,respectively,usingfeaturesidentical11ThesubvectorβisresponsibleforcomputingonlythemeanoftheGaussianfactorandthushasnoimpactonitsnor-malizationcoefficient(Murphy,2012).12Informally,theindirectimportancesamplingestimatecon-vergestothetrueexpectationasM→∞(thedefinitionofstatisticalconsistency).
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
0
3
1
5
6
7
6
1
2
/
/
t
l
a
c
_
a
_
0
0
0
0
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
39
tothejointmodel.Eachofthesedistributionsistractable—wecancomputethemarginalswithdy-namicprogrammingandthussampleefficiently.Todrawsamples(l,s,u)∼q,wesamplesequentiallyfromq1andthenq2,conditionedontheoutputofq1.4.2LearningWeoptimizethelog-likelihoodofthemodelusingADAGRAD(Duchietal.,2011),whichisSGDwithaspecialper-parameterlearningrate.Thefullgra-dientoftheobjectiveforonetrainingexampleis:∇θlogp(v,s,l,u|w)=f(s,l,u)>+g(u,w)>−1σ2(v−Cβ(s,l))∇θCβ(s,l)−∇θlogZθ(w),(9)whereweusetheimportancesamplingalgorithmdescribedin§4.1toapproximatethegradientofthelog-partitionfunction,followingBengioandSenecal(2003).Notethat∇θCβ(s,l)dependsonthecompositionfunctionused.Inthemostcom-plicatedcasewhenCβisaRNN,wecancom-pute∇βCβ(s,l)efficientlywithbackpropagationthroughtime(Werbos,1990).WetakeM=10im-portancesamples;usingsofewsamplescanleadtoapoorestimateofthegradient,butforourapplicationitsuffices.WeemployL2regularization.4.3DecodingDecodingthemodelisalsointractable.Toapproxi-matethesolution,weagainemployimportancesam-pling.WetakeM=10,000importancesamplesandselectthehighestweightedsample.5RelatedWorkTheideathatvectorsemanticsisusefulformor-phologicalsegmentationisnotnew.Countvectors(Salton,1971;TurneyandPantel,2010)havebeenshowntobebeneficialintheunsupervisedinductionofmorphology(SchoneandJurafsky,2000;SchoneandJurafsky,2001).Embeddingswereshowntoactsimilarly(SoricutandOch,2015).Ourmethoddiffersfromthislineofresearchintwokeyways.(i)Wepresentaprobabilisticmodelofthepro-cessofsynthesizingtheword’smeaningfromthemeaningofitsmorphemes.Priorworkwasei-thernotprobabilisticordidnotexplicitlymodelmorphemes.(ii)Ourmethodissupervisedandfo-cusesonderivation.SchoneandJurafsky(2000)andSoricutandOch(2015),beingfullyunsupervised,donotdistinguishbetweeninflectionandderiva-tionandSchoneandJurafsky(2001)focusonin-flection.Morerecently,Narasimhanetal.(2015)lookattheunsupervisedinductionof“morpholog-icalchains”withsemanticvectorsasacrucialfea-ture.Theirgoalistojointlyfigureoutanorderingofwordformationandamorphologicalsegmenta-tion,e.g.,play7→playful7→playfulness.Whileitisarichmodellikeours,theirsdiffersinthatitisun-supervisedandusesvectorsasfeatures,ratherthanexplicitlytreatingvectorcomposition.Alloftheaboveworkfocusesonsurfacesegmentationandnotcanonicalsegmentation,aswedo.Arelatedlineofworkthathasdifferentgoalscon-cernsmorphologicalgeneration.TworecentpapersthataddressthisproblemusingdeeplearningareFaruquietal.(2016a)andFaruquietal.(2016b).Inanolderlineofwork,YarowskyandWicen-towski(2000)andWicentowski(2002)exploitlogfrequencyratiosofinflectionallyrelatedformstoteaseapartthat,e.g.,thepasttenseofsingisnotsinged,butinsteadsang.RelatedworkbyDreyerandEisner(2011)usesaDirichletprocesstomodelacorpusasa“mixtureofaparadigm”,allowingforthesemi-supervisedincorporationofdistribu-tionalsemanticsintoastructuredmodelofinflec-tionalparadigmcompletion.Ourworkisalsorelatedtorecentattemptstoin-tegratemorphologicalknowledgeintogeneralem-beddingmodels.Forexample,BothaandBlun-som(2014)trainalog-bilinearlanguagemodelthatmodelsthecompositionofmorphologicalstructure.Likewise,Luongetal.(2013)trainarecursiveneuralnetwork(GollerandK¨uchler,1996)overaheuristi-callyderivedtreestructuretolearnmorphologicalcompositionovercontinuousvectors.Ourworkisdifferentinthatwelearnajointmodelofsegmen-tationandcomposition.Moreover,supervisedmor-phologicalanalysiscandrasticallyoutperformunsu-pervisedanalysis(Ruokolainenetal.,2013).EarlyworkbyKay(1977)canbeinterpretedasfinite-statecanonicalsegmentation,butitneitherad-dressesnorexperimentallyevaluatesthequestionofjointmodelingofmorphologicalanalysisandse-manticsynthesis.Moreover,wemayviewcanoni-
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
0
3
1
5
6
7
6
1
2
/
/
t
l
a
c
_
a
_
0
0
0
0
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
40
devtestModelAccF1EditAccF1EditENSemi-CRF(Baseline)0.55(.018)0.75(.014)0.80(.043)0.54(.018)0.75(.014)0.78(.034)Joint(Baseline)0.77(.011)0.87(.007)0.41(.029)0.77(.013)0.87(.007)0.43(.029)Joint+Vec(ThisWork)0.83(.014)0.91(.008)0.31(.019)0.82(.020)0.90(.011)0.32(.038)Joint+UR(Oracle)0.94(.015)0.96(.009)0.07(.016)0.94(.011)0.96(.007)0.07(.011)Joint+UR+Vec(Oracle)0.95(.011)0.97(.007)0.05(.013)0.95(.023)0.97(.006)0.05(.025)DESemi-CRF(Baseline)0.39(.062)0.68(.039)1.15(.230)0.39(.058)0.68(.042)1.14(.240)Joint(Baseline)0.79(.107)0.88(.069)0.40(.313)0.79(.099)0.87(.063)0.41(.282)Joint+Vec(ThisWork)0.82(.102)0.90(.067)0.33(.312)0.82(.096)0.90(.061)0.33(.282)Joint+UR(Oracle)0.86(.108)0.90(.070)0.25(.288)0.86(.100)0.90(.064)0.25(.268)Joint+UR+Vec(Oracle)0.87(.106)0.92(.069)0.20(.285)0.88(.096)0.93(.062)0.19(.263)Table2:ResultsforthecanonicalmorphologicalsegmentationtaskonEnglishandGerman.Standarddeviationisgiveninparentheses.Wecompareagainsttwobaselinesthatdonotmakeuseofsemanticvectors:(i)“Semi-CRF(baseline)”,asemi-CRFthatcannotaccountfororthographicchangesand(ii)“Joint(Baseline)”,aversionofourjointmodelwithoutvectors.WealsocompareagainstanoracleversionwithaccesstogoldURs(“Joint+UR(Oracle)”,“Joint+UR+Vec(Oracle)”),revealingthatthetoughestpartofthecanonicalsegmentationtaskisreversingtheorthographicchanges.calizationasanorthographicanaloguetophonology.Onthisinterpretation,thefinite-statesystemsofKa-planandKay(1994),whichcomputationallyapplySPE-stylephonologicalrules(ChomskyandHalle,1968),mayberunbackwardstogetcanonicalun-derlyingforms.6ExperimentsandResultsWeconductexperimentsonEnglishandGermanderivationalmorphology.Weanalyzeourjointmodel’sabilitytosegmentwordsintotheircanoni-calmorphemesaswellasitsabilitytocomposition-allyderivevectorsfornewwords.Finally,weex-ploretherelationshipbetweendistributionalseman-ticsandmorphologicalproductivity.ForEnglish,weusethepretrainedvectorsofLevyandGoldberg(2014a)forallexperiments.ForGerman,wetrainword2vecskip-gramvectorsontheGermanWikipedia.WefirstdescribeourEn-glishdataset,thesubsetoftheEnglishportionoftheCELEXlexicaldatabase(Baayenetal.,1993)thatwasselectedbyLazaridouetal.(2013);thedatasetcontains10,000forms.Thisallowsforcom-parisonwithpreviouslyproposedmethods.Wemaketwomodifications.(i)Lazaridouetal.(2013)makethetwo-morphemeassumption:everywordiscomposedofexactlytwomorphemes.Ingeneral,thisisnottrue,sowefurthersegmentallcomplexwordsinthecorpus.Forexample,friendless+nessisfurthersegmentedintofriend+less+ness.Toneverthelessallowforfaircomparison,wepro-videversionsofourexperimentswithandwithoutthetwo-morphemeassumptionwhereappropriate.(ii)Lazaridouetal.(2013)onlyprovideasingletrain/testsplit.Aswerequireaheld-outdevelop-mentsetforhyperparametertuning,werandomlyallocateaportionofthetrainingdatatoselectthehyperparametersandthenretrainthemodelusingtheseparametersontheoriginaltrainsplit.Wealsoreport10-foldcrossvalidationresultsinadditiontoLazaridouetal.’strain/testsplit.OurGermandatasetistakenfromZelleretal.(2013)andisdescribedinCotterelletal.(2016b).It,again,consistsof10,000derivationalforms.Wereportresultson10-foldcrossvalidation.6.1Experiment1:CanonicalSegmentationForourfirstexperiment,wetestwhetherjointlymodelingthecontinuousrepresentationsallowsustosegmentwordsmoreaccurately.Weassumethatwearegivenanembeddingforthetargetword.Weestimatethemodelp(v,s,l,u|w)asdescribedin§4withL2regularizationλ||θ||22.Toevaluate,wedecodethedistributionp(s,l,u|v,w).Weper-formapproximateMAPinferencewithimportancesampling—takingthesamplewiththehighestscore.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
0
3
1
5
6
7
6
1
2
/
/
t
l
a
c
_
a
_
0
0
0
0
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
41
ENDEBOW2BOW5DEPsSGdevtestdevtestdevtestdevtestoraclestem.403.402.374.376.422.422.400.405add.635.635.541.542.787.785.712.711LDS.660.660.566.568.806.804.717.718RNN.660.660.565.567.807.806.707.712jointstem.399.400.371.372.411.412.394.398add.625.625.524.525.782.781.705.704LDS.648.648.547.547.799.797.712.711RNN.649.647.547.546.801.799.706.708charGRU.586.585.452.452.769.768.675.667LSTM.586.586.455.455.768.767.677.666Table3:Vectorapproximation(measuredbymeanco-sinesimilarity)bothwith(“oracle”)andwithout(“joint”,“char”)goldmorphology.Surprisingly,jointmodelsarecloseinperformancetomodelswithgoldmorphology.Intheseexperiments,weusetheRNNwiththede-pendencyvectors,thecombinationofwhichper-formsbestonvectorapproximationin§6.2.WefollowtheexperimentaldesignofCotterelletal.(2016b).Wecompareagainsttwobase-lines(marked“Baseline”inTable2):(i)a“Semi-CRF”segmenterthatcannotaccountforortho-graphicchangesand(ii)thefull“Joint”modelofCotterelletal.(2016b).13Weadditionallyconsideran“Oracle”setting,wherewegivethemodelthegoldunderlyingorthographicform(“UR”)atbothtrainingandtesttime.Thisgivesusinsightintotheperformanceofthetransductionfactorofourmodel,i.e.,howmuchcouldwebenefitfromarichermodel.Ourhyperparametersare(i)theregularizationcoefficientλand(ii)σ2,thevarianceoftheGaussianfactor.Weusegridsearchtotunethem:λ∈{0.0,101,102,103,104,105},σ2∈{0.25,0.5,0.75,1.0}.Metrics.Weusethreemetricstoevaluatesegmen-tationaccuracy.Notethattheevaluationofcanon-icalsegmentationishardsinceasystemmayre-turnasequenceofmorphemeswhoseconcatenationisnotthesamelengthastheconcatenationofthegoldmorphemes.ThisrulesoutmetricsforsurfacesegmentationlikeborderF1(Kurimoetal.,2010),whichrequirethestringstobeofthesamelength.Wenowdefinethemetrics.(i)Segmentationaccuracymeasureswhethereverysinglecanonicalmorphemeinthereturnedsequenceiscorrect.Itisinflexible:closeranswersarepenalizedthesameas13i.e.,amodelwithouttheGaussianfactorthatscoresvectors.moredistantanswers.(ii)MorphemeF1(vandenBoschandDaelemans,1999)takesthepredictedse-quenceofcanonicalmorphemes,turnsitintoaset,computesprecisionandrecallinthestandardwayandbasedonthatthencomputesF1.Thismetricgivescreditifsomeofthecanonicalmorphemeswerecorrect.(iii)Levenshteindistancejoinsthecanonicalsegmentswithaspecialsymbol#intoasinglestringandcomputestheLevenshteindistancebetweenpredictedandgoldstrings.Discussion.ResultsinTable2showthatjointlymodelingsemanticcoherenceimprovesourabilitytoanalyzewords.Fortest,ourproposedjointmodel(“ThisWork”)outperformsthebaselinesupervisedcanonicalsegmenter,whichisstate-of-the-artforthetask,by.05(resp..03)onaccuracyand.03(resp..03)onF1forEnglish(resp.German).WealsofindthatwhenwegivethejointmodelanoracleURthevectorsgenerallyhelpless:.01(resp..02)onac-curacyand.01(resp..03)onF1forEnglish(resp.German).Thisindicatesthatthechiefboonthevec-torcompositionfactorprovidesliesinselectionofanappropriateUR.Moreover,theupto.15differ-enceinEnglishbetweensystemswithandwithouttheoracleURsuggeststhatreversingorthographicchangesisaparticularlydifficultpartofthetask,atleastforEnglish.6.2Experiment2:VectorApproximationWeadopttheexperimentaldesignofLazaridouetal.(2013).Itsaimistoapproximateavectorofaderivationallycomplexwordusingalearnedmodelofcomposition.AsLazaridouetal.(2013)assumeagoldmorphologicalanalysis,wecomparetwoset-tings:(i)oraclemorphologicalanalysisand(ii)in-ferredmorphologicalanalysis.Tothebestofourknowledge,(ii)isanovelexperimentalconditionthatnopreviousworkhasaddressed.Weconsiderfourcompositionmodels(SeeTa-ble1).(i)stem,usingjustthestemvector.Thisbaselinetellsuswhathappensifwemaketheincor-rectassumptionthatderivationbehaveslikeinflec-tionandisnotmeaning-changing.(ii)add,apurelyadditivemodel.Thisisarguablythesimplestwayofcombiningthevectorsofthemorphemes.(iii)LDS,alineardynamicalsystem.Thisisarguablythesim-plestsequencemodel.(iv)A(simple)RNN.Recur-
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
0
3
1
5
6
7
6
1
2
/
/
t
l
a
c
_
a
_
0
0
0
0
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
42
allHRLR-lessin-un-Lazaridoustem.47.52.32.22.39.33mult.39.43.28.23.34.33dil..48.53.33.30.45.41wadd.50.55.38.24.40.34fulladd.56.61.41.38.47.44lexfunc.54.58.42.44.45.46BOW2stem.43.44.38.32.43.51add.65.67.61.60.64.67LDS.67.69.62.61.66.67RNN.67.69.60.60.65.66c-GRU.59.60.55.59.55.57c-LSTM.52.53.50.55.50.50BOW5stem.40.43.33.27.37.46add.56.59.51.46.55.59LDS.58.61.51.48.57.60RNN.58.61.50.48.56.58c-GRU.45.47.42.42.43.45c-LSTM.46.47.43.43.45.46DEPsstem.46.45.49.38.57.67add.79.79.77.78.80.80LDS.80.81.77.79.81.81RNN.81.82.77.79.80.81c-GRU.75.76.72.78.74.75c-LSTM.75.76.71.77.72.73Table4:Vectorapproximation(measuredbymeancosinesimilarity)withgoldmorphologyonthetrain/testsplitofLazaridouetal.(2013).HR/LR=high/low-relatednesswords.SeeLazaridouetal.(2013)fordetails.rentneuralnetworksarecurrentlythemostwidelyusednonlinearsequencemodelandsimpleRNNsarethesimplestsuchmodels.Partofthemotivationforconsideringaricherclassofmodelsliesinourremovalofthetwo-morphemeassumption.Indeed,itisunclearthatthewaddandfulladdmodels(MitchellandLapata,2008)areusefulmodelsinthegeneralcaseofmulti-morphemicwords—theweightsaretiedbyposition,i.e.,thefirstmorpheme’svector(beitaprefixorstem)isalwaysmultipliedbythesamematrix.ComparisonwithLazaridouetal.TocomparewithLazaridouetal.(2013),weusetheirexacttrain/testsplit.ThoseresultsarereportedinTable4.Thisdatasetenforcesthatallwordsarecomposedofexactlytwomorphemes.Thus,awordlikeunques-tionablyissegmentedasun+questionably,with-outfurtherdecomposition.ThevectorsemployedbyLazaridouetal.(2013)arehigh-dimensionalcountvectorsderivedfromlemmatizedandPOStaggedtextwithabefore-and-afterwindowofsize2.Theythenapplypointwisemutualinforma-tion(PMI)weightinganddimensionalityreductionbynon-negativematrixfactorization.Incontrast,weemployWORD2VEC(Mikolovetal.,2013a),amodelthatisalsointerpretableasthefactorizationofaPMImatrix(LevyandGoldberg,2014b).Wecon-siderthreeWORD2VECmodels:twobag-of-word(BOW)modelswithbefore-and-afterwindowsofsize2and5andDEPs(LevyandGoldberg,2014a),adependency-basedmodelwhosecontextisderivedfromdependencyparsesratherthanBOW.Ingeneral,theresultsindicatethatthekeytobettervectorapproximationisnotarichermodelofcomposition,butratherliesinthevectorsthem-selves.Wefindthatourbestmodel,theRNN,onlymarginallyedgesouttheLDS.Additionally,lookingatthe“all”columnandtheDEPsvectors,thesim-pleadditivemodelisonly≤.02lowerthanLDS.Incomparison,weobservelargedifferencesbetweenthevectors.TheRNN+DEPsmodelis.23bet-terthantheBOW5models(.81vs..58),.14betterthantheBOW2models(.81vs..67)and.25bet-terthanLazaridouetal.’sbestmodel(.81vs..56).AwidercontextforBOW(5insteadof2)yieldsworseresults.Thissuggeststhatsyntacticinfor-mationoratleastpositionalinformationisneces-saryforimprovedmodelsofmorphemecomposi-tion.Thetestvectorsareannotatedforrelatedness,whichisaproxyforsemanticcoherence.HR(high-relatedness)wordswerejudgedtobemorecompo-sitionalthanLR(low-relatedness)words.Character-LevelNeuralRetrofitting.Asafur-therstrongbaseline,weconsideraretrofitting(Faruquietal.,2015)approachbasedoncharacter-levelrecurrentneuralnetworks.Recently,runningarecurrentnetoverthecharacterstreamhasbecomeapopularwayofincorporatingsubwordinformationintoamodel—empiricalgainshavebeenobservedinadiversesetofNLPtasks:POStagging(dosSantosandZadrozny,2014;Lingetal.,2015),pars-ing(Ballesterosetal.,2015)andlanguagemodeling
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
0
3
1
5
6
7
6
1
2
/
/
t
l
a
c
_
a
_
0
0
0
0
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
43
(Kimetal.,2016).Tothebestofourknowledge,character-levelretrofittingisanovelapproach.Givenavectorvforawordformw,weseekafunctiontominimizethefollowingobjective12||v−hN||22,(10)wherehNisthefinalhiddenstateofarecurrentneu-ralarchitecture,i.e.,hi=σ(Ahi−1+Bwi),(11)whereσisanon-linearityandwiistheithchar-acterinw,hi−1istheprevioushiddenstateandAandBarematrices.WhilewehavedefinedthearchitectureforavanillaRNN,weexperimentwithtwomoreadvancedrecurrentarchitectures:GRUs(Choetal.,2014b)andLSTMs(HochreiterandSchmidhuber,1997)aswellasdeepvariants(Sutskeveretal.,2014;Gillicketal.,2016;Firatetal.,2016).Importantly,thismodelhasnoknowledgeofmorphology—itcanonlyrelyonrepresentationsitextractsfromthecharacters.Thisgivesusaclearablationonthebenefitofaddingstructuredmorpho-logicalknowledge.Weoptimizethedepthandthesizeofthehiddenunitsondevelopmentdatausingacoarse-grainedgridsearch.Wefoundadepthof2andhiddenunitsofsize100(inbothLSTMandGRU)performedbest.Wetrainedallmodelsfor100iterationsofAdam(KingmaandBa,2015)withL2regularizationwithregularizationcoefficient0.01.Table4showsthatthetwocharacter-levelmod-els(“c-GRU”and“c-LSTM”)performmuchworsethanourmodels.Thisindicatesthatsupervisedmor-phologicalanalysisproduceshigher-qualityvectorrepresentationsthan“knowledge-poor”character-levelmodels.However,wenotethatthesecharacter-levelmodelshavefewerparametersthanourmorpheme-levelmodels—therearemanymoremorphemesinalanguagesthancharacters.OracleMorphology.Ingeneral,thetwo-morphemeassumptionisincorrect.WeconsideranexpandedsettingofLazaridouetal.(2013)’stask,inwhichwefullydecomposetheword,e.g.,unquestionably7→un+question+able+ly.TheseresultsarereportedinTable3(topblock,“oracle”).Wereportmeancosinesimilarity.Standarddevia-tionssfor10-foldcross-validation(notshown)aresmall(≤.012)withtwoexceptions:s=.044fortheDEPs-joint-stemresults(.411and.412).Themulti-morphemicresultsmirrorthoseofthebi-morphemicsettingofLazaridouetal.(2013).(i)RNN+DEPsattainsanaveragecosinesimilarityofaround.80forEnglish.NumbersforGermanarelower,around.70.(ii)TheRNNonlymarginallyedgesoutLDSforEnglishandisslightlyworseforGerman.Again,thisisnotsurprisingaswearemod-elingshortsequences.(iii)Certainembeddingslendthemselvesmorenaturallytoderivationalcomposi-tionality:BOW2isbetterthanBOW5,DEPsistheclearwinner.InferredMorphology.Thefinalsettingwecon-sideristhevectorapproximationtaskwithoutgoldmorphology.Inthiscase,werelyonthefulljointmodelp(v,s,l,u|w).Atevaluation,wearein-terestedinthemarginaldistributionp(v|w)=Ps,l,up(v,s,l,u|w).Wethenuseimportancesamplingtoapproximatethemeanofthismarginaldistributionasthepredictedembedding,i.e.,ˆv=Zvp(v|w)dv(12)≈1PMi=1w(i)MXi=1w(i)Cβ(l(i),s(i)),(13)wherew(i)aretheimportanceweightsdefinedinEquation8andl(i)ands(i)aretheithsampledla-belingandsegmentation,respectively.Discussion.Surprisingly,Table3(joint)showsthatrelyingontheinferredmorphologydoesnotdrasticallyaffecttheresults.Indeed,weareoftenwithin.01oftheresultwithgoldmorphology.Ourmethodcanbeviewedasaretrofittingprocedure(Faruquietal.,2015),sothisresultisuseful:itindi-catesthatjointsemanticsynthesisandmorphologi-calanalysisproduceshigh-qualityvectors.6.3Experiment3:DerivationalProductivityWenowdelveintotherelationbetweendistribu-tionalsemanticsandmorphologicalproductivity.Theextenttowhichjointlymodelingsemanticsaidsmorphologicalanalysiswillbedeterminedbythein-herentcompositionalityofthewordswithinthevec-torspace.Webreakdownourresultsonthevectorapproximationtaskwithgoldmorphologyusingthe
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
0
3
1
5
6
7
6
1
2
/
/
t
l
a
c
_
a
_
0
0
0
0
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
44
-ly-ness-ize-ablein–ful-ous-less-ic-ist-yun–ityre–ion-ment-al-erAffix0.20.30.40.50.60.70.80.91.0AverageCosineSimilarityFigure2:Theboxplotbreaksdownthecosinesimilaritybetweentheapproximatedvectorandthetargetvectorbyaffix(usinggoldmorphology).Wehaveorderedtheaffixessuchthatthebetterapproximatedvectorsareontheleft.dependencyvectorsandtheRNNcomposerinFig-ure2byselectedaffixes.Weobserveawiderangeofscores:themostcompositionalendinglygivesrisetocosinesimilaritiesthatare20pointshigherthanthoseoftheleastcompositionaler.OntheleftendofFigure2weseeextremelypro-ductivesuffixes.Theaffixizeisusedproductivelywithrelativelyobscurewordsinthesciences,e.g.,Rao-Blackwellize.Likewise,theaffixnesscanbeappliedtoalmostanyadjectivewithoutrestriction,e.g.,Poissonness‘degreetowhichdatahaveaPois-sondistribution’.Ontherightend,wefind-ment,-erandre-.Theaffix-mentisborderlineproductive(Bauer,1983)—modernEnglishtendstoformnovelnominalizationswithnessority.Moreinterestingarere-ander,bothofwhichareveryproductiveinEnglish.Forer,manyofthewordsbringingdowntheaveragearesimplynon-compositional.Forex-ample,homer‘homeruninbaseball’isnotderivedfromhome+er—thisisanerrorindata.Wealsoseeexampleslikecutter.Ithasacompositionalread-ing(e.g.,“boxcutter”),butalsofrequentlyoccursinthenon-compositionalmeaning‘typeofboat’.Fi-nally,propernounslikeHomerandTurnerendinerandinourexperimentswecomputedvectorsforlowercasedwords.Theaffixre-similarlyhasalargenumberofnon-compositionalcases,e.g.,remove,relocate,remark.Indeed,togetthecompositionalreadingofremove,thefirstsyllable(ratherthanthesecond)istypicallystressedtoemphasizetheprefix.Wefinallynoteseverallimitationsofthisexper-iment.(i)Theabilityofourmodels—eventhere-currentneuralnetwork—tomodeltransformationsbetweenvectorsislimited.(ii)Ourvectorsarefarfromperfect;e.g.,sparsenessinthetrainingdataaf-fectsqualityandsomeofthewordsinourcorpusarerare.(iii)Semanticcoherenceisnottheonlycrite-rionforproductivity.Anexampleis-thinEnglish.Asnotedearlier,itiscompositionalinawordlikewarmth,butitcannotbeusedtoformnewwords.7ConclusionWehavepresentedamodelofthesemanticsandstructureofderivationallycomplexwords.Tothebestofourknowledge,thisisthefirstattempttojointlyconsider,withinasinglemodel,(i)themor-phologicaldecompositionofthewordformand(ii)thesemanticcoherenceoftheresultinganal-ysis.Wefoundthatdirectlymodelingcoherenceincreasessegmentationaccuracy,improvingoverastrongbaseline.Also,ourmodelsshowstate-of-the-artperformanceonthederivationalvectorapproxi-mationtaskintroducedbyLazaridouetal.(2013).Futureworkwillfocusontheextensionofthemethodtomorecomplexinstancesofderivationalmorphology,e.g.,compoundingandreduplication,andontheextensiontoadditionallanguages.Wealsoplantoexploretherelationbetweenderivationanddistributionalsemanticsingreaterdetail.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
0
3
1
5
6
7
6
1
2
/
/
t
l
a
c
_
a
_
0
0
0
0
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
45
Acknowledgments.Thefirstauthorwassup-portedbyaDAADLong-TermResearchGrantandanNDSEGfellowshipandthesecondbyaVolk-swagenstiftungOpusMagnumgrant.WewouldalsoliketothankactioneditorReginaBarzilayforsuggestingseveralchangesweincorporatedintotheworkandthethreeanonymousreviewers.ReferencesMohamedAfify,RuhiSarikaya,Hong-KwangJeffKuo,LaurentBesacier,andYuqingGao.2006.OntheuseofmorphologicalanalysisfordialectalArabicspeechrecognition.InNinthInternationalConferenceonSpokenLanguageProcessing.MarkAronoff.1976.WordFormationinGenerativeGrammar.MITPress.HaraldBaayen,RichardPiepenbrock,andHedderikvanRijn.1993.TheCELEXlexicaldatabaseonCD-ROM.MiguelBallesteros,ChrisDyer,andNoahA.Smith.2015.Improvedtransition-basedparsingbymodel-ingcharactersinsteadofwordswithLSTMs.InPro-ceedingsofthe2015ConferenceonEmpiricalMeth-odsinNaturalLanguageProcessing,pages349–359,Lisbon,Portugal,September.AssociationforCompu-tationalLinguistics.LaurieBauer.1983.EnglishWord-Formation.Cam-bridgeUniversityPress.YoshuaBengioandJean-S´ebastienSenecal.2003.Quicktrainingofprobabilisticneuralnetsbyimpor-tancesampling.InProceedingsoftheNinthInterna-tionalConferenceonArtificialIntelligenceandStatis-tics.JanA.BothaandPhilBlunsom.2014.Compositionalmorphologyforwordrepresentationsandlanguagemodelling.InInternationalConferenceonMachineLearning,pages1899–1907.KyunghyunCho,BartvanMerri¨enboer,DzmitryBah-danau,andYoshuaBengio.2014a.Onthepropertiesofneuralmachinetranslation:Encoder–decoderap-proaches.WorkshopOnSyntax,SemanticsandStruc-tureinStatisticalTranslation.KyunghyunCho,BartVanMerri¨enboer,CaglarGul-cehre,DzmitryBahdanau,FethiBougares,HolgerSchwenk,andYoshuaBengio.2014b.LearningphraserepresentationsusingRNNencoder-decoderforstatisticalmachinetranslation.InConferenceonEmpiricalMethodsinNaturalLanguageProcessing.NoamChomskyandMorrisHalle.1968.Thesoundpat-ternofEnglish.Harper&Row.NoamChomsky.1965.AspectsoftheTheoryofSyntax.MITPress.AnnCliftonandAnoopSarkar.2011.Combin-ingmorpheme-basedmachinetranslationwithpost-processingmorphemeprediction.InProceedingsofthe49thAnnualMeetingoftheAssociationforCom-putationalLinguistics:HumanLanguageTechnolo-gies,pages32–42,Portland,Oregon,USA,June.As-sociationforComputationalLinguistics.RyanCotterell,NanyunPeng,andJasonEisner.2014.StochasticcontextualeditdistanceandprobabilisticFSTs.InProceedingsofthe52ndAnnualMeetingoftheAssociationforComputationalLinguistics(Vol-ume2:ShortPapers),pages625–630,Baltimore,Maryland,June.AssociationforComputationalLin-guistics.RyanCotterell,ThomasM¨uller,AlexanderFraser,andHinrichSch¨utze.2015.Labeledmorphologicalseg-mentationwithsemi-Markovmodels.InProceed-ingsoftheNineteenthConferenceonComputationalNaturalLanguageLearning,pages164–174,Beijing,China,July.AssociationforComputationalLinguis-tics.RyanCotterell,ArunKumar,andHinrichSch¨utze.2016a.Morphologicalsegmentationinside-out.InProceedingsofthe2016ConferenceonEmpiri-calMethodsinNaturalLanguageProcessing,pages2325–2330,Austin,Texas,November.AssociationforComputationalLinguistics.RyanCotterell,TimVieira,andHinrichSch¨utze.2016b.Ajointmodeloforthographyandmorphologicalseg-mentation.InProceedingsofthe2016ConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics:HumanLanguageTech-nologies,pages664–669,SanDiego,California,June.AssociationforComputationalLinguistics.MathiasCreutzandKristaLagus.2007.Unsupervisedmodelsformorphemesegmentationandmorphologylearning.ACMTransactionsonSpeechandLanguageProcessing,4(1):3.C´ıceroNogueiradosSantosandBiancaZadrozny.2014.Learningcharacter-levelrepresentationsforpart-of-speechtagging.InInternationalConferenceonMa-chineLearning,pages1818–1826.MarkusDreyerandJasonEisner.2011.Discover-ingmorphologicalparadigmsfromplaintextusingaDirichletprocessmixturemodel.InProceedingsofthe2011ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages616–627.AssociationforComputationalLinguistics,July.MarkusDreyer.2011.ANon-parametricModelfortheDiscoveryofInflectionalParadigmsfromPlainTextusingGraphicalModelsoverStrings.Ph.D.thesis,JohnsHopkinsUniversity.JohnDuchi,EladHazan,andYoramSinger.2011.Adaptivesubgradientmethodsforonlinelearningand
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
0
3
1
5
6
7
6
1
2
/
/
t
l
a
c
_
a
_
0
0
0
0
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
46
stochasticoptimization.JournalofMachineLearningResearch,12:2121–2159.JeffreyL.Elman.1990.Findingstructureintime.Cog-nitiveScience,14(2):179–211.ManaalFaruqui,JesseDodge,SujayKumarJauhar,ChrisDyer,EduardHovy,andNoahA.Smith.2015.Retrofittingwordvectorstosemanticlexicons.InPro-ceedingsofthe2015ConferenceoftheNorthAmeri-canChapteroftheAssociationforComputationalLin-guistics:HumanLanguageTechnologies,pages1606–1615,Denver,Colorado,May–June.AssociationforComputationalLinguistics.ManaalFaruqui,RyanMcDonald,andRaduSoricut.2016a.Morpho-syntacticlexicongenerationusinggraph-basedsemi-supervisedlearning.TransactionsoftheAssociationforComputationalLinguistics,4:1–16.ManaalFaruqui,YuliaTsvetkov,GrahamNeubig,andChrisDyer.2016b.Morphologicalinflectiongen-erationusingcharactersequencetosequencelearn-ing.InProceedingsofthe2016ConferenceoftheNorthAmericanChapteroftheAssociationforCom-putationalLinguistics:HumanLanguageTechnolo-gies,pages634–643,SanDiego,California,June.As-sociationforComputationalLinguistics.OrhanFirat,KyunghyunCho,andYoshuaBengio.2016.Multi-way,multilingualneuralmachinetranslationwithasharedattentionmechanism.InProceedingsofthe2016ConferenceoftheNorthAmericanChap-teroftheAssociationforComputationalLinguistics:HumanLanguageTechnologies,pages866–875,SanDiego,California,June.AssociationforComputa-tionalLinguistics.GottlobFrege.1892.¨UberBegriffundGegenstand.Vierteljahresschriftf¨urwissenschaftlichePhilosophie,16:192–205.DanGillick,CliffBrunk,OriolVinyals,andAmarnagSubramanya.2016.Multilinguallanguageprocess-ingfrombytes.InProceedingsofthe2016Confer-enceoftheNorthAmericanChapteroftheAssocia-tionforComputationalLinguistics:HumanLanguageTechnologies,pages1296–1306,SanDiego,Califor-nia,June.AssociationforComputationalLinguistics.ChristophGollerandAndreasK¨uchler.1996.Learningtask-dependentdistributedrepresentationsbyback-propagationthroughstructure.InIEEEInternationalConferenceonNeuralNetworks.KazumaHashimotoandYoshimasaTsuruoka.2016.Adaptivejointlearningofcompositionalandnon-compositionalphraseembeddings.InProceedingsofthe54thAnnualMeetingoftheAssociationforCom-putationalLinguistics(Volume1:LongPapers),pages205–215,Berlin,Germany,August.AssociationforComputationalLinguistics.MartinHaspelmathandAndreaSims.2013.Under-standingmorphology.Routledge.IreneHeimandAngelikaKratzer.1998.SemanticsinGenerativeGrammar.Blackwell.SeppHochreiterandJ¨urgenSchmidhuber.1997.Longshort-termmemory.NeuralComputation,9(8):1735–1780.KatharinaKann,RyanCotterell,andHinrichSch¨utze.2016.Neuralmorphologicalanalysis:Encoding-decodingcanonicalsegments.InProceedingsofthe2016ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages961–967,Austin,Texas,November.AssociationforComputationalLinguis-tics.RonaldM.KaplanandMartinKay.1994.Regularmod-elsofphonologicalrulesystems.ComputationalLin-guistics,20(3):331–378.MartinKay.1977.Morphologicalandsyntacticanalysis.LinguisticStructuresProcessing,5:131.YoonKim,YacineJernite,DavidSontag,andAlexan-derM.Rush.2016.Character-awareneurallanguagemodels.InProceedingsoftheThirtiethAAAIConfer-enceonArtificialIntelligence,pages2741–2749.DiederikKingmaandJimmyBa.2015.Adam:Amethodforstochasticoptimization.InInternationalConferenceonLearningRepresentations.MaxKisselew,SebastianPad´o,AlexisPalmer,andJanˇSnajder.2015.ObtainingabetterunderstandingofdistributionalmodelsofGermanderivationalmorphol-ogy.InProceedingsofthe11thInternationalConfer-enceonComputationalSemantics,pages58–63.DaphneKollerandNirFriedman.2009.ProbabilisticGraphicalModels:PrinciplesandTechniques.MITPress.LingpengKong,ChrisDyer,andNoahASmith.2016.Segmentalrecurrentneuralnetworks.In4thInterna-tionalConferenceonLearningRepresentations.MikkoKurimo,SamiVirpioja,VilleTurunen,andKristaLagus.2010.MorphoChallengecompetition2005–2010:Evaluationsandresults.InSpecialInterestGrouponComputationalMorphologyandPhonology.AngelikiLazaridou,MarcoMarelli,RobertoZamparelli,andMarcoBaroni.2013.Compositional-lyderivedrepresentationsofmorphologicallycomplexwordsindistributionalsemantics.InProceedingsofthe51stAnnualMeetingoftheAssociationforComputationalLinguistics(Volume1:LongPapers),pages1517–1526,Sofia,Bulgaria,August.AssociationforCom-putationalLinguistics.OmerLevyandYoavGoldberg.2014a.Dependency-basedwordembeddings.InProceedingsofthe52ndAnnualMeetingoftheAssociationforComputationalLinguistics(Volume2:ShortPapers),pages302–308,
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
0
3
1
5
6
7
6
1
2
/
/
t
l
a
c
_
a
_
0
0
0
0
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
47
Baltimore,Maryland,June.AssociationforComputa-tionalLinguistics.OmerLevyandYoavGoldberg.2014b.Neuralwordembeddingasimplicitmatrixfactorization.InAd-vancesinNeuralInformationProcessingSystems,pages2177–2185.MarcLight.1996.Morphologicalcuesforlexicalse-mantics.InProceedingsofthe34thAnnualMeetingoftheAssociationforComputationalLinguistics,pages25–31,SantaCruz,California,USA,June.Associa-tionforComputationalLinguistics.WangLing,ChrisDyer,AlanW.Black,IsabelTrancoso,RamonFermandez,SilvioAmir,LuisMarujo,andTiagoLuis.2015.Findingfunctioninform:Com-positionalcharactermodelsforopenvocabularywordrepresentation.InProceedingsofthe2015ConferenceonEmpiricalMethodsinNaturalLanguageProcess-ing,pages1520–1530,Lisbon,Portugal,September.AssociationforComputationalLinguistics.ThangLuong,RichardSocher,andChristopherMan-ning.2013.Betterwordrepresentationswithrecur-siveneuralnetworksformorphology.InProceedingsoftheSeventeenthConferenceonComputationalNat-uralLanguageLearning,pages104–113,Sofia,Bul-garia,August.AssociationforComputationalLinguis-tics.JohnLyons.1977.Semantics.CambridgeUniversityPress.TomasMikolov,KaiChen,GregCorrado,andJeffreyDean.2013a.Efficientestimationofwordrepresenta-tionsinvectorspace.In2thInternationalConferenceonLearningRepresentations.TomasMikolov,IlyaSutskever,KaiChen,GregoryS.Corrado,andJeffreyDean.2013b.Distributedrep-resentationsofwordsandphrasesandtheircomposi-tionality.InAdvancesinneuralinformationprocess-ingsystems,pages3111–3119.JeffMitchellandMirellaLapata.2008.Vector-basedmodelsofsemanticcomposition.InProceedingsofAssociationofComputationalLinguistics,pages236–244,Columbus,Ohio,June.AssociationforComputa-tionalLinguistics.KevinP.Murphy.2012.MachineLearning:AProba-bilisticPerspective.MITPress.JasonNaradowskyandSharonGoldwater.2009.Im-provingmorphologyinductionbylearningspellingrules.InTwenty-firstInternationalJointConferenceonArtificialIntelligence,pages1531–1536.KarthikNarasimhan,DamianosKarakos,RichardSchwartz,StavrosTsakalidis,andReginaBarzilay.2014.Morphologicalsegmentationforkeywordspot-ting.InProceedingsofthe2014ConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP),pages880–885,Doha,Qatar,October.As-sociationforComputationalLinguistics.KarthikNarasimhan,ReginaBarzilay,andTommiJaakkola.2015.Anunsupervisedmethodforuncov-eringmorphologicalchains.TransactionsoftheAsso-ciationforComputationalLinguistics,3:157–167.OUPeditors.2010.NewOxfordAmericanDictionary.OxfordUniversityPress.SebastianPad´o,AlexisPalmer,MaxKisselew,andJanˇSnajder.2015.Measuringsemanticcontenttoas-sessasymmetryinderivation.InProceedingsofthe11thInternationalConferenceonComputationalSe-mantics.LawrenceR.Rabiner.1989.AtutorialonhiddenMarkovmodelsandselectedapplicationsinspeechrecognition.ProceedingsoftheInstituteofElectricalandElectronicsEngineers,77(2):257–286.PushpendreRastogi,RyanCotterell,andJasonEisner.2016.Weightingfinite-statetransductionswithneu-ralcontext.InProceedingsofthe2016ConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics:HumanLanguageTech-nologies,pages623–633,SanDiego,California,June.AssociationforComputationalLinguistics.EricS.RistadandPeterN.Yianilos.1998.Learningstring-editdistance.IEEETransactionsonPatternAnalysisandMachineIntelligence,20(5):522–532.ReuvenY.RubinsteinandDirkP.Kroese.2011.Sim-ulationandtheMonteCarlomethod.JohnWiley&Sons.TeemuRuokolainen,OskarKohonen,SamiVirpioja,andMikkoKurimo.2013.Supervisedmorphologicalseg-mentationinalow-resourcelearningsettingusingcon-ditionalrandomfields.InProceedingsoftheSev-enteenthConferenceonComputationalNaturalLan-guageLearning,pages29–37,Sofia,Bulgaria,Au-gust.AssociationforComputationalLinguistics.GerardSalton,editor.1971.TheSMARTRetrievalSystem—ExperimentsinAutomaticDocumentPro-cessing.PrenticeHall.SunitaSarawagiandWilliamW.Cohen.2005.Semi-Markovconditionalrandomfieldsforinformationex-traction.InAdvancesinneuralinformationprocess-ingsystems,pages1185–1192.PatrickSchoneandDanielJurafsky.2000.Knowledge-freeinductionofmorphologyusinglatentsemanticanalysis.InProceedingsthe4thConferenceonCom-putationalNaturalLanguageLearning,pages67–72.AssociationforComputationalLinguistics.PatrickSchoneandDanielJurafsky.2001.Knowledge-freeinductionofinflectionalmorphologies.InPro-ceedingsoftheSecondMeetingoftheNorthAmerican
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
0
0
3
1
5
6
7
6
1
2
/
/
t
l
a
c
_
a
_
0
0
0
0
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
48
ChapteroftheAssociationforComputationalLinguis-ticsonLanguageTechnologies,pages1–9.Associa-tionforComputationalLinguistics.WolfgangSeekerand¨OzlemC¸etino˘glu.2015.Agraph-basedlatticedependencyparserforjointmorpholog-icalsegmentationandsyntacticanalysis.Transac-tionsoftheAssociationforComputationalLinguistics,3:359–373.RaduSoricutandFranzOch.2015.Unsupervisedmor-phologyinductionusingwordembeddings.InPro-ceedingsofthe2015ConferenceoftheNorthAmeri-canChapteroftheAssociationforComputationalLin-guistics:HumanLanguageTechnologies,pages1627–1637,Denver,Colorado,May–June.AssociationforComputationalLinguistics.IlyaSutskever,OriolVinyals,andQuocV.Le.2014.Se-quencetosequencelearningwithneuralnetworks.InAdvancesinNeuralInformationProcessingSystems,pages3104–3112.CharlesSuttonandAndrewMcCallum.2006.Anin-troductiontoconditionalrandomfieldsforrelationallearning.InLiseGetoorandBenTaskar,editors,IntroductiontoStatisticalRelationalLearning,pages93–128.MITPress.PeterD.TurneyandPatrickPantel.2010.Fromfre-quencytomeaning:Vectorspacemodelsofsemantics.JournalofArtificialIntelligenceResearch,37(1):141–188.AntalvandenBoschandWalterDaelemans.1999.Memory-basedmorphologicalanalysis.InProceed-ingsofthe37thAnnualMeetingoftheAssociationforComputationalLinguistics,pages285–292,CollegePark,Maryland,USA,June.AssociationforCompu-tationalLinguistics.MartinJ.WainwrightandMichaelI.Jordan.2008.Graphicalmodels,exponentialfamilies,andvaria-tionalinference.FoundationsandTrendsinMachineLearning,1(1-2):1–305.ZhenWang,JianwenZhang,JianlinFeng,andZhengChen.2014.Knowledgegraphandtextjointlyem-bedding.InProceedingsofthe2014ConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP),pages1591–1601,Doha,Qatar,October.AssociationforComputationalLinguistics.PaulJ.Werbos.1990.Backpropagationthroughtime:Whatitdoesandhowtodoit.ProceedingsoftheInstituteofElectricalandElectronicsEngineers,78(10):1550–1560.RichardWicentowski.2002.ModelingandLearningMultilingualInflectionalMorphologyinaMinimallySupervisedFramework.Ph.D.thesis,JohnsHopkinsUniversity.YadollahYaghoobzadehandHinrichSch¨utze.2015.Corpus-levelfine-grainedentitytypingusingcontex-tualinformation.InProceedingsofthe2015Confer-enceonEmpiricalMethodsinNaturalLanguagePro-cessing,pages715–725,Lisbon,Portugal,September.AssociationforComputationalLinguistics.DavidYarowskyandRichardWicentowski.2000.Min-imallysupervisedmorphologicalanalysisbymulti-modalalignment.InThe38thAnnualMeetingoftheAssociationforComputationalLinguistics.WenpengYinandHinrichSch¨utze.2015.Convolutionalneuralnetworkforparaphraseidentification.InPro-ceedingsofthe2015ConferenceoftheNorthAmeri-canChapteroftheAssociationforComputationalLin-guistics:HumanLanguageTechnologies,pages901–911,Denver,Colorado,May–June.AssociationforComputationalLinguistics.BrittaD.Zeller,JanˇSnajder,andSebastianPad´o.2013.DErivBase:InducingandevaluatingaderivationalmorphologyresourceforGerman.InProceedingsofthe51stAnnualMeetingoftheAssociationforCom-putationalLinguistics(Volume1:LongPapers),pages1201–1211,Sofia,Bulgaria,August.AssociationforComputationalLinguistics.BrittaD.Zeller,SebastianPad´o,andJanˇSnajder.2014.Towardssemanticvalidationofaderivationallexicon.InProceedingsthe25thInternationalConferenceonComputationalLinguistics,pages1728–1739,Dublin,Ireland,August.DublinCityUniversityandAssocia-tionforComputationalLinguistics.
Download pdf