Transactions of the Association for Computational Linguistics, vol. 6, pp. 33–48, 2018. Action Editor: Regina Barzilay.

Transactions of the Association for Computational Linguistics, vol. 6, pp. 33–48, 2018. Action Editor: Regina Barzilay.
Submission batch: 5/2016; Revision batch: 10/2016; Published 1/2018.

2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.

c
(cid:13)

JointSemanticSynthesisandMorphologicalAnalysisoftheDerivedWordRyanCotterellDepartmentofComputerScienceJohnsHopkinsUniversityryan.cotterell@jhu.eduHinrichSch¨utzeCISLMUMunichinquiries@cislmu.orgAbstractMuchlikesentencesarecomposedofwords,wordsthemselvesarecomposedofsmallerunits.Forexample,theEnglishwordquestionablycanbeanalyzedasquestion+able+ly.However,thisstructuraldecompositionoftheworddoesnotdirectlygiveusasemanticrepresentationoftheword’smeaning.Sincemorphologyobeystheprincipleofcompositionality,thesemanticsofthewordcanbesystematicallyderivedfromthemeaningofitsparts.Inthiswork,weproposeanovelprobabilisticmodelofwordformationthatcapturesboththeanalysisofawordwintoitsconstituentsegmentsandthesynthesisofthemeaningofwfromthemean-ingsofthosesegments.Ourmodeljointlylearnstosegmentwordsintomorphemesandcomposedistributionalsemanticvectorsofthosemorphemes.WeexperimentwiththemodelonEnglishCELEXdataandGermanDErivBase(Zelleretal.,2013)data.WeshowthatjointlymodelingsemanticsincreasesbothsegmentationaccuracyandmorphemeF1bybetween3%and5%.Additionally,weinvestigatedifferentmodelsofvectorcompo-sition,showingthatrecurrentneuralnetworksyieldanimprovementoversimpleadditivemodels.Finally,westudythedegreetowhichtherepresentationscorrespondtoalinguist’snotionofmorphologicalproductivity.1IntroductionInmostlanguages,wordsdecomposefurtherintosmallerunits,termedmorphemes.Forexample,theEnglishwordquestionablycanbeanalyzedasquestion+able+ly.Thisstructuraldecompositionoftheword,cependant,byitselfisnotasemanticrep-resentationoftheword’smeaning;1wefurtherre-quireanaccountofhowtosynthesizethemeaningfromthedecomposition.Fortunately,words—justlikephrases—toalargeextentobeytheprincipleofcompositionality:thesemanticsofthewordcanbesystematicallyderivedfromthemeaningofitsparts.2Inthiswork,weproposeanoveljointprob-abilisticmodelofwordformationthatcapturesbothstructuraldecompositionofawordwintoitscon-stituentsegmentsandthesynthesisofw’smeaningfromthemeaningofthosesegments.Morphologicalsegmentationisastructuredpre-dictiontaskthatseekstobreakawordupintoitsconstituentmorphemes.Theoutputsegmentationhasbeenshowntoaidadiversesetofapplications,suchasautomaticspeechrecognition(Aﬁfyetal.,2006),keywordspotting(Narasimhanetal.,2014),machinetranslation(CliftonandSarkar,2011)andparsing(SeekerandC¸etino˘glu,2015).Incontrasttomuchofthispriorwork,wefocusonsupervisedsegmentation,i.e.,weprovidethemodelwithgoldsegmentationsduringtrainingtime.Insteadofsur-1Therearemanydifferentlinguisticandcomputationaltheo-riesforinterpretingthestructuraldecompositionofaword.Forexample,un-oftensigniﬁesnegationanditseffectonsemanticscanthenbemodeledbytheoriesbasedonlogic.Thisworkad-dressesthequestionofstructuraldecompositionandsemanticsynthesisinthegeneralframeworkofdistributionalsemantics.2Morphologicalresearchintheoreticalandcomputationallinguisticsoftenfocusesonnoncompositionalorlesscom-positionalphenomena—simplybecausecompositionalderiva-tionposesfewerinterestingresearchproblems.Itisalsotruethat—justasmanyfrequentmultiwordunitsarenotcompletelycompositional—manyfrequentderivations(e.g.,refusal,ﬁt-ness)arenotcompletelycompositional.Anindicationthatnon-lexicalizedderivationsareusuallycompositionalisthefactthatstandarddictionarieslikeOUPeditors(2010)listderivationalafﬁxeswiththeircompositionalmeaning,withoutahedgethattheycanalsooccuraspartofonlypartiallycompositionalforms.SeealsoHaspelmathandSims(2013),§5.3.6.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
0
3
1
5
6
7
6
1
2

/
t

un
c
_
un
_
0
0
0
0
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

facesegmentation,ourmodelperformscanonicalsegmentation(Cotterelletal.,2016a;Cotterelletal.,2016b;Kannetal.,2016),i.e.,itallowstheinduc-tionoforthographicchangestogetherwiththeseg-mentation,whichisnottypical.Fortheexamplequestionably,ourmodelcanrestorethedeletedchar-actersle,yieldingthecanonicalsegmentsquestion,ableandly.Inthiswork,ourprimarycontributionliesintheintegrationofcontinuoussemanticvec-torsintosupervisedmorphologicalsegmentation—wepresentajointmodelofmorphologicalanalysisandsemanticsynthesisattheword-level.Weexperimentallyinvestigatethreenovelaspectsofourmodel.•First,weshowthatjointlymodelingcontinu-ousrepresentationsofthesemanticsofmor-phemesandwordsallowsustoimprovemor-phologicalanalysis.OntheEnglishportionofCELEX(Baayenetal.,1993),weachievea5pointimprovementinsegmentationaccuracyanda3pointimprovementinmorphemeF1.OntheGermanDErivBasedatasetweachievea3pointimprovementinsegmentationaccu-racyanda3pointimprovementinmorphemeF1.•Second,weexploreimprovedmodelsofvec-torcompositionforsynthesizingwordmean-ing.Weﬁndarecurrentneuralnetworkim-provesoverpreviouslyproposedadditivemod-els.Moreover,weﬁndthatmoresyntacticallyorientedvectors(LevyandGoldberg,2014un)arebettersuitedformorphologythanbag-of-word(BOW)models.•Finally,weexploretheproductivityofEnglishderivationalafﬁxesinthecontextofdistribu-tionalsemantics.2DerivationalMorphologyTwoimportantgoalsofmorphology,thelinguisticstudyoftheinternalstructureofwords,aretode-scribetherelationbetweendifferentwordsinthelexiconandtodecomposethemintomorphemes,thesmallestlinguisticunitbearingmeaning.Morphol-ogycanbedividedintotwotypes:inﬂectionalandderivational.Inﬂectionalmorphologyisthesetofprocessesthroughwhichthewordformoutwardlydisplayssyntacticinformation,e.g.,verbtense.Itfollowsthataninﬂectionalafﬁxtypicallyneitherchangesthepart-of-speech(POS)northesemanticsoftheword.Forexample,theEnglishverbtoruntakesvariousforms:run,runs,ranandrunning,allofwhichconvey“movingbyfootquickly”,butap-pearincomplementarysyntacticcontexts.Derivationdealswiththeformationofnewwordsthathavesemanticshiftsinmeaning(ofteninclud-ingPOS)andistightlyintertwinedwithlexicalse-mantics(Light,1996).ConsidertheexampleoftheEnglishnoundiscontentedness,whichisderivedfromtheadjectivediscontented.Itistruethatbothwordsshareaclosesemanticrelationship,butthetransformationisclearlymorethanasimpleinﬂec-tionalmarkingofsyntax.Indeed,wecangoonestepfurtheranddeﬁneachainofwordscontent7→contented7→discontented7→discontentedness.Inthecomputationalliterature,derivationalmor-phologyhasreceivedlessattentionthaninﬂectional.Thereare,cependant,twobodiesofworkonderiva-tionincomputationallinguistics.First,thereisaseriesofpapersthatexploretherelationbetweenlexicalsemanticsandderivation(Lazaridouetal.,2013;Zelleretal.,2014;Pad´oetal.,2015;Kisse-lewetal.,2015).Alloftheseassumeagoldmor-phologicalanalysisandprimarilyfocusontheef-fectofderivationondistributionalsemantics.Thesecondbodyofwork,e.g.,theunsupervisedmor-phologicalsegmenterMORFESSOR(CreutzandLa-gus,2007),doesnotdealwithsemanticsandmakesnodistinctionbetweeninﬂectionalandderivationalmorphology.3Eventhoughtheboundarybetweeninﬂectionalandderivationalmorphologyisacon-tinuumratherthanarigiddivide(HaspelmathandSims,2013),thereisstillthecleardistinctionthatderivationchangesmeaningwhereasinﬂectiondoesnot.Ourgoalinthispaperistodevelopanaccountofhowthemeaningofawordformcanbecomputedjointly,combiningthesetwolinesofwork.ProductivityandSemanticCoherence.Wehighlighttworelatedissuesinderivationthatmoti-vatedthedevelopmentofourmodel:productivity3Narasimhanetal.(2015)alsomakenodistinctionbetweeninﬂectionalandderivationalmorphology,buttheirmodelisanexceptioninthatitincludesvectorsimilarityasasemanticfea-ture.See§5fordiscussion.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
0
3
1
5
6
7
6
1
2

/
t

un
c
_
un
_
0
0
0
0
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

andsemanticcoherence.Roughly,aproductiveafﬁxisonethatcanstillactivelybeemployedtoformnewwordsinalanguage.Forexample,theEnglishnominalizingafﬁxness(red7→red+ness)canbeattachedtojustaboutanyadjective,includingnovelforms.Incontrast,thearchaicEnglishnominal-izingafﬁxth(dear7→dear+th,heal7→heal+th,steal7→steal+th)doesnotallowustoformnewwordssuchascheapth.Thisisacrucialissueinderivationalmorphologysincewewouldnotingeneralwanttoanalyzenewwordsashavingbeenformedfromnon-productiveendings;e.g.,wedonotwanttoanalyzehearthashear+th(orwugthaswug+th).Relationssuchasthosebetweenhealandhealtharelexicalizedsincetheynolongercanbederivedbyproductiveprocesses(Bauer,1983).Underagenerativetreatment(Chomsky,1965)ofmorphology,productivitybecomesacentralno-tionsinceagrammarneedstoaccountforactivewordformationprocessesinthelanguage(Aronoff,1976).Deﬁningproductivityprecisely,cependant,istricky;Aronoff(1976)writes,“oneofthecentralmysteriesofderivationalmorphology…[isthat]…thoughmanythingsarepossibleinmorphology,somearemorepossiblethanothers.”Nevertheless,speakersoftenhaveclearintuitionsaboutwhichaf-ﬁxesinthelanguageareproductive.4Relatedtoproductivityisthenotionofseman-ticcoherence.Theprincipleofcompositionality(Frege,1892;HeimandKratzer,1998)appliestointerpretationofwordsjustasitdoestophrases.In-deed,compositionalityisoftentakentobeasig-nalforproductivity(Aronoff,1976).Whende-cidingwhethertofurtherdecomposeaword,ask-ingwhetherthepartssumuptothewholeisof-tenagoodindicator.Inthecaseofquestionably7→question+able+ly,thecompositionalmeaningis“inamannerthatcouldbequestioned”,whichcorrespondstothemeaningoftheword.Contrastthiswiththewordunquiet,whichmeans“restless”,ratherthan“notquiet”andthecompoundblackmail,whichdoesnotrefertoaletterwritteninblackink.Themodelwewilldescribein§3isajointmodelofbothsemanticcoherenceandsegmentation;that4Itisalsoimportanttodistinguishproductivityfromcreativity—anon-rule-governedformofwordformation(Lyons,1977).Asanexampleofcreativity,considerthecre-ationofportmanteaux,e.g.,dramedyandsoundscape.is,ananalysisisjudgednotonlybycharacter-levelfeatures,butalsobythedegreetowhichthewordissemanticallycompositional.Implicitinsuchatreatmentisthedesiretoonlysegmentawordifthesegmentationisderivedfromaproductiveprocess.Whilemostpriorworkonmorphologicalsegmen-tationhasnotexplicitlymodeledproductivity,5webelieve,fromacomputationalmodelingperspective,segmentingonlyproductiveafﬁxesispreferable.Thisisanalogoustothemodelingofphrasecompo-sitionalityinembeddingmodels,whereitcanbebet-tertonotfurtherdecomposenoncompositionalmul-tiwordunitslikenamedentitiesandidiomaticex-pressions;voir,e.g.,Mikolovetal.(2013b),Wangetal.(2014),YinandSch¨utze(2015),YaghoobzadehandSch¨utze(2015),andHashimotoandTsuruoka(2016).6Inthispaper,werefertothesemanticaspectofthemodeleitherassemanticsynthesisorascoherence.Thesearetwowaysoflookingatsemanticsthatarerelatedasfollows.Ifthesynthesis(i.e.,composi-tion)ofthemeaningofthederivedformfromthemeaningofitspartsisaregularapplicationofthelinguisticrulesofderivation,thenthemeaningsoconstructediscoherent.Thesearethecaseswhereajointmodelisexpectedtobebeneﬁcialforbothsegmentationandinterpretation.3AJointModelFromanNLPperspective,canonicalsegmentation(NaradowskyandGoldwater,2009;Cotterelletal.,2016b)isthetaskthatseekstoalgorithmicallyde-composeawordintoitscanonicalsequenceofmor-phemes.Itisaversionofmorphologicalsegmenta-tionthatrequiresthelearnertohandleorthographicchangesthattakeplaceduringwordformation.Webelievethisisamorenaturalformulationofmor-phologicalanalysis—especiallyfortheprocessing5NotethatsegmenterssuchasMORFESSORutilizetheprin-cipleofminimumdescriptionlength,whichimplicitlyencodesproductivity,inordertoguidesegmentation.6Asareviewerpointsout,productivityofanafﬁxandse-manticcoherenceofthewordsformedfromitarenotperfectlyaligned.Nonproductiveafﬁxescanproducesemanticallycoher-entwords,e.g.,warm7→warm+th.Productiveafﬁxescanpro-ducesemanticallyincoherentwords,e.g.,canny7→un+canny.Again,thisisanalogoustomultiwordunits.However,thereisastrongcorrelationandourexperimentsshowthatrelyingonitgivesgoodresults.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
0
3
1
5
6
7
6
1
2

/
t

un
c
_
un
_
0
0
0
0
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

unquestionablelysuffixsuffixstemprefixunquestionably+++unquestionablely⇡surface formunderlying formsegmentationvector compositiontargetFigure1:Adepictionofthejointmodelthatmakestherelationbetweenthethreefactorsandtheobservedsur-faceformexplicit.Weshowasimpleadditivemodelofcompositionforeaseofexplication.ofderivationalmorphology—asitdrawsheavilyonlinguisticnotions(see§2).Themaininnovationwepresentistheaugmen-tationofcanonicalsegmentationtotakeintoac-countsemanticcoherenceandproductivity.Con-siderthewordhypercuriosityanditscanonicalseg-mentationhyper+curious+ity;thiscanonicalseg-mentationseekstodecomposethewordintoitscon-stituentmorphemesandaccountfororthographicchanges.Thisamountstoastructuraldecomposi-tionoftheword,i.e.,howdowebreakupthestringofcharactersintochunks?Thisissimilartothede-compositionofasentenceintoaparsetree.How-ever,itisalsonaturaltoconsiderthesemanticcom-positionalityofaword,i.e.,howisthemeaningofthewordsynthesizedfromthemeaningoftheindi-vidualmorphemes?Weconsiderbothofthesequestionstogetherinasinglemodel,wherewewouldliketoplacehighprobabilityoncanonicalsegmentationsthatarealsosemanticallycoherent.Returningtohy-percuriosity,wecouldfurtherdecomposeitintohyper+cure+ous+ityinanalogyto,say,vice7→vicious.Nothingaboutthesurfaceformofcuri-ousalonegivesusastrongcuethatweshouldruleoutthesegmentationcure+ous.Turningtodistri-butionalsemantics,cependant,itisthecasethatthecontextsinwhichcuriousoccursarequitedifferentfromthoseinwhichcureoccurs.Thisgivesusastrongcuewhichsegmentationiscorrect.Formally,givenawordstringw∈Σ∗,whereΣisadiscretealphabetofcharacters(inEnglishthiscouldbeassimpleasthe26letterlowercasealpha-bet),andawordvectorv∈V,whereVisasetoflow-dimensionalwordembeddings,wedeﬁnethemodelas:p(v,s,je,toi|w)=1Zθ(w)exp(cid:18)12σ2||v−Cβ(s,je)||22+F(s,je,toi)>η+g(toi,w)>ω(cid:17).(1)Thismodeliscomposedofthreefactors:composi-tionfactor(12σ2||v−Cβ(s,je)||22),segmentationfactorfandtransductionfactorg.Theparametersofthemodelareθ={β,η,ω},thefunctionCβcomposesmorphemevectorstogether,sisthesegmentation,listhelabelingofthesegments,uistheunderlyingrepresentationandZθ(w)isthepartitionfunction.Notethattheconditionaldistributionp(v|s,je,toi,w)isGaussiandistributedbyconstruction.Avisualiza-tionofourmodelisfoundinFigure1.Thismodelisaconditionalrandomﬁeld(CRF)thatismixed,i.e.,itisdeﬁnedoverbothdiscreteandcontinuousrandomvariables(KollerandFriedman,2009).WerestricttherangeofutobeasubsetofΣ|w|+k,wherekisaninsertionlimit(Dreyer,2011).Inthiswork,wetakek=5.Explicitly,thepartitionfunctionisdeﬁnedasZθ(w)=ZXl0,s0,u0exp(cid:18)12σ2||v0−Cβ(s0,l0)||22+F(s0,l0,u0)>η+g(u0,w)>ω(cid:17)dv0,(2)whichisguaranteedtobeﬁnite.7ACRFissimplythegloballyrenormalizedprod-uctofseveralnon-negativefactors(SuttonandMcCallum,2006).Ourmodeliscomposedofthree:transduction,segmentationandcompositionfactors—wedescribeeachinturn.3.1TransductionFactorTheﬁrstfactorweconsideristhetransductionfactor:exp(cid:0)g(toi,w)>ω(cid:1),whichscoresasurface7Sincewehavecappedtheinsertionlimit,wehaveaﬁnitenumberofvaluesthatucantakeforanyw.Thus,itfollowsthatwehaveaﬁnitenumberofcanonicalsegmentationss.HencewetakeaﬁnitenumberofGaussianintegrals.Theseintegralsallconvergesincewehaveﬁxedthecovariancematrixasσ2I,whichispositivedeﬁnite.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
0
3
1
5
6
7
6
1
2

/
t

un
c
_
un
_
0
0
0
0
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

representation(SR)w,thecharacterstringob-servedinrawtext,andanunderlyingrepresen-tation(UR),acharacterstringwithorthographicprocessesreversed.Theaimofthisfactoristoplacehighweightongoodpairs,e.g.,thepair(w=questionably,u=questionablely),sowecanac-curatelyrestorecharacter-levelchanges.Weencodethisportionofthemodelasaweightedﬁnite-statemachineforeaseofcomputation.Thisfactorgeneralizesprobabilisticeditdistance(RistadandYianilos,1998)bylookingatadditionalinputandoutputcontext;seeCotterelletal.(2014)forde-tails.AsmentionedaboveandincontrasttoCot-terelletal.(2014),weboundtheinsertionlimitintheeditdistancemodel.8Computingthescorebe-tweentwostringsuandwrequiresadynamicpro-gramthatrunsinO(|toi|·|w|).ThisisageneralizationoftheforwardalgorithmforHiddenMarkovModels(HMMs)(Rabiner,1989).Weemploystandardfeaturetemplatesforthetaskthatlookatfeaturesofeditoperations,e.g.,substi-tuteifory,invaryingcontextgranularities.SeeCotterelletal.(2016b)fordetails.RecentworkhasalsoexploredweightingofWFSTarcswithscorescomputedbyLSTMs(HochreiterandSchmidhuber,1997),obviatingtheneedforhumanselectionoffeaturetemplates(Rastogietal.,2016).3.2SegmentationFactorThesecondfactoristhesegmentationfactor:exp(cid:0)F(s,je,toi)>η(cid:1).ThegoalofthisfactoristoscoreasegmentationsofaURu.Inourexample,itscorestheinput-outputpair(u=questionablely,s=question+able+ly).Itadditionallyscoresalabelingofthesegmentation.OurlabelsetinthisworkisL={stem,preﬁx,sufﬁx}.Theproperlabelingofthesegmentationaboveisl=question:stem+able:sufﬁx+ly:sufﬁx.Thelabel-ingiscriticalforourcompositionfunctionsCβ(Cot-terelletal.,2015):whichvectorsareuseddependsonthelabelgiventothesegment;e.g.,thevectorsofthepreﬁx“post”andthestem“post”aredifferent.Wecanviewthisfactorasanunnormalizedﬁrst-8AsourtransductionmodelisanunnormalizedfactorinaCRF,wedonotrequirethelocalnormalizationdiscussedinCotterelletal.(2014)—aweightonanedgemaybeanynon-negativerealnumbersincewewillrenormalizelater.Theun-derlyingmodel,cependant,remainsthesame.modelcompositionfunctionstemc=PNi=11li=stemmlisimultc=JNi=1mlisiaddc=PNi=1mlisiwaddc=PNi=1αimlisifulladdc=PNi=1UimlisiLDShi=Xhi−1+UmlisiRNNhi=tanh(Xhi−1+Umlisi)Table1:CompositionmodelsCβ(s,je)usedinthisandpriorwork.TherepresentationofthewordishNforthedynamicandcforthenon-dynamicmodels.Notethatforthedynamicmodelsh0isalearnedparameter.ordersemi-CRF(SarawagiandCohen,2005).Com-putationofthefactoragainrequiresdynamicpro-gramming.Thealgorithmisadifferentgeneraliza-tionoftheforwardalgorithmforHMMs,onethatextendsittothesemi-Markovcase.ThisalgorithmrunsinO(|toi|2·|L|2).Features.Weagainusestandardfeaturetemplatesforthetask.Wecreateatomicindicatorfeaturesfortheindividualsegments.Wethenconjointheatomicfeatureswithleftandrightcontextfeaturesaswellasthelabeltocreatemorecomplexfeaturetemplates.Wealsoincludetransitionfeaturesthatﬁreonpairsofsequentiallabels.SeeCotterelletal.(2015)fordetails.Recentworkhasalsoshowedthataneuralparameterizationcanremovetheneedformanualfeaturedesign(Kongetal.,2016).3.3CompositionFactorThecompositionfactortakestheformofanunnormalizedmultivariateGaussiandensity:exp(cid:0)12σ2||v−Cβ(s,je)||22(cid:1),wherethemeaniscomputedbythe(potentiallynon-linear)compo-sitionfunction(SeeTable1)andthecovariancematrixσ2Iisadiagonalmatrix.ThegoalofthecompositionfunctionCβ(s,je)istostitchtogethermorphemeembeddingstoapproximatethevectoroftheentireword.ThesimplestformofthecompositionfunctionCβ(s,je)isadd,anadditivemodelofthemorphemes.SeeTable1:eachvectormlisireferstoamorpheme-

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
0
3
1
5
6
7
6
1
2

/
t

un
c
_
un
_
0
0
0
0
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

speciﬁc,label-dependentembedding.Ifli=stem,thensirepresentsastemmorpheme.Giventhatoursegmentationiscanonical,ansithatisastemgen-erallyitselfisanentryinthelexiconandv(si)∈V.Ifv(si)6∈V,thenwesetv(si)to0.9Weoptimizeovervectorswithli∈{preﬁx,sufﬁx}astheycorre-spondtoboundmorphemes.Wealsoconsideramoreexpressivecompositionmodel,arecurrentneuralnetwork(RNN).LetNbethenumberofsegments.ThenCβ(s,je)=hNwherehiisahiddenvector,deﬁnedbythere-cursion:10hi=tanh(cid:0)Xhi−1+Umlisi(cid:1)(Elman,1990).Encore,weoptimizethemorphemeembed-dingsmlisionlywhenli6=stemalongwiththeotherparametersoftheRNN,i.e.,thematricesUandX.4InferenceandLearningExactinferenceisintractablesinceweallowar-bitrarysegment-levelfeaturesonthecanonicalizedwordformsu.Sincethesemi-CRFfactorhasfea-turesthatﬁreonsubstrings,wewouldneedady-namicprogrammingstateforeachsubstringofeachoftheexponentiallymanysettingsofu;thisbreaksthedynamicprogram.Wethusturntoapproximateinferencethroughanimportancesamplingroutine(RubinsteinandKroese,2011).4.1InferencebyImportanceSamplingRatherthanconsideringallunderlyingorthographicformsuandsegmentationss,wesamplefromatractableproposaldistributionq—adistributionovercanonicalsegmentations.Inthefollowingequationsweomitthedependenceonwfornotationalbrevityanddeﬁneh(je,s,toi)=f(s,je,toi)+g(toi,w).Cru-cially,thepartitionfunctionZθ(w)isnotafunctionofparametersubvectorβanditsgradientwithre-9Thisisnotchangedintraining,soallsuchv(si)are0intheﬁnalmodel.Clearly,thiscouldbeimprovedinfutureworkasareviewerpointsout,e.g.,bysettingsuchv(si)toanaverageofasuitablechosensetofknownwordvectors.10WedonotexploremorecomplexRNNs,e.g.,LSTMs(HochreiterandSchmidhuber,1997)andGRUs(Choetal.,2014a)aswordsinourdatahave≤7morphemes.Thesearchi-tecturesmakethelearningoflongdistancedependencieseas-ier,butarenomorepowerfulthananElmanRNN,atleastintheory.NotethatperhapsifappliedtolanguageswithricherderivationalmorphologythanEnglish,consideringmorecom-plexneuralarchitectureswouldmakesense.specttoβis0.11Recallthatcomputingthegradi-entofthelog-partitionfunctionisequivalenttotheproblemofmarginalinference(WainwrightandJor-dan,2008).Wederiveourestimatorasfollows:∇θlogZ=E(je,s,toi)∼p[h(je,s,toi)](3)=Xl,s,en haut(je,s,toi)h(je,s,toi)(4)=Xl,s,uq(je,s,toi)q(je,s,toi)p(je,s,toi)h(je,s,toi)(5)=E(je,s,toi)∼q(cid:20)p(je,s,toi)q(je,s,toi)h(je,s,toi)(cid:21),(6)wherewehaveomittedthedependenceonw(whichweconditionon)andv(whichwemarginalizeout).Solongasqhassupporteverywherepdoes(i.e.,p(je,s,toi)>0⇒q(je,s,toi)>0),theestimateisun-biased.Unfortunately,wecanonlyefﬁcientlycom-putep(je,s,toi)uptoaconstantfactor,p(je,s,toi)=¯p(je,s,toi)/Z0θ(w).Ainsi,weusetheindirectimpor-tancesamplingestimator,1PMi=1w(je)MXi=1w(je)h(je(je),s(je),toi(je)),(7)où(je(1),s(1),toi(1))…(je(M.),s(M.),toi(M.))i.i.d.∼qandimportanceweightsw(je)aredeﬁnedas:w(je)=¯p(je(je),s(je),toi(je))q(je(je),s(je),toi(je)).(8)Thisindirectestimatorisbiased,butconsistent.12ProposalDistribution.Thesuccessofimpor-tancesamplingdependsonthechoiceofa“good”proposaldistribution,i.e.,onethatideallyisclosetop.Sincewearefullysupervisedattrainingtime,wehavetheoptionoftraininglocallynormalizeddistributionsfortheindividualcomponents.Con-cretely,wetraintwoproposaldistributionsq1(toi|w)andq2(je,s|toi)thattaketheformofaWFSTandasemi-CRF,respectivement,usingfeaturesidentical11ThesubvectorβisresponsibleforcomputingonlythemeanoftheGaussianfactorandthushasnoimpactonitsnor-malizationcoefﬁcient(Murphy,2012).12Informally,theindirectimportancesamplingestimatecon-vergestothetrueexpectationasM→∞(thedeﬁnitionofstatisticalconsistency).

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
0
3
1
5
6
7
6
1
2

/
t

un
c
_
un
_
0
0
0
0
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

tothejointmodel.Eachofthesedistributionsistractable—wecancomputethemarginalswithdy-namicprogrammingandthussampleefﬁciently.Todrawsamples(je,s,toi)∼q,wesamplesequentiallyfromq1andthenq2,conditionedontheoutputofq1.4.2LearningWeoptimizethelog-likelihoodofthemodelusingADAGRAD(Duchietal.,2011),whichisSGDwithaspecialper-parameterlearningrate.Thefullgra-dientoftheobjectiveforonetrainingexampleis:∇θlogp(v,s,je,toi|w)=f(s,je,toi)>+g(toi,w)>−1σ2(v−Cβ(s,je))∇θCβ(s,je)−∇θlogZθ(w),(9)whereweusetheimportancesamplingalgorithmdescribedin§4.1toapproximatethegradientofthelog-partitionfunction,followingBengioandSenecal(2003).Notethat∇θCβ(s,je)dependsonthecompositionfunctionused.Inthemostcom-plicatedcasewhenCβisaRNN,wecancom-pute∇βCβ(s,je)efﬁcientlywithbackpropagationthroughtime(Werbos,1990).WetakeM=10im-portancesamples;usingsofewsamplescanleadtoapoorestimateofthegradient,butforourapplicationitsufﬁces.WeemployL2regularization.4.3DecodingDecodingthemodelisalsointractable.Toapproxi-matethesolution,weagainemployimportancesam-pling.WetakeM=10,000importancesamplesandselectthehighestweightedsample.5RelatedWorkTheideathatvectorsemanticsisusefulformor-phologicalsegmentationisnotnew.Countvectors(Salton,1971;TurneyandPantel,2010)havebeenshowntobebeneﬁcialintheunsupervisedinductionofmorphology(SchoneandJurafsky,2000;SchoneandJurafsky,2001).Embeddingswereshowntoactsimilarly(SoricutandOch,2015).Ourmethoddiffersfromthislineofresearchintwokeyways.(je)Wepresentaprobabilisticmodelofthepro-cessofsynthesizingtheword’smeaningfromthemeaningofitsmorphemes.Priorworkwasei-thernotprobabilisticordidnotexplicitlymodelmorphemes.(ii)Ourmethodissupervisedandfo-cusesonderivation.SchoneandJurafsky(2000)andSoricutandOch(2015),beingfullyunsupervised,donotdistinguishbetweeninﬂectionandderiva-tionandSchoneandJurafsky(2001)focusonin-ﬂection.Morerecently,Narasimhanetal.(2015)lookattheunsupervisedinductionof“morpholog-icalchains”withsemanticvectorsasacrucialfea-ture.Theirgoalistojointlyﬁgureoutanorderingofwordformationandamorphologicalsegmenta-tion,e.g.,play7→playful7→playfulness.Whileitisarichmodellikeours,theirsdiffersinthatitisun-supervisedandusesvectorsasfeatures,ratherthanexplicitlytreatingvectorcomposition.Alloftheaboveworkfocusesonsurfacesegmentationandnotcanonicalsegmentation,aswedo.Arelatedlineofworkthathasdifferentgoalscon-cernsmorphologicalgeneration.TworecentpapersthataddressthisproblemusingdeeplearningareFaruquietal.(2016un)andFaruquietal.(2016b).Inanolderlineofwork,YarowskyandWicen-towski(2000)andWicentowski(2002)exploitlogfrequencyratiosofinﬂectionallyrelatedformstoteaseapartthat,e.g.,thepasttenseofsingisnotsinged,butinsteadsang.RelatedworkbyDreyerandEisner(2011)usesaDirichletprocesstomodelacorpusasa“mixtureofaparadigm”,allowingforthesemi-supervisedincorporationofdistribu-tionalsemanticsintoastructuredmodelofinﬂec-tionalparadigmcompletion.Ourworkisalsorelatedtorecentattemptstoin-tegratemorphologicalknowledgeintogeneralem-beddingmodels.Forexample,BothaandBlun-som(2014)trainalog-bilinearlanguagemodelthatmodelsthecompositionofmorphologicalstructure.Likewise,Luongetal.(2013)trainarecursiveneuralnetwork(GollerandK¨uchler,1996)overaheuristi-callyderivedtreestructuretolearnmorphologicalcompositionovercontinuousvectors.Ourworkisdifferentinthatwelearnajointmodelofsegmen-tationandcomposition.Moreover,supervisedmor-phologicalanalysiscandrasticallyoutperformunsu-pervisedanalysis(Ruokolainenetal.,2013).EarlyworkbyKay(1977)canbeinterpretedasﬁnite-statecanonicalsegmentation,butitneitherad-dressesnorexperimentallyevaluatesthequestionofjointmodelingofmorphologicalanalysisandse-manticsynthesis.Moreover,wemayviewcanoni-

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
0
3
1
5
6
7
6
1
2

/
t

un
c
_
un
_
0
0
0
0
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

devtestModelAccF1EditAccF1EditENSemi-CRF(Baseline)0.55(.018)0.75(.014)0.80(.043)0.54(.018)0.75(.014)0.78(.034)Joint(Baseline)0.77(.011)0.87(.007)0.41(.029)0.77(.013)0.87(.007)0.43(.029)Joint+Vec(ThisWork)0.83(.014)0.91(.008)0.31(.019)0.82(.020)0.90(.011)0.32(.038)Joint+UR(Oracle)0.94(.015)0.96(.009)0.07(.016)0.94(.011)0.96(.007)0.07(.011)Joint+UR+Vec(Oracle)0.95(.011)0.97(.007)0.05(.013)0.95(.023)0.97(.006)0.05(.025)DESemi-CRF(Baseline)0.39(.062)0.68(.039)1.15(.230)0.39(.058)0.68(.042)1.14(.240)Joint(Baseline)0.79(.107)0.88(.069)0.40(.313)0.79(.099)0.87(.063)0.41(.282)Joint+Vec(ThisWork)0.82(.102)0.90(.067)0.33(.312)0.82(.096)0.90(.061)0.33(.282)Joint+UR(Oracle)0.86(.108)0.90(.070)0.25(.288)0.86(.100)0.90(.064)0.25(.268)Joint+UR+Vec(Oracle)0.87(.106)0.92(.069)0.20(.285)0.88(.096)0.93(.062)0.19(.263)Table2:ResultsforthecanonicalmorphologicalsegmentationtaskonEnglishandGerman.Standarddeviationisgiveninparentheses.Wecompareagainsttwobaselinesthatdonotmakeuseofsemanticvectors:(je)“Semi-CRF(baseline)»,asemi-CRFthatcannotaccountfororthographicchangesand(ii)“Joint(Baseline)»,aversionofourjointmodelwithoutvectors.WealsocompareagainstanoracleversionwithaccesstogoldURs(“Joint+UR(Oracle)»,“Joint+UR+Vec(Oracle)»),revealingthatthetoughestpartofthecanonicalsegmentationtaskisreversingtheorthographicchanges.calizationasanorthographicanaloguetophonology.Onthisinterpretation,theﬁnite-statesystemsofKa-planandKay(1994),whichcomputationallyapplySPE-stylephonologicalrules(ChomskyandHalle,1968),mayberunbackwardstogetcanonicalun-derlyingforms.6ExperimentsandResultsWeconductexperimentsonEnglishandGermanderivationalmorphology.Weanalyzeourjointmodel’sabilitytosegmentwordsintotheircanoni-calmorphemesaswellasitsabilitytocomposition-allyderivevectorsfornewwords.Finally,weex-ploretherelationshipbetweendistributionalseman-ticsandmorphologicalproductivity.ForEnglish,weusethepretrainedvectorsofLevyandGoldberg(2014un)forallexperiments.ForGerman,wetrainword2vecskip-gramvectorsontheGermanWikipedia.WeﬁrstdescribeourEn-glishdataset,thesubsetoftheEnglishportionoftheCELEXlexicaldatabase(Baayenetal.,1993)thatwasselectedbyLazaridouetal.(2013);thedatasetcontains10,000forms.Thisallowsforcom-parisonwithpreviouslyproposedmethods.Wemaketwomodiﬁcations.(je)Lazaridouetal.(2013)makethetwo-morphemeassumption:everywordiscomposedofexactlytwomorphemes.Ingeneral,thisisnottrue,sowefurthersegmentallcomplexwordsinthecorpus.Forexample,friendless+nessisfurthersegmentedintofriend+less+ness.Toneverthelessallowforfaircomparison,wepro-videversionsofourexperimentswithandwithoutthetwo-morphemeassumptionwhereappropriate.(ii)Lazaridouetal.(2013)onlyprovideasingletrain/testsplit.Aswerequireaheld-outdevelop-mentsetforhyperparametertuning,werandomlyallocateaportionofthetrainingdatatoselectthehyperparametersandthenretrainthemodelusingtheseparametersontheoriginaltrainsplit.Wealsoreport10-foldcrossvalidationresultsinadditiontoLazaridouetal.’strain/testsplit.OurGermandatasetistakenfromZelleretal.(2013)andisdescribedinCotterelletal.(2016b).Il,again,consistsof10,000derivationalforms.Wereportresultson10-foldcrossvalidation.6.1Experiment1:CanonicalSegmentationForourﬁrstexperiment,wetestwhetherjointlymodelingthecontinuousrepresentationsallowsustosegmentwordsmoreaccurately.Weassumethatwearegivenanembeddingforthetargetword.Weestimatethemodelp(v,s,je,toi|w)asdescribedin§4withL2regularizationλ||je||22.Toevaluate,wedecodethedistributionp(s,je,toi|v,w).Weper-formapproximateMAPinferencewithimportancesampling—takingthesamplewiththehighestscore.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
0
3
1
5
6
7
6
1
2

/
t

un
c
_
un
_
0
0
0
0
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

ENDEBOW2BOW5DEPsSGdevtestdevtestdevtestdevtestoraclestem.403.402.374.376.422.422.400.405add.635.635.541.542.787.785.712.711LDS.660.660.566.568.806.804.717.718RNN.660.660.565.567.807.806.707.712jointstem.399.400.371.372.411.412.394.398add.625.625.524.525.782.781.705.704LDS.648.648.547.547.799.797.712.711RNN.649.647.547.546.801.799.706.708charGRU.586.585.452.452.769.768.675.667LSTM.586.586.455.455.768.767.677.666Table3:Vectorapproximation(measuredbymeanco-sinesimilarity)bothwith(“oracle”)andwithout(“joint”,“char”)goldmorphology.Surprisingly,jointmodelsarecloseinperformancetomodelswithgoldmorphology.Intheseexperiments,weusetheRNNwiththede-pendencyvectors,thecombinationofwhichper-formsbestonvectorapproximationin§6.2.WefollowtheexperimentaldesignofCotterelletal.(2016b).Wecompareagainsttwobase-lines(marked“Baseline”inTable2):(je)a“Semi-CRF”segmenterthatcannotaccountforortho-graphicchangesand(ii)thefull“Joint”modelofCotterelletal.(2016b).13Weadditionallyconsideran“Oracle”setting,wherewegivethemodelthegoldunderlyingorthographicform(“UR”)atbothtrainingandtesttime.Thisgivesusinsightintotheperformanceofthetransductionfactorofourmodel,i.e.,howmuchcouldwebeneﬁtfromarichermodel.Ourhyperparametersare(je)theregularizationcoefﬁcientλand(ii)σ2,thevarianceoftheGaussianfactor.Weusegridsearchtotunethem:λ∈{0.0,101,102,103,104,105},σ2∈{0.25,0.5,0.75,1.0}.Metrics.Weusethreemetricstoevaluatesegmen-tationaccuracy.Notethattheevaluationofcanon-icalsegmentationishardsinceasystemmayre-turnasequenceofmorphemeswhoseconcatenationisnotthesamelengthastheconcatenationofthegoldmorphemes.ThisrulesoutmetricsforsurfacesegmentationlikeborderF1(Kurimoetal.,2010),whichrequirethestringstobeofthesamelength.Wenowdeﬁnethemetrics.(je)Segmentationaccuracymeasureswhethereverysinglecanonicalmorphemeinthereturnedsequenceiscorrect.Itisinﬂexible:closeranswersarepenalizedthesameas13i.e.,amodelwithouttheGaussianfactorthatscoresvectors.moredistantanswers.(ii)MorphemeF1(vandenBoschandDaelemans,1999)takesthepredictedse-quenceofcanonicalmorphemes,turnsitintoaset,computesprecisionandrecallinthestandardwayandbasedonthatthencomputesF1.Thismetricgivescreditifsomeofthecanonicalmorphemeswerecorrect.(iii)Levenshteindistancejoinsthecanonicalsegmentswithaspecialsymbol#intoasinglestringandcomputestheLevenshteindistancebetweenpredictedandgoldstrings.Discussion.ResultsinTable2showthatjointlymodelingsemanticcoherenceimprovesourabilitytoanalyzewords.Fortest,ourproposedjointmodel(“ThisWork”)outperformsthebaselinesupervisedcanonicalsegmenter,whichisstate-of-the-artforthetask,par.05(resp..03)onaccuracyand.03(resp..03)onF1forEnglish(resp.German).WealsoﬁndthatwhenwegivethejointmodelanoracleURthevectorsgenerallyhelpless:.01(resp..02)onac-curacyand.01(resp..03)onF1forEnglish(resp.German).Thisindicatesthatthechiefboonthevec-torcompositionfactorprovidesliesinselectionofanappropriateUR.Moreover,theupto.15differ-enceinEnglishbetweensystemswithandwithouttheoracleURsuggeststhatreversingorthographicchangesisaparticularlydifﬁcultpartofthetask,atleastforEnglish.6.2Experiment2:VectorApproximationWeadopttheexperimentaldesignofLazaridouetal.(2013).Itsaimistoapproximateavectorofaderivationallycomplexwordusingalearnedmodelofcomposition.AsLazaridouetal.(2013)assumeagoldmorphologicalanalysis,wecomparetwoset-tings:(je)oraclemorphologicalanalysisand(ii)in-ferredmorphologicalanalysis.Tothebestofourknowledge,(ii)isanovelexperimentalconditionthatnopreviousworkhasaddressed.Weconsiderfourcompositionmodels(SeeTa-ble1).(je)stem,usingjustthestemvector.Thisbaselinetellsuswhathappensifwemaketheincor-rectassumptionthatderivationbehaveslikeinﬂec-tionandisnotmeaning-changing.(ii)add,apurelyadditivemodel.Thisisarguablythesimplestwayofcombiningthevectorsofthemorphemes.(iii)LDS,alineardynamicalsystem.Thisisarguablythesim-plestsequencemodel.(iv)UN(simple)RNN.Recur-

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
0
3
1
5
6
7
6
1
2

/
t

un
c
_
un
_
0
0
0
0
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

allHRLR-lessin-un-Lazaridoustem.47.52.32.22.39.33mult.39.43.28.23.34.33dil..48.53.33.30.45.41wadd.50.55.38.24.40.34fulladd.56.61.41.38.47.44lexfunc.54.58.42.44.45.46BOW2stem.43.44.38.32.43.51add.65.67.61.60.64.67LDS.67.69.62.61.66.67RNN.67.69.60.60.65.66c-GRU.59.60.55.59.55.57c-LSTM.52.53.50.55.50.50BOW5stem.40.43.33.27.37.46add.56.59.51.46.55.59LDS.58.61.51.48.57.60RNN.58.61.50.48.56.58c-GRU.45.47.42.42.43.45c-LSTM.46.47.43.43.45.46DEPsstem.46.45.49.38.57.67add.79.79.77.78.80.80LDS.80.81.77.79.81.81RNN.81.82.77.79.80.81c-GRU.75.76.72.78.74.75c-LSTM.75.76.71.77.72.73Table4:Vectorapproximation(measuredbymeancosinesimilarity)withgoldmorphologyonthetrain/testsplitofLazaridouetal.(2013).HR/LR=high/low-relatednesswords.SeeLazaridouetal.(2013)fordetails.rentneuralnetworksarecurrentlythemostwidelyusednonlinearsequencemodelandsimpleRNNsarethesimplestsuchmodels.Partofthemotivationforconsideringaricherclassofmodelsliesinourremovalofthetwo-morphemeassumption.Indeed,itisunclearthatthewaddandfulladdmodels(MitchellandLapata,2008)areusefulmodelsinthegeneralcaseofmulti-morphemicwords—theweightsaretiedbyposition,i.e.,theﬁrstmorpheme’svector(beitapreﬁxorstem)isalwaysmultipliedbythesamematrix.ComparisonwithLazaridouetal.TocomparewithLazaridouetal.(2013),weusetheirexacttrain/testsplit.ThoseresultsarereportedinTable4.Thisdatasetenforcesthatallwordsarecomposedofexactlytwomorphemes.Thus,awordlikeunques-tionablyissegmentedasun+questionably,with-outfurtherdecomposition.ThevectorsemployedbyLazaridouetal.(2013)arehigh-dimensionalcountvectorsderivedfromlemmatizedandPOStaggedtextwithabefore-and-afterwindowofsize2.Theythenapplypointwisemutualinforma-tion(PMI)weightinganddimensionalityreductionbynon-negativematrixfactorization.Incontrast,weemployWORD2VEC(Mikolovetal.,2013a),amodelthatisalsointerpretableasthefactorizationofaPMImatrix(LevyandGoldberg,2014b).Wecon-siderthreeWORD2VECmodels:twobag-of-word(BOW)modelswithbefore-and-afterwindowsofsize2and5andDEPs(LevyandGoldberg,2014un),adependency-basedmodelwhosecontextisderivedfromdependencyparsesratherthanBOW.Ingeneral,theresultsindicatethatthekeytobettervectorapproximationisnotarichermodelofcomposition,butratherliesinthevectorsthem-selves.Weﬁndthatourbestmodel,theRNN,onlymarginallyedgesouttheLDS.Additionally,lookingatthe“all”columnandtheDEPsvectors,thesim-pleadditivemodelisonly≤.02lowerthanLDS.Incomparison,weobservelargedifferencesbetweenthevectors.TheRNN+DEPsmodelis.23bet-terthantheBOW5models(.81vs..58),.14betterthantheBOW2models(.81vs..67)and.25bet-terthanLazaridouetal.’sbestmodel(.81vs..56).AwidercontextforBOW(5insteadof2)yieldsworseresults.Thissuggeststhatsyntacticinfor-mationoratleastpositionalinformationisneces-saryforimprovedmodelsofmorphemecomposi-tion.Thetestvectorsareannotatedforrelatedness,whichisaproxyforsemanticcoherence.HR(high-relatedness)wordswerejudgedtobemorecompo-sitionalthanLR(low-relatedness)words.Character-LevelNeuralRetroﬁtting.Asafur-therstrongbaseline,weconsideraretroﬁtting(Faruquietal.,2015)approachbasedoncharacter-levelrecurrentneuralnetworks.Recently,runningarecurrentnetoverthecharacterstreamhasbecomeapopularwayofincorporatingsubwordinformationintoamodel—empiricalgainshavebeenobservedinadiversesetofNLPtasks:POStagging(dosSantosandZadrozny,2014;Lingetal.,2015),pars-ing(Ballesterosetal.,2015)andlanguagemodeling

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
0
3
1
5
6
7
6
1
2

/
t

un
c
_
un
_
0
0
0
0
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

(Kimetal.,2016).Tothebestofourknowledge,character-levelretroﬁttingisanovelapproach.Givenavectorvforawordformw,weseekafunctiontominimizethefollowingobjective12||v−hN||22,(10)wherehNistheﬁnalhiddenstateofarecurrentneu-ralarchitecture,i.e.,hi=σ(Ahi−1+Bwi),(11)whereσisanon-linearityandwiistheithchar-acterinw,hi−1istheprevioushiddenstateandAandBarematrices.WhilewehavedeﬁnedthearchitectureforavanillaRNN,weexperimentwithtwomoreadvancedrecurrentarchitectures:GRUs(Choetal.,2014b)andLSTMs(HochreiterandSchmidhuber,1997)aswellasdeepvariants(Sutskeveretal.,2014;Gillicketal.,2016;Firatetal.,2016).Surtout,thismodelhasnoknowledgeofmorphology—itcanonlyrelyonrepresentationsitextractsfromthecharacters.Thisgivesusaclearablationonthebeneﬁtofaddingstructuredmorpho-logicalknowledge.Weoptimizethedepthandthesizeofthehiddenunitsondevelopmentdatausingacoarse-grainedgridsearch.Wefoundadepthof2andhiddenunitsofsize100(inbothLSTMandGRU)performedbest.Wetrainedallmodelsfor100iterationsofAdam(KingmaandBa,2015)withL2regularizationwithregularizationcoefﬁcient0.01.Table4showsthatthetwocharacter-levelmod-els(“c-GRU”and“c-LSTM”)performmuchworsethanourmodels.Thisindicatesthatsupervisedmor-phologicalanalysisproduceshigher-qualityvectorrepresentationsthan“knowledge-poor”character-levelmodels.However,wenotethatthesecharacter-levelmodelshavefewerparametersthanourmorpheme-levelmodels—therearemanymoremorphemesinalanguagesthancharacters.OracleMorphology.Ingeneral,thetwo-morphemeassumptionisincorrect.WeconsideranexpandedsettingofLazaridouetal.(2013)’stask,inwhichwefullydecomposetheword,e.g.,unquestionably7→un+question+able+ly.TheseresultsarereportedinTable3(topblock,“oracle”).Wereportmeancosinesimilarity.Standarddevia-tionssfor10-foldcross-validation(notshown)aresmall(≤.012)withtwoexceptions:s=.044fortheDEPs-joint-stemresults(.411and.412).Themulti-morphemicresultsmirrorthoseofthebi-morphemicsettingofLazaridouetal.(2013).(je)RNN+DEPsattainsanaveragecosinesimilarityofaround.80forEnglish.NumbersforGermanarelower,around.70.(ii)TheRNNonlymarginallyedgesoutLDSforEnglishandisslightlyworseforGerman.Again,thisisnotsurprisingaswearemod-elingshortsequences.(iii)Certainembeddingslendthemselvesmorenaturallytoderivationalcomposi-tionality:BOW2isbetterthanBOW5,DEPsistheclearwinner.InferredMorphology.Theﬁnalsettingwecon-sideristhevectorapproximationtaskwithoutgoldmorphology.Inthiscase,werelyonthefulljointmodelp(v,s,je,toi|w).Atevaluation,wearein-terestedinthemarginaldistributionp(v|w)=Ps,je,en haut(v,s,je,toi|w).Wethenuseimportancesamplingtoapproximatethemeanofthismarginaldistributionasthepredictedembedding,i.e.,ˆv=Zvp(v|w)dv(12)≈1PMi=1w(je)MXi=1w(je)Cβ(je(je),s(je)),(13)wherew(je)aretheimportanceweightsdeﬁnedinEquation8andl(je)ands(je)aretheithsampledla-belingandsegmentation,respectively.Discussion.Surprisingly,Table3(joint)showsthatrelyingontheinferredmorphologydoesnotdrasticallyaffecttheresults.Indeed,weareoftenwithin.01oftheresultwithgoldmorphology.Ourmethodcanbeviewedasaretroﬁttingprocedure(Faruquietal.,2015),sothisresultisuseful:itindi-catesthatjointsemanticsynthesisandmorphologi-calanalysisproduceshigh-qualityvectors.6.3Experiment3:DerivationalProductivityWenowdelveintotherelationbetweendistribu-tionalsemanticsandmorphologicalproductivity.Theextenttowhichjointlymodelingsemanticsaidsmorphologicalanalysiswillbedeterminedbythein-herentcompositionalityofthewordswithinthevec-torspace.Webreakdownourresultsonthevectorapproximationtaskwithgoldmorphologyusingthe

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
0
3
1
5
6
7
6
1
2

/
t

un
c
_
un
_
0
0
0
0
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

-ly-ness-ize-ablein–ful-ous-less-ic-ist-yun–ityre–ion-ment-al-erAﬃx0.20.30.40.50.60.70.80.91.0AverageCosineSimilarityFigure2:Theboxplotbreaksdownthecosinesimilaritybetweentheapproximatedvectorandthetargetvectorbyafﬁx(usinggoldmorphology).Wehaveorderedtheafﬁxessuchthatthebetterapproximatedvectorsareontheleft.dependencyvectorsandtheRNNcomposerinFig-ure2byselectedafﬁxes.Weobserveawiderangeofscores:themostcompositionalendinglygivesrisetocosinesimilaritiesthatare20pointshigherthanthoseoftheleastcompositionaler.OntheleftendofFigure2weseeextremelypro-ductivesufﬁxes.Theafﬁxizeisusedproductivelywithrelativelyobscurewordsinthesciences,e.g.,Rao-Blackwellize.Likewise,theafﬁxnesscanbeappliedtoalmostanyadjectivewithoutrestriction,e.g.,Poissonness‘degreetowhichdatahaveaPois-sondistribution’.Ontherightend,weﬁnd-ment,-erandre-.Theafﬁx-mentisborderlineproductive(Bauer,1983)—modernEnglishtendstoformnovelnominalizationswithnessority.Moreinterestingarere-ander,bothofwhichareveryproductiveinEnglish.Forer,manyofthewordsbringingdowntheaveragearesimplynon-compositional.Forex-ample,homer‘homeruninbaseball’isnotderivedfromhome+er—thisisanerrorindata.Wealsoseeexampleslikecutter.Ithasacompositionalread-ing(e.g.,“boxcutter”),butalsofrequentlyoccursinthenon-compositionalmeaning‘typeofboat’.Fi-nally,propernounslikeHomerandTurnerendinerandinourexperimentswecomputedvectorsforlowercasedwords.Theafﬁxre-similarlyhasalargenumberofnon-compositionalcases,e.g.,remove,relocate,remark.Indeed,togetthecompositionalreadingofremove,theﬁrstsyllable(ratherthanthesecond)istypicallystressedtoemphasizethepreﬁx.Weﬁnallynoteseverallimitationsofthisexper-iment.(je)Theabilityofourmodels—eventhere-currentneuralnetwork—tomodeltransformationsbetweenvectorsislimited.(ii)Ourvectorsarefarfromperfect;e.g.,sparsenessinthetrainingdataaf-fectsqualityandsomeofthewordsinourcorpusarerare.(iii)Semanticcoherenceisnottheonlycrite-rionforproductivity.Anexampleis-thinEnglish.Asnotedearlier,itiscompositionalinawordlikewarmth,butitcannotbeusedtoformnewwords.7ConclusionWehavepresentedamodelofthesemanticsandstructureofderivationallycomplexwords.Tothebestofourknowledge,thisistheﬁrstattempttojointlyconsider,withinasinglemodel,(je)themor-phologicaldecompositionofthewordformand(ii)thesemanticcoherenceoftheresultinganal-ysis.Wefoundthatdirectlymodelingcoherenceincreasessegmentationaccuracy,improvingoverastrongbaseline.Also,ourmodelsshowstate-of-the-artperformanceonthederivationalvectorapproxi-mationtaskintroducedbyLazaridouetal.(2013).Futureworkwillfocusontheextensionofthemethodtomorecomplexinstancesofderivationalmorphology,e.g.,compoundingandreduplication,andontheextensiontoadditionallanguages.Wealsoplantoexploretherelationbetweenderivationanddistributionalsemanticsingreaterdetail.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
0
3
1
5
6
7
6
1
2

/
t

un
c
_
un
_
0
0
0
0
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Acknowledgments.Theﬁrstauthorwassup-portedbyaDAADLong-TermResearchGrantandanNDSEGfellowshipandthesecondbyaVolk-swagenstiftungOpusMagnumgrant.WewouldalsoliketothankactioneditorReginaBarzilayforsuggestingseveralchangesweincorporatedintotheworkandthethreeanonymousreviewers.ReferencesMohamedAﬁfy,RuhiSarikaya,Hong-KwangJeffKuo,LaurentBesacier,andYuqingGao.2006.OntheuseofmorphologicalanalysisfordialectalArabicspeechrecognition.InNinthInternationalConferenceonSpokenLanguageProcessing.MarkAronoff.1976.WordFormationinGenerativeGrammar.MITPress.HaraldBaayen,RichardPiepenbrock,andHedderikvanRijn.1993.TheCELEXlexicaldatabaseonCD-ROM.MiguelBallesteros,ChrisDyer,andNoahA.Smith.2015.Improvedtransition-basedparsingbymodel-ingcharactersinsteadofwordswithLSTMs.InPro-ceedingsofthe2015ConferenceonEmpiricalMeth-odsinNaturalLanguageProcessing,pages349–359,Lisbon,Portugal,September.AssociationforCompu-tationalLinguistics.LaurieBauer.1983.EnglishWord-Formation.Cam-bridgeUniversityPress.YoshuaBengioandJean-S´ebastienSenecal.2003.Quicktrainingofprobabilisticneuralnetsbyimpor-tancesampling.InProceedingsoftheNinthInterna-tionalConferenceonArtiﬁcialIntelligenceandStatis-tics.JanA.BothaandPhilBlunsom.2014.Compositionalmorphologyforwordrepresentationsandlanguagemodelling.InInternationalConferenceonMachineLearning,pages1899–1907.KyunghyunCho,BartvanMerri¨enboer,DzmitryBah-danau,andYoshuaBengio.2014a.Onthepropertiesofneuralmachinetranslation:Encoder–decoderap-proaches.WorkshopOnSyntax,SemanticsandStruc-tureinStatisticalTranslation.KyunghyunCho,BartVanMerri¨enboer,CaglarGul-cehre,DzmitryBahdanau,FethiBougares,HolgerSchwenk,andYoshuaBengio.2014b.LearningphraserepresentationsusingRNNencoder-decoderforstatisticalmachinetranslation.InConferenceonEmpiricalMethodsinNaturalLanguageProcessing.NoamChomskyandMorrisHalle.1968.Thesoundpat-ternofEnglish.Harper&Row.NoamChomsky.1965.AspectsoftheTheoryofSyntax.MITPress.AnnCliftonandAnoopSarkar.2011.Combin-ingmorpheme-basedmachinetranslationwithpost-processingmorphemeprediction.InProceedingsofthe49thAnnualMeetingoftheAssociationforCom-putationalLinguistics:HumanLanguageTechnolo-gies,pages32–42,Portland,Oregon,Etats-Unis,June.As-sociationforComputationalLinguistics.RyanCotterell,NanyunPeng,andJasonEisner.2014.StochasticcontextualeditdistanceandprobabilisticFSTs.InProceedingsofthe52ndAnnualMeetingoftheAssociationforComputationalLinguistics(Vol-ume2:ShortPapers),pages625–630,Baltimore,Maryland,June.AssociationforComputationalLin-guistics.RyanCotterell,ThomasM¨uller,AlexanderFraser,andHinrichSch¨utze.2015.Labeledmorphologicalseg-mentationwithsemi-Markovmodels.InProceed-ingsoftheNineteenthConferenceonComputationalNaturalLanguageLearning,pages164–174,Beijing,Chine,July.AssociationforComputationalLinguis-tics.RyanCotterell,ArunKumar,andHinrichSch¨utze.2016a.Morphologicalsegmentationinside-out.InProceedingsofthe2016ConferenceonEmpiri-calMethodsinNaturalLanguageProcessing,pages2325–2330,Austin,Texas,November.AssociationforComputationalLinguistics.RyanCotterell,TimVieira,andHinrichSch¨utze.2016b.Ajointmodeloforthographyandmorphologicalseg-mentation.InProceedingsofthe2016ConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics:HumanLanguageTech-nologies,pages664–669,SanDiego,California,June.AssociationforComputationalLinguistics.MathiasCreutzandKristaLagus.2007.Unsupervisedmodelsformorphemesegmentationandmorphologylearning.ACMTransactionsonSpeechandLanguageProcessing,4(1):3.C´ıceroNogueiradosSantosandBiancaZadrozny.2014.Learningcharacter-levelrepresentationsforpart-of-speechtagging.InInternationalConferenceonMa-chineLearning,pages1818–1826.MarkusDreyerandJasonEisner.2011.Discover-ingmorphologicalparadigmsfromplaintextusingaDirichletprocessmixturemodel.InProceedingsofthe2011ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages616–627.AssociationforComputationalLinguistics,July.MarkusDreyer.2011.ANon-parametricModelfortheDiscoveryofInﬂectionalParadigmsfromPlainTextusingGraphicalModelsoverStrings.Ph.D.thesis,JohnsHopkinsUniversity.JohnDuchi,EladHazan,andYoramSinger.2011.Adaptivesubgradientmethodsforonlinelearningand

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
0
3
1
5
6
7
6
1
2

/
t

un
c
_
un
_
0
0
0
0
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

stochasticoptimization.JournalofMachineLearningResearch,12:2121–2159.JeffreyL.Elman.1990.Findingstructureintime.Cog-nitiveScience,14(2):179–211.ManaalFaruqui,JesseDodge,SujayKumarJauhar,ChrisDyer,EduardHovy,andNoahA.Smith.2015.Retroﬁttingwordvectorstosemanticlexicons.InPro-ceedingsofthe2015ConferenceoftheNorthAmeri-canChapteroftheAssociationforComputationalLin-guistics:HumanLanguageTechnologies,pages1606–1615,Denver,Colorado,May–June.AssociationforComputationalLinguistics.ManaalFaruqui,RyanMcDonald,andRaduSoricut.2016a.Morpho-syntacticlexicongenerationusinggraph-basedsemi-supervisedlearning.TransactionsoftheAssociationforComputationalLinguistics,4:1–16.ManaalFaruqui,YuliaTsvetkov,GrahamNeubig,andChrisDyer.2016b.Morphologicalinﬂectiongen-erationusingcharactersequencetosequencelearn-ing.InProceedingsofthe2016ConferenceoftheNorthAmericanChapteroftheAssociationforCom-putationalLinguistics:HumanLanguageTechnolo-gies,pages634–643,SanDiego,California,June.As-sociationforComputationalLinguistics.OrhanFirat,KyunghyunCho,andYoshuaBengio.2016.Multi-way,multilingualneuralmachinetranslationwithasharedattentionmechanism.InProceedingsofthe2016ConferenceoftheNorthAmericanChap-teroftheAssociationforComputationalLinguistics:HumanLanguageTechnologies,pages866–875,SanDiego,California,June.AssociationforComputa-tionalLinguistics.GottlobFrege.1892.¨UberBegriffundGegenstand.Vierteljahresschriftf¨urwissenschaftlichePhilosophie,16:192–205.DanGillick,CliffBrunk,OriolVinyals,andAmarnagSubramanya.2016.Multilinguallanguageprocess-ingfrombytes.InProceedingsofthe2016Confer-enceoftheNorthAmericanChapteroftheAssocia-tionforComputationalLinguistics:HumanLanguageTechnologies,pages1296–1306,SanDiego,Califor-nia,June.AssociationforComputationalLinguistics.ChristophGollerandAndreasK¨uchler.1996.Learningtask-dependentdistributedrepresentationsbyback-propagationthroughstructure.InIEEEInternationalConferenceonNeuralNetworks.KazumaHashimotoandYoshimasaTsuruoka.2016.Adaptivejointlearningofcompositionalandnon-compositionalphraseembeddings.InProceedingsofthe54thAnnualMeetingoftheAssociationforCom-putationalLinguistics(Volume1:LongPapers),pages205–215,Berlin,Allemagne,August.AssociationforComputationalLinguistics.MartinHaspelmathandAndreaSims.2013.Under-standingmorphology.Routledge.IreneHeimandAngelikaKratzer.1998.SemanticsinGenerativeGrammar.Blackwell.SeppHochreiterandJ¨urgenSchmidhuber.1997.Longshort-termmemory.NeuralComputation,9(8):1735–1780.KatharinaKann,RyanCotterell,andHinrichSch¨utze.2016.Neuralmorphologicalanalysis:Encoding-decodingcanonicalsegments.InProceedingsofthe2016ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages961–967,Austin,Texas,November.AssociationforComputationalLinguis-tics.RonaldM.KaplanandMartinKay.1994.Regularmod-elsofphonologicalrulesystems.ComputationalLin-guistics,20(3):331–378.MartinKay.1977.Morphologicalandsyntacticanalysis.LinguisticStructuresProcessing,5:131.YoonKim,YacineJernite,DavidSontag,andAlexan-derM.Rush.2016.Character-awareneurallanguagemodels.InProceedingsoftheThirtiethAAAIConfer-enceonArtiﬁcialIntelligence,pages2741–2749.DiederikKingmaandJimmyBa.2015.Adam:Amethodforstochasticoptimization.InInternationalConferenceonLearningRepresentations.MaxKisselew,SebastianPad´o,AlexisPalmer,andJanˇSnajder.2015.ObtainingabetterunderstandingofdistributionalmodelsofGermanderivationalmorphol-ogy.InProceedingsofthe11thInternationalConfer-enceonComputationalSemantics,pages58–63.DaphneKollerandNirFriedman.2009.ProbabilisticGraphicalModels:PrinciplesandTechniques.MITPress.LingpengKong,ChrisDyer,andNoahASmith.2016.Segmentalrecurrentneuralnetworks.In4thInterna-tionalConferenceonLearningRepresentations.MikkoKurimo,SamiVirpioja,VilleTurunen,andKristaLagus.2010.MorphoChallengecompetition2005–2010:Evaluationsandresults.InSpecialInterestGrouponComputationalMorphologyandPhonology.AngelikiLazaridou,MarcoMarelli,RobertoZamparelli,andMarcoBaroni.2013.Compositional-lyderivedrepresentationsofmorphologicallycomplexwordsindistributionalsemantics.InProceedingsofthe51stAnnualMeetingoftheAssociationforComputationalLinguistics(Volume1:LongPapers),pages1517–1526,Soﬁa,Bulgaria,August.AssociationforCom-putationalLinguistics.OmerLevyandYoavGoldberg.2014a.Dependency-basedwordembeddings.InProceedingsofthe52ndAnnualMeetingoftheAssociationforComputationalLinguistics(Volume2:ShortPapers),pages302–308,

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
0
3
1
5
6
7
6
1
2

/
t

un
c
_
un
_
0
0
0
0
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Baltimore,Maryland,June.AssociationforComputa-tionalLinguistics.OmerLevyandYoavGoldberg.2014b.Neuralwordembeddingasimplicitmatrixfactorization.InAd-vancesinNeuralInformationProcessingSystems,pages2177–2185.MarcLight.1996.Morphologicalcuesforlexicalse-mantics.InProceedingsofthe34thAnnualMeetingoftheAssociationforComputationalLinguistics,pages25–31,SantaCruz,California,Etats-Unis,June.Associa-tionforComputationalLinguistics.WangLing,ChrisDyer,AlanW.Black,IsabelTrancoso,RamonFermandez,SilvioAmir,LuisMarujo,andTiagoLuis.2015.Findingfunctioninform:Com-positionalcharactermodelsforopenvocabularywordrepresentation.InProceedingsofthe2015ConferenceonEmpiricalMethodsinNaturalLanguageProcess-ing,pages1520–1530,Lisbon,Portugal,September.AssociationforComputationalLinguistics.ThangLuong,RichardSocher,andChristopherMan-ning.2013.Betterwordrepresentationswithrecur-siveneuralnetworksformorphology.InProceedingsoftheSeventeenthConferenceonComputationalNat-uralLanguageLearning,pages104–113,Soﬁa,Bul-garia,August.AssociationforComputationalLinguis-tics.JohnLyons.1977.Semantics.CambridgeUniversityPress.TomasMikolov,KaiChen,GregCorrado,andJeffreyDean.2013a.Efﬁcientestimationofwordrepresenta-tionsinvectorspace.In2thInternationalConferenceonLearningRepresentations.TomasMikolov,IlyaSutskever,KaiChen,GregoryS.Corrado,andJeffreyDean.2013b.Distributedrep-resentationsofwordsandphrasesandtheircomposi-tionality.InAdvancesinneuralinformationprocess-ingsystems,pages3111–3119.JeffMitchellandMirellaLapata.2008.Vector-basedmodelsofsemanticcomposition.InProceedingsofAssociationofComputationalLinguistics,pages236–244,Columbus,Ohio,June.AssociationforComputa-tionalLinguistics.KevinP.Murphy.2012.MachineLearning:AProba-bilisticPerspective.MITPress.JasonNaradowskyandSharonGoldwater.2009.Im-provingmorphologyinductionbylearningspellingrules.InTwenty-ﬁrstInternationalJointConferenceonArtiﬁcialIntelligence,pages1531–1536.KarthikNarasimhan,DamianosKarakos,RichardSchwartz,StavrosTsakalidis,andReginaBarzilay.2014.Morphologicalsegmentationforkeywordspot-ting.InProceedingsofthe2014ConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP),pages880–885,Doha,Qatar,October.As-sociationforComputationalLinguistics.KarthikNarasimhan,ReginaBarzilay,andTommiJaakkola.2015.Anunsupervisedmethodforuncov-eringmorphologicalchains.TransactionsoftheAsso-ciationforComputationalLinguistics,3:157–167.OUPeditors.2010.NewOxfordAmericanDictionary.OxfordUniversityPress.SebastianPad´o,AlexisPalmer,MaxKisselew,andJanˇSnajder.2015.Measuringsemanticcontenttoas-sessasymmetryinderivation.InProceedingsofthe11thInternationalConferenceonComputationalSe-mantics.LawrenceR.Rabiner.1989.AtutorialonhiddenMarkovmodelsandselectedapplicationsinspeechrecognition.ProceedingsoftheInstituteofElectricalandElectronicsEngineers,77(2):257–286.PushpendreRastogi,RyanCotterell,andJasonEisner.2016.Weightingﬁnite-statetransductionswithneu-ralcontext.InProceedingsofthe2016ConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics:HumanLanguageTech-nologies,pages623–633,SanDiego,California,June.AssociationforComputationalLinguistics.EricS.RistadandPeterN.Yianilos.1998.Learningstring-editdistance.IEEETransactionsonPatternAnalysisandMachineIntelligence,20(5):522–532.ReuvenY.RubinsteinandDirkP.Kroese.2011.Sim-ulationandtheMonteCarlomethod.JohnWiley&Sons.TeemuRuokolainen,OskarKohonen,SamiVirpioja,andMikkoKurimo.2013.Supervisedmorphologicalseg-mentationinalow-resourcelearningsettingusingcon-ditionalrandomﬁelds.InProceedingsoftheSev-enteenthConferenceonComputationalNaturalLan-guageLearning,pages29–37,Soﬁa,Bulgaria,Au-gust.AssociationforComputationalLinguistics.GerardSalton,editor.1971.TheSMARTRetrievalSystem—ExperimentsinAutomaticDocumentPro-cessing.PrenticeHall.SunitaSarawagiandWilliamW.Cohen.2005.Semi-Markovconditionalrandomﬁeldsforinformationex-traction.InAdvancesinneuralinformationprocess-ingsystems,pages1185–1192.PatrickSchoneandDanielJurafsky.2000.Knowledge-freeinductionofmorphologyusinglatentsemanticanalysis.InProceedingsthe4thConferenceonCom-putationalNaturalLanguageLearning,pages67–72.AssociationforComputationalLinguistics.PatrickSchoneandDanielJurafsky.2001.Knowledge-freeinductionofinﬂectionalmorphologies.InPro-ceedingsoftheSecondMeetingoftheNorthAmerican

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
0
0
3
1
5
6
7
6
1
2

/
t

un
c
_
un
_
0
0
0
0
3
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

ChapteroftheAssociationforComputationalLinguis-ticsonLanguageTechnologies,pages1–9.Associa-tionforComputationalLinguistics.WolfgangSeekerand¨OzlemC¸etino˘glu.2015.Agraph-basedlatticedependencyparserforjointmorpholog-icalsegmentationandsyntacticanalysis.Transac-tionsoftheAssociationforComputationalLinguistics,3:359–373.RaduSoricutandFranzOch.2015.Unsupervisedmor-phologyinductionusingwordembeddings.InPro-ceedingsofthe2015ConferenceoftheNorthAmeri-canChapteroftheAssociationforComputationalLin-guistics:HumanLanguageTechnologies,pages1627–1637,Denver,Colorado,May–June.AssociationforComputationalLinguistics.IlyaSutskever,OriolVinyals,andQuocV.Le.2014.Se-quencetosequencelearningwithneuralnetworks.InAdvancesinNeuralInformationProcessingSystems,pages3104–3112.CharlesSuttonandAndrewMcCallum.2006.Anin-troductiontoconditionalrandomﬁeldsforrelationallearning.InLiseGetoorandBenTaskar,editors,IntroductiontoStatisticalRelationalLearning,pages93–128.MITPress.PeterD.TurneyandPatrickPantel.2010.Fromfre-quencytomeaning:Vectorspacemodelsofsemantics.JournalofArtiﬁcialIntelligenceResearch,37(1):141–188.AntalvandenBoschandWalterDaelemans.1999.Memory-basedmorphologicalanalysis.InProceed-ingsofthe37thAnnualMeetingoftheAssociationforComputationalLinguistics,pages285–292,CollegePark,Maryland,Etats-Unis,June.AssociationforCompu-tationalLinguistics.MartinJ.WainwrightandMichaelI.Jordan.2008.Graphicalmodels,exponentialfamilies,andvaria-tionalinference.FoundationsandTrendsinMachineLearning,1(1-2):1–305.ZhenWang,JianwenZhang,JianlinFeng,andZhengChen.2014.Knowledgegraphandtextjointlyem-bedding.InProceedingsofthe2014ConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP),pages1591–1601,Doha,Qatar,October.AssociationforComputationalLinguistics.PaulJ.Werbos.1990.Backpropagationthroughtime:Whatitdoesandhowtodoit.ProceedingsoftheInstituteofElectricalandElectronicsEngineers,78(10):1550–1560.RichardWicentowski.2002.ModelingandLearningMultilingualInﬂectionalMorphologyinaMinimallySupervisedFramework.Ph.D.thesis,JohnsHopkinsUniversity.YadollahYaghoobzadehandHinrichSch¨utze.2015.Corpus-levelﬁne-grainedentitytypingusingcontex-tualinformation.InProceedingsofthe2015Confer-enceonEmpiricalMethodsinNaturalLanguagePro-cessing,pages715–725,Lisbon,Portugal,September.AssociationforComputationalLinguistics.DavidYarowskyandRichardWicentowski.2000.Min-imallysupervisedmorphologicalanalysisbymulti-modalalignment.InThe38thAnnualMeetingoftheAssociationforComputationalLinguistics.WenpengYinandHinrichSch¨utze.2015.Convolutionalneuralnetworkforparaphraseidentiﬁcation.InPro-ceedingsofthe2015ConferenceoftheNorthAmeri-canChapteroftheAssociationforComputationalLin-guistics:HumanLanguageTechnologies,pages901–911,Denver,Colorado,May–June.AssociationforComputationalLinguistics.BrittaD.Zeller,JanˇSnajder,andSebastianPad´o.2013.DErivBase:InducingandevaluatingaderivationalmorphologyresourceforGerman.InProceedingsofthe51stAnnualMeetingoftheAssociationforCom-putationalLinguistics(Volume1:LongPapers),pages1201–1211,Soﬁa,Bulgaria,August.AssociationforComputationalLinguistics.BrittaD.Zeller,SebastianPad´o,andJanˇSnajder.2014.Towardssemanticvalidationofaderivationallexicon.InProceedingsthe25thInternationalConferenceonComputationalLinguistics,pages1728–1739,Dublin,Ireland,August.DublinCityUniversityandAssocia-tionforComputationalLinguistics.
Télécharger le PDF