What topic do you need documentation on?
Transactions of the Association for Computational Linguistics, vol. 5, pp. 87–99, 2017. Action Editor: Chris Quirk.
Transactions of the Association for Computational Linguistics, vol. 5, pp. 87–99, 2017. Action Editor: Chris Quirk. Submission batch: 6/2016; Revision batch: 10/2016; Published 3/2017. 2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. c (cid:13) ContextGatesforNeuralMachineTranslationZhaopengTu†YangLiu‡ZhengdongLu†XiaohuaLiu†HangLi††Noah’sArkLab,HuaweiTechnologies,HongKong{tu.zhaopeng,lu.zhengdong,liuxiaohua3,hangli.hl}@huawei.com‡DepartmentofComputerScienceandTechnology,TsinghuaUniversity,Beijingliuyang2011@tsinghua.edu.cnAbstractInneuralmachinetranslation(NMT),genera-tionofatargetworddependsonbothsourceandtargetcontexts.Wefindthatsourcecon-textshaveadirectimpactontheadequacyofatranslationwhiletargetcontextsaffecttheflu-ency.Intuitively,generationofacontentwordshouldrelymoreonthesourcecontextandgenerationofafunctionalwordshouldrelymoreonthetargetcontext.Duetothelackofeffectivecontrolovertheinfluencefromsourceandtargetcontexts,conventionalNMTtendstoyieldfluentbutinadequatetransla-tions.Toaddressthisproblem,weproposecontextgateswhichdynamicallycontroltheratiosatwhichsourceandtargetcontextscon-tributetothegenerationoftargetwords.Inthisway,wecanenhanceboththeadequacyandfluencyofNMTwithmorecarefulcon-troloftheinformationflowfromcontexts.Experimentsshowthatourapproachsignif-icantlyimprovesuponastandardattention-basedNMTsystemby+2.3BLEUpoints.1IntroductionNeuralmachinetranslation(NMT)(KalchbrennerandBlunsom,2013;Sutskeveretal.,2014;Bah-danauetal.,2015)hasmadesignificantprogressinthepastseveralyears.Itsgoalistoconstructandutilizeasinglelargeneuralnetworktoaccom-plishtheentiretranslationtask.Onegreatadvan-tageofNMTisthatthetranslationsystemcanbecompletelyconstructedbylearningfromdatawith-outhumaninvolvement(cf.,featureengineeringinstatisticalmachinetranslation(SMT)).Theencoder-decoderarchitectureiswidelyemployed(Choetal.,inputj¯ınni´anqi´anliˇangyu`eguˇangd¯ongg¯aox¯ınj`ısh`uchˇanpˇınch¯ukˇou37.6y`ımˇeiyu´anNMTinthefirsttwomonthsofthisyear,theexportofnewhighleveltechnologyproductwasUNK-billionusdollars5srcchina’sguangdonghi-techexportshit58billiondollars5tgtchina’sexportofhighandnewhi-techexportsoftheexportoftheexportoftheexportoftheexportoftheexportoftheexportoftheexportoftheexportof···Table1:Sourceandtargetcontextsarehighlycor-relatedtotranslationadequacyandfluency,respec-tively.5srcand5tgtdenotehalvingthecontribu-tionsfromthesourceandtargetcontextswhengen-eratingthetranslation,respectively.2014;Sutskeveretal.,2014),inwhichtheencodersummarizesthesourcesentenceintoavectorrepre-sentation,andthedecodergeneratesthetargetsen-tenceword-by-wordfromthevectorrepresentation.Therepresentationofthesourcesentenceandtherepresentationofthepartiallygeneratedtargetsen-tence(translation)ateachpositionarereferredtoassourcecontextandtargetcontext,respectively.Thegenerationofatargetwordisdeterminedjointlybythesourcecontextandtargetcontext.SeveraltechniquesinNMThaveproventobeveryeffective,includinggating(HochreiterandSchmidhuber,1997;Choetal.,2014)andat-tention(Bahdanauetal.,2015)whichcanmodellong-distancedependenciesandcomplicatedalign- l D o w n o a d e d f r o m h t t p : / / d i r e c t
Transactions of the Association for Computational Linguistics, vol. 5, pp. 73–86, 2017. Action Editor: Eric Fosler-Lussier.
Transactions of the Association for Computational Linguistics, vol. 5, pp. 73–86, 2017. Action Editor: Eric Fosler-Lussier. Submission batch: 8/2016; Revision batch: 11/2016; Published 2/2017. 2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. c (cid:13) AGenerativeModelofPhonotacticsRichardFutrellBrainandCognitiveSciencesMassachusettsInstituteofTechnologyfutrell@mit.eduAdamAlbrightDepartmentofLinguisticsMassachusettsInstituteofTechnologyalbright@mit.eduPeterGraffIntelCorporationgraffmail@gmail.comTimothyJ.O’DonnellDepartmentofLinguisticsMcGillUniversitytimothy.odonnell@mcgill.caAbstractWepresentaprobabilisticmodelofphono-tactics,thesetofwell-formedphonemese-quencesinalanguage.Unlikemostcompu-tationalmodelsofphonotactics(HayesandWilson,2008;GoldsmithandRiggle,2012),wetakeafullygenerativeapproach,model-ingaprocesswhereformsarebuiltupoutofsubpartsbyphonologically-informedstruc-turebuildingoperations.Welearnaninven-toryofsubpartsbyapplyingstochasticmemo-ization(Johnsonetal.,2007;Goodmanetal.,2008)toagenerativeprocessforphonemesstructuredasanand-orgraph,basedoncon-ceptsoffeaturehierarchyfromgenerativephonology(Clements,1985;Dresher,2009).Subpartsarecombinedinawaythatallowstier-basedfeatureinteractions.Weevaluateourmodels’abilitytocapturephonotacticdis-tributionsinthelexiconsof14languagesdrawnfromtheWOLEXcorpus(Graff,2012).Ourfullmodelrobustlyassignshigherproba-bilitiestoheld-outformsthanasophisticatedN-grammodelforalllanguages.Wealsopresentnovelanalysesthatprobemodelbe-haviorinmoredetail.1IntroductionPeoplehavesystematicintuitionsaboutwhichse-quencesofsoundswouldconstitutelikelyorun-likelywordsintheirlanguage:AlthoughblickisnotanEnglishword,itsoundslikeitcouldbe,whilebnickdoesnot(ChomskyandHalle,1965).Suchin-tuitionsrevealthatspeakersareawareoftherestric-tionsonsoundsequenceswhichcanmakeuppossi-blemorphemesintheirlanguage—thephonotacticsofthelanguage.Phonotacticrestrictionsmeanthateachlanguageusesonlyasubsetofthelogically,orevenarticulatorily,possiblestringsofphonemes.Admissiblephonemecombinations,ontheotherhand,typicallyrecurinmultiplemorphemes,lead-ingtoredundancy.Itiswidelyacceptedthatphonotacticjudgmentsmaybegradient:thenonsensewordblickisbetterasahypotheticalEnglishwordthanbwick,whichisbetterthanbnick(HayesandWilson,2008;Al-bright,2009;Dalandetal.,2011).Toaccountforsuchgradedjudgements,therehavebeenavari-etyofprobabilistic(or,moregenerally,weighted)modelsproposedtohandlephonotacticlearningandgeneralizationoverthelasttwodecades(seeDa-landetal.(2011)andbelowforreview).How-ever,inspiredbyoptimality-theoreticapproachestophonology,themostlinguisticallyinformedandsuc-cessfulsuchmodelshavebeenconstraint-based—formulatingtheproblemofphonotacticgeneraliza-tionintermsofrestrictionsthatpenalizeillicitcom-binationsofsounds(e.g.,rulingout∗bn-).Inthispaper,bycontrast,weadoptagenerativeapproachtomodelingphonotacticstructure.Ourapproachharkensbacktoearlyworkonthesoundstructureoflexicalitemswhichmadeuseofmor-phemestructurerulesorconditions(Halle,1959;Stanley,1967;Booij,2011;RasinandKatzir,2014).Suchapproachesexplicitlyattemptedtomodelthe l D o w n o a d e d f r o m h t t p : / / d i r e c t
Transactions of the Association for Computational Linguistics, vol. 5, pp. 45–58, 2017. Action Editor: Brian Roark.
Transactions of the Association for Computational Linguistics, vol. 5, pp. 45–58, 2017. Action Editor: Brian Roark. Submission batch: 5/2016; Revision batch: 9/2016; Published 1/2017. 2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. c (cid:13) Shift-ReduceConstituentParsingwithNeuralLookaheadFeaturesJiangmingLiuandYueZhangSingaporeUniversityofTechnologyandDesign,8SomapahRoad,Singapore,487372{jiangmingliu,yuezhang}@sutd.edu.sgAbstractTransition-basedmodelscanbefastandaccu-rateforconstituentparsing.Comparedwithchart-basedmodels,theyleveragericherfea-turesbyextractinghistoryinformationfromaparserstack,whichconsistsofasequenceofnon-localconstituents.Ontheotherhand,duringincrementalparsing,constituentinfor-mationontherighthandsideofthecurrentwordisnotutilized,whichisarelativeweak-nessofshift-reduceparsing.Toaddressthislimitation,weleverageafastneuralmodeltoextractlookaheadfeatures.Inparticular,webuildabidirectionalLSTMmodel,whichleveragesfullsentenceinformationtopredictthehierarchyofconstituentsthateachwordstartsandends.Theresultsarethenpassedtoastrongtransition-basedconstituentparseraslookaheadfeatures.Theresultingparsergives1.3%absoluteimprovementinWSJand2.3%inCTBcomparedtothebaseline,giv-ingthehighestreportedaccuraciesforfully-supervisedparsing.1IntroductionTransition-basedconstituentparsersarefastandac-curate,performingincrementalparsingusingase-quenceofstatetransitionsinlineartime.Pioneer-ingmodelsrelyonaclassifiertomakelocalde-cisions,searchinggreedilyforlocaltransitionstobuildaparsetree(SagaeandLavie,2005).Zhuetal.(2013)useabeamsearchframework,whichpreserveslineartimecomplexityofgreedysearch,whilealleviatingthedisadvantageoferrorpropaga-tion.Themodelgivesstate-of-the-artaccuraciesataspeedof89sentencespersecondonthestandardWSJbenchmark(Marcusetal.,1993).Zhuetal.(2013)exploitrichfeaturesbyextract-inghistoryinformationfromaparserstack,whichconsistsofasequenceofnon-localconstituents.However,duetotheincrementalnatureofshift-reduceparsing,theright-handsideconstituentsofthecurrentwordcannotbeusedtoguidetheactionateachstep.Suchlookaheadfeatures(Tsuruokaetal.,2011)correspondtotheoutsidescoresinchartparsing(Goodman,1998),whichhasbeeneffectiveforobtainingimprovedaccuracies.Toleveragesuchinformationforimprovingshift-reduceparsing,weproposeanovelneuralmodeltopredicttheconstituenthierarchyrelatedtoeachwordbeforeparsing.OurideaisinspiredbytheworkofRoarkandHollingshead(2009)andZhangetal.(2010b),whichshowsthatshallowsyntacticinformationgatheredoverthewordsequencecanbeutilizedforpruningchartparsers,improvingchartparsingspeedwithoutsacrificingaccuracies.Forex-ample,RoarkandHollingshead(2009)predictcon-stituentboundaryinformationonwordsasapre-processingstep,andusesuchinformationtoprunethechart.Sincesuchinformationismuchlighter-weightcomparedtofullparsing,itcanbepredictedrelativelyaccuratelyusingsequencelabellers.DifferentfromRoarkandHollingshead(2009),wecollectlookaheadconstituentinformationforshift-reduceparsing,ratherthanpruninginforma-tionforchartparsing.Ourmainconcernisimprov-ingtheaccuracyratherthanimprovingthespeed.Accordingly,ourmodelshouldpredictthecon-stituenthierarchyforeachwordratherthansimpleboundaryinformation.Forexample,inFigure1(a),theconstituenthierarchythattheword“The”startsis“S→NP”,andtheconstituenthierarchythattheword“table”endsis“S→VP→NP→PP→NP”. l D o w n o a d e d f r o m h t t p : / / d i r e c t
Transactions of the Association for Computational Linguistics, vol. 5, pp. 31–44, 2017. Action Editor: Hwee Tou Ng.
Transactions of the Association for Computational Linguistics, vol. 5, pp. 31–44, 2017. Action Editor: Hwee Tou Ng. Submission batch: 8/2016 Revision batch: 10/2016; Published 1/2017. 2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. c (cid:13) ModelingSemanticExpectation:UsingScriptKnowledgeforReferentPredictionAshutoshModi1,3IvanTitov2,4VeraDemberg1,3AsadSayeed1,3ManfredPinkal1,31{ashutosh,vera,asayeed,pinkal}@coli.uni-saarland.de2titov@uva.nl3Universit¨atdesSaarlandes,Germany4ILLC,UniversityofAmsterdam,theNetherlandsAbstractRecentresearchinpsycholinguisticshaspro-videdincreasingevidencethathumanspredictupcomingcontent.Predictionalsoaffectsper-ceptionandmightbeakeytorobustnessinhumanlanguageprocessing.Inthispaper,weinvestigatethefactorsthataffecthumanpredictionbybuildingacomputationalmodelthatcanpredictupcomingdiscoursereferentsbasedonlinguisticknowledgealonevs.lin-guisticknowledgejointlywithcommon-senseknowledgeintheformofscripts.Wefindthatscriptknowledgesignificantlyimprovesmodelestimatesofhumanpredictions.Inasecondstudy,wetestthehighlycontroversialhypothesisthatpredictabilityinfluencesrefer-ringexpressiontypebutdonotfindevidenceforsuchaneffect.1IntroductionBeingabletoanticipateupcomingcontentisacorepropertyofhumanlanguageprocessing(Kutasetal.,2011;KuperbergandJaeger,2016)thathasre-ceivedalotofattentioninthepsycholinguisticliter-atureinrecentyears.Expectationsaboutupcomingwordshelphumanscomprehendlanguageinnoisysettingsanddealwithungrammaticalinput.Inthispaper,weuseacomputationalmodeltoaddressthequestionofhowdifferentlayersofknowledge(lin-guisticknowledgeaswellascommon-senseknowl-edge)influencehumananticipation.Herewefocusourattentiononsemanticpre-dictionsofdiscoursereferentsforupcomingnounphrases.Thistaskisparticularlyinterestingbecauseitallowsustoseparatethesemantictaskofantic-ipatinganintendedreferentandtheprocessingoftheactualsurfaceform.Forexample,inthecon-textofIorderedamediumsirloinsteakwithfries.Later,thewaiterbrought…,thereisastrongex-pectationofaspecificdiscoursereferent,i.e.,thereferentintroducedbytheobjectNPofthepreced-ingsentence,whilethepossiblereferringexpressioncouldbeeitherthesteakIhadordered,thesteak,ourfood,orit.Existingmodelsofhumanpredic-tionareusuallyformulatedusingtheinformation-theoreticconceptofsurprisal.Inrecentwork,how-ever,surprisalisusuallynotcomputedforDRs,whichrepresenttherelevantsemanticunit,butforthesurfaceformofthereferringexpressions,eventhoughthereisanincreasingamountofliteraturesuggestingthathumanexpectationsatdifferentlev-elsofrepresentationhaveseparableeffectsonpre-dictionand,asaconsequence,thatthemodellingofonlyonelevel(thelinguisticsurfaceform)isin-sufficient(KuperbergandJaeger,2016;Kuperberg,2016;Zarconeetal.,2016).Thepresentmodelad-dressesthisshortcomingbyexplicitlymodellingandrepresentingcommon-senseknowledgeandconcep-tuallyseparatingthesemantic(discoursereferent)andthesurfacelevel(referringexpression)expec-tations.OurdiscoursereferentpredictiontaskisrelatedtotheNLPtaskofcoreferenceresolution,butitsubstantiallydiffersfromthattaskinthefollowingways:1)weuseonlytheincrementallyavailableleftcontext,whilecoreferenceresolutionusesthefulltext;2)coreferenceresolutiontriestoidentifytheDRforagiventargetNPincontext,whilewelookattheexpectationsofDRsbasedonlyonthecontext l D o w n o a d e d f r o m h t t p : / / d i r e c
Transactions of the Association for Computational Linguistics, vol. 6, pp. 511–527, 2018. Action Editor: Alexander Koller.
Transactions of the Association for Computational Linguistics, vol. 6, pp. 511–527, 2018. Action Editor: Alexander Koller. Submission batch: 1/2018; Revision batch: 5/2018; Published 8/2018. 2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. c (cid:13) ProbabilisticVerbSelectionforData-to-TextGenerationDellZhang†1,JiahaoYuan‡,XiaolingWang‡2,andAdamFoster††Birkbeck,UniversityofLondon,MaletStreet,LondonWC1E7HX,UK‡ShanghaiKeyLabofTrustworthyComputing,EastChinaNormalUniversity,3663NorthZhongshanRoad,Shanghai200062,China1dell.z@ieee.org,2xlwang@sei.ecnu.edu.cnAbstractIndata-to-textNaturalLanguageGeneration(NLG)systems,computersneedtofindtherightwordstodescribephenomenaseeninthedata.Thispaperfocusesontheproblemofchoosingappropriateverbstoexpressthedi-rectionandmagnitudeofapercentagechange(e.g.,instockprices).Ratherthansimplyusingthesameverbsagainandagain,wepresentaprincipleddata-drivenapproachtothisprob-lembasedonShannon’snoisy-channelmodelsoastobringvariationandnaturalnessintothegeneratedtext.Ourexperimentsonthreelarge-scalereal-worldnewscorporademon-stratethattheproposedprobabilisticmodelcanbelearnedtoaccuratelyimitatehumanauthors’patternofusagearoundverbs,outperformingthestate-of-the-artmethodsignificantly.1IntroductionNaturalLanguageGeneration(NLG)isafundamen-taltaskinArtificialIntelligence(AI)(RussellandNorvig,2009).Itaimstoautomaticallyturnstruc-tureddataintoprose(Reiter,2007;BelzandKow,2009)—theoppositeofthebetter-knownfieldofNaturalLanguageProcessing(NLP)thattransformsrawtextintostructureddata(e.g.,alogicalformoraknowledgebase)(JurafskyandMartin,2009).Beingdubbed“algorithmicauthors”or“robotjournalists”,NLGsystemshaveattractedalotofattentioninre-centyears,thankstotheriseofbigdata(Wright,2015).TheuseofNLGinfinancialserviceshasbeengrowingveryfast.OneparticularlyimportantNLGproblemforsummarizingfinancialorbusinessdataistoautomaticallygeneratetextualdescriptionsoftrendsbetweentwodatapoints(suchasstockprices).Inthispaper,weelecttouserelativepercentagesratherthanabsolutenumberstodescribethechangefromonedatapointtoanother.Thisisbecauseanabsolutenumbermightbeconsideredsmallinonecasebutlargeinanother,dependingontheunitandthecontext(Krifka,2007;Smileyetal.,2016).Forexample,1000Britishpoundsareworthmuchmorethan1000Japaneseyen;ariseof100USdollarsincarpricemightbenegligiblebutthesameamountofincreaseinbikepricewouldbesignificant.Giventwodatapoints(e.g.,onastockchart),thepercentagechangecanalwaysbecalculatedeasily.Thechallengeistoselecttheappropriateverbforanypercentagechange.Forexample,innewspa-pers,weoftenseeheadlineslike“Apple’sstockhadjumped34%thisyearinanticipationofthenextiPhone…”and“Microsoft’sprofitclimbed28%withshifttoWeb-basedsoftware…”.Thejournal-istswritingsuchnewsstoriesusedescriptivelan-guagesuchastheverbslikejumpandclimbtoexpressthedirectionandmagnitudeofapercent-agechange.Itisofcoursepossibletosimplykeepusingthesameneutralverbs,e.g.,increaseanddecreaseforupwardanddownwardchangesre-spectively,againandagain,asinmostexistingdata-to-textNLGsystems.However,thegeneratedtextwouldsoundmuchmorenaturalifcomputerscoulduseavarietyofverbssuitableinthecontextlikehumanauthorsdo.Expressionsofpercentagechangesarereadilyavailableinmanynaturallanguagetextdatasetsand l D o w n o a d e d f r o m h t t p : / / d i r e c t
Transactions of the Association for Computational Linguistics, vol. 6, pp. 543–555, 2018. Action Editor: Mark Steedman.
Transactions of the Association for Computational Linguistics, vol. 6, pp. 543–555, 2018. Action Editor: Mark Steedman. Submission batch: 1/2018; Revision batch: 5/2018; 5/2018; Published 8/2018. 2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. c (cid:13) Planning,InferenceandPragmaticsinSequentialLanguageGamesFereshteKhaniStanfordUniversityfereshte@stanford.eduNoahD.GoodmanStanfordUniversityngoodman@stanford.eduPercyLiangStanfordUniversitypliang@cs.stanford.eduAbstractWestudysequentiallanguagegamesinwhichtwoplayers,eachwithprivateinformation,communicatetoachieveacommongoal.Insuchgames,asuccessfulplayermust(i)in-ferthepartner’sprivateinformationfromthepartner’smessages,(ii)generatemessagesthataremostlikelytohelpwiththegoal,and(iii)reasonpragmaticallyaboutthepartner’sstrat-egy.Weproposeamodelthatcapturesallthreecharacteristicsanddemonstratetheirim-portanceincapturinghumanbehavioronanewgoal-orienteddatasetwecollectedusingcrowdsourcing.1IntroductionHumancommunicationisextraordinarilyrich.Peo-pleroutinelychoosewhattosaybasedontheirgoals(planning),figureoutthestateoftheworldbasedonwhatotherssay(inference),allwhiletak-ingintoaccountthatothersarestrategizingagentstoo(pragmatics).Allthreeaspectshavebeenstud-iedinboththelinguisticsandAIcommunities.Forplanning,MarkovDecisionProcessesandtheirex-tensionscanbeusedtocomputeutility-maximizingactionsviaforward-lookingrecurrences(e.g.,Vo-geletal.(2013a)).Forinference,model-theoreticsemantics(Montague,1973)providesamechanismforutterancestoconstrainpossibleworlds,andthishasbeenimplementedrecentlyinsemanticparsing(Matuszeketal.,2012;KrishnamurthyandKollar,2013).Finally,forpragmatics,thecooperativeprin-cipleofGrice(1975)canberealizedbymodelsinwhichaspeakersimulatesalistener—e.g.,Franke(2009)andFrankandGoodman(2012).FindB2FindB2B?B?C?Pletterview?2?3?2PdigitviewPletter:squarePdigit:circlePletter:click(1,3)Planning:Letmefirsttrysquare,whichisjustonepossibility.Inference:Thesquare’slettermustbeB.Pragmatics:Thesquare’sdigitcan-notbe2.Figure1:AgameofInfoJigsawplayedbytwohu-manplayers.Oneoftheplayers(Pletter)onlyseestheletters,whiletheotherone(Pdigit)onlyseesthedigits.Theirgoalistoidentifythegoalobject,B2,byexchangingafewwords.Thecloudsshowthehypothesizedroleofplanning,inference,andprag-maticsintheplayers’choiceofutterances.Inthisgame,thebottomobjectisthegoal(position(1,3)).Therehavebeenafewpreviouseffortsinthelan-guagegamesliteraturetocombinethethreeaspects.Hawkinsetal.(2015)proposedamodelofcommu-nicationbetweenaquestionerandananswererbasedononlyoneroundofquestionanswering.Vogeletal.(2013b)proposedamodeloftwoagentsplayingarestrictedversionofthegamefromtheCardsCor-pus(Potts,2012),wheretheagentsonlycommuni-cateonce.1Inthiswork,weseektocaptureallthreeaspectsinasingle,unifiedframeworkwhichallows1Specifically,twoagentsmustbothco-locatewithaspecificcard.Theagentwhichfindsthecardsoonersharesthecardlocationinformationwiththeotheragent. l D o w n o a d e d f r o m h t t p : / / d i r e c
Transactions of the Association for Computational Linguistics, vol. 6, pp. 529–541, 2018. Action Editor: Holger Schwenk.
Transactions of the Association for Computational Linguistics, vol. 6, pp. 529–541, 2018. Action Editor: Holger Schwenk. Submission batch: 8/2017; Revision batch: 1/2018; Published 8/2018. 2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. c (cid:13) NeuralLatticeLanguageModelsJacobBuckmanLanguageTechnologiesInstituteCarnegieMellonUniversityjacobbuckman@gmail.comGrahamNeubigLanguageTechnologiesInstituteCarnegieMellonUniversitygneubig@cs.cmu.eduAbstractInthiswork,weproposeanewlanguagemod-elingparadigmthathastheabilitytoperformbothpredictionandmoderationofinforma-tionflowatmultiplegranularities:neurallat-ticelanguagemodels.Thesemodelscon-structalatticeofpossiblepathsthroughasen-tenceandmarginalizeacrossthislatticetocal-culatesequenceprobabilitiesoroptimizepa-rameters.Thisapproachallowsustoseam-lesslyincorporatelinguisticintuitions–in-cludingpolysemyandtheexistenceofmulti-wordlexicalitems–intoourlanguagemodel.ExperimentsonmultiplelanguagemodelingtasksshowthatEnglishneurallatticelanguagemodelsthatutilizepolysemousembeddingsareabletoimproveperplexityby9.95%rela-tivetoaword-levelbaseline,andthataChi-nesemodelthathandlesmulti-characterto-kensisabletoimproveperplexityby20.94%relativetoacharacter-levelbaseline.1IntroductionNeuralnetworkmodelshaverecentlycontributedto-wardsagreatamountofprogressinnaturallanguageprocessing.Thesemodelstypicallyshareacommonbackbone:recurrentneuralnetworks(RNN),whichhaveproventhemselvestobecapableoftacklingavarietyofcorenaturallanguageprocessingtasks(HochreiterandSchmidhuber,1997;Elman,1990).Onesuchtaskislanguagemodeling,inwhichweestimateaprobabilitydistributionoversequencesoftokensthatcorrespondstoobservedsentences(§2).Neurallanguagemodels,particularlymodelscon-ditionedonaparticularinput,havemanyapplica-tionsincludinginmachinetranslation(Bahdanauetal.,2016),abstractivesummarization(Chopraetal.,2016),andspeechprocessing(Gravesetal.,2013).dogs chased the small cat dogs chased the smallcatdogs chased thesmalldogs chasedthethe_smallthe_small_cat small_catdogs_chasedchasedchased_thedogs_chased_thechased_the_smallFigure1:Latticedecompositionofasentenceanditscor-respondinglatticelanguagemodelprobabilitycalculationSimilarly,state-of-the-artlanguagemodelsareal-mostuniversallybasedonRNNs,particularlylongshort-termmemory(LSTM)networks(Jozefowiczetal.,2016;Inanetal.,2017;Merityetal.,2016).Whilepowerful,LSTMlanguagemodelsusuallydonotexplicitlymodelmanycommonly-acceptedlinguisticphenomena.Asaresult,standardmod-elslacklinguisticallyinformedinductivebiases,po-tentiallylimitingtheiraccuracy,particularlyinlow-datascenarios(Adamsetal.,2017;KoehnandKnowles,2017).Inthiswork,wepresentanovelmodificationtothestandardLSTMlanguagemod-elingframeworkthatallowsustoincorporatesomevarietiesoftheselinguisticintuitionsseamlessly:neurallatticelanguagemodels(§3.1).Neurallat-ticelanguagemodelsdefinealatticeoverpossi-blepathsthroughasentence,andmaximizethemarginalprobabilityoverallpathsthatleadtogen-eratingthereferencesentence,asshowninFig.1.Dependingonhowwedefinethesepaths,wecanin-corporatedifferentassumptionsabouthowlanguageshouldbemodeled.Intheparticularinstantiationsofneurallatticelanguagemodelscoveredbythispaper,wefocusontwopropertiesoflanguagethatcouldpotentiallybeofuseinlanguagemodeling:theexistenceofmulti-wordlexicalunits(Zgusta,1967)(§4.1)andpoly- l D o w n o a d e d f r o m h
Transactions of the Association for Computational Linguistics, vol. 6, pp. 497–510, 2018. Action Editor: Phil Blunsom.
Transactions of the Association for Computational Linguistics, vol. 6, pp. 497–510, 2018. Action Editor: Phil Blunsom. Submission batch: 1/2018; Revision batch: 3/2018; Published 7/2018. 2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. c (cid:13) Low-RankRNNAdaptationforContext-AwareLanguageModelingAaronJaechandMariOstendorfDepartmentofElectricalEngineering,UniversityofWashington185StevensWay,PaulAllenCenterAE100R,Seattle,WA{ajaech,ostendor}@uw.eduAbstractAcontext-awarelanguagemodelusesloca-tion,userand/ordomainmetadata(context)toadaptitspredictions.Inneurallanguagemodels,contextinformationistypicallyrep-resentedasanembeddinganditisgiventotheRNNasanadditionalinput,whichhasbeenshowntobeusefulinmanyapplications.Weintroduceamorepowerfulmechanismforus-ingcontexttoadaptanRNNbylettingthecontextvectorcontrolalow-ranktransforma-tionoftherecurrentlayerweightmatrix.Ex-perimentsshowthatallowingagreaterfrac-tionofthemodelparameterstobeadjustedhasbenefitsintermsofperplexityandclassi-ficationforseveraldifferenttypesofcontext.1IntroductionInmanylanguagemodelingapplications,thespeechortextisassociatedwithsomemetadataorcontex-tualinformation.Forexample,inspeechrecogni-tion,ifauserisspeakingtoapersonalassistantthenthesystemmightknowthetimeofdayortheiden-tityofthetaskthattheuseristryingtoaccomplish.Iftheusertakesapictureofasigntotranslateitwiththeirsmartphone,thesystemwouldhavecontextualinformationrelatedtothegeographiclocationandtheuser’spreferredlanguage.Thecontext-awarelanguagemodeltargetsthesetypesofapplicationswithamodelthatcanadaptitspredictionsbasedontheprovidedcontextualinformation.Therehasbeenmuchworkonusingcontextinfor-mationtoadaptlanguagemodels.Here,wearein-terestedincontextsdescribedbymetadata(vs.wordhistoryorrelateddocuments)andinneuralnetworkapproachesduetotheirflexibilityforrepresentingdiversetypesofcontexts.Specifically,wefocusonrecurrentneuralnetworks(RNNs)duetotheirwidespreaduse.ThestandardapproachtoadaptanRNNlanguagemodelistoconcatenatethecontextrepresentationwiththewordembeddingattheinputtotheRNN(MikolovandZweig,2012).Optionally,thecon-textembeddingisalsoconcatenatedwiththeout-putfromtherecurrentlayertoadaptthesoftmaxlayer.Thisbasicstrategyhasbeenadoptedforvar-ioustypesofadaptationsuchasforLMpersonal-ization(Wenetal.,2013;Lietal.,2016),adaptingtotelevisionshowgenres(Chenetal.,2015),andadaptingtolongrangedependenciesinadocument(Jietal.,2016),etc.Weproposeamorepowerfulmechanismforus-ingacontextvector,whichwecalltheFactorCell.Ratherthansimplyusingcontextasanadditionalinput,itisusedtocontrolafactored(low-rank)transformationoftherecurrentlayerweightmatrix.Themotivationisthatallowingagreaterfractionofthemodelparameterstobeadjustedinresponsetotheinputcontextwillproduceamodelthatismoreadaptableandresponsivetothatcontext.Weevaluatetheresultingmodelsintermsofcontext-dependentperplexityandcontextclassifica-tionaccuracyonsixtasksreflectingdifferenttypesofcontextvariables,comparingtobaselinesthatrep-resentthemostpopularmethodsforusingcontextinneuralmodels.Wechoosetaskswherecontextisspecifiedbymetadata,ratherthantextsamplesasusedinmanypriorstudies.Thecombination l D o w n o a d e d f r o m h t t p : / / d i r e c t
Transactions of the Association for Computational Linguistics, vol. 6, pp. 483–495, 2018. Action Editor: Hinrich Sch¨utze.
Transactions of the Association for Computational Linguistics, vol. 6, pp. 483–495, 2018. Action Editor: Hinrich Sch¨utze. Submission batch: 11/2017; Revision batch: 3/2018; Published 7/2018. 2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. c (cid:13) LinearAlgebraicStructureofWordSenses,withApplicationstoPolysemySanjeevArora,YuanzhiLi,YingyuLiang,TengyuMa,AndrejRisteskiComputerScienceDepartment,PrincetonUniversity35OldenSt,Princeton,NJ08540{arora,yuanzhil,yingyul,tengyu,risteski}@cs.princeton.eduAbstractWordembeddingsareubiquitousinNLPandinformationretrieval,butitisunclearwhattheyrepresentwhenthewordispolysemous.Hereitisshownthatmultiplewordsensesre-sideinlinearsuperpositionwithinthewordembeddingandsimplesparsecodingcanre-covervectorsthatapproximatelycapturethesenses.Thesuccessofourapproach,whichappliestoseveralembeddingmethods,ismathematicallyexplainedusingavariantoftherandomwalkondiscoursesmodel(Aroraetal.,2016).Anovelaspectofourtech-niqueisthateachextractedwordsenseisac-companiedbyoneofabout2000“discourseatoms”thatgivesasuccinctdescriptionofwhichotherwordsco-occurwiththatwordsense.Discourseatomscanbeofindepen-dentinterest,andmakethemethodpotentiallymoreuseful.Empiricaltestsareusedtoverifyandsupportthetheory.1IntroductionWordembeddingsareconstructedusingFirth’shy-pothesisthataword’ssenseiscapturedbythedistri-butionofotherwordsaroundit(Firth,1957).Clas-sicalvectorspacemodels(seethesurveybyTur-neyandPantel(2010))usesimplelinearalgebraonthematrixofword-wordco-occurrencecounts,whereasrecentneuralnetworkandenergy-basedmodelssuchasword2vecuseanobjectivethatin-volvesanonconvex(thus,alsononlinear)functionofthewordco-occurrences(Bengioetal.,2003;Mikolovetal.,2013a;Mikolovetal.,2013b).Thisnonlinearitymakesithardtodiscernhowthesemodernembeddingscapturethedifferentsensesofapolysemousword.Themonolithicviewofembeddings,withtheinternalinformationex-tractedonlyviainnerproduct,isfelttofailincap-turingwordsenses(Griffithsetal.,2007;ReisingerandMooney,2010;Iacobaccietal.,2015).Re-searchershaveinsteadsoughttocapturepolysemyusingmorecomplicatedrepresentations,e.g.,byin-ducingseparateembeddingsforeachsense(Murphyetal.,2012;Huangetal.,2012).Theseembedding-per-senserepresentationsgrownaturallyoutofclassicWordSenseInductionorWSI(Yarowsky,1995;Schutze,1998;ReisingerandMooney,2010;DiMarcoandNavigli,2013)techniquesthatper-formclusteringonneighboringwords.Thecurrentpapergoesbeyondthismono-lithicview,bydescribinghowmultiplesensesofawordactuallyresideinlinearsuperposi-tionwithinthestandardwordembeddings(e.g.,word2vec(Mikolovetal.,2013a)andGloVe(Pen-ningtonetal.,2014)).Bythiswemeanthefollow-ing:considerapolysemousword,saytie,whichcanrefertoanarticleofclothing,oradrawnmatch,oraphysicalact.Let’staketheusualviewpointthattieisasingletokenthatrepresentsmonosemouswordstie1,tie2,….Thetheoryandexperimentsinthispaperstronglysuggestthatwordembeddingscom-putedusingmoderntechniquessuchasGloVeandword2vecsatisfy:vtie≈α1vtie1+α2vtie2+α3vtie3+···(1)wherecoefficientsαi’sarenonnegativeandvtie1,vtie2,etc.,arethehypotheticalembeddingsof l D o w n o a d e d f r o m h t t p : / / d i r e c t
Transactions of the Association for Computational Linguistics, vol. 6, pp. 451–465, 2018. Action Editor: Brian Roark.
Transactions of the Association for Computational Linguistics, vol. 6, pp. 451–465, 2018. Action Editor: Brian Roark. Submission batch: 12/2017; Revision batch: 5/2018; Published 7/2018. 2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. c (cid:13) LanguageModelingforMorphologicallyRichLanguages:Character-AwareModelingforWord-LevelPredictionDanielaGerz1,IvanVuli´c1,EdoardoPonti1JasonNaradowsky3,RoiReichart2AnnaKorhonen11LanguageTechnologyLab,DTAL,UniversityofCambridge2FacultyofIndustrialEngineeringandManagement,Technion,IIT3JohnsHopkinsUniversity1{dsg40,iv250,ep490,alk23}@cam.ac.uk2roiri@ie.technion.ac.il3narad@jhu.eduAbstractNeuralarchitecturesareprominentinthecon-structionoflanguagemodels(LMs).How-ever,word-levelpredictionistypicallyagnos-ticofsubword-levelinformation(charactersandcharactersequences)andoperatesoveraclosedvocabulary,consistingofalimitedwordset.Indeed,whilesubword-awaremod-elsboostperformanceacrossavarietyofNLPtasks,previousworkdidnotevaluatetheabil-ityofthesemodelstoassistnext-wordpredic-tioninlanguagemodelingtasks.Suchsubword-levelinformedmodelsshouldbeparticularlyeffectiveformorphologically-richlanguages(MRLs)thatexhibithightype-to-tokenratios.Inthiswork,wepresentalarge-scaleLMstudyon50typologicallydiverselanguagescover-ingawidevarietyofmorphologicalsystems,andoffernewLMbenchmarkstothecommu-nity,whileconsideringsubword-levelinforma-tion.Themaintechnicalcontributionofourworkisanovelmethodforinjectingsubword-levelinformationintosemanticwordvectors,integratedintotheneurallanguagemodelingtraining,tofacilitateword-levelprediction.WeconductexperimentsintheLMsettingwherethenumberofinfrequentwordsislarge,anddemonstratestrongperplexitygainsacrossour50languages,especiallyformorphologically-richlanguages.Ourcodeanddatasetsarepub-liclyavailable.1IntroductionLanguageModeling(LM)isakeyNLPtask,servingasanimportantcomponentforapplicationsthatre-quiresomeformoftextgeneration,suchasmachinetranslation(Vaswanietal.,2013),speechrecognition(Mikolovetal.,2010),dialoguegeneration(Serbanetal.,2016),orsummarisation(Filippovaetal.,2015).Atraditionalrecurrentneuralnetwork(RNN)LMsetupoperatesonalimitedclosedvocabularyofwords(Bengioetal.,2003;Mikolovetal.,2010).Thelimitationarisesduetothemodellearningpa-rametersexclusivetosinglewords.Astandardtrain-ingprocedureforneuralLMsgraduallymodifiestheparametersbasedoncontextual/distributionalinfor-mation:eachoccurrenceofawordtokenintrain-ingdatacontributestotheestimateofawordvector(i.e.,modelparameters)assignedtothiswordtype.Low-frequencywordsthereforeoftenhaveincorrectestimates,nothavingmovedfarfromtheirrandominitialisation.Acommonstrategyfordealingwiththisissueistosimplyexcludethelow-qualityparam-etersfromthemodel(i.e.,toreplacethemwiththeplaceholder),leadingtoonlyasubsetofthevocabularybeingrepresentedbythemodel.Thislimitedvocabularyassumptionenablesthemodeltobypasstheproblemofunreliablewordes-timatesforlow-frequencyandunseenwords,butitdoesnotresolveit.Theassumptionisfarfromideal,partlyduetotheZipfiannatureofeachlanguage(Zipf,1949),anditslimitationisevenmorepro-nouncedformorphologically-richlanguages(MRLs):theselanguagesinherentlygenerateaplethoraofwordsbytheirmorphologicalsystems.Asaconse-quence,therewillbealargenumberofwordsforwhichastandardRNNLMcannotguaranteeareli-ablewordestimate.Sincegradualparameterestimationbasedoncon-textualinformationisnotfeasibleforrarephenomena l D o w n o a d e d f r o m h t t p : / / d i r e c t
Transactions of the Association for Computational Linguistics, vol. 6, pp. 467–481, 2018. Action Editor: Jordan Boyd-Graber .
Transactions of the Association for Computational Linguistics, vol. 6, pp. 467–481, 2018. Action Editor: Jordan Boyd-Graber . Submission batch: 11/2017; Revision batch: 2/2018; Published 7/2018. c(cid:13)2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. DetectingInstitutionalDialogActsinPoliceTrafficStopsVinodkumarPrabhakaranStanfordUniv.,CACamillaGriffithsStanfordUniv.,CAHangSuUCBerkeley,CAPrateekVermaStanfordUniv.,CANelsonMorganICSIBerkeley,CAJenniferL.EberhardtStanfordUniv.,CADanJurafskyStanfordUniv.,CA{vinodkpg,camillag,jleberhardt,jurafsky}@stanford.edu,{suhang3240,prateek119}@gmail.com,morgan@uprise.orgAbstractWeapplycomputationaldialogmethodstopolicebody-worncamerafootagetomodelconversationsbetweenpoliceofficersandcommunitymembersintrafficstops.Rely-ingonthetheoryofinstitutionaltalk,wede-velopalabelingschemeforpolicespeechdur-ingtrafficstops,andataggertodetectinsti-tutionaldialogacts(Reasons,Searches,Of-feringHelp)fromtranscribedtextattheturn(78%F-score)andstop(89%F-score)level.Wethendevelopspeechrecognitionandseg-mentationalgorithmstodetecttheseactsatthestoplevelfromrawcameraaudio(81%F-score,withevenhigheraccuracyforcrucialactslikeconveyingthereasonforthestop).Wedemonstratethatthedialogstructurespro-ducedbyourtaggercouldrevealwhetheroffi-cersfollowlawenforcementnormslikeintro-ducingthemselves,explainingthereasonforthestop,andaskingpermissionforsearches.Thisworkmaythereforeinformandaidef-fortstoensuretheproceduraljusticeofpolice-communityinteractions.1IntroductionImprovingtherelationshipbetweenpoliceofficersandthecommunitiestheyserveisacriticalsocietalgoal.Weproposetostudythisrelationshipbyap-plyingNLPtechniquestoconversationsbetweenof-ficersandcommunitymembersintrafficstops.Traf-ficstopsareoneofthemostcommonformsofpo-licecontactwithcommunitymembers,with10%ofU.S.adultspulledovereveryyear(LangtonandDurose,2013).Yetpastresearchonwhatpeopleex-perienceduringthesetrafficstopshasmainlybeenlimitedtoself-reportedbehaviorandpost-hocnar-ratives(LundmanandKaufman,2003;Engel,2005;Brunson,2007;Eppetal.,2014).Therapidadoptionofbody-worncamerasbypo-licedepartmentsintheU.S.(lawsin60%ofstatesintheU.S.encouragetheuseofbodycameras)andacrosstheworldhasprovidedunprecedentedinsightintotrafficstops.1Whilefootagefromthesecam-erasisusedasevidenceincontentiouscases,theun-structurednatureandimmensevolumeofvideodatameansthatmostofthisfootageisuntapped.RecentworkbyVoigtetal.(2017)demonstratedthatbody-worncamerafootagecouldbeusednotjustasevidenceincourt,butasdata.Theydevel-opedalgorithmstoautomaticallydetectthedegreeofrespectthatofficerscommunicatedtodriversincloseto1,000routinetrafficstopscapturedoncam-era.Itwasthefirststudytousemachinelearningtechniquestoextractinsightsfromthisfootage.Thisfootagecanbefurtherusedtounearththestructureofpolice-communityinteractionsandgainamorecomprehensivepictureofthetrafficstopasaneverydayinstitutionalpractice.Forinstance,knowingwhichrequeststheofficermakes,whetherandwhentheyintroducethemselvesorexplainthereasonforthestopisanovelwaytomeasurepro-ceduraljustice;asetoffairnessprinciplesrecom-mendedbythePresident’sTaskForceon21stCen-turyPolicing,2andendorsedbypolicedepartmentsacrosstheU.S.1https://en.wikipedia.org/wiki/Body_worn_video_(police_equipment)2http://www.theiacp.org/TaskForceReport l D o w n o a d e d f r o m h t t p : / / d i r e c t .
Transactions of the Association for Computational Linguistics, vol. 6, pp. 391–406, 2018. Action Editor: Katrin Erk.
Transactions of the Association for Computational Linguistics, vol. 6, pp. 391–406, 2018. Action Editor: Katrin Erk. Submission batch: 8/2017; Revision batch: 12/2017; Published 6/2018. 2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. c (cid:13) MeasuringtheEvolutionofaScientificFieldthroughCitationFramesDavidJurgensUniversityofMichiganjurgens@umich.eduSrijanKumarStanfordUniversitysrijan@stanford.eduRaineHooverStanfordUniversityraine@stanford.eduDanMcFarlandStanfordUniversitydmcfarla@stanford.eduDanJurafskyStanfordUniversityjurafsky@stanford.eduAbstractCitationshavelongbeenusedtocharacter-izethestateofascientificfieldandtoiden-tifyinfluentialworks.However,writersusecitationsfordifferentpurposes,andthisvar-iedpurposeinfluencesuptakebyfutureschol-ars.Unfortunately,ourunderstandingofhowscholarsuseandframecitationshasbeenlim-itedtosmall-scalemanualcitationanalysisofindividualpapers.Weperformthelargestbe-havioralstudyofcitationstodate,analyzinghowscientificworksframetheircontributionsthroughdifferenttypesofcitationsandhowthisframingaffectsthefieldasawhole.Weintroduceanewdatasetofnearly2,000cita-tionsannotatedfortheirfunction,anduseittodevelopastate-of-the-artclassifierandla-belthepapersofanentirefield:NaturalLan-guageProcessing.Wethenshowhowdiffer-encesinframingaffectscientificuptakeandrevealtheevolutionofthepublicationvenuesandthefieldasawhole.Wedemonstratethatauthorsaresensitivetodiscoursestructureandpublicationvenuewhenciting,andthathowapaperframesitsworkthroughcitationsispre-dictiveofthecitationcountitwillreceive.Fi-nally,weusechangesincitationframingtoshowthatthefieldofNLPisundergoingasig-nificantincreaseinconsensus.1IntroductionAuthorsusecitationstoframetheircontributionsandconnecttoanintellectuallineage(Latour,1987).Anauthor’sscientificframeemployscitationsinmultipleways(Figure1)soastobuildastrongUnlike CITE, we use the method of CITE,which has been used previously for parsing (CITE).ContrastUseBackgroundFigure1:Examplesofcitationfunctionality.andmultifacetedargument.Thesedifferencesinci-tationshavebeenexaminedextensivelywithinthecontextofasinglepaper(Swales,1986;White,2004;Dingetal.,2014).However,weknowrela-tivelylittleabouthowthesecitationframesdevelopovertimewithinafieldandwhatimpacttheyhaveonscientificuptake.Answeringthesequestionshasbeenlargelyhin-deredbythelackofadatasetshowinghowcitationsfunctionatthefieldscale.Here,weperformthefirstfield-scalestudyofcitationframingbyfirstde-velopingastate-of-the-artmethodforautomaticallyclassifyingcitationfunctionandthenapplyingthismethodtoanentirefield’sliteraturetoquantifytheeffectsandevolutionofframing.Analyzinglarge-scalechangesincitationfram-ingrequiresanaccuratemethodforclassifyingthefunctionacitationplaystowardsfurtheringanar-gument.Duetothedifficultyofinterpretingci-tationintent,manypriorworksperformedmanualanalysis(MoravcsikandMurugesan,1975;Swales,1990;Harwood,2009)andonlyrecentlyhaveau-tomatedapproachesbeendeveloped(Teufeletal.,2006b;Valenzuelaetal.,2015).Here,weunifycoreaspectsofseveralpriorcitationannotationschemes(White,2004;Dingetal.,2014;Hern´andez-AlvarezandGomez,2016).Usingthisscheme,wecreate l D o w n o a d e d f r o
Transactions of the Association for Computational Linguistics, vol. 6, pp. 329–342, 2018. Action Editor: Ivan Titov.
Transactions of the Association for Computational Linguistics, vol. 6, pp. 329–342, 2018. Action Editor: Ivan Titov. Submission batch: 1/2018; Revision batch: 3/2018; Published 5/2018. 2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. c (cid:13) NativeLanguageCognateEffectsonSecondLanguageLexicalChoiceEllaRabinovich?NYuliaTsvetkov†ShulyWintner??DepartmentofComputerScience,UniversityofHaifaNIBMResearch†LanguageTechnologiesInstitute,CarnegieMellonUniversityellarabi@gmail.com,ytsvetko@cs.cmu.edushuly@cs.haifa.ac.ilAbstractWepresentacomputationalanalysisofcog-nateeffectsonthespontaneouslinguisticpro-ductionsofadvancednon-nativespeakers.In-troducingalargecorpusofhighlycompetentnon-nativeEnglishspeakers,andusingasetofcarefullyselectedlexicalitems,weshowthatthelexicalchoicesofnon-nativesareaf-fectedbycognatesintheirnativelanguage.ThiseffectissopowerfulthatweareabletoreconstructthephylogeneticlanguagetreeoftheIndo-EuropeanlanguagefamilysolelyfromthefrequenciesofspecificlexicalitemsintheEnglishofauthorswithvariousnativelanguages.Wequantitativelyanalyzenon-nativelexicalchoice,highlightingcognatefa-cilitationasoneoftheimportantphenomenashapingthelanguageofnon-nativespeakers.1IntroductionAcquisitionofvocabularyandsemanticknowledgeofasecondlanguage,includingappropriatewordchoiceandawarenessofsubtlewordmeaningcon-tours,arerecognizedasanotoriouslyhardtask,evenforadvancednon-nativespeakers.Whennon-nativeauthorsproduceutterancesinaforeignlan-guage(L2),theseutterancesaremarkedbytracesoftheirnativelanguage(L1).Suchtracesareknownastransfereffects,andtheycanbephonological(aforeignaccent),morphological,lexical,orsyn-tactic.Specifically,psycholinguisticresearchhasshownthatthechoiceoflexicalitemsisinfluencedbytheauthor’sL1,andthatnon-nativespeakerstendtochoosewordsthathappentohavecognatesintheirnativelanguage.Cognatesarewordsintwolanguagesthatsharebothasimilarmeaningandasimilarphonetic(and,sometimes,alsoorthographic)form,duetoacom-monancestorinsomeprotolanguage.Thedefinitionissometimesalsoextendedtowordsthathavesim-ilarformsandmeaningsduetoborrowing.Moststudiesoncognatefacilitationhavebeenconductedwithfewhumansubjects,focusingonfewwords,andtheexperimentalsetupwassuchthatpartici-pantswereaskedtoproducelexicalchoicesinanartificialsetting.Wedemonstratethatcognatesaf-fectlexicalchoiceinL2spontaneousproductiononamuchlargerscale.Usinganewanduniquelargecorpusofnon-nativeEnglishthatweintroduceaspartofthiswork,weidentifyafocussetofover1000words,andshowthattheyaredistributedverydifferentlyacrossthe“Englishes”ofauthorswithvariousL1s.Impor-tantly,wegotogreatlengthstoguaranteethatthesewordsdonotreflectspecificpropertiesofthevar-iousnativelanguages,theculturesassociatedwiththem,orthetopicsthatmayberelevantforparticu-largeographicregions.Rather,theseare“ordinary”words,withverylittleculture-specificweight,thathappentohavesynonymsinEnglishthatmayre-flectcognatesinsomeL1s,butnotallofthem.Con-sequently,theyareuseddifferentlybyauthorswithdifferentlinguisticbackgrounds,totheextentthattheauthors’L1scanbeidentifiedthroughtheiruseofthewordsinthefocusset.ThesignalofL1issopowerful,thatweareabletoreconstructalinguistictypologytreefromthedistributionofthesewordsintheEnglisheswitnessedinthecorpus.Weproposeamethodologyforcreatingafocussetofhighlyfrequent,unbiasedwordsthatweex-pecttobedistributeddifferentlyacrossdifferentEn-glishessimplybecausetheyhappentohavesyn-onymswithdifferentetymologies,eventhoughthey l D o w n o a d e d f r o m h t t p : / / d i r e c t
Transactions of the Association for Computational Linguistics, vol. 6, pp. 317–328, 2018. Action Editor: Katrin Erk.
Transactions of the Association for Computational Linguistics, vol. 6, pp. 317–328, 2018. Action Editor: Katrin Erk. Submission batch: 6/2017; Revision batch: 9/2017; Published 5/2018. 2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. c (cid:13) TheNarrativeQAReadingComprehensionChallengeTom´aˇsKoˇcisk´y†‡JonathanSchwarz†PhilBlunsom†‡ChrisDyer†KarlMoritzHermann†G´aborMelis†EdwardGrefenstette††DeepMind‡UniversityofOxford{tkocisky,schwarzjn,pblunsom,cdyer,kmh,melisgl,etg}@google.comAbstractReadingcomprehension(RC)—incontrasttoinformationretrieval—requiresintegratingin-formationandreasoningaboutevents,enti-ties,andtheirrelationsacrossafulldocument.QuestionansweringisconventionallyusedtoassessRCability,inbothartificialagentsandchildrenlearningtoread.However,existingRCdatasetsandtasksaredominatedbyques-tionsthatcanbesolvedbyselectinganswersusingsuperficialinformation(e.g.,localcon-textsimilarityorglobaltermfrequency);theythusfailtotestfortheessentialintegrativeas-pectofRC.Toencourageprogressondeepercomprehensionoflanguage,wepresentanewdatasetandsetoftasksinwhichthereadermustanswerquestionsaboutstoriesbyreadingentirebooksormoviescripts.Thesetasksaredesignedsothatsuccessfullyansweringtheirquestionsrequiresunderstandingtheunderly-ingnarrativeratherthanrelyingonshallowpatternmatchingorsalience.Weshowthatal-thoughhumanssolvethetaskseasily,standardRCmodelsstruggleonthetaskspresentedhere.Weprovideananalysisofthedatasetandthechallengesitpresents.1IntroductionNaturallanguageunderstandingseekstocreatemod-elsthatreadandcomprehendtext.Acommonstrat-egyforassessingthelanguageunderstandingcapa-bilitiesofcomprehensionmodelsistodemonstratethattheycananswerquestionsaboutdocumentstheyread,akintohowreadingcomprehensionistestedinchildrenwhentheyarelearningtoread.Afterread-ingadocument,areaderusuallycannotreproduceTitle:GhostbustersIIQuestion:HowisOscarrelatedtoDana?Answer:hersonSummarysnippet:…Peter’sformergirlfriendDanaBarretthashadason,Oscar…Storysnippet:DANA(settingthewheelbrakesonthebuggy)Thankyou,Frank.I’llgetthehangofthiseventually.ShecontinuesdigginginherpursewhileFrankleansoverthebuggyandmakesfunnyfacesatthebaby,OSCAR,averycutenine-montholdboy.FRANK(tothebaby)Hiya,Oscar.Whatdoyousay,slugger?FRANK(toDana)That’sagood-lookingkidyougotthere,Ms.Barrett.Figure1:Examplequestion–answerpair.Thesnippetsherewereextractedbyhumansfromsummariesandthefulltextofmoviescriptsorbooks,respectively,andarenotprovidedtothemodelassupervisionorattesttime.Instead,themodelwillneedtoreadthefulltextandlo-catesalientsnippetsbasedsolelyonthequestionanditsreadingofthedocumentinordertogeneratetheanswer.theentiretextfrommemory,butoftencananswerquestionsaboutunderlyingnarrativeelementsofthedocument:thesaliententities,events,places,andtherelationsbetweenthem.Thus,testingunderstandingrequiresthecreationofquestionsthatexaminehigh-levelabstractionsinsteadofjustfactsoccurringinonesentenceatatime.Unfortunately,superficialquestionsaboutadocu-mentmayoftenbeansweredsuccessfully(bybothhumansandmachines)usingashallowpatternmatch- l D o w n o a d e d f r o m h t t p : / / d i r e c t
Transactions of the Association for Computational Linguistics, vol. 6, pp. 287–302, 2018. Action Editor: Katrin Erk.
Transactions of the Association for Computational Linguistics, vol. 6, pp. 287–302, 2018. Action Editor: Katrin Erk. Submission batch: 10/2017; Revision batch: 2/2018; Published 5/2018. 2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. c (cid:13) ConstructingDatasetsforMulti-hopReadingComprehensionAcrossDocumentsJohannesWelbl1PontusStenetorp1SebastianRiedel1,21UniversityCollegeLondon,2BloomsburyAI{j.welbl,p.stenetorp,s.riedel}@cs.ucl.ac.ukAbstractMostReadingComprehensionmethodslimitthemselvestoquerieswhichcanbeansweredusingasinglesentence,paragraph,ordocu-ment.Enablingmodelstocombinedisjointpiecesoftextualevidencewouldextendthescopeofmachinecomprehensionmethods,butcurrentlynoresourcesexisttotrainandtestthiscapability.Weproposeanoveltasktoencouragethedevelopmentofmodelsfortextunderstandingacrossmultipledocumentsandtoinvestigatethelimitsofexistingmethods.Inourtask,amodellearnstoseekandcom-bineevidence–effectivelyperformingmulti-hop,aliasmulti-step,inference.Wedeviseamethodologytoproducedatasetsforthistask,givenacollectionofquery-answerpairsandthematicallylinkeddocuments.Twodatasetsfromdifferentdomainsareinduced,1andweidentifypotentialpitfallsanddevisecircum-ventionstrategies.Weevaluatetwoprevi-ouslyproposedcompetitivemodelsandfindthatonecanintegrateinformationacrossdoc-uments.However,bothmodelsstruggletose-lectrelevantinformation;andprovidingdoc-umentsguaranteedtoberelevantgreatlyim-provestheirperformance.Whilethemod-elsoutperformseveralstrongbaselines,theirbestaccuracyreaches54.5%onanannotatedtestset,comparedtohumanperformanceat85.0%,leavingampleroomforimprovement.1IntroductionDevisingcomputersystemscapableofansweringquestionsaboutknowledgedescribedusingtexthas1Availableathttp://qangaroo.cs.ucl.ac.ukThe Hanging Gardens, in [Mumbai], also known as Pherozeshah Mehta Gardens, are terraced gardens … They provide sunset views over the [Arabian Sea] …Mumbai (also known as Bombay,
Transactions of the Association for Computational Linguistics, vol. 6, pp. 269–285, 2018. Action Editor: Diana McCarthy.
Transactions of the Association for Computational Linguistics, vol. 6, pp. 269–285, 2018. Action Editor: Diana McCarthy. Submission batch: 11/2017; Revision batch: 2/2018; Published 5/2018. c(cid:13)2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. BootstrapDomain-SpecificSentimentClassifiersfromUnlabeledCorporaAndriusMudinas,DellZhang,andMarkLeveneDepartmentofComputerScienceandInformationSystemsBirkbeck,UniversityofLondonLondonWC1E7HX,UKandrius@dcs.bbk.ac.uk,dell.z@ieee.org,mark@dcs.bbk.ac.ukAbstractThere is often the need to perform sentiment classification in a particular domain where no labeled document is available. Although we could make use of a general-purpose off-the-shelf sentiment classifier
Transactions of the Association for Computational Linguistics, vol. 6, pp. 241–252, 2018. Action Editor: Brian Roark.
Transactions of the Association for Computational Linguistics, vol. 6, pp. 241–252, 2018. Action Editor: Brian Roark. Submission batch: 9/2017; Revision batch: 12/2017; Published 4/2018. 2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. c (cid:13) QuestionableAnswersinQuestionAnsweringResearch:ReproducibilityandVariabilityofPublishedResultsMattCraneDavidR.CheritonSchoolofComputerScience,UniversityofWaterloomatt.crane@uwaterloo.caAbstract“Basedontheoreticalreasoningithasbeensuggestedthatthereliabilityoffindingspub-lishedinthescientificliteraturedecreaseswiththepopularityofaresearchfield”(PfeifferandHoffmann,2009).Asweknow,deeplearningisverypopularandtheabilitytoreproducere-sultsisanimportantpartofscience.Thereisgrowingconcernwithinthedeeplearningcommunityaboutthereproducibilityofresultsthatarepresented.Inthispaperwepresentanumberofcontrollable,yetunreported,ef-fectsthatcansubstantiallychangetheeffec-tivenessofasamplemodel,andthuslythere-producibilityofthoseresults.Throughtheseenvironmentaleffectsweshowthatthecom-monlyheldbeliefthatdistributionofsourcecodeisallthatisneededforreproducibilityisnotenough.Sourcecodewithoutarepro-ducibleenvironmentdoesnotmeananythingatall.Inadditiontherangeofresultsproducedfromtheseeffectscanbelargerthanthema-jorityofincrementalimprovementreported.1IntroductionTherecent“reproducibilitycrisis”(Baker,2016)invariousscientificfields(particularlyPsychologyandSocialSciences)indicatesthatsomeintrospectionisneededinallfields,particularlythosethatareexper-imentalbynature.TheeffortsofCollberg’srepeata-bilitystudieshighlightthestateofaffairswithinthecomputersystemsresearchcommunity(Morailaetal.,2014;Collbergetal.,2015).1Otherfieldshavealsobeguntopushformorestringentpresentationof1http://reproducibility.cs.arizona.eduresults,forexample,theinformationretrievalcom-munityhasbeenawareforsometimeoftheissuessurroundingweakbaselines(Armstrongetal.,2009)andmorerecentlyreproducibility(Arguelloetal.,2016;Linetal.,2016).Theissueofreproducibilityinthedeep-learningcommunityhasalsostartedtobecomeagrowingconcern,withtheneedforreplicableandrepro-ducibleresultsbeingincludedinalistofchallengesfortheACL(Nivre,2017).Inreinforcementlearn-ing,Hendersonetal.(2017)showedthatthereareanumberofeffectsthatwouldchangetheresultsob-tainedbypublishedauthorsandcallformorerigor-oustesting,andreporting,ofstate-of-the-artmeth-ods.ThereisalsoanongoingprojectbyOpenAItoprovidebaselinesinreinforcementlearningthatarereproducedfrompublisheddescriptions,buteventheyadmitthattheirscoresareonly“roughlyonparwiththescoresinpublishedpapers.”2ReimersandGurevych(2017)investigatedover50,000combina-tionsofhyper-parametersettings,suchaswordem-beddingsourcesandtheoptimizeracrossfivedif-ferentNLPtasksandfoundthatthesesettingshaveasignificantimpactonboththevariability,andtherelativeeffectivenessofmodels.Inthispaperwepresentanumberofcontrollableenvironmentsettingsthatoftengounreported,andillustratethatthesearefactorsthatcancauseirre-producibilityofresultsaspresentedinthelitera-ture.Theseenvironmentalfactorshaveaneffectontheeffectivenessofneuralnetworksduetothenon-convexityoftheoptimizationsurface,meaningthat2https://blog.openai.com/openai-baselines-dqn/ l D o w n o a d e d f r o m h t t p : / / d i r e c t
Transactions of the Association for Computational Linguistics, vol. 6, pp. 225–240, 2018. Action Editor: Philipp Koehn.
Transactions of the Association for Computational Linguistics, vol. 6, pp. 225–240, 2018. Action Editor: Philipp Koehn. Submission batch: 10/2017; Revision batch: 2/2018; Published 4/2018. 2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. c (cid:13) ScheduledMulti-TaskLearning:FromSyntaxtoTranslationEliyahuKiperwasser∗ComputerScienceDepartmentBar-IlanUniversityRamat-Gan,Israelelikip@gmail.comMiguelBallesterosIBMResearch1101KitchawanRoad,Route134YorktownHeights,NY10598.U.Smiguel.ballesteros@ibm.comAbstractNeuralencoder-decodermodelsofmachinetranslationhaveachievedimpressiveresults,whilelearninglinguisticknowledgeofboththesourceandtargetlanguagesinanimplicitend-to-endmanner.Weproposeaframeworkinwhichourmodelbeginslearningsyntaxandtranslationinterleaved,graduallyputtingmorefocusontranslation.Usingthisapproach,weachieveconsiderableimprovementsintermsofBLEUscoreonrelativelylargeparallelcor-pus(WMT14EnglishtoGerman)andalow-resource(WITGermantoEnglish)setup.1IntroductionNeuralMachineTranslation(NMT)(KalchbrennerandBlunsom,2013;Sutskeveretal.,2014;Bah-danauetal.,2014)hasrecentlybecomethestate-of-the-artapproachtomachinetranslation(Bojaretal.,2016).Oneofthemainadvantagesofneuralap-proachesistheimpressiveabilityofRNNstoactasfeatureextractorsovertheentireinput(KiperwasserandGoldberg,2016),ratherthanfocusingonlocalinformation.Neuralarchitecturesareabletoextractlinguisticpropertiesfromtheinputsentenceintheformofmorphology(Belinkovetal.,2017)orsyn-tax(Linzenetal.,2016).Nonetheless,asshowninDyeretal.(2016)andDyer(2017),systemsthatignoreexplicitlinguis-ticstructuresareincorrectlybiasedandtheytendtomakeoverlystronglinguisticgeneralizations.Pro-vidingexplicitlinguisticinformation(Dyeretal.,∗WorkcarriedoutduringsummerinternshipatIBMRe-search.2016;Kuncoroetal.,2017;NiehuesandCho,2017;SennrichandHaddow,2016;Eriguchietal.,2017;AharoniandGoldberg,2017;Nadejdeetal.,2017;Bastingsetal.,2017;Matthewsetal.,2018)hasproventobebeneficial,achievinghigherresultsinlanguagemodelingandmachinetranslation.Multi-tasklearning(MTL)consistsofbeingabletosolvesynergistictaskswithasinglemodelbyjointlytrainingmultipletasksthatlookalike.Thefi-naldenserepresentationsoftheneuralarchitecturesencodethedifferentobjectives,andtheyleveragetheinformationfromeachtasktohelptheothers.Forexample,taskslikemultiwordexpressionde-tectionandpart-of-speechtagginghavebeenfoundveryusefulforotherslikecombinatorycategoricalgrammar(CCG)parsing,chunkingandsuper-sensetagging(BingelandSøgaard,2017).Inordertoperformaccuratetranslations,wepro-ceedbyanalogytohumans.Itisdesirabletoacquireadeepunderstandingofthelanguages;and,oncethisisacquireditispossibletolearnhowtotranslategraduallyandwithexperience(includingrevisitingandre-learningsomeaspectsofthelanguages).Weproposeasimilarstrategybyintroducingthecon-ceptofScheduledMulti-TaskLearning(Section4)inwhichweproposetointerleavethedifferenttasks.Inthispaper,weproposetolearnthestructureoflanguage(throughsyntacticparsingandpart-of-speechtagging)withamulti-tasklearningstrategywiththeintentionsofimprovingtheperformanceoftaskslikemachinetranslationthatusethatstructureandmakegeneralizations.WeachieveconsiderableimprovementsintermsofBLEUscoreonarela-tivelylargeparallelcorpus(WMT14EnglishtoGer- l D o w n o a d e d f r o m h t t p : / / d i r e c t