Transactions of the Association for Computational Linguistics, vol. 3, pp. 545–558, 2015. Action Editor: Jason Eisner.

Transactions of the Association for Computational Linguistics, vol. 3, pp. 545–558, 2015. Action Editor: Jason Eisner.
Submission batch: 5/2015; Revision batch: 10/2015; Published 11/2015.

2015 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.

c
(cid:13)

ImitationLearningofAgenda-basedSemanticParsersJonathanBerantStanfordUniversityyonatan@cs.stanford.eduPercyLiangStanfordUniversitypliang@cs.stanford.eduAbstractSemanticparsersconventionallyconstructlogicalformsbottom-upinafixedorder,re-sultinginthegenerationofmanyextraneouspartiallogicalforms.Inthispaper,wecom-bineideasfromimitationlearningandagenda-basedparsingtotrainasemanticparserthatsearchespartiallogicalformsinamorestrate-gicorder.Empirically,ourparserreducesthenumberofconstructedpartiallogicalformsbyanorderofmagnitude,andobtainsa6x-9xspeedupoverfixed-orderparsing,whilemain-tainingcomparableaccuracy.1IntroductionSemanticparsing,thetaskofmappingnaturallanguagetosemanticrepresentations(e.g.,log-icalforms),hasemergedinrecentyearsasapromisingparadigmfordevelopingquestionan-sweringsystems(ZelleandMooney,1996;Zettle-moyerandCollins,2005;WongandMooney,2007;Kwiatkowskietal.,2010;Liangetal.,2011)andothernaturallanguageinterfaces(ZettlemoyerandCollins,2007;Tellexetal.,2011;Matuszeketal.,2012).Recently,therehavebeentwoma-jortrends:Thefirstistoscalesemanticparsingtolargeknowledgebases(KB)suchasFreebase(CaiandYates,2013;Kwiatkowskietal.,2013;BerantandLiang,2014).Thesecondistolearnsemanticparserswithoutrelyingonannotatedlogicalforms,butinsteadontheirdenotations(answers)(Clarkeetal.,2010;Liangetal.,2011);thislessenstheanno-tationburdenandhasbeeninstrumentalinfuelingthefirsttrend(Berantetal.,2013).whatcitywasabrahamlincolnbornin3915083622020>1MType.City(cid:117)PlaceOfBirthOf.AbeLincolnType.Loc(cid:117)ContainedBy.LincolnTownAbeLincolnLincolnTownUSSLincolnType.CityType.LocPlaceOfBirthOfPlacesLivedFigure1:Aparsingchartfortheutterance“whatcitywasabrahamlincolnbornin”.Numbersinchartcellsindicatethenumberofpossiblesemanticparsesconstructedoverthatspan,andarrowspointtosomeofthelogicalformsthatwerecon-structed.Therearemorethanonemillionpossiblesemanticparsesforthisutterance.Inthispaper,weareinterestedintrainingseman-ticparsersfromdenotationsonlargeKBs.Thechal-lengeinthissettingisthatthevocabularyofthetargetlogicallanguageoftencontainsthousandsoflogicalpredicates,andthereisamismatchbetweenthestructureofthenaturallanguageandthelogicallanguage.Asaresult,thespaceofpossibleseman-ticparsesforevenashortutterancegrowsquickly.Forexample,considertheutterance“whatcitywasabrahamlincolnbornin”.Figure1illustratesthenumberofpossiblesemanticparsesthatcanbecon-structedoversomeoftheutterancespans.Justbycombiningsemanticparsesoverthespans“city”,“lincoln”and“born”wealreadyobtain362·391·20possibleparses;attheroot,wegetoveramillionparses.1Theambiguityoflanguagethusresultsina1Evenwhentypeconstraintsareusedtopruneparses,westillproducemorethanamillionpossibleparsesattheroot.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
1
5
7
1
5
6
6
8
0
0

/

/
t

je

un
c
_
un
_
0
0
1
5
7
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

546

Root:Type.City(cid:117)PlaceOfBirthOf.AbeLincolnwhatSet:Type.CitycitywasSet:PlaceOfBirthOf.AbeLincolnEntity:AbeLincolnabrahamlincolnBinary:PlaceOfBirthOfborninJoinIntersectLexLexLexFigure2:Anexamplesemanticparse,orderivation,fortheutterance“whatcitywasabrahamlincolnbornin”.Eachnodeinthetreehasacategory(e.g.,ENTITY)andalogicalform(e.g.,AbeLincoln).hardsearchproblem.Tomanagethiscombinatorialexplosion,pastapproaches(KrishnamurthyandMitchell,2012;Kwiatkowskietal.,2013;Berantetal.,2013)usedbeamsearch,wherethenumberofparses(seeFig-ure2)foreachchartcell(par exemple.,(SET,3:5))iscappedatK.Typicalbottom-upparsingisemployed,wherewebuildallparsesforspansoflengthnbeforen+1,etc.Thisfixed-orderparsingstrategycon-structsmanyunnecessaryparsesthough.Forex-ample,itwouldcreateKparsesforthecategoryENTITYandthespanover“lincoln”,generatingthelogicalformUSSLincoln,althoughitisunlikelythatthisentitywouldbeinthefinallogicalform.Toovercometheproblemswithfixed-orderpars-ing,weturntoagenda-basedparsing(Kay,1986;CaraballoandCharniak,1998;KleinandManning,2003;PaulsandKlein,2009;AuliandLopez,2011).Inagenda-basedparsing,anagenda(priorityqueue)holdspartialparsesthatcanbeconstructednext.Ateachstep,theparsewiththehighestpriorityispoppedfromtheagendaandputintothechart.Thisgivestheparserfullcontroloverthesequenceofparsesconstructed.Butimportantly,agenda-basedparsingrequiresagoodscoringfunctionthatcanranknotjustfullparsesbutalsopartialparsesontheagenda.Howdoweobtainsuchascoringfunction?Tothisend,weborrowideasfromimitationlearn-ingforstructuredprediction(Daumeetal.,2009;Rossetal.,2011;GoldbergandNivre,2013;Changetal.,2015).Specifically,wecastagenda-basedse-manticparsingasaMarkovdecisionprocess,wherethegoalistolearnapolicy,thatgivenastate(i.e.,thecurrentchartandagenda),choosesthebestnextaction(i.e.,theparsetopopfromtheagenda).Thesupervisionsignalisusedtogenerateasequenceoforacleactions,fromwhichthemodelistrained.OurworkbearsastrongresemblancetoJiangetal.(2012),whoappliedimitationlearningtoagenda-basedparsing,butinthecontextofsyntacticpars-ing.However,twonewchallengesariseinseman-ticparsing.First,syntacticparsingassumesgoldparses,fromwhichitiseasytoderiveanoracleac-tionsequence.Incontrast,wetrainfromquestion-answerpairsonly(ratherthanparsetreesorevenlogicalforms),sogeneratinganoraclesequenceismorechallenging.Second,semanticparsersexploreamuchlargersearchspacethansyntacticparsers,duetothehighlevelofuncertaintywhentranslatingtologicalform.Thus,weholdabeamofparsesforeachchartcell,andmodifylearningforthissetup.Wegainfurtherefficiencybyintroducingalazyagenda,whichreducesthenumberofparsesthatneedtobescored.Forexample,thesingleactionofprocessing“born”,requiresplacing391logicalformsontheagenda,althoughonlyfewofthemwillbeused.Ourlazyagendaholdsderivationstreams,whichimplicitlyrepresenta(possiblyin-finite!)groupofrelatedparsesasasingleagendaitem,andlazilymaterializeparsesasneeded.Em-pirically,thisreducesthenumberofparsesthatarescoredattrainingtimeby35%.Last,wemakemodelingcontributionsbyaug-mentingthefeaturesetpresentedbyBerantetal.(2013)withnewfeaturesthatimprovethemappingofphrasestoKBpredicates.Weevaluateouragenda-basedparserontheWE-BQUESTIONSdataset(Berantetal.,2013)againstafixed-orderparser,andobservethatourparserre-ducesthenumberofparsingactionsbyanorderofmagnitude,achievesa6x-9xspeedup,andobtainsacomparableaccuracyof49.7%.Toconclude,thispaperdescribesthreecontribu-tions:D'abord,anovelagenda-basedsemanticparserthatlearnstochoosegoodparsingactions,train-ingfromquestion-answerpairsonly;Deuxième,alazyagendathatpacksparsesinstreamsandreducesthenumberofgeneratedparses;Last,modelingchangesthatsubstantiallyimproveaccuracy.2SemanticParsingTaskWhileouragenda-basedsemanticparserappliesmorebroadly,ourexpositionwillbebasedonour

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
1
5
7
1
5
6
6
8
0
0

/

/
t

je

un
c
_
un
_
0
0
1
5
7
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

547

primarymotivation,questionansweringonaknowl-edgebase.Thesemanticparsingtaskisdefinedasfollows:Given(je)aknowledgebase(KB)K,(ii)agrammarG(definedshortly),et(iii)atrainingsetofquestion-answerpairs{(xi,yi)}ni=1,outputase-manticparserthatmapsnewquestionsxtoanswersyvialatentlogicalformsz.WenowbrieflydescribetheKBandlogicalformsusedinthispaper.LetEdenoteasetofentities(e.g.,AbeLincoln),andletPdenoteasetofproperties(e.g.,PlaceOfBirthOf).AknowledgebaseKisasetofassertions(e1,p,e2)∈E×P×E(par exemple.,(Hodgenville,PlaceOfBirthOf,AbeLincoln)).WeusetheFreebaseKB(Google,2013),whichhas41Mentities,19Kproperties,and596Massertions.ToquerytheKB,weusethelogicallanguagesimpleλ-DCS.Insimpleλ-DCS,anentity(e.g.,AbeLincoln)denotesthesingletonsetcontainingthatentity;thisisaspecialcaseofaunarypred-icate.Aproperty(aspecialcaseofabinarypredicate)canbejoinedwithaunarypredicate;e.g.,PlaceOfBirthOf.AbeLincolndenotesallen-titiesthataretheplaceofbirthofAbrahamLin-coln.Wealsohaveintersection:Type.CityuPlaceOfBirthOf.AbeLincolndenotesthesetofen-titiesthatarebothcitiesandtheplaceofbirthofAbrahamLincoln.WewriteJzKKforthedenotationofalogicalformzwithrespecttoaKBK.Foraformaldescriptionofλ-DCS,seeLiang(2013).3GrammarsandSemanticFunctionsSincewearelearningsemanticparsersfromdeno-tations,wecannotinduceagrammarfromprovidedlogicalforms(Kwiatkowskietal.,2010).Plutôt,weassumeasmallandflexiblegrammarthatspec-ifiesthespaceoflogicalforms.Thegrammarcon-sistsofabackboneCFG,butisatypicalinthateachruleisaugmentedwithasemantic(composition)functionthatproducesavaryingnumberofderiva-tionsusingarbitrarycontext.Thisflexibilitypro-videsproceduralcontroloverthegenerationoflogi-calforms.Formally,agrammarisatuple(V,N,R.),whereVisasetofterminals(words),Nisasetofcate-gories(suchasBINARY,ENTITY,SETandROOTinFigure2,whereROOTistherootcategory),andRisarulesetofbinaryandunaryrules,explainedbelow.Abinaryruler∈RhastheformA→BC[F],whereA∈Nistheleft-hand-side,BC∈N2istheright-hand-side(RHS),andfisasemanticfunc-tion,explainedbelow.Givenanutterancex,thegrammardefinesasetofderivations(semanticparsetrees)overeveryspanxi:j=(wi,wi+1,…,wj−1).DefineDtobethesetofallderivations,andletdAi:jbeaderivationoverthespanxi:jofcategoryA.Giventhederiva-tionsdBi:kanddCk:jandtheruler=A→BC[F],thesemanticfunctionf:D×D→2Dpro-ducesasetofderivationsf(dBi:k,dCk:j)overxi:jwithcategoryA.Inwords,thesemanticfunc-tiontakestwochildderivationsasinputandpro-ducesasetofcandidateoutputderivations.Foreachoutputderivationd,letd.rbetheruleused(SET→ENTITYBINARY[JOIN])andd.zbethelogicalformconstructedbyf,usuallycreatedbycombiningthelogicalformsofthechildderiva-tions(PlaceOfBirthOf.AbeLincoln).Thiscom-pletesourdescriptionofbinaryrules;unaryrulesA→B[F]andlexicalrulesA→w[F]arehandledsimilarly,wherew∈V+isasequenceofterminals.Figure3demonstratestheflexibilityofseman-ticfunctions.TheJOINsemanticfunctiontakesaderivationwhoselogicalformisabinarypredicate,andaderivationwhoselogicalformisaunarypred-icate,andperformsajoinoperation.LEXtakesaderivationrepresentingaphraseandoutputsmanycandidatederivations.INTERSECTtakestwoderiva-tionsandattemptstointersecttheirlogicalforms(asdefinedinSection2).Inthisspecificcase,noout-putderivationsareproducedbecausetheKBtypesforType.CityandReleaseDateOf.LincolnFilmdonotmatch.IncontrastwithCFGrulesforsyntacticpars-ing,ruleswithsemanticfunctionsgeneratesetsofderivationsratherthanasinglederivation.Weallowsemanticfunctionstoperformarbitraryoperationsonthechildderivations,accessexternalresourcessuchasFreebasesearchAPIandtheKB.Inprac-tice,ourgrammaremploys11semanticfunctions;inadditiontoJOIN,LEX,andINTERSECT,weuseBRIDGE,whichimplementsthebridgingoperation(seeSection8)fromBerantetal.(2013),aswellasonesthatrecognizedatesandfilterderivationsbasedonpart-of-speechtags,namedentitytags,etc..

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
1
5
7
1
5
6
6
8
0
0

/

/
t

je

un
c
_
un
_
0
0
1
5
7
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

548

Join(cid:18)Entity:AbeLincolnabrahamlincoln,Binary:PlaceOfBirthOfborn(cid:19)=(cid:40)Set:PlaceOfBirthOf.AbeLincolnEntity:AbeLincolnabrahamlincolnBinary:PlaceOfBirthOfborn(cid:41)Lex(lincoln)=(cid:40)Entity:AbeLincolnlincoln,Entity:LincolnFilmlincoln,…(cid:41)Intersect(cid:18)Set:Type.Citycity,Set:ReleaseDateOf.LincolnFilmabrahamlincolnborn(cid:19)={}Figure3:Asemanticfunction(weshowJOIN,LEXandINTERSECT)takesoneortwochildderivationsandreturnsasetofpossiblederivations.4Fixed-orderParsingWenowdescribefixed-orderparsingwithbeamsearch,whichhasbeenthecommonpracticeinpastwork(KrishnamurthyandMitchell,2012;Kwiatkowskietal.,2013;Berantetal.,2013).Letxbetheinpututterance.WecallderivationsdROOT0:|X|,spanningtheutterancexandwithrootcate-gory,rootderivations,andallotherderivationspar-tialderivations.Givenascoringfunctions:D→R,abottom-upfixed-orderparseriteratesoverspansxi:jofincreasinglengthnandcategoriesA∈N,andusesthegrammartogeneratederivationsbasedonderivationsofsubspans.Weusebeamsearch,inwhichforeveryspanxi:jandeverycategoryAwekeepabeamthatstoresuptoKderivationsinachartcellHAi:j(wheredifferentderivationsusuallycorrespondtodifferentlogicalforms).WedenotebyHthesetofderivationsinanychartcell.Afixed-orderparserisguaranteedtocomputetheKhighest-scoringderivationsifthefollowingtwoconditionshold:(je)allsemanticfunctionsreturnex-actlyonederivation,et(ii)thescoringfunctiondecomposes—thatis,thereisafunctionsrule:R→Rsuchthatforeveryruler=A→BC[F],thescoreofaderivationproducedbytheruleiss(dAi:j)=s(dBi:k)+s(dCk:j)+srule(r).Malheureusement,thetwoconditionsgenerallydonotholdinseman-ticparsing.Forexample,theINTERSECTfunctionreturnsanemptysetwhentype-checkingfails,vi-olatingcondition(je),andthescoringfunctionsof-tendependsonthedenotationsizeoftheconstructedlogicalform,violatingcondition(ii).Ingeneral,wewanttheflexibilityofhavingthescoringfunctiondependonthelogicalformsandsub-derivations,andthereforewewillnotbeconcernedwithexactnessinthispaper.Notethatwecouldaugmentthecat-egoriesNwiththelogicalform,butthiswouldin-creasethenumberofcategoriesexponentially.Model.Wefocusonlinearscoringfunctions:s(d)(d),whereφ(d)∈RFisthefeaturevectorandθ∈RFistheparametervectortobelearned.GivenanysetofderivationsD⊆D,wecandefinethecorrespondinglog-lineardistribution:(d|D)=exp{φ(d)}Pd0∈Dexp{φ(d0)}.(1)Learning.Thetrainingdataconsistsofasetofutterance-denotation(question-answer)pairs{(xi,yi)}ni=1.Tolearnθ,weuseanonlinelearn-ingalgorithm,whereoneach(xi,yi),weusebeamsearchbasedonthecurrentparameterstoconstructasetofrootderivationsDi=HROOT0:|X|,andthentakeagradientsteponthefollowingobjective:Oi(je)=logp(yi|xi)(2)=logXd∈Dipθ(d|Di)R.(d)+λkθk1,(3)whereR(d)[0,1]isarewardfunctionthatmea-suresthecompatibilityofthepredicteddenotationJd.zKKandthetruedenotationyi,2.Wemarginalizeoverlatentderivations,whichareweightedbytheircompatibilitywiththeobserveddenotationyi.Themaindrawbackoffixed-orderparsingisthattoobtaintheKrootderivationsDi,theparsermustfirstconstructKderivationsforallspansandallcat-egories,manyofwhichwillnotmakeitintoanyrootderivationd∈Di.Next,wedescribeagenda-basedparsing,whosegoalistogivetheparserbettercon-trolovertheconstructedderivations.2Jd.zKKandyiarebothsetsofentities,soRistheF1score.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
1
5
7
1
5
6
6
8
0
0

/

/
t

je

un
c
_
un
_
0
0
1
5
7
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

549

removeaddFigure4:Aschematicillustrationofaexecutingaparsingaction,specifiedbyaderivationontheagenda.First,weremoveitfromtheagendaandputitinthechart.Then,combineitwithotherchartderivationstoproducenewderivations,whichareaddedbacktotheagenda.Algorithm1Agenda-basedparsing1:procedurePARSE(X)2:INITAGENDA()3:alors que|Q|>0∧|HROOT0:|X||k>jandr=B→AC[F]∈Rdo14:fordCj,k∈HCj:kdo15:Q.addAll(F(dAi:j,dCj:k))16:forkθ.Computingthebestactionargmaxapθ(un|s)sim-plyinvolvespoppingfromthepriorityqueue.Ahistoryh=(s1,a1,…,aT,sT+1)(seeFig-ure5)isasequenceofstatesandactions,suchthats1hasanemptychartandaninitialagenda,andsT+1isaterminalstatereachedafterperformingthechartactioninwhichwechoosearootderivationaTfromHROOT0:|X|(Algorithm1).Thepolicyforchoosingparsingactionsinducesadistributionoverhistoriespθ(h)=QTt=1pθ(à|st).Atahighlevel,ourpolicyistrainedusingimi-tationlearningtomimicanoraclethattakesaopti-malactionateverystep(Daumeetal.,2009;Rossetal.,2011).Becauseinsemanticparsingwetrainfromquestionsandanswers,wedonothaveaccesstoanoracle.Instead,wefirstparsexbysamplingahistoryfromthecurrentpolicypθ;letd∗betherootderivationwithhighestrewardoutoftheKrootderivationsconstructed(voir(2)).Wethengenerateatargethistoryhtargetfromd∗usingtwoideas—localreweightingandhistorycompression,whichweex-plainshortly.Thepolicyparametersθarethenup-datedasfollows:θ←θ+ηR(htarget)TXt=1δt(htarget),(4)δt(h)=∇θlogpθ(à|st)(5)(à)−Epθ(a0t|st)[φ(a0t)].TherewardR(h)=R(aT)[0,1]measuresthecompatibilityofthereturnedderivation(voir(2)),andηisthelearningrate.4Notethatwhileourfea-4Notethatunlikestandardpolicygradient,ourupdatesarenotinvariant(eveninexpectation)toshiftingtherewardbyaconstant.Ourupdatesdonotmaximizereward,buttherewardmerelyprovidesawaytomodulatethemagnitudeoftheup-dates.st−1stst+1at−1atFigure5:Aschematicillustrationofa(partial)historyofstatesandactions.Eachellipserepresentsastate(chartandagenda),andtheredpathmarkstheactionschosen.turesφ(un)dependontheactiononly,theupdateruletakesintoaccountallactionsthatareontheagenda.Localreweighting.Giventhereferenced∗,letI[aind∗]indicatewhetheranactionaisasub-derivationofd∗.Wesamplehtargetfromthelo-callyreweighteddistributionp+wθ(un|s)∝pθ(un|s)·exp{βI[aind∗]}forsomeβ>0.Thisisamul-tiplicativeinterpolationofthemodeldistributionpθandtheoracle.Whenβishigh,thisreducestosam-plingfromtheavailableactionsind∗.Whennoor-acleactionsareavailable,thisreducestosamplingfrompθ.Theprobabilityofahistoryisdefinedasp+wθ(h)=QTt=1p+wθ(à|st).RecallweconstructKrootderivations.Aprob-lemwithlocalreweightingisthatafteraddingd∗tothechart,therearenomoreoracleactionsontheagendaandallsubsequentactionsaresimplysam-pledfromthemodel.Wefoundthatupdatingto-wardstheseactionshurtsaccuracy.Toavoidthisproblem,weproposeperforminghistorycompres-sion,describednext.Historycompression.Givend∗,wecandefineforeveryhistoryhasequenceofindices(t1,t2,…)suchthatI[atiind∗]=1foreveryi.Then,thecompressedhistoryc(h)=(st1,at1,st2,at2,…)isasequenceofstatesandactionssuchthatallactionschoosesub-derivationsofd∗.Notethatc(h)isnota“realhistory”inthesensethattak-ingactionatidoesnotnecessarilyresultinstatesti+1.InFigure6,thecompressedhistoryc(h)=(s1,a1,s3,a3,s4,a4,s5).Wecannowsampleatargethistoryhtargetfor(4)fromadistributionovercompressedhistories,

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
1
5
7
1
5
6
6
8
0
0

/

/
t

je

un
c
_
un
_
0
0
1
5
7
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

551

a4=d∗h:a1a3s1s2s3s4s5a1a2a3a4Figure6:Anexamplehistoryofstatesandactions,whereac-tionsthatarepartofthereferencederivationd∗=a4areinred.Thecompressedhistoryisc(h)=(s1,a1,s3,a3,s4,a4,s5).Algorithm2LearningalgorithmprocedureLEARN({xi,yi}ni=1)θ←0foreachiterationτandexampleidoh0←PARSE(,xi)d∗←CHOOSEORACLE(h0)htarget←PARSE(p+cwθ,xi)θ←θ+ητ,i·R(htarget)PTt=1δt(htarget)p+cθ(h)=Ph0:c(h0)=hpθ(h),wherewemarginal-izeoverallhistoriesthathavethesamecompressedhistory.Tosamplefromp+cθ,wesampleh0∼pθandreturnhtarget=c(h0).Thiswillprovideahis-torycontainingonlyactionsleadingtotheoracled∗.Inourfullmodel,wesampleahistoryfromp+cwθ,whichcombineslocalreweightingandhis-torycompression:wesampleh0∼p+wθandre-turnhtarget=c(h0).WeempiricallyanalyzelocalreweightingandhistorycompressioninSection9.Inpractice,wesetβlargeenoughsothatthebe-haviorofp+cwθisasfollows:wefirstconstructthereferenced∗bysamplingoracleactions.Aftercon-structingd∗,nooracleactionsareontheagenda,soweconstructK−1morerootderivations,samplingfrompθ(butnotetheseactionsarenotpartofthere-turnedcompressedhistory).Enfin,thelastactionchoosesd∗fromtheKderivations.Algorithm2summarizeslearning.Weinitializeourparameterstozero,andthenparseeachexam-plebysamplingahistoryfrompθ.WechoosethederivationwithhighestrewardinHROOT0:|X|astheref-erencederivationd∗.Thisdefinesp+cwθ,whichwesamplefromtoupdateparameters.Thelearningrateητ,iissetusingAdaGrad(Duchietal.,2010).6.2RelatedapproachesOurmethodisrelatedtopolicygradientinreinforce-mentlearning(Suttonetal.,1999):ifin(4),wesamplefromthemodeldistributionpθwithoutanoracle,thenourupdateisexactlythepolicygradi-entupdate,whichmaximizestheexpectedrewardEpθ(h)[R.(h)].Wedonotusepolicygradientsincethegradientisalmostzeroduringthebeginningoftraining,leadingtoslowconvergence.Thiscorrob-oratesJiangetal.(2012).OurmethodextendsJiangetal.(2012)toseman-ticparsing,whichposesthefollowingchallenges:(un)Wetrainfromdenotations,andmustobtainaref-erencetoguidelearning.(b)Tocombatlexicalun-certaintywemaintainabeamofsizeKineachpars-ingstate(weshowthisisimportantinSection9).(c)Weintroducehistorycompression,whichfocusesthelearnerontheactionsthatproducethecorrectderivationratherthanincorrectonesonthebeam.Interestingly,Jiangetal.(2012)foundthatimitationlearningdidnotworkwell,andobtainedimprove-mentsfrominterpolatingwithpolicygradient.Wefoundthatimitationlearningworkedwell,andin-terpolatingwithpolicygradientdidnotofferfurtherimprovements.ApossibleexplanationisthattheuncertaintypreservedintheKderivationsineachchartcellallowedimitationlearningtogeneralizeproperly,comparedtoJiangetal.(2012),whohadjustasingleitemineachchartcell.7LazyAgendaAswesawinSection3,asinglesemanticfunction(e.g.,LEX,BRIDGE)cancreatehundredsofderiva-tions.Scoringallthesederivationswhenaddingthemtotheagendaiswasteful,becausemosthavelowprobability.Inthissection,weassumesemanticfunctionsreturnaderivationstream,i.e.,anitera-torthatlazilycomputesderivationsondemand.OurlazyagendaGwillholdderivationstreamsratherthanderivations,andtheactualagendaQwillbedefinedonlyimplicitly.TheintuitionissimilartolazyK-bestparsing(HuangandChiang,2005),butisappliedtoagenda-basedsemanticparsing.Ourmainassumptionisthateveryderivationstreamg=[d1,d2,…],issortedbydecreasingscore:s(d1)≥s(d2)≥···(inpractice,thisisonlyapproximatedasweexplainattheendofthissec-tion).Wedefinethescoreofaderivationstreamass(g)=s(d1).AttesttimetheonlychangetoAl-gorithm1isinline4,whereinsteadofpoppingthe

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
1
5
7
1
5
6
6
8
0
0

/

/
t

je

un
c
_
un
_
0
0
1
5
7
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

552

Gs(g[1])|g|U[d1]710.88[d2,d3,d4,…]510011.92Gs(g[1])|g|U[d1]710.88[d2]510.12[d3]110.006[d4,…]−2980.004Figure7:Unrollingaderivationwhere(cid:15)=0.01and|G+|=1.Thestreaminredontheleftviolatesthestoppingcondition,andsoweunrolltwoderivationsuntilallstreamssatisfythecondition.highestscoringderivation,wepopthehighestscor-ingderivationstreamandprocessthefirstderivationonthestream.Then,wefeaturizeandscorethenextderivationonthestreamifthestreamisnotempty,andpushthestreambacktotheagenda.Thisguar-anteeswewillobtainthehighestscoringderivationineveryparsingaction.However,duringtrainingwesamplefromadistri-butionoverderivations,notjustreturntheargmax.Samplingfromthedistributionoverstreamscanbequiteinaccurate.Supposetheagendacontainstwoderivationstreams:g1containsonederivationwithscore1andg2contains50derivationswithscore0.Thenwewouldassigng1probabilitye1e1+e0=0.73insteadofthetruemodelprobabilitye1e1+50e0=0.05.Theissueisthatthefirstderivationofgisnotindicativeoftheactualprobabilitymassing.Oursolutionissimple:beforesampling(line4inAlgorithm1),weprocesstheagendatoguaranteethatthesumofprobabilitiesofallunscoredderiva-tionsissmallerthan(cid:15).LetGbethelazyagendaandG+⊆Gbethesubsetofderivationstreamsthatcontainmorethanonederivation(whereunscoredderivationsexist).Ifforeveryg∈G+,pθ(g)=Pd∈gpθ(d)(cid:15)|G+|,thentheprobabilitysumofallunscoredderivationissmall:Pg∈G+p(g)(cid:15).Toguaranteethatpθ(g)(cid:15)|G+|,weunrollgun-tilthisstoppingconditionissatisfied.Unrollingastreamfromg=[d1,d2,…]meanspoppingd1fromg,constructingasingletonderivationstreamgnew=[d1],pushinggnewtotheagendaandscoringtheremainingstreambasedonthenextderivations(g)=s(d2)(Figure7).Tocheckifp(g)(cid:15)|G+|,wedefinethefollow-ingupperboundUonp(g),whichisbasedonthenumberofderivationsinthestream|g|:(g)=Pd∈ges(d)Pg0∈GPd0∈g0es(d0)|g|es(g[1])Pg0∈Ges(g0[1])=Uwhereg[1]isthefirstderivationing.CheckingthatU≤(cid:15)|G+|iseasy,sinceitisbasedonlyonthefirstderivationofeverystream.Onceallstreamsmeetthiscriterion,weknowthatthetotalunscoredprob-abilityislessthan(cid:15).Aslearningprogresses,therearebemanylowprobabilityderivationswhichwecanskipentirely.Thelastmissingpieceisensuringthatstreamsaresortedwithoutexplicitlyscoringallderivations.Wemakeabestefforttopreservethisproperty.Sortingderivationstreams.Allderivationsinastreamghavethesamechildderivations,astheywereconstructedbyoneapplicationofaseman-ticfunctionf.Thus,thedifferenceintheirscoresisonlyduetonewfeaturescreatedwhenapplyingf.Wecandecomposethesenewfeaturesintotwodisjointfeaturesets.Onesetincludesfeaturesthatdependonthegrammarruleonlyandareindepen-dentoftheinpututterancex,andanotheralsode-pendsonx.Forexample,thesemanticfunctionf=LEXmapsphrases,suchas“bornin”,tolog-icalforms,suchasPlaceOfBirthOf.MostfeaturesextractedbyLEXdonotdependonx:thecon-junctionof“bornin”andPlaceOfBirthOf,thefre-quencyofthephrase“bornin”inacorpus,etc.However,somefeaturesmaydependonxaswell.Forexample,ifxis“whatcitywasabrahamlin-colnbornin”,wecanconjoinPlaceOfBirthOfwiththefirsttwowords“whatcity”.Asanotherex-ample,thesemanticfunctionBRIDGEtakesunarypredicates,suchasAbeLincoln,andjoinsthemwithanytype-compatiblebinarytoproducelogi-calforms,suchasPlaceOfBirthOf.AbeLincoln.Again,afeaturesuchasthenumberofassertionsinKthatcontainPlaceOfBirthOfdoesnotde-pendonx,whileafeaturethatconjoinstheintro-ducedbinary(PlaceOfBirthOf)withthemainverb(“born”),doesdependonx(seeSection8).Ourstrategyistopre-computeallfeaturesthatareindependentofxbeforetraining,5andsortstreams5ForLEX,thisrequiresgoingoveralllexiconentriesonce.ForBRIDGE,thisrequiresgoingonceovertheKB.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
1
5
7
1
5
6
6
8
0
0

/

/
t

je

un
c
_
un
_
0
0
1
5
7
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

553

basedonthesefeaturesonly,asanapproximationforthetrueorder.Let’sassumethatderivationsre-turnedbyanapplicationofasemanticfunctionfareparameterizedbyanauxiliarysetB.Forexample,whenapplyingLEXon“bornin”,Bwillincludealllexicalentriesthatmap“bornin”toabinarypredicate.WhenapplyingBRIDGEonAbeLincoln,Bwillincludeallbinarypredicatesthataretype-compatiblewithAbeLincoln.Weequipeachb∈BwithafeaturevectorφB(b)(computedbeforetrain-ing)ofallfeaturesthatareindependentofx.ThisgivesrisetoascoresB(b)=φB(b)>θthatdependsonthesemanticfunctiononly.Thus,wecansortBbeforeparsing,sothatwhenthefunctionfiscalled,wedonotneedtoinstantiatethederivations.NotethattheparametersθandthussBchangeduringlearning,sowere-sortBaftereveryiteration(ofgoingthroughalltrainingexamples),yieldinganapproximationtothetrueorderingofB.Inpractice,featuresextractedbyLEXdependmostlyonthelex-icalentryitselfandourapproximationisaccurate,whileforBRIDGEsomefeaturesdependonx,asweexplainnext.8FeaturesThefeaturesetinourmodelincludesallfeaturesde-scribedinBerantetal.(2013).6Inaddition,weaddnewlexicalizedfeaturesthatconnectnaturallan-guagephrasestobinarypredicates.InBerantetal.(2013),abinarypredicateisgeneratedusingalexiconconstructedofflineviaalignment,orthroughthebridgingoperation.Asmentionedabove,bridgingallowsustojoinunarypredicateswithbinarypredicatesthataretype-compatible,evenwhennowordintheutterancetrig-gersthebinarypredicate.Forexample,giventheut-terance“whatmoneytotaketosrilanka”,theparserwillidentifytheentitySriLanka,andbridgingwillproposeallpossiblebinaries,includingCurrency.Weaddafeaturetemplatethatconjoinsbinariessuggestedbybridging(Currency)withallcontentwordlemmas(“what”,“money”,“take”).Afterobservingenoughexamples,weexpectthefeaturecorrespondingto“money”andCurrencytobeup-weighted.Generatingfreelyandreweightingusing6Asinpreviouswork,somefeaturesusethefactthatthespellingofKBpredicatesisoftensimilartoEnglishwords.featurescanbeviewedasasoftwaytoexpandthelexiconduringtraining,similartolexicongeneration(ZettlemoyerandCollins,2005).Notethatthisfea-turedependsontheutterancex,andisnotusedforsortingstreams(Section7).Enfin,eachfeatureisactuallyduplicated:onecopyfireswhenchoosingderivationsontheagenda(Algorithm1,line4),andtheothercopyfireswhenchoosingthefinalrootderivation(line6).Wefoundthattheincreasedexpressivityfromseparatingfea-turesimprovesaccuracy.9ExperimentsWeevaluateoursemanticparserontheWEBQUES-TIONSdataset(Berantetal.,2013),whichcon-tains5,810question-answerpairs.Thequestionsareaboutpopulartopics(e.g.,“whatmoviesdoestaylorlautnerplayin?»)andanswersaresetsofentitiesobtainedthroughcrowdsourcing(allquestionsareanswerablebyFreebase).Weusetheprovidedtrain-testsplitandperformthreerandom80%-20%splitsofthetrainingdatafordevelopment.WeperformlexicallookupforFreebaseentitiesusingtheFreebaseSearchAPIandobtain20can-didateentitiesforeverynamedentityidentifiedbyStanfordCoreNLP(Manningetal.,2014).WeusethelexiconreleasedbyBerantetal.(2013)tore-trieveunaryandbinarypredicates.Weexecuteλ-DCSlogicalformsbyconvertingthemtoSPARQLandqueryingourlocalVirtuoso-backedcopyofFreebase.Duringtraining,weuseL1regularization,andcrudelytunehyperparametersonthedevelop-mentset(beamsizeK=200,toleranceforthelazyagenda(cid:15)=0.01,localreweightingβ=1000,andL1regularizationstrengthλ=10−5).Weevaluatedoursemanticparserusingthere-wardofthepredictions,i.e.,averageF1scoreonpredictedvs.trueentitiesoveralltestexamples.79.1MainResultsTable1providesourkeyresultcomparingthefixed-orderparser(FIXEDORDER)andourproposedagenda-basedparser(AGENDAIL).Inallsubse-quenttables,Train,Dev.,andTestdenotetrain-ing,developmentandtestaccuracies,|Act.|denotes7Weusetheofficialevaluationscriptfromhttp://www-nlp.stanford.edu/software/sempre/.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
1
5
7
1
5
6
6
8
0
0

/

/
t

je

un
c
_
un
_
0
0
1
5
7
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

554

TestTrain|Act.||Feat.|TimeFIXEDORDER49.660.618,12718,1271,782AGENDAIL49.761.11,3461,814291Table1:Testsetresultsforthestandardfixed-orderparser(FIXEDORDER)andournewagenda-basedparser(AGENDAIL),whichsubstantiallyreducesparsingtimeandthenumberofparsingactionsatnocosttoaccuracy.SystemAuthorsAcc.YV14YaoandVan-Durme(2014)35.4BCFL13Berantetal.(2013)35.7BDZZ14Baoetal.(2014)37.5BWC14Bordesetal.(2014)39.2BL14BerantandLiang(2014)39.9YDZR14Yangetal.(2014)41.3BWC14+BL14Bordesetal.(2014)41.8WYWH14Wangetal.(2014)45.3WYWH14Wangetal.(2014)45.3YCHG15Yihetal.(2015)52.5FIXEDORDERthiswork49.6AGENDAILthiswork49.7Table2:ResultsontheWEBQUESTIONStestset.theaveragenumberofparsingactions(popsfromagendainAGENDAILandderivationsplacedonchartinFIXEDORDER)perutterance,|Feat.|de-notestheaveragenumberoffeaturizedderivationsperutterance,andTimeisaverageparsingtimeinmilliseconds.WefoundthatAGENDAILis6xfasterthanFIXE-DORDER,performs13xfewerparsingactions,andreducesthenumberoffeaturizedderivationsbyanorderofmagnitude,withoutlossofaccuracy.Table2presentstestsetresultsofoursystems,comparedtorecentlypublishedresults.Wenotethatmostsystemsperformquestionansweringwith-outsemanticparsing.Ourfixed-orderparser,FIXE-DORDER,andagenda-basedparser,AGENDAIL,obtainanaccuracyof49.6and49.7respectively.Thisimprovesaccuracycomparedtoallprevioussystems,exceptforarecentlypublishedsemanticparserpresentedbyYihetal.(2015),whoseaccu-racyis52.5.Weattributeouraccuracyimprovementcomparedtoprevioussystemstothenewfeaturesandchangestothemodel,aswediscussbelow.BCFL13alsousedafixed-orderparser,butob-tainedlowerperformance.Themaindifferencesbe-tweenthesystemsarethat(je)ourmodelincludesnewfeatures(Section8)combinedwithL1regular-ization,(ii)weusetheFreebasesearchAPIratherthanstringmatching,et(iii)ourgrammargener-AlgorithmDev.|Act.||Feat.|TimeAGENDAIL48.01,4211,912214FIXEDORDER49.118,25918,2591,972AGENDA45.96,2116,320419FIXED+AGENDA47.16,2816,615775α=100047.811,27911,2791,216α=10035.63,8583,858174α=1027.01,6041,60478p+wθ43.31,7062,121238p+cθ36.83,7584,278358pθ1.212,30215,5241,497-BINARYANDLEMMA40.51,5612,110167Table3:DevelopmentsetresultsforvariantsofAGENDAIL.atesalargerspaceofderivations.9.2AnalysisTogaininsightintooursystemcomponents,weper-formextensiveexperimentsonthedevelopmentset.Comparisonwithfixed-orderparsing.Figure8comparesaccuracy,speedattesttime,andnumberofderivationsforAGENDAILandFIXEDORDER.ForAGENDAIL,weshowboththenumberofderivationspoppedfromtheagenda,aswellasnum-berofderivationsscored,whichisslightlyhigherduetoscoredderivationsontheagenda.Weob-servethatforsmallbeamsizes,AGENDAILsub-stantiallyoutperformsFIXEDORDER.ThisissinceAGENDAILexploitssmallbeamsmoreefficientlyinintermediateparsingstates.Forlargebeamsperfor-manceissimilar.Intermsofspeedandnumberofderivations,weseethatAGENDAILisdramaticallymoreefficientthanFIXEDORDER:withbeamsize200–400,itisroughlyasefficientasFIXEDORDERwithbeamsize10–20.Forthechosenbeamsize(K=200),AGENDAILis9xfasterthanFIXE-DORDER.ForK=1,performanceispoorforAGENDAILandzeroforFIXEDORDER.Thishighlightsthein-herentdifficultyofmappingtologicalformscom-paredtomoreshallowtasks,asmaintainingjustasinglebestderivationforeachparsingstateisnotsufficient.AcommonvariantonbeamparsingistoreplacethefixedbeamsizeKwithathresholdα,andpruneanyderivationwhoseprobabilityisatleastαtimessmallerthanthebestderivationinthatstate(Zhangetal.,2010;Bodenstabetal.,2011).Weimple-mentedthisbaselineandcomparedittoAGENDAIL

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
1
5
7
1
5
6
6
8
0
0

/

/
t

je

un
c
_
un
_
0
0
1
5
7
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

555

1102050100200400beam size01020304050accuracyFixedOrderAgendaIL1102050100200400beam size0.00.51.01.52.02.53.03.54.0time (sec)FixedOrderAgendaIL1102050100200400beam size051015202530#derivations (thousands)FixedOrderAgendaIL (scored)AgendaIL (popped)Figure8:ComparingAGENDAILandFIXEDORDERforvariousbeamsizes(gauche:accuracy,middle:parsingtimeattesttimeinseconds,droite:numberofthousandsofderivationsscoredandpopped).Thex-axisisonalogarithmicscale.andFIXEDORDERinTable3.Weseethatforα=1000,wegetafasteralgorithm,butaminordropinperformancecomparedtoFIXEDORDER.However,thisbaselinestillfeaturizes6xmorederivationsandis6xslowerthanAGENDAIL.Impactoflearning.TheAGENDAbaselineusesanagenda-basedparsertoapproximatethegradientsof(2).Thatis,weupdateparametersasinFIXE-DORDER,butsearchforKrootderivationsusingtheagenda-basedparser,describedinAlgorithm1(wherewepopthehighestscoringderivation).WeobservethatAGENDAfeaturizes3xmorederiva-tionscomparedtoAGENDAIL,andresultsina2.1dropinaccuracy.Thisdemonstratestheimportanceofexplicitlylearningtochoosecorrectactionsdur-ingintermediatestagesofparsing.Sinceonthedevelopmentset,FIXEDORDERoutperformedAGENDAILby1.1points,weim-plementedFIXED+AGENDA,whereafixed-orderparserisusedattrainingtime,butanagenda-basedparserisusedattesttime.Thisparserfeaturized3.5xmorederivationscomparedtoAGENDAIL,is3.5xslower,andhasslightlyloweraccuracy.RecallthatAGENDAILsamplesahistoryfromp+cwθ,thatis,usinglocalreweightingandhistorycompression.Table3showstheimpactofsamplingfromp+wθ(localreweighting),p+cθ(historycompres-sion),anddirectlyfrompθ,whichreducestopolicygradient.Weobservethatsamplingfrompθdirectlyaccordingtopolicygradientresultsinverylowac-curacy,asthisproducesderivationswithzerore-wardmostofthetime.Bothlocalreweightingandhistorycompressionaloneimproveaccuracy(localAcc.|Feat.|TimeTr.Dev.Tr.Dev.Tr.Dev.(cid:15)=10256.446.71,6502,1211,159345(cid:15)=10−159.547.02,0431,8901,425279(cid:15)=10−261.048.02,6001,9121,830214(cid:15)=10−361.448.53,0631,7402,110220NOSTREAM60.047.64,0494,9313,155293Table4:Accuracy,numberoffeaturizedderivations,andpars-ingtimeforboththetrainingsetanddevelopmentsetwhenvaryingthevalueofthetoleranceparameter(cid:15).reweightingismoreimportant),butbothperformworsethanAGENDAIL.Impactoflazyagenda.Wenowexaminethecon-tributionofthelazyagenda.Notethatthelazyagendaaffectstrainingtimemuchmorethantesttimefortworeasons:(un)attesttimeweonlyneedtopopthehighestscoringderivation,andtheoverheadofapriorityqueueonlygrowslogarithmicallywiththesizeoftheagenda.Duringtraining,weneedtakeafullpassovertheagendawhensampling,andthusthenumberofitemsontheagendaisimportant;(b)attesttimeweneverunrollderivationstreams,onlypopthehighestscoringderivation(seeSection7).Inbrief,usingthelazyagendaresultsina1.5xspeedupattrainingtime.Tounderstandthesavingsofthelazyagenda,wevarythevalueofthetoleranceparameter(cid:15).Quand(cid:15)isveryhigh,wewillneverunrollderivationstreams,becauseforallderivationstreamsU≤(cid:15)|G+|(Section7).Thiswillbefast,butsamplingcouldbeinaccurate.As(cid:15)decreases,weunrollmorederivations.WealsocomparedtotheNOSTREAMbaseline,wheretheagendaholdsderivationsratherthanderivationstreams.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
1
5
7
1
5
6
6
8
0
0

/

/
t

je

un
c
_
un
_
0
0
1
5
7
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

556

AgendaIL:FixedOrder:whatcurrencydoesjamaicaaccept?2912127102085020222490018642140400whatcurrencydoesjamaicaaccept?54865248,71459532111,099761331,519403360441931593558Figure9:Numberofderivationsineverychartcellfortheutterance“whatcurrencydoesjamaicaaccept?”.AGENDAILreducesthenumberofderivationsinchartcellscomparedtoFIXEDORDER.Table4showstheresultsoftheseexperiments.Naturally,thenumberoffeaturizedderivationsintrainingincreasesas(cid:15)decreases.Inparticular,NOSTREAMresultsina2.5xincreaseinnumberoffeaturizedderivationscomparedtonounrolling((cid:15)=102),and1.5xincreasecomparedto(cid:15)=10−2,whichisthechosenvalue.Similarly,averagetrain-ingtimeisabout1.5xslowerforNOSTREAMcom-paredto(cid:15)=10−2.Accuracydoesnotchangemuchforvariousval-uesof(cid:15).Evenwhen(cid:15)=102,accuracydecreasesbyonly1.8pointscomparedto(cid:15)=10−3.Unexpect-edly,NOSTREAMyieldsaslightdropinaccuracy.Featureablation.Table3showsanablationtestonthenewfeaturetemplateweintroducedthatconjoinsbinariesandlemmasduringbridging(-BINARYANDLEMMA).Removingthisfeaturetem-platesubstantiallyreducesaccuracycomparedtoAGENDAIL,highlightingtheimportanceoflearningnewlexicalassociationsduringtraining.Example.Asafinalexample,Figure9showstyp-icalparsechartsforAGENDAILandFIXEDORDER.AGENDAILgeneratesonly1,198derivations,whileFIXEDORDERconstructs15,543derivations,manyofwhichareunnecessary.Insummary,wedemonstratedthattraininganagenda-basedparsertochoosegoodparsingactionsthroughimitationlearningdramaticallyimprovesef-ficiencyandspeedattesttime,whilemaintainingcomparableaccuracy.10DiscussionandRelatedWorkLearning.Inthispaper,wesampledhistoriesfromadistributionthattriestotargetthereferencederivationd∗wheneverpossible.Workinimita-tionlearning(AbbeelandNg,2004;Daumeetal.,2009;Rossetal.,2011;GoldbergandNivre,2013)hasshownthatinterpolatingwiththemodel(corre-spondingtosmallerβ)canimprovegeneralization.Wewereunabletoimproveaccuracybyannealingβfrom1000to0,sounderstandingthisdynamicre-mainsanopenquestion.Parsing.Inthispaper,weavoidedcomputingKderivationsineachchartcellusinganagendaandlearningascoringfunctionforchoosingagendaitems.Acomplementaryandpurelyalgorithmicso-lutionislazyK-bestparsing(HuangandChiang,2005),orcubegrowing(HuangandChiang,2007),whichdonotinvolvelearningoranagenda.Simi-lartoourwork,cubegrowingapproximatesthebestderivationsineachchartcellinthecasewherefea-turesdonotdecomposeWorkinthepastattemptedtospeedupinferenceusingasimplemodelthatistrainedseparatelyandusedtoprunethehypothesesconsideredbythemainparsingmodel(Bodenstabetal.,2011;FitzGeraldetal.,2013).Weontheotherhandspeedupinferencebytrainingasinglemodelthatlearnstofollowgoodparsingactions.Workinagenda-basedsyntacticparsing(KleinandManning,2003;PaulsandKlein,2009)focusedonA*algorithmswhereeachderivationhasaprior-itybasedonthederivationscore(insidescore),andacompletionestimate(outsidescore).Goodesti-matesfortheoutsidescoreresultinadecreaseinthenumberofderivations.Currentlyactionsdependontheinsidescore,butwecouldaddfeaturesbasedonchartderivationstoprovide“outside”information.Addingsuchfeatureswouldpresentcomputationalchallengesasscoresontheagendawouldhavetobeupdatedastheagendaandchartaremodified.Semanticparsinghasbeengainingmomentuminrecentyears,butstilltherehasbeenrelativelylit-tleworkondevelopingfasteralgorithms,especiallycomparedtosyntacticparsing(Huang,2008;Kum-merfeldetal.,2010;RushandPetrov,2012;Lewis

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
1
5
7
1
5
6
6
8
0
0

/

/
t

je

un
c
_
un
_
0
0
1
5
7
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

557

andSteedman,2014).Whilewehaveobtainedsig-nificantspeedups,wehopetoencouragenewideasthatexploitthestructureofsemanticparsingtoyieldbetteralgorithms.Reproducibility.Allcode,8data,andexperimentsforthispaperareavailableontheCodaLabplatformathttps://www.codalab.org/worksheets/0x8fdfad310dd84b7baf683b520b4b64d5/.AcknowledgmentsWethanktheanonymousreviewersandtheactioneditor,JasonEisner,fortheirthoroughreviewsandconstructivefeedback.Wealsogratefullyacknowl-edgethesupportoftheDARPACommunicatingwithComputers(CwC)programunderAROprimecontractno.W911NF-15-1-0462.ReferencesP.AbbeelandA.Ng.2004.Apprenticeshiplearningviainversereinforcementlearning.InInternationalConferenceonMachineLearning(ICML).M.AuliandA.Lopez.2011.EfficientCCGparsing:A*versusadaptivesupertagging.InAssociationforComputationalLinguistics(ACL).J.Bao,N.Duan,M.Zhou,andT.Zhao.2014.Knowledge-basedquestionansweringasmachinetranslation.InAssociationforComputationalLinguis-tics(ACL).J.BerantandP.Liang.2014.Semanticparsingviapara-phrasing.InAssociationforComputationalLinguis-tics(ACL).J.Berant,A.Chou,R.Frostig,andP.Liang.2013.SemanticparsingonFreebasefromquestion-answerpairs.InEmpiricalMethodsinNaturalLanguagePro-cessing(EMNLP).N.Bodenstab,A.Dunlop,K.Hall,andB.Roark.2011.Beam-widthpredictionforefficientcontext-freepars-ing.InAssociationforComputationalLinguistics(ACL),pages440–449.A.Bordes,S.Chopra,andJ.Weston.2014.Questionansweringwithsubgraphembeddings.InEmpiricalMethodsinNaturalLanguageProcessing(EMNLP).Q.CaiandA.Yates.2013.Large-scalesemanticparsingviaschemamatchingandlexiconextension.InAsso-ciationforComputationalLinguistics(ACL).8OursystemusestheSEMPREtoolkit(http://nlp.stanford.edu/software/sempre).S.A.CaraballoandE.Charniak.1998.Newfiguresofmeritforbest-firstprobabilisticchartparsing.Compu-tationalLinguistics,24:275–298.K.Chang,A.Krishnamurthy,A.Agarwal,H.Daume,andJ.Langford.2015.Learningtosearchbetterthanyourteacher.arXiv.J.Clarke,D.Goldwasser,M.Chang,andD.Roth.2010.Drivingsemanticparsingfromtheworld’sre-sponse.InComputationalNaturalLanguageLearn-ing(CoNLL),pages18–27.H.Daume,J.Langford,andD.Marcu.2009.Search-basedstructuredprediction.MachineLearning,75:297–325.J.Duchi,E.Hazan,andY.Singer.2010.Adaptivesubgradientmethodsforonlinelearningandstochas-ticoptimization.InConferenceonLearningTheory(COLT).N.FitzGerald,Y.Artzi,andL.S.Zettlemoyer.2013.Learningdistributionsoverlogicalformsforrefer-ringexpressiongeneration.InEmpiricalMethodsinNaturalLanguageProcessing(EMNLP),pages1914–1925.Y.GoldbergandJ.Nivre.2013.Trainingdeterminis-ticparserswithnon-deterministicoracles.Transac-tionsoftheAssociationforComputationalLinguistics(TACL),1.Google.2013.Freebasedatadumps(2013-06-09).https://developers.google.com/freebase/data.L.HuangandD.Chiang.2005.Betterk-bestparsing.InProceedingsoftheNinthInternationalWorkshoponParsingTechnology,pages53–64.L.HuangandD.Chiang.2007.Forestrescoring:Fasterdecodingwithintegratedlanguagemodels.InAssoci-ationforComputationalLinguistics(ACL).L.Huang.2008.Forestreranking:Discriminativepars-ingwithnon-localfeatures.InAssociationforCom-putationalLinguistics(ACL).J.Jiang,A.Teichert,J.Eisner,andH.Daume.2012.Learnedprioritizationfortradingoffaccuracyandspeed.InAdvancesinNeuralInformationProcessingSystems(NIPS).M.Kay.1986.AlgorithmSchemataandDataStruc-turesinSyntacticProcessing.ReadingsinNaturalLanguageProcessing.D.KleinandC.Manning.2003.A*parsing:Fastex-actviterbiparseselection.InHumanLanguageTech-nologyandNorthAmericanAssociationforComputa-tionalLinguistics(HLT/NAACL).J.KrishnamurthyandT.Mitchell.2012.Weaklysuper-visedtrainingofsemanticparsers.InEmpiricalMeth-odsinNaturalLanguageProcessingandComputa-tionalNaturalLanguageLearning(EMNLP/CoNLL),pages754–765.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
1
5
7
1
5
6
6
8
0
0

/

/
t

je

un
c
_
un
_
0
0
1
5
7
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

558

J.Kummerfeld,J.Roesner,T.Dawborn,J.Haggerty,J.Curran,andS.Clark.2010.Fasterparsingbysupertaggeradaptation.InAssociationforComputa-tionalLinguistics(ACL).T.Kwiatkowski,L.Zettlemoyer,S.Goldwater,andM.Steedman.2010.InducingprobabilisticCCGgrammarsfromlogicalformwithhigher-orderunifi-cation.InEmpiricalMethodsinNaturalLanguageProcessing(EMNLP),pages1223–1233.T.Kwiatkowski,E.Choi,Y.Artzi,andL.Zettlemoyer.2013.Scalingsemanticparserswithon-the-flyontol-ogymatching.InEmpiricalMethodsinNaturalLan-guageProcessing(EMNLP).M.LewisandM.Steedman.2014.A*CCGparsingwithasupertag-factoredmodel.InEmpiricalMethodsinNaturalLanguageProcessing(EMNLP).P.Liang,M.I.Jordan,andD.Klein.2011.Learningdependency-basedcompositionalsemantics.InAs-sociationforComputationalLinguistics(ACL),pages590–599.P.Liang.2013.Lambdadependency-basedcomposi-tionalsemantics.arXiv.C.D.Manning,M.Surdeanu,J.Bauer,J.Finkel,S.J.Bethard,andD.McClosky.2014.TheStanfordCoreNLPnaturallanguageprocessingtoolkit.InACLsystemdemonstrations.C.Matuszek,N.FitzGerald,L.Zettlemoyer,L.Bo,andD.Fox.2012.Ajointmodeloflanguageandper-ceptionforgroundedattributelearning.InInter-nationalConferenceonMachineLearning(ICML),pages1671–1678.A.PaulsandD.Klein.2009.K-bestA*parsing.InAs-sociationforComputationalLinguistics(ACL),pages958–966.S.Ross,G.Gordon,andA.Bagnell.2011.Areductionofimitationlearningandstructuredpredictiontono-regretonlinelearning.InArtificialIntelligenceandStatistics(AISTATS).A.RushandS.Petrov.2012.Vinepruningforefficientmulti-passdependencyparsing.InHumanLanguageTechnologyandNorthAmericanAssociationforCom-putationalLinguistics(HLT/NAACL).R.Sutton,D.McAllester,S.Singh,andY.Mansour.1999.Policygradientmethodsforreinforcementlearningwithfunctionapproximation.InAdvancesinNeuralInformationProcessingSystems(NIPS).S.Tellex,T.Kollar,S.Dickerson,M.R.Walter,A.G.Banerjee,S.J.Teller,andN.Roy.2011.Understand-ingnaturallanguagecommandsforroboticnavigationandmobilemanipulation.InAssociationfortheAd-vancementofArtificialIntelligence(AAAI).Z.Wang,S.Yan,H.Wang,andX.Huang.2014.AnoverviewofMicrosoftdeepQAsystemonStan-fordWebQuestionsbenchmark.Technicalreport,Mi-crosoftResearch.Y.W.WongandR.J.Mooney.2007.Learningsyn-chronousgrammarsforsemanticparsingwithlambdacalculus.InAssociationforComputationalLinguis-tics(ACL),pages960–967.M.Yang,N.Duan,M.Zhou,andH.Rim.2014.Jointre-lationalembeddingsforknowledge-basedquestionan-swering.InEmpiricalMethodsinNaturalLanguageProcessing(EMNLP).X.YaoandB.Van-Durme.2014.Informationextractionoverstructureddata:QuestionansweringwithFree-base.InAssociationforComputationalLinguistics(ACL).W.Yih,M.Chang,X.He,andJ.Gao.2015.Semanticparsingviastagedquerygraphgeneration:Questionansweringwithknowledgebase.InAssociationforComputationalLinguistics(ACL).M.ZelleandR.J.Mooney.1996.Learningtoparsedatabasequeriesusinginductivelogicprogramming.InAssociationfortheAdvancementofArtificialIntel-ligence(AAAI),pages1050–1055.L.S.ZettlemoyerandM.Collins.2005.Learningtomapsentencestologicalform:Structuredclassifica-tionwithprobabilisticcategorialgrammars.InUncer-taintyinArtificialIntelligence(UAI),pages658–666.L.S.ZettlemoyerandM.Collins.2007.Onlinelearn-ingofrelaxedCCGgrammarsforparsingtologicalform.InEmpiricalMethodsinNaturalLanguagePro-cessingandComputationalNaturalLanguageLearn-ing(EMNLP/CoNLL),pages678–687.Y.Zhang,B.Ahn,S.Clark,C.V.Wyk,J.R.Curran,andL.Rimell.2010.Chartpruningforfastlexicalised-grammarparsing.InInternationalConferenceonComputationalLinguistics(COLING).
Télécharger le PDF