Transactions of the Association for Computational Linguistics, Bd. 4, S. 431–444, 2016. Action Editor: David Chiang.

Transactions of the Association for Computational Linguistics, Bd. 4, S. 431–444, 2016. Action Editor: David Chiang.
Submission batch: 3/2016; Revision batch: 5/2016; Published 7/2016.

2016 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.

C
(cid:13)

ManyLanguages,OneParserWaleedAmmar♦GeorgeMulcaire♥MiguelBallesteros♠♦ChrisDyer♦NoahA.Smith♥♦SchoolofComputerScience,CarnegieMellonUniversity,Pittsburgh,PA,USA♥ComputerScience&Maschinenbau,UniversityofWashington,Seattle,WA,USA♠NLPGroup,PompeuFabraUniversity,Barcelona,Spainwammar@cs.cmu.edu,gmulc@uw.edu,miguel.ballesteros@upf.educdyer@cs.cmu.edu,nasmith@cs.washington.eduAbstractWetrainonemultilingualmodelfordepen-dencyparsinganduseittoparsesentencesinseverallanguages.Theparsingmodeluses(ich)multilingualwordclustersandem-beddings;(ii)token-levellanguageinforma-tion;Und(iii)language-specificfeatures(fine-grainedPOStags).Thisinputrepresentationenablestheparsernotonlytoparseeffec-tivelyinmultiplelanguages,butalsotogener-alizeacrosslanguagesbasedonlinguisticuni-versalsandtypologicalsimilarities,makingitmoreeffectivetolearnfromlimitedannota-tions.Ourparser’sperformancecomparesfa-vorablytostrongbaselinesinarangeofdatascenarios,includingwhenthetargetlanguagehasalargetreebank,asmalltreebank,ornotreebankfortraining.1IntroductionDevelopingtoolsforprocessingmanylanguageshaslongbeenanimportantgoalinNLP(Rösner,1988;HeidandRaab,1989),1butitwasonlywhenstatisticalmethodsbecamestandardthatmassivelymultilingualNLPbecameeconomical.Themain-streamapproachformultilingualNLPistodesignlanguage-specificmodels.Foreachlanguageofin-terest,theresourcesnecessaryfortrainingthemodelareobtained(orcreated),andseparateparametersarefitforeachlanguageseparately.Thisapproachissimpleandgrantstheflexibilityofcustomizing1Asof2007,thetotalnumberofnativespeakersofthehundredmostpopularlanguagesonlyaccountsfor85%oftheworld’spopulation(Wikipedia,2016).themodelandfeaturestotheneedsofeachlan-guage,butitissuboptimalfortheoreticalandprac-ticalreasons.Theoretically,thestudyoflinguistictypologytellsusthatmanylanguagessharemor-phological,phonological,andsyntacticphenomena(Bender,2011);daher,themainstreamapproachmissesanopportunitytoexploitrelevantsupervi-sionfromtypologicallyrelatedlanguages.Practi-cally,itisinconvenienttodeployordistributeNLPtoolsthatarecustomizedformanydifferentlan-guagesbecause,foreachlanguageofinterest,weneedtoconfigure,train,tune,monitor,andoccasion-allyupdatethemodel.Furthermore,code-switchingorcode-mixing(mixingmorethanonelanguageinthesamediscourse),whichispervasiveinsomegen-res,inparticularsocialmedia,presentsachallengeformonolingually-trainedNLPmodels(Barmanetal.,2014).2Inparsing,theavailabilityofhomogeneoussyn-tacticdependencyannotationsinmanylanguages(McDonaldetal.,2013;Nivreetal.,2015b;Agi´cetal.,2015;Nivreetal.,2015a)hascreatedanopportunitytodevelopaparserthatiscapableofparsingsentencesinmultiplelanguages,address-ingthesetheoreticalandpracticalconcerns.3Amultilingualparsercanpotentiallyreplaceanarrayoflanguage-specificmonolingually-trainedparsers2Whileourparsercanbeusedtoparseinputwithcode-switching,wehavenotevaluatedthiscapabilityduetothelackofappropriatedata.3Althoughmultilingualdependencytreebankshavebeenavailableforadecadeviathe2006and2007CoNLLsharedtasks(BuchholzandMarsi,2006;Nivreetal.,2007),thetree-bankofeachlanguagewasannotatedindependentlyandwithitsownannotationconventions.

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
1
0
9
1
5
6
7
4
0
4

/

/
T

l

A
C
_
A
_
0
0
1
0
9
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

432

(forlanguageswithalargetreebank).Thesameapproachhasbeenusedinlow-resourcescenarios(withnotreebankorasmalltreebankinthetargetlanguage),whereindirectsupervisionfromauxiliarylanguagesimprovestheparsingquality(Cohenetal.,2011;McDonaldetal.,2011;ZhangandBarzi-lay,2015;Duongetal.,2015a;Duongetal.,2015b;Guoetal.,2016),butthesemodelsmaysacrificeac-curacyonsourcelanguageswithalargetreebank.Inthispaper,wedescribeamodelthatworkswellforbothlow-resourceandhigh-resourcescenarios.Weproposeaparsingarchitecturethattakesasin-putsentencesinseverallanguages,4optionallypre-dictingthepart-of-speech(POS)tagsandinputlan-guage.Theparseristrainedontheunionofavail-ableuniversaldependencyannotationsindifferentlanguages.Ourapproachintegratesandcriticallyreliesonseveralrecentdevelopmentsrelatedtode-pendencyparsing:universalPOStagsets(Petrovetal.,2012),cross-lingualwordclusters(Täckströmetal.,2012),selectivesharing(Naseemetal.,2012),universaldependencyannotations(McDonaldetal.,2013;Nivreetal.,2015b;Agi´cetal.,2015;Nivreetal.,2015a),advancesinneuralnetworkarchitec-tures(ChenandManning,2014;Dyeretal.,2015),andmultilingualwordembeddings(Gardneretal.,2015;Guoetal.,2016;Ammaretal.,2016).Weshowthatourparsercomparesfavorablytostrongbaselinestrainedonthesametreebanksinthreedatascenarios:whenthetargetlanguagehasalargetree-bank(Table3),asmalltreebank(Table7),ornotreebank(Table8).Ourparserispubliclyavailable.52OverviewOurgoalistotrainadependencyparserforasetoftargetlanguagesLt,givenuniversaldependencyannotationsinasetofsourcelanguagesLs.Ide-ally,wewouldliketohavetrainingdatainalltar-getlanguages(i.e.,Lt⊆Ls),butwearealsointer-estedinthecasewherethesetsofsourceandtargetlanguagesaredisjoint(i.e.,Lt∩Ls=∅).WhenalllanguagesinLthavealargetreebank,themain-streamapproachhasbeentotrainonemonolingualparserpertargetlanguageandroutesentencesofa4Wediscussdatarequirementsinthenextsection.5https://github.com/clab/language-universal-parsergivenlanguagetothecorrespondingparserattesttime.Incontrast,ourapproachistotrainonepars-ingmodelwiththeunionoftreebanksinLs,thenusethissingletrainedmodeltoparsetextinanylan-guageinLt,hencethename“ManyLanguages,OneParser”(MALOPA).MALOPAstrikesabalancebe-tween:(1)enablingcross-lingualmodeltransfervialanguage-invariantinputrepresentations;i.e.,coarsePOStags,multilingualwordembeddingsandmul-tilingualwordclusters,Und(2)tweakingthebe-havioroftheparserdependingonthecurrentinputlanguagevialanguage-specificrepresentations;i.e.,fine-grainedPOStagsandlanguageembeddings.Inadditiontouniversaldependencyannotationsinsourcelanguages(seeTable1),weusethefollow-ingdataresourcesforeachlanguageinL=Lt∪Ls:•universalPOSannotationsfortrainingaPOStag-ger,6•abilingualdictionarywithanotherlanguageinLforaddingcross-linguallexicalinformation,7•languagetypologyinformation,8•language-specificPOSannotations,9and•amonolingualcorpus.10Novelcontributionsofthispaperinclude:(ich)us-ingoneparserinsteadofanarrayofmonolingually-trainedparserswithoutsacrificingaccuracyonlan-guageswithalargetreebank,(ii)aneffectiveneuralnetworkarchitectureforusinglanguageembeddingstoimprovemultilingualparsing,Und(iii)astudyofhowautomaticlanguageidentificationaffectstheperformanceofamultilingualdependencyparser.Whilenottheprimaryfocusofthispaper,wealsoshowthatavariantofourparseroutperformspre-viousworkonmulti-sourcecross-lingualparsingin6See§3.6fordetails.7Ourbestresultsmakeuseofthisresource.WerequirethatalllanguagesinLare(transitively)connected.Thebilingualdictionariesweusedarebasedonunsupervisedwordalign-mentsofparallelcorpora,asdescribedinGuoetal.(2016).See§3.3fordetails.8See§3.4fordetails.9Ourbestresultsmakeuseofthisresource.See§3.5fordetails.10Thisisonlyusedfortrainingwordembeddingswith‘mul-tiCCA,’‘multiCluster’and‘translation-invariance’methodsinTable6.Wedonotusethisresourcewhenwecomparetopre-viouswork.

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
1
0
9
1
5
6
7
4
0
4

/

/
T

l

A
C
_
A
_
0
0
1
0
9
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

433

Deutsch(von)English(In)Spanish(es)French(fr)Italian(Es)Portuguese(pt)Swedish(sv)UDT2train14118(264906)39832(950028)14138(375180)14511(351233)6389(149145)9600(239012)4447(66631)dev.801(12215)1703(40117)1579(40950)1620(38328)399(9541)1211(29873)493(9312)test1001(16339)2416(56684)300(8295)300(6950)400(9187)1205(29438)1219(20376)UD1.2train14118(269626)12543(204586)14187(382436)14552(355811)11699(249307)8800(201845)4303(66645)dev.799(12512)2002(25148)1552(41975)1596(39869)489(11656)271(4833)504(9797)test977(16537)2077(25096)274(8128)298(7210)489(11719)288(5867)1219(20377)tags-50–36866134Table1:Numberofsentences(tokens)ineachtreebanksplitinUniversalDependencyTreebanks(UDT)version2.0andUniversalDependencies(UD)version1.2forthelanguagesweexperimentwith.Thelastrowgivesthenumberofuniquelanguage-specificfine-grainedPOStagsusedinatreebank.lowresourcescenarios,wherelanguagesinLthaveasmalltreebank(seeTable7)orwhereLt∩Ls=∅(seeTable8).Inthesmalltreebanksetupwith3,000tokenannotations,weshowthatourparserconsis-tentlyoutperformsastrongmonolingualbaselinewith5.7absoluteLAS(labeledattachmentscore)pointsperlanguage,onaverage.3ParsingModelRecentadvancessuggestthatrecurrentneuralnet-works,especiallylongshort-termmemory(LSTM)architectures,arecapableoflearningusefulrepre-sentationsformodelingproblemsofsequentialna-ture(Gravesetal.,2013;Sutskeveretal.,2014).Inthissection,wedescribeourlanguage-universalparser,whichextendsthestackLSTM(S-LSTM)parserofDyeretal.(2015).3.1Transition-basedParsingwithS-LSTMsThissectionbrieflyreviewsDyeretal.’sS-LSTMparser,whichwemodifyinthefollowingsections.Thecoreparsercanbeunderstoodasthesequentialmanipulationofthreedatastructures:•abuffer(fromwhichwereadthetokensequence),•astack(whichcontainspartially-builtparsetrees),and•alistofactionspreviouslytakenbytheparser.Theparserusesthearc-standardtransitionsystem(Nivre,2004).11Ateachtimestept,atransitionac-tionisappliedthataltersthesedatastructuresac-cordingtoTable2.11Inapreprocessingstep,wetransformnonprojectivetreesinthetrainingtreebankstopseudo-projectivetreesusingthe“baseline”schemein(NivreandNilsson,2005).Weevaluateagainsttheoriginalnonprojectivetestset.Alongwiththediscretetransitionsofthearc-standardsystem,theparsercomputesvectorrepre-sentationsforthebuffer,stackandlistofactionsattimesteptdenotedbt,st,andat,respectively.12Theparserstateattimetisgivenby:pt=max{0,W[st;bt;bei]+Wbias}(1)wherethematrixWandthevectorWbiasarelearnedparameters.ThematrixWismultipliedbythevector[st;bt;bei]createdbytheconcatenationofst,bt,at.Theparserstateptisthenusedtodefineacategoricaldistributionoverpossiblenextactionsz:13P(z|pt)=exp(cid:0)g>zpt+qz(cid:1)Pz0exp(cid:0)g>z0pt+qz0(cid:1)(2)wheregzandqzareparametersassociatedwithac-tionz.Theselectedactionisthenusedtoupdatethebuffer,stackandlistofactions,andtocomputebt+1,st+1andat+1accordingly.Themodelistrainedtomaximizethelog-likelihoodofcorrectactions.Attesttime,theparsergreedilychoosesthemostprobableactionineverytimestepuntilacompleteparsetreeisproduced.Thefollowingsectionsdescribeourextensionsofthecoreparser.MoredetailsaboutthecoreparsercanbefoundinDyeretal.(2015).3.2TokenRepresentationsThevectorrepresentationsofinputtokensfeedintothestack-LSTMmodulesofthebufferandthestack.12Astack-LSTMmoduleisusedtocomputethevectorrep-resentationforeachdatastructure,asdetailedinDyeretal.(2015).13Thetotalnumberofactionsis1+2×thenumberofuniquedependencylabelsinthetreebankusedfortraining,butweonlyconsideractionswhichmeetthearc-standardpreconditionsinFig.2.

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
1
0
9
1
5
6
7
4
0
4

/

/
T

l

A
C
_
A
_
0
0
1
0
9
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

434

StacktBuffertActionDependencyStackt+1Buffert+1u,v,SBREDUCE-RIGHT(R)ur→vu,SBu,v,SBREDUCE-LEFT(R)ur←vv,SBSu,BSHIFT—u,SBTable2:Parsertransitionsindicatingtheactionappliedtothestackandbufferattimetandtheresultingstackandbufferattimet+1.Formonolingualparsing,werepresenteachtokenbyconcatenatingthefollowingvectors:•afixed,pretrainedembeddingofthewordtype,•alearnedembeddingofthewordtype,•alearnedembeddingoftheBrowncluster,•alearnedembeddingofthefine-grainedPOStag,•alearnedembeddingofthecoarsePOStag.FormultilingualparsingwithMALOPA,westartwithasimpledelexicalizedmodelwherethetokenrepresentationonlyconsistsoflearnedembeddingsofcoarsePOStags,whicharesharedacrossalllan-guagestoenablemodeltransfer.Inthefollowingsubsections,weenhancethetokenrepresentationinMALOPAtoincludelexicalembeddings,languageembeddings,andfine-grainedPOSembeddings.3.3LexicalEmbeddingsPreviousworkhasshownthatsacrificinglexicalfea-turesamountstoasubstantialdecreaseintheperfor-manceofadependencyparser(Cohenetal.,2011;Täckströmetal.,2012;Tiedemann,2015;Guoetal.,2015).daher,weextendthetokenrepresenta-tioninMALOPAbyconcatenatinglearnedembed-dingsofmultilingualwordclusters,andpretrainedmultilingualembeddingsofwordtypes.MultilingualBrownclusters.Beforetrainingtheparser,weestimateBrownclustersofEnglishwordsandprojectthemviawordalignmentstowordsinotherlanguages.Thisissimilartothe‘projectedclusters’methodinTäckströmetal.(2012).TogofromBrownclusterstoembeddings,weignorethehierarchywithinBrownclustersandassignauniqueparametervectortoeachcluster.Multilingualwordembeddings.WealsouseGuoetal.’s(2016)‘robustprojection’methodtopre-trainmultilingualwordembeddings.Thefirststepin‘robustprojection’istolearnembeddingsforEn-glishwordsusingtheskip-grammodel(Mikolovetal.,2013).Dann,wecomputeanembeddingofnon-EnglishwordsastheweightedaverageofEnglishwordembeddings,usingwordalignmentprobabili-tiesasweights.Thelaststepcomputesanembed-dingofnon-EnglishwordswhicharenotalignedtoanyEnglishwordsbyaveragingtheembeddingsofallwordswithinaneditdistanceof1inthesamelanguage.Weexperimentwithtwoothermethods—‘multiCCA’and‘multiCluster,’bothproposedbyAmmaretal.(2016)—forpretrainingmultilingualwordembeddingsin§4.1.‘MultiCCA’usesalin-earoperatortoprojectpretrainedmonolingualem-beddingsineachlanguage(exceptEnglish)tothevectorspaceofpretrainedEnglishwordembed-dings,while‘multiCluster’usesthesameembed-dingfortranslationally-equivalentwordsindifferentlanguages.TheresultsinTable6illustratethatthethreemethodsperformsimilarlyonthistask.3.4LanguageEmbeddingsWhilemanylanguages,especiallyonesthatbelongtothesamefamily,exhibitsomesimilarsyntac-ticphenomena(e.g.,alllanguageshavesubjects,verbs,andobjects),substantialsyntacticdifferencesabound.Someofthesedifferencesareeasytochar-acterize(e.g.,subject-verb-objectvs.verb-subject-object,prepositionsvs.postpositions,adjective-nounvs.noun-adjective),whileothersaresub-tle(e.g.,numberandpositionsofnegationmor-phemes).Itisnotatallclearhowtotranslatede-scriptivefactsaboutalanguage’ssyntaxintofea-turesforaparser.Consequently,trainingalanguage-universalparserontreebanksinmultiplesourcelanguagesrequirescaution.Whileexposingtheparsertoadiversesetofsyntacticpatternsacrossmanylan-guageshasthepotentialtoimproveitsperformance

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
1
0
9
1
5
6
7
4
0
4

/

/
T

l

A
C
_
A
_
0
0
1
0
9
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

435

ineach,dependencyannotationsinonelanguagewill,insomeways,contradictthoseintypologicallydifferentlanguages.Forinstance,consideracontextwherethenextwordonthebufferisanoun,andthetopwordonthestackisanadjective,followedbyanoun.Tree-banksoflanguageswherepostpositiveadjectivesaretypical(e.g.,French)willoftenteachtheparsertopredictREDUCE-LEFT,whilethoseoflanguageswhereprepositiveadjectivesaremoretypical(e.g.,English)willteachtheparsertopredictSHIFT.InspiredbyNaseemetal.(2012),weaddressthisproblembyinformingtheparserabouttheinputlan-guageitiscurrentlyparsing.Letlbetheinputvectorrepresentationofaparticularlanguage.Weconsiderthreedefinitionsforl:14•one-hotencodingofthelanguageID,•one-hotencodingofindividualword-orderprop-erties,15and•averagedone-hotencodingofWALStypologicalproperties(includingword-orderproperties).16Itisworthnotingthatthefirstdefinition(languageID)turnsouttoworkbestinourexperiments.Weuseahiddenlayerwithtanhnonlinearitytocomputethelanguageembeddingl0as:l0=tanh(Ll+Lbias)wherethematrixLandthevectorLbiasareaddi-tionalmodelparameters.Wemodifytheparsingar-chitectureasfollows:•includel0inthetokenrepresentation(whichfeedsintothestack-LSTMmodulesofthebufferandthestackasdescribedin§3.1),14Thefileswhichcontainthesedefinitionsareavailableathttps://github.com/clab/language-universal-parser/tree/master/typological_properties.15TheWorldAtlasofLanguageStructures(WALS;DryerandHaspelmath,2013)isanonlineportaldocumentingtypo-logicalpropertiesof2,679languages(asofJuly2015).WeusethesamesetofWALSfeaturesusedbyZhangandBarzilay(2015),namely82A(orderofsubjectandverb),83A(orderofobjectandverb),85A(orderofadpositionandnounphrase),86A(orderofgenitiveandnoun),and87A(orderofadjectiveandnoun).16SomeWALSfeaturesarenotannotatedforalllanguages.Therefore,weusetheaveragevalueofalllanguagesinthesamegenus.Werescaleallvaluestobeintherange[−1,1].•includel0intheactionvectorrepresentation(whichfeedsintothestack-LSTMmodulethatrepresentspreviousactionsasdescribedin§3.1),and•redefinetheparserstateattimetaspt=max{0,W[st;bt;bei;l0]+Wbias}.Intuitively,thefirsttwomodificationsallowtheinputlanguagetoinfluencethevectorrepresentationofthestack,thebufferandthelistofactions.Thethirdmodificationallowstheinputlanguagetoin-fluencetheparserstatewhichinturnisusedtopre-dictthenextaction.Inpreliminaryexperiments,wefoundthataddingthelanguageembeddingsatthetokenandactionlevelisimportant.Wealsoexperi-mentedwithcomputingmorecomplexfunctionsof(st,bt,bei,l0)todefinetheparserstate,buttheydidnothelp.3.5Fine-grainedPOSTagEmbeddingsTiedemann(2015)showsthatomittingfine-grainedPOStagssignificantlyhurtstheperformanceofade-pendencyparser.However,thosefine-grainedPOStagsetsaredefinedmonolinguallyandareonlyavail-ableforasubsetofthelanguageswithuniversalde-pendencytreebanks.Weextendthetokenrepresentationtoincludeafine-grainedPOSembedding(inadditiontothecoarsePOSembedding).Westochasticallydropoutthefine-grainedPOSembeddingforeachtokenwith50%probability(Srivastavaetal.,2014)sothattheparsercanmakeuseoffine-grainedPOStagswhenavailablebutstayreliablewhenthefine-grainedPOStagsaremissing.3.6PredictingPOSTagsThemodeldiscussedthusfarconditionsonthePOStagsofwordsintheinputsentence.However,goldPOStagsmaynotbeavailableinrealapplications(e.g.,parsingtheweb).Hier,wedescribetwomod-ificationsto(ich)modelbothPOStagginganddepen-dencyparsing,Und(ii)increasetherobustnessoftheparsertoincorrectPOSpredictions.Taggingmodel.Letx1,…,xn,y1,…,yn,z1,…,z2nbethesequenceofwords,POStags,andparsingactions,jeweils,forasentenceoflengthn.WedefinethejointdistributionofaPOS

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
1
0
9
1
5
6
7
4
0
4

/

/
T

l

A
C
_
A
_
0
0
1
0
9
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

436

tagsequenceandparsingactionsgivenasequenceofwordsasfollows:P(y1,…,yn,z1,…,z2n|x1,…,xn)=nYi=1p(yi|x1,…,xn)×2nYj=1p(zj|x1,…,xn,y1,…,yn,z1,…,zj−1)wherep(zj|…)isdefinedinEq.2,andp(yi|x1,…,xn)usesabidirectionalLSTM(Gravesetal.,2013).Huangetal.(2015)showthattheperfor-manceofabidirectionalLSTMPOStaggerisonparwithaconditionalrandomfieldtagger.Weuseslightlydifferenttokenrepresentationsfortaggingandparsinginthesamemodel.Fortag-ging,weconstructthetokenrepresentationbycon-catenatingtheembeddingsofthewordtype(pre-trained),theBrownclusterandtheinputlanguage.ThistokenrepresentationfeedsintothebidirectionalLSTM,followedbyasoftmaxlayer(ateachposi-tion)whichdefinesacategoricaldistributionoverpossiblePOStags.Forparsing,weconstructtheto-kenrepresentationbyfurtherconcatenatingtheem-beddingsofpredictedPOStags.Thistokenrepre-sentationfeedsintothestack-LSTMmodulesofthebufferandstackcomponentsofthetransition-basedparser.Thismulti-tasklearningsetupenablesustopredictbothPOStagsanddependencytreesinthesamemodel.Wenotethatpretrainedwordembed-dings,clusterembeddingsandlanguageembeddingsaresharedfortaggingandparsing.Blockdropout.Weuseanindependentlydevel-opedvariantofworddropout(Iyyeretal.,2015),whichwecallblockdropout.Thetokenrepresenta-tionusedforparsingincludestheembeddingofpre-dictedPOStags,whichmaybeincorrect.Weintro-duceanothermodificationwhichmakestheparsermorerobusttoincorrectPOStagpredictions,bystochasticallyzeroingouttheentireembeddingofthePOStag.Whiletrainingtheparser,wereplacethePOSembeddingvectorewithanothervector(ofthesamedimensionality)stochasticallycomputedas:e0=(1−b)/µ×e,whereb∈{0,1}isaBernoulli-distributedrandomvariablewithparame-terµwhichisinitializedto1.0(i.e.,alwaysdropout,settingb=1,e0=0),andisdynamicallyupdatedtomatchtheerrorrateofthePOStaggeronthede-velopmentset.Attesttime,weneverdropoutthepredictedPOSembedding,i.e.,e0=e.Intuitively,thismethodextendsthedropoutmethod(Srivastavaetal.,2014)toaddressstructurednoiseintheinputlayer.4ExperimentsInthissection,weevaluatetheMALOPAapproachinthreedatascenarios:whenthetargetlanguagehasalargetreebank(Table3),asmalltreebank(Table7)ornotreebank(Table8).Data.Forexperimentswherethetargetlanguagehasalargetreebank,weusethestandarddatasplitsforGerman(von),English(In),Spanish(es),French(fr),Italian(Es),Portuguese(pt)andSwedish(sv)inthelatestrelease(version1.2)ofUniversalDepen-dencies(Nivreetal.,2015a),andexperimentwithbothgoldandpredictedPOStags.Forexperimentswherethetargetlanguagehasnotreebank,weusethestandardsplitsfortheselanguagesintheolderuniversaldependencytreebanksv2.0(McDonaldetal.,2013)andusegoldPOStags,followingthebase-lines(ZhangandBarzilay,2015;Guoetal.,2016).Table1givesthenumberofsentencesandwordsannotatedforeachlanguageinbothversions.Inapreprocessingstep,welowercasealltokensandre-movemulti-wordannotationsandlanguage-specificdependencyrelations.WeusethesamemultilingualBrownclustersandmultilingualembeddingsofGuoetal.(2016),kindlyprovidedbytheauthors.Optimization.WefollowDyeretal.(2015)inparameterinitializationandoptimization.17How-ever,whentrainingtheparseronmultiplelanguages17Weusestochasticgradientupdateswithaninitiallearn-ingrateofη0=0.1inepoch#0,updatethelearningrateinfollowingepochsasηt=η0/(1+0.1T).Weclipthe‘2normofthegradienttoavoid“exploding”gradients.Unla-beledattachmentscore(UAS)onthedevelopmentsetdeter-minesearlystopping.Parametersareinitializedwithuniformsamplesin±p6/(r+c)whererandcarethesizesofthepreviousandfollowinglayerinthenueralnetwork(GlorotandBengio,2010).Thestandarddeviationsofthelabeledattach-mentscore(LAS)duetorandominitializationinindividualtar-getlanguagesare0.36(von),0.40(In),0.37(es),0.46(fr),0.47(Es),0.41(pt)and0.24(sv).Thestandarddeviationoftheaver-ageLASscoresacrosslanguagesis0.17.

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
1
0
9
1
5
6
7
4
0
4

/

/
T

l

A
C
_
A
_
0
0
1
0
9
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

437

LAStargetlanguageaveragedeenesfritptsvmonolingual79.385.983.781.788.785.783.584.0MALOPA70.469.372.471.178.074.165.471.5+lexical76.782.082.781.287.682.181.281.9+languageID78.684.283.482.489.184.282.683.5+fine-grainedPOS78.985.484.382.489.086.284.584.3Table3:Dependencyparsing:labeledattachmentscores(LAS)formonolingually-trainedparsersandMALOPAinthefullysupervisedscenariowhereLt=Ls.Notethatweusetheuniversaldependenciesverson1.2whichonlyincludesannotationsfor∼13,000Englishsentences,whichexplainstherelativelylowscoresinEnglish.Whenweinsteadusetheuniversaldependencytreebanksversion2.0whichincludesannotationsfor∼40,000Englishsentences(originallyfromtheEnglishPennTreebank),weachieveUASscore93.0andLASscore91.5.inMALOPA,insteadofupdatingtheparameterswiththegradientofindividualsentences,weusemini-batchupdateswhichincludeonesentencesam-pleduniformly(withoutreplacement)fromeachlanguage’streebank,untilallsentencesinthesmall-esttreebankareused(whichconcludesanepoch).Werepeatthesameprocessinfollowingepochs.Wefoundthistohelppreventonesourcelanguagewithalargertreebank(e.g.,German)fromdominat-ingparameterupdatesattheexpenseofothersourcelanguageswithasmallertreebank(e.g.,Swedish).4.1TargetLanguageswithaTreebank(Lt=Ls)Hier,weevaluateourMALOPAparserwhenthetargetlanguagehasatreebank.Baseline.Foreachtargetlanguage,thestrongbaselineweuseisamonolingually-trainedS-LSTMparserwithatokenrepresentationwhichconcate-nates:pretrainedwordembeddings(50dimen-sions),18learnedwordembeddings(50dimensions),coarse(universal)POStagembeddings(12dimen-sions),fine-grained(language-specific,whenavail-able)POStagembeddings(12dimensions),andem-beddingsofBrownclusters(12dimensions),andusesatwo-layerS-LSTMforeachofthestack,thebufferandthelistofactions.Weindependentlytrainonebaselineparserforeachtargetlanguage,andsharenomodelparameters.Thisbaseline,denoted18Theseembeddingsaretreatedasfixedinputstotheparser,andarenotoptimizedtowardstheparsingobjective.WeusethesameembeddingsusedinGuoetal.(2016).‘monolingual’inTables3and7,achievesUASscore93.0andLASscore91.5whentrainedontheEn-glishPennTreebank,whichiscomparabletoDyeretal.(2015).MALOPA.WetrainMALOPAontheconcante-nationoftrainingsectionsofallsevenlanguages.Tobalancethedevelopmentset,weonlyconcatenatethefirst300sentencesofeachlanguage’sdevelop-mentsection.Tokenrepresentations.ThefirstMAL-OPAparserweevaluateusesonlycoarsePOSembeddingstoconstructthetokenrepresentation.19AsshowninTable3,thisparserconsistentlyunderperformsthemonolingualbaselines,withagapof12.5LASpointsonaverage.Augmentingthetokenrepresentationwithlexicalembeddingstothetokenrepresentation(bothmul-tilingualwordclustersandpretrainedmultilingualwordembeddings,asdescribedin§3.3)substan-tiallyimprovestheperformanceofMALOPA,re-covering83%ofthegapinaverageperformance.Weexperimentedwiththreewaystoincludelanguageinformationinthetokenrepresentation,nämlich:‘languageID’,‘wordorder’and‘fullty-pology’(see§3.4fordetails),andfoundallthreetoimprovetheperformanceofMALOPAgivingLASscores83.5,83.2and82.5,respectively.Itisnoteworthythatthemodelbenefitsmorefromlan-19WeusethesamenumberofdimensionsforthecoarsePOSembeddingsasinthemonolingualbaselines.ThesameappliestoallothertypesofembeddingsusedinMALOPA.

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
1
0
9
1
5
6
7
4
0
4

/

/
T

l

A
C
_
A
_
0
0
1
0
9
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

438

Recall%leftrightrootshortlongnsubj*dobjconj*compcase*modmonolingual89.995.286.492.981.177.375.566.045.693.377.0MALOPA85.493.380.291.273.357.362.764.234.090.769.6+lexical89.993.884.592.678.673.373.466.935.391.675.3+languageID89.194.786.693.281.474.773.071.248.292.876.3+fine-grainedPOS89.595.787.893.682.074.774.969.746.093.376.3Table4:Recallofsomeclassesofdependencyattachments/relationsinGerman.LAStargetlanguageaveragelanguageIDcoarsePOSdeenesfritptsvgoldgold78.684.283.482.489.184.282.683.5predictedgold78.580.283.482.188.983.982.582.7goldpredicted71.279.980.578.585.078.475.578.4predictedpredicted70.874.180.578.284.777.175.577.2Table5:EffectofautomaticallypredictinglanguageIDandPOStagswithMALOPAonLASscores.guageIDthanfromtypologicalproperties.Using‘languageID,’werecoveranother12%oftheorigi-nalgap.Finally,thebestconfigurationofMALOPAaddsfine-grainedPOSembeddingstothetokenrepresen-tation.20Surprisingly,addingfine-grainedPOSem-beddingsimprovestheperformanceevenforsomelanguageswherefine-grainedPOStagsarenotavail-able(e.g.,Spanish).Thisparseroutperformsthemonolingualbaselineinfiveoutofseventargetlan-guages,andwinsonaverageby0.3LASpoints.Weemphasizethatthismodelisonlytrainedonceonalllanguages,andthesamemodelisusedtoparsethetestsetofeachlanguage,whichsimplifiesthedistributionordeploymentofmultilingualparsingsoftware.Qualitativeanalysis.Togainabetterunderstand-ingofthemodelbehavior,weanalyzecertainclassesofdependencyattachments/relationsinGer-man,whichhasnotablyflexiblewordorder,inTa-ble4.Weconsidertherecallofleftattachments(wheretheheadwordprecedesthedependentwordinthesentence),rightattachments,rootattach-ments,short-attachments(withdistance=1),long-attachments(withdistance>6),aswellasthefol-lowingrelationgroups:nsubj*(nominalsubjects:20Fine-grainedPOStagswereonlyavailableforEnglish,Italian,PortugueseandSwedish.OtherlanguagesreusethecoarsePOStagsasfine-grainedtagsinsteadofpaddingtheex-tradimensionsinthetokenrepresentationwithzeros.nsubj,nsubjpass),dobj(directobject:dobj),conj(conjunct:conj),*comp(clausalcomple-ments:ccomp,xcomp),Fall(cliticsandadposi-tions:Fall),*mod(modifiersofanoun:nmod,nummod,amod,appos),neg(negationmodifier:neg).21Findings.Wefoundthateachofthethreeim-provements(lexicalembeddings,languageembed-dingsandfine-grainedPOSembeddings)tendstoimproverecallformostclasses.MALOPAun-derperforms(comparedtothemonolingualbase-line)insomeclasses:nominalsubjects,directob-jectsandmodifiersofanoun.Nevertheless,MAL-OPAoutperformsthebaselineinsomeimportantclassessuchas:root,longattachmentsandconjunc-tions.PredictinglanguageIDsandPOStags.InTa-ble3,weassumethatbothgoldlanguageIDoftheinputlanguageandgoldPOStagsaregivenattesttime.However,thisassumptionisnotrealisticinpracticalapplications.Here,wequantifythedegra-dationinparsingaccuracywhenlanguageIDandPOStagsareonlygivenattrainingtime,butmustbepredictedattesttime.Wedonotusefine-grained21Foreachgroup,wereportrecallofboththeattach-mentandrelationweightedbythenumberofinstancesinthegoldannotation.Adetaileddescriptionofeachrelationcanbefoundathttp://universaldependencies.org/u/dep/index.html

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
1
0
9
1
5
6
7
4
0
4

/

/
T

l

A
C
_
A
_
0
0
1
0
9
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

439

POStagsintheseexperimentsbecausesomelan-guagesuseaverylargefine-grainedPOStagset(e.g.,866uniquetagsinPortuguese).InordertopredictlanguageID,weusethelangid.pylibrary(LuiandBaldwin,2012)22andclassifyindividualsentencesinthetestsetstooneofthesevenlanguagesofinterest,usingthedefaultmodelsincludedinthelibrary.Themacroaver-agelanguageIDpredictionaccuracyonthetestsetacrosssentencesis94.7%.InordertopredictPOStags,weusethemodeldescribedin§3.6withbothinputandhiddenLSTMdimensionsof60,andwithblockdropout.ThemacroaverageaccuracyofthePOStaggeris93.3%.Table5summarizesthefourconfigurations:{goldlanguageID,predictedlan-guageID}×{goldPOStags,predictedPOStags}.Theperformanceoftheparsersuffersmildly(–0.8LASpoints)whenusingpredictedlanguageIDs,butmore(–5.1LASpoints)whenusingpredictedPOStags.AsanalternativeapproachtopredictingPOStags,wetrainedtheStanfordPOStagger,foreachtargetlanguage,onthecoarsePOStagannota-tionsinthetrainingsectionoftheuniversaldepen-dencytreebanks,23thenreplacedthegoldPOStagsinthetestsetofeachlanguagewithpredictionsofthemonolingualtagger.Theresultingdegradationinparsingperformancebetweengoldvs.predictedPOStagsis–6.0LASpoints(onaverage,comparedtoadegradationof–5.1LASpointsinTable5).Thedisparityinparsingresultswithgoldvs.predictedPOStagsisanimportantopenproblem,andhasbeenpreviouslydiscussedbyTiedemann(2015).ThepredictedPOSresultsinTable5useblockdropout.Withoutusingblockdropout,weloseanextra0.2LASpointsinbothconfigurationsusingpredictedPOStags.Differentmultilingualembeddings.Severalmethodshavebeenproposedforpretrainingmul-tilingualwordembeddings.Wecomparethreeofthem:•multiCCA(Ammaretal.,2016)usesalin-22https://github.com/saffsd/langid.py23Weusedversion3.6.0oftheStanfordPOStag-ger,withthefollowingpre-packagedconfigurationfiles:german-fast-caseless.tagger.props(von),english-caseless-left3words-distsim.tagger.props(In),spanish.tagger.props(es),french.tagger.props(fr).Wereusedfrench.tagger.propsfor(Es,pt,sv).multilingualembeddingsUASLASmultiCluster87.784.1multiCCA87.884.4robustprojection87.884.2Table6:Effectofmultilingualembeddingestima-tionmethodonthemultilingualparsingwithMAL-OPA.UASandLASscoresaremacro-averagedacrossseventargetlanguages.earoperatortoprojectpretrainedmonolingualembeddingsineachlanguage(exceptEnglish)tothevectorspaceofpretrainedEnglishwordembeddings.•multiCluster(Ammaretal.,2016)usesthesameembeddingfortranslationally-equivalentwordsindifferentlanguages.•robustprojection(Guoetal.,2015)firstpre-trainsmonolingualEnglishwordembeddings,thendefinestheembeddingofanon-EnglishwordastheweightedaverageembeddingofEnglishwordsalignedtothenon-Englishwords(inaparallelcorpus).Theembeddingofanon-EnglishwordwhichisnotalignedtoanyEnglishwordsisdefinedastheaverageembed-dingofwordswithauniteditdistanceinthesamelanguage(e.g.,‘playz’istheaverageof‘plays’and‘play’).24Allembeddingsaretrainedonthesamedataandusethesamenumberofdimensions(100).25Table6il-lustratesthatthethreemethodsperformsimilarlyonthistask.AsidefromTable6,inthispaper,weex-clusivelyusetherobustprojectionmultilingualem-beddingstrainedinGuoetal.(2016).26The“ro-bustprojection”resultinTable6(whichuses100dimensions)iscomparabletothelastrowinTable3(whichuses50dimensions).24Ourimplementationofthismethodcanbefoundathttps://github.com/gmulcaire/average-embeddings.25Wesharetheembeddingfilesathttps://github.com/clab/language-universal-parser/tree/master/pretrained_embeddings.26TheembeddingswerekindlyprovidedbytheauthorsofGuoetal.(2016)athttps://drive.google.com/file/d/0B1z04ix6jD_DY3lMN2Ntdy02NFU/view

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
1
0
9
1
5
6
7
4
0
4

/

/
T

l

A
C
_
A
_
0
0
1
0
9
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

440

LAStargetlanguagedeesfritsvmonolingual58.064.763.068.757.6Duongetal.61.870.567.271.362.5MALOPA63.470.569.174.163.4Table7:Small(3,000token)targettreebanksetting:language-universaldependencyparserperformance.Smalltargettreebank.Duongetal.(2015B)con-sideredasetupwherethetargetlanguagehasasmalltreebankof∼3,000tokens,andthesourcelanguage(English)hasalargetreebankof∼205,000tokens.TheparserproposedinDuongetal.(2015B)isaneuralnetworkparserbasedonChenandManning(2014),whichsharesmostoftheparametersbe-tweenEnglishandthetargetlanguage,andusesan‘2regularizertotiethelexicalembeddingsoftranslationally-equivalentwords.Whilenotthepri-maryfocusofthispaper,27wecompareourpro-posedmethodtothatofDuongetal.(2015B)onfivetargetlanguagesforwhichmultilingualBrownclustersareavailablefromGuoetal.(2016).Foreachtargetlanguage,wetraintheparserontheEn-glishtrainingdataintheUDversion1.0corpus(Nivreetal.,2015b)andasmalltreebankinthetargetlanguage.28FollowingDuongetal.(2015B),inthissetup,weonlyusegoldcoarsePOStags,wedonotuseanydevelopmentdatainthetargetlanguages(weusetheEnglishdevelopmentsetin-stead),andwesubsampletheEnglishtrainingdataineachepochtothesamenumberofsentencesinthetargetlanguage.WeusethesamehyperparametersspecifiedbeforeforthesingleMALOPAparserandeachofthemonolingualbaselines.Table7showsthatourmethodoutperformsDuongetal.(2015B)by1.4LASpointsonaverage.Ourmethodconsis-tentlyoutperformsthemonolingualbaselinesinthis27Thesetupcostinvolvedinrecruitinglinguists,developingandrevisingannotationguidelinestoannotateanewlanguageoughttobehigherthanthecostofannotating3,000tokens.Af-terinvestingmuchresourcesinalanguage,webelieveitisun-realistictostoptheannotationeffortafteronly3,000tokens.28WethankLongDuongforsharingtheprocessed,subsampledtrainingcorporaineachtargetlanguageathttps://github.com/longdt219/universal_dependency_parser/tree/master/data/universal-dep/universal-dependencies-1.0.setup,withanaverageimprovementof5.7absoluteLASpoints.4.2TargetLanguageswithoutaTreebank(Lt∩Ls=∅)McDonaldetal.(2011)establishedthat,whennotreebankannotationsareavailableinthetargetlan-guage,trainingonmultiplesourcelanguagesout-performstrainingonone(i.e.,multi-sourcemodeltransferoutperformssingle-sourcemodeltransfer).Inthissection,weevaluatetheperformanceofourparserinthissetup.Weusetwostrongbaselinemulti-sourcemodeltransferparserswithnosuper-visioninthetargetlanguage:•ZhangandBarzilay(2015)isagraph-basedarc-factoredparsingmodelwithatensor-basedscor-ingfunction.Ittakestypologicalpropertiesofalanguageasinput.Wecomparetothebestreportedconfiguration(i.e.,thecolumntitled“OURS”inTable5ofZhangandBarzilay,2015).•Guoetal.(2016)isatransition-basedneural-networkparsingmodelbasedonChenandMan-ning(2014).ItusesamultilingualembeddingsandBrownclustersaslexicalfeatures.Wecom-paretothebestreportedconfiguration(i.e.,thecolumntitled“MULTI-PROJ”inTable1ofGuoetal.,2016).FollowingGuoetal.(2016),foreachtargetlan-guage,wetraintheparseronsixotherlanguagesintheGoogleuniversaldependencytreebanksversion2.029(von,In,es,fr,Es,pt,sv,excludingwhicheveristhetargetlanguage),andweusegoldcoarsePOStags.OurparserusesthesamewordembeddingsandwordclustersusedinGuoetal.(2016),anddoesnotuseanytypologyinformation.30TheresultsinTable8showthat,onaverage,ourparseroutperformsbothbaselinesbymorethan1pointinLAS,andgivesthebestLASresultsinfour(outofsix)languages.5RelatedWorkOurworkbuildsonthemodeltransferapproach,whichwaspioneeredbyZemanandResnik(2008)29https://github.com/ryanmcd/uni-dep-tb/30Inpreliminaryexperiments,wefoundlanguageembed-dingstohurttheperformanceoftheparserfortargetlanguageswithoutatreebank.

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
1
0
9
1
5
6
7
4
0
4

/

/
T

l

A
C
_
A
_
0
0
1
0
9
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

441

LAStargetlanguageaveragedeesfritptsvZhangandBarzilay(2015)54.168.368.869.472.562.565.9Guoetal.(2016)55.973.071.071.278.669.569.3MALOPA57.174.673.972.577.068.170.5Table8:Dependencyparsing:labeledattachmentscores(LAS)formulti-sourcetransferparsersinthesimulatedlow-resourcescenariowhereLt∩Ls=∅.whotrainedaparseronasourcelanguagetreebankthenappliedittoparsesentencesinatargetlan-guage.Cohenetal.(2011)andMcDonaldetal.(2011)trainedunlexicalizedparsersontreebanksofmultiplesourcelanguagesandappliedtheparsertodifferentlanguages.Naseemetal.(2012),Täck-strömetal.(2013),andZhangandBarzilay(2015)usedlanguagetypologytoimprovemodeltrans-fer.Toaddlexicalinformation,Täckströmetal.(2012)usedmultilingualwordclusters,whileXiaoandGuo(2014),Guoetal.(2015),Søgaardetal.(2015)andGuoetal.(2016)usedmultilingualwordembeddings.Duongetal.(2015B)usedaneuralnetworkbasedmodel,sharingmostoftheparame-tersbetweentwolanguages,andusedan‘2regular-izertotiethelexicalembeddingsoftranslationally-equivalentwords.Weincorporatetheseideasinourframework,whileproposinganovelneuralar-chitectureforembeddinglanguagetypology(see§3.4),anduseavariantofworddropout(Iyyeretal.,2015)forconsumingnoisystructuredinputs.Wealsoshowhowtoreplaceanarrayofmono-linguallytrainedparserswithonemultilingually-trainedparserwithoutsacrificingaccuracy,whichisrelatedtoVilaresetal.(2016).NeuralnetworkparsingmodelswhichprecededDyeretal.(2015)includeHenderson(2003),TitovandHenderson(2007),HendersonandTitov(2010)andChenandManning(2014).Relatedtolexi-calfeaturesincross-lingualparsingisDurrettetal.(2012)whodefinedlexico-syntacticfeaturesbasedonbilinguallexicons.OtherrelatedworkincludeÖstling(2015),whichmaybeusedtoinducemoreusefultypologicalpropertiestoinformmultilingualparsing.Anotherpopularapproachforcross-lingualsu-pervisionistoprojectannotationsfromthesourcelanguagetothetargetlanguageviaaparallelcor-pus(Yarowskyetal.,2001;Hwaetal.,2005)orviaautomatically-translatedsentences(Tiedemannetal.,2014).MaandXia(2014)usedentropyregu-larizationtolearnfrombothparalleldata(withpro-jectedannotations)andunlabeleddatainthetargetlanguage.RasooliandCollins(2015)trainedanarrayoftarget-languageparsersonfullyannotatedtrees,byiterativelydecodingsentencesinthetar-getlanguagewithincompleteannotations.Onere-searchdirectionworthpursuingistofindsynergiesbetweenthemodeltransferapproachandannotationprojectionapproach.6ConclusionWepresentedMALOPA,asingleparsertrainedonamultilingualsetoftreebanks.Weshowedthatthisparser,equippedwithlanguageembeddingsandfine-grainedPOSembeddings,onaverageoutper-formsmonolingually-trainedparsersfortargetlan-guageswithatreebank.Thispatternofresultsisquiteencouraging.Althoughlanguagesmayshareunderlyingsyntacticproperties,individualparsingmodelsmustbehavequitedifferently,andourmodelallowsthiswhilesharingparametersacrosslan-guages.Thevalueofthissharingismorepro-nouncedinscenarioswherethetargetlanguage’strainingtreebankissmallornon-existent,whereourparseroutperformspreviouscross-lingualmulti-sourcemodeltransfermethods.AcknowledgmentsWaleedAmmarissupportedbytheGooglefellow-shipinnaturallanguageprocessing.MiguelBalles-terosissupportedbytheEuropeanCommissionun-derthecontractnumbersFP7-ICT-610411(projectMULTISENSOR)andH2020-RIA-645012(projectKRISTINA).PartofthismaterialisbaseduponworksupportedbyasubcontractwithRaytheon

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
1
0
9
1
5
6
7
4
0
4

/

/
T

l

A
C
_
A
_
0
0
1
0
9
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

442

BBNTechnologiesCorp.underDARPAPrimeCon-tractNo.HR0011-15-C-0013,andpartofthisre-searchwassupportedbyaGoogleresearchawardtoNoahSmith.WethankJiangGuoforsharingthemultilingualwordembeddingsandmultilingualwordclusters.WethankLoriLevin,RyanMc-Donald,JörgTiedemann,YuliaTsvetkov,andYuanZhangforhelpfuldiscussions.Lastbutnotleast,wethanktheanonymousTACLreviewersfortheirvaluablefeedback.ReferencesŽeljkoAgi´c,MariaJesusAranzabe,AitziberAtutxa,CristinaBosco,JinhoChoi,Marie-CatherinedeMarn-effe,TimothyDozat,RichárdFarkas,JenniferFoster,FilipGinter,IakesGoenaga,KoldoGojenola,YoavGoldberg,JanHajiˇc,AndersTrærupJohannsen,JennaKanerva,JuhaKuokkala,VeronikaLaippala,Alessan-droLenci,KristerLindén,NikolaLjubeši´c,TeresaLynn,ChristopherManning,HéctorAlonsoMartínez,RyanMcDonald,AnnaMissilä,SimonettaMonte-magni,JoakimNivre,HannaNurmi,PetyaOsenova,SlavPetrov,JussiPiitulainen,BarbaraPlank,ProkopisProkopidis,SampoPyysalo,WolfgangSeeker,Moj-ganSeraji,NataliaSilveira,MariaSimi,KirilSimov,AaronSmith,ReutTsarfaty,VeronikaVincze,andDanielZeman.2015.Universaldependencies1.1.LINDAT/CLARINdigitallibraryattheInstituteofFormalandAppliedLinguistics,CharlesUniversityinPrague.WaleedAmmar,GeorgeMulcaire,YuliaTsvetkov,Guil-laumeLample,ChrisDyer,andNoahA.Smith.2016.Massivelymultilingualwordembeddings.arXiv:1602.01925v2.UtsabBarman,AmitavaDas,JoachimWagner,andJen-niferFoster.2014.Codemixing:Achallengeforlan-guageidentificationinthelanguageofsocialmedia.InEMNLPWorkshoponComputationalApproachestoCodeSwitching.EmilyM.Bender.2011.Onachievingandevaluatinglanguage-independenceinNLP.LinguisticIssuesinLanguageTechnology,6(3):1–26.SabineBuchholzandErwinMarsi.2006.CoNLL-Xsharedtaskonmultilingualdependencyparsing.InProc.ofCoNLL.DanqiChenandChristopherManning.2014.Afastandaccuratedependencyparserusingneuralnetworks.InProc.ofEMNLP.ShayB.Cohen,DipanjanDas,andNoahA.Smith.2011.Unsupervisedstructurepredictionwithnon-parallelmultilingualguidance.InProc.ofEMNLP.MatthewS.DryerandMartinHaspelmath,editors.2013.WALSOnline.MaxPlanckInstituteforEvolutionaryAnthropology,Leipzig.LongDuong,TrevorCohn,StevenBird,andPaulCook.2015a.Lowresourcedependencyparsing:Cross-lingualparametersharinginaneuralnetworkparser.InProc.ofACL-IJCNLP.LongDuong,TrevorCohn,StevenBird,andPaulCook.2015b.Aneuralnetworkmodelforlow-resourceuni-versaldependencyparsing.InProc.ofEMNLP.GregDurrett,AdamPauls,andDanKlein.2012.Syn-tactictransferusingabilinguallexicon.InProc.ofEMNLP.ChrisDyer,MiguelBallesteros,WangLing,AustinMatthews,andNoahA.Smith.2015.Transition-baseddependencyparsingwithstacklongshort-termmemory.InProc.ofACL.MattGardner,KejunHuang,EvangelosPapalexakis,XiaoFu,ParthaTalukdar,ChristosFaloutsos,NicholasSidiropoulos,andTomMitchell.2015.Translationin-variantwordembeddings.InProc.ofEMNLP.XavierGlorotandYoshuaBengio.2010.Understand-ingthedifficultyoftrainingdeepfeedforwardneuralnetworks.InProc.ofAISTATS.AlanGraves,Abdel-rahmanMohamed,andGeoffreyHinton.2013.Speechrecognitionwithdeeprecurrentneuralnetworks.InProc.ofICASSP.JiangGuo,WanxiangChe,DavidYarowsky,HaifengWang,andTingLiu.2015.Cross-lingualdependencyparsingbasedondistributedrepresentations.InProc.ofACL.JiangGuo,WanxiangChe,DavidYarowsky,HaifengWang,andTingLiu.2016.Arepresentationlearningframeworkformulti-sourcetransferparsing.InProc.ofAAAI.UlrichHeidandSybilleRaab.1989.Collocationsinmultilingualgeneration.InProc.ofEACL.JamesHendersonandIvanTitov.2010.Incrementalsig-moidbeliefnetworksforgrammarlearning.JournalofMachineLearningResearch,11:3541–3570.JamesHenderson.2003.Inducinghistoryrepresenta-tionsforbroadcoveragestatisticalparsing.InProc.ofNAACL-HLT.ZhihengHuang,WeiXu,andKaiYu.2015.Bidi-rectionalLSTM-CRFmodelsforsequencetagging.arXiv:1508.01991.RebeccaHwa,PhilipResnik,AmyWeinberg,ClaraCabezas,andOkanKolak.2005.Bootstrappingparsersviasyntacticprojectionacrossparalleltexts.NaturalLanguageEngineering,11(03):311–325.MohitIyyer,VarunManjunatha,JordanL.Boyd-Graber,andHalDaumé.2015.Deepunorderedcomposi-tionrivalssyntacticmethodsfortextclassification.InProc.ofACL.

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
1
0
9
1
5
6
7
4
0
4

/

/
T

l

A
C
_
A
_
0
0
1
0
9
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

443

MarcoLuiandTimothyBaldwin.2012.langid.py:Anoff-the-shelflanguageidentificationtool.InProc.ofACL.XuezheMaandFeiXia.2014.Unsuperviseddepen-dencyparsingwithtransferringdistributionviaparal-lelguidanceandentropyregularization.InProc.ofACL.RyanMcDonald,SlavPetrov,andKeithHall.2011.Multi-sourcetransferofdelexicalizeddependencyparsers.InProc.ofEMNLP.RyanMcDonald,JoakimNivre,YvonneQuirmbach-Brundage,YoavGoldberg,DipanjanDas,KuzmanGanchev,KeithHall,SlavPetrov,HaoZhang,OscarTäckström,ClaudiaBedini,NúriaBertomeuCastelló,andJungmeeLee.2013.Universaldependencyanno-tationformultilingualparsing.InProc.ofACL.TomasMikolov,KaiChen,GregCorrado,andJeffreyDean.2013.Efficientestimationofwordrepresen-tationsinvectorspace.InProc.ofICLR.TahiraNaseem,ReginaBarzilay,andAmirGloberson.2012.Selectivesharingformultilingualdependencyparsing.InProc.ofACL.JoakimNivreandJensNilsson.2005.Pseudo-projectivedependencyparsing.InProc.ofACL.JoakimNivre,JohanHall,SandraKubler,RyanMcDon-ald,JensNilsson,SebastianRiedel,andDenizYuret.2007.TheCoNLL2007sharedtaskondependencyparsing.InProc.ofCoNLL.JoakimNivre,ŽeljkoAgi´c,MariaJesusAranzabe,MasayukiAsahara,AitziberAtutxa,MiguelBalles-teros,JohnBauer,KepaBengoetxea,RiyazAh-madBhat,CristinaBosco,SamBowman,GiuseppeG.A.Celano,MiriamConnor,Marie-CatherinedeMarneffe,ArantzaDiazdeIlarraza,KajaDo-brovoljc,TimothyDozat,TomažErjavec,RichárdFarkas,JenniferFoster,DanielGalbraith,FilipGin-ter,IakesGoenaga,KoldoGojenola,YoavGold-berg,BertaGonzales,BrunoGuillaume,JanHa-jiˇc,DagHaug,RaduIon,ElenaIrimia,AndersJo-hannsen,HiroshiKanayama,JennaKanerva,SimonKrek,VeronikaLaippala,AlessandroLenci,NikolaLjubeši´c,TeresaLynn,ChristopherManning,C˘at˘alinaM˘ar˘anduc,DavidMareˇcek,HéctorMartínezAlonso,JanMašek,YujiMatsumoto,RyanMcDonald,AnnaMissilä,VerginicaMititelu,YusukeMiyao,Simon-ettaMontemagni,ShunsukeMori,HannaNurmi,PetyaOsenova,LiljaØvrelid,ElenaPascual,MarcoPassarotti,Cenel-AugustoPerez,SlavPetrov,JussiPiitulainen,BarbaraPlank,MartinPopel,ProkopisProkopidis,SampoPyysalo,LoganathanRamasamy,RudolfRosa,ShadiSaleh,SebastianSchuster,Wolf-gangSeeker,MojganSeraji,NataliaSilveira,MariaSimi,RaduSimionescu,KatalinSimkó,KirilSimov,AaronSmith,JanŠtˇepánek,AlaneSuhr,ZsoltSzántó,TakaakiTanaka,ReutTsarfaty,SumireUematsu,Lar-raitzUria,ViktorVarga,VeronikaVincze,ZdenˇekŽabokrtský,DanielZeman,andHanzhiZhu.2015a.Universaldependencies1.2.LINDAT/CLARINdigi-tallibraryattheInstituteofFormalandAppliedLin-guistics,CharlesUniversityinPrague.JoakimNivre,CristinaBosco,JinhoChoi,Marie-CatherinedeMarneffe,TimothyDozat,RichárdFarkas,JenniferFoster,FilipGinter,YoavGold-berg,JanHajiˇc,JennaKanerva,VeronikaLaippala,AlessandroLenci,TeresaLynn,ChristopherManning,RyanMcDonald,AnnaMissilä,SimonettaMonte-magni,SlavPetrov,SampoPyysalo,NataliaSilveira,MariaSimi,AaronSmith,ReutTsarfaty,VeronikaVincze,andDanielZeman.2015b.Universaldepen-dencies1.0.LINDAT/CLARINdigitallibraryattheInstituteofFormalandAppliedLinguistics,CharlesUniversityinPrague.JoakimNivre.2004.Incrementalityindeterministicde-pendencyparsing.InProceedingsoftheWorkshoponIncrementalParsing:BringingEngineeringandCog-nitionTogether.RobertÖstling.2015.Wordordertypologythroughmul-tilingualwordalignment.InProc.ofACL-IJCNLP.SlavPetrov,DipanjanDas,andRyanMcDonald.2012.Auniversalpart-of-speechtagset.InProc.ofLREC.MohammadSadeghRasooliandMichaelCollins.2015.Density-drivencross-lingualtransferofdependencyparsers.InProc.ofEMNLP.DeitmarRösner.1988.ThegenerationsystemoftheSEMSYNproject:Towardsatask-independentgener-atorforGerman.AdvancesinNaturalLanguageGen-eration,2.AndersSøgaard,ŽeljkoAgi´c,HéctorMartínezAlonso,BarbaraPlank,BerndBohnet,andAndersJohannsen.2015.Invertedindexingforcross-lingualNLP.InProc.ofACL-IJCNLP2015.NitishSrivastava,GeoffreyHinton,AlexKrizhevsky,IlyaSutskever,andRuslanSalakhutdinov.2014.Dropout:Asimplewaytopreventneuralnetworksfromoverfitting.JournalofMachineLearningRe-search,15(1):1929–1958.IlyaSutskever,OriolVinyals,andQuocV.Le.2014.Sequencetosequencelearningwithneuralnetworks.InNIPS.OscarTäckström,RyanMcDonald,andJakobUszkoreit.2012.Cross-lingualwordclustersfordirecttransferoflinguisticstructure.InProc.ofNAACL-HLT.OscarTäckström,DipanjanDas,SlavPetrov,RyanMc-Donald,andJoakimNivre.2013.Tokenandtypeconstraintsforcross-lingualpart-of-speechtagging.TransactionsoftheAssociationforComputationalLinguistics,1:1–12.

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
1
0
9
1
5
6
7
4
0
4

/

/
T

l

A
C
_
A
_
0
0
1
0
9
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

444

JörgTiedemann,ZeljkoAgic,andJoakimNivre.2014.Treebanktranslationforcross-lingualparserinduc-tion.InProc.ofCoNLL.JörgTiedemann.2015.Cross-lingualdependencypars-ingwithuniversaldependenciesandpredictedPOSla-bels.InProc.ofInternationalConferenceonDepen-dencyLinguistics(Depling).IvanTitovandJamesHenderson.2007.Constituentparsingwithincrementalsigmoidbeliefnetworks.InProc.ofACL.DavidVilares,CarlosGómez-Rodríguez,andMiguelA.Alonso.2016.Onemodel,twolanguages:Train-ingbilingualparserswithharmonizedtreebanks.arXiv:1507.08449v2.Wikipedia.2016.Listoflanguagesbynumberofnativespeakers.http://bit.ly/1LUP5kJ.Accessed:2016-01-26.MinXiaoandYuhongGuo.2014.Distributedwordrepresentationlearningforcross-lingualdependencyparsing.InProc.ofCoNLL.DavidYarowsky,GraceNgai,andRichardWicentowski.2001.Inducingmultilingualtextanalysistoolsviarobustprojectionacrossalignedcorpora.InProc.ofHLT.DanielZemanandPhilipResnik.2008.Cross-languageparseradaptationbetweenrelatedlanguages.InProc.ofIJCNLP.YuanZhangandReginaBarzilay.2015.Hierarchicallow-ranktensorsformultilingualtransferparsing.InProc.ofEMNLP.
PDF Herunterladen