Transacciones de la Asociación de Lingüística Computacional, volumen. 5, páginas. 247–261, 2017. Editor de acciones: Hinrich Schütze.
Lote de envío: 12/2015; Lote de revisión: 5/2016; 11/2016; Publicado 7/2017.
2017 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.
C
(cid:13)
SparseCodingofNeuralWordEmbeddingsforMultilingualSequenceLabelingG´aborBerendDepartmentofInformaticsUniversityofSzeged2´Arp´adt´er,6720Szeged,Hungaryberendg@inf.u-szeged.huAbstractInthispaperweproposeandcarefullyeval-uateasequencelabelingframeworkwhichsolelyutilizessparseindicatorfeaturesde-rivedfromdensedistributedwordrepresen-tations.Theproposedmodelobtains(cerca)state-of-theartperformanceforbothpart-of-speechtaggingandnamedentityrecognitionforavarietyoflanguages.Ourmodelreliesonlyonafewthousandsparsecoding-derivedfeatures,withoutapplyinganymodificationofthewordrepresentationsemployedforthedifferenttasks.Theproposedmodelhasfa-vorablegeneralizationpropertiesasitretainsover89.8%ofitsaveragePOStaggingaccu-racywhentrainedat1.2%ofthetotalavailabletrainingdata,i.e.150sentencesperlanguage.1IntroductionDeterminingthelinguisticstructureofnaturallan-guagetextsbasedonrichhand-craftedfeatureshasalong-goinghistoryinnaturallanguageprocessing.Thefocusoftraditionalapproacheshasmostlybeenonbuildinglinguisticanalyzersforaparticularkindofanalysis,whichoftenleadstotheincorporationofextensivelinguisticand/ordomainknowledgefordefiningthefeaturespace.Consequently,traditionalmodelseasilybecomelanguageand/ortaskspecificresultinginimpropergeneralizationproperties.Anewresearchdirectionhasemergedrecently,thataimsatbuildingmoregeneralmodelsthatre-quirefarlessfeatureengineeringornoneatall.Theseadvancementsinnaturallanguageprocessing,pioneeredbyBengioetal.(2003),followedbyCol-lobertandWeston(2008),Collobertetal.(2011),Mikolovetal.(2013a)amongothers,employadif-ferentphilosophy.Theobjectiveoftheseworksistofindrepresentationsforlinguisticphenomenainanunsupervisedmannerbyrelyingonlargeamountsoftext.Naturallanguagephenomenaareextremelysparsebytheirnature,whereascontinuouswordem-beddingsemploydenserepresentationsofwords.Inourpaperweempiricallyverifyviarigorousexper-imentsthatturningthesedenserepresentationsintoamuchsparser(yetdenserthanone-hotencoding)formcankeepthemostsalientpartsofwordrepre-sentationsthatarehighlysuitableforsequencemod-els.Furthermore,ourexperimentsrevealthatourpro-posedmodelperformssubstantiallybetterthantra-ditionalfeature-richmodelsintheabsenceofabun-danttrainingdata.Ourproposedmodelalsohastheadvantageofperformingwellonmultiplesequencelabelingtaskswithoutanymodificationintheap-pliedwordrepresentationsthankstothesparsefea-turesderivedfromcontinuouswordrepresentations.Ourworkaimsatintroducinganovelsequencela-belingmodelsolelyutilizingfeaturesderivedfromthesparsecodingofcontinuouswordembeddings.Eventhoughsparsecodinghadpreviouslybeenuti-lizedinNLPpriortous(Faruquietal.,2015;Chenetal.,2016),tothebestofourknowledge,wearethefirsttoproposeasequencelabelingframeworkincorporatingitwiththefollowingcontributions:•Weshowthattheproposedsparserepresen-tationisgeneralassequencelabelingmodelstrainedonthemachieve(cerca)state-of-the-artperformancesforbothPOStaggingandNER.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
5
9
1
5
6
7
4
5
6
/
/
t
yo
a
C
_
a
_
0
0
0
5
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
248
•Weshowthattherepresentationisgeneralintheothersense,thatitproducesreasonablere-sultsformorethan40treebanksforPOStag-ging,•rigorouslycomparedifferentsparsecodingap-proachesinconjunctionwithdifferentlytrainedcontinuouswordembeddings,•highlightthefavorablegeneralizationproper-tiesofourmodelinsettingswhenaccesstoaverylimitedtrainingcorpusisassumed,•releasethesparsewordrepresentationsde-terminedforourexperimentsathttps://begab.github.io/sparse_embedstoensurethereplicabilityofourresultsandtofos-terfurthermultilingualNLPresearch.2RelatedworkThelineofresearchintroducedinthispaperre-liesondistributedwordrepresentations(Al-Rfouetal.,2013)anddictionarylearningforsparsecoding(Mairaletal.,2010)andalsoshowscloseresem-blanceto(Faruquietal.,2015).2.1DistributedwordrepresentationsDistributedwordrepresentationsassignsomerela-tivelylow-dimensional,densevectorstoeachwordinacorpussuchthatwordswithsimilarcontextandmeaningtendtohavesimilarrepresentations.Fromanalgebraicpointofview,theembeddingofwordihavingindexidxiinavocabularyVcanbethoughtofastheresultofamatrix-vectormultiplicationW1i,wheretheithcolumnofmatrixW∈Rk×|V|containsthek-dimensional(k(cid:28)|V|)embeddingforwordiandvector1i∈R|V|istheone-hotrep-resentationofwordi.Theone-hotrepresentationofwordiissuchavector,whichcontainszerosforallofitsentriesexceptforindexidxiwhereitstoresaone.DependingonhowthecolumnsofW(i.e.thewordembeddings)getdetermined,wecoulddistin-guishaplethoraofapproaches(Bengioetal.,2003;LebretandCollobert,2014;MnihandKavukcuoglu,2013;collobertandweston,2008;Mikolovetal.,2013a;Penningtonetal.,2014).Prediction-baseddistributedwordembeddingap-proachessuchasword2vec(Mikolovetal.,2013a)havebeenconjecturedtohavesuperiorper-formanceovercount-basedwordrepresentations(Baronietal.,2014).Sin embargo,asLebretandCol-lobert(2015),Levyetal.(2015)andQuetal.(2015)pointoutcount-baseddistributionalmodelscanper-formonparwithprediction-baseddistributedwordembeddingmodels.Levyetal.(2015)illustratethattheeffectivenessofneuralwordembeddingslargelydependontheselectionofmodelhyperparametersandotherdesignchoices.Accordingtothesefindings,inordertoavoidanyhasslesoftuningthehyperparametersofthewordembeddingmodelemployed,weprimarilyusethepubliclyavailablepre-trainedpolyglotwordem-beddingsofAl-Rfouetal.(2013)en cambio,withoutanytaskspecificmodificationforourexperiments.Akeythingtonoteisthatpolyglotwordembed-dingsarenottailoredtowardanyspecificlanguageanalysistasksuchasPOStaggingorNER.Thesewordembeddingsareinsteadtrainedinamannerfa-voringthewordanalogytaskintroducedbyMikolovetal.(2013C).Thepolyglotprojectdistributeswordembeddingsformorethan100languages.Al-Rfouetal.(2013)alsoreportresultsonPOStagging,sin embargo,wordrepresentationstheyapplyfortheseexperimentsaredifferentfromthetask-agnosticrep-resentationstheymadepubliclyavailable.Therehasbeenpreviousresearchontrainingneu-ralnetworksforlearningdistributedwordrepresen-tationsforvariousspecificlanguageanalysistasks.Collobertetal.(2011)proposeneuralnetworkarchi-tecturestofournaturallanguageprocessingtasks,i.e.POStagging,namedentityrecognition,semanticrolelabelingandchunking.Collobertetal.(2011)trainwordrepresentationsonlargeamountsofunan-notatedtextsfromWikipedia,thenupdatethepre-trainedwordrepresentationsfortheindividualtasks.Ourapproachisdifferentinthatwedonotup-dateourwordrepresentationsforthedifferenttasksandmostimportantlythatweusesuccessfullythefeaturesderivedfromsparsecodinginalog-linearmodelinsteadofaneuralnetworkarchitecture.Afinaldifferenceto(Collobertetal.,2011)isthatweexperimentwithamuchwiderrangeoflanguageswhiletheyreportresultsforEnglishonly.Quetal.(2015)evaluatetheimpactsofchoos-ingdifferentembeddingmethodsonfoursequencelabelingtasks,i.e.POStagging,NER,syntactic
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
5
9
1
5
6
7
4
5
6
/
/
t
yo
a
C
_
a
_
0
0
0
5
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
249
chunkingandmultiwordexpressionidentification.Thehand-craftedfeaturestheyemployforPOStag-gingandNERarethesameasinCollobertetal.(2011)andTurianetal.(2010).2.2SparsecodingThegeneralgoalofsparsecodingistoexpresssig-nalsintheformofsparselinearcombinationofba-sisvectorsandthetaskoffindinganappropriatesetofbasisvectorsisreferredtoasthedictionarylearn-ingproblem(Mairaletal.,2010).Generally,givenadatamatrixX∈Rk×nwithitsithcolumnxirep-resentingtheithk-dimensionalsignal,thetaskistofindD∈Rk×mandα∈Rm×n,suchthatX≈Dα.Thiscanbeformalizedintoan‘1-regularizedlinearleast-squaresminimizationproblemhavingtheformminD∈C,α12nnXi=1(cid:0)kxi−Dαik22+λkαik1(cid:1),(1)withCbeingtheconvexsetofmatricesofcolumnvectorshavingan‘2normatmostone,matrixDactingastheshareddictionaryacrossthesignals,andthecolumnsofthesparsematrixαcontainingthecoefficientsforthelinearcombinationsofeachofthenobservedsignals.PerformingsparsecodingofwordembeddingshasrecentlybeenproposedbyFaruquietal.(2015),sin embargo,theobjectivefunctiontheyoptimizedif-fersfrom(1).InSection4,wecomparetheeffectsofemployingdifferentsparsecodingparadigmsin-cludingtheonesin(Faruquietal.,2015).Intheirwork,Yogatamaetal.(2015)proposedanefficientlearningalgorithmfordetermininghi-erarchicallyorganizedsparsewordrepresentationsusingstochasticproximalmethods.Mostrecently,Sunetal.(2016)haveproposedanonlinelearn-ingalgorithmusingregularizeddualaveragingtodi-rectlyobtain‘1regularizedcontinuousbagofwords(CBOW)representaciones(Mikolovetal.,2013a)withouttheneedtodeterminedenseCBOWrepre-sentationsfirst.3SequencelabelingframeworkThissectionintroducesthesequencelabelingframe-workweuseforbothPOStaggingandNER.Sinceourgoalistomeasuretheeffectivenessofsparsewordembeddingsalone,wedonotapplyanyfea-turesbasedongazetters,capitalizationpatternsorcharactersuffixes.Asdescribedpreviously,wordembeddingmeth-odsturnahigh-dimensional(i.e.,asmanydimen-sionsaswordsinthevocabulary)andextremelysparse(i.e.containingonlyonenon-zeroelementatthevocabularyindexoftheworditrepresents)one-hotencodedrepresentationofwordsintoadenseembeddingofmuchlowerdimensionalityk.Inourwork,insteadofusingthelowdimensionaldensewordembeddings,weuseadictionarylearn-ingapproachtoobtainsparsecodingsfortheem-beddedwordrepresentations.Formally,giventhelookupmatrixW∈Rk×|V|whichcontainstheem-beddingvectors,welearnedD∈Rk×mbeingthedictionarymatrixsharedacrossalltheembeddingvectorsandα∈Rm×|V|containingsparselinearcombinationcoefficientsforeachofthewordem-beddingssothatkW−Dαk2F+λkαk1isminimized.OncethedictionarymatrixDislearned,thesparselinearcombinationcoefficientsαicaneasilybedeterminedforawordembeddingvectorwibysolvingan‘1-regularizedlinearleast-squaresmini-mizationproblem(Mairaletal.,2010).Wedefinefeaturesbasedonvectorαibytakingthesignsandindicesofitsnon-zerocoefficients,thatisf(Wisconsin)={sign(αi[j])j|αi[j]6=0},(2)whereαi[j]denotesthejthcoefficientinthesparsevectorαi.TheintuitionbehindthisfeatureisthatwordswithsimilarmeaningareexpectedtouseanoverlappingsetofbasisvectorsfromdictionaryD.Incorporatingthesignsofcoefficientsintothefea-turefunctioncanhelptodistinguishcaseswhenabasisvectortakespartinthereconstructionofawordrepresentation“destructively”or“construc-tively”.Whenassigningfeaturestoatargetwordatsomepositionwithinasentence,wedeterminethesamesetoffeaturefunctionsforthetargetworditselfanditsneighboringwordsofwindowsize1.Ex-perimentswithwindowsize2werealsoperformed.However,weomittheseresultsforbrevityastheydonotsubstantiallydifferfromthoseobtainedwithawindowsizeof1.Wethenusethepreviouslydescribedsetoffea-turesinalinearchainCRF(Laffertyetal.,2001)
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
5
9
1
5
6
7
4
5
6
/
/
t
yo
a
C
_
a
_
0
0
0
5
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
250
usingCRFsuite(Okazaki,2007)withitsdefaultset-tingsforhyperparameters,i.e.,thecoefficientsof1.0and0.001for‘1and‘2regularization,respectively.4ExperimentsWerelyontheSPArseModelingSoftware1(SPAMS)(Mairaletal.,2010)forperformingsparsecodingofdistributedwordrepresentations.Fordic-tionarylearningasformulatedinEquation1,oneshouldchoosemandλ,controllingthenumberofthebasisvectorsandtheregularizationcoefficientaffectingthesparsityofα,respectively.Startingwithm=256anddoublingitateachiteration,ourpreliminaryinvestigationsshowedasteadygrowthintheusefulnessofsparsewordrepresentationsasafunctionofm,plateauingatm=1024.Wesetmtothatvalueforfurtherexperiments.4.1BaselinemethodsBrownclusteringVariousstudieshaveidentifiedBrownclustering(Brownetal.,1992)asausefulsourceoffeaturegenerationforsequencelabelingtasks(RatinovandRoth,2009;Turianetal.,2010;Owoputietal.,2013;StratosandCollins,2015;Der-czynskietal.,2015).Weshouldnotethatsparsecodingcanalsobeviewedasakindofclusteringthat–unlikeBrownclustering–hasthecapabilityofassigningwordformstomultipleclustersatatime(correspondingtothenon-zerocoefficientsinα).WethusdefinealinearchainCRFrelyingonfea-turesfromtheBrownclusteridentifierofwordsasoneofourbaselineapproach.SinceBrownclus-teringdefinesahierarchicalclusteringoverwords,clustersupersetscaneasilyfunctionasfeatures.Wegeneratefeaturesfromlength-p(p∈{4,6,10,20})prefixesofBrownclusteridentifierssimilartoRati-novandRoth(2009)andTurianetal.(2010).InourexperimentsweusetheimplementationbyLiang(2005)forperformingBrownclustering2.WeprovidetheverysameWikipediaarticlesasinputtextfordeterminingBrownclustersthatareusedfortrainingthepolyglot3wordembeddings.We1http://spams-devel.gforge.inria.fr/2https://github.com/percyliang/brown-cluster3https://sites.google.com/site/rmyeid/projects/polyglot#LevelFeaturename1charisNumber(peso)2charisTitleCase(peso)3charisNonAlnum(peso)4charprefix(peso,i)1≤i≤45charsuffix(peso,i)1≤i≤46wordwt+j−2≤j≤27wordwt⊕wt+i1≤i≤98wordwt⊕wt−i1≤i≤99word⊕t+j+1i=t+jwi−2≤j≤110word⊕t+j+2i=t+jwi−2≤j≤011word⊕t+j+2i=t+j−1wi−1≤j≤012word⊕t+2i=t−2wiTable1:Featuresandfeaturetemplatesappliedbyourfeature-richbaselinefortargetwordwt.⊕isabinaryoperatorformingafeaturefromwordsandtheirrelativepositionsbycombiningthemtogether.alsosetthenumberofBrownclusterstobeidenti-fiedto1024,whichisthenumberofbasisvectorsappliedduringsparsecoding(cf.D∈R64×1024).Feature-richrepresentationWereportresultsre-lyingonlinearchainCRFsthatassignstandardstate-of-the-artfeature-richrepresentationtose-quences.Weapplytheverysamefeaturesandfea-turetemplatesincludedinthePOStaggingmodelofCRFSuite4.WesummarizethesefeaturesinTable1,where⊕denotesthebinaryoperatorwhichdefinesfeaturesasacombinationofwordformsatdifferent(notnecessarilycontiguous)positionsofasentence.WeusethesamepooloffeaturesdescribedinTa-ble1forbothPOStaggingandNER.Thereasonwhywedonotadjustthefeature-richrepresentationemployedasourbaselineforthedifferenttasksisthatwedonotalterourrepresentationinanywaywhenusingoursparsecoding-basedmodeleither.Notethatfeatures#1through#5inTable1oper-ateatcharacter-level,whereasourproposedframe-worksolelyusesfeaturesderivedfromthesparsecodingofwordforms.Wethusdistinguishtwofeature-richbaselines,i.e.FRw+cincludingbothwordandcharacter-levelfeaturesandFRwtreatingwordformsasatomicunitstoderivefeaturesfrom.UsingdensewordrepresentationsAsourulti-mategoalistodemonstratetheusefulnessofsparse4http://github.com/chokkan/crfsuite/blob/master/example/pos.py
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
5
9
1
5
6
7
4
5
6
/
/
t
yo
a
C
_
a
_
0
0
0
5
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
251
featuresderivedfromdensewordrepresentations,itisimportanttoaddressthequestionofwhethersparsewordrepresentationsaremorebeneficialforsequencelabelingtaskscomparedtotheirdensecounterparts.Tothisend,wedevelopedasimilarmodeltotheoneproposedinSection3,exceptforusingtheoriginaldensewordrepresentationsforin-ducingfeatures.Accordingtothismodification,wemadethefol-lowingchangeinourfeaturefunction:insteadofcalculatingEquation(2)forsomewordi,themodi-fiedfeaturefunctionweuseforthisbaselineisf(Wisconsin)={j:Wisconsin[j]|∀j∈{1,…,k}}.Thatis,insteadofrelyingonthenonzerovaluesinαi,eachwordischaracterizedbyitskreal-valuedcoordinatesintheembeddingspace.Inordertono-tationallydistinguishsparseanddenserepresenta-tions,weaddsubscriptSCwhenwerefertoasparsecodedversionofsomewordembedding(e.g.SGSC).4.2POStaggingexperimentsEventhoughitisreasonabletoassumethatlan-guagesshareacommoncoarsesetoflinguisticcat-egories,linguisticresourceshadtheirownnotationsforpart-of-speechtags.ThefirstnotableattempttocanonizethemultipletagsetswastheGoogleuni-versalpart-of-speechtagsintroducedbyPetrovetal.(2012)inwhichthePOStagsofvarioustaggingschemesweremappedto12language-independentpart-of-speechtags.Therecentinitiativeofuniversaldependencies(UD)(Nivre,2015)aimstoprovideaunifiedno-tationformultiplelinguisticphenomena,includingpart-of-speechtagsaswell.ThePOStagsetpro-posedforUDhas17categorieswhichpartiallyover-lapwiththosedefinedbyPetrovetal.(2012).4.2.1ExperimentsusingCoNLL2006/07dataWeuse12treebanksintheCoNLL-XformatfromtheCoNLL-2006/07(BuchholzandMarsi,2006;Nivreetal.,2007)sharedtasks.Thecompletelistofthetreebanksincludedinourexperimentsispre-sentedinTable2.WerelyontheofficialscriptsreleasedbyPetrovetal.(2012)5formappingthetreebankspecific5https://github.com/slavpetrov/universal-pos-tagsLanguageSourcebgBTB/CoNLL06(2005)daDDT/CoNLL06(2004)deTiger/CoNLL06(2002)enPennTreebank(1993)esCast3LB/CoNLL06(2008)huSzegedTreebank/CoNLL07(2005)itISST/CoNLL07(2003)nlAlpino/CoNLL06(2002)ptFlorestaSint(C)tica/CoNLL06(2002)slSDT/CoNLL06(2006)svTalbanken05/CoNLL06(2006)trMETU-Sabanci/CoNLL07(2003)Table2:TreebanksusedforPOStaggingexperimentsfromtheCoNLL2006/07sharedtask.bgdadeeneshuitnlptslsvtrAvg.5060708090100Coverage(%)TokenWordformFigure1:Tokenandwordform-levelcoveragesofthewordvectorsagainstthecombinedtrain/testsetsoftheCoNLL-2006/07POStaggingdatasets.POStagstotheGoogleuniversalPOStagsinor-dertoobtainresultscomparableacrosslanguages.ForourexperimentsweusedtheoriginalCoNLL-Xtrain/testsplitsofthetreebanks.Akeyfactorfortheefficiencyofourproposedmodelresidesinthecoverageofwordembeddings,i.e.theproportionoftokens/wordformsforwhichdistributedrepresentationisdetermined.Figure1depictsthesecoveragescorescalculatedoverthemergedtrainingandtestsetsforthedifferentlan-guages.Figure1revealsthatasubstantialamountoftokenshasdistributedrepresentationdefinedfor(around90%forthemajorityoflanguages,exceptforTurkishwhereitis5pointless).TokencoveragesofthewordembeddingsaremostlikelyaffectedbythemorphologicalrichnessofthelanguagesandtheelaboratenessofthecorrespondingWikipediaarti-clesusedfortrainingwordembeddings.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
5
9
1
5
6
7
4
5
6
/
/
t
yo
a
C
_
a
_
0
0
0
5
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
252
0.940.950.960.970.980.9910.930.940.950.960.97SparsityPertokenaccuracy(a)bg0.940.950.960.970.980.9910.920.930.940.950.960.97Sparsity(b)da0.940.950.960.970.980.9910.940.950.960.970.98Sparsity(C)de0.940.950.960.970.980.9910.960.970.98Sparsity(d)en0.940.950.960.970.980.9910.920.930.940.950.96SparsityPertokenaccuracy(mi)es0.940.950.960.970.980.9910.860.880.90.920.940.96Sparsity(F)hu0.940.950.960.970.980.9910.90.920.940.96Sparsity(gramo)it0.940.950.960.970.980.9910.90.920.94Sparsity(h)nl0.940.950.960.970.980.9910.930.940.950.960.970.98SparsityPertokenaccuracy(i)pt0.940.950.960.970.980.9910.880.90.920.94Sparsity(j)sl0.940.950.960.970.980.9910.920.930.940.95Sparsity(k)sv0.940.950.960.970.980.9910.840.860.880.9Sparsity(yo)trpolyglotSCCBOWSCSGSCGloveSCFRw+cFRwBrownFigure2:POStaggingresultsontheCoNLL2006/07treebanksevaluatingagainstuniversalPOStags.Ticksareplacedforλ=0.05,0.1,0.2,0.3,0.4,0.5.Thex-axisshowsthesparsityoftherepresentations.ComparingwordembeddingsOurmotivationforchoosingpolyglotwordembeddingsasin-puttosparsecodingisthattheyarepubliclyavail-ableforavarietyoflanguages.However,distributedwordrepresentationstrainedinanyotherreasonablemannercanserveasinputtoourapproach.Inordertoinvestigateifsomeofthepopularwordembed-dingtechniquesseemfavorableforouralgorithm,weconductexperimentsusingalternativelytrainedembeddings,i.e.skip-gram(SG),continuousbag-of-words(CBOW)andGlove.Inorderthattheutilityofdifferentwordembed-dingsnottobeconflatedwithotherfactors,wetrainthemonthesameWikipediadumpsusedfortrain-ingthepolyglotwordvectors.Wechoosefurtherhyperparametersidenticallytopolyglot,i.e.wetrain64dimensionaldensewordrepresentationsus-ingasymmetriccontextwindowofsize2forbothSG/CBOW6andGlove7.Figure2includesPOStaggingaccuraciesoverthe12treebanksfromtheCoNLL2006/07sharedtasksevaluatedagainstGoogleUniversalPOStags.Insteadofreportingresultsasafunctionofλ,weratherpresentaccuraciesasafunctionofthedif-ferentsparsitylevelsinducedbydifferentλval-ues.Figure2demonstratesthatPOStaggingperfor-manceisquiteinsensitivetothechoiceofλunlessityieldssomeextremesparsitylevel(>99.5%).Figure2alsorevealsthattheusageof6https://code.google.com/archive/p/word2vec/7http://nlp.stanford.edu/projects/glove/
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
5
9
1
5
6
7
4
5
6
/
/
t
yo
a
C
_
a
_
0
0
0
5
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
253
bgdadeeneshuitnlptslsvtrAvg.polyglotSC96.0495.7196.3397.2096.1492.9295.2193.4395.9694.1094.3685.9394.44CBOWSC95.1095.3595.6197.0895.7592.1794.5192.6195.4292.9693.1885.1293.74SGSC94.6795.4995.4796.9195.2991.9794.1193.1295.2892.6393.6084.9993.63GloveSC93.1693.6394.6196.1093.3688.6292.8890.1694.6590.3192.1983.3691.92(a)Resultsobtainedusingsparsewordrepresentations(λ=0.1,m=1024).bgdadeeneshuitnlptslsvtrAvg.polyglot92.1193.0393.1094.8094.6489.2392.9090.0794.3689.3689.1481.3391.17CBOW90.1990.3688.4691.2291.5586.0787.1188.0992.4587.8287.0079.3088.30SG88.1088.8486.4890.1991.3484.3885.0985.1191.7788.1784.4878.7286.89Glove83.1081.9583.0786.6484.6577.3479.9878.5486.6280.9178.7776.7781.53(b)Resultsobtainedusingdensewordrepresentations.Table3:PerformancesofsparseanddensewordrepresentationsforPOStaggingoverthe12CoNLL-Xdatasets.polyglotSCwordrepresentationstendtoproducesuperiorresultsoverallalternativerepresentationsweexperimentwith.Furthermore,modelsusingpolyglotSCconsistentlyoutperformtheFRwandBrownclustering-basedbaselines.ModelsrelyingonSGSCandCBOWSCrepresen-tationshaveanaveragetaggingaccuracyof93.74and93.63,respectively,andtheytypicallyperformbetterthanthebaselineusingBrownclusteringwithanaveragetaggingperformanceof93.27.Al-thoughutilizingGloveembeddingsproducethelow-estscores(91.92onaverage),itsscoresstillsurpassthoseoftheFRwbaselineforalllanguagesexceptforTurkish.Theaveragetaggingperformanceoverthe12languageswhenrelyingonfeaturesbasedonpolyglotSCisonly1.3pointsbelowthatofFRw+c(i.e.94.4versus95.7).RecallthatFRw+cusesafeature-richrepresentation,whereasourpro-posedmodelusesonlyO(metro)características,i.e.itistiedtothenumberofthebasisvectorsemployedforsparsecoding.Furthermore,ourmodeldoesnotemploywordidentityfeatures,nordoesitrelyoncharacter-levelfeaturesofwords.AnalyzingtheeffectsofwindowsizeHyper-parametersfortrainingwordrepresentationscangreatlyimpacttheirqualityasalsoconcludedbyLevyetal.(2015).Wethusinvestigateifprovid-ingalargercontextwindowsizeduringthetrainingofCBOW,SGandGloveembeddingscanimprovetheirperformanceinourmodel.AccordingtoFigure3applyingcontextwindowsizesof2fortrainingthewordembeddingstendto0.050.10.20.30.40.5λ0.880.900.920.940.960.981.00AccuracyCBOWSCw=2w=100.050.10.20.30.40.5λSGSCw=2w=100.050.10.20.30.40.5λGloveSCw=2w=10Figure3:OverviewofPOStaggingaccuraciesoverthe12CoNLL-Xdatasetswhenrelyingonsparsecodedver-sionsofalternativewordembeddingstrainedwithcontextwindowsizeof2and10.producebetteroverallPOStaggingaccuraciesthanapplyingalargerwindowsizeof10.Differencesarethemostpronouncedincaseofskip-gramrepresen-tation,confirmingthefindingsofLinetal.(2015),i.e.embeddingmodelsthatmodelshort-rangecon-textaremoreeffectiveforPOStagging.ComparingdenseandsparserepresentationsUnlessstatedotherwise,weuseλ=0.1fortheexperimentsbelowinaccordancetoFigure2.Ta-ble3demonstratesthatperformancesobtainedbymodelsusingdensewordrepresentationsasfeaturesareconsistentlyinferiortothosemodelsrelyingonsparsewordrepresentations.InTable3b,wecanseethatpolyglotem-beddingsperformthebestfordenserepresentationsaswell.Whenusingdensefeatures,theCBOWrepresentation-basedmodeltendstoproduceresultsbetterthanbya1.4pointsmarginonaveragecom-paredtoSGembeddings.Thisperformancegapbe-tweenthetwoword2vecvariantsvanishes,how-ever,whendenserepresentationsarereplacedbytheirsparsecounterparts.Table3alsorevealsthat
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
5
9
1
5
6
7
4
5
6
/
/
t
yo
a
C
_
a
_
0
0
0
5
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
254
modelbgdadeeneshuitnlptslsvtrAvg.polyglotSC96.0495.7196.3397.2096.1492.9295.2193.4395.9694.1094.3685.9394.44FRw92.5591.6895.8696.9992.3186.3389.3288.7993.2887.1291.5183.5090.77FRw+c97.2096.6798.4297.7496.4395.3695.9494.4797.7393.9095.5689.6395.75#trainsents.1282351903921639832330660353110133499071153411042499712458(a)Resultsobtainedwithdifferentmodelswhenallthetrainingcorporawasused.modelbgdadeeneshuitnlptslsvtrAvg.polyglotSC88.2094.0493.4795.7695.6391.1594.1987.2894.6094.1291.1483.2391.90FRw79.6387.7585.5890.9389.8780.0186.6074.4089.1386.9380.1677.5985.05FRw+c88.7193.5295.7794.5995.4292.7493.6684.9495.1393.8288.5684.9291.82trainsents.%11.7028.903.823.7745.3724.8648.2311.2416.5497.7813.5830.0212.04(b)Resultsobtainedwithdifferentmodelswhenthefirst1,500sentencesofthetrainingcorporawereused.modelbgdadeeneshuitnlptslsvtrAvg.polyglotSC76.4689.5188.2990.4691.3286.5189.1375.2490.7486.6782.5071.1784.83FRw62.4474.8872.4678.1077.8067.2075.4556.6779.3872.4665.1361.3870.28FRw+c74.8783.3489.6485.7585.8883.5484.9969.2887.5283.8876.7167.4081.07trainsents.%1.172.890.380.384.542.494.821.121.659.781.363.001.20(C)Resultsobtainedwithdifferentmodelswhenthefirst150sentencesofthetrainingcorporawereused.Table4:Comparisonofmodelsbasedondifferentamountoftrainingdata.Boldnumbersindicatethebestresultsforagiventrainingregime(i.e.eithertrainingon150/1,500/alltrainingsentences).polyglotSCusesm=1024,λ=0.1.sparsewordrepresentationsimproveaveragePOStaggingaccuracyby3.3,5.4,6.7and10.4pointsforpolylgot,CBOW,SGandGlovewordrepresen-tations,respectively.ComparingtheeffectsoftrainingcorpussizeWealsoinvestigatethegeneralizationcharacteristicsoftheproposedrepresentationbytrainingmodelsthathaveaccesstosubstantiallydifferentamountsoftrainingdataperlanguage.Wedistinguishthreescenarios,i.e.whenusingonlythefirst150,thefirst1,500andalltheavailabletrainingsentencesfromeachcorpus.Figure4illustratestheaveragePOStaggingaccuracyoverthe12CoNLL-Xdatasetsfordifferentamountsoftrainingdataandmodels.Table4furtherrevealsthattheaverageperfor-manceofpolyglotSCis14.55and3.76pointsbettercomparedtotheFRwandFRw+cbaselineswhenusingonly1.2%ofalltheavailabletrainingdata,i.e.150sentencesperlanguage.Bydiscard-ing98.8%ofthetrainingdatapolyglotSCobtains89.8%ofitsaverageperformancecomparedtothescenariowhenithasaccesstoallthetrainingsen-tences.However,underthesamescenariotheFRw+candFRwmodelsonlymanagetopreserve85%and77%oftheiroriginalperformance,respectively.OurmodelperformsonparwithFRw+candhasa1501500allTraining sentences0.600.650.700.750.800.850.900.951.00AccuracyFRwFRw+cpolyglotSCFigure4:Averagetaggingaccuraciesoverthe12CoNLL-Xlanguagesusingvaryingamountoftrainingsentences.6.85pointsadvantageoverFRwwithatrainingcor-pusof1,500sentences.FRw+chasanaverageof1.3pointsadvantageoverpolyglotSCwhenwepro-videaccesstoalltrainingdataduringtraining,nev-erthelessFRwstillunderperformspolyglotSCinthatsettingby3.67points.ComparingsparsecodingtechniquesNext,wecomparedifferentsparsecodingapproachesonthe
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
5
9
1
5
6
7
4
5
6
/
/
t
yo
a
C
_
a
_
0
0
0
5
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
255
pre-trainedpolyglotwordrepresentations.TherecentworkofFaruquietal.(2015)formulatedal-ternativeapproachestodeterminesparsewordrep-resentations.OneoftheobjectivefunctionsFaruquietal.(2015)applyisminD,α12nnXi=1kxi−Dαik22+λkαik1+τkDk22.(3)ThemaindifferenceinEq.1and3isthatthelat-terdoesnotexplicitlyconstrainDtobeamemberoftheconvexsetofmatricescomprisingofcolumnvectorshavingapre-definedupperboundontheirnorm.InordertoimplicitlycontrolforthenormsofthebasisvectorsFaruquietal.(2015)applyanadditionalregularizationtermaffectedbyanextraparameterτintheirobjectivefunction.Faruquietal.(2015)alsoformulatedacon-strainedobjectivefunctionoftheformminD∈Rk×m≥0α∈Rk×|V|≥012nnXi=1kxi−Dαik22+λkαik1+τkDk22,(4)forwhichanon-negativityconstraintontheele-mentsofα(butnoconstraintonD)isimposed.WhenusingtheobjectivefunctionsintroducedbyFaruquietal.(2015),weusethedefaultτ=10−5value.Notationally,wedistinguishthesparsecod-ingapproachesbasedontheequationtheyuseastheirobjectivefunction,i.e.SC-i,i∈{1,3,4}.Weappliedλ=0.05forSC-1andλ=0.5forSC-3andSC-4inordertoobtainwordrepresenta-tionsofcomparableaveragesparsitylevelsacrossthe12languages,i.e.95.3%,94.5%and95.2%,re-spectively(cf.theleftofFigure5).TherightofFig-ure5furtherillustratesthespreadofPOStaggingaccuraciesoverthe12CoNLL-Xtreebankswhenusingmodelsthatrelyondifferentsparsecodingstrategieswithcomparablesparsitylevels.AlthoughMurphyetal.(2012)mentionsnon-negativityasadesiredpropertyofwordrepresenta-tionsforcognitiveplausibility,Figure5revealsthatoursequencelabelingmodelcannotbenefitfromitastheaveragePOStaggingaccuracyforSC-4is0.7pointsbelowthatofSC-3approach.Theaver-ageperformanceswhenapplyingSC-1andSC-3arenearlyidenticalwitha0.18pointdifferencebetweenthetwo.SC1SC3SC4Sparse coding approach9091929394959697%SparsitySC1SC3SC4Sparse coding approach8486889092949698POS tagging accuracyFigure5:ComparisonofthePOStaggingaccuraciesofdifferentsparsecodingtechniqueswithcomparableaver-agesparsenesslevelsoverthe12CoNLL-Xlanguages.bgdadeeneshuitnlptslsvtr051015202530‘2 normSC1bgdadeeneshuitnlptslsvtrSC3bgdadeeneshuitnlptslsvtrSC4(a)‘2normsbgdadeeneshuitnlptslsvtr0.000.020.040.060.080.10Relative frequencySC1bgdadeeneshuitnlptslsvtrSC3bgdadeeneshuitnlptslsvtrSC4(b)RelativefrequenciesFigure6:Characteristicsofthedifferentsparsecodingtechniquesoverthe12CoNLL-Xlanguages.Itisinstructivetoanalyzethepatternsdifferentsparsecodingapproachesexhibit.Eventhoughtheobjectivefunctionsusedbythedifferentapproachesaresimilar,decompositionsobtainedbythemcon-veyratherdifferentsparsitystructures.Figure6aillustratesthatthereexistsubstantialvariationinthelengthofthebasisvectorsobtainedbySC-3andSC-4bothwithinandacrosslanguages.However,SC-1producespracticallynovariationinthelengthofthebasisvectorscomprisingDduetotheconstraintpresentintheobjectivefunctionitem-ploys.Figure6bshowssimilardifferencesabouttherelativefrequencyofbasisvectorstakingpartinthereconstructionofwordembeddings.Figure7showsastrongcorrelationbetweenthe‘2normofbasisvectorsandtherelativenumberoftimesanon-zerocoefficientisassignedtotheminαforSC-3andSC-4butnotforSC-1.ItcanbefurthernotedfromFigure7thatthenorm
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
5
9
1
5
6
7
4
5
6
/
/
t
yo
a
C
_
a
_
0
0
0
5
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
256
020406080100120‘2 norm0.00.10.20.30.40.50.60.7Relative frequencySC1SC3SC4Figure7:Relativefrequencyofbasisvectorsreceivingnonzerocoefficientsinαasafunctionoftheir‘2norm.ofthebasisvectorsdeterminedbySC-3andSC-4areoftenordersofmagnitudelargerthanthosede-terminedbySC-1.Thiseffect,sin embargo,canbenat-urallymitigatedbyincreasingτ.Overall,thedifferentapproachesconveycompa-rablePOStaggingaccuraciesbutdifferentdecom-positionsduetothedifferencesintheobjectivefunc-tionstheyemploy.ExperimentsdescribedbelowareconductedusingtheobjectivefunctioninEq.1.4.2.2ExperimentsusingUDtreebanksForPOStaggingwealsoexperimentwithUDv1.2(Nivreetal.,2015)treebanks.Weusedthedefaulttrain-testsplitsofthetreebanksnotutiliz-ingthedevelopmentsetsforfinetuningperformanceonanyofthelanguagesduringourexperiments.WeomittedtheJapanesetreebankaswordsinitarestrippedoffduetolicensingissues.AlsothereisnopolyglotvectorreleasedforOldChurchSlavonicandGothic.EventhoughpolyglotwordrepresentationsarereleasedforArabic,itwasofnopracticaluseasitcontainedunvocalizedsurfaceformsoftokensincontrasttothevocalizedformsinUDv.1.2.Forthisreason,wediscardedtheArabictreebankaslessthan30%ofitstokenscouldbeas-sociatedwitharepresentation.Byomittingthese4languagesfromourexperimentswearefinallyleftwith33treebanksfor29languages.WenotethatforAncientGreektreebanks(grc*)weusewordembed-dingstrainedonModernGreek.Weshouldaddthatthereare4languages(relatedto6treebanks)forwhichpolyglotwordvectorsareaccessible,sin embargo,theWikipediadumpsusedfortrainingthemarenotdistributed.Forthisreason,Brownclustering-basedbaselinesaremissingfortheaffectedtreebanks.WereportourresultsonUDv1.2inTable5.Re-callthatthedefaultbehaviorofoursparsecoding-basedmodels(SCinTable5)isthattheydonothandlewordidentityasanexplicitfeature.Wenowinvestigatehowmuchcontributionwordiden-tityfeaturesconveyontheirownandalsowhenusedinconjunctionwithsparsecoding-derivedfeatures.ForthisendweintroduceasimplelinearchainCRFmodelgeneratingfeaturessolelyontheidentityofthecurrentwordandtheonessurroundingit(WIinTable5).Asimismo,wedefineamodelthatreliesonWIandSCfeaturessimultaneously(WI+SC).Ta-ble5revealsthatSCoutperformsWIbyalargemar-ginandthatcombiningthetwofeaturesetstogetheryieldssomefurtherimprovementsoverSCscores.WealsopresentinTable5thestate-of-the-artre-sultsofthebidirectionalLSTMmodelsbyPlanketal.(2016)forcomparativepurposes.NotethattheauthorsreportedresultsonlyonasubsetofUDv1.2(i.e.treebankswithatleast60ktokens),forwhichreasonwecanincludetheirresultson21treebanks.Outofthese21UDv1.2treebanksthereare15and20cases,respectivamente,forwhichSCandWI+SCproducesbetterresultsthanbi-LSTMw.OnlyFRw+candbi-LSTMw+c,modelswhichenjoytheadditionalbenefitofemployingcharacter-levelfeaturesbesidesword-levelones,arecapableofoutperformingSCandWI+SC.4.3NamedentityrecognitionexperimentsBesidesthePOStaggingexperiments,weinvesti-gatediftheverysamefeaturesastheonesappliedforPOStaggingcanbeutilizedinadifferentse-quencelabelingtask,namelynamedentityrecog-nition.Inordertoevaluateourapproach,weob-tainedtheEnglish,SpanishandDutchdatasetsfromthe2002and2003CoNLLsharedtasksonmultilin-gualNamedEntityRecognition(TjongKimSang,2002;TjongKimSangandDeMeulder,2003).Weusethetrain-testsplitsprovidedbytheor-
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
5
9
1
5
6
7
4
5
6
/
/
t
yo
a
C
_
a
_
0
0
0
5
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
257
BaselineusingWordSparseWordsandcharactersWordsonlyIdentitycodingTokenTreebankbi-LSTMw+cFRw+cbi-LSTMwFRwBrown(Wisconsin)(CAROLINA DEL SUR)WI+SCcoveragebg98.2596.8895.1290.4093.3690.7595.3395.6392.64cs97.9398.0393.7793.0991.9893.4095.1395.8392.42da95.9494.7091.9687.4192.4587.5193.3293.2993.96de93.1191.7390.3385.7388.5285.9089.1190.7392.75el—96.77—90.9195.9691.5396.9197.1295.80en94.6193.5292.1089.2891.4089.3693.0393.4797.61es95.3494.3793.6090.9393.8391.3194.4394.6997.08et—84.83—75.4284.5276.7885.5686.3080.40eu94.9193.0388.0083.36—84.8390.1990.6390.98fa96.8996.1395.3193.9895.0494.4595.9196.1197.80fi95.1892.9387.9582.3185.9883.1788.8089.1984.37fiftb—91.84—86.9182.8681.5786.9187.8883.92fr96.0495.3094.4492.8092.4292.8893.5294.9692.06ga—89.64—84.32—85.2188.2288.8288.80grc—93.57—84.3557.1384.4470.2785.0443.58grcproiel—96.39—90.7349.4191.0167.1791.3845.74he95.9293.9193.3790.1793.7990.3394.3895.2892.03hi96.6495.9695.9994.3294.6194.2595.3796.0996.40hr95.5994.1889.2482.9192.2283.5292.8593.5392.45hu—92.88—73.6991.0875.6389.4789.4790.07id92.7993.3290.4887.2991.3988.0391.7192.0297.09it97.6496.9296.5793.6294.9293.4395.7096.2894.99la—92.03—77.75—79.9985.4986.3483.03laitt—98.78—97.69—97.7495.4397.7792.23laproiel—95.89—90.53—90.8490.1492.4285.21nl92.0788.7984.9681.1184.2881.2784.3285.1092.28no97.7796.5394.3991.5894.2991.8795.4295.6794.53pl96.6295.2789.7384.4191.1384.5793.5793.9594.19pt97.4896.5994.2490.6993.7491.1194.0095.5092.53ro—86.46—76.3289.9375.9688.9988.2793.06sl97.7895.2891.0984.4390.2484.9292.6592.7092.14sv96.3094.9493.3288.8493.5088.9494.4694.6292.50ta—85.37—68.02—70.6981.2581.8085.35Avg.95.9994.7692.4088.7791.9589.0593.1593.7393.59Table5:PertokenPOStaggingaccuraciesfor33UDtreebanks.ForsparsecodingSPAMSisusedonpolyglotvectorswithλ=0.1andm=1024.Resultsinboldarebetterthananyofbi-LSTMw,FRwandBrownmodels(i.e.thebaselinesusingfeaturesbasedonwordsonly).Averageiscalculatedoverthe20highlightedtreebanksforwhichthereareresultsineverycolumn.Thebi-LSTMresultsarefromPlanketal.(2016).ganizersandreportourNERresultsusingtheF1scoresbasedontheofficialevaluationscriptoftheCoNLLsharedtask.SimilartoCollobertetal.(2011)wealsoapplythe17-tagIOBEStaggingschemeduringtrainingandinference.ThebestF1scoresreportedforEnglishbyCollobertetal.(2011)withoutemployingadditionalunlabeledtextstoenhancetheirlanguagemodelis81.47.Whenpre-trainingtheirneurallanguagemodelonlargeamountsofWikipediatextstheyreportanF1scoreof87.58.Figure8includesourNERresultsobtainedus-ingdifferentwordembeddingrepresentationsasin-putforsparsecodinganddifferentlevelsofspar-sity.SimilartoourPOStaggingexperiments,usingpolyglotSCvectorstendtoperformbestforNERaswell.However,asubstantialdifferencecomparedtothePOStaggingresultsisthatNERperformances
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
5
9
1
5
6
7
4
5
6
/
/
t
yo
a
C
_
a
_
0
0
0
5
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
258
0.940.950.960.970.980.9910.70.750.80.85SparsityF1score(a)en0.940.950.960.970.980.9910.70.720.740.760.78Sparsity(b)es0.940.950.960.970.980.9910.550.60.650.70.75Sparsity(C)nlpolyglotSCCBOWSCSGSCGloveSCFRw+cFRwBrownFigure8:NERresultsrelyingonsparsecodingofdifferentwordrepresentations.Thex-axisshowsthesparsityoftherepresentationswithticksatλ=0.05,0.1,0.2,0.3,0.4,0.5.enesnlAvg.polyglotSC82.9277.0372.6677.54CBOWSC83.4075.5171.3676.76SGSC82.8375.2270.8676.30GloveSC82.3175.7869.8575.98(a)Sparse(m=1024,λ=0.1)enesnlAvg.polyglot78.8070.1365.5871.50CBOW72.6864.4964.8067.32SG74.6866.1763.9568.27Glove74.3365.1157.7365.72(b)DenseTable6:ComparisonoftheperformanceofsparseanddensewordrepresentationsforNER.donotdegradeevenforextremelevelsofsparsity.Also,thesparsecoding-basedmodelsperformmuchbetterwhencomparedtotheFRw+cbaseline.InTable6,wecomparetheeffectivenessofmod-elsrelyingonsparseanddensewordrepresentationsforNER.Inordernottofine-tunehyperparametersforaparticularexperiment,similarlytoourprevi-ouschoicesmandλaresetto1024and0.1,re-spectively.ResultsinTable6areinlinewiththosereportedinTable3forPOStagging.5ConclusionInthispaperweshowthatitispossibletotrainse-quencemodelsthatperformnearlyaswellasbestexistingmodelsonavarietyoflanguagesforbothPOStaggingandNER.Ourapproachdoesnotre-quirewordidentityfeaturestoperformreliably,fur-thermore,itiscapableofachievingcomparablere-sultstotraditionalfeature-richmodels.Wealsoil-lustratetheadvantageousgeneralizationpropertyofourmodelasitretained89.8%ofitsoriginalaveragePOStaggingaccuracywhentrainedononly1.2%ofthetotalaccessibletrainingsentences.AsMikolovetal.(2013b)pointedoutthesimi-laritiesofcontinuouswordembeddingsacrosslan-guages,wethinkthatourproposedmodelcouldbeemployednotinjustmulti-lingual,butalsoincross-linguallanguageanalysissettings.Infact,weinves-tigateitsfeasibilityinourfuturework.Finally,wehavemadethesparsecodedwordembeddingvec-torspubliclyavailableinordertofacilitatethere-producibilityofourresultsandtofostermultilingualandcross-lingualresearch.AcknowledgementTheauthorwouldliketothanktheTACLeditorsandtheanonymousreviewersfortheirvaluablefeed-backsandsuggestions.ReferencesSusanaAfonso,EckhardBick,RenatoHaber,andDianaSantos.2002.“Florestasint´a(C)tica”:atreebankforPortuguese.InProceedingsofthe3rdInternationalConferenceonLanguageResourcesandEvaluation(LREC),pages1698–1703.EuropeanLanguageRe-sourcesAssociation(ELRA).RamiAl-Rfou,BryanPerozzi,andStevenSkiena.2013.Polyglot:Distributedwordrepresentationsformulti-lingualNLP.InProceedingsoftheSeventeenthCon-ferenceonComputationalNaturalLanguageLearn-ing,pages183–192.AssociationforComputationalLinguistics.NartB.Atalay,KemalOflazer,andBilgeSay.2003.TheannotationprocessintheTurkishtreebank.In
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
5
9
1
5
6
7
4
5
6
/
/
t
yo
a
C
_
a
_
0
0
0
5
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
259
Proceedingsofthe4thInternationalWorkshoponLin-guisticallyInterpretedCorpora(LINC),pages33–38.AssociationforComputationalLinguistics.MarcoBaroni,GeorgianaDinu,andGerm´anKruszewski.2014.Don’tcount,predict!Asystematiccompari-sonofcontext-countingvs.context-predictingseman-ticvectors.InProceedingsofthe52ndAnnualMeet-ingoftheAssociationforComputationalLinguistics(Volume1:LongPapers),pages238–247.AssociationforComputationalLinguistics.YoshuaBengio,R´ejeanDucharme,PascalVincent,andChristianJanvin.2003.Aneuralprobabilisticlan-guagemodel.TheJournalofMachineLearningRe-search,3:1137–1155.SabineBrants,StefanieDipper,SilviaHansen,WolfgangLezius,andGeorgeSmith.2002.TheTIGERtree-bank.InProceedingsoftheWorkshoponTreebanksandLinguisticTheories,pages24–41.PeterF.Brown,PeterV.deSouza,RobertL.Mercer,Vin-centJ.DellaPietra,andJeniferC.Lai.1992.Class-basedn-grammodelsofnaturallanguage.Computa-tionalLinguistics,18(4):467–479.SabineBuchholzandErwinMarsi.2006.CoNLL-Xsharedtaskonmultilingualdependencyparsing.InProceedingsoftheTenthConferenceonCompu-tationalNaturalLanguageLearning,CoNLL-X’06,pages149–164.AssociationforComputationalLin-guistics.YunchuanChen,LiliMou,YanXu,GeLi,andZhiJin.2016.Compressingneurallanguagemodelsbysparsewordrepresentations.InProceedingsofthe54thAn-nualMeetingoftheAssociationforComputationalLinguistics(Volume1:LongPapers),pages226–235.AssociationforComputationalLinguistics.RonanCollobertandJasonWeston.2008.Aunifiedar-chitecturefornaturallanguageprocessing:Deepneu-ralnetworkswithmultitasklearning.InProceed-ingsofthe25thInternationalConferenceonMachineLearning,ICML’08,pages160–167.AssociationforComputingMachinery.RonanCollobert,JasonWeston,L´eonBottou,MichaelKarlen,KorayKavukcuoglu,andPavelKuksa.2011.Naturallanguageprocessing(almost)fromscratch.TheJournalofMachineLearningResearch,12:2493–2537.D´oraCsendes,J´anosCsirik,TiborGyim´othy,andAndr´asKocsor.2005.TheSzegedTreebank.InText,SpeechandDialogue,8thInternationalConference,TSD2005Proceedings,pages123–131.LeonDerczynski,SeanChester,andKennethBøgh.2015.Tuneyourbrownclustering,please.InPro-ceedingsoftheInternationalConferenceRecentAd-vancesinNaturalLanguageProcessing,pages110–117.INCOMALtd.Shoumen,Bulgaria.SaˇsoDˇzeroski,TomaˇzErjavec,NinaLedinek,PetrPajas,ZdenˇekˇZabokrtsk´y,andAndrejaˇZele.2006.TowardsaSlovenedependencytreebank.InProceedingsoftheFifthInternationalLanguageResourcesandEvalua-tionConference,LREC2006,pages1388–1391.Eu-ropeanLanguageResourcesAssociation(ELRA).SimonettaMontemagnietal.2003.BuildingtheItal-iansyntactic-semantictreebank.InBuildingandusingParsedCorpora,LanguageandSpeechseries,pages189–210.Kluwer.ManaalFaruqui,YuliaTsvetkov,DaniYogatama,ChrisDyer,andNoahA.Smith.2015.Sparseovercom-pletewordvectorrepresentations.InProceedingsofthe53rdAnnualMeetingoftheAssociationforCom-putationalLinguisticsandthe7thInternationalJointConferenceonNaturalLanguageProcessing(Volume1:LongPapers),pages1491–1500.AssociationforComputationalLinguistics.MatthiasT.Kromann,LineMikkelsen,andStineKernLynge.2004.Danishdependencytreebank.JohnD.Lafferty,AndrewMcCallum,andFernandoC.N.Pereira.2001.Conditionalrandomfields:Probabilis-ticmodelsforsegmentingandlabelingsequencedata.InProceedingsoftheEighteenthInternationalCon-ferenceonMachineLearning,ICML’01,pages282–289.MorganKaufmannPublishersInc.R´emiLebretandRonanCollobert.2014.Wordembed-dingsthroughHellingerPCA.InProceedingsofthe14thConferenceoftheEuropeanChapteroftheAsso-ciationforComputationalLinguistics,pages482–490.AssociationforComputationalLinguistics.R´emiLebretandRonanCollobert.2015.Rehabili-tationofcount-basedmodelsforwordvectorrepre-sentations.InInternationalConferenceonIntelligentTextProcessingandComputationalLinguistics,pages417–429.Springer.OmerLevy,YoavGoldberg,andIdoDagan.2015.Im-provingdistributionalsimilaritywithlessonslearnedfromwordembeddings.TACL,3:211–225.PercyLiang.2005.Semi-supervisedlearningfornaturallanguage.Master’sthesis,MassachusettsInstituteofTechnology.Chu-ChengLin,WaleedAmmar,ChrisDyer,andLoriLevin.2015.UnsupervisedPOSinductionwithwordembeddings.InProceedingsofthe2015ConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics:HumanLanguageTech-nologies,pages1311–1316.AssociationforCompu-tationalLinguistics.JulienMairal,FrancisBach,JeanPonce,andGuillermoSapiro.2010.Onlinelearningformatrixfactorizationandsparsecoding.TheJournalofMachineaLearningResearch,11:19–60.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
5
9
1
5
6
7
4
5
6
/
/
t
yo
a
C
_
a
_
0
0
0
5
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
260
MitchellP.Marcus,BeatriceSantorini,andMaryAnnMarcinkiewicz.1993.Buildingalargeannotatedcor-pusofEnglish:ThePennTreebank.ComputationalLinguistics,19(2):313–330.TomasMikolov,KaiChen,GregCorrado,andJeffreyDean.2013a.Efficientestimationofwordrepresenta-tionsinvectorspace.CoRR,abs/1301.3781.TomasMikolov,QuocV.Le,andIlyaSutskever.2013b.Exploitingsimilaritiesamonglanguagesformachinetranslation.CoRR,abs/1309.4168.TomasMikolov,Ilya Sutskever,KaiChen,GregCorrado,andJeffreyDean.2013c.Distributedrepresenta-tionsofwordsandphrasesandtheircompositionality.InProceedingsofthe26thInternationalConferenceonNeuralInformationProcessingSystems,NIPS’13,pages3111–3119.CurranAssociatesInc.AndriyMnihandKorayKavukcuoglu.2013.Learningwordembeddingsefficientlywithnoise-contrastivees-timation.InAdvancesinNeuralInformationProcess-ingSystems26,pages2265–2273.CurranAssociates,Inc.BrianMurphy,ParthaTalukdar,andTomMitchell.2012.Learningeffectiveandinterpretablesemanticmodelsusingnon-negativesparseembedding.InProceedingsofCOLING2012,pages1933–1950.TheCOLING2012OrganizingCommittee.JoakimNivre,JensNilsson,andJohanHall.2006.Tal-banken05:ASwedishtreebankwithphrasestruc-tureanddependencyannotation.InProceedingsoftheFifthInternationalConferenceonLanguageRe-sourcesandEvaluation(LREC2006),pages1392–1395.EuropeanLanguageResourcesAssociation(ELRA).JoakimNivre,JohanHall,SandraK¨ubler,RyanMcDon-ald,JensNilsson,SebastianRiedel,andDenizYuret.2007.TheCoNLL2007sharedtaskondependencyparsing.InProceedingsoftheCoNLLSharedTaskSessionofEMNLP-CoNLL2007,pages915–932.As-sociationforComputationalLinguistics.JoakimNivreetal.2015.Universaldependen-cies1.2.http://hdl.handle.net/11234/1-1548.LIN-DAT/CLARINdigitallibraryatInstituteofFormalandAppliedLinguistics,CharlesUniversityinPrague.JoakimNivre,2015.TowardsaUniversalGrammarforNaturalLanguageProcessing,pages3–16.SpringerInternationalPublishing.NaoakiOkazaki.2007.CRFsuite:afastimplementationofConditionalRandomFields(CRFs).OlutobiOwoputi,BrendanO’Connor,ChrisDyer,KevinGimpel,NathanSchneider,andNoahA.Smith.2013.Improvedpart-of-speechtaggingforonlineconversa-tionaltextwithwordclusters.InProceedingsofthe2013ConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics:Hu-manLanguageTechnologies,pages380–390.Associ-ationforComputationalLinguistics.JeffreyPennington,RichardSocher,andChristopherManning.2014.Glove:Globalvectorsforwordrep-resentation.InProceedingsofthe2014ConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP),pages1532–1543.AssociationforCompu-tationalLinguistics.SlavPetrov,DipanjanDas,andRyanMcDonald.2012.Auniversalpart-of-speechtagset.InProceed-ingsoftheEighthInternationalConferenceonLan-guageResourcesandEvaluation(LREC-2012),pages2089–2096.EuropeanLanguageResourcesAssocia-tion(ELRA).BarbaraPlank,AndersSøgaard,andYoavGoldberg.2016.Multilingualpart-of-speechtaggingwithbidi-rectionallongshort-termmemorymodelsandauxil-iaryloss.InProceedingsofthe54thAnnualMeetingoftheAssociationforComputationalLinguistics(Vol-ume2:ShortPapers),pages412–418.AssociationforComputationalLinguistics.LizhenQu,GabrielaFerraro,LiyuanZhou,WeiweiHou,NathanSchneider,andTimothyBaldwin.2015.Bigdatasmalldata,indomainout-ofdomain,knownwordunknownword:Theimpactofwordrepresen-tationsonsequencelabellingtasks.InProceedingsoftheNineteenthConferenceonComputationalNatu-ralLanguageLearning,pages83–93.AssociationforComputationalLinguistics.LevRatinovandDanRoth.2009.Designchallengesandmisconceptionsinnamedentityrecognition.InPro-ceedingsoftheThirteenthConferenceonComputa-tionalNaturalLanguageLearning,CoNLL’09,pages147–155.AssociationforComputationalLinguistics.MarionaTaul´eM.Ant`oniaMart´ıMartaRecasens.2008.AnCora:MultilevelannotatedcorporaforCatalanandSpanish.InProceedingsoftheSixthInternationalConferenceonLanguageResourcesandEvaluation(LREC-08),pages96–101.EuropeanLanguageRe-sourcesAssociation(ELRA).KirilSimovandPetyaOsenova.2005.ExtendingtheannotationofBulTreeBank:Phase2.InTheFourthWorkshoponTreebanksandLinguisticTheories(TLT2005),pages173–184.KarlStratosandMichaelCollins.2015.Simplesemi-supervisedPOStagging.InProceedingsofthe1stWorkshoponVectorSpaceModelingforNaturalLan-guageProcessing,pages79–87.AssociationforCom-putationalLinguistics.FeiSun,JiafengGuo,YanyanLan,JunXu,andXueqiCheng.2016.Sparsewordembeddingsusing‘1regu-larizedonlinelearning.InProceedingsoftheTwenty-FifthInternationalJointConferenceonArtificialIntel-
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
5
9
1
5
6
7
4
5
6
/
/
t
yo
a
C
_
a
_
0
0
0
5
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
261
ligence,pages2915–2921.AAAIPress/InternationalJointConferencesonArtificialIntelligence.ErikF.TjongKimSangandFienDeMeulder.2003.In-troductiontotheCoNLL-2003sharedtask:Language-independentnamedentityrecognition.InProceed-ingsoftheSeventhConferenceonNaturalLanguageLearningatHLT-NAACL2003-Volume4,CONLL’03,pages142–147.AssociationforComputationalLinguistics.ErikF.TjongKimSang.2002.IntroductiontotheCoNLL-2002sharedtask:Language-independentnamedentityrecognition.InProceedingsofCoNLL-2002,pages155–158.JosephTurian,LevRatinov,andYoshuaBengio.2010.Wordrepresentations:Asimpleandgeneralmethodforsemi-supervisedlearning.InProceedingsofthe48thAnnualMeetingoftheAssociationforComputa-tionalLinguistics,ACL’10,pages384–394.Associa-tionforComputationalLinguistics.LeonoorvanderBeek,GosseBouma,JanDaciuk,TanjaGaustad,RobertMalouf,GertjanvanNoord,RobbertPrins,andBegoaVillada.2002.Chapter5.TheAlpinodependencytreebank.InAlgorithmsforLin-guisticProcessingNWOPIONIERProgressReport.DaniYogatama,ManaalFaruqui,ChrisDyer,andNoahSmith.2015.Learningwordrepresentationswithhierarchicalsparsecoding.InProceedingsofthe32ndInternationalConferenceonMachineLearning,volume37ofProceedingsofMachineLearningRe-search,pages87–96.PMLR.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
0
5
9
1
5
6
7
4
5
6
/
/
t
yo
a
C
_
a
_
0
0
0
5
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3