Transactions of the Association for Computational Linguistics, vol. 4, pp. 357–370, 2016. Action Editor: Masaaki Nagata.
Submission batch: 11/2015; Revision batch: 3/2016; Published 7/2016.
2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.
c
(cid:13)
NamedEntityRecognitionwithBidirectionalLSTM-CNNsJasonP.C.ChiuUniversityofBritishColumbiajsonchiu@gmail.comEricNicholsHondaResearchInstituteJapanCo.,Ltd.e.nichols@jp.honda-ri.comAbstractNamedentityrecognitionisachallengingtaskthathastraditionallyrequiredlargeamountsofknowledgeintheformoffeatureengineer-ingandlexiconstoachievehighperformance.Inthispaper,wepresentanovelneuralnet-workarchitecturethatautomaticallydetectsword-andcharacter-levelfeaturesusingahy-bridbidirectionalLSTMandCNNarchitec-ture,eliminatingtheneedformostfeatureen-gineering.Wealsoproposeanovelmethodofencodingpartiallexiconmatchesinneu-ralnetworksandcompareittoexistingap-proaches.Extensiveevaluationshowsthat,givenonlytokenizedtextandpubliclyavail-ablewordembeddings,oursystemiscom-petitiveontheCoNLL-2003datasetandsur-passesthepreviouslyreportedstateoftheartperformanceontheOntoNotes5.0datasetby2.13F1points.Byusingtwolexiconscon-structedfrompublicly-availablesources,weestablishnewstateoftheartperformancewithanF1scoreof91.62onCoNLL-2003and86.28onOntoNotes,surpassingsystemsthatemployheavyfeatureengineering,proprietarylexicons,andrichentitylinkinginformation.1IntroductionNamedentityrecognitionisanimportanttaskinNLP.Highperformanceapproacheshavebeendom-inatedbyapplyingCRF,SVM,orperceptronmodelstohand-craftedfeatures(RatinovandRoth,2009;Passosetal.,2014;Luoetal.,2015).Cependant,Collobertetal.(2011b)proposedaneffectiveneu-ralnetworkmodelthatrequireslittlefeatureengi-neeringandinsteadlearnsimportantfeaturesfromwordembeddingstrainedonlargequantitiesofun-labelledtext–anapproachmadepossiblebyrecentadvancementsinunsupervisedlearningofwordem-beddingsonmassiveamountsofdata(CollobertandWeston,2008;Mikolovetal.,2013)andneuralnet-worktrainingalgorithmspermittingdeeparchitec-tures(Rumelhartetal.,1986).UnfortunatelytherearemanylimitationstothemodelproposedbyCollobertetal.(2011b).D'abord,itusesasimplefeed-forwardneuralnetwork,whichrestrictstheuseofcontexttoafixedsizedwindowaroundeachword–anapproachthatdiscardsuse-fullong-distancerelationsbetweenwords.Second,bydependingsolelyonwordembeddings,itisun-abletoexploitexplicitcharacterlevelfeaturessuchasprefixandsuffix,whichcouldbeusefulespeciallywithrarewordswherewordembeddingsarepoorlytrained.Weseektoaddresstheseissuesbypropos-ingamorepowerfulneuralnetworkmodel.Awell-studiedsolutionforaneuralnetworktoprocessvariablelengthinputandhavelongtermmemoryistherecurrentneuralnetwork(RNN)(GollerandKuchler,1996).Recently,RNNshaveshowngreatsuccessindiverseNLPtaskssuchasspeechrecognition(Gravesetal.,2013),machinetranslation(Choetal.,2014),andlanguagemod-eling(Mikolovetal.,2011).Thelong-shorttermmemory(LSTM)unitwiththeforgetgateallowshighlynon-triviallong-distancedependenciestobeeasilylearned(Gersetal.,2000).Forsequentialla-bellingtaskssuchasNERandspeechrecognition,abi-directionalLSTMmodelcantakeintoaccountaneffectivelyinfiniteamountofcontextonbothsidesofawordandeliminatestheproblemoflimitedcon-textthatappliestoanyfeed-forwardmodel(Gravesetal.,2013).WhileLSTMshavebeenstudiedinthepastfortheNERtaskbyHammerton(2003),thelackofcomputationalpower(whichledtotheuse
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
0
4
1
5
6
7
3
9
2
/
/
t
je
un
c
_
un
_
0
0
1
0
4
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
358
WesawpaintingsofPicassoWord EmbeddingAdditionalWord FeaturesCNN-extractedChar FeaturesLSTMLSTMLSTMLSTMLSTMLSTMLSTMLSTMLSTMLSTMOutOutOutOutOut00ForwardLSTMBackwardLSTMOutputLayersTag ScoresOOOOS-PERBest TagSequenceFigure1:Le(unrolled)BLSTMfortaggingnameden-tities.Multipletableslookupword-levelfeaturevectors.TheCNN(Figure2)extractsafixedlengthfeaturevectorfromcharacter-levelfeatures.Foreachword,thesevec-torsareconcatenatedandfedtotheBLSTMnetworkandthentotheoutputlayers(Figure3).ofverysmallmodels)andqualitywordembeddingslimitedtheireffectiveness.Convolutionalneuralnetworks(CNN)havealsobeeninvestigatedformodelingcharacter-levelin-formation,amongotherNLPtasks.Santosetal.(2015)andLabeauetal.(2015)successfullyem-ployedCNNstoextractcharacter-levelfeaturesforuseinNERandPOS-taggingrespectively.Collobertetal.(2011b)alsoappliedCNNstosemanticrolela-beling,andvariantsofthearchitecturehavebeenap-pliedtoparsingandothertasksrequiringtreestruc-tures(Blunsometal.,2014).Cependant,theeffec-tivenessofcharacter-levelCNNshasnotbeeneval-uatedforEnglishNER.Whileweconsideredusingcharacter-levelbi-directionalLSTMs,whichwasre-centlyproposedbyLingetal.(2015)forPOS-tagging,preliminaryevaluationshowsthatitdoesnotperformsignificantlybetterthanCNNswhilebe-ingmorecomputationallyexpensivetotrain.OurmaincontributionliesincombiningtheseneuralnetworkmodelsfortheNERtask.Wepresentahybridmodelofbi-directionalLSTMsandCNNsPaddingPoPaddingCharacterEmbeddingAdditionalChar FeaturesicassConvolutionMaxCNN-extractedChar featuresFigure2:Theconvolutionalneuralnetworkextractschar-acterfeaturesfromeachword.Thecharacterembed-dingand(optionally)thecharactertypefeaturevectorarecomputedthroughlookuptables.Then,theyareconcate-natedandpassedintotheCNN.thatlearnsbothcharacter-andword-levelfeatures,presentingthefirstevaluationofsuchanarchitec-tureonwell-establishedEnglishlanguageevalua-tiondatasets.Furthermore,aslexiconsarecrucialtoNERperformance,weproposeanewlexiconencod-ingschemeandmatchingalgorithmthatcanmakeuseofpartialmatches,andwecompareittothesim-plerapproachofCollobertetal.(2011b).Extensiveevaluationshowsthatourproposedmethodestab-lishesanewstateoftheartonboththeCoNLL-2003NERsharedtaskandtheOntoNotes5.0datasets.2ModelOurneuralnetworkisinspiredbytheworkofCol-lobertetal.(2011b),wherelookuptablestransformdiscretefeaturessuchaswordsandcharactersintocontinuousvectorrepresentations,whicharethenconcatenatedandfedintoaneuralnetwork.Insteadofafeed-forwardnetwork,weusethebi-directionallong-shorttermmemory(BLSTM)network.Toin-ducecharacter-levelfeatures,weuseaconvolutionalneuralnetwork,whichhasbeensuccessfullyappliedtoSpanishandPortugueseNER(Santosetal.,2015)andGermanPOS-tagging(Labeauetal.,2015).2.1Sequence-labellingwithBLSTMFollowingthespeech-recognitionframeworkout-linedbyGravesetal.(2013),weemployed
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
0
4
1
5
6
7
3
9
2
/
/
t
je
un
c
_
un
_
0
0
1
0
4
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
359
LSTMLSTMForwardBackwardLinearLog-SoftmaxAddFigure3:Theoutputlayers(“Out”inFigure1)decodeoutputintoascoreforeachtagcategory.astacked1bi-directionalrecurrentneuralnetworkwithlongshort-termmemoryunitstotransformwordfeaturesintonamedentitytagscores.Figures1,2,and3illustratethenetworkindetail.TheextractedfeaturesofeachwordarefedintoaforwardLSTMnetworkandabackwardLSTMnet-work.Theoutputofeachnetworkateachtimestepisdecodedbyalinearlayerandalog-softmaxlayerintolog-probabilitiesforeachtagcategory.Thesetwovectorsarethensimplyaddedtogethertopro-ducethefinaloutput.Wetriedminorvariantsofoutputlayerarchitec-tureandselectedtheonethatperformedthebestinpreliminaryexperiments.2.2ExtractingCharacterFeaturesUsingaConvolutionalNeuralNetworkForeachwordweemployaconvolutionandamaxlayertoextractanewfeaturevectorfromtheper-characterfeaturevectorssuchascharacterembed-dings(Section2.3.2)et(optionally)charactertype(Section2.5).WordsarepaddedwithanumberofspecialPADDINGcharactersonbothsidesdepend-ingonthewindowsizeoftheCNN.Thehyper-parametersoftheCNNarethewindowsizeandtheoutputvectorsize.1Foreachdirection(forwardandbackward),theinputisfedintomultiplelayersofLSTMunitsconnectedinsequence(i.e.LSTMunitsinthesecondlayertakeintheoutputofthefirstlayer,andsoon);thenumberoflayersisatunedhyper-parameter.Figure1showsonlyoneunitforsimplicity.CategorySENNADBpediaLocation36,697709,772Miscellaneous4,722328,575Organization6,440231,868Person123,2831,074,363Total171,1422,344,578Table1:NumberofentriesforeachcategoryintheSENNAlexiconandourDBpedialexicon.DatasetTrainDevTestCoNLL-2003204,56751,57846,666(23,499)(5,942)(5,648)OntoNotes5.01,088,503147,724152,728/CoNLL-2012(81,828)(11,066)(11,257)Table2:Datasetsizesinnumberoftokens(entities)2.3CoreFeatures2.3.1WordEmbeddingsOurbestmodelusesthepubliclyavailable50-dimensionalwordembeddingsreleasedbyCollobertetal.(2011b)2,whichweretrainedonWikipediaandtheReutersRCV-1corpus.Wealsoexperimentedwithtwoothersetsofpub-lishedembeddings,namelyStanford’sGloVeem-beddings3trainedon6billionwordsfromWikipediaandWebtext(Penningtonetal.,2014)andGoogle’sword2vecembeddings4trainedon100billionwordsfromGoogleNews(Mikolovetal.,2013).Inaddition,aswehypothesizedthatwordem-beddingstrainedonin-domaintextmayperformbetter,wealsousedthepubliclyavailableGloVe(Penningtonetal.,2014)programandanin-housere-implementation5oftheword2vec(Mikolovetal.,2013)programtotrainwordembeddingsonWikipediaandReutersRCV1datasetsaswell.6FollowingCollobertetal.(2011b),allwordsarelower-casedbeforepassingthroughthelookuptable2http://ml.nec-labs.com/senna/3http://nlp.stanford.edu/projects/glove/4https://code.google.com/p/word2vec/5Weusedourin-housereimplementationtotrainwordvec-torsbecauseitusesdistributedprocessingtotrainmuchquickerthanthepublicly-releasedimplementationofword2vecanditsperformanceonthewordanalogytaskwashigherthanreportedbyMikolovetal.(2013).6WhileCollobertetal.(2011b)usedWikipediatextfrom2007,weusedWikipediatextfrom2011.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
0
4
1
5
6
7
3
9
2
/
/
t
je
un
c
_
un
_
0
0
1
0
4
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
360
TextHayaoTada,commanderoftheJapaneseNorthChinaAreaArmyLOC—–BI-S–MISC—SBBISSSSORG—–BIBIIEPERSBE——S–Figure4:Exampleofhowlexiconfeaturesareapplied.TheB,je,E,markingsindicatethatthetokenmatchestheBegin,Inside,andEndtokenofanentryinthelexicon.Sindicatesthatthetokenmatchesasingle-tokenentry.toconverttotheircorrespondingembeddings.Thepre-trainedembeddingsareallowedtobemodifiedduringtraining.72.3.2CharacterEmbeddingsWerandomlyinitializedalookuptablewithval-uesdrawnfromauniformdistributionwithrange[−0.5,0.5]tooutputacharacterembeddingof25dimensions.ThecharactersetincludesalluniquecharactersintheCoNLL-2003dataset8plusthespecialtokensPADDINGandUNKNOWN.ThePADDINGtokenisusedfortheCNN,andtheUNKNOWNtokenisusedforallothercharacters(whichappearinOntoNotes).Thesamesetofran-domembeddingswasusedforallexperiments.92.4AdditionalWord-levelFeatures2.4.1CapitalizationFeatureAscapitalizationinformationiserasedduringlookupofthewordembedding,weevaluateCol-lobert’smethodofusingaseparatelookuptabletoaddacapitalizationfeaturewiththefollowingop-tions:allCaps,upperInitial,lowercase,mixedCaps,noinfo(Collobertetal.,2011b).Thismethodiscomparedwiththecharactertypefeature(Section2.5)andcharacter-levelCNNs.2.4.2LexiconsMoststateoftheartNERsystemsmakeuseoflexiconsasaformofexternalknowledge(Ratinov7Preliminaryexperimentsshowedthatmodifiablevectorsperformedbetterthanso-called“frozenvectors.”8Upperandlowercaseletters,numbers,andpunctuations9WedidnotexperimentwithothersettingsbecausetheEn-glishcharactersetissmallenoughthateffectiveembeddingscouldbelearneddirectlyfromthetaskdata.10Byincrementsof50.11Determinedbyevaluatingdevsetperformance.12ProbabilityofdiscardinganyLSTMoutputnode.13Mini-batchsizewasexcludedfromtheround2particleswarmhyper-parametersearchspaceduetotimeconstraints.andRoth,2009;Passosetal.,2014).Foreachofthefourcategories(Person,Organization,Location,Misc)definedbytheCoNLL2003NERsharedtask,wecompiledalistofknownnamedentitiesfromDBpedia(Aueretal.,2007),byextractingalldescendantsofDB-pediatypescorrespondingtotheCoNLLcate-gories.14WedidnotconstructseparatelexiconsfortheOntoNotestagsetbecausecorrespondencesbe-tweenDBpediacategoriesanditstagscouldnotbefoundinmanyinstances.Inaddition,foreachentrywefirstremovedparenthesesandalltextcontainedwithin,thenstrippedtrailingpunctuation,15andfi-nallytokenizeditwiththePennTreebanktokeniza-tionscriptforthepurposeofpartialmatching.Ta-ble1showsthesizeofeachcategoryinourlexiconcomparedtoCollobert’slexicon,whichweextractedfromtheirSENNAsystem.Figure4showsanexampleofhowthelexiconfeaturesareapplied.16Foreachlexiconcategory,wematcheveryn-gram(uptothelengthofthelongestlexiconentry)againstentriesinthelexicon.Amatchissuccessfulwhenthen-grammatchestheprefixorsuffixofanentryandisatleasthalfthelengthoftheentry.Becauseofthehighpotentialforspuri-ousmatches,forallcategoriesexceptPerson,wediscardpartialmatcheslessthan2tokensinlength.Whentherearemultipleoverlappingmatcheswithinthesamecategory,wepreferexactmatchesoverpar-tialmatches,andthenlongermatchesovershortermatches,andfinallyearliermatchesinthesentenceoverlatermatches.Allmatchesarecaseinsensitive.Foreachtokeninthematch,thefeatureisen-14TheMiscellaneouscategorywaspopulatedbyentitiesoftheDBpediacategoriesArtifactandWork.15Thepunctuationstrippedwasperiod,comma,semi-colon,colon,forwardslash,backwardslash,andquestionmark.16Ascanbeenseeninthisexample,thelexicons–inpartic-ularMiscellaneous–stillcontainalotofnoise.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
0
4
1
5
6
7
3
9
2
/
/
t
je
un
c
_
un
_
0
0
1
0
4
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
361
Hyper-parameterCoNLL-2003(Round2)OntoNotes5.0(Round1)FinalRangeFinalRangeConvolutionwidth3[3,7]3[3,9]CNNoutputsize53[15,84]20[15,100]LSTMstatesize275[100,500]200[100,400]10LSTMlayers1[1,4]2[2,4]Learningrate0.0105[10−3,10−1.8]0.008[10−3.5,10−1.5]Epochs1180-18-Dropout120.68[0.25,0.75]0.63[0,1]Mini-batchsize9-139[5,14]Table3:Hyper-parametersearchspaceandfinalvaluesusedforallexperimentsRoundCoNLL-2003OntoNotes5.0193.82(±0.15)84.57(±0.27)294.03(±0.23)84.47(±0.29)Table4:DevelopmentsetF1scoreperformanceofthebesthyper-parametersettingsineachoptimizationround.codedinBIOESannotation(Begin,Inside,Outside,End,Single),indicatingthepositionofthetokeninthematchedentry.Inotherwords,Bwillnotappearinasuffix-onlypartialmatch,andEwillnotappearinaprefix-onlypartialmatch.AswewillseeinSection4.5,wefoundthatthismoresophisticatedmethodoutperformsthemethodpresentedbyCollobertetal.(2011b),whichtreatspartialandexactmatchesequally,allowsprefixbutnotsuffixmatches,allowsveryshortpartialmatches,andmarkstokenswithYES/NO.Inaddition,sinceCollobertetal.(2011b)releasedtheirlexiconwiththeirSENNAsystem,wealsoap-pliedtheirlexicontoourmodelforcomparisonandinvestigatedusingbothlexiconssimultaneouslyasdistinctfeatures.WefoundthatthetwolexiconscomplementeachotherandimproveperformanceontheCoNLL-2003dataset.OurbestmodelusestheSENNAlexiconwithex-actmatchingandourDBpedialexiconwithpartialmatching,withBIOESannotationinbothcases.2.5AdditionalCharacter-levelFeaturesAlookuptablewasusedtooutputa4-dimensionalvectorrepresentingthetypeofthecharacter(uppercase,lowercase,punctuation,other).2.6TrainingandInference2.6.1ImplementationWeimplementtheneuralnetworkusingthetorch7library(Collobertetal.,2011a).Trainingandinferencearedoneonaper-sentencelevel.Theini-tialstatesoftheLSTMarezerovectors.Exceptforthecharacterandwordembeddingswhoseinitializa-tionhasbeendescribedpreviously,alllookuptablesarerandomlyinitializedwithvaluesdrawnfromthestandardnormaldistribution.2.6.2ObjectiveFunctionandInferenceWetrainournetworktomaximizethesentence-levellog-likelihoodfromCollobertetal.(2011b).17D'abord,wedefineatag-transitionmatrixAwhereAi,jrepresentsthescoreofjumpingfromtagitotagjinsuccessivetokens,andA0,iasthescoreforstartingwithtagi.Thismatrixofparametersarealsolearned.Defineθasthesetofparametersfortheneuralnetwork,andθ0=θ∪{Ai,j∀i,j}asthesetofallparameterstobetrained.Givenanexam-plesentence,[X]T1,oflengthT,anddefine[fθ]je,tasthescoreoutputtedbytheneuralnetworkforthetthwordandithtaggivenparametersθ,thenthescoreofasequenceoftags[je]T1isgivenasthesumofnet-workandtransitionscores:S([X]T1,[je]T1,θ0)=TXt=1(cid:0)UN[je]t−1,[je]t+[fθ][je]t,t(cid:1)17Muchlater,wediscoveredthattrainingwithcrossentropyobjectivewhileperformingViterbidecodingtorestrictoutputtovalidtagsequencesalsoappearstoworkjustaswell.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
0
4
1
5
6
7
3
9
2
/
/
t
je
un
c
_
un
_
0
0
1
0
4
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
362
ModelCoNLL-2003OntoNotes5.0Prec.RecallF1Prec.RecallF1FFNN+emb+caps+lex89.5489.8089.67(±0.24)74.2873.6173.94(±0.43)BLSTM80.1472.8176.29(±0.29)79.6875.9777.77(±0.37)BLSTM-CNN83.4883.2883.38(±0.20)82.5882.4982.53(±0.40)BLSTM-CNN+emb90.7591.0890.91(±0.20)85.9986.3686.17(±0.22)BLSTM-CNN+emb+lex91.3991.8591.62(±0.33)86.0486.5386.28(±0.26)Collobertetal.(2011b)–88.67—Collobertetal.(2011b)+lexicon–89.59—Huangetal.(2015)–90.10—RatinovandRoth(2009)1891.2090.5090.8082.0084.9583.45LinandWu(2009)–90.90—FinkelandManning(2009)19—84.0480.8682.42Suzukietal.(2011)–91.02—Passosetal.(2014)20–90.90–82.24DurrettandKlein(2014)—85.2282.8984.04Luoetal.(2015)2191.5091.4091.20—Table5:Resultsofourmodels,withvariousfeaturesets,comparedtootherpublishedresults.Thethreesectionsare,inorder,ourmodels,publishedneuralnetworkmodels,andpublishednon-neuralnetworkmodels.Forthefeatures,emb=Collobertwordembeddings,caps=capitalizationfeature,lex=lexiconfeaturesfrombothSENNAandDBpedialexicons.ForF1scores,standarddeviationsareinparentheses.Then,letting[oui]T1bethetruetagsequence,thesentence-levellog-likelihoodisobtainedbynormal-izingtheabovescoreoverallpossibletag-sequences[j]T1usingasoftmax:logP([oui]T1|[X]T1,θ0)=S([X]T1,[oui]T1,θ0)−logX∀[j]T1eS([X]T1,[j]T1,θ0)Thisobjectivefunctionanditsgradientscanbeef-ficientlycomputedbydynamicprogramming(Col-lobertetal.,2011b).Atinferencetime,givenneuralnetworkout-puts[fθ]je,tweusetheViterbialgorithmtofindthetagsequence[je]T1thatmaximizesthescoreS([X]T1,[je]T1,θ0).2.6.3TaggingSchemeTheoutputtagsareannotatedwithBIOES(whichstandforBegin,Inside,Outside,End,Single,indicatingthepositionofthetokeninthe18OntoNotesresultstakenfrom(DurrettandKlein,2014)19EvaluationonOntoNotes5.0donebyPradhanetal.(2013)20Notdirectlycomparableastheyevaluatedonanearlierver-sionofthecorpuswithadifferentdatasplit.21Numberstakenfromtheoriginalpaper(Luoetal.,2015).Whiletheprecision,recall,andF1scoresareclearlyinconsis-tent,itisunclearinwhichwaytheyareincorrect.entity)asthisschemehasbeenreportedtooutper-formotherssuchasBIO(RatinovandRoth,2009).2.6.4LearningAlgorithmTrainingisdonebymini-batchstochasticgradi-entdescent(SGD)withafixedlearningrate.Eachmini-batchconsistsofmultiplesentenceswiththesamenumberoftokens.Wefoundapplyingdropouttotheoutputnodes22ofeachLSTMlayer(Phametal.,2014)wasquiteeffectiveinreducingoverfit-ting(Section4.4).Weexploredothermoresophis-ticatedoptimizationalgorithmssuchasmomentum(Nesterov,1983),AdaDelta(Zeiler,2012),andRM-SProp(Hintonetal.,2012),andinpreliminaryex-perimentstheydidnotimproveuponplainSGD.3EvaluationEvaluationwasperformedonthewell-establishedCoNLL-2003NERsharedtaskdataset(TjongKimSangandDeMeulder,2003)andthemuchlargerbutless-studiedOntoNotes5.0dataset(Hovyetal.,2006;Pradhanetal.,2013).Table2givesanoverviewofthesetwodifferentdatasets.Foreachexperiment,wereporttheaverageandstandarddeviationof10successfultrials.22Addingdropouttoinputsseemstohaveanadverseeffect.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
0
4
1
5
6
7
3
9
2
/
/
t
je
un
c
_
un
_
0
0
1
0
4
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
363
FeaturesBLSTMBLSTM-CNNBLSTM-CNN+lexCoNLLOntoNotesCoNLLOntoNotesCoNLLOntoNotesnone76.29(±0.29)77.77(±0.37)83.38(±0.20)82.53(±0.40)87.77(±0.29)83.82(±0.19)emb88.23(±0.23)82.72(±0.23)90.91(±0.20)86.17(±0.22)91.62(±0.33)86.28(±0.26)emb+caps90.67(±0.16)86.19(±0.25)90.98(±0.18)86.35(±0.28)91.55(±0.19)*86.28(±0.32)*emb+caps+lex91.43(±0.17)86.21(±0.16)91.55(±0.19)*86.28(±0.32)*91.55(±0.19)*86.28(±0.32)*emb+char–90.88(±0.48)86.08(±0.40)91.44(±0.23)86.34(±0.18)emb+char+caps–90.88(±0.31)86.41(±0.22)91.48(±0.23)86.33(±0.26)Table6:F1scoreresultsofBLSTMandBLSTM-CNNmodelswithvariousadditionalfeatures;emb=Collobertwordembeddings,char=charactertypefeature,caps=capitalizationfeature,lex=lexiconfeatures.Notethatstarredresultsarerepeatedforeaseofcomparison.3.1DatasetPreprocessingForalldatasets,weperformedthefollowingpre-processing:•Alldigitsequencesarereplacedbyasingle“0”.•Beforetraining,wegroupsentencesbywordlengthintomini-batchesandshufflethem.Inaddition,fortheOntoNotesdataset,inordertohandletheDate,Time,Money,Percent,Quantity,Ordinal,andCardinalnameden-titytags,wesplittokensbeforeandaftereverydigit.3.2CoNLL2003DatasetTheCoNLL-2003dataset(TjongKimSangandDeMeulder,2003)consistsofnewswirefromtheReutersRCV1corpustaggedwithfourtypesofnamedentities:location,organization,person,andmiscellaneous.AsthedatasetissmallcomparedtoOntoNotes,wetrainedthemodelonboththetrain-inganddevelopmentsetsafterperforminghyper-parameteroptimizationonthedevelopmentset.3.3OntoNotes5.0DatasetPradhanetal.(2013)compiledacoreportionoftheOntoNotes5.0datasetfortheCoNLL-2012sharedtaskanddescribedastandardtrain/dev/testsplit,whichweuseforourevaluation.FollowingDurrettandKlein(2014),weappliedourmodeltothepor-tionofthedatasetwithgold-standardnamedentityannotations;theNewTestamentsportionwasex-cludedforlackinggold-standardannotations.ThisdatasetismuchlargerthanCoNLL-2003andcon-sistsoftextfromawidevarietyofsources,suchasbroadcastconversation,broadcastnews,newswire,magazine,telephoneconversation,andWebtext.3.4Hyper-parameterOptimizationWeperformedtworoundsofhyper-parameteropti-mizationandselectedthebestsettingsbasedonde-velopmentsetperformance23.Table3showsthefi-nalhyper-parameters,andTable4showsthedevsetperformanceofthebestmodelsineachround.Inthefirstround,weperformedrandomsearchandselectedthebesthyper-parametersoverthede-velopmentsetoftheCoNLL-2003data.Weevalu-atedaround500hyper-parametersettings.Then,wetookthesamesettingsandtunedthelearningrateandepochsontheOntoNotesdevelopmentset.24Forthesecondround,weperformedindependenthyper-parametersearchesoneachdatasetusingOp-tunity’simplementationofparticleswarm(Claesenetal.,),asthereissomeevidencethatitismoreefficientthanrandomsearch(ClercandKennedy,2002).Weevaluated500hyper-parametersettingsthisroundaswell.Aswelaterfoundoutthattrain-ingfailsoccasionally(Section3.5)aswellaslargevariationfromruntorun,weranthetop5settingsfromeachdatasetfor10trialseachandselectedthebestonebasedonaverageddevsetperformance.ForCoNLL-2003,wefoundthatparticleswarmproducedbetterhyper-parametersthanrandomsearch.However,surprisinglyforOntoNotespar-ticleswarmwasunabletoproducebetterhyper-parametersthanthosefromthead-hocapproachinround1.WealsotriedtuningtheCoNLL-2003hyper-parametersfromround2forOntoNotesandthatwasnotanybetter25either.WetrainedCoNLL-2003modelsforalargenum-23Hyper-parameteroptimizationwasdonewiththeBLSTM-CNN+emb+lexfeatureset,asithadthebestperformance.24Selectedbasedondevsetperformanceofafewruns.25Theresultis84.41(±0.33)ontheOntoNotesdevset.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
1
0
4
1
5
6
7
3
9
2
/
/
t
je
un
c
_
un
_
0
0
1
0
4
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
364
WordEmbeddingsCoNLL-2003OntoNotesRandom50d87.77(±0.29)83.82(±0.19)Random300d87.84(±0.23)83.76(±0.37)GloVe6B50d91.09(±0.15)86.25(±0.24)GloVe6B300d90.71(±0.21)86.26(±0.30)Google100B300d90.60(±0.23)85.34(±0.25)Collobert50d91.62(±0.33)86.28(±0.26)OurGloVe50d91.41(±0.21)86.24(±0.35)OurSkip-gram50d90.76(±0.23)85.70(±0.29)Table7:F1scoreswhentheCollobertwordvectorsarereplaced.Wetried50-and300-dimensionalrandomvec-tors(Random50d,Random300d);GloVe’sreleasedvec-torstrainedon6billionwords(GloVe6B50d,GloVe6B300d);Google’sreleased300-dimensionalvectorstrainedon100billionwordsfromGoogleNews(Google100B300d);and50-dimensionalGloVeandword2vecskip-gramvectorsthatwetrainedonWikipediaandReutersRCV-1(OurGloVe50d,OurSkip-gram50d).berofepochsbecauseweobservedthatthemodelsdidnotexhibitovertrainingandinsteadcontinuedtoslowlyimproveonthedevelopmentsetlongaf-terreachingnear100%accuracyonthetrainingset.Incontrast,despiteOntoNotesbeingmuchlargerthanCoNLL-2003,trainingformorethanabout18epochscausesperformanceonthedevelopmentsettodeclinesteadilyduetooverfitting.3.5ExcludingFailedTrialsOntheCoNLL-2003dataset,whileBLSTMmodelscompletedtrainingwithoutdifficulty,theBLSTM-CNNmodelsfailtoconvergearound5∼10%ofthetimedependingonfeatureset.Similarly,onOntoNotes,1.5%oftrialsfail.Wefoundthatusingalowerlearningratereducesfailurerate.WealsotriedclippinggradientsandusingAdaDeltaandbothofthemwereeffectiveateliminatingsuchfailuresbythemselves.AdaDelta,cependant,madetrainingmoreexpensivewithnogaininmodelperformance.Inanycase,forallexperimentsweexcludedtrialswherethefinalF1scoreonasubsetoftrainingdatafallsbelowacertainthreshold,andcontinuedtoruntrialsuntilweobtained10successfulones.ForCoNLL-2003,weexcludedtrialswherethefinalF1scoreonthedevelopmentsetwaslessthan95;therewasnoambiguityinselectingthethresholdaseverytrialscoredeitherabove98orbelow90.ForOntoNotes,thethresholdwasaF1scoreof80onthelast5,000sentencesofthetrainingset;everytrialscoredeitherabove80orbelow75.3.6TrainingandTaggingSpeedOnanIntelXeonE5-2697processor,trainingtakesabout6hourswhiletaggingthetestsettakesabout12secondsforCoNLL-2003.Thetimesare10hoursand60secondsrespectivelyforOntoNotes.4ResultsandDiscussionTable5showstheresultsforalldatasets.Tothebestofourknowledge,ourbestmodelshavesur-passedtheprevioushighestreportedF1scoresforbothCoNLL-2003andOntoNotes.Inparticular,withnoexternalknowledgeotherthanwordem-beddings,ourmodeliscompetitiveontheCoNLL-2003datasetandestablishesanewstateoftheartforOntoNotes,suggestingthatgivenenoughdata,theneuralnetworkautomaticallylearnstherelevantfeaturesforNERwithoutfeatureengineering.4.1ComparisonwithFFNNsWere-implementedtheFFNNmodelofCollobertetal.(2011b)asabaselineforcomparison.Ta-ble5showsthatwhileperformingreasonablywellonCoNLL-2003,FFNNsareclearlyinadequateforOntoNotes,whichhasalargerdomain,showingthatLSTMmodelsareessentialforNER.4.2Character-levelCNNsvs.CharacterTypeandCapitalizationFeaturesThecomparisonofmodelsinTable6showsthatonCoNLL-2003,BLSTM-CNNmodelssignificantly26outperformtheBLSTMmodelswhengiventhesamefeatureset.Thiseffectissmallerandnotsta-tisticallysignificantonOntoNoteswhencapitaliza-tionfeaturesareadded.AddingcharactertypeandcapitalizationfeaturestotheBLSTM-CNNmod-elsdegradesperformanceforCoNLLandmostlyimprovesperformanceonOntoNotes,suggestingcharacter-levelCNNscanreplacehand-craftedchar-acterfeaturesinsomecases,butsystemswithweaklexiconsmaybenefitfromcharacterfeatures.26Wilcoxonranksumtest,p<0.05whencomparingthefourBLSTMmodelswiththecorrespondingBLSTM-CNNmodelsusingthesamefeatureset.TheWilcoxonranksumtestwasselectedforitsrobustnessagainstsmallsamplesizeswhenthedistributionisunknown. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 0 4 1 5 6 7 3 9 2 / / t l a c _ a _ 0 0 1 0 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 365 DropoutCoNLL-2003OntoNotes5.0DevTestDevTest-93.72(±0.10)90.76(±0.22)82.02(±0.49)84.06(±0.50)0.1093.85(±0.18)90.87(±0.31)83.01(±0.39)84.94(±0.25)0.3094.08(±0.17)91.09(±0.18)83.61(±0.32)85.44(±0.33)0.5094.19(±0.18)91.14(±0.35)84.35(±0.23)86.36(±0.28)0.63--84.47(±0.23)86.29(±0.25)0.6894.31(±0.15)91.23(±0.16)--0.7094.31(±0.24)91.17(±0.37)84.56(±0.40)86.17(±0.25)0.9094.17(±0.17)90.67(±0.17)81.38(±0.19)82.16(±0.18)Table8:F1scoreresultswithvariousdropoutvalues.Modelsweretrainedusingonlythetrainingsetforeachdataset.Allotherexperimentsusedropout=0.68forCoNLL-2003anddropout=0.63forOntoNotes5.0.4.3WordEmbeddingsTable5andTable7showthatweobtainalarge,sig-nificant27improvementwhentrainedwordembed-dingsareused,asopposedtorandomembeddings,regardlessoftheadditionalfeaturesused.ThisisconsistentwithCollobertet.al.(2011b)’sresults.Table7comparestheperformanceofdifferentwordembeddingsinourbestmodelinTable5(BLSTM-CNN+emb+lex).ForCoNLL-2003,thepubliclyavailableGloVeandGoogleembeddingsareaboutonepointbehindCollobert’sembeddings.ForOntoNotes,GloVeembeddingsperformclosetoCollobertembeddingswhileGoogleembeddingsareagainonepointbehind.Inaddition,300dimen-sionalembeddingspresentnosignificantimprove-mentover50dimensionalembeddings–aresultpre-viouslyreportedbyTurianetal.(2010).OnepossiblereasonthatCollobertembeddingsperformbetterthanotherpubliclyavailableem-beddingsonCoNLL-2003isthattheyaretrainedontheReutersRCV-1corpus,thesourceoftheCoNLL-2003dataset,whereastheotherembed-dingsarenot28.Ontheotherhand,wesuspectthatGoogle’sembeddingsperformpoorlybecauseofvo-cabularymismatch-inparticular,Google’sembed-dingsweretrainedinacase-sensitivemanner,andembeddingsformanycommonpunctuationsand27Wilcoxonranksumtest,p<0.00128TomakeadirectcomparisontoCollobertetal.(2011b),wedonotexcludetheCoNLL-2003NERtasktestdatafromthewordvectortrainingdata.Whileitispossiblethatthisdiffer-encecouldberesponsibleforthedisparateperformanceofwordvectors,theCoNLL-2003trainingdatacomprisesonly20koutof800millionwords,or0.00002%ofthetotaldata;inanun-supervisedtrainingscheme,theeffectsarelikelynegligible.symbolswerenotprovided.Totestthesehypothe-ses,weperformedexperimentswithnewwordem-beddingstrainedusingGloVeandword2vec,withvocabularylistandcorpussimilartoCollobertet.al.(2011b).AsshowninTable7,ourGloVeembeddingsimprovedsignificantly29overpubliclyavailableembeddingsonCoNLL-2003,andourword2vecskip-gramembeddingsimprovedsignifi-cantly30overGoogle’sembeddingsonOntoNotes.Duetotimeconstraintswedidnotperformnewhyper-parametersearcheswithanyofthewordem-beddings.Aswordembeddingqualitydependsonhyper-parameterchoiceduringtheirtraining(Pen-ningtonetal.,2014),andalso,inourNERneuralnetwork,hyper-parameterchoiceislikelysensitivetothetypeofwordembeddingsused,optimizingthemallwilllikelyproducebetterresultsandpro-videafairercomparisonofwordembeddingquality.4.4EffectofDropoutTable8comparestheresultofvariousdropoutval-uesforeachdataset.Themodelsaretrainedusingonlythetrainingsetforeachdatasettoisolatetheeffectofdropoutonbothdevandtestsets.Allotherhyper-parametersandfeaturesremainthesameasourbestmodelinTable5.Inbothdatasetsandonbothdevandtestsets,dropoutisessentialforstateoftheartperformance,andtheimprovementisstatisti-callysignificant31.Dropoutisoptimizedonthedevset,asdescribedinSection3.4.Hence,thechosen29Wilcoxonranksumtest,p<0.0130Wilcoxonranksumtest,p<0.0131Wilcoxonranksumtest,nodropoutvs.bestsetting:p<0.001fortheCoNLL-2003testset,p<0.0001fortheOntoNotes5.0testset,p<0.0005forallothers. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 0 4 1 5 6 7 3 9 2 / / t l a c _ a _ 0 0 1 0 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 366 LOCMISCORGPERNot NECARDINALDATEMONEYORDINALPERCENTQUALITYTIMELOCFACGPENORPORGPERSONEVENTLANGLAWPRODUCTWORKNon-NELOCMISCORGPERAnyEntityTagLexiconMatchCoNLLOntoNotesFigure5:FractionofnamedentitiesofeachtagcategorymatchedcompletelybyentriesineachlexiconcategoryoftheSENNA/DBpediacombinedlexicon.White=higherfraction.valuemaynotbethebest-performinginTable8.4.5LexiconFeaturesTable6showsthatontheCoNLL-2003dataset,us-ingfeaturesfromboththeSENNAlexiconandourproposedDBpedialexiconprovidesasignificant32improvementandallowsourmodeltoclearlysur-passthepreviousstateoftheart.UnfortunatelythedifferenceisminusculeforOntoNotes,mostlikelybecauseourlexicondoesnotmatchDBpediacategorieswell.Figure5showsthatonCoNLL-2003,lexiconcoverageisreasonableandmatchesthetagssetforeverythingexceptthecatch-allMISCcategory.Forexample,LOCentriesinlexiconmatchmostlyLOCnamedentitiesandviceversa.However,onOntoNotes,thematchesarenoisyandcorrespondencebetweenlexiconmatchandtagcategoryisquiteambiguous.Forexample,alllexiconcategorieshavespuriousmatchesinun-relatednamedentitieslikeCARDINAL,andLOC,GPE,andLANGUAGEentitiesallgetalotofmatchesfromtheLOCcategoryinthelexicon.Inaddition,namedentitiesincategorieslikeNORP,ORG,LAW,PRODUCTreceivelittlecoverage.Thelowercover-age,bruit,andambiguityallcontributetothedis-appointingperformance.ThissuggeststhattheDB-pedialexiconconstructionmethodneedstobeim-proved.AreasonableplacetostartwouldbetheDBpediacategorytoOntoNotesNEtagmappings.Inordertoisolatethecontributionofeachlexiconandmatchingmethod,wecomparedifferentsourcesandmatchingmethodsonaBLSTM-CNNmodelwithrandomlyinitializedwordembeddingsandno32Wilcoxonranksumtest,p<0.001.otherfeaturesorsourcesofexternalknowledge.Ta-ble9showstheresults.Inthisweakenedmodel,bothlexiconscontributesignificant33improvementsoverthebaseline.ComparedtotheSENNAlexicon,ourDBpe-dialexiconisnoisierbuthasbroadercoverage,whichexplainswhywhenapplyingitusingthesamemethodasCollobertetal.(2011b),itperformsworseonCoNLL-2003butbetteronOntoNotes–adatasetcontainingmanymoreobscurenameden-tities.However,wesuspectthatthemethodofCol-lobertetal.(2011b)isnotnoiseresistantandthere-foreunsuitableforourlexiconbecauseitfailstodis-tinguishexactandpartialmatches34anddoesnotsetaminimumlengthforpartialmatching.35Instead,whenweapplyoursuperiorpartialmatchingalgo-rithmandBIOESencodingwithourDBpedialex-icon,wegainasignificant36improvement,allow-ingourlexicontoperformsimilarlytotheSENNAlexicon.Unfortunately,aswecouldnotreliablyre-movepartialentriesfromtheSENNAlexicon,wewereunabletoinvestigatewhetherornotourlexi-conmatchingmethodwouldhelpinthatlexicon.Inaddition,usingbothlexiconstogetherasdis-tinctfeaturesprovidesafurtherimprovement37onCoNLL-2003,whichwesuspectisbecausethelexi-33Wilcoxonranksumtest,p<0.05forSENNA-Exact-BIOES,p<0.005forallothers.34WeachievethisbyusingBIOESencodingandprioritizingexactmatchesoverpartialmatches.35Matchingonlythefirstwordofalongentryisnotveryuseful;thisisnotaproblemintheSENNAlexiconbecause99%ofitsentriescontainonly3tokensorless.36Wilcoxonranksumtest,p<0.001.37Wilcoxonranksumtest,p<0.001. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 0 4 1 5 6 7 3 9 2 / / t l a c _ a _ 0 0 1 0 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 367 LexiconMatchingEncodingCoNLL-2003OntoNotesNolexicon--83.38(±0.20)82.53(±0.40)SENNAExactYN86.21(±0.39)83.24(±0.33)ExactBIOES86.14(±0.48)83.01(±0.52)DBpediaExactYN84.93(±0.30)83.15(±0.26)ExactBIOES85.02(±0.23)83.39(±0.39)PartialYN85.72(±0.45)83.25(±0.33)PartialBIOES86.18(±0.56)83.97(±0.38)Collobert’smethod85.01(±0.31)83.24(±0.26)BothBestcombination87.77(±0.29)83.82(±0.19)Table9:Comparisonoflexiconandmatching/encodingmethodsovertheBLSTM-CNNmodelemployingrandomembeddingsandnootherfeatures.Whenusingbothlexicons,thebestcombinationofmatchingandencodingisExact-BIOESforSENNAandPartial-BIOESforDBpedia.NotethattheSENNAlexiconalreadycontains“partialentries”soexactmatchinginthatcaseisreallyjustamoreprimitiveformofpartialmatching.consarecomplementary;theSENNAlexiconisrel-ativelycleanandtailoredtonewswire,whereastheDBpedialexiconisnoisierbuthashighcoverage.4.6AnalysisofOntoNotesPerformanceTable10showstheper-genrebreakdownoftheOntoNotesresults.Asexpected,ourmodelper-formsbestoncleantextlikebroadcastnews(BN)andnewswire(NW),andworstonnoisytextliketelephoneconversation(TC)andWebtext(WB).OurmodelalsosubstantiallyimprovesoverpreviousworkonallgenresexceptTC,wherethesmallsizeofthetrainingdatalikelyhinderslearning.Finally,theperformancecharacteristicsofourmodelappeartobequitedifferentthanthepreviousCRFmod-els(FinkelandManning,2009;DurrettandKlein,2014),likelybecauseweapplyacompletelydiffer-entmachinelearningmethod.5RelatedResearchNamedentityrecognitionisataskwithalonghis-tory.Inthissection,wesummarizetheworkswecomparewithandthatinfluencedourapproach.5.1NamedEntityRecognitionMostrecentapproachestoNERhavebeencharac-terizedbytheuseofCRF,SVM,andperceptronmodels,whereperformanceisheavilydependentonfeatureengineering.RatinovandRoth(2009)usednon-localfeatures,agazetteerextractedfrom38Wedownloadedtheirpubliclyreleasedsoftwareandmodeltoperformtheper-genreevaluation.Wikipedia,andBrown-cluster-likewordrepresenta-tions,andachievedanF1scoreof90.80onCoNLL-2003.LinandWu(2009)surpassedthemwithoutusingagazetteerbyinsteadusingphrasefeaturesobtainedbyperformingk-meansclusteringoveraprivatedatabaseofsearchenginequerylogs.Passosetal.(2014)obtainednearlythesameperformanceusingonlypublicdatabytrainingphrasevectorsintheirlexicon-infusedskip-grammodel.Inordertocombattheproblemofsparsefeatures,Suzukietal.(2011)employedlarge-scaleunlabelleddatatoper-formfeaturereductionandachievedanF1scoreof91.02onCoNLL-2003,whichisthecurrentstateoftheartforsystemswithoutexternalknowledge.TraininganNERsystemtogetherwithrelatedtaskssuchasentitylinkinghasrecentlybeenshowntoimprovethestateoftheart.DurrettandKlein(2014)combinedcoreferenceresolution,entitylink-ing,andNERintoasingleCRFmodelandaddedcross-taskinteractionfactors.TheirsystemachievedstateoftheartresultsontheOntoNotesdataset,buttheydidnotevaluateontheCoNLL-2003datasetduetolackofcoreferenceannotations.Luoetal.(2015)achievedstateoftheartresultsonCoNLL-2003bytrainingajointmodelovertheNERandentitylinkingtasks,thepairoftaskswhoseinter-dependenciescontributedthemosttotheworkofDurrettandKlein(2014).5.2NERwithNeuralNetworksWhilemanyapproachesinvolveCRFmodels,therehasalsobeenalonghistoryofresearchinvolvingneuralnetworks.Earlyattemptswerehinderedby l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 0 4 1 5 6 7 3 9 2 / / t l a c _ a _ 0 0 1 0 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 368 ModelBCBNMZNWTCWBTestsetsize(#tokens)32,57623,55718,26051,66711,01519,348Testsetsize(#entities)1,6972,1841,1634,6963801,137FinkelandManning(2009)78.6687.2982.4585.5067.2772.56DurrettandKlein(2014)3878.8887.3982.4687.6072.6876.17BLSTM-CNN81.2686.8779.9485.2767.8272.11BLSTM-CNN+emb85.0589.9384.3188.3572.4477.90BLSTM-CNN+emb+lex85.2389.9384.4588.3972.3978.38Table10:PergenreF1scoresonOntoNotes.BC=broadcastconversation,BN=broadcastnews,MZ=magazine,NW=newswire,TC=telephoneconversation,WB=blogsandnewsgroupslackofcomputationalpower,scalablelearningalgo-rithms,andhighqualitywordembeddings.Petasisetal.(2000)usedafeed-forwardneuralnetworkwithonehiddenlayeronNERandachievedstate-of-the-artresultsontheMUC6dataset.TheirapproachusedonlyPOStagandgazetteertagsforeachword,withnowordembeddings.Hammerton(2003)attemptedNERwithasingle-directionLSTMnetworkandacombinationofwordvectorstrainedusingself-organizingmapsandcon-textvectorsobtainedusingprinciplecomponentanalysis.However,whileourmethodoptimizeslog-likelihoodandusessoftmax,theyusedadifferentoutputencodingandoptimizedanunspecifiedob-jectivefunction.Hammerton’s(2003)reportedre-sultswereonlyslightlyabovebaselinemodels.Muchlater,withtheadventofneuralwordembeddings,Collobertetal.(2011b)presentedSENNA,whichemploysadeepFFNNandwordembeddingstoachievenearstateoftheartresultsonPOStagging,chunking,NER,andSRL.Webuildontheirapproach,sharingthewordembeddings,fea-tureencodingmethod,andobjectivefunctions.Recently,Santosetal.(2015)presentedtheirCharWNNnetwork,whichaugmentstheneuralnet-workofCollobertetal.(2011b)withcharacterlevelCNNs,andtheyreportedimprovedperformanceonSpanishandPortugueseNER.Wehavesuccessfullyincorporatedcharacter-levelCNNsintoourmodel.Therehavebeenvariousothersimilararchitec-tureproposedforvarioussequentiallabelingNLPtasks.Huangetal.(2015)usedaBLSTMforthePOS-tagging,chunking,andNERtasks,buttheyemployedheavyfeatureengineeringinsteadofusingaCNNtoautomaticallyextractcharacter-levelfeatures.Labeauetal.(2015)usedaBRNNwithcharacter-levelCNNstoperformGermanPOS-tagging;ourmodeldiffersinthatweusethemorepowerfulLSTMunit,whichwefoundtoperformbetterthanRNNsinpreliminaryexperiments,andthatweemploywordembeddings,whichismuchmoreimportantinNERthaninPOStagging.Lingetal.(2015)usedbothword-andcharacter-levelBLSTMstoestablishthecurrentstateoftheartforEnglishPOStagging.WhileusingBLSTMsin-steadofCNNsallowsextractionofmoresophisti-catedcharacter-levelfeatures,wefoundinprelim-inaryexperimentsthatforNERitdidnotperformsignificantlybetterthanCNNsandwassubstantiallymorecomputationallyexpensivetotrain.6ConclusionWehaveshownthatourneuralnetworkmodel,whichincorporatesabidirectionalLSTMandacharacter-levelCNNandwhichbenefitsfromrobusttrainingthroughdropout,achievesstate-of-the-artresultsinnamedentityrecognitionwithlittlefeatureengineering.OurmodelimprovesoverpreviousbestreportedresultsontwomajordatasetsforNER,sug-gestingthatthemodeliscapableoflearningcom-plexrelationshipsfromlargeamountsofdata.Preliminaryevaluationofourpartialmatchinglexiconalgorithmsuggeststhatperformancecouldbefurtherimprovedthroughmoreflexibleappli-cationofexistinglexicons.Evaluationofexistingwordembeddingssuggeststhatthedomainoftrain-ingdataisasimportantasthetrainingalgorithm.Moreeffectiveconstructionandapplicationoflexiconsandwordembeddingsareareasthatrequiremoreresearch.Inthefuture,wewouldalsoliketoextendourmodeltoperformsimilartaskssuchasextendedtagsetNERandentitylinking. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 0 4 1 5 6 7 3 9 2 / / t l a c _ a _ 0 0 1 0 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 369 AcknowledgmentsThisresearchwassupportedbyHondaResearchIn-stituteJapanCo.,Ltd.TheauthorswouldliketothankCollobertetal.(2011b)forreleasingSENNAwithitswordvectorsandlexicon,thetorch7frame-workcontributors,andAndreyKarpathyfortheref-erenceLSTMimplementation.ReferencesS¨orenAuer,ChristianBizer,GeorgiKobilarov,JensLehmann,RichardCyganiak,andZacharyIves.2007.DBpedia:Anucleusforawebofopendata.InThesemanticweb,pages722–735.Springer.PhilBlunsom,EdwardGrefenstette,NalKalchbrenner,etal.2014.Aconvolutionalneuralnetworkformod-ellingsentences.InProceedingsofthe52ndAnnualMeetingoftheAssociationforComputationalLinguis-tics.AssociationforComputationalLinguistics.KyunghyunCho,BartvanMerri¨enboer,DzmitryBah-danau,andYoshuaBengio.2014.Ontheproper-tiesofneuralmachinetranslation:Encoder-decoderapproaches.InProceedingsofSSST-8,EighthWork-shoponSyntax,SemanticsandStructureinStatisticalTranslation,pages103–111.AssociationforCompu-tationalLinguistics.MarcClaesen,JaakSimm,DusanPopovic,YvesMoreau,andBartDeMoor.EasyhyperparametersearchusingOptunity.InProceedingsoftheInterna-tionalWorkshoponTechnicalComputingforMachineLearningandMathematicalEngineering.MauriceClercandJamesKennedy.2002.Theparticleswarm-explosion,stability,andconvergenceinamul-tidimensionalcomplexspace.EvolutionaryComputa-tion,IEEETransactionson,6(1):58–73.RonanCollobertandJasonWeston.2008.Aunifiedar-chitecturefornaturallanguageprocessing:Deepneu-ralnetworkswithmultitasklearning.InProceed-ingsofthe25thInternationalConferenceonMachineLearning,pages160–167.ACM.RonanCollobert,KorayKavukcuoglu,andCl´ementFarabet.2011a.Torch7:AMatlab-likeenvironmentformachinelearning.InProceedingsofBigLearn,NIPSWorkshop,numberEPFL-CONF-192376.RonanCollobert,JasonWeston,L´eonBottou,MichaelKarlen,KorayKavukcuoglu,andPavelKuksa.2011b.Naturallanguageprocessing(presque)fromscratch.TheJournalofMachineLearningResearch,12:2493–2537.GregDurrettandDanKlein.2014.Ajointmodelforen-tityanalysis:Coreference,typing,andlinking.Trans-actionsoftheAssociationforComputationalLinguis-tics,2:477–490.JennyRoseFinkelandChristopherDManning.2009.Jointparsingandnamedentityrecognition.InPro-ceedingsofHumanLanguageTechnologies:The2009AnnualConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics,pages326–334.AssociationforComputationalLinguistics.FelixAGers,J¨urgenSchmidhuber,andFredCummins.2000.Learningtoforget:ContinualpredictionwithLSTM.NeuralComputation,12(10):2451–2471.ChristophGollerandAndreasKuchler.1996.Learningtask-dependentdistributedrepresentationsbyback-propagationthroughstructure.InNeuralNetworks,1996.,IEEEInternationalConferenceon,volume1,pages347–352.IEEE.AlanGraves,Abdel-rahmanMohamed,andGeoffreyHinton.2013.Speechrecognitionwithdeeprecurrentneuralnetworks.InProceedingsofthe2013IEEEIn-ternationalConferenceonAcoustics,SpeechandSig-nalProcessing,pages6645–6649.JamesHammerton.2003.Namedentityrecognitionwithlongshort-termmemory.InProceedingsoftheSeventhConferenceonNaturalLanguageLearningatHLT-NAACL2003,pages172–175.AssociationforComputationalLinguistics.GeoffreyHinton,NitishSrivastava,andKevinSwersky.2012.Lecture6e:RMSProp:dividethegradientbyarunningaverageofitsrecentmagnitude.InNeuralNetworksforMachineLearning.http://www.cs.toronto.edu/˜tijmen/csc321/slides/lecture_slides_lec6.pdf.EduardHovy,MitchellMarcus,MarthaPalmer,LanceRamshaw,andRalphWeischedel.2006.OntoNotes:the90%solution.InProceedingsoftheHumanLan-guageTechnologyConferenceoftheNAACL,Com-panionVolume:ShortPapers,pages57–60.Associ-ationforComputationalLinguistics.ZhihengHuang,WeiXu,andKaiYu.2015.Bidi-rectionalLSTM-CRFmodelsforsequencetagging.CoRR,abs/1508.01991.MatthieuLabeau,KevinL¨oser,andAlexandreAllauzen.2015.Non-lexicalneuralarchitectureforfine-grainedPOStagging.InProceedingsofthe2015ConferenceonEmpiricalMethodsinNaturalLanguageProcess-ing,pages232–237.AssociationforComputationalLinguistics.DekangLinandXiaoyunWu.2009.Phraseclusteringfordiscriminativelearning.InProceedingsoftheJointConferenceofthe47thAnnualMeetingoftheACLandthe4thInternationalJointConferenceonNatu-ralLanguageProcessingoftheAFNLP,pages1030–1038.AssociationforComputationalLinguistics. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 0 4 1 5 6 7 3 9 2 / / t l a c _ a _ 0 0 1 0 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 370 WangLing,ChrisDyer,AlanWBlack,IsabelTrancoso,RamonFermandez,SilvioAmir,LuisMarujo,andTiagoLuis.2015.Findingfunctioninform:Com-positionalcharactermodelsforopenvocabularywordrepresentation.InProceedingsofthe2015ConferenceonEmpiricalMethodsinNaturalLanguageProcess-ing,pages1520–1530.AssociationforComputationalLinguistics.GangLuo,XiaojiangHuang,Chin-YewLin,andZa-iqingNie.2015.Jointentityrecognitionanddisam-biguation.InProceedingsofthe2015ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages879–888.AssociationforComputationalLin-guistics.TomasMikolov,StefanKombrink,AnoopDeoras,LukarBurget,andJanCernocky.2011.RNNLM-recurrentneuralnetworklanguagemodelingtoolkit.InPro-ceedingsofthe2011ASRUWorkshop,pages196–201.TomasMikolov,IlyaSutskever,KaiChen,GregSCor-rado,andJeffDean.2013.Distributedrepresenta-tionsofwordsandphrasesandtheircompositionality.InProceedingsoftheTwenty-seventhAnnualConfer-enceonAdvancesinNeuralInformationProcessingSystems,pages3111–3119.YuriiNesterov.1983.Amethodofsolvingaconvexpro-grammingproblemwithconvergencerateO(1/k2).SovietMathematicsDoklady,27(2):372–376.AlexandrePassos,VineetKumar,andAndrewMcCal-lum.2014.Lexiconinfusedphraseembeddingsfornamedentityresolution.InProceedingsoftheEigh-teenthConferenceonComputationalNaturalLan-guageLearning,pages78–86.AssociationforCom-putationalLinguistics.JeffreyPennington,RichardSocher,andChristopherDManning.2014.GloVe:Globalvectorsforwordrep-resentation.InProceedingsofthe2014ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages1532–1543.GPetasis,SPetridis,GPaliouras,VKarkaletsis,SJPerantonis,andCDSpyropoulos.2000.Symbolicandneurallearningfornamed-entityrecognition.InProceedingsoftheSymposiumonComputationalIn-telligenceandLearning,pages58–66.Citeseer.VuPham,Th´eodoreBluche,ChristopherKermorvant,andJ´erˆomeLouradour.2014.Dropoutimprovesre-currentneuralnetworksforhandwritingrecognition.InProceedingsofthe14thInternationalConferenceonFrontiersinHandwritingRecognition,pages285–290.IEEE.SameerPradhan,AlessandroMoschitti,NianwenXue,HweeTouNg,AndersBj¨orkelund,OlgaUryupina,YuchenZhang,andZhiZhong.2013.TowardsrobustlinguisticanalysisusingOntoNotes.InProceedingsoftheSeventeenthConferenceonComputationalNat-uralLanguageLearning,pages143–152.AssociationforComputationalLinguistics.LevRatinovandDanRoth.2009.Designchallengesandmisconceptionsinnamedentityrecognition.InProceedingsoftheThirteenthConferenceonCompu-tationalNaturalLanguageLearning,pages147–155.AssociationforComputationalLinguistics.DavidRumelhart,GeoffreyHinton,andRonaldWilliams.1986.Learningrepresentationsbyback-propagatingerrors.Nature,pages323–533.CıceroSantos,VictorGuimaraes,RJNiter´oi,andRiodeJaneiro.2015.Boostingnamedentityrecognitionwithneuralcharacterembeddings.InProceedingsoftheFifthNamedEntitiesWorkshop,pages25–33.JunSuzuki,HidekiIsozaki,andMasaakiNagata.2011.Learningcondensedfeaturerepresentationsfromlargeunsuperviseddatasetsforsupervisedlearning.InPro-ceedingsofthe49thAnnualMeetingoftheAssocia-tionforComputationalLinguistics:HumanLanguageTechnologies:ShortPapers,pages636–641.Associa-tionforComputationalLinguistics.ErikFTjongKimSangandFienDeMeulder.2003.In-troductiontotheCoNLL-2003sharedtask:Language-independentnamedentityrecognition.InProceed-ingsoftheSeventhConferenceonNaturalLanguageLearningatHLT-NAACL2003,pages142–147.Asso-ciationforComputationalLinguistics.JosephTurian,LevRatinov,andYoshuaBengio.2010.Wordrepresentations:asimpleandgeneralmethodforsemi-supervisedlearning.InProceedingsofthe48thAnnualMeetingoftheAssociationforComputationalLinguistics,pages384–394.AssociationforComputa-tionalLinguistics.MatthewD.Zeiler.2012.ADADELTA:anadaptivelearningratemethod.CoRR,abs/1212.5701.