Operazioni dell'Associazione per la Linguistica Computazionale, 2 (2014) 15–26. Redattore di azioni: Sharon Goldwater.
Submitted 9/2013; Revised 11/2013; Pubblicato 2/2014. C
(cid:13)
2014 Associazione per la Linguistica Computazionale.
FLORS:FastandSimpleDomainAdaptationforPart-of-SpeechTaggingTobiasSchnabelDepartmentofComputerScienceCornellUniversitytbs49@cornell.eduHinrichSchützeCenterforInformation&LanguageProcessingUniversityofMunichinquiries@cislmu.orgAbstractWepresentFLORS,anewpart-of-speechtag-gerfordomainadaptation.FLORSusesro-bustrepresentationsthatworkespeciallywellforunknownwordsandforknownwordswithunseentags.FLORSissimplerandfasterthanpreviousdomainadaptationmethods,yetithassignificantlybetteraccuracythanseveralbaselines.1IntroductionInthispaperwedescribeFLORS,apart-of-speech(POS)taggerthatisFastintrainingandtagging,usesLOcalcontextonly(asopposedtofindingtheoptimaltagsequencefortheentiresentence),per-formsRobustlyontargetdomains(TDs)inunsu-perviseddomainadaptation(DA)andisSimpleinarchitectureandfeaturerepresentation.FLORSconstructsarobustrepresentationofthelocalcontextofthewordvthatistobetagged.Thisrepresentationconsistsofdistributionalfea-tures,suffixesandwordshapesofvanditslocalneighbors.Weshowthatithastwoadvantages.First,sincethemainpredictorsusedbyFLORSaredistributionalfeatures(nottheword’sidentity),FLORSpredictsunseentagsofknownwordsbet-terthanpriorworkonDAforPOS.Second,sinceFLORSusesrepresentationscomputedfromunla-beledtext,representationsofunknownwordsareinprincipleofthesametypeasrepresentationsofknownwords;thispropertyofFLORSresultsinbetterperformanceonunknownwordscomparedtopriorwork.ThesetwoadvantagesareespeciallybeneficialforTDsthatcontainhighratesofunseentagsofknownwordsandhighratesofunknownwords.WeshowthatFLORSachievesexcellentDAtaggingresultsonthefivedomainsoftheSANCL2012sharedtask(PetrovandMcDonald,2012)andoutperformsthreestate-of-the-arttaggersonBlitzeretal.’s(2006)biomedicaldata.FLORSisalsosimplerandfasterthanotherPOSDAmethods.Itissimpleinthattheinputrepre-sentationconsistsofthreesimpletypesoffeatures:distributionalcountfeaturesandtwotypesofbinaryfeatures,suffixandshapefeatures.Manyotherwordrepresentationsthatareusedforimprovinggeneral-ization(per esempio.,(Brownetal.,1992;Collobertetal.,2011))arecostlytotrainorhavedifficultyhan-dlingunknownwords.Ourrepresentationsarefasttobuildandcanbecreatedon-the-flyforunknownwordsthatoccurduringtesting.Thelearningarchitectureissimpleandfastaswell.Wetrainkbinaryone-vs-allclassifiersthatuselocalcontextonlyandnosequenceinforma-tion(wherekisthenumberoftags).Così,tag-gingcomplexityisO(k).Manyotherlearningse-tupsforDAaremorecomplex;e.g.,theylearnrep-resentations(asopposedtojustcounting),theylearnseveralclassifiersfordifferentsubclassesofwords(e.g.,knownvs.unknown)ortheycombineleft-to-rightandright-to-lefttaggings.Thenexttwosectionsdescribeexperimentaldata,setupandresults.ResultsarediscussedinSection4.WecompareFLORStoalternativewordrepresenta-tionsinSection5andtorelatedworkinSection6.Section7presentsourconclusions.2ExperimentaldataandsetupData.OursourcedomainisthePennTreebank(Marcusetal.,1993)ofWallStreetJournal(WSJ)
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
1
6
2
1
5
6
6
8
3
2
/
/
T
l
UN
C
_
UN
_
0
0
1
6
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
16
text.FollowingBlitzeretal.(2006),weusesections2-21fortrainingand100,000WSJsentencesfrom1988asunlabeleddataintraining.WeevaluateonsixdifferentTDs.ThefirstfiveTDs(newsgroups,weblogs,recensioni,answers,emails)arefromtheSANCLsharedtask(PetrovandMcDonald,2012).Additionally,theSANCLdatasetcontainssections22and23oftheWSJforin-domaindevelopmentandtesting,respectively.EachSANCLTDhasanunlabeledtrainingsetof100,000sentencesanddevelopmentandtestsetsofabout1000labeledsentenceseach.ThesixthTDisBIO,thePennBioTreebankdatasetdistributedbyBlitzer.Itconsistsofdevandtestsetsof500sen-tenceseachand100,000unlabeledsentences.Classificationsetup.SimilartoSVMTool(GiménezandMàrquez,2004)andChoiandPalmer(2012)(henceforth:C&P),weuselocalcontextonlyfortagginginsteadofperformingsequenceclassifi-cation.Forawordwoccurringastokenviinasen-tence,webuildafeaturevectorforalocalwindowofsize2l+1aroundvi.TherepresentationoftheobjecttobeclassifiedisthisfeaturevectorandthetargetclassisthePOStagofvi.WeusethelinearL2-regularizedL2-lossSVMimplementationprovidedbyLIBLINEAR(Fanetal.,2008)totrainkone-vs-allclassifiersonthetrain-ingsetwherekisthenumberofPOStagsinthetrainingset(inourcasek=45).Wetrainwithun-tuneddefaultparameters;inparticular,C=1.InthespecialcaseoflinearSVMs,thevalueofCdoesnotneedtobetunedexhaustivelyasthesolutionre-mainsconstantafterChasreachedacertainthresh-oldvalueC∗(KeerthiandLin,2003).TrainingcaneasilybeparallelizedbygivingeachbinarySVMitsownthread.Windows.Thelocalcontextfortaggingtokenviisawindowofsize2l+1centeredaroundvi:(vi−l,…,vi,…,vi+l).WepadsentencesoneithersidewithhBOUNDARYitoensuresufficientcon-textforallwords.Givenamappingffromwordstofeaturevectors(seebelow),therepresentationFofatokenviistheconcatenationofthe2l+1wordvectorsinitswindowF(vi)=f(vi−l)⊕…⊕f(vi+l)where⊕isvectorconcatenation.Wordfeatures.Werepresenteachwordwbyfourcomponents:(io)countsofleftneighbors,(ii)countsofrightneighbors,(iii)binarysuffixfeaturesand(iv)binaryshapefeatures.Thesefourcompo-nentsareconcatenated:F(w)=fleft(w)⊕fright(w)⊕fsuffix(w)⊕fshape(w)Weconsiderthesesourcesofinformationequallyimportantandnormalizeeachofthefourcompo-nentvectorstounitlength.NormalizationalsohasabeneficialeffectonSVMtrainingtimebecauseitalleviatesnumericalproblems(Fanetal.,2008).Distributionalfeatures.Wefollowalongtra-ditionofolder(FinchandChater,1992;Schütze,1993;Schütze,1995)andnewer(HuangandYates,2009)workoncreatingdistributionalfeaturesforPOStaggingbasedonlocalleftandrightneighbors.Specifically,theithentryxioffleft(w)istheweightednumberoftimesthattheindicatorwordcioccursimmediatelytotheleftofw:xi=tf(freq(bigram(ci,w)))whereciisthewordwithfrequencyrankiinthecor-pus,freq(bigram(ci,w))isthenumberoftimesthebigram“ciw”occursinthecorpusandweweightthenon-zerofrequencieslogarithmically:tf(X)=1+log(X).tf-weightinghasbeenusedbyotherre-searchers(HuangandYates,2009)andshowedgoodperformanceinourownpreviouswork.fright(w)isdefinedanalogously.Werestrictthesetofindicatorwordstothen=500mostfre-quentwordsinthecorpus.Toavoidzerovectors,weaddanentryxn+1toeachvectorthatcountsomittedcontexts:xn+1=tfXj:j>nfreq(bigram(cj,w))WecomputedistributionalvectorsonthejointcorpusDALLofalllabeledandunlabeledtextofsourcedomainandTD.Thetextispreprocessedbylowercasingeverything–whichisoftendonewhencomputingwordrepresentations,e.g.,byTurianetal.(2010)–andbypaddingsentenceswithhBOUNDARYitokens.Suffixfeatures.SuffixesarepromisingforDAbecausebasicmorphologyrulesarethesameindif-ferentdomains.Incontrasttootherworkontagging
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
1
6
2
1
5
6
6
8
3
2
/
/
T
l
UN
C
_
UN
_
0
0
1
6
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
17
modelclassifierfeatures1TnTHMMp−{0,1,2},v0,suffixes(forOOVs)2Stanfordbidir.MEMMp±{0,1,2},v±{0,1},affixes,orthography3SVMToolSVMp±{0,1,2,3},v±{0,1,2,3},affixes,orthography,wordlength4C&PSVMp±{0,1,2,3},v±{0,1,2,3},affixes,orthography5FLORSSVMdistributionsofv±{0,1,2},suffixes,orthographyTable1:OverviewofbaselinetaggersandFLORS.vi:token,pi:POStag.Positionsincludedinthesetsoftokenindicesarerelativetothepositioniofthewordv0tobetagged;e.g.,p±{0,1,2}isshortfor{p−0,p−1,p−2,p0,p1,p2}.Torepresenttokensvi,models1–4usevocabularyindicesandFLORSusesdistributionalrepresentations.Models2–4usecombinationsoffeatures(e.g.,tag-word)aswell.(e.g.,Ratnaparkhi(1996),Toutanovaetal.(2003),Milleretal.(2007))wesimplyuseall(lowercase)suffixestoavoidtheneedforselectingasubsetofsuffixes;andwetreatallwordsequallyasopposedtousingsuffixfeaturesforonlyasubsetofwords.Forsuffixs,wesetthedimensioncorrespondingtosinfsuffix(w)to1iflowercasedwendsinsandto0otherwise.Notethatwisasuffixofitself.1Shapefeatures.WeusetheBerkeleyparserwordsignatures(PetrovandKlein,2007).Eachwordismappedtoabitstringencompassing16binaryindicatorsthatcorrespondtodifferentorthographic(e.g.,doesthewordcontainadigit,hyphen,upper-casecharacter)andmorphological(e.g.,doesthewordendin-edor-ing)features.Thereare50uniquesignaturesinWSJ.Wesetthedimensionoffshape(w)thatcorrespondstothesignatureofwto1andallotherdimensionsto0.WenotethattheshapefeaturesweuseweredesignedforEnglishandprob-ablywouldhavetobeadjustedforotherlanguages.Baselines.Weaddresstheproblemofunsuper-viseddomainadaptationforPOStagging.Forthisproblem,weconsiderthreetypesofbaselines:(io)high-performingpubliclyavailablesystems,(ii)thetaggersusedatSANCLand(iii)POSDAresultspublishedforBIO.Mostofourexperimentsusetaggersfromcate-gory(io)becausewecanensurethatexperimentalconditionsaredirectlycomparable.Thefourbase-linesincategory(io)areshowninTable1.Threehavenearstate-of-the-artperformanceonWSJ:SVMTool(GiménezandMàrquez,2004),Stanford1Onecouldalsocomputethesesuffixesfor_w(wprefixedbyunderscore)insteadofforwtoincludewordsasdistinguish-ablespecialsuffixes.WetestthisalternativeinTable2,line15.(Toutanovaetal.,2003)(abirectionalMEMM)andC&P.TnT(Brants,2000)isincludedasarepresen-tativeoffastandsimpleHMMtaggers.Inaddition,C&PisataggerthathasbeenextensivelytestedinDAscenarioswithexcellentresults.Unlessother-wisestated,wetrainallmodelsusingtheirdefaultconfigurationfiles.WeusetheoptimizedparameterconfigurationpublishedbyC&PfortheC&Pmodel.TestsetresultswillbecomparedwiththeSANCLtaggers(categoria(ii))attheendofSection3.Asfarascategory(iii)isconcerned,mostworkonPOSDAhasbeenevaluatedonBIO.WediscussourconcernsabouttheBIOevaluationsetsinSec-tion4,butalsoshowthatFLORSbeatspreviouslypublishedresultsonBIOaswell(seeTable6).3ExperimentalresultsWetrainkbinarySVMclassifiersonthetrainingset.Atokeninthetestsetisclassifiedbybuildingitsfeaturevector,runningtheclassifiersonitandthenassigningittothePOSclasswhoseone-vs-allLIBLINEARclassifierreturnsthelargestscore.ResultsforALLaccuracy(accuracyforallto-kens)andOOVaccuracy(accuracyfortokensnotoccurringinthelabeledWSJdata)arereportedinTable2.Resultswithanasteriskaresignificantlyworsethanacolumn’sbestresultusingMcNemar’stest(P<.001).Weusethesametestandp-valuethroughoutthispaper.ThebasicFLORSmodel(Table2,line5)useswindowsize5(l=2).Eachwordinthewindowhas1002distributionalfeatures(501leftandright),91,161suffixfeaturesand50shapefeatures.Thefinalfeaturevectorforatokenhasadimensionalityofabout500,000,butisverysparse.FLORSoutperformsallbaselinesonthefiveTDs
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
6
2
1
5
6
6
8
3
2
/
/
t
l
a
c
_
a
_
0
0
1
6
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
18
newsgroupsreviewsweblogsanswersemailswsjALLOOVALLOOVALLOOVALLOOVALLOOVALLOOV1TnT88.66∗54.73∗90.40∗56.75∗93.33∗74.17∗88.55∗48.32∗88.14∗58.09∗95.75∗88.302Stanford89.11∗56.02∗91.43∗58.66∗94.15∗77.13∗88.92∗49.30∗88.68∗58.42∗96.8390.253SVMTool89.14∗53.82∗91.30∗54.20∗94.21∗76.44∗88.96∗47.25∗88.64∗56.37∗96.6387.964C&P89.51∗57.23∗91.58∗59.67∗94.41∗78.46∗89.08∗48.46∗88.74∗58.62∗96.7888.655FLORSbasic90.8666.4292.9575.2994.7183.6490.3062.1589.44∗62.6196.5990.376n=25090.9367.0392.9375.4594.6983.6990.2962.2089.6363.4396.56∗89.457v±{0,1,2}n=089.14∗55.59∗91.80∗66.31∗93.40∗72.55∗89.47∗55.82∗88.21∗57.83∗96.29∗85.55∗8nosuffixes90.6065.1792.7471.94∗94.7784.9289.77∗58.71∗89.30∗62.0996.28∗88.889noshapes89.70∗63.10∗92.24∗68.70∗92.60∗74.72∗89.55∗59.08∗89.6364.1795.52∗83.94∗10v±{1,2}n=090.61∗65.9592.76∗75.5694.6284.6290.2361.8789.40∗63.8296.51∗90.0211nosuffixes90.6664.78∗92.8875.0894.8384.5290.3661.9289.4262.7496.6489.4512noshapes90.7467.0393.0275.8894.5783.8390.2361.7389.41∗63.4996.5790.2513l=190.44∗63.62∗92.69∗75.7294.48∗84.0390.02∗62.6689.17∗62.7196.44∗88.6514L-to-R90.56∗66.0892.9775.4094.5783.7990.4362.8089.4363.1396.53∗90.9415voc.indices90.9366.6492.9175.0394.7184.0890.2761.9289.37∗62.2696.6390.60Table2:TaggingaccuracyoffourbaselinesandFLORSonthedevsets.Thetableisstructuredasfollows:baselines(lines1–4),basicFLORSsetup(lines5–6),effectofomittingoneofthethreefeaturetypesifthewordtobetaggedischangedcomparedtothebasicFLORSsetup(lines7–9)andifthewordtobetaggedisnotchangedcomparedtobasicFLORS(lines10–12),effectofthreeimportantconfigurationchoicesontaggingaccuracy:windowsize(line13),inclusionofpriortaggingdecision(line14)andvocabularyindex(line15).n:numberofindicatorwords.2l+1:sizeofthelocalcontextwindow.Lines10–12:Onlytheneighborsofv0aremodifiedcomparedtobasic(line5).Lines7–9:Allfivetokenrepresentations(includingv0)aremodified.Acolumn’sbestresultisbold.(line5vs.lines1–4).Onlyin-domainonWSJ,threebaselinesareslightlysuperior.ThebaselinesareslightlybetteronALLaccuracybecausetheyweredesignedfortaggingin-domaindataandusefeaturesetsthathavebeenfoundtoworkwellonthesourcedomain.Generally,C&PperformsbestforDAamongthebaselines.OnanswersandWSJ,how-ever,Stanfordhasbetteroverallaccuracies.TheseresultsareinlinewithC&P.Onlines6–15,weinvestigatehowdifferentmod-ificationsofthebasicFLORSmodelaffectperfor-mance.First,weexaminetheeffectofleavingoutcomponentsoftherepresentation:distributionalfea-tures(fleft(w),fright(w)),suffixes(fsuffix(w))andshapefeatures(fshape(w)).Distributionalfeaturesboostperformanceinalldomains:ALLandOOVaccuraciesareconsistentlyworseforn=0(line7)thanforn∈{250,500}(lines6&5).FLORSwithn=250hasbetterOOVaccuraciesin5of6domains.However,ALLaccu-racyforFLORSwithn=500isbetterinthemajor-ityofdomains.ThemainresultofthiscomparisonisthatFLORSdoesnotseemtobeverysensitivetothevalueofnifnislargeenough.Shapefeaturesalsoimproveresultsinalldo-mains,withoneexception:emails(lines9vs5).Foremails,shapefeaturesdecreaseALLaccuracyby.19andOOVaccuracyby1.56.ThismaybeduetothefactthatmanyOOVsareNNP/NNandthattaggingconventionsforNNP/NNvarybetweendo-mains.SeeSection4fordiscussion.Performancebenefitsfromsuffixesinalldomainsbutweblogs(lines8vs5).WeblogscontainmanyforeignnamessuchasAbdulandYasim.Forthesewords,shapesapparentlyprovidebetterinforma-tionforclassificationthansuffixes.ALLaccura-ciessufferlittlewhenleavingoutsuffixes,butthefeaturespaceismuchsmaller:about3000dimen-sions.Thus,fordomainswhereweexpectfewOOVs,omittingsuffixfeaturescouldbeconsidered.Lines7–9omitoneofthecomponentsoff(vi)forallfivewordsinthelocalcontext:i∈{−2,−1,0,1,2}.Lines10–12omitthesamecom-ponentsfortheneighborwordsonly–i.e.,i∈{−2,−1,1,2}–andleavef(v0)unchanged.14ofthe6×3ALLaccuraciesonlines10–12areworsethanFLORSbasic,4arebetter.Thelargestdiffer-encesare.25fornewsgroupsand.19forreviews(lines5vs10),butdifferencesfortheotherdomainsarenegligible.Thisshowsthatthemostimportant
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
6
2
1
5
6
6
8
3
2
/
/
t
l
a
c
_
a
_
0
0
1
6
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
19
featurerepresentationisthatofv0(notsurprisingly)andthatthedistributionalfeaturesoftheotherwordscanbeomittedatthecostofsomelossinaccuracyifasmallaveragenumberofactivefeaturesisdesired.AnotherFLORSparameteristhesizeofthelocalcontext.Surprisingly,OOVaccuraciesbenefitabitinfourdomainsifwereducelfrom2to1(lines13vs5).However,ALLaccuracyconsistentlydropsinallsixdomains.Thisarguesforusingl=2,i.e.,awindowsizeof5.Resultsforleft-to-right(L-to-R)taggingaregivenonline14.SimilartoSVMToolandC&P,eachsen-tenceistaggedfromlefttorightandprevioustag-gingdecisionsareusedforthecurrentclassification.Inthissetting,weusetheprevioustagpi−1asoneadditionalfeatureinthefeaturevectorofvi.Theeffectofleft-to-rightissimilartotheeffectofomittingsuffixes:OOVaccuraciesgoupinsomedomains,butALLaccuraciesdecrease(exceptforanincreaseof.02forreviews).Thisisinlinewiththeexperimentsin(SchnabelandSchütze,2013)wheresequentialinformationinaCRFwasnotro-bustacrossdomains.OOVtaggingmaybenefitfromcorrectprevioustagsbecausethelargerleftcontextthatisindirectlymadeavailablebyleft-to-righttag-gingcompensatespartiallyforthelackofinforma-tionabouttheOOVword.IncontrasttostandardapproachestoPOStag-ging,theFLORSbasicrepresentationdoesnotcon-tainvocabularyindices.Line15showswhathap-pensifweaddthem;thedimensionalityofthefea-turevectorisincreasedby5|V|–whereVisthetrainingsetvocabulary–andintrainingonebinaryfeatureissettooneforeachofthefivelocalcon-textwords.PerformanceisalmostindistinguishablefromFLORSbasic,suggestingthatonlyusingsuf-fixes–whichcanbeviewedas“ambiguous”vocab-ularyindices,e.g.,“at”isonfor“at”,“mat”,“hat”,“laundromat”etc–issufficient.Insummary,wefindthatdistributionalfeatures,wordsignaturesandsuffixesallcontributetosuc-cessfulPOSDA.Factorswithonlyminorimpactonperformancearethenumberofindicatorwordsusedforthedistributionalrepresentations,thewin-dowsizelandthetaggingscheme(L-to-Rvs.non-L-to-R).Unknownwordsandknownwordsbehavedifferentlywithrespecttocertainfeaturechoices.Thedifferentbehaviorofunknownandknownwordssuggeststhattrainingandoptimizingtwosep-aratemodels–anapproachusedbySVMTool–wouldfurtherincreasetaggingaccuracy.Notethattherehasbeenatleastonepublication(SchnabelandSchütze,2013)onoptimizingaseparatemodelforunknownwordsthathasinsomecasesbetterper-formanceonOOVaccuracythanwhatwepublishhere.2However,thiswouldcomplicatethearchitec-tureofFLORS.Weoptedforamaximallysimplemodelinthispaper,potentiallyatthecostofsomeperformance.Testsetresults.Table3reportsresultsonthetestsets.FLORSagainperformssignificantlybetteronallfiveTDs,bothonALLandOOV.Onlyin-domainonWSJ,ALLperformanceisworse.Finally,wecompareourresultstothePOStaggersforwhichperformancewasreportedatSANCL2012(PetrovandMcDonald,2012,Ta-ble4).Constituency-basedparsers–whichalsotagwordsasaby-productofderivingcompleteparsetrees–areexcludedfromthecomparisonbe-causetheyaretrainedonaricherrepresentation,thesyntacticstructureofsentences.3FLORS’resultsarebetterthanthebestnon-parsing-basedresultsatSANCL2012,whichwereaccuraciesof92.32onnewsgroups(HIT),90.65onreviews(HIT)and91.07onanswers(IMS-1).4DiscussionAdvantagesofFLORSrepresentation.AswecanseeinTable1,themainrepresentationaldifferencebetweenFLORSandtheothertaggersisthattheFLORSrepresentationdoesnotincludevocabularyindicesofthewordtobetaggedoritsneighbors–theFLORSvectoronlyconsistsofdistributional,suffixandshapefeatures.ThisisanobviousadvantageforOOVs.Inotherrepresentationalschemes,OOVshaverepresenta-tionsthatarefundamentallydifferentfromknown2SchnabelandSchütze(2013)reportOOVaccuraciesof56.62(newsgroups),64.61(reviews),71.86(weblogs),54.28(answers),61.05(emails)and64.64(BIO)fortheirbasicmodelandevenhigherOOVaccuraciesifparametersareoptimizedonaper-domainbasis.3DCU-Paris13islistedinthedependencyparsertables,butDCU-Paris13resultsarederivedfromaconstituencyparser.DCUalsodevelopedsophisticatedpreprocessingrulesforthedifferentdomains,whichcanbeviewedasakindofmanualdomainadaptation.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
6
2
1
5
6
6
8
3
2
/
/
t
l
a
c
_
a
_
0
0
1
6
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
20
newsgroupsreviewsweblogsanswersemailswsjALLOOVALLOOVALLOOVALLOOVALLOOVALLOOV1TnT90.85∗56.60∗89.67∗50.98∗91.37∗62.65∗89.36∗51.82∗87.38∗55.12∗96.57∗86.272Stanford91.25∗57.96∗90.30∗51.87∗92.32∗67.85∗89.74∗53.41∗87.77∗57.10∗97.4388.713SVMTool91.21∗54.40∗90.01∗45.05∗92.05∗63.59∗89.90∗51.07∗87.74∗53.23∗97.2686.474C&P91.68∗60.58∗90.42∗51.12∗92.22∗66.91∗89.90∗53.31∗87.91∗54.47∗97.4488.205FLORSbasic92.4166.9192.2570.8793.1475.3291.1767.9388.6761.0997.11∗87.79Table3:TaggingaccuracyoffourbaselinesandFLORSonthetestsets.newsgroupsreviewsweblogsanswersemailswsjbiopcttokensunknowntag0.310.060.000.250.800.000.98OOV10.346.848.458.5310.562.7219.86unseenword+tag2.442.221.462.913.470.612.50accuracyonunseenword+tagTnT0.000.000.000.000.000.000.00Stanford3.665.749.405.462.7715.234.64SVMTool0.000.160.000.000.100.000.00C&P14.4714.7520.5113.3710.2938.078.98FLORSbasic21.0621.9721.6517.1915.1341.1212.69Table4:Top:Percentageofunknowntags,OOVsandunseenword+tagcombinations(i.e.,knownwordstaggedwithunseentags)inthedevsets.Bottom:Taggingaccuracyonunseenword+tag.words–sincetheirvocabularyindexdoesnotoc-curinthetrainingsetandcannotbeusedforpredic-tion.Incontrast,givenenoughunlabeledTDdata,FLORSrepresentsknownandunknownwordsines-sentiallythesamewayandpredictionofthecorrecttagiseasier.ThisexplanationissupportedbytheexperimentsinTable2:FLORSbeatsallothersys-temsonOOVs–evenin-domainonWSJ.InouranalysiswefoundthatapartfrombetterhandlingofOOVsthereisasecondbeneficialef-fectofdistributionalrepresentations:theyfacilitatethecorrecttaggingofknownwordsoccurringwithtagsunseeninthetrainingset,whichwecallun-seenword+tags.Table4givesstatisticsonthiscaseandshowsthatunseenword+tagsoccuratleasttwotimesasoftenout-of-domain(e.g.,1.46%forwe-blogs)thanin-domain(.61%forWSJ).Thebottompartofthetableshowsperformanceofthefivetag-gersonunseenword+tags.FLORSisthetopper-formeronallsevendomains,withlargedifferencesofmorethan5%insomedomains.TheexplanationissimilartotheOOVcase:FLORSdoesnotrestrictthesetofpossiblePOS’sofaword.TheothertaggersinTable2usethevocabu-laryindexofthewordtobetaggedandwillthereforegiveastrongpreferencetoseentags.SinceFLORSusesdistributionalfeatures,itcanmoreeasilyassignanunseentagaslongasitiscompatiblewiththeoverallpatternofdistribution,suffixesandshapestypicalofthetag.C&Palsoperformrelativelywellonunseenword+tagduetotheambiguityclassesintheirmodel,butFLORSrepresentationsarebetterforeverydomain.Wetaketheseresultstomeanthatconstraintsonaword’spossiblePOStagsmaywellbehelpfulforin-domaindata,butforout-of-domaindataanoverlystrongbiasforaword’sobservedtagsisharmful.Itisimportanttostressthatrepresentationssim-ilartoFLORSrepresentationshavebeenusedforalongtime;wewouldexpectmanyofthemtohavesimilaradvantagesforunseenword+tags.E.g.,Brownclusters(Brownetal.,1992)andwordem-beddings(Collobertetal.,2011)aresimilartoFLORSinthisrespect.However,FLORSrepresen-tationsareextractedbysimplecountingwhereasthecomputationofBrownclustersorwordembeddingsismuchmoreexpensive.ThespeedwithwhichFLORSrepresentationscanbecomputedispartic-ularlybeneficialwhentaggersneedtobeadaptedtonewdomains.FLORScaneasilyadaptitsrep-resentationsonthefly–aseachnewoccurrenceofawordisencountered,thecountsthatarethebasis
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
6
2
1
5
6
6
8
3
2
/
/
t
l
a
c
_
a
_
0
0
1
6
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
21
forthexicansimplybeincremented.WepresentadirectcomparisonofFLORSrepresentationswithotherrepresentationsinSection5.“Localcontext”vs.sequenceclassification.ThemostcommonapproachtoPOStaggingistotagasentencewithitsmostlikelysequence;incontrast,independenttaggingoflocalcontextisnotguaran-teedtofindthebestsequence.RecentworkonEn-glishsuggeststhatwindow-basedtaggingcanper-formaswellassequence-basedmethods(Liangetal.,2008;Collobertetal.,2011).Toutanovaetal.(2003)reportsimilarresults.Inourexperiments,wealsodidnotfindconsistentimprovementswhenweincorporatedsequenceconstraints(Table2,line14).However,theremaybelanguagesandappli-cationsinvolvinglong-distancerelationshipswherelocal-contextclassificationissuboptimal.Local-contextclassificationhastwoadvantagescomparedtosequenceclassification.(i)Itsimplifiestheclassificationandtaggingsetup:wecanuseanyexistingstatisticalclassifier.Sequenceclassificationlimitstherangeofmethodsthatcanbeapplied;e.g.,itisdifficulttofindagoodCRFimplementationthatcanhandlereal-valuedfeatures–whichareofcriti-calimportanceforourrepresentation.(ii)ThetimecomplexityofFLORSintaggingisO(skf)wheresisthelengthofthesentence,kisthenumberoftagsandfisthenumberofnon-zerofea-turesperlocal-contextrepresentation.Incontrast,sequencedecodingcomplexityisO(sk2f).Thisdifferenceisnotofpracticalimportanceforstan-dardEnglishPOSsets,butitcouldbeanargumentagainstsequenceclassificationfortaggingproblemswithmuchlargertagsets.Insummary,replacingsequenceclassificationwithlocal-contextclassificationisattractiveforlarge-scale,practicaltagging.WhatDAcanandcannotdo.Despitethesupe-riorDAtaggingresultswereportforFLORSinthispaper,thereisstillagapof2%–7%(dependingonthedomain)betweenin-domainWSJaccuracyandDAaccuracyonSANCL.Inouranalysisofthisgap,wefoundsomeevidencethatDAperformancecanbefurtherimproved–especiallyasmoreunlabeledTDdatabecomesavailable.ButwealsofoundtworeasonsforlowperformancethatunsupervisedDAcannotdoanythingabout:differencesintagsets–orunknowntags–anddifferencesinannotationguide-lines.Table4showsthatunknowntagsoccurinfiveofthesevenTDsatratesbetween0%(weblogs)and1%(BIO).Eachtokenthatistaggedwithanun-knowntagisnecessarilyanerrorinunsupervisedDA.Furthermore,theunknowntagcanalsoim-pacttaggingaccuracyinthelocalcontext4–sotheunknowntagratesinTable4areprobablylowerboundsfortheerrorthatisduetounknowntags.Basedontheseconsiderations,itisnotsurprisingthattaggingaccuracy(e.g.,ofFLORSbasic)andunknowntagratearecorrelatedaswecanseeinTa-bles2,4and6;e.g.,wegetthehighestaccuraciesinthetwodomainsthatdonothaveunknowntags(weblogsandWSJ)andthelowestaccuracyinthedomainwiththehighestrate(BIO).Sinceunknowntagscannotbepredictedcorrectly,onecouldsimplyreportaccuracyonknowntags.However,giventhenegativeeffectofunknowntagsontaggingaccuracyofthelocalcontextinwhichtheyoccur,excludingunknowntagsdoesnotfullyaddresstheproblem.Forthisreason,itisprobablybesttokeepthecommonpracticeofsimplyreport-ingaccuracyonalltokens,includingunknowntags.Butthepercentagesofunknowntagsshouldalsobereportedforeachdatasetasabasisforamoreaccu-rateinterpretationofresults.AnothertypeoferrorthatcannotbeavoidedinunsupervisedDAisduetodifferencesinannota-tionguidelines.ThereareafewsuchproblemsinSANCL;e.g.,filenameslike“Services.doc”arean-notatedasNNintheemaildomain.Buttheirdis-tributionalandgrammaticalbehaviorismoresimi-lartoNNPs;asaconsequence,mostfilenamesareincorrectlytagged.Ingeneral,itisdifficulttodis-criminateNNsfromNNPs.ThePennTreebankan-notationguidelines(Santorini,1990)arecompatiblewitheithertaginmanycasesanditmaysimplybeimpossibletowriteannotationguidelinesthatavoidtheseproblems(cf.Manning(2011)).NN-NNPin-consistenciesareespeciallyproblematicforOOVtaggingsincemostOOVsareNNsorNNPs.4Forexample,thereisaspecialtagADDinthewebdo-mainforwebaddresses.Thelasttwowordsofthesentence“Iwouldliketohostmyupcomingwebsiteto/INLiquid-web.com/ADD”aremistaggedbyStanfordtaggeras“...to/TOLiquidweb.com/VB”.Sothemissingtaginthiscasealsoaffectsthetaggingofsurroundingwords.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
6
2
1
5
6
6
8
3
2
/
/
t
l
a
c
_
a
_
0
0
1
6
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
22
biodevwsjtrainOOVALLALLNN62.425.414.4JJ15.98.96.2NNS10.27.56.3NNP0.50.29.5NNPS0.00.00.3Table5:Frequencyofsometags(percentoftokens)forbiodevandwsjtrain.WhiletheamountofinconsistentannotationislimitedforSANCL,itisaseriousproblemforBIO.Table5showsthattheproportionofNNPsinBIOislessthanatenthofthatinWSJ(.2inBIOvs.9.5inWSJ).Thisisduetothefactthatmanybio-specificnames,inparticulargenes,areannotatedasNN.Incontrast,thedistributionallyandorthograph-icallymostsimilarnamesinWSJaretaggedasNNP.Forexample,wefind“Onecellwasteasedout,anditsDNA/NNPextracted”inWSJvs.“DNA/NNwasisolated”inBIO.standardsetupNNP→NNALLOOVALLOOVTnT87.49∗59.08∗91.75∗78.33∗Stanford88.46∗62.55∗92.36∗79.19∗SVMTool88.33∗61.30∗92.4779.46∗C&P87.82∗60.60∗92.06∗79.30∗FLORSbasic88.9064.7492.9182.58n=25088.9064.5192.9382.47n=087.27∗57.75∗90.91∗73.57∗nosuffixes88.09∗62.20∗91.98∗79.27∗noshapes87.78∗59.82∗91.81∗77.31∗l=189.1265.5292.9982.90Table6:Taggingaccuracyonbiodev.NNP→NNresultswereobtainedbyreplacingNNPswithNNs.GiventhislargediscrepancyinthefrequencyofthetagNNP–whicharguablyisduetodifferentannotationguidelines,notduetounderlyingdiffer-encesbetweenthetwogenres–BIOshouldproba-blynotbeusedforevaluatingDA.ThisiswhywedidnotincludeitinourcomparisoninTable2.Forsakeofcompleteness,weprovidetaggingac-curaciesforBIOinTable6,“standardsetup”.TheresultsareinlinewithSANCLresults:FLORSbeatsthebaselinesonALLandOOVaccuracies.However,ifwebuildtheNNbiasintoourmodelbysimplyreplacingallNNPtagswithNNtags,thenaccuracygoesupby4%onALLandbyalmost20%onOOV.EvenTnT,themostbasictagger,achievesALL/OOVaccuracyof91.75/78.33,betterthananymethodinthestandardsetup.Theseaccuraciesarewellabovethosein(Blitzeretal.,2006)and(HuangandYates,2010).SincesimplyreplacingNNPswithNNshassuchalargeeffect,BIOcannotbeusedsensiblyforeval-uatingDAmethods.Inpractice,itisnotpossibletoseparate“true”improvementsduetogenericbet-terDAfromelementsoftheproposedmethodthatsimplyintroduceanegativebiasforNNP.Insummary,whencomparingdifferentDAmeth-odscautionshouldbeexercisedinthechoiceofdo-mains.Inparticular,theeffectofunknowntagsshouldbemadetransparentandthegoldstandardsshouldbeanalyzedtodeterminewhetherthetaskaddressedintheTDdifferssignificantlyinsomeas-pectsfromthataddressedinthesourcedomain.5ComparisonofwordrepresentationsOurapproachtoDAisaninstanceofrepresentationlearning:weaimtofindrepresentationsthatarero-bustacrossdomains.Inthissection,wecompareFLORSwithtwootherwidelyusedrepresentationlearningmethods:(i)Brownclusters(Brownetal.,1992)and(ii)C&Wembeddings,thewordembed-dingsofCollobertetal.(2011).Weusefdist(w)=fleft(w)⊕fright(w)torefertoourowndistributionalwordrepresentations(seeSection2).Theperhapsoldestandmostfrequentlyusedlow-dimensionalrepresentationofwordsisbasedonBrownclusters.Typically,prefixesofBrownclus-ters(Brownetal.,1992)areaddedtoincreasetherobustnessofPOStaggers(e.g.,Toutanovaetal.(2003)).Computationalcostsarehigh(quadraticinthevocabularysize)althoughthecomputationcanbeparallelized(UszkoreitandBrants,2008).Morerecently,generalwordrepresentations(Col-lobertetal.,2011;Turianetal.,2010)havebeenusedforrobustPOStagging.Thesewordrepresen-tationsaretypicallytrainedonalargeamountofun-labeledtextandfine-tunedforspecificNLPtasks.SimilartoBrownclusters,theyarelow-dimensionalandcanbeusedasfeaturesinmanyNLPtasks,ei-
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
6
2
1
5
6
6
8
3
2
/
/
t
l
a
c
_
a
_
0
0
1
6
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
23
theraloneorincombinationwithotherfeatures.Tocomparefdist(w)(ourdistributionalrepre-sentations)withBrownclusters,weinduced1000BrownclustersonthejointcorpusdataDALL(seeSection2)usingthepubliclyavailableimplemen-tationofLiang(2005).WepaddedsentenceswithhBOUNDARYitokensoneachsideandusedpathprefixesoflength4,6,10and20asfeaturesforeachword(cf.RatinovandRoth(2009),Turianetal.(2010)).C&WembeddingsareprovidedbyCollobertetal.(2011):50-dimensionalvectorsfor130,000wordsfromWSJ,trainedonWikipedia.Similartoourdis-tributionalrepresentationsfdist(w),theembeddingsalsocontainahBOUNDARYitoken(whichtheycallPADDING).Moreover,theyhaveaspecialem-beddingforunknownwords(calledUNKNOWN)whichweusewheneverweencounterawordthatisnotintheirlookuptable.Wepreprocessourrawtokensthesamewaytheydo(lowercaseandreplacesequencesofdigitsby“0”)beforewelookuparep-resentationduringtrainingandtesting.Wereplacedthedistributionalfeaturesinourba-sicsetupbyeitherBrownclusterfeaturesorC&Wembeddings.Table7repeatslines5and7ofTable2andgivesresultsofthemodifiedFLORSsetup.AllthreerepresentationsimprovebothALLandOOVaccuraciesinalldomains.fdistoutperformsBrowninallcasesexceptforOOVonemails.Brownmaysufferfromnoisydata;cleaningmeth-odshavebeenusedintheliterature(Liang,2005;Turianetal.,2010),buttheyarenotunproblematicsincealargepartofthedataavailableislost,whichresultsinmoreunknownwords.Brownandfdistcanbedirectlycomparedsincetheyweretrainedonexactlythesamedata.fdistandC&Warehardertocomparedirectlybecausetherearemanydifferences.(i)C&Wistrainedonamuchlargerdataset.OneconsequenceofthisisthatOOVaccuracyonWSJmaybehigherbecausesomewordsthatareunknownforothermethodsareactuallyknowntoC&W.(ii)C&WvectorsarenottrainedontheSANCLTDdatasets–thisgivesfdistanadvantage.(iii)C&WvectorsarenottrainedontheWSJ.Again,thiscouldgivefdistanadvantage.(iv)C&Wandfdistarefundamentallydifferentinthewaytheyhandleunknownwords.C&Whasalim-itedvocabularyandmustreplaceallwordsnotinthisvocabularybythetokenUNKNOWN.Incon-trast,fdistcancreateameaningfulindividualrepre-sentationforanyOOVworditencounters.OurFLORStaggerprovidesbestALLaccuraciesinalldomainsbutWSJ,whereC&Whasbestre-sults.ThegoodperformanceofC&Wisratherun-surprisingsincetheembeddingswerecreatedforthe130,000mostfrequentwordsoftheWSJandthuscovertheWSJdomainmuchbetter.Also,WSJwasusedtotuneparametersduringdevelopment.Aswithourpreviousexperiments,OOVresultsonemailsseemslightlymoresensitivetoparameterchoicesthanonotherdomains(recallthediscussionofthisissueinSection4).Insummary,wehaveshownthatfdistrepresen-tationsworkbetterforPOSDAthanBrownclus-ters.Furthermore,theevidencewehavepresentedsuggeststhatfdistarecomparableinperformancetoC&WembeddingsifnotbetterforPOSDA.ThemostimportantdifferencebetweenfdistandBrown/C&Wisthatfdistaremuchsimplerandmuchfastertocompute.Theyaresimplerbecausetheyarejustslightlytransformedcountsincontrasttotheothertwoapproaches,whichsolvecomplexoptimizationproblems.fdistcanbecomputedeffi-cientlythroughsimpleincrementationinonepassthroughthecorpus.Incontrast,theothertwoap-proachesareanorderofmagnitudeslower.6RelatedworkUnsupervisedDAmethodscanbebroadlyputintofourcategories:representationlearningandconstraint-basedframeworks–whichrequiresometailoringtoatask–andinstanceweightingandboot-strapping–whichcanbemoregenerallyappliedtoawiderangeofproblems.Sincemanyapproachesareapplication-specific,wefocusontheonesthathavebeenappliedtoPOStagging.Representationlearning.WealreadydiscussedtwoimportantapproachestorepresentationlearninginSection5:C&WembeddingsandBrownclusters.Blitzeretal.’s(2006)structuralcorrespondencelearning(SCL)supportsDAbycreatingsimilarrepresentationsforcorrelatedfeaturesinthepivotfeaturespace.Thisisapotentiallypowerfulmethod.FLORSissimplerinthatcorrelationsaremadedirectlyaccessibletothesupervisedlearner.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
6
2
1
5
6
6
8
3
2
/
/
t
l
a
c
_
a
_
0
0
1
6
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
24
newsgroupsreviewsweblogsanswersemailswsjALLOOVALLOOVALLOOVALLOOVALLOOVALLOOV1FLORSfdist(w),n=50090.8666.4292.9575.2994.7183.6490.3062.1589.4462.6196.5990.372fdist(w),n=089.14∗55.59∗91.80∗66.31∗93.40∗72.55∗89.47∗55.82∗88.21∗57.83∗96.29∗85.55∗3C&Wforfdist(w)90.5764.5792.54∗72.48∗94.5180.58∗90.2360.9989.4463.1396.7290.484Brownforfdist(w)90.34∗62.41∗92.23∗71.47∗94.4581.7689.71∗56.28∗89.02∗63.2096.48∗87.50Table7:Taggingaccuracyofdifferentwordrepresentationsonthedevsets.Line1correspondstoFLORSbasic.n:numberofindicatorwords.Acolumn’sbestresultisbold.Moreover,FLORSrepresentationsconsistofsimplecountswhereasSCLsolvesaseparateoptimizationproblemforeachpivotfeature.Umansky-Pesinetal.(2010)derivedistributionalinformationforOOVsbyrunningwebqueries.Thisapproachisslowsinceitdependsonasearchengine.Ganchevetal.(2012)successfullyusesearchlogs.ThisisapromisingenhancementforFLORS.HuangandYates(2009)evaluateCRFswithdis-tributionalfeatures.Theyexaminelowerdimen-sionalfeaturerepresentationsusingSVDorthela-tentstatesofanunsupervisedHMM.Theyfindbet-teraccuraciesfortheirHMMmethodthanBlitzeretal.(2006);however,theydonotcomparethemagainstaCRFbaselineusingdistributionalfeatures.Inlaterwork,HuangandYates(2010)addthela-tentstatesofmultiple,differentlytrainedHMMsasfeaturestotheirCRF.HuangandYates(2012)ar-guethatfindinganoptimalfeaturerepresentationiscomputationallyintractableandproposeanewframeworkthatallowspriorknowledgetobeinte-gratedintorepresentationlearning.Latentsequencestatesareaformofwordrepre-sentation.Thus,itwouldbeinterestingtocomparethemtothenon-sequence-baseddistributionalrep-resentationthatFLORSuses.Constraint-basedmethods.Rushetal.(2012)useglobalconstraintsonOOVstoimproveout-of-domaintagging.Althoughconstraintsensurecon-sistency,theyrequirecarefulmanualengineering.Distributionalfeaturescanalsobeseenasaformofconstraintsincefeatureweightswillbesharedamongallwords.Subramanyaetal.(2010)constructagraphtoen-couragesimilarn-gramstobetaggedsimilarly,re-sultinginmoderategainsinonedomain,butnogainsonBIOwhencomparedtoself-training.Thereasoncouldbeaninsufficientamountofunsuper-viseddataforBIO(100,000sentences).Ourap-proachdoesnotseemtosufferfromthisproblem.Bootstrapping.Bothself-training(McCloskyetal.,2006)–whichusesoneclassificationmodel–andco-training(BlumandMitchell,1998)–whichuses≥2models–havebeenappliedtoPOStagging.Self-trainingusuallyimprovesaPOSbaselineonlyslightlyifatall(Huangetal.,2009;HuangandYates,2010).Devisingfeaturesbasedonlabeledin-stances(insteadoftrainingonthem)hasbeenmoresuccessful(Florianetal.,2004;Søgaard,2011).Chenetal.(2011)useco-trainingforDA.Ineachroundoftheiralgorithm,bothnewtraininginstancesfromtheunlabeleddataandnewfeaturesareadded.Theirmodelislimitedtobinaryclassification.Theco-trainingmethodofKüblerandBaucom(2011)trainsseveraltaggersandaddssentencesfromtheTDtothetrainingsetonwhichtheyagree.Theyreportslight,butstatisticallysignificantincreasesinaccuracyforPOStaggingofdialoguedata.Instanceweighting.Instanceweightingformal-izesDAastheproblemofhavingdatafromdiffer-entprobabilitydistributionsineachdomain.Thegoalistomakethesetwodistributionsalignbyus-inginstance-specificweightsduringtraining.JiangandZhai(2007)proposeaframeworkthatintegratespriorknowledgefromdifferentdatasetsintothelearningobjectivebyweights.Inrelatedwork,C&Ptraingeneralizedanddomain-specificmodels.Aninputsentenceistaggedbythemodelthatismostsimilartothesentence.FLORScouldbeeasilyextendedalongtheselines,anexperimentweplanforthefuture.Intermsofthebasicclassificationsetup,ourPOStaggerismostsimilartotheSVM-basedapproachesofGiménezandMàrquez(2004)andC&P.How-ever,wedonotusealeft-to-rightapproachwhentaggingsentences.Moreover,SVMTooltrainstwoseparatemodels,oneforOOVsandoneforknownwords.FLORSonlyhasasinglemodel.Inaddition,
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
6
2
1
5
6
6
8
3
2
/
/
t
l
a
c
_
a
_
0
0
1
6
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
25
wedonotmakeuseofambiguityclasses,token-tagdictionariesandrarefeaturethresholds.Instead,werelyonlyonthreetypesoffeatures:distributionalrepresentations,suffixesandwordshapes.Thelocal-context-onlyapproachofSVMTool,C&PandFLORSisdifferentfromstandardse-quenceclassificationsuchasMEMMs(e.g.,Rat-naparkhi(1996),Toutanovaetal.(2003),TsuruokaandTsujii(2005))andCRFs(e.g.,Collins(2002)).Sequencemodelsaremorepowerfulintheory,butthismaynotbeanadvantageinDAbecausethesub-tledependenciestheyexploitmaynotholdacrossdomains.7ConclusionWehavepresentedFLORS,anewPOStaggerforDA.FLORSusesrobustrepresentationsthatworkespeciallywellforunknownwordsandforknownwordswithunseentags.FLORSissimplerandfasterthanpreviousDAmethods,yetwewereabletodemonstratethatithassignificantlybetteraccu-racythanseveralbaselines.Acknowledgments.ThisworkwassupportedbyDFG(DeutscheForschungsgemeinschaft).ReferencesJohnBlitzer,RyanMcDonald,andFernandoPereira.2006.Domainadaptationwithstructuralcorrespon-dencelearning.InEMNLP,pages120–128.AvrimBlumandTomMitchell.1998.Combiningla-beledandunlabeleddatawithco-training.InCOLT,pages92–100.ThorstenBrants.2000.TnT:Astatisticalpart-of-speechtagger.InANLP,pages224–231.PeterF.Brown,PeterV.deSouza,RobertL.Mercer,Vin-centJ.DellaPietra,andJeniferC.Lai.1992.Class-basedn-grammodelsofnaturallanguage.Computa-tionalLinguistics,18:467–479.MinminChen,KilianQ.Weinberger,andJohnBlitzer.2011.Co-trainingfordomainadaptation.InNIPS,pages1–9.JinhoD.ChoiandMarthaPalmer.2012.Fastandrobustpart-of-speechtaggingusingdynamicmodelselection.InACL:ShortPapers,pages363–367.MichaelCollins.2002.DiscriminativetrainingmethodsforhiddenMarkovmodels:Theoryandexperimentswithperceptronalgorithms.InEMNLP,pages1–8.RonanCollobert,JasonWeston,LéonBottou,MichaelKarlen,KorayKavukcuoglu,andPavelKuksa.2011.Naturallanguageprocessing(almost)fromscratch.TheJournalofMachineLearningResearch,12:2493–2537.Rong-EnFan,Kai-WeiChang,Cho-JuiHsieh,Xiang-RuiWang,andChih-JenLin.2008.LIBLINEAR:Ali-braryforlargelinearclassification.TheJournalofMachineLearningResearch,9:1871–1874.StevenFinchandNickChater.1992.Bootstrappingsyn-tacticcategoriesusingstatisticalmethods.InBack-groundandExperimentsinMachineLearningofNat-uralLanguage,pages229–235.RaduFlorian,HanyHassan,AbrahamIttycheriah,HongyanJing,NandaKambhatla,XiaoqiangLuo,NicolasNicolov,andSalimRoukos.2004.Astatisti-calmodelformultilingualentitydetectionandtrack-ing.InHLT-NAACL,pages1–8.KuzmanGanchev,KeithHall,RyanMcDonald,andSlavPetrov.2012.Usingsearch-logstoimprovequerytag-ging.InACL:ShortPapers,pages238–242.JesúsGiménezandLluísMàrquez.2004.SVMTool:Ageneralpostaggergeneratorbasedonsupportvectormachines.InLREC,pages43–46.FeiHuangandAlexanderYates.2009.Distributionalrepresentationsforhandlingsparsityinsupervisedsequence-labeling.InACL-IJCNLP,pages495–503.FeiHuangandAlexanderYates.2010.Exploringrepresentation-learningapproachestodomainadapta-tion.InDANLP,pages23–30.FeiHuangandAlexanderYates.2012.Biasedrepre-sentationlearningfordomainadaptation.InEMNLP-CoNLL,pages1313–1323.ZhongqiangHuang,VladimirEidelman,andMaryHarper.2009.ImprovingasimplebigramHMMpart-of-speechtaggerbylatentannotationandself-training.InNAACL-HLT:ShortPapers,pages213–216.JingJiangandChengXiangZhai.2007.Instanceweight-ingfordomainadaptationinNLP.InACL,pages264–271.S.SathiyaKeerthiandChih-JenLin.2003.AsymptoticbehaviorsofsupportvectormachineswithGaussiankernel.Neuralcomputation,15(7):1667–1689.SandraKüblerandEricBaucom.2011.Fastdomainadaptationforpartofspeechtaggingfordialogues.InRANLP,pages41–48.PercyLiang,HalDauméIII,andDanKlein.2008.Structurecompilation:tradingstructureforfeatures.InICML,pages592–599.PercyLiang.2005.Semi-supervisedlearningfornaturallanguageprocessing.Master’sthesis,MassachusettsInstituteofTechnology.ChristopherD.Manning.2011.Part-of-speechtaggingfrom97%to100%:Isittimeforsomelinguistics?InCICLing,pages171–189.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
1
6
2
1
5
6
6
8
3
2
/
/
t
l
a
c
_
a
_
0
0
1
6
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
26
MitchellP.Marcus,MaryAnnMarcinkiewicz,andBeat-riceSantorini.1993.Buildingalargeannotatedcor-pusofEnglish:ThePenntreebank.ComputationalLinguistics,19(2):313–330.DavidMcClosky,EugeneCharniak,andMarkJohnson.2006.Rerankingandself-trainingforparseradapta-tion.InACL,pages337–344.JohnMiller,ManabuTorii,andVijayK.Shanker.2007.Buildingdomain-specifictaggerswithoutannotated(domain)data.InEMNLP-CoNLL,pages1103–1111.SlavPetrovandDanKlein.2007.Improvedinferenceforunlexicalizedparsing.InHLT-NAACL,pages404–411.SlavPetrovandRyanMcDonald.2012.Overviewofthe2012SharedTaskonParsingtheWeb.Notesofthe1stSANCLWorkshop.LevRatinovandDanRoth.2009.Designchallengesandmisconceptionsinnamedentityrecognition.InCoNLL,pages147–155.AdwaitRatnaparkhi.1996.Amaximumentropymodelforpart-of-speechtagging.InEMNLP,pages133–142.AlexanderM.Rush,RoiReichart,MichaelCollins,andAmirGloberson.2012.ImprovedparsingandPOStaggingusinginter-sentenceconsistencyconstraints.InEMNLP-CoNLL,pages1434–1444.BeatriceSantorini.1990.Part-of-speechtaggingguide-linesforthePennTreebankproject(3rdrevision,2ndprinting).Technicalreport,DepartmentofLinguistics,UniversityofPennsylvania.TobiasSchnabelandHinrichSchütze.2013.Towardsrobustcross-domaindomainadaptationforpart-of-speechtagging.InIJCNLP,pages198–206.HinrichSchütze.1993.Part-of-speechinductionfromscratch.InACL,pages251–258.HinrichSchütze.1995.Distributionalpart-of-speechtagging.InEACL,pages141–148.AndersSøgaard.2011.Semisupervisedcondensednear-estneighborforpart-of-speechtagging.InACL:Shortpapers,pages48–52.AmarnagSubramanya,SlavPetrov,andFernandoPereira.2010.Efficientgraph-basedsemi-supervisedlearningofstructuredtaggingmodels.InEMNLP,pages167–176.KristinaToutanova,DanKlein,ChristopherD.Manning,andYoramSinger.2003.Feature-richpart-of-speechtaggingwithacyclicdependencynetwork.InNAACL-HLT,pages173–180.YoshimasaTsuruokaandJun’ichiTsujii.2005.Bidirec-tionalinferencewiththeeasiest-firststrategyfortag-gingsequencedata.InEMNLP-HLT,pages467–474.JosephTurian,LevRatinov,andYoshuaBengio.2010.Wordrepresentations:Asimpleandgeneralmethodforsemi-supervisedlearning.InACL,pages384–394.ShulamitUmansky-Pesin,RoiReichart,andAriRap-poport.2010.Amulti-domainweb-basedalgorithmforPOStaggingofunknownwords.InCOLING,pages1274–1282.JakobUszkoreitandThorstenBrants.2008.Distributedwordclusteringforlargescaleclass-basedlanguagemodelinginmachinetranslation.InACL,pages755–762.
Scarica il pdf