Transactions of the Association for Computational Linguistics, 2 (2014) 55–66. Action Editor: Lillian Lee.

Transactions of the Association for Computational Linguistics, 2 (2014) 55–66. Action Editor: Lillian Lee.
Submitted 9/2013; Überarbeitet 12/2013; Published 2/2014. C
2014 Verein für Computerlinguistik.
(cid:13)

Cross-lingualProjectedExpectationRegularizationforWeaklySupervisedLearningMengqiuWangandChristopherD.ManningComputerScienceDepartmentStanfordUniversityStanford,CA94305USA{mengqiu,manning}@cs.stanford.eduAbstractWeconsideramultilingualweaklysupervisedlearningscenariowhereknowledgefroman-notatedcorporainaresource-richlanguageistransferredviabitexttoguidethelearninginotherlanguages.Pastapproachesprojectlabelsacrossbitextandusethemasfeaturesorgoldlabelsfortraining.Weproposeanewmethodthatprojectsmodelexpectationsratherthanlabels,whichfacilitiestransferofmodeluncertaintyacrosslanguagebound-aries.WeencodeexpectationsasconstraintsandtrainadiscriminativeCRFmodelusingGeneralizedExpectationCriteria(MannandMcCallum,2010).EvaluatedonstandardChinese-EnglishandGerman-EnglishNERdatasets,ourmethoddemonstratesF1scoresof64%and60%whennolabeleddataisused.Attainingthesameaccuracywithsu-pervisedCRFsrequires12kand1.5klabeledsentences.Furthermore,whencombinedwithlabeledexamples,ourmethodyieldssigniﬁ-cantimprovementsoverstate-of-the-artsuper-visedmethods,achievingbestreportednum-berstodateonChineseOntoNotesandGer-manCoNLL-03datasets.1IntroductionSupervisedstatisticallearningmethodshaveen-joyedgreatpopularityinNaturalLanguageProcess-ing(NLP)overthepastdecade.Thesuccessofsu-pervisedmethodsdependsheavilyupontheavail-abilityoflargeamountsofannotatedtrainingdata.Manualcurationofannotatedcorporaisacostlyandtimeconsumingprocess.Todate,mostannotatedre-sourcesresideswithintheEnglishlanguage,whichhinderstheadoptionofsupervisedlearningmethodsinmanymultilingualenvironments.Tominimizetheneedforannotation,signiﬁcantprogresshasbeenmadeindevelopingunsupervisedandsemi-supervisedapproachestoNLP(CollinsandSinger1999;Klein2005;Liang2005;Smith2006;Goldberg2010;interalia).Morerecentparadigmsforsemi-supervisedlearningallowmod-elerstodirectlyencodeknowledgeaboutthetaskandthedomainasconstraintstoguidelearning(Changetal.,2007;MannandMcCallum,2010;Ganchevetal.,2010).Jedoch,inamultilingualsetting,comingupwitheffectiveconstraintsrequireextensiveknowledgeoftheforeign1language.Bilingualparalleltext(bitext)lendsitselfasamediumtotransferknowledgefromaresource-richlanguagetoaforeignlanguages.YarowskyandNgai(2001)projectlabelsproducedbyanEnglishtag-gertotheforeignsideofbitext,thenusethepro-jectedlabelstolearnaHMMmodel.Morerecentworkappliedtheprojection-basedapproachtomorelanguage-pairs,andfurtherimprovedperformancethroughtheuseoftype-levelconstraintsfromtagdictionaryandfeature-richgenerativeordiscrimina-tivemodels(DasandPetrov,2011;T¨ackstr¨ometal.,2013).Inourwork,weproposeanewprojection-basedmethodthatdiffersintwoimportantways.First,weneverexplicitlyprojectthelabels.Instead,weprojectexpectationsoverthelabels.Thisprojection1Forexperimentalpurposes,wedesignateEnglishastheresource-richlanguage,andotherlanguagesofinterestas“for-eign”.Inourexperiments,wesimulatetheresource-poorsce-nariousingChineseandGerman,eventhoughinrealitythesetwolanguagesarequiterichinresources.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
1
6
5
1
5
6
6
8
3
0

/
T

A
C
_
A
_
0
0
1
6
5
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

actsasasoftconstraintoverthelabels,whichal-lowsustotransfermoreinformationanduncertaintyacrosslanguageboundaries.Secondly,weencodetheexpectationsasconstraintsandtrainamodelbyminimizingdivergencebetweenmodelexpectationsandprojectedexpectationsinaGeneralizedExpec-tation(GE)Criteria(MannandMcCallum,2010)framework.WeevaluateourapproachonNamedEntityRecognition(NER)tasksforEnglish-ChineseandEnglish-Germanlanguagepairsonstandardpublicdatasets.Wereportresultsintwosettings:aweaklysupervisedsettingwherenolabeleddataorasmallamountoflabeleddataisavailable,andasemi-supervisedsettingswherelabeleddataisavailable,butwecangainpredictivepowerbylearningfromunlabeledbitext.2RelatedWorkMostsemi-supervisedlearningapproachesembodytheprincipleoflearningfromconstraints.Therearetwobroadcategoriesofconstraints:multi-viewcon-straints,andexternalknowledgeconstraints.Examplesofmethodsthatexploremulti-viewconstraintsincludeself-training(Yarowsky,1995;McCloskyetal.,2006),2co-training(BlumandMitchell,1998;Sindhwanietal.,2005),multi-viewlearning(AndoandZhang,2005;Carlsonetal.,2010),anddiscriminativeandgenerativemodelcombination(SuzukiandIsozaki,2008;DruckandMcCallum,2010).Anearlyexampleofusingknowledgeascon-straintsinweakly-supervisedlearningistheworkbyCollinsandSinger(1999).Theyshowedthattheadditionofasmallsetof“seed”rulesgreatlyim-proveaco-trainingstyleunsupervisedtagger.Changetal.(2007)proposedaconstraint-drivenlearning(CODL)frameworkwhereconstraintsareusedtoguidetheselectionofbestself-labeledexamplestobeincludedasadditionaltrainingdatainaniterativeEM-styleprocedure.ThekindofconstraintsusedinapplicationssuchasNERaretheoneslike“thewordsCA,Australia,NYareLOCATION”(Changetal.,2007).Noticethesimilarityofthispartic-2Amulti-viewinterpretationofself-trainingisthattheself-taggedadditionaldataoffersnewviewstolearnerstrainedonexistinglabeleddata.ularconstrainttothekindsoffeaturesonewouldexpecttoseeinadiscriminativeMaxEntmodel.Thedifferenceisthatinsteadoflearningthevalid-ity(orweight)ofthisfeaturefromlabeledexam-ples—sincewedonothavethem—wecancon-strainthemodelusingourknowledgeofthedomain.Drucketal.(2009)alsodemonstratedthatinanac-tivelearningsettingwhereannotationbudgetislim-ited,itismoreefﬁcienttolabelfeaturesthanex-amples.Othersourcesofknowledgeincludelexi-consandgazetteers(Drucketal.,2007;Changetal.,2007).Whileitisstraight-forwardtoseehowresourcessuchasalistofcitynamescangivealotofmileageinrecognizinglocations,wearealsoexposedtothedangerofover-committingtohardconstraints.Forexample,itbecomesproblematicwithcitynamesthatareambiguous,suchasAugusta,Georgia.3Tosoftentheseconstraints,MannandMcCallum(2010)proposedtheGeneralizedExpectation(GE)Criteriaframework,whichencodesconstraintsasaregularizationtermoversomescorefunctionthatmeasuresthedivergencebetweenthemodel’sex-pectationandthetargetexpectation.TheconnectionbetweenGEandCODLisanalogoustotherelation-shipbetweenhard(Viterbi)EMandsoftEM,asil-lustratedbySamdanietal.(2012).AnothercloselyrelatedworkisthePosteriorRegularization(PR)frameworkbyGanchevetal.(2010).Infact,asBellareetal.(2009)haveshown,inadiscriminativemodelthesetwomethodsopti-mizeexactlythesameobjective.4Thetwodifferinoptimizationdetails:PRusesaEMalgorithmtoapproximatethegradientswhichavoidstheex-pensivecomputationofacovariancematrixbetweenfeaturesandconstraints,whereasGEdirectlycal-culatesthegradient.However,laterresults(Druck,2011)haveshownthatusingtheExpectationSemir-ingtechniquesofLiandEisner(2009),onecancomputetheexactgradientsofGEinaConditionalRandomFields(CRF)(Laffertyetal.,2001)atcosts3ThisisacityinthestateofGeorgiainUSA,famousforitsgolfcourses.ItisambiguoussincebothAugustaandGeorgiacanalsobeusedaspersonnames.4ThedifferentterminologyemployedbyGEandPRmaybeconfusingtodiscerningreaders,butthe“expectation”inthecontextofGEmeansthesamethingas“marginalposterior”asinPR.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
1
6
5
1
5
6
6
8
3
0

/
T

A
C
_
A
_
0
0
1
6
5
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

nogreaterthancomputingthegradientsofordinaryCRF.Andempirically,GEtendstoperformmoreac-curatelythanPR(Bellareetal.,2009;Druck,2011).ObtainingappropriateknowledgeresourcesforconstructingconstraintsremainasabottleneckinapplyingGEandPRtonewlanguages.However,anumberofpastworkrecognizesparallelbitextasarichsourceoflinguisticconstraints,naturallycap-turedinthetranslations.Asaresult,bitexthasbeeneffectivelyutilizedforunsupervisedmultilin-gualgrammarinduction(Alshawietal.,2000;Sny-deretal.,2009),parsing(BurkettandKlein,2008),andsequencelabeling(Naseemetal.,2009).Anumberofrecentworkalsoexploredbilin-gualconstraintsinthecontextofsimultaneousbilin-gualtagging,andshowedthatenforcingagreementsbetweenlanguagepairsgivesuperiorresultsthanmonolingualtagging(Burkettetal.,2010;Cheetal.,2013;Wangetal.,2013a).Burkettetal.(2010)alsodemonstratedauptraining(Petrovetal.,2010)settingwheretag-inducedbitextcanbeusedasad-ditionalmonolingualtrainingdatatoimprovemono-lingualtaggers.Amajordrawbackofthisapproachisthatitrequiresareadily-trainedtaggingmodelsineachlanguages,whichmakesaweaklysupervisedsettinginfeasible.Anotherintricacyofthisapproachisthatitonlyworkswhenthetwomodelshavecom-parablestrength,sincemutualagreementsareen-forcedbetweenthem.Projection-basedmethodscanbeveryeffectiveinweakly-supervisedscenarios,asdemonstratedbyYarowskyandNgai(2001),andXiandHwa(2005).Oneproblemwithprojectedlabelsisthattheyareoftentoonoisytobedirectlyusedastrainingsig-nals.Tomitigatethisproblem,DasandPetrov(2011)designedalabelpropagationmethodtoau-tomaticallyinduceataglexiconfortheforeignlan-guagetosmooththeprojectedlabels.FossumandAbney(2005)ﬁlteroutprojectionnoisebycom-biningprojectionsfromfrommultiplesourcelan-guages.However,thisapproachisnotalwaysviablesinceitreliesonhavingparallelbitextfrommulti-plesourcelanguages.Lietal.(2012)proposedtheuseofcrowd-sourcedWiktionaryasadditionalre-sourcesforinducingtaglexicons.Morerecently,T¨ackstr¨ometal.(2013)combinedtoken-levelandtype-levelconstraintstoconstrainlegitimatelabelsequencesandandrecalibratetheprobabilitydistri-butioninaCRF.ThetagdictionaryusedforPOStaggingareanalogoustothegazetteersandnamelexiconsusedforNERbyChangetal.(2007).OurworkisalsocloselyrelatedtoGanchevetal.(2009).Theyusedatwo-stepprojectionmethodsimilartoDasandPetrov(2011)fordependencyparsing.Insteadofusingtheprojectedlinguis-ticstructuresasgroundtruth(YarowskyandNgai,2001),orasfeaturesinagenerativemodel(DasandPetrov,2011),theyusedthemasconstraintsinaPRframework.Ourworkdiffersbyproject-ingexpectationsratherthanViterbione-bestlabels.WealsochoosetheGEframeworkoverPR.Experi-mentsinBellareetal.(2009)andDruck(2011)sug-gestthatinadiscriminativemodel(likeours),GEismoreaccuratethanPR.Morerecently,GanchevandDas(2013)furtherextendedthislineofworktodirectlytraindiscriminativesequencemodelsus-ingcrosslingualprojectionwithPR.ThetypesofconstraintsappliedinthisnewworkaresimilartotheonesinthemonolingualPRsettingproposedbyGanchevetal.(2010),wherethetotalcountsofla-belsofaparticularkindareexpectedtomatchsomefractionoftheprojectedtotalcounts.Ourworkdif-ferinthatweenforceexpectationconstraintsatto-kenlevel,whichgivestighterguidancetolearningthemodel.3ApproachGivenbitextbetweenEnglishandaforeignlan-guage,ourgoalistolearnaCRFmodelintheforeignlanguagefromlittleornolabeleddata.OurmethodperformsCross-LingualProjectedExpectationRegularization(CLiPER).Foreveryalignedsentencepairinthebitext,weﬁrstcomputetheposteriormarginalateachwordpo-sitionontheEnglishsideusingapre-trainedEnglishCRFtagger;thenforeachalignedEnglishword,weprojectitsposteriormarginalasexpectationstothealignedwordpositionontheforeignside.Figure1showsasnippetofasentencefromrealcorpus.No-ticethatifweweretodirectlyprojecttheViterbibestassignmentfromEnglishtoChinese,allthreeChinesewordsthatarenamedentitieswouldhavegottenthewrongtags.ButprojectingtheEnglishCRFmodelexpectationspreservessomeuncertain-ties,informingtheChinesemodelthatthereisa40%

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
1
6
5
1
5
6
6
8
3
0

/
T

A
C
_
A
_
0
0
1
6
5
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

areceptioninLuobuLinka……metwithrepresentativesofZhongguoRibaoO:0.0032Ö:0.0037GPE:0.0000GPE:0.0000PER:0.0000PER:0.0000PER:0.0000GPE:0.0042GPE:0.0042LOC:0.0003LOC:0.0003GPE:0.0000GPE:0.0000GPE:0.0000ORG:0.0308ORG:0.0307Ö:0.0012Ö:0.0011ORG:0.0000ORG:0.0000ORG:0.0000LOC:0.3250LOC:0.3256ORG:0.4060ORG:0.4061LOC:0.0000LOC:0.0000LOC:0.0000PER:0.6369PER:0.6377PER:0.5925PER:0.5925Ö:1.0000Ö:1.0000Ö:1.0000在罗布林卡举行的招待会……会见了中国日报代表PER:0.6373PER:0.5925PER:0.5925Ö:1.0000Ö:1.0000Ö:1.0000LOC:0.3253ORG:0.4060ORG:0.4061LOC:0.0000LOC:0.0000LOC:0.0000ORG:0.0307Ö:0.0012Ö:0.0011ORG:0.0000ORG:0.0000ORG:0.0000GPE:0.0042LOC:0.0003LOC:0.0003GPE:0.0000GPE:0.0000GPE:0.0000Ö:0.0035GPE:0.0000GPE:0.0000PER:0.0000PER:0.0000PER:0.0000Figure1:DiagramillustratingtheprojectionofmodelexpectationfromEnglishtoChinese.TheposteriorprobabilitiesassignedbytheEnglishCRFmodelisshownaboveeachEnglishword;automaticallyinducedwordalignmentsareshowninred;thecorrectprojectedlabelsforChinesewordsareshowningreen,andincorrectlabelsareshowninred.chancethat“中国日报”(ChinaDaily)isanorgani-zationinthiscontext.WewouldliketolearnaCRFmodelinthefor-eignlanguagethathassimilarexpectationsastheprojectedexpectationsfromEnglish.Tothisend,weadopttheGeneralizedExpectation(GE)Crite-riaframeworkintroducedbyMannandMcCallum(2010).Intheremainderofthissection,wefollowthenotationusedin(Druck,2011)toexplainourap-proach.3.1CLiPERThegeneralideaofGEisthatwecanexpressourpreferencesovermodelsthroughconstraintfunc-tions.Adesiredmodelshouldsatisfytheimposedconstraintsbymatchingtheexpectationsontheseconstraintfunctionswithsometargetexpectations(attainedbyexternalknowledgelikelexiconsorinourcasetransferredknowledgefromEnglish).Wedeﬁneaconstraintfunctionφi,ljforeachwordpo-sitioniandoutputlabelassignmentlj.φi,lj=0isaconstraintinthatpositionicannottakelabellj.Theset{l1,···,lm}denotesallpossiblelabelas-signmentforeachyi,andmisnumberoflabelval-ues.AiisthesetofEnglishwordsalignedtoChi-nesewordi.φi,ljaredeﬁnedforallpositionisuchthatAi6=∅.Inotherwords,theconstraintfunctionappliesonlytoChinesewordpositionsthathaveatleastonealignedEnglishword.Eachφi,lj(j)canbetreatedasaBernoullirandomvariable,andweconcatenatethesetofallφi,ljintoarandomvectorφ(j),whereφk=φi,ljifk=i∗m+j.Wedropthe(j)inφforsimplicity.Thetargetexpectationoverφi,lj,denotedas˜φi,lj,istheexpectationofassigninglabelljtoEnglishwordAiundertheEnglishconditionalprobabilitymodel.WhenmultipleEnglishwordsarealignedtothesameforeignword,weaveragetheexpectations.Theexpectationoverφunderaconditionalprob-abilitymodelP(j|X;θ)isdenotedasEP(j|X;θ)[Phi],andsimpliﬁedasEθ[Phi]wheneveritisunambigu-ous.TheconditionalprobabilitymodelP(j|X;θ)inourcaseisdeﬁnedasastandardlinear-chainCRF:5P(j|X;θ)=1Z(X;θ)exp nXiθf(X,yi,yi−1)!wherefisasetoffeaturefunctions;θarethematch-ingparameterstolearn;n=|X|.TheobjectivefunctiontomaximizeinastandardCRFisthelogprobabilityoveracollectionofla-beleddocuments:LCRF(θ)=a0Xa=1logP(y∗a|xa;θ)(1)a0isthenumberoflabeledsentences.y∗isanob-servedlabelsequence.TheobjectivefunctiontomaximizeinGEisde-ﬁnedasthesumoverallunlabeledexamplesonthe5WesimplifynotationbydroppingtheL2regularizerintheCRFdeﬁnition,butapplyitinourexperiments.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
1
6
5
1
5
6
6
8
3
0

/
T

A
C
_
A
_
0
0
1
6
5
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

foreignsideofbitext,denotedasxb,oversomecostfunctionSbetweenthemodelexpectationoverφ(Eθ[Phi])andthetargetexpectation(˜φ).WechooseStobethenegativeL22squarederrorsum6deﬁnedas:LGE(θ)=n0Xb=1S(cid:16)EP(yb|xb;θ)[Phi(yb)],˜φb(cid:17)=n0Xb=1−k˜φb−Eθ[Phi(yb)]k22(2)n0isthetotalnumberofunlabeledbitextsentencepairs.Whenbothlabeledandbitexttrainingdataareavailable,thejointobjectiveisthesumofEqn.1and2.Eachiscomputedoverthelabeledtrainingdataandforeignhalfinthebitext,respectively.Wecanoptimizethisjointobjectivebycomput-ingthegradientsanduseagradient-basedoptimiza-tionmethodsuchasL-BFGS.GradientsofLCRFdecomposesdowntothegradientsovereachla-beledtrainingexample(X,y∗).Computingthegra-dientofLGEdecomposesdowntothegradientsofS(EP(j|xb;θ[Phi])foreachunlabeledforeignsentencexandtheconstraintsoverthisexampleφ.Thegra-dientscanbecalculatedas:∂∂θS(Eθ[Phi])=−∂∂θ(cid:16)˜φ−Eθ[Phi](cid:17)T(cid:16)˜φ−Eθ[Phi](cid:17)=2(cid:16)˜φ−Eθ[Phi](cid:17)T(cid:18)∂∂θEθ[Phi](cid:19)Weredeﬁnethepenaltyvectoru=2(cid:16)˜φ−Eθ[Phi](cid:17)tobeu.∂∂θEθ[Phi]isamatrixwhereeachcolumncontainsthegradientsforaparticularmodelfeatureθwithrespecttoallconstraintfunctionsφ.Itcanbe6Ingeneral,otherlossfunctionssuchasKL-divergencecanalsobeusedforS.WefoundL22toworkwellinpractice.computedas:∂∂θEθ[Phi]=Xyφ(j)∂∂θP(j|X;θ)=Xyφ(j)∂∂θ(cid:18)1Z(X;θ)exp(θTf(X,j))(cid:19)=Xyφ(j) 1Z(X;θ)(cid:18)∂∂θexp(θTf(X,j))(cid:19)+exp(θTf(X,j))(cid:18)∂∂θ1Z(X;θ)(cid:19)!=Xyφ(j)(cid:18)P(j|X;θ)F(X,j)T−P(j|X;θ)Xy0P(y0|X;θ)F(X,y0)T(cid:19)=XyP(j|X;θ)Xyφ(j)F(X,j)T−(cid:0)XyP(j|X;θ)Phi(j)(cid:1)(cid:0)XyP(j|X;θ)F(X,j)T(cid:1)=COVP(j|X;θ)(Phi(j),F(X,j))(3)=Eθ[φfT]−Eθ[Phi]Eθ[fT](4)Eqn.3givestheintuitionofhowoptimizationworksinGE.IneachiterationofL-BFGS,themodelpa-rametersareupdatedaccordingtotheircovariancewiththeconstraintfeatures,scaledbythediffer-encebetweencurrentexpectationandtargetexpec-tation.ThetermEθ[φfT]inEqn.4canbecom-putedusingadynamicprogramming(DP)algo-rithm,butsolvingitdirectlyrequiresustostoreamatrixofthesamedimensionasfTineachstepoftheDP.Wecanreducethecomplexitybyusingthesametrickasin(LiandEisner,2009)forcom-putingExpectationSemiring.Theresultingalgo-rithmhascomplexityO(nm2),whichisthesameasthestandardforward-backwardinferencealgorithmforCRF.(Druck,2011,93)givesfulldetailsofthisderivation.3.2Hardvs.softProjectionProjectingexpectationsinsteadofone-bestlabelas-signmentsfromEnglishtoforeignlanguagecanbethoughtofasasoftversionofthemethodde-scribedin(DasandPetrov,2011)Und(Ganchevet

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
1
6
5
1
5
6
6
8
3
0

/
T

A
C
_
A
_
0
0
1
6
5
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

al.,2009).Softprojectionhasitsadvantage:whentheEnglishmodelisnotcertainaboutitspredic-tions,wedonothavetocommittothecurrentbestprediction.Theforeignmodelhasmorefreedomtoformitsownbeliefsinceanymarginaldistribu-tionitproduceswoulddeviatesfromaﬂatdistri-butionbyjustaboutthesameamount.Ingeneral,preservinguncertaintiestilllaterisastrategythathasbeneﬁtedmanyNLPtasks(Finkeletal.,2006).Hardprojectioncanalsobetreatedasaspecialcaseinourframework.Wecansimplyrecalibratepos-teriormarginalofEnglishbyassigningprobabilitymass1tothemostlikelyoutcome,andzeroev-erythingelseout,effectivelytakingtheargmaxofthemarginalateachwordposition.Werefertothisversionofexpectationasthe“hard”expecta-tion.Inthehardprojectionsetting,GEtrainingre-semblesa“project-then-train”stylesemi-supervisedCRFtrainingscheme(YarowskyandNgai,2001;T¨ackstr¨ometal.,2013).Insuchatrainingscheme,weprojecttheone-bestpredictionsofEnglishCRFtotheforeignsidethroughwordalignments,thenin-cludethenewly“tagged”foreigndataasadditionaltrainingdatatoastandardCRFintheforeignlan-guage.Ratherthanprojectinglabelsonaper-wordbasis,YarowskyandNgai(2001)alsoexploredanalternativemethodfornoun-phrase(NP)bracketingtaskthatamountstoprojectingthespansofNPsbasedontheobservationthatindividualNPstendtoretaintheirsequentialspansacrosstranslations.WeexperimentedwiththesamemethodforNER,butfoundthatthismethodofprojectingtheNEspansdoesnothelpinreducingnoiseandactuallylowersmodelperformance.Besidesthedifferenceinprojectingexpecta-tionsratherthanhardlabels,ourmethodandthe“project-then-train”schemealsodifferbyoptimiz-ingdifferentobjectives:CRFoptimizesmaximumconditionallikelihoodoftheobservedlabelse-quence,whereasGEminimizessquarederrorbe-tweenmodel’sexpectationand“hard”expectationbasedontheobservedlabelsequence.InthecasewheresquarederrorlossisreplacedwithaKL-divergenceloss,GEhasthesameeffectasmarginal-izingoutallpositionswithunknownprojectedla-bels,allowingmorerobustlearningofuncertaintiesinthemodel.Aswewillshowintheexperimen-OPERLOCORGGPEO2913393911411281221PER1263672155673LOC40923546123133ORG2423143528387196GPE566239696686604OPERLOCORGMISCO812092438155103PER775725416910LOC4940374312760ORG178102142407591MISC17541301141826Table1:RawcountsintheerrorconfusionmatrixofEnglishCRFmodels.ToptablecontainsthecountsonOntoNotestestdata,andbottomtablecontainsCoNLL-03testdatacounts.Rowsarethetruela-belsandcolumnsaretheobservedlabels.Forexam-ple,itematrow2,column3ofthetoptablereads:weobserved5timeswherethetruelabelshouldbePERSON,butEnglishCRFmodeloutputlabelLO-CATION.talresultsinSection4.2,softprojectionincombi-nationoftheGEobjectivesigniﬁcantlyoutperformstheproject-then-trainstyleCRFtrainingscheme.3.3Source-sidenoiseAnadditionalsourceofnoisecomesfromerrorsgeneratedbythesource-sideEnglishCRFmod-els.WeknowthattheEnglishCRFmodelsgivesF1scoreof81.68%ontheOntoNotesdatasetforEnglish-Chineseexperiment,and90.45%ontheCoNLL-03datasetforEnglish-Germanexperiment.WepresentasimplewayofmodelingEnglish-sidenoisebypicturingthefollowingprocess:thela-belsassignedbytheEnglishCRFmodel(denotedasy)aresomenoisedversionofthetruelabels(de-notedasy∗).Wecanrecovertheprobabilityofthetruelabelsbymarginalizingovertheobservedla-bels:P(y∗|X)=PyP(y∗|j)∗P(j|X).P(j|X)istheposteriorprobabilitiesgivenbytheCRFmodel,andwecanapproximateP(y∗|j)bythecolumn-normalizederrorconfusionmatrixshowninTable1.Thissource-sidenoisemodelislikelytobeoverlysimplistic.Generallyspeaking,wecouldbuildmuchmoresophisticatednoisingmodelforthesource-side,possiblyconditioningoncontext,orcapturinghigher-orderlabelsequences.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
1
6
5
1
5
6
6
8
3
0

/
T

A
C
_
A
_
0
0
1
6
5
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

4ExperimentsWeconductexperimentsonChineseandGermanNER.WeevaluateCLiPERintwolearningset-tings:weaklysupervisedandsemi-supervised.Intheweaklysupervisedsetting,wesimulatethecon-ditionofhavingnolabeledtrainingdata,andevalu-atethemodellearnedfrombitextalone.Wethenvarytheamountoflabeleddataavailabletothemodel,andexaminethemodel’slearningcurve.Inthesemi-supervisedsetting,weassumeourmodelhasaccesstothefulllabeleddata;ourgoalistoimproveperformanceofthesupervisedmethodbylearningfromadditionalbitext.4.1DatasetandsetupWeusedthelatestversionofStanfordNERToolkit7asourbaseCRFmodelinallexperiments.Fea-turesforEnglish,ChineseandGermanCRFsaredocumentedextensivelyin(Cheetal.,2013)Und(FaruquiandPad´o,2010)andomittedhereforbrevity.ItitworthnotingthatthecurrentStan-fordNERmodelsincluderecentimprovementsfromsemi-superviselearningapproachesthatinducesdis-tributionalsimilarityfeaturesfromlargewordclus-ters.Thesemodelsrepresentthecurrentstate-of-the-artinsupervisedmethods,andserveasaverystrongbaseline.ForChineseNERexperiments,wefollowthesamesetupasCheetal.(2013)toevaluateonthelatestOntoNotes(v4.0)corpus(Hovyetal.,2006).8Atotalof8,249sentencesfromtheparallelChineseandEnglishPennTreebankportion9arereservedforevaluation.Odd-numbereddocumentsareusedasdevelopmentset,andeven-numbereddocumentsareheldoutasblindtestset.TherestofOntoNotesannotatedwithNERtagsareusedtotraintheEn-glishandChineseCRFbasetaggers.Thereareabout16kand39klabeledsentencesforChineseandEnglishtraining,respectively.TheEnglishCRFtag-gertrainedonthistrainingcorpusgivesF1scoreof81.68%ontheOntoNotestestset.Fourenti-tiestypes10areusedforbothChineseandEnglishwithaIOtaggingscheme.11TheEnglish-Chinese7http://www-nlp.stanford.edu/ner8LDCcatalogueNo.:LDC2011T039Filenumbers:chtb0001-0325,ectb1001-107810PERSON,LOCATION,ORGANIZATIONandGPE.11WedidnotadoptthecommonlyseenBIOtaggingschemebitextcomesfromtheForeignBroadcastInforma-tionServicecorpus(FBIS).12Werandomlysampled80kparallelsentencepairstouseasbitextinourexperiments.ItisﬁrstsentencealignedusingtheChampollionToolKit,13thenwordalignedwiththeBerkeleyAligner.14ForGermanNERexperiments,weevaluateus-ingthestandardCoNLL-03NERcorpus(SangandMeulder,2003).Thelabeledtrainingsethas12kand15ksentences,containingfourentitytypes.15AnEnglishCRFmodelisalsotrainedontheCoNLL-03Englishdatawiththesameentitytypes.Forbi-text,weusedarandomlysampledsetof40kparallelsentencesfromthede-enportionoftheNewsCom-mentarydataset.16TheEnglishCRFtaggertrainedonCoNLL-03EnglishtrainingcorpusgivesF1scoreof90.4%ontheCoNLL-03testset.Wereporttypedentityprecision(P),recall(R)andF1score.Statisticalsigniﬁcancetestsaredoneusingapairedbootstrapresamplingmethodwith1000iterations,averagedover5runs.Wecom-pareagainstthreerecentlyapproachesthatwerein-troducedinSection2.Theyare:semi-supervisedlearningmethodusingfactoredbilingualmodelswithGibbssampling(Wangetal.,2013a);bilin-gualNERusingIntegerLinearProgramming(ILP)withbilingualconstraints,von(Cheetal.,2013);andconstraint-drivenbilingual-rerankingapproach(Burkettetal.,2010).Thecodefrom(Cheetal.,2013)Und(Wangetal.,2013a)arepubliclyavail-able.17Codefrom(Burkettetal.,2010)isobtainedthroughpersonalcommunications.SincetheobjectivefunctioninEqn.2isnon-convex,weadoptedtheearlystoppingtrainingschemefrom(Turianetal.,2010)asthefollowing:aftereachiterationinL-BFGStraining,themodel(RamshawandMarcus,1999),becausewhenprojectedacrossswappingwordalignments,the“B-”and“I-”tagdistinctionmaynotbewell-preservedandmayintroduceadditionalnoise.12TheFBIScorpusisacollectionofradionewscastsandcontainstranslationsofopenlyavailablenewsandinformationfrommediasourcesoutsidetheUnitedStates.TheLDCcata-logueNo.isLDC2003E14.13champollion.sourceforge.net14code.google.com/p/berkeleyaligner15PERSON,LOCATION,ORGANIZATIONandMISCELLA-NEOUS.16http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz17https://github.com/stanfordnlp/CoreNLP

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
1
6
5
1
5
6
6
8
3
0

/
T

A
C
_
A
_
0
0
1
6
5
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

isevaluatedagainstthedevelopmentset;thetrain-ingprocedureisterminatedifnoimprovementshavebeenmadein20iterations.4.2WeaklysupervisedresultsFigure2aand2bshowresultsofweaklysupervisedlearningexperiments.Quiteremarkably,onChinesetestset,ourproposedmethod(CLiPER)achievesaF1scoreof64.4%with80kbitext,whennolabeledtrainingdataisused.Incontrast,thesupervisedCRFbaselinewouldrequireasmuchas12klabeledsentencestoattainthesameaccuracy.ResultsontheGermantestsetislessstriking.Withnolabeleddataand40kofbitext,CLiPERperformsatF1of60.0%,theequivalentofusing1.5klabeledexamplesinthesupervisedsetting.Whencombinedwith1klabeledexamples,performanceofCLiPERreaches69%,againofover5%absoluteoversupervisedCRF.WealsonoticethatsupervisedCRFmodellearnsmuchfasterinGermanthanChinese.Thisresultisnottoosurprising,sinceitiswellrecognizedthatChineseNERismorechallengingthanGermanorEnglish.ThebestsupervisedresultsforChineseis10-20%(F1score)behindbestGermanandEnglishsuper-visedresults.ChineseNERreliesmoreonlexical-izedfeatures,andthereforeneedsmorelabeleddatatoachievegoodcoverage.TheresultssuggestthatCLiPERseemstobeveryeffectiveattransferringlexicalknowledgefromEnglishtoChinese.Figure2cand2dcomparessoftGEprojectionwithhardGEprojectionandthe“project-then-train”styleCRFtrainingscheme(cf.Section3.2).WeobservethatbothsoftandhardGEprojectionsig-niﬁcantlyoutperformthe“project-then-train”styletrainingscheme.Thedifferenceisespeciallypro-nouncedontheChineseresultswhenfewerlabeledexamplesareavailable.Softprojectiongivesbetteraccuracythanhardprojectionwhennolabeleddataisavailable,andalsohasafasterlearningrate.Incorporatingsource-sidenoiseusingthemethoddescribedinSection3.3givesasmallimprovementonChinesewithsuperviseddata,increasingF1scorefrom64.40%to65.50%.Thisimprovementisstatis-ticallysigniﬁcantat92%conﬁdenceinterval.How-ever,ontheGermandata,weobserveatinyde-creasewithnostatisticalsigniﬁcanceinF1score,droppingfrom59.88%to59.66%.Alikelyex-planationofthedifferenceisthattheEnglishCRFmodelintheEnglish-Chineseexperiment,whichistrainedonOntoNotesdata,hasamuchhighererrorrate(18.32%)thantheEnglishCRFmodelintheEnglish-GermanexperimenttrainedonCoNLL-03(9.55%).daher,modelingnoiseintheEnglish-ChinesecaseislikelytohaveagreatereffectthantheEnglish-Germancase.4.3Semi-supervisedresultsInthesemi-supervisedexperiments,welettheCRFmodelusethefullsetoflabeledexamplesinaddi-tiontotheunlabeledbitext.ResultsonthetestsetareshowninTable2.Allsemi-supervisedbaselinesaretestedwiththesamenumberofunlabeledbitextasCLiPERineachlanguage.The“project-then-train”semi-supervisedtrainingschemeseverelyhurtsperformanceonChinese,butgivesasmallim-provementonGerman.Moreover,onChineseitlearnstoachievehighprecisionbutatasigniﬁcantlossinrecall.OnGermanitsbehavioristheoppo-site.Suchdrasticanderraticimbalancesuggestthatthismethodisnotrobustorreliable.Theotherthreesemi-supervisedbaselines(row3-5)allshowim-provementsovertheCRFbaseline,consistentwiththeirreportedresults.CLIPERsgivesthebestre-sultsonbothChineseandGerman,yieldingstatis-ticallysigniﬁcantimprovementsoverallbaselinesexceptforCWD13onGerman.ThehardprojectionversionofCLiPERalsogivessizablegainoverCRF.However,incomparison,CLIPERsissuperior.TheimprovementsofCLIPERsoverCRFonChinesetestsetisover2.8%inabsoluteF1.TheimprovementoverCRFonGermanisalmostaper-cent.Toourknowledge,thesearethebestreportednumbersontheOntoNotesChineseandCoNLL-03Germandatasets.4.4EfﬁciencyAnotheradvantageofourproposedapproachisef-ﬁciency.Becauseweeliminatedthepreviousmulti-stage“uptraining”paradigm,butinsteadintegratingthesemi-supervisedandsupervisedobjectiveintoonejointobjective,weareabletoattainsigniﬁ-cantspeedimprovementsoverallmethodsexceptCRFptt.Table3showstherequiredtrainingtime.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
1
6
5
1
5
6
6
8
3
0

/
T

A
C
_
A
_
0
0
1
6
5
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

012345678910111213141501020304050607080#oflabeledtrainingsentences[k]F1score[%]supervisedCRFCLiPPERsoft(A)ChineseTest012345678910111201020304050607080#oflabeledtrainingsentences[k]F1score[%]supervisedCRFCLiPPERsoft(B)GermanTest01234567891011121314154446485052545658606264666870727476#oflabeledtrainingsentences[k]F1score[%]CRFprojectionCLiPPERhardCLiPPERsoft(C)Softvs.HardonChineseTest01234567891011125456586062646668707274767880#oflabeledtrainingsentences[k]F1score[%]CRFprojectionCLiPPERhardCLiPPERsoft(D)Softvs.HardonGermanTest[高岗]纪念碑在[横山]落成Amonumentcommemorating[VicePresidentGaoGangPER]wascompletedin[HengshanLOC](e)Wordproceeding“monument”isPERSON[碛口][毛主席]东渡[黄河]纪念碑简介Introductionof[QikouLOC][ChairmanMaoPER][YellowRiverLOC]crossingmonument(F)Wordproceeding“monument”isLOCATIONFigure2:TopfourﬁguresshowperformancecurvesofCLiPERwithvaryingamountsofavailablelabeledtrainingdatainaweaklysupervisedsetting.VerticalaxesshowtheF1scoreonthetestset.PerformancecurvesofsupervisedCRFand“project-then-train”CRFareplottedforcomparison.BottomtwoﬁguresareexamplesofalignedsentencepairsinChineseandEnglish.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
1
6
5
1
5
6
6
8
3
0

/
T

A
C
_
A
_
0
0
1
6
5
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

ChineseGermanPRF1PRF1CRF79.0963.5970.5086.6971.3078.25CRFptt84.0145.2958.8581.5075.5678.41BPBK1079.2565.6771.8384.0072.1777.64CWD1381.3165.5072.5585.9972.9878.95WCD13a80.3165.7872.3385.9872.3778.59WCD13b78.5566.5472.0585.1972.9878.62CLiPERh83.6764.8073.04§‡86.5272.0278.61∗CLiPERs82.5765.9973.35§†?(cid:5)∗87.1172.5679.17‡?∗§Table2:TestsetChinese,GermanNERresults.Bestnumberofeachcolumnishighlightedinbold.CRFisthesupervisedbaseline.CRFpttisthe“project-then-train”semi-supervisedschemeforCRF.BPBK10is(Burkettetal.,2010),WCD13is(Wangetal.,2013a),CWD13Ais(Cheetal.,2013),andWCD13Bis(Wangetal.,2013b).CLIPERsandCLIPERharethesoftandhardprojections.§indicatesF1scoresthatarestatisticallysigniﬁcantlybetterthanCRFbaselineat99.5%conﬁdencelevel;?markssigniﬁcanceoverCRFpttwith99.5%con-ﬁdence;†and‡markssigniﬁcanceoverWCD13with99.9%and94%conﬁdence;Und(cid:5)markssig-niﬁcanceoverCWD13with99.7%conﬁdence;∗markssigniﬁcanceoverBPBK10with99.9%con-ﬁdence.5DiscussionsFigure2eand2fgivetwoexamplesofcross-lingualprojectionmethodsinaction.Bothexampleshaveanamedentitythatimmediatelyproceedstheword“纪念碑”(monument)intheChinesesentence.InFigure2e,theword“高岗”hasliteralmeaningofahillocklocatedatahighposition,whichalsohap-penstobethenameofaformervicepresidentofChina.Withouthavingpreviouslyobservedthiswordasapersonnameinthelabeledtrainingdata,theCRFmodeldoesnothaveenoughevidencetobelievethatthisisaPERSON,insteadofLOCATION.ButthealignedwordsinEnglish(“GaoGang”)areclearlypartofapersonnameastheywerepre-cededbyatitle(“VicePresident”).TheEnglishmodelhashighexpectationthatthealignedChi-nesewordof”GaoGang”isalsoaPERSON.There-fore,projectingtheEnglishexpectationstoChineseprovidesastrongcluetohelpdisambiguatingthisword.Figure2fgivesanotherexample:theword“黄河”(HuangHe,theYellowRiverofChina)canChineseGermanCRF19m30s7m15sCRFptt34m2s12m45sWCD133h17m1h1mCWD13a16h42m4h49mCWD13b16h42m4h49mBPBK106h16m2h42mCLiPERh1h28m16m30sCLiPERs1h40m18m51sTable3:Timingstatsduringmodeltraining.beconfusedwithapersonnamesince“黄”(HuangorHwang)isalsoacommonChineselastname.18.Again,knowingthetranslationinEnglish,whichhastheindicativeword“River”init,helpsdisam-biguation.TheCRFpttandCLIPERhmethodssuccessfullylabeledthesetwoexamplescorrectly,butfailedtoproducethecorrectlabelfortheexampleinFig-ure1.Ontheotherhand,amodeltrainedwiththeCLIPERsmethoddoescorrectlylabelbothentitiesinFigure1,demonstratingthemeritsofthesoftpro-jectionmethod.6ConclusionWeintroducedadomainandlanguageindependentsemi-supervisedmethodfortrainingdiscriminativemodelsbyprojectingexpectationsacrossbitext.Ex-perimentsonChineseandGermanNERshowthatourmethod,learnedoverbitextalone,canrivalper-formanceofsupervisedmodelstrainedwiththou-sandsoflabeledexamples.Furthermore,applyingourmethodinasettingwherealllabeledexamplesareavailablealsoshowsimprovementsoverstate-of-the-artsupervisedmethods.Ourexperimentsalsoshowedthatsoftexpectationprojectionismorefa-vorabletohardprojection.Thistechniquecanbegeneralizedtoallsequencelabelingtasks,andcanbeextendedtoincludemorecomplexconstraints.Forfuturework,weplantoapplythismethodtomorelanguagepairsandalsoexploredataselectionstrategiesandmodelingalignmentuncertainties.18Infact,apeoplesearchofthename黄河onthemostpop-ularChinesesocialnetwork(renren.com)returnsover13,000matches.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
1
6
5
1
5
6
6
8
3
0

/
T

A
C
_
A
_
0
0
1
6
5
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

AcknowledgmentsTheauthorswouldliketothankJenniferGillenwa-terforadiscussionthatinspiredthiswork,BehrangMohitandNathanSchneiderfortheirhelpwiththeArabicNERdata,andDavidBurkettforprovidingthesourcecodeoftheirworkforcomparison.WewouldalsoliketothankeditorLillianLeeandthethreeanonymousreviewersfortheirvaluablecom-mentsandsuggestions.WegratefullyacknowledgethesupportoftheU.S.DefenseAdvancedResearchProjectsAgency(DARPA)BroadOperationalLan-guageTranslation(BOLT)programthroughIBM.Anyopinions,ﬁndings,andconclusionorrecom-mendationsexpressedinthismaterialarethoseoftheauthorsanddonotnecessarilyreﬂecttheviewofDARPA,ortheUSgovernment.ReferencesHiyanAlshawi,SrinivasBangalore,andShonaDouglas.2000.Head-transducermodelsforspeechtranslationandtheirautomaticacquisitionfrombilingualdata.MachineTranslation,15.RieKubotaAndoandTongZhang.2005.Ahigh-performancesemi-supervisedlearningmethodfortextchunking.InProceedingsofACL.KedarBellare,GregoryDruck,andAndrewMcCallum.2009.Alternatingprojectionsforlearningwithexpec-tationconstraints.InProceedingsofUAI.AvrimBlumandTomMitchell.1998.Combiningla-beledandunlabeleddatawithco-training.InProceed-ingsofCOLT.DavidBurkettandDanKlein.2008.Twolanguagesarebetterthanone(forsyntacticparsing).InProceedingsofEMNLP.DavidBurkett,SlavPetrov,JohnBlitzer,andDanKlein.2010.Learningbettermonolingualmodelswithunan-notatedbilingualtext.InProceedingsofCoNLL.AndrewCarlson,JustinBetteridge,RichardC.Wang,Es-tevamR.HruschkaJr.,andTomM.Mitchell.2010.Coupledsemi-supervisedlearningforinformationex-traction.InProceedingsofWSDM.Ming-WeiChang,LevRatinov,andDanRoth.2007.Guidingsemi-supervisionwithconstraint-drivenlearning.InProceedingsofACL.WanxiangChe,MengqiuWang,andChristopherD.Man-ning.2013.Namedentityrecognitionwithbilingualconstraints.InProceedingsofNAACL.MichaelCollinsandYoramSinger.1999.Unsupervisedmodelsfornamedentityclassiﬁcation.InProceedingsofEMNLP.DipanjanDasandSlavPetrov.2011.Unsupervisedpart-of-speechtaggingwithbilingualgraph-basedprojec-tions.InProceedingsofACL.GregoryDruckandAndrewMcCallum.2010.High-performancesemi-supervisedlearningusingdiscrim-inativelyconstrainedgenerativemodels.InProceed-ingsofICML.GregoryDruck,GideonMann,andAndrewMcCallum.2007.Leveragingexistingresourcesusinggeneralizedexpectationcriteria.InProceedingsofNIPSWorkshoponLearningProblemDesign.GregoryDruck,BurrSettles,andAndrewMcCallum.2009.Activelearningbylabelingfeatures.InPro-ceedingsofEMNLP.GregoryDruck.2011.GeneralizedExpectationCriteriaforLightlySupervisedLearning.Ph.D.thesis,Univer-sityofMassachusettsAmherst.ManaalFaruquiandSebastianPad´o.2010.TrainingandevaluatingaGermannamedentityrecognizerwithse-manticgeneralization.InProceedingsofKONVENS.JennyRoseFinkel,ChristopherD.Manning,andAn-drewY.Ng.2006.Solvingtheproblemofcascadingerrors:Approximatebayesianinferenceforlinguisticannotationpipelines.InProceedingsofEMNLP.VictoriaFossumandStevenAbney.2005.Automaticallyinducingapart-of-speechtaggerbyprojectingfrommultiplesourcelanguagesacrossalignedcorpora.InProceedingsofIJCNLP.KuzmanGanchevandDipanjanDas.2013.Cross-lingualdiscriminativelearningofsequencemodelswithposteriorregularization.InProceedingsofEMNLP.KuzmanGanchev,JenniferGillenwater,andBenTaskar.2009.Dependencygrammarinductionviabitextpro-jectionconstraints.InProceedingsofACL.KuzmanGanchev,JoaoGrac¸a,JenniferGillenwater,andBenTaskar.2010.Posteriorregularizationforstruc-turedlatentvariablemodels.JMLR,10:2001–2049.AndrewB.Goldberg.2010.NewDirectionsinSemi-supervisedLearning.Ph.D.thesis,UniversityofWisconsin-Madison.EduardHovy,MitchellMarcus,MarthaPalmer,LanceRamshaw,andRalphWeischedel.2006.OntoNotes:the90%solution.InProceedingsofNAACL-HLT.DanKlein.2005.TheUnsupervisedLearningofNaturalLanguageStructure.Ph.D.thesis,StanfordUniver-sity.JohnD.Lafferty,AndrewMcCallum,andFernandoC.N.Pereira.2001.Conditionalrandomﬁelds:Probabilis-ticmodelsforsegmentingandlabelingsequencedata.InProceedingsofICML.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
1
6
5
1
5
6
6
8
3
0

/
T

A
C
_
A
_
0
0
1
6
5
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

ZhifeiLiandJasonEisner.2009.First-andsecond-orderexpectationsemiringswithapplicationstominimum-risktrainingontranslationforests.InProceedingsofEMNLP.ShenLi,JoaoGrac¸a,andBenTaskar.2012.Wiki-lysupervisedpart-of-speechtagging.InProceedingsofEMNLP-CoNLL.PercyLiang.2005.Semi-supervisedlearningfornaturallanguage.Master’sthesis,MassachusettsInstituteofTechnology.GideonMannandAndrewMcCallum.2010.General-izedexpectationcriteriaforsemi-supervisedlearningwithweaklylabeleddata.JMLR,11:955–984.DavidMcClosky,EugeneCharniak,andMarkJohnson.2006.Effectiveself-trainingforparsing.InProceed-ingsofNAACL-HLT.TahiraNaseem,BenjaminSnyder,JacobEisenstein,andReginaBarzilay.2009.Multilingualpart-of-speechtagging:Twounsupervisedapproaches.JAIR,36:1076–9757.SlavPetrov,Pi-ChuanChang,MichaelRinggaard,andHiyanAlshawi.2010.Uptrainingforaccuratedeter-ministicquestionparsing.InProceedingsofEMNLP.LanceA.RamshawandMitchellP.Marcus.1999.Textchunkingusingtransformation-basedlearning.Natu-ralLanguageProcessingUsingVeryLargeCorpora,11:157–176.RajhansSamdani,Ming-WeiChang,andDanRoth.2012.Uniﬁedexpectationmaximization.InProceed-ingsofNAACL.ErikF.TjongKimSangandFienDeMeulder.2003.In-troductiontotheCoNLL-2003sharedtask:language-independentnamedentityrecognition.InProceedingsofCoNLL.VikasSindhwani,ParthaNiyogi,andMikhailBelkin.2005.Aco-regularizationapproachtosemi-supervisedlearningwithmultipleviews.InProceed-ingsofICMLWorkshoponLearningwithMultipleViews,InternationalConferenceonMachineLearn-ing.NoahA.Smith.2006.NovelEstimationMethodsforUnsupervisedDiscoveryofLatentStructureinNatu-ralLanguageText.Ph.D.thesis,JohnsHopkinsUni-versity.BenjaminSnyder,TahiraNaseem,andReginaBarzilay.2009.Unsupervisedmultilingualgrammarinduction.InProceedingsofACL.JunSuzukiandHidekiIsozaki.2008.Semi-supervisedsequentiallabelingandsegmentationusinggiga-wordscaleunlabeleddata.InProceedingsofACL.OscarT¨ackstr¨om,DipanjanDas,SlavPetrov,RyanMc-Donald,andJoakimNivre.2013.Tokenandtypeconstraintsforcross-lingualpart-of-speechtagging.InProceedingsofACL.JosephTurian,LevRatinov,andYoshuaBengio.2010.Wordrepresentations:Asimpleandgeneralmethodforsemi-supervisedlearning.InProceedingsofACL.MengqiuWang,WanxiangChe,andChristopherD.Man-ning.2013a.Effectivebilingualconstraintsforsemi-supervisedlearningofnamedentityrecognizers.InProceedingsofAAAI.MengqiuWang,WanxiangChe,andChristopherD.Man-ning.2013b.Jointwordalignmentandbilingualnamedentityrecognitionusingdualdecomposition.InProceedingsofACL.ChenhaiXiandRebeccaHwa.2005.Abackoffmodelforbootstrappingresourcesfornon-englishlanguages.InProceedingsofHLT-EMNLP.DavidYarowskyandGraceNgai.2001.Inducingmul-tilingualPOStaggersandNPbracketersviarobustprojectionacrossalignedcorpora.InProceedingsofNAACL.DavidYarowsky.1995.Unsupervisedwordsensedis-ambiguationrivalingsupervisedmethods.InProceed-ingsofACL.
PDF Herunterladen