Transactions of the Association for Computational Linguistics, vol. 4, pp. 507–519, 2016. Action Editor: Jason Eisner. - IA de Investigación especializada en el MIT

Transacciones de la Asociación de Lingüística Computacional, volumen. 4, páginas. 507–519, 2016. Editor de acciones: Jason Eisner.
Lote de envío: 3/2016; Lote de revisión: 5/2016; Publicado 11/2016.

2016 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

C
(cid:13)

MinimallySupervisedNumberNormalizationKyleGormanandRichardSproatGoogle,Inc.1118thAve.,NewYork,Nueva York,USAAbstractWeproposetwomodelsforverbalizingnum-bers,akeycomponentinspeechrecognitionandsynthesissystems.Thefirstmodelusesanend-to-endrecurrentneuralnetwork.Thesec-ondmodel,drawinginspirationfromthelin-guisticsliterature,usesfinite-statetransducersconstructedwithaminimalamountoftrainingdata.Whilebothmodelsachievenear-perfectperformance,thelattermodelcanbetrainedusingseveralordersofmagnitudelessdatathantheformer,makingitparticularlyusefulforlow-resourcelanguages.1IntroductionManyspeechandlanguageapplicationsrequiretexttokenstobeconvertedfromoneformtoanother.Forexample,intext-to-speechsynthesis,onemustcon-vertdigitsequences(32)intonumbernames(thirty-two),andappropriatelyverbalizedateandtimeex-pressions(12:47→twelveforty-seven)andabbre-viations(kg→kilograms)whilehandlingallomor-phyandmorphologicalconcord(e.g.,Sproat,1996).QuiteabitofrecentworkonSMS(e.g.,Beaufortetal.,2010)andtextfromsocialmediasites(e.g.,YangandEisenstein,2013)hasfocusedondetect-ingandexpandingnovelabbreviations(e.g.,cnuplzhlp).Colectivamente,suchconversionsallfallundertherubricoftextnormalization(Sproatetal.,2001),butthistermmeansradicallydifferentthingsindiffer-entapplications.Forinstance,itisnotnecessarytodetectandverbalizedatesandtimeswhenpreparingsocialmediatextfordownstreaminformationextrac-tion,butthisisessentialforspeechapplications.Whileexpandingnovelabbreviationsisalsoim-portantforspeech(RoarkandSproat,2014),num-bers,veces,fechas,measurephrasesandthelikearefarmorecommoninawidevarietyoftextgenres.FollowingTaylor(2009),werefertocate-goriessuchascardinalnumbers,veces,anddates—eachofwhichissemanticallywell-circumscribed—assemioticclasses.Somepreviousworkontextnor-malizationproposesminimally-supervisedmachinelearningtechniquesfornormalizingspecificsemi-oticclasses,suchasabbreviations(e.g.,Changetal.,2002;PennellandLiu,2011;RoarkandSproat,2014).Thispapercontinuesthistraditionbycon-tributingminimally-supervisedmodelsfornormal-izationofcardinalnumberexpressions(e.g.,ninety-seven).PreviousworkonthissemioticclassincludeformallinguisticstudiesbyCorstius(1968)andHur-ford(1975)andcomputationalmodelsproposedbySproat(1996;2010)andKanisetal.(2005).Ofallsemioticclasses,numbersarebyfarthemostim-portantforspeech,ascardinal(andordinal)num-bersarenotonlysemioticclassesintheirownright,butknowinghowtoverbalizenumbersisimportantformostoftheotherclasses:onecannotverbalizetimes,fechas,measures,orcurrencyexpressionswith-outknowinghowtoverbalizethatlanguage’snum-bersaswell.Onecomputationalapproachtonumbernamever-balization(Sproat,1996;Kanisetal.,2005)employsacascadeoftwofinite-statetransducers(FSTs).ThefirstFSTfactorstheinteger,expressedasadigitse-quence,intosumsofproductsofpowersoften(i.e.,inthecaseofabase-tennumbersystem).ThisiscomposedwithasecondFSTthatdefineshowthe

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
1
1
4
1
5
6
7
4
1
4

/
t

a
C
_
a
_
0
0
1
1
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

508

numericfactorsareverbalized,andmayalsohandleallomorphyormorphologicalconcordinlanguagesthatrequireit.Numbernamescanberelativelyeasy(asinEnglish)orcomplex(asinRussian;Sproat,2010)andthustheseFSTsmayberelativelyeasyorquitedifficulttodevelop.WhiletheGoogletext-to-speech(TTS)(seeEbdenandSproat,2014)andau-tomaticspeechrecognition(ASR)systemsdependonhand-builtnumbernamegrammarsforabout70languages,developingthesegrammarsfornewlan-guagesrequiresextensiveresearchandlabor.Forsomelanguages,aprofessionallinguistcandevelopanewgrammarinaslittleasaday,butotherlan-guagesmayrequiredaysorweeksofeffort.Wehavealsofoundthatitisverycommonforthesehand-writtengrammarstocontaindifficult-to-detecter-rors;en efecto,thecomputationalmodelsusedinthisstudyrevealedseverallong-standingbugsinhand-writtennumbergrammars.Theamountoftime,esfuerzo,andexpertiserequiredtoproduceerror-freenumbergrammarsleadsustoconsidermachinelearningsolutions.Yetitisim-portanttonotethatnumberverbalizationposesadauntinglyhighstandardofaccuracycomparedtonearlyallotherspeechandlanguagetasks.WhileonemightforgiveaTTSsystemthatreadstheam-biguousabbreviationplzasplazaratherthanthein-tendedplease,itwouldbeinexcusableforthesamesystemtoeverread72asfourhundredseventytwo,evenifitrenderedthevastmajorityofnumberscor-rectly.Tosetthestageforthiswork,wefirst(§2–3)brieflydescribeseveralexperimentswithapower-fulandpopularmachinelearningtechnique,namelyrecurrentneuralnetworks(RNNs).Whenprovidedwithalargecorpusofparalleldata,thesesystemsarehighlyaccurate,butmaystillproduceoccasionaler-rors,renderingitunusableforapplicationslikeTTS.Inordertogivethereadersomebackgroundontherelevantlinguisticissues,wethenreviewsomecross-linguisticpropertiesofcardinalnumberexpressionsandproposeafinite-stateapproachtonumbernor-malizationinformedbytheselinguisticproperties(§4).Thecoreoftheapproachisanalgorithmforin-ducinglanguage-specificnumbergrammarrules.Weevaluatethistechniqueondatafromfourlanguages.Figure1:TheneuralnetarchitectureforthepreliminaryRussiancardinalnumberexperiments.PurpleLSTMlay-ersperformforwardstransitionsandgreenLSTMlayersperformbackwardstransitions.TheoutputisproducedbyaCTClayerwithasoftmaxactivationfunction.Inputto-kensarecharactersandoutputtokensarewords.2PreliminaryexperimentwithrecurrentneuralnetworksAspartofaseparatestrandofresearch,wehavebeenexperimentingwithvariousrecurrentneuralnetwork(RNN)architecturesforproblemsintextnormaliza-tion.Inonesetofexperiments,wetrainedRNNstolearnamappingfromdigitsequencesmarkedwithmorphosyntactic(caseandgender)información,andtheirexpressionasRussiancardinalnumbernames.ThemotivationforchoosingRussianisthatthenum-bernamesystemofthislanguage,likethatofmanySlaviclanguages,isquitecomplicated,andthereforeservesasagoodtestoftheabilitiesofanytextnor-malizationsystem.ThearchitectureusedwassimilartoanetworkemployedbyRaoetal.(2015)forgrapheme-to-phonemeconversion,asuperficiallysimilarsequence-to-sequencemappingproblem.Weusedarecurrentnetworkwithaninputlayer,fourhiddenfeed-forwardLSTMlayers(HochreiterandSchmid-huber,1997),andaconnectionisttemporalclassi-fication(CTC)outputlayerwithasoftmaxactiva-tionfunction(Gravesetal.,2006).1Twoofthehiddenlayersmodeledforwardsequencesandtheothertwobackwardsequences.Therewere32in-putnodes—correspondingtocharacters—and153outputnodes—correspondingtopredictednumbernamewords.Eachofthehiddenlayershad256nodes.ThefullarchitectureisdepictedinFigure1.Thesystemwastrainedon22Muniquedigitse-1Experimentswithanon-CTCsoftmaxoutputlayeryieldedconsistentlypoorresults,andwedonotreportthemhere.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
1
1
4
1
5
6
7
4
1
4

/
t

a
C
_
a
_
0
0
1
1
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

509

quencesrangingfromonetoonemillion;thesewerecollectedbyapplyinganexistingTTStextnormal-izationsystemtoseveralterabytesofwebtext.Eachtrainingexampleconsistedofadigitsequence,gen-derandcasefeatures,andtheRussiancardinalnum-berverbalizationofthatnumber.Thus,forexample,thesystemhastolearntoproducethefemininein-strumentalformof60.ExamplesofthesemappingsareshowninTable1,andthevariousinflectedformsofasinglecardinalnumberaregiveninTable2.Inpreliminaryexperiments,itwasdiscoveredthatshortdigitsequenceswerepoorlymodeledduetounder-sampling,soanadditional240,000shortsequencesamples(ofthreeorfewerdigits)wereaddedtocom-pensate.2.2Mexamples(10%)wereheldoutasadevelopmentset.Thesystemwastrainedforoneday,afterwhichithada0%labelerrorrate(LER)onthedevelopmentdataset.Whendecoding240,000tokensofheld-outtestdatawiththismodel,weachievedveryhighac-curacy(LER<.0001).Thefewremainingerrors,however,areaseriousobstacletousingthissystemforTTS.Themodelappearstomakenomistakesap-plyinginflectionalsuffixestounseendata.Plausibly,thistaskwasmadeeasierbyourpositioningofthemorphologicalfeaturestringattheendoftheinput,makingitlocaltotheoutputinflectionalsuffix(atleastforthelastwordinthenumberexpression).Butitdoesmakeerrorswithrespecttothenumericvalueoftheexpression.Forexample,for9801plu.ins.(девятьютысячамивосьмьюстамиодними),thesystemproducedдевятьютысячамисемьюстамиодними(9701plu.ins.):themorphologyiscor-rect,butthenumericvalueiswrong.2Thispatternoferrorswasexactlytheoppositeofwhatwewantforspeechapplications.OnemightforgiveaTTSsystemthatreads9801withthecor-rectnumericvaluebutinthewrongcaseform:alis-tenerwouldlikelynoticetheerrorbutwouldusuallynotbemisledaboutthemessagebeingconveyed.Incontrast,readingitasninethousandsevenhundredandoneiscompletelyunacceptable,asthiswouldactivelymisleadthelistener.Itisworthpointingoutthatthetrainingsetusedhere—22Mexamples—wasquitelarge,andwewere2Theexactnumberoferrorsandtheirparticulardetailsvar-iedfromruntorun.onlyabletoobtainsuchalargeamountoflabeleddatabecausewealreadyhadahigh-qualityhand-builtgrammardesignedtodoexactlythistransduc-tion.Itissimplyunreasonabletoexpectthatonecouldobtainthisamountofparalleldataforanewlanguage(e.g.,fromnaturally-occurringexamples,orfromspeechtranscriptions).Thisproblemises-peciallyacuteforlow-resourcelanguages(i.e.,mostoftheworld’slanguages),wheredataisbydefini-tionscarce,butwhereitisalsohardtofindhigh-qualitylinguisticresourcesorexpertise,andwhereamachinelearningapproachisthusmostneeded.Inconclusion,thesystemdoesnotperformaswellaswedemand,norisitinanycaseapracticalsolu-tionduetothelargeamountoftrainingdataneeded.TheRNNappearstohavedoneanimpressivejoboflearningthecomplexinflectionalmorphologyofRussian,butitoccasionallychoosesthewrongnum-bernamesaltogether.3NumbernormalizationwithRNNsForthepurposeofmoredirectlycomparingtheper-formanceofRNNswiththemethodswereportonbelow,wechosetoignoretheissueofallomorphyandmorphologicalconcord,whichappearstobe“easy”forgenericsequencemodelslikeRNNs,andfocusinsteadonverbalizingnumberexpressionsinwhatevermorphologicalcategoryrepresentsthelan-guage’scitationform.3.1DataandgeneralapproachForourexperimentsweusedthreeparalleldatasetswherethetargetnumbernamewasincitationform(inRussian,nominativecase):•Alargesetconsistingof28,000examplesex-tractedfromseveralterabytesofwebtextusinganexistingTTStextnormalizationsystem•Amediumsetof9,000randomly-generatedex-amples(fordetails,seeAppendixA)•Aminimalsetof300examples,consistingofthecountingnumbersupto200,and100carefully-chosenexamplesengineeredtocoverawidevarietyofphenomenaAseparatesetof1,000randomly-generatedexam-pleswereheldoutforevaluation. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 1 4 1 5 6 7 4 1 4 / / t l a c _ a _ 0 0 1 1 4 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 510 5neu.gen.→пятиfive24mas.acc.→двадцатьчетыреtwenty-four99plu.ins.→девяностадевятьюninety-nine11fem.nom.→одиннадцатьeleven81fem.gen.→восьмидесятиоднойeighty-one60fem.ins.→шестьюдесятьюsixty91neu.ins.→девяностаоднимninety-one3mas.gen.→трёхthreeTable1:Exampleinputandoutputdata(andglosses)fortheRussianRNNexperiments.шестьдесятnominative(nom.)шестидесятиgenitive(gen.)шестидесятиdative(dat.)шестьдесятaccusative(acc.)шестьюдесятьюinstrumental(ins.)шестидесятиprepositional(pre.)Table2:Inflectionalformsofthecardinalnumbernumber“60”inRussian.Theminimalsetwasintendedtoberepresentativeofthesortofdataonemightobtainfromanative-speakerwhenaskedtoprovidealltheessentialinfor-mationaboutnumbernamesintheirlanguage.3IntheseexperimentsweusedtwodifferentRNNmodels.ThefirstwasthesameLSTMarchitectureasabove(henceforthreferredtoas“LSTM”),exceptthatthenumbersofinputandoutputnodeswere13and53,respectively,duetothesmallerinputandout-putvocabularies.ThesecondwasaTensorFlow-basedRNNwithanattentionmechanism(Mnihetal.,2014),usinganoverallarchitecturesimilartothatusedinasystemforend-to-endspeechrecognition(Chanetal.,2016).Specifically,weuseda4-layerpyramidalbidirec-tionalLSTMreaderthatreadsinputcharacters,alayerof256attentionalunits,anda2-layerdecoderthatproduceswordsequences.ThereaderisreferredtoChanetal.,2016forfurtherdetails.Henceforthwerefertothismodelas“Attention”.Allmodelsweretrainedfor24hours,atwhichpointtheyweredeterminedtohaveconverged.3Notethatthenativespeakerinquestionmerelyneedstobeabletoanswerquestionsoftheform“howdoyousay‘23’inyourlanguage?”;theydonotneedtobelinguisticallytrained.Incontrast,hand-builtgrammarsrequireatleastsomelinguisticsophisticationonthepartofthegrammarian.3.2ResultsanddiscussionResultsfortheseexperimentsonatestcorpusof1,000randomexamplesaregiveninTable3.TheRNNwithattentionclearlyoutperformedtheLSTMinthatitperformedperfectlywithboththemediumandlargetrainingsets,whereastheLSTMmadeasmallpercentageoferrors.Notethatsincethenumberswereincitationform,therewaslittleroomfortheLSTMtomakeinflectionalerrors,andtheer-rorsitmadewereallofthe“silly”variety,inwhichtheoutputsimplydenotesthewrongnumber.Butneithersystemwascapableoflearningvalidtrans-ductionsgivenjust300trainingexamples.4Wedrawtwoconclusionsfromtheseresults.First,evenapowerfulmachinelearningmodelknowntobeapplicabletoawidevarietyofproblemsmaynotbeappropriateforallsuperficially-similarprob-lems.Second,itremainstobeseenwhetheranyRNNcouldbedesignedtolearneffectivelyfromanamountofdataassmallasoursmallesttrainingset.Learningfromminimaldatasetsisofgreatpracticalconcern,andwewillproceedtoprovideaplausiblesolutiontothisproblembelow.Wenoteagainthatverylowerrorratesdonotensurethatasystemis4ThefailureoftheRNNstogeneralizefromtheminimaltrainingsetsuggeststheyareevidentlynotexpressiveenoughforthesortof“clever”inferencethatisneededtogeneralizefromsolittledata.ItisplausiblethatanalternativeRNNarchi-tecturecouldlearnwithsuchasmallamountofdata,thoughweleaveittofutureresearchtodiscoverjustwhatsuchanar-chitecturemightbe.InanattempttoprovidetheRNNswithadditionalsupport,wealsoperformedanevaluationwiththeminimaltrainingsetinwhichinputswereencodedsothateachdecimalpositionabove0wasrepresentedwithaletter(Afor10,Bfor100,andsoforth).Thus2034wasrepresentedas2C3A4.Inprinciple,thisoughttohavepreventederrorswhichfailtotakepositionalinformationintoaccount.Unfortunately,thismadenodifferencewhatsoever. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 1 4 1 5 6 7 4 1 4 / / t l a c _ a _ 0 0 1 1 4 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 511 TrainingsizeLSTMAcc.AttentionAcc.Overlap28,0000.9991.00056%9,0000.9941.0000%300<0.001<0.001<1%Table3:Accuraciesonatestcorpusof1,000randomRussiancitation-formnumber-nameexamplesforthetwoRNNarchitectures.“Overlap”indicatesthepercentageofthetestexamplesthatarealsofoundinthetrainingdata.usable,sincenotallerrorsareequallyforgivable.4Numbernormalizationwithfinite-statetransducersTheproblemofnumbernormalizationnaturallyde-composesintotwosubproblems:factorizationandverbalizationofthenumericfactors.Wefirstcon-siderthelatterproblem,thesimplerofthetwo.Letλbethesetofnumbernamesinthetargetlan-guage,andletνbethesetofnumerals,theintegersdenotedbyanumbername.ThenletL:ν∗→λ∗beatransducerwhichreplacesasequenceofnu-meralswithasequenceofnumbernames.Forin-stance,forEnglish,Lwillmap907toninetyseven.Inlanguageswheretherearemultipleallomorphsorcaseformsforanumeral,Lwillbenon-functional(i.e.,one-to-many);wereturntothisissueshortly.Innearlyallcases,however,therearenomorethanafewdozennumeralsinν,5andnomorethanafewnamesinλfortheequivalentnumeralinν.There-fore,weassumeitispossibletoconstructLwithmin-imaleffortandminimalknowledgeofthelanguage.Indeed,alltheinformationneededtoconstructLfortheexperimentsconductedinthispapercanbefoundinEnglish-languageWikipediaarticles.Theremainingsubproblem,factorization,isre-sponsibleforconvertingdigitsequencestonumeralfactors.InEnglish,forexample,97000isfactoredas9071000.Factorizationisalsolanguage-specific.InStandardFrench,forexample,thereisnosim-plexnumbernamefor‘90’;insteadthisisrealizedasquatre-vingt-dix“fourtwentyten”,andthus97000(quatre-vingt-dix-septmille)isfactoredas4201071000.Itisnotaprioriobvioushowonemightgoaboutlearninglanguage-specificfactorizations.For5Atworst,asmallnumberoflanguages,suchasseveralIndiclanguagesofNorthIndia,effectivelyuseuniquenumeralsforallcountingnumbersupto100.inspiration,weturntoalesser-knownbodyoflin-guisticsresearchfocusingonnumbergrammars.Hurford(1975)surveyscross-linguisticpropertiesofnumbernamingandproposesasyntacticrepre-sentationwhichdirectlyrelatesverbalizednumbernamestothecorrespondingintegers.Hurfordinter-pretscomplexnumberconstructionsasarithmeticexpressionsinwhichoperators(andtheparenthe-sesindicatingassociativity)havebeenelided.Byfarthetwomostcommonarithmeticoperationsaremultiplicationandaddition.6InFrench,forexam-ple,theexpressiondix-sept,literally‘tenseven’,de-notes17,thesumofitsterms,andquatre-vingt(s),literally‘fourtwenty’,refersto80,theproductofitsterms.Thesemaybecombined,inquatre-vingt-dix-sept.Tovisualizearithmeticoperationsandas-sociativities,wehenceforthwritefactorizationsus-ings-expressions—pre-orderserializationsofk-arytrees—withnumeralterminalsandarithmeticoper-atornon-terminals.Forexample,quatre-vingt-dix-septiswritten(+(*420)107).Withinanylanguagetherearecuestothiselidedarithmeticstructure.Insomelanguages,someoralladdendsareseparatedbyawordtranslatedasand.Inotherlanguagesitispossibletodeterminewhethertermsaretobemultipliedorsummeddependingontheirrelativemagnitudes.InFrench(asinEnglish),forinstance,anexpressionXYusuallyisinterpretedasaproductifXX>Y,asinvingt-quatre‘24’.Thustheproblemofnumberdenormalization—thatis,recov-eringtheintegerdenotedbyaverbalizednumber—canbethoughtofasaspecialcaseofgrammarinduc-tionfrompairsofnaturallanguageexpressionsand6Somelanguagesmakeuseofhalf-counting,ormultiplica-tionbyonehalf(e.g.,Welshhannercant,‘50’,literally‘halfhundred’),orback-counting,i.e.,subtraction(e.g.,Latinunde-vīgintī,‘19’,literally‘onefromtwenty’;Menninger,1969,94f.).Butthesedonotreducethegeneralityoftheapproachhere.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
1
1
4
1
5
6
7
4
1
4

/
t

a
C
_
a
_
0
0
1
1
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

512

theirdenotations(e.g.,Kwiatkowskietal.,2011).4.1FSTmodelThecompletemodelconsistsoffourcomponents:1.Alanguage-independentcoveringgrammarF,transducingfromintegersexpressedasdigitse-quencestothesetofpossiblefactorizationsforthatinteger2.Alanguage-specificnumeralmapM,transduc-ingfromdigitsequencestonumerals3.Alanguage-specificverbalizationgrammarG,acceptingonlythosefactorizationswhicharelicitinthetargetlanguage4.Alanguage-specificlexicalmapL,transducingfromsequencesofnumerals(e.g.,20)tonum-bernames(alreadydefined)Asthefinalcomponent,thelexicalmapL,hasal-readybeendescribed,weproceedtodescribethere-mainingthreecomponentsofthesystem.4.1.1Finite-statetransduceralgorithmsWhileweassumethereaderhassomefamiliaritywithFSTs,wefirstprovideabriefreviewofafewkeyalgorithmsweemploybelow.OurFSTmodelisconstructedusingcomposition,denotedbythe◦operator.Whenbothargumentstocompositionaretransducers,compositionisequiva-lenttochainingthetworelationsdescribed.Forex-ample,ifAtransducesstringxtostringy,andBtrans-ducesytoz,thenA◦Btransducesfromstringxtostringz.Whentheleft-handsideofcompositionisatransducerandtheright-handsideisanaccep-tor,thentheircompositionproducesatransducerinwhichtherangeoftheleft-handsiderelationisinter-sectedwiththesetofstringsacceptedbytheright-handsideargument.ThusifAtransducesstringxtostrings{y,z},andBacceptsythenA◦Btransducesfromxtoy.Wemakeuseoftwootherfundamentaloperations,namelyinversionandprojection.EverytransducerAhasaninversedenotedbyA−1,whichisthetrans-ducersuchthatA−1(y)→xifandonlyifA(X)→y.AnytransducerAalsohasinputandoutputprojec-tionsdenotedbyπi(A)andπo(A),respectively.IfthetransducerAhasthedomainα∗andtherangeβ∗,thenπi(A)istheacceptoroverα∗whichacceptsxifandonlyifA(X)→yforsomey∈β∗;outputpro-jectionisdefinedsimilarly.Theinverse,inputpro-jection,andoutputprojectionofanFST(orapush-downtransducer)arecomputedbyswappingand/orcopyingtheinputoroutputlabelsofeacharcinthemachine.SeeMohrietal.,2002formoredetailsontheseandotherfinite-statetransduceralgorithms.4.1.2CoveringgrammarLetAbeanFSTwhich,whenrepeatedlyappliedtoanarithmetics-expressionstring,producesthes-expression’svalue.Forexample,oneapplicationofAto(+(*420)107)produces(+80107),andasecondapplicationproduces97.Letμbethesetofs-expressionmarkupsymbols{'(',')',‘+’,‘*’}andΔbetheset{0,1,2,…,9}.Entonces,F:Δ∗→(μ∪Δ)∗=A−1◦A−1◦A−1…(1)isanFSTwhichtransducesanintegerexpressedasadigitstringtoallitscandidatefactorizationsexpressedass-expressionstrings.7LetC(d)=πo(d◦F),whichmapsfromadigitsequencedtothesetofallpossiblefactorizations—inanylanguage—ofthatdigitsequence,encodedass-expressions.Forexample,C(97)containsstringssuchas:(+907)(+80107)(+(*420)107)…4.1.3GrammarinferenceLetM:(μ∪Δ)∗→ν∗beatransducerwhichdeletesallmarkupsymbolsinμandreplacesse-quencesofintegersexpressedasdigitsequenceswiththeappropriatenumeralsinν.LetD(yo)=πi(M◦L◦l),whichmapsfromaverbalizationltothesetofalls-expressionswhichcontainlaster-minals.Forexample,D(420107)contiene:7Inpractice,ours-expressionsneverhaveadepthexceedingfive,soweassumeF=A−1◦A−1◦A−1◦A−1◦A−1.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
1
1
4
1
5
6
7
4
1
4

/
t

a
C
_
a
_
0
0
1
1
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

513

S→(7|90|*|+)∗→(7|90|+)1000+→907Table4:AfragmentofanEnglishnumbergrammarwhichacceptsfactorizationsofthenumbers{7,90,97,7000,90000,and97000}.Srepresentsthestartsymbol,and‘|’denotesdisjunction.Notethatthisfragmentisregularratherthancontext-free,thoughthisisrarelythecaseforcompletegrammars.(+420107)(+420(*107))(+(*420)107)…Then,given(d,yo)whered∈Δ∗isanintegerex-pressedasadigitsequence,andl∈λ∗isd’sverbal-ization,theirintersectionE(d,yo)=C(d)◦D(yo)(2)willcontainthefactorization(s)ofdthatverbalizesasl.Inmostcases,Ewillcontainexactlyonepathforagiven(d,yo)pair.Forinstance,ifdis97000andlisninetyseventhousand,mi(d,yo)es(*(+907)10000).WecanuseEtoinduceacontext-freegrammar(CFG)whichacceptsonlythosenumberverbaliza-tionspresentinthetargetlanguage.Thesimplestpos-siblesuchCFGuses‘*’and‘+’asnon-terminalla-bels,andtheelementsinthedomainofL(e.g.,20)asterminals.Thegrammarwillthenconsistofbinaryproductionsextractedfromthes-expressionderiva-tionsproducedbyE.Table4providesafragmentofsuchagrammar.Withthisapproach,wefacethefamiliarissuesofambiguityandsparsity.Concerningtheformer,theoutputofEisnotuniqueforalloutputs.Weaddressthiseitherbyapplyingnormalformconstraintsonthesetofpermissibleproductions,orignoringam-biguousexamplesduringinduction.Onecaseofam-biguityinvolvesexpressionsinvolvingadditionwith0ormultiplicationby1,bothidentityoperationsthatleavetheidentityelement(i.e.,0or1)freetoasso-ciateeithertotheleftortotheright.Fromourper-spective,thisambiguityisspurious,sowestipulatethatidentityelementsmayonlybesiblingsto(i.e.,ontheright-handsideofaproductionwith)anotherdigit→(2|3|4|…9)teen→(11|12|13|…19)decade→(20|30|40|…90)century→(200|300|400|…900)power_of_ten→(1000|10000|…)Table5:Optionalpreterminalrules.terminal.Thusanexpressionlikeonethousandonehundredcanonlybeparsedas(+(*11000)(*1100)).Butnotallambiguitiescanbehandledbynormalformconstraints.Someexpressionsaream-biguousduetothepresenceof“palindromes”intheverbalizationstring.Forinstance,twohundredtwocaneitherbeparsedas(+2(*1002))o(+(*2100)).Thelatterderivationis“correct”insofarasitfollowsthesyntacticpatternsofotherEnglishnumberexpressions,butthereisnowaytodeter-minethisexceptwithreferencetotheverylanguage-specificpatternsweareattemptingtolearn.There-foreweignoresuchexpressionsduringgrammarin-duction,forcingtherelevantrulestobeinducedfromunambiguousexpressions.Similarly,multiplicationandadditionareassociativesoexpressionslikethreehundredthousandcanbebinarizedeitheras(*(*3100)10000)o(*3(*1001000)),thoughbothderivationsareequally“correct”.Onceagainweig-noresuchambiguousexpressions,insteadextractingtherelevantrulesfromunambiguousexpressions.Sinceweonlyadmittwonon-terminallabels,thevastmajorityofourrulescontainnumeralterminalsontheirright-handsides,andasaresult,thenum-berofrulestendstoberoughlyproportionaltothesizeoftheterminalvocabulary.Thusitiscommonthatwehaveobserved,por ejemplo,thirteenthou-sandandfourteenmillionbutnotfourteenthousandorthirteenmillion,andasaresult,theCFGmaybedeficientsimplyduetosparsityinthetrainingdata,particularlyinlanguageswithlargeterminalvocab-ularies.Toenhanceourabilitytogeneralizefromasmallnumberofexamples,weoptionallyinsertpreterminallabelsduringgrammarinductiontoformclassesofterminalsassumedtopatterntogetherinallproductions.Forinstance,byintroducing‘teen’and‘power_of_ten’preterminals,allfourofthepreviousexpressionsaregeneratedbythesametop-levelpro-duction.ThefullsetofpreterminallabelsweusehereareshowninTable5.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
1
1
4
1
5
6
7
4
1
4

/
t

a
C
_
a
_
0
0
1
1
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

514

Inpractice,obtainingproductionsusingEisinef-ficient:itisroughlyequivalenttoanaïvealgorithmwhichgeneratesallpossiblederivationsforthenu-meralsgiven,thenfiltersoutallofthosewhichdonotevaluatetotheexpectedtotal,violatetheafore-mentionednormalformconstraints,orareotherwiseambiguous.Thisfailstotakeadvantageoftop-downconstraintsderivedfromtheparticularstructureoftheproblem.Forexample,thenaïvealgorithmen-tertainsmanycandidateparsesforquatre-vingt-dix-sept‘97’wheretherootis‘*’andthefirstchildis‘4’,despitethefactthatnosuchhypothesisisviableas4isnotadivisorof97.Weinjectarithmeticconstraintsintothegram-marinductionprocedure,asfollows.Theinputstothemodifiedalgorithmaretuplesoftheform(t,ν0,…,νn)whereTisthenumericvalueoftheexpressionandν0,…,νnarethen+1numeralsintheverbalization.Considerahypothesizednumericvalueoftheleftmostchildoftheroot,T0…i,whichdominatesν0,…,νiwhereiResultsanddiscussionTheresultswereexcellentforallfourlanguages.TherewerenoerrorsatallinEnglish,Georgian,andKhmerwitheitherdataset.WhiletherewereafewerrorsinRussian,cruciallyallwereagree-menterrorsratherthanerrorsinthefactorizationitself,exactlytheoppositepatternoferrortotheonesweobservedwiththeLSTMmodel.Forexample,70,477,170wasrenderedasсемьдесятмиллиончетырестасемьдесятсемьтысячстосемьдесят;thesecondwordshouldbeмиллионов,thegenitivepluralform.Moresurprisingly,verbalizerstrainedonthe300examplesoftheminimaldatasetper-formedjustaswellasonestrainedwithtwoordersofmagnitudemorelabeleddata.5DiscussionWepresentedtwoapproachestonumbernormaliza-tion.ThefirstusedageneralRNNarchitecturethathasbeenusedforothersequencemappingproblems,andthesecondanFST-basedsystemthatusesafairamountofdomainknowledge.TheRNNapproachcanachieveveryhighaccuracy,butwithtwocaveats:itrequiresalargeamountoftrainingdata,andtheer-rorsitmakesmayresultinthewrongnumber.TheFST-basedsolutionontheotherhandcanlearnfromatinydataset,andnevermakesthatparticularlyper-nicioustypeoferror.Thesmallsizeoftrainingdataneededandthehighaccuracymakethisaparticu-larlyattractiveapproachforlow-resourcescenarios.Infact,wesuspectthattheFSTmodelcouldbemadetolearnfromasmallernumberofexamplesthanthe300thatmakeupthe“minimal”set.Findingthemin-imumnumberofexamplesnecessarytocovertheentirenumbergrammarappearstobeacaseofthesetcoverproblem—-whichisNP-complete(Karp,1972)—butitisplausiblethatagreedyalgorithmcouldidentifyanevensmallertrainingset.ThegrammarinductionmethodusedfortheFSTverbalizerisneartothesimplestimaginablesuchprocedure:ittreatsrulesaswell-formedifandonlyiftheyhaveatleastoneunambiguousoccurrenceinthetrainingdata.Moresophisticatedinductionmethodscouldbeusedtoimprovebothgeneralizationandro-bustnesstoerrorsinthetrainingdata.Generalizationmightbeimprovedbymethodsthat“hallucinate”un-observedproductions(MohriandRoark,2006),androbustnesscouldbeimprovedusingmanualorau-tomatedtreeannotation(e.g.,KleinandManning,2003;PetrovandKlein,2007).Weleavethisforfu-turework.Above,wefocusedsolelyoncardinalnumbers,andspecificallytheircitationforms.However,inallfourlanguagesstudiedhere,ordinalnumberssharethesamefactorizationanddifferonlysuperficiallyfromcardinals.Inthiscase,theordinalnumberver-balizercanbeconstructedbyapplyingatrivialtrans-ductiontothecardinalnumberverbalizer.However,itisanopenquestionwhetherthisisauniversalorwhethertheremaybesomelanguagesinwhichthediscrepancyismuchgreater,sothatseparatemeth-odsarenecessarytoconstructtheordinalverbalizer.TheFSTverbalizerdoesnotprovideanymech-anismforverbalizationofnumbersinmorphologi-calcontextsotherthancitationform.Onepossibilitywouldbetouseadiscriminativemodeltoselectthemostappropriatemorphologicalvariantofanumberincontext.Wealsoleavethisforfuturework.OnedesirablepropertyoftheFST-basedsystemisthatFSTs(andPDTs)aretriviallyinvertible:ifonebuildsatransducerthatmapsfromdigitsequencestonumbernames,onecaninvertit,resultinginatrans-ducerthatmapsnumbernamestodigitsequences.(InvertibilityisnotapropertyofanyRNNsolution.)Thisallowsone,withthehelpoftheappropriatetarget-sidelanguagemodel,toconvertanormaliza-tionsystemintoadenormalizationsystem,thatmapsfromspokentowrittenformratherthanfromwrittentospoken.DuringASRdecoding,forexample,itisoftenpreferabletousespokenrepresentations(e.g.,twenty-three)ratherthanthewrittenforms(e.g.,23),andthenperformdenormalizationontheresultingtranscriptssotheycanbedisplayedtousersinamore-readableform(Shugrina,2010;Vassermanetal.,2015).InongoingworkweareevaluatingFSTverbalizersforuseinASRdenormalization.6ConclusionsWehavedescribedtwoapproachestonumbernor-malization,akeycomponentofspeechrecognitionandsynthesissystems.Thefirstusedarecurrentneu-ralnetworkandlargeamountsoftrainingdata,butverylittleknowledgeabouttheproblemspace.Thesecondusedfinite-statetransducersandalearning l D o w n o a d e d f r o m h t t p : / / d i r mi C t . metro i t . mi d tu / t a C yo / yo a r t i C mi - pag d F / d oh i / . 1 0 1 1 6 2 / t yo a C _ a _ 0 0 1 1 4 1 5 6 7 4 1 4 / / t yo a C _ a _ 0 0 1 1 4 pag d . F b y gramo tu mi s t t oh norte 0 9 S mi pag mi metro b mi r 2 0 2 3 517 methodtotallyspecializedforthisdomainbutwhichrequiresverylittletrainingdata.WhiletheformerapproachiscertainlymoreappealinggivencurrenttrendsinNLP,onlythelatterisfeasibleforlow-resourcelanguageswhichmostneedanautomatedapproachtotextnormalization.Tobesure,wehavenotdemonstratedthanRNNs—orsimilarmodels—areinapplicabletothisproblem,nordoesitseempossibletodoso.How-ever,numbernormalizationisarguablyasequence-to-sequencetransductionproblem,andRNNshavebeenshowntobeviableend-to-endsolutionsforsim-ilarproblems,includinggrapheme-to-phonemecon-version(Raoetal.,2015)andmachinetranslation(Sutskeveretal.,2014),soonemightreasonablyhaveexpectedthemtohaveperformedbetterwithoutmakingthe“silly”errorsthatweobserved.Muchoftherecentrhetoricaboutdeeplearningsuggeststhatneuralnetworksobviatetheneedforincorporatingdetailedknowledgeoftheproblemtobesolved;in-stead,onemerelyneedstofeedpairsconsistingofinputsandtherequiredoutputs,andthesystemwillself-organizetolearnthedesiredmapping(GravesandJaitly,2014).Whilethatiscertainlyadesirableideal,forthisproblemonecanachieveamuchmorecompactanddata-efficientsolutionifoneiswillingtoexploitknowledgeofthedomain.AcknowledgementsThankstoCyrilAllauzen,JasonEisner,MichaelRi-ley,BrianRoark,andKeWuforhelpfuldiscus-sion,andtoNavdeepJaitlyandHaşimSakforas-sistancewithRNNmodeling.Allfinite-statemod-elswereconstructedusingtheOpenGrmlibraries(http://opengrm.org).ReferencesCyrilAllauzenandMichaelRiley.2012.ApushdowntransducerextensionfortheOpenFstlibrary.InCIAA,pages66–77.RichardBeaufort,SophieRoekhaut,Louise-AmélieCougnon,andCédrickFairon.2010.Ahybridrule/model-basedfinite-stateframeworkfornormaliz-ingSMSmessages.InACL,pages770–779.WilliamChan,NavdeepJaitly,QuocV.Le,andOriolVinyals.2016.Listen,attendandspell:Aneuralnet-workforlargevocabularyconversationalspeechrecog-nition.InICASSP,pages4960–4964.JeffreyT.Chang,HinrichSchütze,andRussB.Altman.2002.CreatinganonlinedictionaryofabbreviationsfromMEDLINE.JournaloftheAmericanMedicalIn-formaticsAssociation,9(6):612–620.H.BrandtCorstius,editor.1968.Grammarsfornumbernames.D.Reidel,Dordrecht.PeterEbdenandRichardSproat.2014.TheKestrelTTStextnormalizationsystem.NaturalLanguageEngi-neering,21(3):1–21.AlexGravesandNavdeepJaitly.2014.Towardsend-to-endspeechrecognitionwithrecurrentneuralnetworks.InICML,pages1764–1772.AlexGraves,SantiagoFernández,GaustinoGomez,andJürgenSchmidhuber.2006.Connectionisttempo-ralclassification:Labelingunsegmentedsequencedatawithrecurrentneuralnetworks.InICML,pages369–376.SeppHochreiterandJürgenSchmidhuber.1997.Longshort-termmemory.NeuralComputation,9(8):1735–1780.JamesR.Hurford.1975.TheLinguisticTheoryofNu-merals.CambridgeUniversityPress,Cambridge.JakubKanis,JanZelinka,andLuděkMüller.2005.Auto-maticnumbernormalizationininflectionallanguages.InSPECOM,pages663–666.RichardM.Karp.1972.Reducibilityamongcombina-torialproblems.InRaymondE.MillerandJamesW.Thatcher,editores,ComplexityofComputerComputa-tions,pages85–103.Plenum,NewYork.DanKleinandChristopherD.Manning.2003.Accurateunlexicalizedparsing.InACL,pages423–430.TomKwiatkowski,LukeZettlemoyer,SharonGoldwater,andMarkSteedman.2011.LexicalgeneralizationinCCGgrammarinductionforsemanticparsing.InEMNLP,pages1512–1523.KarlMenninger.1969.NumberWordsandNumberSym-bols.MITPress,Cambridge.TranslationofZahlwortundZiffer,publishedbyVanderhoeck&Ruprecht,Breslau,1934.VolodymyrMnih,NicolasHeess,AlexGraves,andKo-rayKavukcuoglu.2014.Recurrentmodelsofvisualattention.InNIPS,pages2204–2212.MehryarMohriandBrianRoark.2006.Probabilisticcontext-freegrammarinductionbasedonstructuralze-ros.InNAACL,pages312–319.MehryarMohri,FernandoPereira,andMichaelRiley.2002.Weightedfinite-statetransducersinspeechrecognition.ComputerSpeech&Idioma,16(1):69–88.DeanaPennellandYangLiu.2011.Towardtextmessagenormalization:Modelingabbreviationgeneration.InICASSP,pages5364–5367. yo D oh w norte oh a d mi d F r oh metro h t t pag : / / d i r mi C t . metro i t . mi d tu / t a C yo / yo a r t i C mi - pag d F / d oh i / . 1 0 1 1 6 2 / t yo a C _ a _ 0 0 1 1 4 1 5 6 7 4 1 4 / / t yo a C _ a _ 0 0 1 1 4 pag d . F b y gramo tu mi s t t oh norte 0 9 S mi pag mi metro b mi r 2 0 2 3 518 SlavPetrovandDanKlein.2007.Improvedinferenceforunlexicalizedparsing.InNAACL,pages404–411.KanishkaRao,FuchunPeng,HaşimSak,andFrançoiseBeaufays.2015.Grapheme-to-phonemeconversionusinglongshort-termmemoryrecurrentneuralnet-works.InICASSP,pages4225–4229.BrianRoarkandRichardSproat.2014.Hippocraticab-breviationexpansion.InACL,pages364–369.MariaShugrina.2010.Formattingtime-alignedASRtranscriptsforreadability.InNAACL,pages198–206.RichardSproat,AlanW.Black,StanleyChen,ShankarKumar,MariOstendorf,andChristopherRichards.2001.Normalizationofnon-standardwords.Com-puterSpeechandLanguage,15(3):287–333.RichardSproat.1996.Multilingualtextanalysisfortext-to-speechsynthesis.NaturalLanguageEngineering,2(4):369–380.RichardSproat.2010.Lightlysupervisedlearningoftextnormalization:Russiannumbernames.InIEEEWork-shoponSpeechandLanguageTechnology,pages436–441.IlyaSutskever,OriolVinyals,andQuocV.Le.2014.Se-quencetosequencelearningwithneuralnetworks.InNIPS,pages3104–3112.PaulTaylor.2009.TexttoSpeechSynthesis.CambridgeUniversityPress,Cambridge.LucyVasserman,VladSchogol,andKeithHall.2015.Sequence-basedclasstaggingforrobusttranscriptioninASR.InINTERSPEECH,pages473–477.WilliamA.Woods.1970.Transitionnetworkgrammarsfornaturallanguageanalysis.CommunicationsoftheACM,13(10):591–606.YiYangandJacobEisenstein.2013.Alog-linearmodelforunsupervisedtextnormalization.InEMNLP,pages61–72. yo D oh w norte oh a d mi d F r oh metro h t t pag : / / d i r mi C t . metro i t . mi d tu / t a C yo / yo a r t i C mi - pag d F / d oh i / . 1 0 1 1 6 2 / t yo a C _ a _ 0 0 1 1 4 1 5 6 7 4 1 4 / / t yo a C _ a _ 0 0 1 1 4 pag d . F b y gramo tu mi s t t oh norte 0 9 S mi pag mi metro b mi r 2 0 2 3 519 ARandomsamplingprocedureRandom-generateddatasetswereproducedbysamplingfromaYule-Simondistributionwithρ=1,thenroundingeachsample’sktrailingdigits,wherekisarandomvariableinthediscreteuniformdistributionU{0,norte}andnistheorderofthesamplednumber.Duplicatesampleswerethenremoved.ThefollowingRfunctionimplementsthisprocedure.require(VGAM)EPSILON<-1e-12rnumbers<-function(n){x<-ryules(n,rho=1)num.digits<-floor(log10(x+EPSILON))+1sig.digits<-ceiling(runif(n,min=0,max=num.digits))unique(signif(x,sig.digits))}BParsegenerationalgorithmThefollowingalgorithmisusedtogenerateparsesfromparallel(written/spoken)data.ItdependsuponaprocedureGetSubtrees(…)generatingallpossiblelabeledbinarysubtreesgivenasequenceofterminals,whichisleftasanexercisetothereader.1:procedureGetOracles(t,v0,…,vn)▷TotalT,terminalsv0,…,vn.2:ifn=1then3:ifeval(v0)=Tthen4:yields-expressionv05:endif6:return7:endif8:fori∈1...n−1do▷Sizeofleftchild.9:forallL∈GetSubtrees(v0,…,vi)do10:TL←eval(l)11:TR←T−TL▷Hypothesizes+root.12:ifTR>0then13:forallR∈GetOracles(TR,vi+1,…,vn)do14:yields-expression(+LR)15:endfor16:endif17:TR←T/TL▷Hypothesizes*root.18:ifTR∈Nthen▷“isanaturalnumber”.19:forallR∈GetOracles(TR,vi+1,…,vn)do20:yields-expression(*LR)21:endfor22:endif23:endfor24:endfor25:endprocedure

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
1
1
4
1
5
6
7
4
1
4

/
t

a
C
_
a
_
0
0
1
1
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
9
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

520 Transacciones de la Asociación de Lingüística Computacional, volumen. 4, páginas. 507–519, 2016. Editor de acciones: Jason Eisner. imagen

Descargar PDF