计算语言学协会会刊, 卷. 4, PP. 87–98, 2016. 动作编辑器: Alexander Clark.
提交批次: 7/2015; 修改批次: 11/2015; 已发表 4/2016.
2016 计算语言学协会. 根据 CC-BY 分发 4.0 执照.
C
(西德:13)
LearningTier-basedStrictly2-LocalLanguagesAdamJardineandJeffreyHeinzUniversityofDelaware{ajardine,heinz}@udel.eduAbstractTheTier-basedStrictly2-Local(TSL2)lan-guagesareaclassofformallanguageswhichhavebeenshowntomodellong-distancephonotacticgeneralizationsinnaturallan-guage(Heinzetal.,2011).Thispaperin-troducestheTier-basedStrictly2-LocalIn-ferenceAlgorithm(2TSLIA),thefirstnon-enumerativelearnerfortheTSL2languages.Weprovethe2TSLIAisguaranteedtocon-vergeinpolynomialtimeonadatasamplewhosesizeisboundedbyaconstant.1IntroductionThisworkpresentstheTier-basedStrictly2-LocalInferenceAlgorithm(2TSLIA),anefficientlearn-ingalgorithmforaclassofTier-basedStrictlyLo-cal(TSL)formallanguages(Heinzetal.,2011).ATSLclassisdeterminedbytwoparameters:thetier,orsubsetofthealphabet,andthepermissibletierk-factors,whicharethelegalsequencesoflengthkallowedinthestring,onceallnon-tiersymbolshavebeenremoved.TheTier-basedStrictly2-Local(TSL2)languagesarethoseinwhichk=2.Aswillbediscussedbelow,theTSLlanguagesareofinteresttophonologybecausetheycanmodelawidevarietyoflong-distancephonotacticpatternsfoundinnaturallanguage(Heinzetal.,2011;Mc-MullinandHansson,即将推出).OneexampleisderivedfromLatinliquiddissimilation,inwhichtwolscannotappearinawordunlessthereisanrintervening,regardlessofdistance.Forexam-ple,floralis‘floral’iswell-formedbutnot*mili-talis(cf.militaris‘military’).Asexplainedinsec-tions2and4,thiscanbemodeledwithpermissible2-factorsoveratierconsistingoftheliquids{我,r}.Forlong-distancephonotactics,kcanbefixedto2,butitisdoesnotappearthatthetiercanbefixedsincelanguagesemployavarietyofdifferenttiers.Thispresentsaninterestinglearningproblem:Givenafixedk,howcananalgorithminducebothatierandasetofpermissibletierk-factorsfrompositivedata?Thereissomerelatedworkwhichaddressesthisquestion.GoldsmithandRiggle(2012),buildingonworkbyGoldsmithandXanthos(2009),presentamethodbasedonmutualinformationforlearn-ingtiersandsubsequentlylearningharmonypat-terns.Thispaperdiffersinthatitsmethodsarerootedfirmlyingrammaticalinferenceandformallanguagetheory(delaHiguera,2010).Forinstance,incontrasttotheresultspresentedthere,weprovethekindsofpatterns2TSLIAsucceedsonandthekindofdatasufficientforittodoso.Nonetheless,thereisrelevantworkincomputa-tionallearningtheory:金子(1967)provedthatanyfiniteclassoflanguagesisidentifiableinthelimitviaanenumerationmethod.Givenafixedalphabetandafixedk,thenumberofpossibletiersandper-missibletierk-factorsisfinite,andthuslearnableinthisway.However,suchlearnersaregrosslyinef-ficient.Noprovably-correct,non-enumerative,ef-ficientlearnerforboththetierandpermissibletierk-factorparametershaspreviouslybeenproposed.Thisworkfillsthisgapwithanalgorithmwhichlearnstheseparameterswhenk=2frompositivedataintimepolynomialinthesizeofthedata.Finally,Jardine(2016)presentsasimplifiedver-
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
0
8
5
1
5
6
7
3
7
6
/
/
t
我
A
C
_
A
_
0
0
0
8
5
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
88
sionof2TSLIAandreportstheresultsofsomesim-ulations.Unlikethatpaper,thepresentworkpro-videsthefullmathematicaldetailsandproofs.Thesimulationsarediscussedinthediscussionsection.Thispaperisstructuredasfollows.§2moti-vatestheworkwithexamplesfromnaturallanguagephonology.§3outlinesthebasicconceptsanddef-initionstobeusedthroughoutthepaper.§4definestheTSLlanguagesanddiscussestheirproperties.§5detailsthe2TSLIAandprovesitlearnstheTSL2classinpolynomialtimeanddata.§6discussesfu-turework,and§7concludes.2LinguisticmotivationTheprimarymotivationforstudyingtheTSLlan-guagesandtheirlearnabilitycomesfromtheirrel-evancetophonotactic(wordwell-formedness)pat-ternsinnaturallanguagephonology.Manyphono-tacticpatternsbelongtoeithertheStrictlyLocal(SL)语言(McNaughtonandPapert,1971)ortheStrictlyPiecewise(SP)语言(Rogersetal.,2010).1AnexampleofphonotacticknowledgewhichisSLisChomskyandHalle’s(ChomskyandHalle,1965)observationthatEnglishspeakerswillclassifyblickasapossiblewordofEnglishwhilerejectingbnickasimpossible.ThisisSLbecauseitcanbedescribedas*bnbeingunacceptableasaword-initialsequence.SPlanguagescandescribelong-distancedependenciesbasedonprecedencere-lationshipsbetweensounds(suchasconsonanthar-monyinSarcee,inwhichsmaynotfollowaSany-whereinaword,butmayprecedeone(厨师,1984))(Heinz,2010A).还,theSLandSPlanguagesareefficientlylearnable(Garc´ıaetal.,1990;Heinz,2010A;Heinz,2010乙;HeinzandRogers,2013).然而,therearesomelong-distancepatternswhichcannotbedescribedpurelywithprecedencerelationships.OneexampleisapatternfromLatin,inwhichincertaincasesanlcannotfollowanotherlunlessanrintervenes,nomatterthedistancebe-tweenthem(詹森,1974;Odden,1994;Heinz,2010A).Thiscanbeseeninthe-alisadjectivalsuf-fix(Example1),whichappearsas-arisiftheworditattachestoalreadycontainsanl((d)通过(F)1Therelationshipbetweentheseformallanguageclassesandhumancognition(linguisticandotherwise)isdiscussedinmoredetailinRogersandPullum(2011),RogersandHauser(2010),Rogersetal.(2013),andLai(2013;2015).inExample1),exceptincaseswherethereisanin-terveningr,inwhichitappearsagainas-alis((G)通过(我)inExample1).Intheexamples,thesoundsinquestionareunderlinedforemphasis(datafromHeinz(2010A)),andfor(d)通过(F),illicitformsaregiven,markedwithastar(*).Example1.a.navalis‘naval’b.episcopalis‘episcopal’c.infinitalis‘negative’d.solaris‘solar’(*solalis)e.lunaris‘lunar’(*lunalis)f.militaris‘military’(*militalis)g.floralis‘floral’h.sepulkralis‘funereal’i.litoralis‘oftheshore’Thisnon-localalternatingpatternisnotSPbe-causeSPlanguagescannotexpressblockingeffects(Heinz,2010A).然而,itcanbedescribedwithaTSLgrammarinwhichthetieris{r,我}andthepermissibletier2-factorsdonotinclude*lland*rr.Thisyieldsexactlythesetofstringsinwhichanlisalwaysimmediately(disregardingallsoundsbesides{r,我})followedbyanr,andviceversa.FormallearningalgorithmsforSLandSPlan-guagescanprovideamodelforhumanlearningofSLandSPsoundpatterns(Heinz,2010A).TSLlan-guagesarealsosimilarlylearnable,giventhestip-ulationthatboththetierandkarefixed.Fornat-urallanguage,thevalueforkneverseemstogoabove2(Heinzetal.,2011).然而,tiersvaryinhumanlanguage—TSLpatternsoccurbothwithdifferentkindsofvowels(Nevins,2010)andcon-sonants(Suzuki,1998;Bennett,2013).Forexam-ple,inTurkishthetieristheentirevowelinven-tory(ClementsandSezer,1982),whileinFinnishitisvowelsexcept/i,e/(Ringen,1975).InSamalaconsonantharmony,thetierissibilants(RoseandWalker,2004),whereasinKoorete,thetierissibi-lantsand{乙,r,G,d}(McMullinandHansson,forth-coming).因此,itisofinteresttounderstandhowboththetierandpermissibletier2-factorsforTSL2grammarsmightbelearnedefficiently.3PreliminariesBasicknowledgeofsettheoryisassumed.LetS1−S2denotethesettheoreticdifferenceofsetsS1and
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
0
8
5
1
5
6
7
3
7
6
/
/
t
我
A
C
_
A
_
0
0
0
8
5
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
89
S2.LetP(S)denotethepowersetofSandPfin(S)bethesetofallfinitesubsetsofS.LetΣdenoteasetofsymbols,referredtoasthealphabet,andletastringoverΣbeafinitese-quenceofsymbolsfromthatalphabet.Thelengthofastringwwillbedenoted|w|.Letλdenotetheemptystring;|λ|=0.LetΣ∗(Kleenestar)repre-sentallstringsoverthisalphabet,andΣkrepresentallstringsoflengthk.Concatenationoftwostringswandv(orsymbolsσ1andσ2)willbewrittenwv(σ1σ2).Specialbeginning(哦)andend(n)symbols(哦,n6∈Σ)willoftenbeusedtomarkthebegin-ningandendofwords;thealphabetΣaugmentedwiththese,Σ∪{哦,n},willbedenotedΣon.Astringu∈Σ∗issaidtobeafactororsub-stringofanotherstringwiftherearetwootherstringsv,v0∈Σ∗suchthatw=vuv0.Wecalluak-factorofwifitisafactorofwand|你|=k.Letfack:Σ∗→P(Σ≤k)beafunctionmappingstringstotheirk-factors,wherefack(w)equals{你|uisak-factorofw}如果|w|>kandequals{w}otherwise.Forexample,fac2(aab)={aa,ab}andfac8(aab)={aab}.Weextendthek-factorfunc-tiontolanguages;fack(L)=Sw∈Lfack(w).3.1Grammars,语言,andlearningAlanguage(orstringset)LisasubsetofΣ∗.IfLisfinite,让|L|denotethecardinalityofL,andlet||L||denotethesizeofL,whichisdefinedtobePw∈L|w|.LetL1·L2denotetheconcatenationofthelanguagesL1andL2,i.e.,thepairwiseconcate-nationofeachwordinL1toeachwordinL2.Fornotationalsimplicity,theconcatenationofasingle-tonlanguage{w}toanotherlanguageL2(orL1to{w})willbewrittenwL2(L1w).Agrammarisafiniterepresentationofapos-siblyinfinitelanguage.AclassLoflanguagesisrepresentedbyaclassRofrepresentationsifeveryr∈RisoffinitesizeandthereisanamingfunctionL:R→Lwhichisbothtotalandsurjective.Thelearningparadigmusedinthispaperisidenti-ficationinthelimitlearningparadigm(金子,1967),withpolynomialboundsontimeanddata(delaHiguera,1997).Thisparadigmhastwocomplemen-taryaspects.First,itrequiresthattheinformationwhichdistinguishesalearningtargetfromotherpo-tentialtargetsbepresentintheinputforalgorithmstosuccessfullylearn.Second,itrequiressuccessfulalgorithmstoreturnahypothesisintimepolynomialofthesizeofthesample,andthatthesizeofthesam-pleitselfmustbepolynomialinthesizeofgrammar.Thedefinitionofthelearningparadigm(Defini-tion3)dependsonsomepreliminarynotions.Definition1.LetLbeaclassoflanguagesrepre-sentedbysomeclassRofrepresentations.1.AninputsampleIforalanguageL∈LisafinitesetofdataconsistentwithL,thatistosayI⊆L.2.A(L,右)-learningalgorithmAisaprogramthattakesasinputasampleforalanguageL∈LandoutputsarepresentationfromR.Thenotionofcharacteristicsampleisintegral.Definition2(Characteristicsample).Fora(L,右)-learningalgorithmA,asampleCSisacharacteris-ticsampleofalanguageL∈LifforallsamplesIforLsuchthatCS⊆I,AreturnsarepresentationrsuchthatL(r)=L.Nowthelearningparadigmcanbedefined.Definition3(Identificationinpolynomialtimeanddata).AclassLoflanguagesisidentifiableinpolynomialtimeanddataifthereexistsa(L,右)-learningalgorithmAandtwopolynomialsp()andq()suchthat:1.ForanysampleIofsizemforL∈L,Are-turnsahypothesisr∈RinO(p(米))time.2.Foreachrepresentationr∈Rofsizen,thereexistsacharacteristicsampleofrforAofsizeatmostO(q(n)).4Tier-basedStrictlyLocalLanguagesThissectionintroducestheTier-basedStrictlyLocal(TSL)语言(Heinzetal.,2011).TheTSLlan-guagesgeneralizetheSLlanguages(McNaughtonandPapert,1971;Garc´ıaetal.,1990;Caron,2000),andassuchthesewillbrieflybediscussedfirst.4.1TheStrictlyLocalLanguagesTheSLlanguagescanbedefinedasfollows(Heinzetal.,2011):Definition4(SLlanguages).AlanguageLisStrictlyk-Local(SLk)iffthereexistsafi-nitesetS⊆fack(oΣ∗n)suchthatL={w∈Σ∗|fack(own)⊆S}
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
0
8
5
1
5
6
7
3
7
6
/
/
t
我
A
C
_
A
_
0
0
0
8
5
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
90
SuchasetSissometimesreferredtoastheper-missiblek-factorsorpermissiblesubstringsofL.Forexample,letL={λ,ab,abab,ababab,…}.ThisLcanbedescribedwithasetofpermissible2-factorsS={在,oa,ab,ba,BN}becauseevery2-factorofeverywordinLisinS;因此,LisStrictly2-Local(abbreviatedSL2).AsasetSofpermissiblek-factorsisfiniteitcanalsobeviewedasaSLgrammarwhereL(S)={w∈Σ∗|fack(own)⊆S}.AnelementsofanSLgrammarSforalanguageLisuseful(resp.use-less)iffs∈fack(L)(s6∈fack(L)).AcanonicalSLgrammarcontainsnouselesselements.Intheexampleabove,aa6∈Sandaa6∈fac2(w)foranyw∈L.Suchastringisreferredtoasaforbiddenk-factororarestrictiononL.Thesetofforbiddenk-factorsRisfack(oΣ∗n)−S.Think-ingaboutthegrammarintermsofSorintermsofRisequivalent,butinsomecasesitissimplertorefertooneratherthantheother,soweshalluseboth.AnySLkclassoflanguagesislearnablewithpolynomialboundsontimeanddataifkisknowninadvance(Garc´ıaetal.,1990;Heinz,2010乙).TheclassofSLklanguages(foreachk)belongstoacollectionoflanguageclassescalledstringex-tensionlanguageclasses(Heinz,2010乙).Thedis-cussionabovepresentsSLklanguagesfromthisper-spective.Theselanguageclasseshavemanydesir-ablelearningproperties,duetotheirunderlyinglat-ticestructure(Heinzetal.,2012).4.2TheTSLlanguagesTheTSLlanguagescanbethoughtofasafurtherpa-rameterizationonthek-factorfunctionwhereacer-tainsubsetofthealphabettakespartinthegrammarandallothersymbolsinthealphabetareignored.ThisspecialsubsetisreferredtoasatierT⊆Σ.Symbolsnotonthetierareremovedfromconsider-ationofthegrammarbyanerasingfunctionET:Σ∗→T∗definedasET(σ0σ1…σn)=u0u1…unwhereui=σiifσi∈T;elseui=λ.Forexample,ifΣ={A,乙,C}andT={A,C}thenET(bbabbcbba)=aca.WecanthendefineatierversionfacT-koffackasfacT-k(w)=fack(oET(w)n).这里,oandnarebuiltintothefunctionastheyarealwaystreatedaspartofthetier.Continu-ingtheexamplefromabove,facT-2(bbabbcbba)={oa,交流电,加州,一个}.facT-kcanbeextendedtolan-guagesaswithfackabove.TheTSLlanguagescannowbedefinedparalleltotheSLlanguages(thefollowingdefinitionisequiv-alenttotheoneinHeinzetal.(2011)):Definition5(TSLlanguages).AlanguageLisTier-basedStrictlyk-LocaliffthereexistsasubsetT⊆ΣofthealphabetandafinitesetS⊆fack(oT∗n)suchthatL={w∈Σ∗|facT-k(w)⊆S}ParalleltoSLgrammarsabove,hT,SicanbethoughtofasaTSLgrammarofL.Likewise,theforbiddentiersubstrings(ortierrestrictions)RissimplythesetfacT-k(Σ∗)−S.Finally,hT,SiiscanonicalifScontainsnouselesselements(i.e.,s∈S⇔s∈facT-k(L(hT,和)))andRisnonempty(thissecondrestrictionisexplainedbelow).Forexample,letΣ={A,乙,C}andT={A,C}asaboveandletS={在,oa,oc,交流电,一个,加州,cc,中文}.PluggingtheseintoDefinition5,weobtainalanguageLwhichonlycontainsthosestringswithouttheforbidden2-factoraaontierT.Thesearewordswhichmaycontainbsin-terspersedwithasandcsprovidedthatnoapre-cedesanotherwithoutaninterveningc.Forexam-ple,bbabbcbba∈Lbutbbabbbbabb6∈L,becauseET(bbabbbbabb)=aaandaa∈fac2(oaan)butaa6∈S.LiketheclassofSLklanguages,theclassofTSLlanguages(forfixedTandk)isastringextensionlanguageclass.TherelationshipoftheTSLlan-guagestoothersub-Regularlanguageclasses(Mc-NaughtonandPapert,1971;Rogersetal.,2013)isstudiedinHeinzetal.(2011).GivenafixedkandT,SiseasilylearnableinthesamewayastheSLlanguages(Heinzetal.,2011).然而,asdiscussedabove,inthecaseofnaturallanguagephonology,itisnotclearthatinformationaboutTisavailableapriori.LearningbothTandSsimultaneouslyisthusaninterestingproblem.Thisproblemadmitsatechnical,butunsatisfying,solution.ThenumberofsubsetsTsuchthatT⊆Σisfinite,sothenumberofTSLklanguagesgivenafixedkisfinite.Itisalreadyknownthatanyfiniteclassoflanguageswhichcanbeenumeratedcanbeidentifiedinthelimitbyanalgorithmwhichchecksthroughtheenumeration(金子,1967).然而,giventhecognitiverelevanceofTSLlanguages,它
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
0
8
5
1
5
6
7
3
7
6
/
/
t
我
A
C
_
A
_
0
0
0
8
5
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
91
isofinteresttopursueasmarter,computationallyefficientmethodforlearningthem.WhataretheconsequencesofvaryingT?WhenT=Σ,theresultisanSLklanguage,becausetheerasingfunctionoperatesvacuously(Heinzetal.,2011).反过来,whenT=∅,byDefinition5,Siseither∅or{在}.TheformerobtainstheemptylanguagewhilethelatterobtainsΣ∗.ByDefinition5,Σ∗canbedescribedwithanyT⊆ΣaslongasS=fac2(oT∗n).Insuchagrammar,noneofthemembersofTserveanypurpose;thuswestipulatethatforacanonicalgrammarRisnonempty.Importantly,amemberofTmayfailtobelongtoastringinRandstillserveapurpose.Forex-ample,letΣ={A,乙,C},T={A,C},andS=fac2(oT∗n)-{aa}.Becausecappearsinnofor-biddentiersubstringsinR,itisfreelydistributedinL=L(hT,和).然而,itmakesadifferenceinthelanguage,becauseaca∈Lbutaba6∈L.Ifc(andtherelevanttiersubstringsinS)weremiss-ingfromthetier,neitherabanoracawouldbeinL.Thiscanbethoughtofa‘blocking’functionofc,be-causeitallowssequencesa…aeventhoughaa∈R.WemaynowreturntotheLatinexamplefrom§2inalittlemoredetail.TheLatinpattern,inwhichrsandlsmustalternate,regardlessofotherinterveningsounds.ThiscanbecapturedbyaTSLgrammarinwhichT={我,r}andS={ol,或者,lr,rl,rr,rn,ln}.Thiscapturesthegen-eralizationthat,ignoringallsoundsbesideslandr,landlareneverallowedtobeadjacent.There-mainderofthepaperdiscusseshowsuchagrammar,includingT,maybeinducedfrompositivedata.5AlgorithmThissectionintroducestheTier-basedStrictly2-LocalInferenceAlgorithm(2TSLIA),whichin-ducesbothatierandasetofpermissibletier2-factorsfrompositivedata.First,in§5.1,thecon-ceptofapath,whichiscrucialtothelearner,isdefined.§5.3introducesanddescribesthealgo-rithm,and§5.4definesadistinguishingexampleandprovesthatitisacharacteristicsampleforwhichthealgorithmisguaranteedtoconverge.Timeanddatacomplexityforthealgorithmarediscussedthere.5.1PathsFirst,wedefinetheconceptof2-paths.Howthisconceptmightbegeneralizedtokisdiscussedin§6.Pathsdenoteprecedencerelationsbetweensymbolsinastring,buttheyarefurtheraugmentedwithsetsofinterveningsymbols.Formally,a2-pathisa3-tuplehx,Z,yiwherexandyareasymbolinΣonandZisasubsetofΣ.The2-pathsofastringw=σ0σ1…σn,denotedpaths2(w),arepaths2(w)=(西德:8)hσi,Z,σji(西德:12)(西德:12)我
下载pdf