Operazioni dell'Associazione per la Linguistica Computazionale, vol. 4, pag. 87–98, 2016. Redattore di azioni: Alexander Clark.
Lotto di invio: 7/2015; Lotto di revisione: 11/2015; Pubblicato 4/2016.
2016 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.
C
(cid:13)
LearningTier-basedStrictly2-LocalLanguagesAdamJardineandJeffreyHeinzUniversityofDelaware{ajardine,heinz}@udel.eduAbstractTheTier-basedStrictly2-Local(TSL2)lan-guagesareaclassofformallanguageswhichhavebeenshowntomodellong-distancephonotacticgeneralizationsinnaturallan-guage(Heinzetal.,2011).Thispaperin-troducestheTier-basedStrictly2-LocalIn-ferenceAlgorithm(2TSLIA),thefirstnon-enumerativelearnerfortheTSL2languages.Weprovethe2TSLIAisguaranteedtocon-vergeinpolynomialtimeonadatasamplewhosesizeisboundedbyaconstant.1IntroductionThisworkpresentstheTier-basedStrictly2-LocalInferenceAlgorithm(2TSLIA),anefficientlearn-ingalgorithmforaclassofTier-basedStrictlyLo-cal(TSL)formallanguages(Heinzetal.,2011).ATSLclassisdeterminedbytwoparameters:thetier,orsubsetofthealphabet,andthepermissibletierk-factors,whicharethelegalsequencesoflengthkallowedinthestring,onceallnon-tiersymbolshavebeenremoved.TheTier-basedStrictly2-Local(TSL2)languagesarethoseinwhichk=2.Aswillbediscussedbelow,theTSLlanguagesareofinteresttophonologybecausetheycanmodelawidevarietyoflong-distancephonotacticpatternsfoundinnaturallanguage(Heinzetal.,2011;Mc-MullinandHansson,forthcoming).OneexampleisderivedfromLatinliquiddissimilation,inwhichtwolscannotappearinawordunlessthereisanrintervening,regardlessofdistance.Forexam-ple,floralis‘floral’iswell-formedbutnot*mili-talis(cf.militaris‘military’).Asexplainedinsec-tions2and4,thiscanbemodeledwithpermissible2-factorsoveratierconsistingoftheliquids{l,R}.Forlong-distancephonotactics,kcanbefixedto2,butitisdoesnotappearthatthetiercanbefixedsincelanguagesemployavarietyofdifferenttiers.Thispresentsaninterestinglearningproblem:Givenafixedk,howcananalgorithminducebothatierandasetofpermissibletierk-factorsfrompositivedata?Thereissomerelatedworkwhichaddressesthisquestion.GoldsmithandRiggle(2012),buildingonworkbyGoldsmithandXanthos(2009),presentamethodbasedonmutualinformationforlearn-ingtiersandsubsequentlylearningharmonypat-terns.Thispaperdiffersinthatitsmethodsarerootedfirmlyingrammaticalinferenceandformallanguagetheory(delaHiguera,2010).Forinstance,incontrasttotheresultspresentedthere,weprovethekindsofpatterns2TSLIAsucceedsonandthekindofdatasufficientforittodoso.Nonetheless,thereisrelevantworkincomputa-tionallearningtheory:Gold(1967)provedthatanyfiniteclassoflanguagesisidentifiableinthelimitviaanenumerationmethod.Givenafixedalphabetandafixedk,thenumberofpossibletiersandper-missibletierk-factorsisfinite,andthuslearnableinthisway.However,suchlearnersaregrosslyinef-ficient.Noprovably-correct,non-enumerative,ef-ficientlearnerforboththetierandpermissibletierk-factorparametershaspreviouslybeenproposed.Thisworkfillsthisgapwithanalgorithmwhichlearnstheseparameterswhenk=2frompositivedataintimepolynomialinthesizeofthedata.Finally,Jardine(2016)presentsasimplifiedver-
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
0
8
5
1
5
6
7
3
7
6
/
/
T
l
UN
C
_
UN
_
0
0
0
8
5
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
88
sionof2TSLIAandreportstheresultsofsomesim-ulations.Unlikethatpaper,thepresentworkpro-videsthefullmathematicaldetailsandproofs.Thesimulationsarediscussedinthediscussionsection.Thispaperisstructuredasfollows.§2moti-vatestheworkwithexamplesfromnaturallanguagephonology.§3outlinesthebasicconceptsanddef-initionstobeusedthroughoutthepaper.§4definestheTSLlanguagesanddiscussestheirproperties.§5detailsthe2TSLIAandprovesitlearnstheTSL2classinpolynomialtimeanddata.§6discussesfu-turework,and§7concludes.2LinguisticmotivationTheprimarymotivationforstudyingtheTSLlan-guagesandtheirlearnabilitycomesfromtheirrel-evancetophonotactic(wordwell-formedness)pat-ternsinnaturallanguagephonology.Manyphono-tacticpatternsbelongtoeithertheStrictlyLocal(SL)languages(McNaughtonandPapert,1971)ortheStrictlyPiecewise(SP)languages(Rogersetal.,2010).1AnexampleofphonotacticknowledgewhichisSLisChomskyandHalle’s(ChomskyandHalle,1965)observationthatEnglishspeakerswillclassifyblickasapossiblewordofEnglishwhilerejectingbnickasimpossible.ThisisSLbecauseitcanbedescribedas*bnbeingunacceptableasaword-initialsequence.SPlanguagescandescribelong-distancedependenciesbasedonprecedencere-lationshipsbetweensounds(suchasconsonanthar-monyinSarcee,inwhichsmaynotfollowaSany-whereinaword,butmayprecedeone(Cook,1984))(Heinz,2010UN).Also,theSLandSPlanguagesareefficientlylearnable(Garc´ıaetal.,1990;Heinz,2010UN;Heinz,2010B;HeinzandRogers,2013).Tuttavia,therearesomelong-distancepatternswhichcannotbedescribedpurelywithprecedencerelationships.OneexampleisapatternfromLatin,inwhichincertaincasesanlcannotfollowanotherlunlessanrintervenes,nomatterthedistancebe-tweenthem(Jensen,1974;Odden,1994;Heinz,2010UN).Thiscanbeseeninthe-alisadjectivalsuf-fix(Example1),whichappearsas-arisiftheworditattachestoalreadycontainsanl((D)through(F)1Therelationshipbetweentheseformallanguageclassesandhumancognition(linguisticandotherwise)isdiscussedinmoredetailinRogersandPullum(2011),RogersandHauser(2010),Rogersetal.(2013),andLai(2013;2015).inExample1),exceptincaseswherethereisanin-terveningr,inwhichitappearsagainas-alis((G)through(io)inExample1).Intheexamples,thesoundsinquestionareunderlinedforemphasis(datafromHeinz(2010UN)),andfor(D)through(F),illicitformsaregiven,markedwithastar(*).Example1.a.navalis‘naval’b.episcopalis‘episcopal’c.infinitalis‘negative’d.solaris‘solar’(*solalis)e.lunaris‘lunar’(*lunalis)f.militaris‘military’(*militalis)g.floralis‘floral’h.sepulkralis‘funereal’i.litoralis‘oftheshore’Thisnon-localalternatingpatternisnotSPbe-causeSPlanguagescannotexpressblockingeffects(Heinz,2010UN).Tuttavia,itcanbedescribedwithaTSLgrammarinwhichthetieris{R,l}andthepermissibletier2-factorsdonotinclude*lland*rr.Thisyieldsexactlythesetofstringsinwhichanlisalwaysimmediately(disregardingallsoundsbesides{R,l})followedbyanr,andviceversa.FormallearningalgorithmsforSLandSPlan-guagescanprovideamodelforhumanlearningofSLandSPsoundpatterns(Heinz,2010UN).TSLlan-guagesarealsosimilarlylearnable,giventhestip-ulationthatboththetierandkarefixed.Fornat-urallanguage,thevalueforkneverseemstogoabove2(Heinzetal.,2011).Tuttavia,tiersvaryinhumanlanguage—TSLpatternsoccurbothwithdifferentkindsofvowels(Nevins,2010)andcon-sonants(Suzuki,1998;Bennett,2013).Forexam-ple,inTurkishthetieristheentirevowelinven-tory(ClementsandSezer,1982),whileinFinnishitisvowelsexcept/i,e/(Ringen,1975).InSamalaconsonantharmony,thetierissibilants(RoseandWalker,2004),whereasinKoorete,thetierissibi-lantsand{B,R,G,D}(McMullinandHansson,forth-coming).Così,itisofinteresttounderstandhowboththetierandpermissibletier2-factorsforTSL2grammarsmightbelearnedefficiently.3PreliminariesBasicknowledgeofsettheoryisassumed.LetS1−S2denotethesettheoreticdifferenceofsetsS1and
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
0
8
5
1
5
6
7
3
7
6
/
/
T
l
UN
C
_
UN
_
0
0
0
8
5
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
89
S2.LetP(S)denotethepowersetofSandPfin(S)bethesetofallfinitesubsetsofS.LetΣdenoteasetofsymbols,referredtoasthealphabet,andletastringoverΣbeafinitese-quenceofsymbolsfromthatalphabet.Thelengthofastringwwillbedenoted|w|.Letλdenotetheemptystring;|λ|=0.LetΣ∗(Kleenestar)repre-sentallstringsoverthisalphabet,andΣkrepresentallstringsoflengthk.Concatenationoftwostringswandv(orsymbolsσ1andσ2)willbewrittenwv(σ1σ2).Specialbeginning(o)andend(N)symbols(o,n6∈Σ)willoftenbeusedtomarkthebegin-ningandendofwords;thealphabetΣaugmentedwiththese,Σ∪{o,N},willbedenotedΣon.Astringu∈Σ∗issaidtobeafactororsub-stringofanotherstringwiftherearetwootherstringsv,v0∈Σ∗suchthatw=vuv0.Wecalluak-factorofwifitisafactorofwand|tu|=k.Letfack:Σ∗→P(Σ≤k)beafunctionmappingstringstotheirk-factors,wherefack(w)equals{tu|uisak-factorofw}if|w|>kandequals{w}otherwise.Forexample,fac2(aab)={aa,ab}andfac8(aab)={aab}.Weextendthek-factorfunc-tiontolanguages;fack(l)=Sw∈Lfack(w).3.1Grammars,languages,andlearningAlanguage(orstringset)LisasubsetofΣ∗.IfLisfinite,let|l|denotethecardinalityofL,andlet||l||denotethesizeofL,whichisdefinedtobePw∈L|w|.LetL1·L2denotetheconcatenationofthelanguagesL1andL2,i.e.,thepairwiseconcate-nationofeachwordinL1toeachwordinL2.Fornotationalsimplicity,theconcatenationofasingle-tonlanguage{w}toanotherlanguageL2(orL1to{w})willbewrittenwL2(L1w).Agrammarisafiniterepresentationofapos-siblyinfinitelanguage.AclassLoflanguagesisrepresentedbyaclassRofrepresentationsifeveryr∈RisoffinitesizeandthereisanamingfunctionL:R→Lwhichisbothtotalandsurjective.Thelearningparadigmusedinthispaperisidenti-ficationinthelimitlearningparadigm(Gold,1967),withpolynomialboundsontimeanddata(delaHiguera,1997).Thisparadigmhastwocomplemen-taryaspects.First,itrequiresthattheinformationwhichdistinguishesalearningtargetfromotherpo-tentialtargetsbepresentintheinputforalgorithmstosuccessfullylearn.Second,itrequiressuccessfulalgorithmstoreturnahypothesisintimepolynomialofthesizeofthesample,andthatthesizeofthesam-pleitselfmustbepolynomialinthesizeofgrammar.Thedefinitionofthelearningparadigm(Defini-tion3)dependsonsomepreliminarynotions.Definition1.LetLbeaclassoflanguagesrepre-sentedbysomeclassRofrepresentations.1.AninputsampleIforalanguageL∈LisafinitesetofdataconsistentwithL,thatistosayI⊆L.2.A(l,R)-learningalgorithmAisaprogramthattakesasinputasampleforalanguageL∈LandoutputsarepresentationfromR.Thenotionofcharacteristicsampleisintegral.Definition2(Characteristicsample).Fora(l,R)-learningalgorithmA,asampleCSisacharacteris-ticsampleofalanguageL∈LifforallsamplesIforLsuchthatCS⊆I,AreturnsarepresentationrsuchthatL(R)=L.Nowthelearningparadigmcanbedefined.Definition3(Identificationinpolynomialtimeanddata).AclassLoflanguagesisidentifiableinpolynomialtimeanddataifthereexistsa(l,R)-learningalgorithmAandtwopolynomialsp()andq()suchthat:1.ForanysampleIofsizemforL∈L,Are-turnsahypothesisr∈RinO(P(M))time.2.Foreachrepresentationr∈Rofsizen,thereexistsacharacteristicsampleofrforAofsizeatmostO(q(N)).4Tier-basedStrictlyLocalLanguagesThissectionintroducestheTier-basedStrictlyLocal(TSL)languages(Heinzetal.,2011).TheTSLlan-guagesgeneralizetheSLlanguages(McNaughtonandPapert,1971;Garc´ıaetal.,1990;Caron,2000),andassuchthesewillbrieflybediscussedfirst.4.1TheStrictlyLocalLanguagesTheSLlanguagescanbedefinedasfollows(Heinzetal.,2011):Definition4(SLlanguages).AlanguageLisStrictlyk-Local(SLk)iffthereexistsafi-nitesetS⊆fack(oΣ∗n)suchthatL={w∈Σ∗|fack(own)⊆S}
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
0
8
5
1
5
6
7
3
7
6
/
/
T
l
UN
C
_
UN
_
0
0
0
8
5
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
90
SuchasetSissometimesreferredtoastheper-missiblek-factorsorpermissiblesubstringsofL.Forexample,letL={λ,ab,abab,ababab,…}.ThisLcanbedescribedwithasetofpermissible2-factorsS={SU,oa,ab,ba,bn}becauseevery2-factorofeverywordinLisinS;così,LisStrictly2-Local(abbreviatedSL2).AsasetSofpermissiblek-factorsisfiniteitcanalsobeviewedasaSLgrammarwhereL(S)={w∈Σ∗|fack(own)⊆S}.AnelementsofanSLgrammarSforalanguageLisuseful(resp.use-less)iffs∈fack(l)(s6∈fack(l)).AcanonicalSLgrammarcontainsnouselesselements.Intheexampleabove,aa6∈Sandaa6∈fac2(w)foranyw∈L.Suchastringisreferredtoasaforbiddenk-factororarestrictiononL.Thesetofforbiddenk-factorsRisfack(oΣ∗n)−S.Think-ingaboutthegrammarintermsofSorintermsofRisequivalent,butinsomecasesitissimplertorefertooneratherthantheother,soweshalluseboth.AnySLkclassoflanguagesislearnablewithpolynomialboundsontimeanddataifkisknowninadvance(Garc´ıaetal.,1990;Heinz,2010B).TheclassofSLklanguages(foreachk)belongstoacollectionoflanguageclassescalledstringex-tensionlanguageclasses(Heinz,2010B).Thedis-cussionabovepresentsSLklanguagesfromthisper-spective.Theselanguageclasseshavemanydesir-ablelearningproperties,duetotheirunderlyinglat-ticestructure(Heinzetal.,2012).4.2TheTSLlanguagesTheTSLlanguagescanbethoughtofasafurtherpa-rameterizationonthek-factorfunctionwhereacer-tainsubsetofthealphabettakespartinthegrammarandallothersymbolsinthealphabetareignored.ThisspecialsubsetisreferredtoasatierT⊆Σ.Symbolsnotonthetierareremovedfromconsider-ationofthegrammarbyanerasingfunctionET:Σ∗→T∗definedasET(σ0σ1…σn)=u0u1…unwhereui=σiifσi∈T;elseui=λ.Forexample,ifΣ={UN,B,C}andT={UN,C}thenET(bbabbcbba)=aca.WecanthendefineatierversionfacT-koffackasfacT-k(w)=fack(oET(w)N).Here,oandnarebuiltintothefunctionastheyarealwaystreatedaspartofthetier.Continu-ingtheexamplefromabove,facT-2(bbabbcbba)={oa,ac,ca,an}.facT-kcanbeextendedtolan-guagesaswithfackabove.TheTSLlanguagescannowbedefinedparalleltotheSLlanguages(thefollowingdefinitionisequiv-alenttotheoneinHeinzetal.(2011)):Definition5(TSLlanguages).AlanguageLisTier-basedStrictlyk-LocaliffthereexistsasubsetT⊆ΣofthealphabetandafinitesetS⊆fack(oT∗n)suchthatL={w∈Σ∗|facT-k(w)⊆S}ParalleltoSLgrammarsabove,hT,SicanbethoughtofasaTSLgrammarofL.Likewise,theforbiddentiersubstrings(ortierrestrictions)RissimplythesetfacT-k(Σ∗)−S.Finally,hT,SiiscanonicalifScontainsnouselesselements(i.e.,s∈S⇔s∈facT-k(l(hT,Si)))andRisnonempty(thissecondrestrictionisexplainedbelow).Forexample,letΣ={UN,B,C}andT={UN,C}asaboveandletS={SU,oa,oc,ac,an,ca,cc,cn}.PluggingtheseintoDefinition5,weobtainalanguageLwhichonlycontainsthosestringswithouttheforbidden2-factoraaontierT.Thesearewordswhichmaycontainbsin-terspersedwithasandcsprovidedthatnoapre-cedesanotherwithoutaninterveningc.Forexam-ple,bbabbcbba∈Lbutbbabbbbabb6∈L,becauseET(bbabbbbabb)=aaandaa∈fac2(oaan)butaa6∈S.LiketheclassofSLklanguages,theclassofTSLlanguages(forfixedTandk)isastringextensionlanguageclass.TherelationshipoftheTSLlan-guagestoothersub-Regularlanguageclasses(Mc-NaughtonandPapert,1971;Rogersetal.,2013)isstudiedinHeinzetal.(2011).GivenafixedkandT,SiseasilylearnableinthesamewayastheSLlanguages(Heinzetal.,2011).Tuttavia,asdiscussedabove,inthecaseofnaturallanguagephonology,itisnotclearthatinformationaboutTisavailableapriori.LearningbothTandSsimultaneouslyisthusaninterestingproblem.Thisproblemadmitsatechnical,butunsatisfying,solution.ThenumberofsubsetsTsuchthatT⊆Σisfinite,sothenumberofTSLklanguagesgivenafixedkisfinite.Itisalreadyknownthatanyfiniteclassoflanguageswhichcanbeenumeratedcanbeidentifiedinthelimitbyanalgorithmwhichchecksthroughtheenumeration(Gold,1967).Tuttavia,giventhecognitiverelevanceofTSLlanguages,Esso
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
0
8
5
1
5
6
7
3
7
6
/
/
T
l
UN
C
_
UN
_
0
0
0
8
5
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
91
isofinteresttopursueasmarter,computationallyefficientmethodforlearningthem.WhataretheconsequencesofvaryingT?WhenT=Σ,theresultisanSLklanguage,becausetheerasingfunctionoperatesvacuously(Heinzetal.,2011).Conversely,whenT=∅,byDefinition5,Siseither∅or{SU}.TheformerobtainstheemptylanguagewhilethelatterobtainsΣ∗.ByDefinition5,Σ∗canbedescribedwithanyT⊆ΣaslongasS=fac2(oT∗n).Insuchagrammar,noneofthemembersofTserveanypurpose;thuswestipulatethatforacanonicalgrammarRisnonempty.Importantly,amemberofTmayfailtobelongtoastringinRandstillserveapurpose.Forex-ample,letΣ={UN,B,C},T={UN,C},andS=fac2(oT∗n)−{aa}.Becausecappearsinnofor-biddentiersubstringsinR,itisfreelydistributedinL=L(hT,Si).Tuttavia,itmakesadifferenceinthelanguage,becauseaca∈Lbutaba6∈L.Ifc(andtherelevanttiersubstringsinS)weremiss-ingfromthetier,neitherabanoracawouldbeinL.Thiscanbethoughtofa‘blocking’functionofc,be-causeitallowssequencesa…aeventhoughaa∈R.WemaynowreturntotheLatinexamplefrom§2inalittlemoredetail.TheLatinpattern,inwhichrsandlsmustalternate,regardlessofotherinterveningsounds.ThiscanbecapturedbyaTSLgrammarinwhichT={l,R}andS={ol,O,lr,rl,rr,rn,ln}.Thiscapturesthegen-eralizationthat,ignoringallsoundsbesideslandr,landlareneverallowedtobeadjacent.There-mainderofthepaperdiscusseshowsuchagrammar,includingT,maybeinducedfrompositivedata.5AlgorithmThissectionintroducestheTier-basedStrictly2-LocalInferenceAlgorithm(2TSLIA),whichin-ducesbothatierandasetofpermissibletier2-factorsfrompositivedata.First,in§5.1,thecon-ceptofapath,whichiscrucialtothelearner,isdefined.§5.3introducesanddescribesthealgo-rithm,and§5.4definesadistinguishingexampleandprovesthatitisacharacteristicsampleforwhichthealgorithmisguaranteedtoconverge.Timeanddatacomplexityforthealgorithmarediscussedthere.5.1PathsFirst,wedefinetheconceptof2-paths.Howthisconceptmightbegeneralizedtokisdiscussedin§6.Pathsdenoteprecedencerelationsbetweensymbolsinastring,buttheyarefurtheraugmentedwithsetsofinterveningsymbols.Formally,a2-pathisa3-tuplehx,Z,yiwherexandyareasymbolinΣonandZisasubsetofΣ.The2-pathsofastringw=σ0σ1…σn,denotedpaths2(w),arepaths2(w)=(cid:8)hσi,Z,σji(cid:12)(cid:12)io
Scarica il pdf