Transactions of the Association for Computational Linguistics, vol. 4, pp. 87–98, 2016. Action Editor: Alexander Clark.
Submission batch: 7/2015; Revision batch: 11/2015; Published 4/2016.
2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.
c
(cid:13)
LearningTier-basedStrictly2-LocalLanguagesAdamJardineandJeffreyHeinzUniversityofDelaware{ajardine,heinz}@udel.eduAbstractTheTier-basedStrictly2-Local(TSL2)lan-guagesareaclassofformallanguageswhichhavebeenshowntomodellong-distancephonotacticgeneralizationsinnaturallan-guage(Heinzetal.,2011).Thispaperin-troducestheTier-basedStrictly2-LocalIn-ferenceAlgorithm(2TSLIA),thefirstnon-enumerativelearnerfortheTSL2languages.Weprovethe2TSLIAisguaranteedtocon-vergeinpolynomialtimeonadatasamplewhosesizeisboundedbyaconstant.1IntroductionThisworkpresentstheTier-basedStrictly2-LocalInferenceAlgorithm(2TSLIA),anefficientlearn-ingalgorithmforaclassofTier-basedStrictlyLo-cal(TSL)formallanguages(Heinzetal.,2011).ATSLclassisdeterminedbytwoparameters:thetier,orsubsetofthealphabet,andthepermissibletierk-factors,whicharethelegalsequencesoflengthkallowedinthestring,onceallnon-tiersymbolshavebeenremoved.TheTier-basedStrictly2-Local(TSL2)languagesarethoseinwhichk=2.Aswillbediscussedbelow,theTSLlanguagesareofinteresttophonologybecausetheycanmodelawidevarietyoflong-distancephonotacticpatternsfoundinnaturallanguage(Heinzetal.,2011;Mc-MullinandHansson,forthcoming).OneexampleisderivedfromLatinliquiddissimilation,inwhichtwolscannotappearinawordunlessthereisanrintervening,regardlessofdistance.Forexam-ple,floralis‘floral’iswell-formedbutnot*mili-talis(cf.militaris‘military’).Asexplainedinsec-tions2and4,thiscanbemodeledwithpermissible2-factorsoveratierconsistingoftheliquids{je,r}.Forlong-distancephonotactics,kcanbefixedto2,butitisdoesnotappearthatthetiercanbefixedsincelanguagesemployavarietyofdifferenttiers.Thispresentsaninterestinglearningproblem:Givenafixedk,howcananalgorithminducebothatierandasetofpermissibletierk-factorsfrompositivedata?Thereissomerelatedworkwhichaddressesthisquestion.GoldsmithandRiggle(2012),buildingonworkbyGoldsmithandXanthos(2009),presentamethodbasedonmutualinformationforlearn-ingtiersandsubsequentlylearningharmonypat-terns.Thispaperdiffersinthatitsmethodsarerootedfirmlyingrammaticalinferenceandformallanguagetheory(delaHiguera,2010).Forinstance,incontrasttotheresultspresentedthere,weprovethekindsofpatterns2TSLIAsucceedsonandthekindofdatasufficientforittodoso.Nonetheless,thereisrelevantworkincomputa-tionallearningtheory:Gold(1967)provedthatanyfiniteclassoflanguagesisidentifiableinthelimitviaanenumerationmethod.Givenafixedalphabetandafixedk,thenumberofpossibletiersandper-missibletierk-factorsisfinite,andthuslearnableinthisway.However,suchlearnersaregrosslyinef-ficient.Noprovably-correct,non-enumerative,ef-ficientlearnerforboththetierandpermissibletierk-factorparametershaspreviouslybeenproposed.Thisworkfillsthisgapwithanalgorithmwhichlearnstheseparameterswhenk=2frompositivedataintimepolynomialinthesizeofthedata.Finally,Jardine(2016)presentsasimplifiedver-
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
8
5
1
5
6
7
3
7
6
/
/
t
je
un
c
_
un
_
0
0
0
8
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
88
sionof2TSLIAandreportstheresultsofsomesim-ulations.Unlikethatpaper,thepresentworkpro-videsthefullmathematicaldetailsandproofs.Thesimulationsarediscussedinthediscussionsection.Thispaperisstructuredasfollows.§2moti-vatestheworkwithexamplesfromnaturallanguagephonology.§3outlinesthebasicconceptsanddef-initionstobeusedthroughoutthepaper.§4definestheTSLlanguagesanddiscussestheirproperties.§5detailsthe2TSLIAandprovesitlearnstheTSL2classinpolynomialtimeanddata.§6discussesfu-turework,and§7concludes.2LinguisticmotivationTheprimarymotivationforstudyingtheTSLlan-guagesandtheirlearnabilitycomesfromtheirrel-evancetophonotactic(wordwell-formedness)pat-ternsinnaturallanguagephonology.Manyphono-tacticpatternsbelongtoeithertheStrictlyLocal(SL)languages(McNaughtonandPapert,1971)ortheStrictlyPiecewise(SP)languages(Rogersetal.,2010).1AnexampleofphonotacticknowledgewhichisSLisChomskyandHalle’s(ChomskyandHalle,1965)observationthatEnglishspeakerswillclassifyblickasapossiblewordofEnglishwhilerejectingbnickasimpossible.ThisisSLbecauseitcanbedescribedas*bnbeingunacceptableasaword-initialsequence.SPlanguagescandescribelong-distancedependenciesbasedonprecedencere-lationshipsbetweensounds(suchasconsonanthar-monyinSarcee,inwhichsmaynotfollowaSany-whereinaword,butmayprecedeone(Cook,1984))(Heinz,2010un).Aussi,theSLandSPlanguagesareefficientlylearnable(Garc´ıaetal.,1990;Heinz,2010un;Heinz,2010b;HeinzandRogers,2013).Cependant,therearesomelong-distancepatternswhichcannotbedescribedpurelywithprecedencerelationships.OneexampleisapatternfromLatin,inwhichincertaincasesanlcannotfollowanotherlunlessanrintervenes,nomatterthedistancebe-tweenthem(Jensen,1974;Odden,1994;Heinz,2010un).Thiscanbeseeninthe-alisadjectivalsuf-fix(Example1),whichappearsas-arisiftheworditattachestoalreadycontainsanl((d)through(F)1Therelationshipbetweentheseformallanguageclassesandhumancognition(linguisticandotherwise)isdiscussedinmoredetailinRogersandPullum(2011),RogersandHauser(2010),Rogersetal.(2013),andLai(2013;2015).inExample1),exceptincaseswherethereisanin-terveningr,inwhichitappearsagainas-alis((g)through(je)inExample1).Intheexamples,thesoundsinquestionareunderlinedforemphasis(datafromHeinz(2010un)),andfor(d)through(F),illicitformsaregiven,markedwithastar(*).Example1.a.navalis‘naval’b.episcopalis‘episcopal’c.infinitalis‘negative’d.solaris‘solar’(*solalis)e.lunaris‘lunar’(*lunalis)f.militaris‘military’(*militalis)g.floralis‘floral’h.sepulkralis‘funereal’i.litoralis‘oftheshore’Thisnon-localalternatingpatternisnotSPbe-causeSPlanguagescannotexpressblockingeffects(Heinz,2010un).Cependant,itcanbedescribedwithaTSLgrammarinwhichthetieris{r,je}andthepermissibletier2-factorsdonotinclude*lland*rr.Thisyieldsexactlythesetofstringsinwhichanlisalwaysimmediately(disregardingallsoundsbesides{r,je})followedbyanr,andviceversa.FormallearningalgorithmsforSLandSPlan-guagescanprovideamodelforhumanlearningofSLandSPsoundpatterns(Heinz,2010un).TSLlan-guagesarealsosimilarlylearnable,giventhestip-ulationthatboththetierandkarefixed.Fornat-urallanguage,thevalueforkneverseemstogoabove2(Heinzetal.,2011).Cependant,tiersvaryinhumanlanguage—TSLpatternsoccurbothwithdifferentkindsofvowels(Nevins,2010)andcon-sonants(Suzuki,1998;Bennett,2013).Forexam-ple,inTurkishthetieristheentirevowelinven-tory(ClementsandSezer,1982),whileinFinnishitisvowelsexcept/i,e/(Ringen,1975).InSamalaconsonantharmony,thetierissibilants(RoseandWalker,2004),whereasinKoorete,thetierissibi-lantsand{b,r,g,d}(McMullinandHansson,forth-coming).Ainsi,itisofinteresttounderstandhowboththetierandpermissibletier2-factorsforTSL2grammarsmightbelearnedefficiently.3PreliminariesBasicknowledgeofsettheoryisassumed.LetS1−S2denotethesettheoreticdifferenceofsetsS1and
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
8
5
1
5
6
7
3
7
6
/
/
t
je
un
c
_
un
_
0
0
0
8
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
89
S2.LetP(S)denotethepowersetofSandPfin(S)bethesetofallfinitesubsetsofS.LetΣdenoteasetofsymbols,referredtoasthealphabet,andletastringoverΣbeafinitese-quenceofsymbolsfromthatalphabet.Thelengthofastringwwillbedenoted|w|.Letλdenotetheemptystring;|λ|=0.LetΣ∗(Kleenestar)repre-sentallstringsoverthisalphabet,andΣkrepresentallstringsoflengthk.Concatenationoftwostringswandv(orsymbolsσ1andσ2)willbewrittenwv(σ1σ2).Specialbeginning(o)andend(n)symbols(o,n6∈Σ)willoftenbeusedtomarkthebegin-ningandendofwords;thealphabetΣaugmentedwiththese,Σ∪{o,n},willbedenotedΣon.Astringu∈Σ∗issaidtobeafactororsub-stringofanotherstringwiftherearetwootherstringsv,v0∈Σ∗suchthatw=vuv0.Wecalluak-factorofwifitisafactorofwand|toi|=k.Letfack:Σ∗→P(Σ≤k)beafunctionmappingstringstotheirk-factors,wherefack(w)equals{toi|uisak-factorofw}si|w|>kandequals{w}otherwise.Forexample,fac2(aab)={aa,ab}andfac8(aab)={aab}.Weextendthek-factorfunc-tiontolanguages;fack(L)=Sw∈Lfack(w).3.1Grammars,languages,andlearningAlanguage(orstringset)LisasubsetofΣ∗.IfLisfinite,let|L|denotethecardinalityofL,andlet||L||denotethesizeofL,whichisdefinedtobePw∈L|w|.LetL1·L2denotetheconcatenationofthelanguagesL1andL2,i.e.,thepairwiseconcate-nationofeachwordinL1toeachwordinL2.Fornotationalsimplicity,theconcatenationofasingle-tonlanguage{w}toanotherlanguageL2(orL1to{w})willbewrittenwL2(L1w).Agrammarisafiniterepresentationofapos-siblyinfinitelanguage.AclassLoflanguagesisrepresentedbyaclassRofrepresentationsifeveryr∈RisoffinitesizeandthereisanamingfunctionL:R→Lwhichisbothtotalandsurjective.Thelearningparadigmusedinthispaperisidenti-ficationinthelimitlearningparadigm(Gold,1967),withpolynomialboundsontimeanddata(delaHiguera,1997).Thisparadigmhastwocomplemen-taryaspects.First,itrequiresthattheinformationwhichdistinguishesalearningtargetfromotherpo-tentialtargetsbepresentintheinputforalgorithmstosuccessfullylearn.Second,itrequiressuccessfulalgorithmstoreturnahypothesisintimepolynomialofthesizeofthesample,andthatthesizeofthesam-pleitselfmustbepolynomialinthesizeofgrammar.Thedefinitionofthelearningparadigm(Defini-tion3)dependsonsomepreliminarynotions.Definition1.LetLbeaclassoflanguagesrepre-sentedbysomeclassRofrepresentations.1.AninputsampleIforalanguageL∈LisafinitesetofdataconsistentwithL,thatistosayI⊆L.2.A(L,R.)-learningalgorithmAisaprogramthattakesasinputasampleforalanguageL∈LandoutputsarepresentationfromR.Thenotionofcharacteristicsampleisintegral.Definition2(Characteristicsample).Fora(L,R.)-learningalgorithmA,asampleCSisacharacteris-ticsampleofalanguageL∈LifforallsamplesIforLsuchthatCS⊆I,AreturnsarepresentationrsuchthatL(r)=L.Nowthelearningparadigmcanbedefined.Definition3(Identificationinpolynomialtimeanddata).AclassLoflanguagesisidentifiableinpolynomialtimeanddataifthereexistsa(L,R.)-learningalgorithmAandtwopolynomialsp()andq()suchthat:1.ForanysampleIofsizemforL∈L,Are-turnsahypothesisr∈RinO(p(m))time.2.Foreachrepresentationr∈Rofsizen,thereexistsacharacteristicsampleofrforAofsizeatmostO(q(n)).4Tier-basedStrictlyLocalLanguagesThissectionintroducestheTier-basedStrictlyLocal(TSL)languages(Heinzetal.,2011).TheTSLlan-guagesgeneralizetheSLlanguages(McNaughtonandPapert,1971;Garc´ıaetal.,1990;Caron,2000),andassuchthesewillbrieflybediscussedfirst.4.1TheStrictlyLocalLanguagesTheSLlanguagescanbedefinedasfollows(Heinzetal.,2011):Definition4(SLlanguages).AlanguageLisStrictlyk-Local(SLk)iffthereexistsafi-nitesetS⊆fack(oΣ∗n)suchthatL={w∈Σ∗|fack(propre)⊆S}
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
8
5
1
5
6
7
3
7
6
/
/
t
je
un
c
_
un
_
0
0
0
8
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
90
SuchasetSissometimesreferredtoastheper-missiblek-factorsorpermissiblesubstringsofL.Forexample,letL={λ,ab,abab,ababab,…}.ThisLcanbedescribedwithasetofpermissible2-factorsS={sur,oa,ab,ba,bn}becauseevery2-factorofeverywordinLisinS;thus,LisStrictly2-Local(abbreviatedSL2).AsasetSofpermissiblek-factorsisfiniteitcanalsobeviewedasaSLgrammarwhereL(S)={w∈Σ∗|fack(propre)⊆S}.AnelementsofanSLgrammarSforalanguageLisuseful(resp.use-less)iffs∈fack(L)(s6∈fack(L)).AcanonicalSLgrammarcontainsnouselesselements.Intheexampleabove,aa6∈Sandaa6∈fac2(w)foranyw∈L.Suchastringisreferredtoasaforbiddenk-factororarestrictiononL.Thesetofforbiddenk-factorsRisfack(oΣ∗n)−S.Think-ingaboutthegrammarintermsofSorintermsofRisequivalent,butinsomecasesitissimplertorefertooneratherthantheother,soweshalluseboth.AnySLkclassoflanguagesislearnablewithpolynomialboundsontimeanddataifkisknowninadvance(Garc´ıaetal.,1990;Heinz,2010b).TheclassofSLklanguages(foreachk)belongstoacollectionoflanguageclassescalledstringex-tensionlanguageclasses(Heinz,2010b).Thedis-cussionabovepresentsSLklanguagesfromthisper-spective.Theselanguageclasseshavemanydesir-ablelearningproperties,duetotheirunderlyinglat-ticestructure(Heinzetal.,2012).4.2TheTSLlanguagesTheTSLlanguagescanbethoughtofasafurtherpa-rameterizationonthek-factorfunctionwhereacer-tainsubsetofthealphabettakespartinthegrammarandallothersymbolsinthealphabetareignored.ThisspecialsubsetisreferredtoasatierT⊆Σ.Symbolsnotonthetierareremovedfromconsider-ationofthegrammarbyanerasingfunctionET:Σ∗→T∗definedasET(σ0σ1…σn)=u0u1…unwhereui=σiifσi∈T;elseui=λ.Forexample,ifΣ={un,b,c}andT={un,c}thenET(bbabbcbba)=aca.WecanthendefineatierversionfacT-koffackasfacT-k(w)=fack(oET(w)n).Ici,oandnarebuiltintothefunctionastheyarealwaystreatedaspartofthetier.Continu-ingtheexamplefromabove,facT-2(bbabbcbba)={oa,ac,ca,un}.facT-kcanbeextendedtolan-guagesaswithfackabove.TheTSLlanguagescannowbedefinedparalleltotheSLlanguages(thefollowingdefinitionisequiv-alenttotheoneinHeinzetal.(2011)):Definition5(TSLlanguages).AlanguageLisTier-basedStrictlyk-LocaliffthereexistsasubsetT⊆ΣofthealphabetandafinitesetS⊆fack(oT∗n)suchthatL={w∈Σ∗|facT-k(w)⊆S}ParalleltoSLgrammarsabove,hT,SicanbethoughtofasaTSLgrammarofL.Likewise,theforbiddentiersubstrings(ortierrestrictions)RissimplythesetfacT-k(Σ∗)−S.Finally,hT,SiiscanonicalifScontainsnouselesselements(i.e.,s∈S⇔s∈facT-k(L(hT,Si)))andRisnonempty(thissecondrestrictionisexplainedbelow).Forexample,letΣ={un,b,c}andT={un,c}asaboveandletS={sur,oa,oc,ac,un,ca,cc,cn}.PluggingtheseintoDefinition5,weobtainalanguageLwhichonlycontainsthosestringswithouttheforbidden2-factoraaontierT.Thesearewordswhichmaycontainbsin-terspersedwithasandcsprovidedthatnoapre-cedesanotherwithoutaninterveningc.Forexam-ple,bbabbcbba∈Lbutbbabbbbabb6∈L,becauseET(bbabbbbabb)=aaandaa∈fac2(oaan)butaa6∈S.LiketheclassofSLklanguages,theclassofTSLlanguages(forfixedTandk)isastringextensionlanguageclass.TherelationshipoftheTSLlan-guagestoothersub-Regularlanguageclasses(Mc-NaughtonandPapert,1971;Rogersetal.,2013)isstudiedinHeinzetal.(2011).GivenafixedkandT,SiseasilylearnableinthesamewayastheSLlanguages(Heinzetal.,2011).Cependant,asdiscussedabove,inthecaseofnaturallanguagephonology,itisnotclearthatinformationaboutTisavailableapriori.LearningbothTandSsimultaneouslyisthusaninterestingproblem.Thisproblemadmitsatechnical,butunsatisfying,solution.ThenumberofsubsetsTsuchthatT⊆Σisfinite,sothenumberofTSLklanguagesgivenafixedkisfinite.Itisalreadyknownthatanyfiniteclassoflanguageswhichcanbeenumeratedcanbeidentifiedinthelimitbyanalgorithmwhichchecksthroughtheenumeration(Gold,1967).Cependant,giventhecognitiverelevanceofTSLlanguages,it
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
0
8
5
1
5
6
7
3
7
6
/
/
t
je
un
c
_
un
_
0
0
0
8
5
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
91
isofinteresttopursueasmarter,computationallyefficientmethodforlearningthem.WhataretheconsequencesofvaryingT?WhenT=Σ,theresultisanSLklanguage,becausetheerasingfunctionoperatesvacuously(Heinzetal.,2011).Inversement,whenT=∅,byDefinition5,Siseither∅or{sur}.TheformerobtainstheemptylanguagewhilethelatterobtainsΣ∗.ByDefinition5,Σ∗canbedescribedwithanyT⊆ΣaslongasS=fac2(oT∗n).Insuchagrammar,noneofthemembersofTserveanypurpose;thuswestipulatethatforacanonicalgrammarRisnonempty.Importantly,amemberofTmayfailtobelongtoastringinRandstillserveapurpose.Forex-ample,letΣ={un,b,c},T={un,c},andS=fac2(oT∗n)−{aa}.Becausecappearsinnofor-biddentiersubstringsinR,itisfreelydistributedinL=L(hT,Si).Cependant,itmakesadifferenceinthelanguage,becauseaca∈Lbutaba6∈L.Ifc(andtherelevanttiersubstringsinS)weremiss-ingfromthetier,neitherabanoracawouldbeinL.Thiscanbethoughtofa‘blocking’functionofc,be-causeitallowssequencesa…aeventhoughaa∈R.WemaynowreturntotheLatinexamplefrom§2inalittlemoredetail.TheLatinpattern,inwhichrsandlsmustalternate,regardlessofotherinterveningsounds.ThiscanbecapturedbyaTSLgrammarinwhichT={je,r}andS={ol,ou,lr,rl,rr,rn,ln}.Thiscapturesthegen-eralizationthat,ignoringallsoundsbesideslandr,landlareneverallowedtobeadjacent.There-mainderofthepaperdiscusseshowsuchagrammar,includingT,maybeinducedfrompositivedata.5AlgorithmThissectionintroducestheTier-basedStrictly2-LocalInferenceAlgorithm(2TSLIA),whichin-ducesbothatierandasetofpermissibletier2-factorsfrompositivedata.First,in§5.1,thecon-ceptofapath,whichiscrucialtothelearner,isdefined.§5.3introducesanddescribesthealgo-rithm,and§5.4definesadistinguishingexampleandprovesthatitisacharacteristicsampleforwhichthealgorithmisguaranteedtoconverge.Timeanddatacomplexityforthealgorithmarediscussedthere.5.1PathsFirst,wedefinetheconceptof2-paths.Howthisconceptmightbegeneralizedtokisdiscussedin§6.Pathsdenoteprecedencerelationsbetweensymbolsinastring,buttheyarefurtheraugmentedwithsetsofinterveningsymbols.Formally,a2-pathisa3-tuplehx,Z,yiwherexandyareasymbolinΣonandZisasubsetofΣ.The2-pathsofastringw=σ0σ1…σn,denotedpaths2(w),arepaths2(w)=(cid:8)hσi,Z,σji(cid:12)(cid:12)je
Télécharger le PDF