Transactions of the Association for Computational Linguistics, vol. 5, pp. 73–86, 2017. Action Editor: Eric Fosler-Lussier.

Transactions of the Association for Computational Linguistics, vol. 5, pp. 73–86, 2017. Action Editor: Eric Fosler-Lussier.
Submission batch: 8/2016; Revision batch: 11/2016; Published 2/2017.

2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

c
(cid:13)

AGenerativeModelofPhonotacticsRichardFutrellBrainandCognitiveSciencesMassachusettsInstituteofTechnologyfutrell@mit.eduAdamAlbrightDepartmentofLinguisticsMassachusettsInstituteofTechnologyalbright@mit.eduPeterGraffIntelCorporationgraffmail@gmail.comTimothyJ.O’DonnellDepartmentofLinguisticsMcGillUniversitytimothy.odonnell@mcgill.caAbstractWepresentaprobabilisticmodelofphono-tactics,thesetofwell-formedphonemese-quencesinalanguage.Unlikemostcompu-tationalmodelsofphonotactics(HayesandWilson,2008;GoldsmithandRiggle,2012),wetakeafullygenerativeapproach,model-ingaprocesswhereformsarebuiltupoutofsubpartsbyphonologically-informedstruc-turebuildingoperations.Welearnaninven-toryofsubpartsbyapplyingstochasticmemo-ization(Johnsonetal.,2007;Goodmanetal.,2008)toagenerativeprocessforphonemesstructuredasanand-orgraph,basedoncon-ceptsoffeaturehierarchyfromgenerativephonology(Clements,1985;Dresher,2009).Subpartsarecombinedinawaythatallowstier-basedfeatureinteractions.Weevaluateourmodels’abilitytocapturephonotacticdis-tributionsinthelexiconsof14languagesdrawnfromtheWOLEXcorpus(Graff,2012).Ourfullmodelrobustlyassignshigherproba-bilitiestoheld-outformsthanasophisticatedN-grammodelforalllanguages.Wealsopresentnovelanalysesthatprobemodelbe-haviorinmoredetail.1IntroductionPeoplehavesystematicintuitionsaboutwhichse-quencesofsoundswouldconstitutelikelyorun-likelywordsintheirlanguage:AlthoughblickisnotanEnglishword,itsoundslikeitcouldbe,whilebnickdoesnot(ChomskyandHalle,1965).Suchin-tuitionsrevealthatspeakersareawareoftherestric-tionsonsoundsequenceswhichcanmakeuppossi-blemorphemesintheirlanguage—thephonotacticsofthelanguage.Phonotacticrestrictionsmeanthateachlanguageusesonlyasubsetofthelogically,orevenarticulatorily,possiblestringsofphonemes.Admissiblephonemecombinations,ontheotherhand,typicallyrecurinmultiplemorphemes,lead-ingtoredundancy.Itiswidelyacceptedthatphonotacticjudgmentsmaybegradient:thenonsensewordblickisbetterasahypotheticalEnglishwordthanbwick,whichisbetterthanbnick(HayesandWilson,2008;Al-bright,2009;Dalandetal.,2011).Toaccountforsuchgradedjudgements,therehavebeenavari-etyofprobabilistic(or,moregenerally,weighted)modelsproposedtohandlephonotacticlearningandgeneralizationoverthelasttwodecades(seeDa-landetal.(2011)andbelowforreview).How-ever,inspiredbyoptimality-theoreticapproachestophonology,themostlinguisticallyinformedandsuc-cessfulsuchmodelshavebeenconstraint-based—formulatingtheproblemofphonotacticgeneraliza-tionintermsofrestrictionsthatpenalizeillicitcom-binationsofsounds(e.g.,rulingout∗bn-).Inthispaper,bycontrast,weadoptagenerativeapproachtomodelingphonotacticstructure.Ourapproachharkensbacktoearlyworkonthesoundstructureoflexicalitemswhichmadeuseofmor-phemestructurerulesorconditions(Halle,1959;Stanley,1967;Booij,2011;RasinandKatzir,2014).Suchapproachesexplicitlyattemptedtomodelthe

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
4
7
1
5
6
7
4
3
6

/
t

a
c
_
a
_
0
0
0
4
7
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

redudancywithinthesetofallowablelexicalformsinalanguage.Weadoptaprobabilisticversionofthisidea,conceivingofthephonotacticsystemasthecomponentofthelinguisticsystemwhichgen-eratesthephonologicalformoflexicalitemssuchaswordsandmorphemes.1Oursystemlearnsin-ventoriesofreusablephonotacticallylicitstructuresfromexistinglexicalitems,andassemblesnewlex-icalitemsbycombiningtheselearnedphonotac-ticpatternsusingphonologicallyplausiblestructure-buildingoperations.Thus,insteadofmodelingphonotacticgeneralizationsintermsofconstraints,wetreattheproblemasaproblemoflearninglan-guagespeciﬁcinventoriesofphonologicalunitsandlanguagespeciﬁcbiasesonhowthesephonesarelikelytobecombined.Althoughtherehavebeenanumberofearliergen-erativemodelsofphonotacticstructure(seeSec-tion4)thesemodelshavemostlyusedrelativelysimplisticorphonologicallyimplausiblerepresenta-tionsofphonesandphonologicalstructure-building.Bycontrast,ourmodelisbuiltaroundthreerepre-sentationalassumptionsinspiredbythegenerativephonologyliterature.First,wecapturesparsityinthespaceoffeature-speciﬁcationsofphonemesbyusingfeaturedependencygraphs—anideainspiredbyworkonfeaturegeometriesandthecontrastivehierarchy(Clements,1985;Dresher,2009).Sec-ond,oursystemcanrepresentphonotacticgeneral-izationsnotonlyattheleveloffullyspeciﬁedseg-ments,butalsoallowsthestorageandreuseofsub-segments,inspiredbytheautosegmentsandclassnodesofautosegmentalphonology.Finally,alsoin-spiredbyautosegmentalphonology,wemakeuseofastructure-buildingoperationwhichissenstitivetotier-basedcontextualstructure.Tomodelphonotacticlearning,wemakeuseoftoolsfromBayesiannonparametricstatistics.Inpar-ticular,wemakeuseofthenotionoflexicalmem-oization(?;Goodmanetal.,2008;Woodetal.,2009;O’Donnell,2015)—theideathatlanguage-speciﬁcgeneralizationscanbecapturedbythestor-ageandreuseoffrequentpatternsfromalinguisti-1Ultimately,weconceiveofphonotacticsasthemoduleofphonologywhichgeneratestheunderlyingformsoflexicalitems,whicharethensubjecttophonologicaltransformations(i.e.,transductions).Inthiswork,however,wedonotattempttomodeltransformationsfromunderlyingtosurfaceforms.callyuniversalinventory.Inourcase,thisamountstotheideathataninventoryofsegmentsandsub-segmentscanbeacquiredbyalearnerthatstoresandreusescommonlyoccuringsegmentsinpartic-ular,phonologicallyrelevantcontexts.Inshort,weviewtheproblemoflearningthephonemeinven-toryasoneofconcentratingprobabilitymassonthesegmentswhichhavebeenobservedbefore,andtheproblemofphonotacticgeneralizationaslearningwhich(sub-)segmentsarelikelyinparticulartier-basedphonologicalcontexts.2ModelMotivationsInthissection,wegiveanoverviewofhowourmodelworksanddiscussthephenomenaandthe-oreticalideasthatmotivateit.2.1FeatureDependencyGraphsMostformalmodelsofphonologypositthatseg-mentsaregroupedintosets,knownasnaturalclasses,thatarecharacterizedbysharedarticulatoryandacousticproperties,orphonologicalfeatures(Trubetzkoy,1939;Jakobsonetal.,1952;ChomskyandHalle,1968).Forexample,thesegments/n/and/m/areclassiﬁedwithapositivevalueofanasal-ityfeature(i.e.,NASALITY:+).Similarly,/m/and/p/canbeclassiﬁedusingthelabialvalueofaPLACEfeature,PLACE:labial.Thesefeaturesal-lowcompactdescriptionofmanyphonotacticgen-eralizations.2Fromaprobabilisticstructure-buildingperspec-tive,weneedtospecifyagenerativeprocedurewhichassemblessegmentsoutofpartsdeﬁnedintermsofthesefeatures.Inthissection,wewillbuildupsuchaprocedurestartingfromthesimplestpossi-bleprocedureandprogressingtowardsonewhichismorephonologicallyinformed.Wewillclarifythe2Forcompatibilitywiththedatasourcesusedinevaluation(Section5.2),thefeaturesystemweuseheredepartsinseveralwaysfromstandardfeaturesets:(1)Weusemultivalentratherthanbinary-valuedfeatures.(2)Werepresentmannerwithasinglefeature,whichhasvaluessuchasvocalic,stop,andfricative.Thisapproachallowsustorefertomannersmorecompactlythaninsystemsthatemploycombinationsoffeaturessuchassonorant,continuant,andconsonantal.Forexample,ratherthanreferringtovowelsas‘non-syllabic’,werefertothemusingfeaturevaluevocalicforthefeatureMANNER.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
4
7
1
5
6
7
4
3
6

/
t

a
c
_
a
_
0
0
0
4
7
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

generativeprocesshereusingananalogytoPCFGs,butthisanalogywillbreakdowninlatersections.Thesimplestprocedureforgeneratingaseg-mentfromfeaturesistospecifyeachfeatureindependently.Forexample,considerthesetoffeature-valuepairsfor/t/:{NASALITY:-,PLACE:alveolar,…}.Inanaivegenerativepro-cedure,onecouldgenerateaninstanceof/t/byinde-pendentlychoosingvaluesforeachfeatureintheset{NASALITY,PLACE,…}.Weexpressthisprocessusingtheand-orgraphnotationbelow.Box-shapednodes—calledor-nodes—representfeaturessuchasNASALITY,whilecircularnodesrepresentgroupsoffeatureswhosevaluesarechosenindependentlyandarecalledand-nodes.NASALITY…PLACEThisgenerativeprocedureisequivalent(ignoringor-der)toaPCFGwithrules:SEGMENT→NASALITY…PLACENASALITY→+NASALITY→-PLACE→bilabialPLACE→alveolar…Notallcombinationsoffeature-valuepairscor-respondtopossiblephonemes.Forexample,while/l/isdistinguishedfromotherconsonantsbythefeatureLATERAL,itisincoherenttospecifyvow-elsasLATERAL.Inordertoconcentrateprobabil-itymassonrealsegments,ourprocessshouldopti-mallyassignzeroprobabilitymasstotheseincoher-entphonemes.WecanavoidspecifyingaLATERALfeatureforvowelsbystructuringthegenerativepro-cessasbelow,sothattheLATERALor-nodeisonlyreachedforconsonants:VOCALICALATERAL…BHEIGHT…consonantvowelBeyondgeneratingwell-formedphonemes,aba-sicrequirementofamodelofphonotacticsisthatitconcentratesmassonlyonthesegmentsinapar-ticularlanguage’ssegmentinventory.Forexam-ple,themodelofEnglishphonotacticsshouldputzeroornominalmassonanysequencecontainingthesegment/x/,althoughthisisalogicallypossi-blephoneme.Soourgenerativeprocedureforaphonememustbeabletolearntogenerateonlythelicitsegmentsofalanguage,givensomeprobabil-itydistributionsattheand-andor-nodes.Forthistask,independentlysamplingvaluesatand-nodesdoesnotgiveusawaytoruleoutparticularcom-binationsoffeaturessuchasthoseforming/x/.Ourapproachtothisproblemusestheideaofstochasticmemoization(oradaptation),inwhichtheresultsofcertaincomputationsarestoredandmaybeprobabilisticallyreused“aswholes,”ratherthanrecomputedfromscratch(Michie,1968;Goodmanetal.,2008).Thistechniquehasbeenappliedtotheproblemoflearninglexicalitemsatvariouslevelsoflinguisticstructure(deMarcken,1996;Johnsonetal.,2007;Goldwater,2006;O’Donnell,2015).Givenourmodelsofar,applyingstochasticmemo-izationisequivalenttospecifyinganadaptorgram-maroverthePCFGsdescribedsofar.Letfbeastochasticfunctionwhichsamplesfeaturevaluesusingtheand-orgraphrepresenta-tiondescribedabove.Weapplystochasticmemo-izationtoeachnode.FollowingJohnsonetal.(2007)andGoodmanetal.(2008),weuseadistri-butionforprobabilisticmemoizationknownastheDirichletProcess(DP)(Ferguson,1973;Sethura-man,1994).Letmem{f}beaDP-memoizedver-sionoff.ThebehaviorofaDP-memoizedfunctioncanbedescribedasfollows.Theﬁrsttimeweinvokemem{f},thefeaturespeciﬁcationofanewsegmentwillbesampledusingf.Onsubsequentinvocations,weeitherchooseavaluefromamongthesetofpre-vioussampledvalues(amemodraw),orwedrawanewvaluefromf(abasedraw).TheprobabilityofsamplingtheitholdvalueinamemodrawisniN+θ,whereNisthenumberoftokenssampledsofar,niisthenumberoftimesthatvalueihasbeenusedinthepast,andθ>0isaparameterofthemodel.AbasedrawhappenswithprobabilityθN+θ.Thispro-cessinducesabiastoreuseitemsfromfwhichhavebeenfrequentlygeneratedinthepast.Weapplymemrecursivelytothesamplingproce-dureforeachnodeinthefeaturedependencygraph.Themoretimesthatweusesomeparticularsetoffeaturesunderanodetogeneratewordsinalan-guage,themorelikelywearetoreusethatsetof

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
4
7
1
5
6
7
4
3
6

/
t

a
c
_
a
_
0
0
0
4
7
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

featuresinthefutureinamemodraw.Thisdynamicleadsourmodeltorapidlyconcentrateprobabilitymassonthesubsetofsegmentswhichoccurintheinventoryofalanguage.2.2ClassNodeStructureOuruseofand-orgraphsandlexicalmemoiza-tiontomodelinter-featuredependenciesisin-spiredbyworkinphonologyondistinctivenessandmarkednesshierarchies(Kean,1975;Berwick,1985;Dresher,2009).Inadditiontousingfeaturehierarchiestodelineatepossiblesegments,theliter-aturehasusedthesestructurestodesignatebundlesoffeaturesthathaveprivilegedstatusinphonolog-icaldescription,i.e.featuregeometries(Clements,1985;Halle,1995;McCarthy,1988).Forexample,manyanalysesgroupfeaturesconcerninglaryngealstates(e.g.,VOICE,ASPIRATION,etc.)underala-ryngealnode,whichisdistinctfromthenodecon-tainingoralplace-of-articulationfeatures(ClementsandHume,1995).Thesenodesareknownasclassnodes.Intheseanalyses,featuresgroupedtogetherunderthelaryngealclassnodemaycovarywhilebe-ingindependentoffeaturesgroupedundertheoralclassnode.Thelexicalmemoizationtechniquediscussedabovecapturesthisnotionofclassnodedirectly,be-causethemodellearnsaninventoryofsubsegmentsundereachnode.Considerthefeaturedependencygraphbelow.ABNASALITYVOICE…VOCALICCBACKNESSHEIGHT……consonantvowelInthisgraph,theand-nodeAgeneratesfullyspec-iﬁedsegments.And-nodeBcanbethoughtofasgeneratingthenon-oralpropertiesofasegment,in-cludingvoicingandnasality.And-nodeCisaclassnodebundlingtogethertheoralfeaturesofvowelsegments.ThefeaturesunderBareoutsideoftheVO-CALICnode,sothesefeaturesarespeciﬁedforbothconsonantandvowelsegments.Thisallowscombinationssuchasvoicednasalconsonants,andalsorarercombinationssuchasunvoicednasalvow-els.Becausealland-nodesarerecursivelymemo-ized,ourmodelisabletobindtogetherparticularnon-oralchoices(nodeB),learningforinstancethatthecombination{NASALITY:+,VOICED:+}com-monlyrecursforbothvowelsandconsonantsinalanguage.Thatis,{NASALITY:+,VOICED:+}be-comesahigh-probabilitymemodraw.Sincethemodellearnsaninventoryoffullyspec-iﬁedsegmentsatnodeA,themodelcouldlearnone-offexceptionstothisgeneralizationaswell.Forexample,itcouldstoreatahighlevelasegmentwith{NASALITY:+,VOICED:-}alongwithsomeotherfeatures,whilemaintainingthegeneralizationthat{NASALITY:+,VOICED:+}ishighlyfrequentinbasedraws.Language-speciﬁcphonemeinvento-riesaboundwithsuchcombinationsofclass-node-basedgeneralizationsandidiosyncrasies.Byusinglexicalmemoizationatmultipledifferentlevels,ourmodelcancaptureboththebroadergeneralizationsdescribedinclassnodeterminologyandtheexcep-tionstothosegeneralizations.2.3SequentialStructureasMemoizationinContextInSection2.2,wefocusedontherolethatfeaturesplayindeﬁningalanguage’ssegmentinventory.Wegaveaphonologically-motivatedgenerativeprocess,equivalenttoanadaptorgrammar,forphonemesinisolation.However,featuresalsoplayanim-portantroleincharacterizinglicitsequences.Wemodelsequentialrestrictionsascontext-dependentsegmentinventories.Ourmodellearnsadistributionoversegmentsandsubsegmentsconditionaloneachprecedingsequenceof(sub)segments,usinglexi-calmemoization.Introducingcontext-dependencemeansthatthemodelcannolongerbeformulatedasanadaptorgrammar.2.4Tier-basedInteractionOnesalientpropertyofsequentialrestrictionsinphonotacticsisthatsegmentsareoftenrequiredtobearthesamefeaturevaluesasnearbysegments.Forexample,asequenceofanasalandafollow-ingstopmustagreeinplacefeaturesattheendofamorphemeinEnglish.Suchrestrictionsmayevenbenon-local.Forexample,manylanguagespre-fercombinationsofvowelsthatagreeinfeaturessuchasHEIGHT,BACKNESS,orROUNDING,even

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
4
7
1
5
6
7
4
3
6

/
t

a
c
_
a
_
0
0
0
4
7
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

Figure1:TiersdeﬁnedbyclassnodesAandBforcontextsequence/ak/.Seetext.acrossarbitrarynumbersofinterveningconsonants(i.e.,vowelharmony).Onewaytodescribethesesequentialfeaturein-teractionsistoassumethatfeaturevaluesofonesegmentinaworddependonvaluesforthesameorcloselyrelatedfeaturesinothersegments.Thisisaccomplishedbydividingsegmentsintosubsets(suchasconsonantsandvowels),calledtiers,andthenmakingasegment’sfeaturevaluespreferen-tiallydependentonthevaluesofothersegmentsonthesametier.Suchphonologicaltiersareoftenidentiﬁedwithclassnodesinafeaturedependencygraph.Forexample,arequirementthatonevowelidenticallymatchthevowelintheprecedingsyllablewouldbestatedasarequirementthatthevowel’sHEIGHT,BACKNESS,andROUNDINGfeaturesmatchtheval-uesoftheprecedingvowel’sfeatures.Inthiscase,thevowelsthemselvesneednotbeadjacent—byas-sumingthatvowelqualityfeaturesarenotpresentinconsonants,itispossibletosaythattwovowelsareadjacentonatierdeﬁnedbythenodesHEIGHT,BACKNESS,andROUNDING.Ourfullgenerativeprocessforasegmentfollow-ingothersegmentsisthefollowing.Wefollowtheexampleofthegenerationofaphonemeconditionalonaprecedingcontextof/ak/,shownwithsimpli-ﬁedfeaturalspeciﬁcationsandtiersinFigure1.Ateachnodeinthefeaturedependencygraph,wecaneithergenerateafully-speciﬁedsubsegmentforthatnode(memodraw),orassembleanovelsubseg-mentforthatnodeoutofpartsdeﬁnedbythefea-turedependencygraph(basedraw).Startingattherootnodeofthefeaturedependencygraph,wede-cidewhethertodoamemodraworbasedrawcon-ditionalonthepreviousnsubsegmentsatthatnode.Soinordertogeneratethenextsegmentfollow-ing/ak/intheexample,westartatnodeAinthenextdrawfromthefeaturegeometry,withsomeprobabil-itywedoamemodrawconditionedon/ak/,deﬁnedbytheredtier.Ifwedecidetodoabasedrawin-stead,wethenrepeattheprocedureconditionalonthepreviousn−1segments,recursivelyuntilweareconditioningontheemptycontext.Thatis,wedoamemodrawconditionalon/k/,orconditionalontheemptycontext.Thisprocessofconditioningonsuccessivelysmallercontextsisastandardtech-niqueinBayesiannonparametriclanguagemodeling(Teh,2006;Goldwateretal.,2006).Attheemptycontext,ifwedecidetodoabasedraw,thenwegenerateanovelsegmentbyrepeat-ingthewholeprocessateachchildnode,togen-erateseveralsubsegments.Intheexample,wewouldassembleaphonemebyindependentlysam-plingsubsegmentsatthenasal/laryngealnodeBandtheMANNERnode,andthencombiningthem.Cru-cially,theconditioningcontextconsistsonlyofthevaluesatthecurrentnodeinthepreviousphonemes.SowhenwesampleasubsegmentfromnodeB,itisconditionalontheprevioustwovaluesatnodeB,{VOICE:+,NASAL:-}and{VOICE:-,NASAL:-},deﬁnedbythebluetierintheﬁgure.Theprocesscontinuesdownthefeaturedependencygraphrecur-sively.Atthepointwherethemodeldecidesonvowelplacefeaturessuchasheightandbackness,thesewillbeconditionedonlyonthevowelplacesfeaturesofthepreceding/a/,with/k/skippeden-tirelyasitdoesnothavevaluesatvowelplacenodes.Thissectionhasprovidedmotivationsandawalk-throughofourproposedgenerativeprocedureforse-quencesofsegments.Inthenextsection,wegivetheformalizationofthemodel.3FormalizationoftheModelsHerewegiveafullformaldescriptionofourpro-posedmodelinthreesteps.First,inSection3.1,weformalizethegenerativeprocessforasegmentinisolation.Second,inSection3.2,wegivefor-mulationofBayesiannonparametricN-grammod-elswithbackoff.Third,inSection3.3,weshowhowtodropthegenerativeprocessforaphonemeintotheN-grammodelsuchthattier-basedinterac-tionsemergenaturally.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
4
7
1
5
6
7
4
3
6

/
t

a
c
_
a
_
0
0
0
4
7
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

3.1GenerativeProcessforaSegmentAfeaturedependencygraphGisafullyconnected,singlyrooted,directed,acyclicgraphgivenbythetriplehV,A,t,riwhereVisasetofverticesornodes,Aisasetofdirectedarcs,tisatotalfunctiont(n):V7→{and,or},andrisadistinguishedrootnodeinV.Adirectedarcisapairhp,ciwheretheparentpandchildcarebothelementsinV.Thefunctiont(n)identiﬁeswhethernisanand-oror-node.Deﬁnech(n)tobethefunctionthatreturnsallchildrenofnoden,thatis,alln0∈Nsuchthathn,n0i∈A.AsubgraphGsoffeaturedependencygraphGisthegraphobtainedbystartingfromnodesbyre-tainingonlynodesandarcsreachablebytraversingarcsstartingfroms.Asubsegmentpsisasubgraphrootedinnodesforwhicheachor-nodecontainsex-actlyoneoutgoingarc.Subsegmentsrepresentsam-pledphoneconstituents.Asegmentisasubsegmentrootedinr—thatis,afullyspeciﬁedphoneme.ThedistributionassociatedwithasubgraphGsisgivenbyGsbelow.Gsisadistributionoversub-segments;thedistributionforthefullgraphGrisadistributionoverfullyspeciﬁedsegments.Weoc-casionallyoverloadthenotationsuchthatGs(ps)willrefertotheprobabilitymassfunctionassociatedwithdistributionGsevaluatedatthesubsegmentps.Hs∼DP(θs,Gs)(1)Gs(ps)=Ys0∈ch(s)Hs0(ps0)t(s)=ANDXs0∈ch(s)ψss0Hs0(ps0)t(s)=ORTheﬁrstcaseofthedeﬁnitioncoversand-nodes.Weassumethattheleavesofourfeaturedependencygraph—whichrepresentatomicfeaturevaluessuchasthelaryngealvalueofaPLACEfeature—arechildlessand-nodes.Thesecondcaseofthedeﬁnitioncoversor-nodesinthegraph,whereψss0istheprobabilityassociatedwithchoosingoutgoingarchs,s0ifromparentor-nodestochildnodes0.Thus,or-nodesdeﬁnemix-turedistributionsoveroutgoingarcs.ThemixtureweightsaredrawnfromaDirichletprocess.Inpar-ticular,foror-nodenintheunderlyinggraphG,thevectorofprobabilitiesoveroutgoingedgesisdis-tributedasfollows.~ψs∼DP(θs,UNIFORM(|ch(s)|))NotethatinbothcasesthedistributionoverchildsubgraphsisdrawnfromaDirichletprocess,asbe-low,capturingthenotionofsubsegmentalstoragediscussedabove.3.2N-GramModelswithDP-BackoffLetTbeasetofdiscreteobjects(e.g.,atomicsym-bolsorstructuredsegmentsasdeﬁnedinthepreced-ingsections).LetT∗bethesetofallﬁnite-lengthstringswhichcanbegeneratedbycombiningele-mentsofT,underconcatenation,·,includingtheemptystring(cid:15).Acontext,uisanyﬁnitestringbe-ginningwithaspecialdistinguishedstartsym-bolandendingwithsomesequenceinT∗,thatis,u∈{start·T∗}.Foranystringα,deﬁnehd(α)tobethefunctionthatreturnstheﬁrstsymbolinthestring,tl(α)tobethefunctionthatreturnssufﬁxofαminustheﬁrstsymbol,and|α|tobethelengthofα,withhd((cid:15))=tl((cid:15))=(cid:15)and|(cid:15)|=0.Writetheconcatenationoftwostringsαandα0asα·α0.LetHubeadistributiononnextsymbols—thatis,objectsinT∪{stop}—conditionedonagivencontextu.ForanN-grammodeloforderN,theprobabilityofastringβinT∗isgivenbyKNstart(β·stop),whereKNu(α)isdeﬁnedas:KNu(α)=n1α=(cid:15)HfN(u)(hd(α))×KNu·hd(α)(tl(α))otherwise,(2)wherefn(·)isacontext-managementfunctionwhichdetermineswhichpartsoftheleft-contextshouldbeusedtodeterminetheprobabilityofthecurrentsymbol.InthecaseoftheN-grammodelsusedinthispaper,fn(·)takesasequenceuandre-turnsonlytherightmostn−1elementsfromthesequence,ortheentiresequenceifithaslengthlessthann.NotetwoaspectsofthisformulationofN-grammodels.First,Huisafamilyofdistributionsovernextsymbolsormoregeneralobjects.Later,wewilldropinphonological-feature-basedgenerativepro-cessesforthesedistributions.Second,thefunctionfnisaparameteroftheabovedeﬁnitions.Inwhatfollows,wewilluseavariantofthisfunctionwhichissensitivetotier-basedstructure,returningthepre-viousn−1onlyontheappropriatetier.MacKayandPeto(1994)introducedahierarchi-calDirichletprocess-basedbackoffschemeforN-

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
4
7
1
5
6
7
4
3
6

/
t

a
c
_
a
_
0
0
0
4
7
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

grammodels,withgeneralizationsinTeh(2006)andGoldwateretal.(2006).Inthissetup,thedistribu-tionovernextsymbolsgivenacontextuisdrawnhierarchicallyfromaDirichletprocesswhosebasemeasureisanotherDirichletprocessassociatedwithcontexttl(u),andsoon,withalldrawsultimatelybackingoffintosomeunconditioneddistributionoverallpossiblenextsymbols.Thatis,inahier-archicalDirichletprocessN-grammodel,Hfn(u)isgivenasfollows.Hfn(u)∼nDP(θfn(u),Hfn−1(u))n≥1DP(θfn(u),UNIFORM(T∪{stop}))n=03.3Tier-BasedInteractionsTomaketheN-grammodeldeﬁnedinthelastsec-tioncapturetier-basedinteractions,wemaketwochanges.First,wegeneralizethegenerativepro-cessHsfromEquation1toHsu,whichgeneratessubsegmentsconditionalonasequenceu.Sec-ond,wedeﬁneacontext-truncatingfunctionfsn(u)whichtakesacontextofsegmentsuandreturnstherightmostn−1non-emptysubsegmentswhoserootnodeiss.Thenwesubstitutethegenerativepro-cessHsfsn(u)(whichappliesthecontext-managementfunctionfsn(·)tothecontextu)forHfn(u)inEqua-tion2.Theresultingprobabilitydistributionis:KNu(α)=n1α=(cid:15)HrfrN(u)(hd(α))×KNu·hd(α)(tl(α))otherwise.KNu(α)isthedistributionovercontinuationsgivenacontextofsegments.ItsdeﬁnitiondependsonHsfsn(u),whichisthegeneralizationofthegener-ativeprocessforsegmentsHstobeconditionalonsometier-basedN-gramcontextfsn(u).Hsfsn(u)is:Hsfsn(u)∼nDP(θsfsn(u),Hsfsn−1(u))n≥1DP(θsfsn(u),GsfsN(u))n=0Gsfsn(u)(ps)=(Qs0∈ch(s)Hs0fs0n(u)(ps0)t(s)=ANDPs0∈ch(s)ψss0Hs0fs0n(u)(ps0)t(s)=OR.Hsfsn(u)andGsfsn(u)abovearemutuallyrecursivefunctions.Hsfsn(u)implementsbackoffinthetier-basedcontextofprevioussubsegments;Gsfsn(u)im-plementsbackoffbygoingdownintotheprobabil-itydistributionsdeﬁnedbythefeaturedependencygraph.NotethatthefunctionHsfsn(u)recursivelybacksofftotheemptycontext,butitsultimatebasedistri-butionisindexedbyfsN(u),usingtheglobalmaxi-mumN-gramorderN.Sowhensamplesaredrawnfromthefeaturedependencygraph,theyarecon-ditionedonnon-emptytier-basedcontexts.Inthisway,subsegmentsaregeneratedbasedontier-basedcontextandbasedonfeaturalbackoffinaninter-leavedfashion.3.4InferenceWeusetheChineseRestaurantProcessrepresen-tationforsampling.Inferenceinthemodelisoverseatingarrangementsforobservationsofsub-segmentsandoverthehyperparametersθforeachrestaurant.WeperformGibbssamplingonseatingarrangementsintheDirichletN-grammodelsbyre-movingandre-addingobservationsineachrestau-rant.TheseGibbssweepshadnegligibleimpactonmodelbehavior.Fortheconcentrationparame-terθ,wesetapriorGamma(10,.1).Wedrawpos-teriorsamplesusingtheslicesamplerdescribedinJohnsonandGoldwater(2009).Wedrawonepos-teriorsampleofthehyperparametersforeachGibbssweep.IncontrasttotheGibbssweeps,wefoundre-samplinghyperparameterstobecrucialforachiev-ingtheperformancedescribedbelow(Section5.3).4RelatedWorkPhonotacticshasprovenafruitfulproblemdomainforcomputationalmodels.Mostsuchworkhasadoptedaconstraint-basedapproach,attemptingtodesignascoringfunctionbasedonphonologicalfea-turestoseparateacceptableformsfromunaccept-ableones,typicallybyformulatingrestrictionsorconstraintstoruleoutless-goodstructures.Thisconcepthaslednaturallytotheuseofundi-rected(maximum-entropy,log-linear)models.Inthisclassofmodels,aformisscoredbyevaluationagainstanumberofpredicates,calledfactors3—forexample,whethertwoadjacentsegmentshavethephonologicalfeaturesVOICE:+VOICE:-.Eachfac-torisassociatedwithaweight,andthescoreforaformisthesumoftheweightsofthefactorswhicharetruefortheform.Thewell-knownmodelof3Factorsarealsocommonlycalled“features”—atermweavoidtopreventconfusionwithphonologicalfeatures.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
4
7
1
5
6
7
4
3
6

/
t

a
c
_
a
_
0
0
0
4
7
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

HayesandWilson(2008)adoptsthisframework,pairingitwithaheuristicprocedureforﬁndingex-planatoryfactorswhilepreventingoverﬁtting.Simi-larly,Albright(2009)assignsascoretoformsbasedonfactorsdeﬁnedovernaturalclassesofadjacentsegments.Constraint-basedmodelshavetheadvan-tageofﬂexibility:itispossibletoscoreformsusingarbitrarilycomplexandoverlappingsetsoffactors.Forexample,onecanstateaconstraintagainstad-jacentphonemeshavingfeaturesVOICE:+andLAT-ERAL:+,oranycombinationoffeaturevalues.Incontrast,wehavepresentedamodelwhereformsarebuiltoutofpartsbystructure-buildingop-erations.Fromthisperspective,thegoalofamodelisnottoruleoutbadforms,butrathertodiscoverrepeatingstructuresingoodforms,suchthatnewformswiththosestructurescanbegenerated.Inthissettingthereislessﬂexibilityinhowphonologicalfeaturescanaffectwell-formedness.Forastructure-buildingmodeltoassign“scores”toarbitrarypairsofco-occurringfeatures,theremustbeapointinthegenerativeprocesswherethosefea-turesareconsideredinisolation.Comingupwithsuchaprocesshasbeenchallenging.Asaresultofthislimitation,structure-buildingmodelsofphono-tacticshavenotgenerallyincludedrichfeaturalin-teractions.Forexample,ColemanandPierrehum-bert(1997)giveaprobabilisticmodelforphonotac-ticswherewordsaregeneratedusinggrammaroverunitssuchassyllables,onsets,andrhymes.Thismodeldoesnotincorporateﬁne-grainedphonolog-icalfeaturessuchasvoicingandplace.Infact,ithasbeenarguedthataconstraint-basedapproachisrequiredinordertocapturerichfeature-basedinteractions.Forexample,GoldsmithandRiggle(2012)developatier-basedstructure-buildingmodelofFinnishphonotacticswhichcap-turesnonlocalvowelharmonyinteractions,butar-guethatthismodelisinadequatebecauseitdoesnotassignhigherprobabilitiestoformsthananN-grammodel,acommonbaselinemodelforphono-tactics(Dalandetal.,2011).Theyarguethatthisdeﬁciencyisbecausethemodelcannotsimulta-neouslymodelnonlocalvowel-vowelinteractionsandlocalconsonant-vowelinteractions.Becauseofourtier-basedconditioningmechanism(Sections2.4and3.3),ourmodelcansimultaneouslyproducelo-calandnonlocalinteractionsbetweenfeaturesus-ingstructure-buildingoperations,anddoesassignhigherprobabilitiestoheld-outformsthananN-grammodel(Section5.3).Fromthisperspective,ourmodelcanbeseenasaproofofconceptthatitispossibletohaverichfeature-basedconditioningwithoutadoptingaconstraint-basedapproach.Whileourmodelcancapturefeaturalinteractions,itislessﬂexiblethanaconstraint-basedmodelinthattheallowableinteractionsarespeciﬁedbythefeaturedependencygraph.Forexample,thereisnowaytoencodeadirectconstraintagainstadja-centphonemeshavingfeaturesVOICE:+andLAT-ERAL:+.Weconsiderthisastrengthoftheap-proach:Aparticularfeaturedependencygraphisaparameterofourmodel,andaspeciﬁcscientiﬁchypothesisaboutthespaceoflikelyfeaturalinterac-tionsbetweenphonemes,similartofeaturegeome-triesfromclassicalgenerativephonology(Clements,1985;McCarthy,1988;Halle,1995).4Whileprobabilisticapproacheshavemostlytakenaconstraint-basedapproach,recentformallanguagetheoreticapproachestophonologyhaveinvestigatedwhatbasicpartsandstructurebuildingoperationsareneededtocapturerealisticfeature-basedinterac-tions(Heinzetal.,2011;JardineandHeinz,2015).Weseeprobabilisticstructure-buildingapproachessuchasthisworkasawaytounifytherecentfor-mallanguagetheoreticadvancesincomputationalphonologywithcomputationalphonotacticmodel-ing.OurmodeljoinsotherNLPworkattemptingtodosequencegenerationwhereeachsymbolisgen-eratedbasedonarichfeaturalrepresentationofprevioussymbols(BilmesandKirchhoff,2003;DuhandKirchhoff,2004),thoughwefocusmoreonphonology-speciﬁcrepresentations.Ourand-orgraphsaresimilartothoseusedincomputervisiontorepresentpossibleobjects(JinandGeman,2006).5ModelEvaluationandExperimentsHereweevaluatesomeofthedesigndecisionsofourmodelandcompareittoabaselineN-grammodelandtoawidely-usedconstraint-basedmodel,BLICK.Inordertoprobemodelbehavior,wealso4Wedohowevernotethatitmaybepossibletolearnfeaturehierarchiesonalanguage-by-languagebasisfromuniversalar-ticulatoryandacousticbiases,assuggestedbyDresher(2009).

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
4
7
1
5
6
7
4
3
6

/
t

a
c
_
a
_
0
0
0
4
7
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

presentevaluationsonartiﬁcialdata,andasamplingof“representativeforms”preferredbyonemodelascomparedtoanother.Ourmodelconsistsofstructure-buildingopera-tionsoveralearnedinventoryofsubsegments.Ifourmodelcanexploitmorerepeatedstructureinphono-logicalformsthantheN-grammodelorconstraint-basedmodels,thenitshouldassignhigherprobabil-itiestoforms.Thelogprobabilityofaformunderamodelcorrespondstothedescriptionlengthofthatformunderthemodel;ifamodelassignsahigherlogprobabilitytoaform,thatmeansthemodelisca-pableofcompressingtheformmorethanothermod-els.Therefore,wecomparemodelsontheirabilitytoassignhighprobabilitiestophonologicalforms,asinGoldsmithandRiggle(2012).5.1EvaluationofModelComponentsWeareinterestedindiscoveringtheextenttowhicheachmodelcomponentdescribedabove—featuredependencygraphs(Section2.1),classnodestruc-ture(Section2.2),andtier-basedconditioning(Sec-tion2.4)—contributestotheabilityofthemodeltoexplainwordforms.Toevaluatethecontributionoffeaturedepen-dencygraphs,wecompareourmodelswithabase-lineN-grammodel,whichrepresentsphonemesasatomicunits.ForthisN-grammodel,weuseaHier-archicalDirichletProcesswithn=3.Toevaluatefeaturedependencygraphswithandwithoutarticulatedclassnodestructure,wecom-paremodelsusingthegraphshowninFigure3(theminimalstructurerequiredtoproducewell-formedphonemes)tomodelswiththegraphshowninFigure2,whichincludesphonologicallymoti-vated“classnodes”.5Toevaluatetier-basedconditioning,wecomparemodelswiththeconditioningdescribedinSec-tions2.4and3.3tomodelswherealldecisionsareconditionedonthefullfeaturalspeciﬁcationofthepreviousn−1phonemes.Thisallowsustoisolateimprovementsduetotier-basedconditioningbeyondimprovementsfromthefeaturehierarchy.5ThesefeaturedependencygraphsdifferfromthoseintheexpositioninSection2inthattheydonotincludeaMANNERfeature;butrathertreatvowelasapossiblevalueofMANNER.durationlaryngealnasalmannersuprasegmentalbacknessheightroundingCplace2ndart.lateralotherwisevowelFigure2:Featuredependencygraphwithclassnodestructureusedinourexperiments.PlaintextnodesareOR-nodeswithnochilddistributions.Thearcmarkedotherwiserepresentsseveralarcs,eachlabelledwithaconsonantmannersuchasstop,fricative,etc.durationlaryngealnasalmannersuprasegmentalbacknessheightroundingCplace2ndart.lateralotherwisevowelFigure3:“Flat”featuredependencygraph.5.2LexiconDataTheWOLEXcorpusprovidestranscriptionsforwordsindictionariesof60diverselanguages,rep-resentedintermsofphonologicalfeatures(Graff,2012).Inadditiontowords,thedictionariesin-cludesomeshortsetphrases,suchasofcourse.WeusethefeaturalrepresentationofWOLEX,andde-signourfeaturedependencygraphstogenerateonlywell-formedphonemesaccordingtothisfeaturesys-tem.Forspacereasons,wepresenttheevaluationofourmodelon14oftheselanguages,chosenbasedonthequalityoftheirtranscribedlexicons,andtheauthors’knowledgeoftheirphonologicalsystems.5.3Held-OutEvaluationHerewetestwhetherthedifferentmodelconﬁgu-rationsdescribedaboveassignhighprobabilitytoheld-outforms.Thisteststhemodels’abilitytogeneralizebeyondtheirtrainingdata.Wetraineachmodelon2500randomlyselectedwordformsfromaWOLEXdictionary,andcomputeposteriorpredic-tiveprobabilitiesfortheremainingwordformsfromtheﬁnalstateofthemodel.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
0
4
7
1
5
6
7
4
3
6

/
t

a
c
_
a
_
0
0
0
4
7
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

Languagengramﬂatcl.nodeﬂat/notierscl.node/notiersEnglish-22.20-22.15-21.73∗∗-22.15-22.14French-18.30-18.28-17.93∗∗-18.29-18.28Georgian-20.21-20.17-19.64∗-20.18-20.18German-24.77-24.72-24.07∗∗-24.73-24.74Greek-22.48-22.45-21.65∗∗-22.45-22.45HaitianCreole-16.09-16.04-15.82∗∗-16.05-16.04Lithuanian-19.03-18.99-18.58∗-18.99-18.99Mandarin-13.95-13.83∗-13.78∗∗-13.82∗-13.82∗Mor.Arabic-16.15-16.10-16.00∗-16.13-16.12Polish-20.12-20.08-19.76∗∗-20.08-20.07Quechua-14.35-14.30-13.87∗-14.30-14.31Romanian-18.71-18.68-18.32∗∗-18.69-18.68Tatar-16.21-16.18-15.65∗∗-16.19-16.19Turkish-18.88-18.85-18.55∗∗-18.85-18.84Table1:Averagelogposteriorpredictiveprobabilityofaheld-outform.“ngram”istheDPBackoff3-grammodel.“ﬂat”modelsusethefeaturedependencygraphinFig-ure3.“cl.node”modelsusethegraphinFigure2.Seetextformotivationsofthesegraphs.“notiers”modelsconditioneachdecisiononthepreviousphoneme,ratherthanontiersofpreviousfeatures.Asterisksindicatesta-tisticalsigniﬁcanceaccordingtoat-testcomparingwiththescoresundertheN-grammodel.*=p<.05;**=p<.001.Table1showstheaverageprobabilityofaheld-outwordunderourmodelsandundertheN-grammodelforonemodelrun.6Foralllanguages,wegetastatisticallysigniﬁcantincreaseinprobabili-tiesbyadoptingtheautosegmentalmodelwithclassnodesandtier-basedconditioning.Modelvariantswithouteithercomponentdonotsigniﬁcantlyout-performtheN-grammodelexceptinChinese.Thecombinationofclassnodesandtier-basedcondition-ingresultsinmodelimprovementsbeyondthecon-tributionsoftheindividualfeatures.5.4EvaluationonArtiﬁcialDataOurmodeloutperformstheN-grammodelinpre-dictingheld-outforms,butitremainstobeshownthatthisperformanceisduetocapturingthekindsoflinguisticintuitionsdiscussedinSection2.AnalternativepossibilityisthattheAutosegmentalN-grammodel,whichhasmanymoreparametersthanaplainN-grammodel,cansimplylearnamoreac-curatemodelofanysequence,evenifthatsequencehasnoneofthestructurediscussedabove.Toevalu-atethispossibility,wecomparetheperformanceofourmodelinpredictingheld-outlinguisticformstoitsperformanceinpredictingheld-outformsfromartiﬁciallexiconswhichexpresslydonothavethe6Themeanstandarddeviationperformoflogprobabilitiesover50runsofthefullmodelrangedfrom.09forAmharicto.23forDutch.linguisticstructureweareinterestedin.IftheautosegmentalmodeloutperformstheN-grammodelevenonartiﬁcialdatawithnophono-logicalstructure,thenitsperformanceonthereallinguisticdatainSection5.3mightbeoverﬁtting.Ontheotherhand,iftheautosegmentalmodeldoesbetteronrealdatabutnotartiﬁcialdata,thenwecanconcludethatitispickinguponsomerealdistinc-tivestructureofthatdata.ForeachreallexiconLr,wegenerateanartiﬁciallexiconLabytrainingaDP3-grammodelonLrandforward-sampling|Lr|forms.Additionally,theformsinLaareconstrainedtohavethesamedistri-butionoverlengthsastheformsinLr.Theresultinglexiconshavenotier-basedorfeaturalinteractionsexceptastheyappearbychancefromtheN-grammodeltrainedontheselexica.ForeachLawethentrainourmodelsontheﬁrst2500formsandscoretheprobabilitiesoftheheld-outforms,thesamepro-cedureasinSection5.3.WeranthisprocedureforallthelexiconsshowninTable1.Forallbutonelexicon,weﬁndthattheautosegmentalmodelsdonotsigniﬁcantlyoutper-formtheN-grammodelsonartiﬁcialdata.Theex-ceptionisMandarinChinese,wheretheaveragelogprobabilityofanartiﬁcialformis−13.81undertheN-grammodeland−13.71underthefullautoseg-mentalmodel.Theresultsuggeststhattheanoma-lousbehaviorofMandarinChineseinSection5.3maybeduetooverﬁtting.Whenexposedtodatathatexplicitlydoesnothaveautosegmentalstructure,themodelisnotmoreaccuratethanaplainsequencemodelforalmostalllanguages.Butwhenexposedtoreallinguisticdata,themodelismoreaccurate.Thisresultprovidesev-idencethatthegenerativemodeldevelopedinSec-tion2capturestruedistributionalpropertiesoflexi-consthatareabsentinN-gramdistributions,suchasfeaturalandtier-basedinteractions.5.5ComparisonwithaConstraint-BasedModelHereweprovideacomparisonwithHayesandWilson(2008)’sPhonotacticLearner,whichout-putsaphonotacticgrammarintheformofasetofweightedconstraintsonfeatureco-occurrences.Thisgrammarisoptimizedtomatchtheconstraintviolationproﬁleinatraininglexicon,andsocanbe l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 4 7 1 5 6 7 4 3 6 / / t l a c _ a _ 0 0 0 4 7 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 83 seenasaprobabilisticmodelofthatlexicon.Theauthorshavedistributedonesuchgrammar,BLICK,asa“referencepointforphonotacticprobabilityinexperimentation”(Hayes,2012).HerewecompareourmodelagainstBLICKonitsabilitytoassignprobabilitiestoforms,asinSection5.3.Ideally,wewouldsimplycomputetheprobabil-ityofformslikewedidinourearliermodelcom-parisons.BLICKreturnsscoresforeachform.However,sincetheprobabilisticmodelunderlyingBLICKisundirected,thesescoresareinfactunnor-malizedlogprobabilities,sotheycannotbecom-pareddirectlytothenormalizedprobabilitiesas-signedbytheothermodels.Furthermore,becausetheprobabilisticmodelunderlyingBLICKdoesnotpenalizeformsforlength,thenormalizingconstantoverallformsisinfactinﬁnite,makingstraightfor-wardcomparisonofpredictiveprobabilitiesimpos-sible.Nevertheless,wecanturnBLICKscoresintoprobabilitiesbyconditioningonfurtherconstraints,suchasthelengthkoftheform.Weenumerateallpossibleformsoflengthktocomputethenormal-izingconstantforthedistributionoverformsofthatlength.Thesameprocedurecanalsobeusedtocom-putetheprobabilitiesofeachform,conditionedonthelengthoftheformk,undertheN-gramandAu-tosegmentalmodels.TocompareourmodelsagainstBLICK,wecal-culateconditionalprobabilitiesforformsoflength2through5fromtheEnglishlexicon.7TheformsarethoseintheWOLEXcorpus;weincludethemforthisevaluationiftheyareksymbolslongintheWOLEXrepresentation.ForourN-gramandAu-tosegmentalmodels,weusethesamemodelsasinSection5.3.Theaverageprobabilitiesofformsun-derthethreemodelsareshowninTable2.Forlength3-5,theautosegmentalmodelassignsthehighestprobabilities,followedbytheN-grammodelandBLICK.Forlength2,BLICKoutperformstheDPN-grammodelbutnottheautosegmentalmodel.OurmodelassignshigherprobabilitiestoshortformsthanBLICK.Thatis,ourmodelshaveiden-tiﬁedmoreredundantstructureintheformsthanBLICK,allowingthemtocompressthedatamore.However,thecomparisonisimperfectinseveral7Enumeratingandscoringthe22,164,361,129possibleformsoflength6wascomputationallyimpractical.LengthBLICKngramcl.node2-6.50-6.81-5.183-9.38-8.76-7.954-14.1-11.7-11.45-18.1-14.2-13.9Table2:AveragelogposteriorpredictiveprobabilityofanEnglishformofﬁxedlengthunderBLICKandourmodels.EnglishN-gramEnglishFullModelcollaborationistmistrustfulaposterioriinharmoniousnesssacristyabsentmindednessmatterofcourseblamelessnessearnestmoneyphlegmaticallyTable3:MostrepresentativeformsfortheN-grammodelandforourfullmodel(“cl.node”inTable1)inEn-glish.Formsarepresentedinnativeorthography,butwerescoredbasedontheirphoneticform.ways.First,BLICKandourmodelsweretrainedondifferentdata;itispossiblethatourtrainingdataaremorerepresentativeofourtestdatathanBLICK’strainingdatawere.Second,BLICKusesadifferentunderlyingfeaturaldecompositionthanourmodels;itispossiblethatourfeaturesystemismoreac-curate.Nevertheless,theseresultsshowthatourmodelconcentratesmoreprobabilitymasson(short)formsattestedinalanguage,whereasBLICKlikelyspreadsitsprobabilitymassmoreevenlyoverthespaceofallpossible(short)strings.5.6RepresentativeFormsInordertogetasenseofthedifferencesbetweenmodels,weinvestigatewhatphonologicalformsarepreferredbydifferentkindsofmodels.Theseformsmightbeinformativeaboutthephonotacticpatternsthatourmodeliscapturingwhicharenotwell-representedinsimplermodels.Wecalculatetherep-resentativenessofaformfwithrespecttomodelm1asopposedtom2asp(f|m1)/p(f|m2)(Good,1965;TenenbaumandGrifﬁths,2001).Theformsthataremost“representative”ofmodelm1arenottheformsthatm1assignsthehighestprobability,butrathertheformsthatm1rankshighestrelativetom2.Tables3and4showformsfromthelexiconthataremostrepresentativeofourfullmodelandoftheN-grammodelforEnglishandTurkish.Themost l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 4 7 1 5 6 7 4 3 6 / / t l a c _ a _ 0 0 0 4 7 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 84 TurkishN-gramTurkishFullModelüstfamilyabüyükkarapınardekstrinkızılcapınarmnemoteknialtınpınarekskavatörsarımehmetlerfoksteryekaraellilerTable4:MostrepresentativeformsforN-gramandAu-tosegmentalmodelsinTurkish.uniquelyrepresentativeformsforourfullmodelaremorphologicallycomplexformsconsistingofmanyproductive,frequentlyreusedmorphemessuchasness.Ontheotherhand,therepresentativeformsfortheN-grammodelincludeforeignformssuchasaposteriori(forEnglish)andekskavatör(forTurk-ish),whicharenotbuiltoutofpartsthatfrequentlyrepeatinthoselanguages.Therepresentativeformssuggestthatthefullmodelplacesmoreprobabilitymassonwordswhicharebuiltoutofhighlyproduc-tive,phonotacticallywell-formedparts.6DiscussionWeﬁndthatourmodelssucceedinassigninghighprobabilitiestounseenforms,thattheydosospeciﬁ-callyforlinguisticformsandnotrandomsequences,thattheytendtofavorformswithmanyproductiveparts,andthattheyperformcomparablytoastate-of-the-artconstraint-basedmodelinassigningprob-abilitiestoshortforms.TheimprovementforourmodelsovertheN-grambaselineisconsistentbutnotlarge.Weattributethistothewayinwhichphonologicalgeneraliza-tionsareusedinthepresentmodel:inparticular,phonologicalgeneralizationsfunctionprimarilyasaformofbackoffforasequencemodel.Ourmod-elshavelexicalmemoizationateachnodeinafea-turedependencygraph;assuch,thetopnodeinthegraphendsuprepresentingtransitionprobabil-itiesforwholephonemesconditionedonpreviousphonemes,andtherestofthefeaturedependencygraphfunctionsasabackoffdistribution.Whenamodelhasbeenexposedtomanytrainingforms,itsbehaviorwillbelargelydominatedbytheN-gram-likebehaviorofthetopnode.Infutureworkitmightbeeffectivetolearnanoptimalbackoffprocedurewhichgivesmoreinﬂuencetothebasedistribution(DuhandKirchhoff,2004;WoodandTeh,2009).Whilethetier-basedconditioninginourmodelwouldseemtobecapableofmodelingnonlocalinteractionssuchasvowelharmony,wehavenotfoundthatthemodelsdowellatreproducingthesenonlocalinteractions.Webelievethisisbecausethemodel’sbehaviorisdominatedbynodeshighinthefeaturedependencygraph.Inanycase,asimpleMarkovmodeldeﬁnedovertiers,aswehavepre-sentedhere,mightnotbeenoughtofullymodelvowelharmony.Rather,amodelofphonologicalprocesses,transducingunderlyingformstosurfaceforms,seemslikeamorenaturalwaytocapturethesephenomena.Westressthatthismodelisnottiedtoaparticularfeaturedependencygraph.Infact,webelieveourmodelprovidesanovelwayoftestingdifferenthy-pothesesaboutfeaturestructures,andcouldformthebasisforlearningtheoptimalfeaturehierarchyforagivendataset.Thechoiceoffeaturedependencygraphhasalargeeffectonwhatfeaturalinteractionsthemodelcanrepresentdirectly.Forexample,nei-therfeaturedependencygraphhassharedplacefea-turesforconsonantsandvowels,sothemodelhaslimitedabilitytorepresentplace-basedrestrictionsonconsonant-vowelsequencessuchasrequirementsforlabializedorpalatalizedconsonantsinthecon-textof/u/or/i/.Theseinteractionscanbetreatedinourframeworkifvowelsandconsonantsshareplacefeatures,asinPadgett(2011).7ConclusionWehavepresentedaprobabilisticgenerativemodelforsequencesofphonemesdeﬁnedintermsofphonologicalfeatures,basedonrepresentationalideasfromgenerativephonologyandtoolsfromBayesiannonparametricmodeling.Weconsiderourmodelasaproofofconceptthatprobabilisticstructure-buildingmodelscanincluderichfeaturalinteractions.OurmodelrobustlyoutperformsanN-grammodelonsimplemetrics,andlearnstogener-ateformsconsistingofhighlyproductiveparts.Wealsoviewthisworkasatestofthescientiﬁchy-pothesesthatphonologicalfeaturescanbeorganizedinahierarchyandthattheyinteractalongtiers:inourmodelevaluation,wefoundthatbothconceptswerenecessarytogetanimprovementoverabase-lineN-grammodel. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 4 7 1 5 6 7 4 3 6 / / t l a c _ a _ 0 0 0 4 7 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 85 AcknowledgmentsWewouldliketothankTalLinzen,LeonBergen,EdwardFlemming,EdwardGibson,BobBerwick,JimGlass,andtheaudiencesatMIT’sPhonologyCircle,SIGMORPHON,andtheLSA2016AnnualMeetingforhelpfulcomments.Thisworkwassup-portedinpartbyNSFDDRIGGrant#1551543toR.F.ReferencesAdamAlbright.2009.Feature-basedgeneralizationasasourceofgradientacceptability.Phonology,26:9–41.RobertC.Berwick.1985.Theacquisitionofsyntacticknowledge.MITPress,Cambridge,MA.JeffA.BilmesandKatrinKirchhoff.2003.Fac-toredlanguagemodelsandgeneralizedparallelback-off.InProceedingsofthe2003ConferenceoftheNorthAmericanChapteroftheAssociationforCom-putationalLinguisticsonHumanLanguageTechnol-ogy:CompanionVolumeoftheProceedingsofHLT-NAACL2003–ShortPapers–Volume2,pages4–6.As-sociationforComputationalLinguistics.GeertBooij.2011.Morphemestructureconstraints.InTheBlackwellCompaniontoPhonology.Blackwell.NoamChomskyandMorrisHalle.1965.Somecontro-versialquestionsinphonologicaltheory.JournalofLinguistics,1:97–138.NoamChomskyandMorrisHalle.1968.TheSoundPatternofEnglish.Harper&Row,NewYork,NY.GeorgeN.ClementsandElizabethV.Hume.1995.Theinternalorganizationofspeechsounds.InTheHand-bookofPhonologicalTheory,pages24–306.Black-well,Oxford.GeorgeN.Clements.1985.Thegeometryofphonologi-calfeatures.PhonologyYearbook,2:225–252.JohnColemanandJanetB.Pierrehumbert.1997.Stochasticphonologicalgrammarsandacceptability.InJohnColeman,editor,Proceedingsofthe3rdMeet-ingoftheACLSpecialInterestGroupinComputa-tionalPhonology,pages49–56,Somserset,NJ.Asso-ciationforComputationalLinguistics.RobertDaland,BruceHayes,JamesWhite,MarcGarellek,AndreasDavis,andIngridNormann.2011.Explainingsonorityprojectioneffects.Phonology,28:197–234.CarldeMarcken.1996.Linguisticstructureascom-positionandperturbation.InProceedingsofthe34thannualmeetingonAssociationforComputationalLin-guistics,pages335–341.AssociationforComputa-tionalLinguistics.B.ElanDresher.2009.TheContrastiveHierarchyinPhonology.CambridgeUniversityPress.KevinDuhandKatrinKirchhoff.2004.Automaticlearningoflanguagemodelstructure.InProceedingsofCOLING2004,Geneva,Switzerland.ThomasS.Ferguson.1973.Abayesiananalysisofsomenonparametricproblems.AnnalsofStatistics,1(2):209–230.JohnGoldsmithandJasonRiggle.2012.Informationtheoreticapproachestophonology:thecaseofFinnishvowelharmony.Naturallangaugeandlinguisticthe-ory,30(3):859–96.SharonGoldwater,ThomasL.Grifﬁths,andMarkJohn-son.2006.Interpolatingbetweentypesandtokensbyestimatingpower-lawgenerators.InAdvancesinNeu-ralInformationProcessingSystems,volume18,pages459–466,Cambridge,MA.MITPress.SharonGoldwater.2006.NonparametricBayesianMod-elsofLexicalAcquisition.Ph.D.thesis,BrownUni-versity.IrvingJohnGood.1965.TheEstimationofProbabili-ties.MITPress,Cambridge,MA.NoahD.Goodman,VikashK.Mansinghka,DanielRoy,KeithBonawitz,andJoshuaB.Tenenbaum.2008.Church:Alanguageforgenerativemodels.InUn-certaintyinArtiﬁcialIntelligence,Helsinki,Finland.AUAIPress.PeterGraff.2012.CommunicativeEfﬁciencyintheLexi-con.Ph.D.thesis,MassachusettsInstituteofTechnol-ogy.MorrisHalle.1959.TheSoundPatternofRussian:Alinguisticandacousticalinvestigation.Mouton,TheHague,TheNetherlands.MorrisHalle.1995.Featuregeometryandfeaturespreading.LinguisticInquiry,26(1):1–46.BruceHayesandColinWilson.2008.Amaximumen-tropymodelofphonotacticsandphonotacticlearning.LinguisticInquiry,39(3):379–440.BruceHayes.2012.Blick-aphonotacticprobabilitycalculator.JeffreyHeinz,ChetanRawal,andHerbertG.Tanner.2011.Tier-basedstrictlylocalconstraintsforphonol-ogy.InThe49thAnnualMeetingoftheAssociationforComputationalLinguistics.RomanJakobson,C.GunnarM.Fant,andMorrisHalle.1952.PreliminariestoSpeechAnalysis:TheDistinc-tiveFeaturesandtheirCorrelates.TheMITPress,Cambridge,MassachusettsandLondon,England.AdamJardineandJeffreyHeinz.2015.Aconcatenationoperationtoderiveautosegmentalgraphs.InProceed-ingsofthe14thMeetingontheMathematicsofLan-guage(MoL2015),pages139–151,Chicago,USA,July. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 0 4 7 1 5 6 7 4 3 6 / / t l a c _ a _ 0 0 0 4 7 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 86 YaJinandStuartGeman.2006.Contextandhierar-chyinaprobabilisticimagemodel.InProceedingsofthe2006IEEEComputerSocietyConferenceonComputerVisionandPatternRecognition(CVRP’06),pages2145–2152.MarkJohnsonandSharonGoldwater.2009.Improvingnonparametericbayesianinference:experimentsonunsupervisedwordsegmentationwithadaptorgram-mars.InNAACL-HLT2009,pages317–325.Associa-tionforComputationalLinguistics.MarkJohnson,ThomasLGrifﬁths,andSharonGoldwa-ter.2007.Adaptorgrammars:Aframeworkforspeci-fyingcompositionalnonparametricBayesianmodels.InAdvancesinNeuralInformationProcessingSys-tems,pages641–648.Mary-LouiseKean.1975.TheTheoryofMarkednessinGenerativeGrammar.Ph.D.thesis,MassachusettsInstituteofTechnology.DavidJ.C.MacKayandLindaC.BaumanPeto.1994.AhierarchicalDirichletlanguagemodel.NaturalLan-guageEngineering,1:1–19.JohnMcCarthy.1988.Featuregeometryanddepen-dency:Areview.Phonetica,43:84–108.DonaldMichie.1968.“memo”functionsandmachinelearning.Nature,218:19–22.TimothyJ.O’Donnell.2015.ProductivityandReuseinLanguage:ATheoryofLinguisticComputationandStorage.TheMITPress,Cambridge,MassachusettsandLondon,England.JayePadgett.2011.Consonant-vowelplacefeaturein-teractions.InTheBlackwellCompaniontoPhonol-ogy,pages1761–1786.BlackwellPublishing,Malden,MA.EzerRasinandRoniKatzir.2014.Alearnabilityargu-mentforconstraintsonunderlyingrepresentations.InProceedingsofthe45thAnnualMeetingoftheNorthEastLinguisticSociety(NELS45),Cambridge,Mas-sachusetts.JayaramSethuraman.1994.AconstructivedeﬁnitionofDirichletpriors.StatisticaSinica,pages639–650.RichardStanley.1967.Redundancyrulesinphonology.Language,43(2):393–436.YeeWhyeTeh.2006.ABayesianinterpretationofin-terpolatedKneser-Ney.TechnicalReportTRA2/06,SchoolofComputing,NationalUniversityofSinga-pore.JoshuaBTenenbaumandThomasLGrifﬁths.2001.Therationalbasisofrepresentativeness.InProceedingsofthe23rdAnnualConferenceoftheCognitiveScienceSociety,pages1036–1041.NikolaiS.Trubetzkoy.1939.GrundzügederPhonolo-gie.Number7inTravauxduCercleLinguistiquedePrague.Vandenhoeck&Ruprecht,Göttingen.FrankWoodandYeeWhyeTeh.2009.Ahierarchi-calnonparametricBayesianapproachtostatisticallan-guagemodeldomainadaptation.InArtiﬁcialIntelli-genceandStatistics,pages607–614.FrankWood,CédricArchambeau,JanGasthaus,JamesLancelot,andYeeWhyeTeh.2009.Astochasticmemoizerforsequencedata.InProceedingsofthe26thInternationalConferenceonMachineLearning,Montreal,Canada. Transactions of the Association for Computational Linguistics, vol. 5, pp. 73–86, 2017. Action Editor: Eric Fosler-Lussier. image

Download pdf