Transactions of the Association for Computational Linguistics, vol. 3, pp. 419–432, 2015. Action Editor: Philipp Koehn.

Transactions of the Association for Computational Linguistics, vol. 3, pp. 419–432, 2015. Action Editor: Philipp Koehn.
Submission batch: 4/2015; Revision batch 7/2015; Published 7/2015.

2015 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.

c
(cid:13)

UnsupervisedIdentificationofTranslationeseEllaRabinovichShulyWintnerDepartmentofComputerScienceDepartmentofComputerScienceUniversityofHaifaUniversityofHaifaellarabi@csweb.haifa.ac.ilshuly@cs.haifa.ac.ilAbstractTranslatedtextsaredistinctivelydifferentfromoriginalones,totheextentthatsu-pervisedtextclassificationmethodscandis-tinguishbetweenthemwithhighaccuracy.Thesedifferenceswereprovenusefulforsta-tisticalmachinetranslation.However,ithasbeensuggestedthattheaccuracyoftransla-tiondetectiondeteriorateswhentheclassifierisevaluatedoutsidethedomainitwastrainedon.Weshowthatthisisindeedthecase,inavarietyofevaluationscenarios.Wethenshowthatunsupervisedclassificationishighlyac-curateonthistask.Wesuggestamethodfordeterminingthecorrectlabelsoftheclusteringoutcomes,andthenusethelabelsforvoting,improvingtheaccuracyevenfurther.More-over,wesuggestasimplemethodforcluster-inginthechallengingcaseofmixed-domaindatasets,inspiteofthedominanceofdomain-relatedfeaturesovertranslation-relatedones.Theresultisaneffective,fully-unsupervisedmethodfordistinguishingbetweenoriginalandtranslatedtextsthatcanbeappliedtonewdomainswithreasonableaccuracy.1IntroductionHuman-translatedtexts(inanylanguage)havedis-tinctfeaturesthatdistinguishthemfromoriginal,non-translatedtexts.Thesedifferencesstemei-therfromtheeffectofthetranslationprocessonthetranslatedoutcomes,orfrom“fingerprints”ofthesourcelanguageonthetargetlanguageproduct.Thetermtranslationesewascoinedtoindicatetheuniquepropertiesoftranslations.Awarenesstotranslationesecanimprovestatis-ticalmachinetranslation(SMT).D'abord,fortrainingtranslationmodels,paralleltextsthatweretranslatedinthedirectionoftheSMTtaskarepreferabletotextstranslatedintheoppositedirection;second,fortraininglanguagemodels,monolingualcorporaoftranslatedtextsarebetterthanoriginaltexts.Itispossibletoautomaticallydistinguishbetweenoriginal(Ô)andtranslated(T)texts,withveryhighaccuracy,byemployingtextclassificationmethods.Existingapproaches,cependant,onlyemploysuper-visedmachine-learning;theythereforesufferfromtwomaindrawbacks:(je)theyinherentlydependondataannotatedwiththetranslationdirection,et(ii)theymaynotbegeneralizedtounseen(relatedorunrelated)domains.1Theseshortcomingsunder-minetheusabilityofsupervisedmethodsfortrans-lationeseidentificationinatypicalreal-lifescenario,wherenolabelledin-domaindataareavailable.Inthisworkweexploreunsupervisedtechniquesforreliablediscriminationoforiginalandtranslatedtexts.Moreprecisely,weapplydimensionreductionandcentroid-basedclusteringmethods(enhancedbyinternalclusteringevaluation),fortellingOfromTinanunsupervisedscenario.Furthermore,wein-troducearobustmethodologyforlabellingtheob-tainedclusters,i.e.,annotatingthemas“original”or“translated”,byinspectingsimilaritiesbetweentheclusteringoutcomesandOandTprototypicalex-amples.Rigorousexperimentswithfourdiversecor-porademonstratethatclusteringofin-domaintextsusinglexical,content-independentfeaturessystem-aticallyyieldsveryhighaccuracy,only10percentpointslowerthantheperformanceofsupervisedclassificationonthesamedata(inmostcases).Ac-1Weuse“domain”ratherfreelyhenceforthtoindicatenotonlythetopicofacorpusbutalsoitsmodality(writtenvs.spo-ken),register,genre,date,etc..

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
1
4
8
1
5
6
6
7
9
8

/

/
t

je

un
c
_
un
_
0
0
1
4
8
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

420

curacycanbeimprovedevenfurtherbyclusteringconsensustechniques.Wefurtherscrutinizethetensionbetweendomain-relatedandtranslationese-basedtextprop-erties.Usingaseriesofexperimentsinamixed-domainsetup,weshowthatclustering(inparticular,relyingoncontent-independentfeatures)perfectlygroupsthedataintodomains,ratherthanintothe(desirable)cross-domainOandT;thatis,domain-relatedpropertiesclearlydominateandovershadowthetranslationese-basedcharacteristicsoftheunder-lyingtexts.Weaddressthechallengeofdiscriminat-ingOfromTinamixed-domainsetupbyproposingtwosimplemethodologies(flatandtwo-phase)andempiricallydemonstratetheirsoundness.Theclusteringexperimentsthroughoutthepaperwereconductedinasetupsimilartothatofsuper-visedclassification,determiningthestatus(Ovs.T)oflogicalunits(chunks)of2,000tokens.Wealsoshowthatclusteringaccuracyremainsstableevenwhenthenumberofavailablechunksdecreasesdra-maticallyandremainssatisfactorywhenthechunksizeisreduced.Themaincontributionofthisworkisthereforetwo-fold:(je)weestablisharobustapproachforre-liableunsupervisedidentificationoftranslatedtexts,therebyeliminatingtheneedforin-domainlabeleddata;(ii)weprovideanextensiveempiricalfoun-dationforthedominanceofdomain-basedproper-tiesovertranslationese-relatedcharacteristicsofatext,andproposeamethodologyforidentificationoftranslationeseinamixed-domainscenario.Theremainderofthepaperisstructuredasfol-lowing:afterreviewingrelatedworkinSection2,wedetailourdatasets,featuresandtoolsinSec-tion3.InSection4wereproduceandextendsu-pervisedclassificationresults,anddemonstratethepoorcross-domainclassificationaccuracyofsuper-visedmethods.Ourclusteringmethodologyandex-perimentsaredescribedinSection5;mixed-domainclassificationisdiscussedinSection6.Wecon-cludewithadiscussionandsuggestionsforfutureresearch.2RelatedWorkMuchresearchinTranslationStudiesindicatesthattranslatedtextshaveuniquecharacteristics.Trans-latedtexts(inanylanguage)constituteasub-language(sometimesreferredtoasagenre,oradialect)ofthetargetlanguage,presumablyreflect-ingboththeartifactsofthetranslationprocessandtracesoftheoriginallanguagefromwhichthetextsweretranslated(thesourcelanguage).Gellerstam(1986)calledthissub-languagetranslationese,andsuggestedthatthedifferencesbetweenOandTdonotindicatepoortranslationbutratherastatisticalphenomenon,causedbyasystematicinfluenceofthesourcelanguageonthetargetlanguage.ThesedifferenceshaveramificationsforSMT.Kurokawaetal.(2009)werethefirsttonoteit:theyshowedthattranslationmodelstrainedonEnglish-translated-to-FrenchbitextsweremuchbetterthanonestrainedonFrench-translated-to-English,whentheSMTtaskistranslatingEnglishtoFrench.Lem-berskyetal.(2012un,2013)corroboratedthesere-sults,formorelanguagepairs,andsuggestedawaytoadapttranslationmodelstothepropertiesoftrans-lationese.Furthermore,Lemberskyetal.(2011,2012b)showedthatlanguagemodelscompiledfromtranslatedtextsarebetterforSMTthanonescom-piledfromoriginaltexts.Theseresultsallhighlightthepracticalimportanceofbeingabletoreliablydis-tinguishbetweentranslatedandoriginaltexts.Indeed,translatedtextsaresomarkedlydiffer-entfromoriginalonesthatautomaticclassificationcanidentifythemwithveryhighaccuracy(BaroniandBernardini,2006;Iliseietal.,2010;IliseiandInkpen,2011;Popescu,2011).Recently,Volanskyetal.(2015)investigatedseveraltranslationstudieshypothesesbyperforminganextensiveexplorationoftheabilityofvariousfeaturesetstodistinguishbetweenOandT.UsingSVMclassifiersandten-foldcross-validationevaluation,theylistedseveralfeaturesthatyieldnearperfectaccuracy.Mostworksmentionedabovetrainandevaluateclassifiersontextsdrawnfromthesamecorpus.Whentheseclassifiersaretestedontextsfromdif-ferentdomains,orinadifferentgenre,ortrans-latedfromadifferentlanguage,classificationac-curacydramaticallydeteriorates.KoppelandOr-dan(2011)trainclassifiersontheEuroparlcorpus(Koehn,2005),withEnglishtranslatedfromfivedif-ferentlanguages.Whentheclassifiersareevalu-atedonEnglishtranslatedfromthesamelanguagetheyweretrainedon,accuracyisnear100%;mais

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
1
4
8
1
5
6
6
7
9
8

/

/
t

je

un
c
_
un
_
0
0
1
4
8
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

421

whenevaluatedontranslationsfromadifferentlan-guage,accuracydropssignificantly,insomecasesbelow60%.Thispatternrecurswhenthetestcorpusisdifferentfromthetrainingcorpus(newspaperar-ticlesvs.parliamentproceedings).De la même manière,Avneretal.(Forthcoming)reportexcellent(near100%)re-sultsidentifyingHebrewtranslationeseonacorpusofliterarytexts,usingverysimpleword-levelfea-tures.Evaluationondifferentdomains(popularsci-ence)andonHebrewtranslatedfromFrench,ratherthanEnglish,cependant,showsmuchpoorerresults,withaccuraciesaround60%inmanycases.Wehypothesizethatthemainreasonforthede-teriorationintheaccuracyof(supervised)trans-lationeseclassifierswhenevaluatedout-of-domainstemsfromthefactthatdomaindifferencesover-shadowthedifferencesbetweenOandT.Diwersyetal.(2014)studiedvarioussortsoflinguisticvaria-tionbyapplyingsemi-supervisedmultivariatetech-niques.Theyinvestigated,amongotherfactors,reg-istervariationinEnglishandGermanoriginalsandtranslations.Byapplyingaseriesofsupervisedandunsupervisedstatisticalanalyses,theydemonstratedthatregister-relatedpropertiesaremuchbetterex-hibitedbytheunderlyingtextsthanpropertiesre-latedtothedocuments’translationstatus.Wead-dressthechallengeofmixed-domainclassificationinSection6.Onewaytoovercomethedependenceonlabeleddataanddomain-overfittingofsupervisedclassifiersistouseunsupervisedmethods,inparticularclus-tering.Theonlyapplicationofclusteringtotransla-tionesethatweareawareofistheworkofNisioiandDinu(2013),whoinvestigatedtranslationese-andauthorship-relatedcharacteristicsbyapplyinghier-archicalclusteringtobookswrittenbyaRussian-Englishbilingualauthor.Whiletheymainlyfo-cusedonauthorshipattribution,NisioiandDinu(2013)alsodemonstratedthatitispossibletodis-criminateOfromTbyapplyingclusteringwithlex-icalfeatures(functionwords)extractedfromcom-pletebooks(25,000–180,000tokens).Weaddressthechallengeofunsupervisedidentificationoftrans-lationeseusingadifferentmethodologyandmuchsmallerlogicalunits(2,000tokens),andfurtherbroadenthescopeofourworkbyproposingamethodologyfortellingOfromTinmixed-domainscenarios.Unsupervisedclassificationisawell-establisheddiscipline;inthisworkweuseKMeans(Lloyd,1982)forclusteringandKMeans++(ArthurandVassilvitskii,2007)asaKMeansinitializationmethod.3ExperimentalSetup3.1DatasetsOurmaindataset2consistsoftextsoriginallywrit-teninEnglishandtextstranslatedtoEnglishfromFrench.Weusevariouscorpora:(je)Europarl,theproceedingsoftheEuropeanParliament(Koehn,2005),betweentheyears2001-2006;(ii)theCana-dianHansard,transcriptsoftheCanadianParlia-ment,spanningyears2001-2009;(iii)literaryclas-sicswritten(ortranslated)mainlyinthe19thcen-tury;et(iv)transcriptsofTEDandTEDxtalks.Thiscollectionsuggestsdiversityingenre,register,modality(writtenvs.spoken)andera.Table1de-tailssomestatisticaldataonthecorpora(aftertok-enization).3Wenowbrieflydescribeeachdataset.Europarlisprobablythemostpopularparallelcorpusinnaturallanguageprocessing,anditwasin-deedusedformanyofthetranslationesetaskssur-veyedinSection2.Thiscorpushasbeenusedex-tensivelyinSMT(Koehnetal.,2009),andwasevenadaptedspecificallyforresearchintranslationstud-ies:IslamandMehler(2012)compiledacustomizedversionofEuroparl,wherethedirectionoftrans-lationisindicated.WeuseaversionofEuroparl(RabinovichandWintner,Forthcoming)thataimstofurtherincreasetheconfidenceinthedirectionoftranslation,throughacomprehensivecross-lingualvalidationoftheoriginallanguageofthespeakers.TheHansardisaparallelcorpusconsistingoftranscriptionsoftheCanadianparliamentinEnglishandFrenchbetween2001and2009.ThisisthelargestavailablesourceofEnglish–Frenchsentencepairs.Weuseaversionthatisannotatedwiththeoriginallanguageofeachparallelsentence.Relyingonmetadataavailableinthecorpus,wefilteredoutallsegmentsnotreferringtospeech,i.e.,retainingonlysentencesannotatedasContentParaText.2Thedatasetisavailableathttp://cl.haifa.ac.il/projects/translationese.3Weuse“EUR”,“HAN”,“LIT”and“TED”todenotethefourcorporainthediscussionbelow.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
1
4
8
1
5
6
6
7
9
8

/

/
t

je

un
c
_
un
_
0
0
1
4
8
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

422

NumberofsentencesNumberoftokensNumberoftypesCorpusOriginalEF→ETotalOriginalEF→EOriginalEF→EEUR134,72571,816206,5413,406,5132,112,08537,20328,119HAN3,441,984757,5734,199,55765,491,96013,457,613158,64563,192LIT36,12385,210121,333858,2971,750,52525,11338,842TED7,5514,82712,378129,33487,2149,6677,441Table1:CorpusstatisticsTheLiteraturecorpusconsistsofliteraryclassicswritten(andtranslated)inthe18th–20thcenturiesbyEnglishandFrenchauthors;therawmaterialisavailablefromtheGutenbergproject.Weusesub-setsthatweremanuallyorautomaticallyparagraph-aligned.Notethatclassifyingliterarytextsisconsid-eredamorechallengingtaskthanclassifyingmore“technical”translations,suchasparliamentproceed-ings,sincetranslatorsofliteraturetypicallyenjoymoreliteraryfreedom,therebyrenderingthetransla-tionproductmoresimilartooriginalwriting(LynchandVogel,2012;Avneretal.,Forthcoming).OurTEDtalkscorpusconsistsoftalksoriginallygiveninEnglishandtalkstranslatedtoEnglishfromFrench.Thequalityoftranslationsinthiscorpusisveryhigh:notonlyaretranslatorsassumedtobecompetent,butthecommonpracticeisthateachtranslationpassesthroughareviewbeforebeingpublished.Thiscorpusconsistsoftalksdeliveredorally,butweassumethattheyweremeticulouslyprepared,sothelanguageisnotspontaneousbutratherplanned.Comparedtotheothersub-corpora,theTEDdatasethassomeuniquecharacteristicsthatstemfromthefollowingreasons:(je)itssizeisrel-ativelysmall;(ii)itexhibitsstylisticdisparitybe-tweentheoriginalandtranslatedtexts(theformercontainsmore“oral”markersofaspokenlanguage,whilethelatterisawrittentranslation);andfinally(iii)TEDtalksarenottranscribedbutarerathersub-titled,sotheyundergosomeeditingandrephrasing.4ThevastmajorityofTEDtalksarepubliclyavail-ableonline,whichmakesthiscorpuseasilyextend-ableforfutureresearch.3.2ProcessingandToolsAlldatasetsarefirsttokenizedusingtheStanfordtools(Manningetal.,2014)andthenpartitionedinto4http://translations.ted.org/wiki/How_to_Compress_Subtitleschunksofapproximately2000tokens(endingonasentenceboundary).Weassumethattranslationese-relatedfeaturesarepresentinthetextsacrossauthororspeaker,thusweallowsomechunkstocontainlinguisticinformationfromtwoormoredifferenttextssimultaneously.Forthemain(single-corpus)classificationexperimentsweuse2000textchunkseachfromEuroparlandHansard,800fromLiter-atureand88chunksfromTED;eachsub-corpusconsistsofanequalnumberoforiginalandtrans-latedchunks.Foreveryclassificationexperimentweusethemaximalequalnumberofchunksfromeachclass,thuswealways(randomly)down-samplethedatasetsinordertohaveacomparablenumberoftraining/testingexamplesforsupervisedclassifica-tion,andcomparableclustersizeforclustering.WeuseWeka(Halletal.,2009)asthemaintoolforclassification,clustering,anddimensionreduc-tion.Inalltheclassificationexperiments,weuseSVM(SMO)astheclassificationalgorithmwiththedefaultlinearkernel.ForclusteringweuseWeka’sKMeansimplementation(SimpleKMeans)withtheKMeans++initializationstrategy.WeuseEuclediandistanceasthesimilaritymeasureforKMeans,andapplyacustomclustering-evaluation-basedwrapper(seeSection5)tofurtherenhanceWeka’sbasicclus-teringimplementation.WeusePrincipalComponentAnalysis(APC,Jol-liffe(2002))fordimensionreduction.PCAisasta-tisticalprocedurethatdiscoversvariableswiththelargestpossiblevariance,i.e.,featuresthataccountformostvariabilityinthedata(principalcompo-nents).Itperformsalinearmappingofthedatatoalower-dimensionalspaceinawaythatmaximizesthevarianceofthedatainthelow-dimensionalrep-resentation,byremovinghighlycorrelatedorsuper-fluousvariables.TheoutcomeofPCAisanewsetoffeatures,eachofwhichisalinearcombinationofthediscoveredcomponents.Thenumberofthenewly

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
1
4
8
1
5
6
6
7
9
8

/

/
t

je

un
c
_
un
_
0
0
1
4
8
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

423

generatedvariablesvariesfromonetothenumberofvariablesoriginallyusedtodescribethedata,andistypicallycontrolledbyaparameter.Apartfromtheenhancedefficiency(duetothere-ducedcomputationalcosts),dimensionalityreduc-tionoftencarriesapositiveeffectontheaccuracyoftheunderlyingclassificationtask,especiallywhenthedataaremeagerorfeaturevectorsaresparse.The(accuracy-wise)optimizationgainsofPCA,whenfollowedbytheKMeansclusteringalgorithm,werereportedbyNgetal.(2001).Weperformdi-mensionreductionusingtheWekaimplementationofPCA,withthe“variancecovered”parametersetto0.1acrossallfeaturetypesanddatasets,priortoapplyingaclusteringprocedure.3.3FeaturesWefocusonasetoffeaturesthatreflectlexicalandstructuralpropertiesofthetext,andhavebeenshowntobeeffectiveforsupervisedclassificationoftranslationese(Volanskyetal.,2015).Specifically,weusefunctionwords(FW),moreprecisely,thesamelistthatwasusedinpreviousworksonclassi-ficationoftranslationese(KoppelandOrdan,2011;Volanskyetal.,2015).Featurevaluesarerawcounts(furtherdenotedbytermfrequency,tf),normalizedbythenumberoftokensinthechunk;thechunksizemayslightlyvary,sincethechunksrespectsen-tenceboundaries.Fortheclusteringexperimentswefurtherscalethenormalizedtfbytheinversedoc-umentfrequency(idf),whichoffsetstheimportanceofatermbyafactorproportionaltoitsfrequencyinthecorpus.Thetf-idfstatistichasbeenshowntobeeffectivewithlexicalfeatures,andisoftenusedasaweightingfactorininformationretrievalandtextmining.Whilefunctionwordsareassumedtobeveryfrequent,theircountswithinatextvarygreatly(e.g.,“the”vs.“whereas”).Wethereforeoptfortf-idfweightingofFWacrossallsub-corpora.Inadditiontofunctionwords,weexperimentwithseveralotherfeaturesets,includingcharac-tertrigrams,part-of-speech(POS)trigrams,contex-tualfunctionwordsandcohesivemarkers.Con-textualfunctionwordsareavariationofPOStri-gramswhereatrigramcanbeanchoredbyspe-cificfunctionwords:theseareconsecutivetripletshw1,w2,w3iwhereatleasttwooftheelementsarefunctionwords,andatmostoneisaPOStag.Co-hesivemarkersarewordsorphrasesthatsignaltheunderlyingflowofthought:theyorganizeacompo-sitionofphrasesbyspecifyingthetype,purposeordirectionofupcomingideas,andcanthereforeserveasevidenceofthetranslationprocess.Weusethelistof40cohesivemarkersdefinedinVolanskyetal.(2015).Character,POS,andcontextualFWtrigramsarecalculatedasdetailedinVolanskyetal.(2015),butweonlyconsiderthe1000mostfrequentfeatureval-uesextractedfromeachdataset(oracombinationofdatasets)beingclassified.Thissubsetyieldsthesameclassificationqualityasthefullset,reducingcomputationcomplexity.4SupervisedClassificationWebeginwithsupervisedclassification,re-establishingthehighaccuracyofin-domain(su-pervised)classificationoftranslationese,buthigh-lightingthedeteriorationinaccuracywhencross-domainclassificationisconsidered.Wefirstre-producetheEuroparlclassificationresultswiththebestperformingfeaturesets,asreportedbyVolan-skyetal.(2015),andpresentresultsforthreead-ditionalsub-corpora:Hansard,LiteratureandTED.Table2liststheten-foldcross-validationclassifica-tionaccuracywithvariousfeatures.Allfeatures(ex-ceptperhapscohesivemarkers)yieldexcellentaccu-racy.feature/corpusEURHANLITTEDFW96.398.197.397.7char-trigrams98.897.199.5100.0POS-trigrams98.597.298.792.0contextualFW95.296.894.186.3cohesivemarkers83.686.978.681.8Table2:In-domain(cross-validation)classificationaccu-racyusingvariousfeaturesetsAfewpreviousworkssuggestedthatcross-domainclassificationoftranslationeseresultsinlowaccuracy(KoppelandOrdan,2011;Avneretal.,Forthcoming).Ourexperimentscorroboratethisob-servation;Table3depictsthecross-domainclassifi-cationaccuracyontheEuroparl,HansardandLit-eraturecorpora,whentrainingononecorpusand

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
1
4
8
1
5
6
6
7
9
8

/

/
t

je

un
c
_
un
_
0
0
1
4
8
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

424

testingonanother(usingfunctionwords).5Abal-ancedsetupforthisexperimentwasgeneratedbyrandomlyselecting800chunksfromeachcorpus,dividedequallytoOandT.Theresultsonlyslightlyoutperformchancelevel,evenfortheEuroparl–Hansardseeminglydomain-relatedpair:weobtain59.7%to60.8%accuracyinthetwodirections.train/testEURHANLIT10-foldx-validationEUR60.856.294.7HAN59.758.798.1LIT64.361.597.3Table3:Pairwisecross-domainclassificationusingfunc-tionwordsAttemptingtoenrichtheclassifier’straining“ex-perience”weconductedadditionalexperiments,wherewetrainontwosub-corporaoutofEu-roparl,HansardandLiterature,andtestonthere-mainingone.TheresultsaredepictedinTable4.Here,aussi,accuracyisverylow,implyingthattrain-ingondiversedatadoesnotnecessarilyprovideasolutionforcross-domainclassificationoftransla-tionese.Theright-handcolumnofthetablere-portsten-foldcross-validationresultsofthetwosub-corporathataresubjectfortraining.Excellentin-domainclassificationresultsontheonehandandpoorcross-domainpredictiveperformanceontheother,implythatthemodeldescribingtherelationinacertaindomainisinapplicabletoadifferent(evenseeminglysimilar)domainduetosignificantdiffer-encesinthedistributionoftheunderlyingdata.train/testEURHANLIT10-foldx-validationEUR+HAN63.894.0EUR+LIT64.192.9HAN+LIT59.896.0Table4:Leave-one-outcross-domainclassificationusingfunctionwordsReflectingthepoorgeneralizationcapabilityoftranslationesefeatures,theseresultscallfordevel-5Wefocusmainlyonfunctionwords,becausetheyareknowntoreflectstylisticdifferencesratherthancontentsorspe-cificcorpusfeatures,andarethereforelesssusceptibletodo-mainoverfitting.Otherfeaturesetsyieldedsimilarresults.opingothermethodologiesforreliablydiscriminat-ingOfromT,specifically,methodologiesthatareindependentofin-domainlabeleddata.5Clustering5.1InitialresultsToovercomethedomain-dependenceofsupervisedclassification,weexperimentinthissectionwithun-supervisedmethods.WebeginwiththeKMeansclusteringalgorithm,usingKMeans++initializationpolicyanddimensionreduction.Toevaluatetheac-curacyofthealgorithms,eachclusterislabeledbythemajorityof(OorT)instancesitincludes(us-inggroundtruthannotations),andtheoverallpreci-sionisthepercentageofinstancescorrectlyassignedtotheirrespectiveclusters(wediscussunsupervisedclusterlabelinginSection5.2).TheKMeansclusteringalgorithm(withanyini-tializationpolicy)issensitivetotheinitialsettingsofitsparameters,inparticulartheinitialchoiceofcentroids.Aclustercentroidisthegeometricalcen-terofallobservationswithinthecluster.There-sultoftheKMeansalgorithmmaysignificantlyvaryaccordingtoitsfirststep:theinitialassignmentof(random)pointstoclustercentroids.WeaddressthispotentialpitfallbyperformingNclusteringit-erations,randomlyvaryingtheinitialparameterset-tings,outputtingtheoutcomethatexhibitsthehigh-estsimilarityofpointswithinacluster.Formally,letCjidenoteclusteriiniterationj,andletmjidenotethiscluster’scentroid,sothati∈[1,2],andj∈[1..N].Sum-of-Square-Error(SSE)isanintrinsicclusteringevaluationmetricthatmeasuresthesimilarityofel-ementsinacluster.TheSSEofCjiisdefinedbySSEji=Xx∈Cji(x−mji)2WeaimtooptimizetheclusteringresultbychoosinganoutcomethatminimizestheaccumulativeSSE:argminjSSEj=argminj∈[1..N]Xi∈[1,2]SSEjiTheselectedclusteringoutcomerepresentsthere-sultofasingleclusteringexperiment.ThedescribedmethodforselectingaclusteringoutcomecanbeviewedasabinaryversionoftheBisectingKMeans

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
1
4
8
1
5
6
6
7
9
8

/

/
t

je

un
c
_
un
_
0
0
1
4
8
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

425

algorithme;itisappliedinallexperimentsthrough-outthepaper,withnumberofiterations(N)fixedto5,followingtherecommendationbySteinbachetal.(2000,p.13).Weconductedaseriesofexperimentswithvar-iousfeaturesets;themainresultsaredepictedinTable5.Thereportednumbersreflecttheaverageaccuracyover30experiments(theonlydifferencebeingarandomchoiceoftheinitialconditions).6feature/corpusEURHANLITTEDFW88.688.978.887.5char-trigrams72.163.870.378.6POS-trigrams96.976.070.776.1contextualFW92.993.268.267.0cohesivemarkers63.181.267.163.0Table5:ClusteringresultsusingvariousfeaturesetsFirstandforemost,theresultsareverygood,rangingfromafewpercentpointslowerthansuper-visedclassification(Table2,EuroparlandHansard)toapproximately25percentpointslowerinafewcases(e.g.,Literature).Functionwordssystemati-callyyieldveryhighaccuracy;thequalityofcluster-ingwithotherfeaturesvariesacrossthesub-corpora.Cohesivemarkersperformpoorly(withasingleex-ception,Hansard),whichmirrorsthemoderatesu-pervisedclassificationprecisionachievedwiththesamefeatureset.TheexceptionallyhighresultofEuroparlwithPOS-trigramscanbeattributedtotheexcessivefre-quencyofspecificphrasesinthetranslatedEuroparltexts(incontrasttotheiroriginalcounterparts).7WeexplainthelowerprecisionachievedontheLiter-aturecorpusbyitsdiversecharacter:itcomprisesworksattributedtoavarietyofauthors,periodsandgenres,whichischallengingfortheunsupervisedal-gorithm(seeSection6).AnotablyhighaccuracyisobtainedonthesmallTEDcorpus,whichimpliestheapplicabilityofourclusteringmethodologytodata-meagerscenarios.Weconductedanadditionalsetofexperimentswithunequalproportionsoforiginalandtranslatedtexts,consideringtwicethenumberofOchunks6Standarddeviationinmostexperimentswascloseto0.7Asanexample(andinlinewithvanHalteren(2008)),inthe2000Europarlchunks,thephraseladiesandgentlemenappears1258timesinT,butonly12timesinO.comparedtoTandviceversa.Theaverageclus-teringaccuracyusingFWissimilartothatobtainedinthebalancedsetup(Table5):87.5%onEuroparl,88.9%onHansard,73.2%onLiterature,and88.6%ontheTEDsub-corpus.5.2ClusterlabelingAsisalwaysthecasewithunsupervisedmethods,clusteringcandivideobservationsintoclassesbutcannotlabelthoseclasses.Aclusterlabelingalgo-rithmexaminesthecontentsofeachclusterinordertofindlabelsthatbestsummarizeitsmembers,anddistinguishtheclustersfromeachother.Inthecontextoftranslationeseidentification,thetaskofclusterlabelingistodeterminewhichoftheproducedclustersrepresentsO,andwhichT.Weaddressthischallengebyexploringsimilaritiesbe-tweenthelanguagemodelsoftheobtainedclusters,andlanguagemodelsof(presumably)prototypicalOandTsamples.Asimpleunigramlanguagemodelassignseachwordaprobabilityproportionaltoitsfrequencyintheunderlyingtext;weusesmoothedtermfrequenciesscaledbytheinversetotaltermfre-quencies.Wethencomparelanguagemodelstore-vealsimilaritiesbetweentheprototypicalOandTsamplesandthechunksetsproducedbyclustering.TheconstructionmethodofprototypicalLMsismotivatedby(je)abstractingfromcontent,byutiliz-ingonlyfunctionwordsforthispurpose;et(ii)at-temptingtoavoidtheinterferenceofdomain-relatedproperties,byconsideringonly(presumably)uni-versalmarkers:wordsthatsharesimilarfrequencypatternsinseveraldatasetsw.r.t.toOvs.T.LetOm(O-markers)denoteasetoffunctionwordsthattendtobeassociatedwithO.WeselectthissetbypickingwordswhosefrequencyinOisexcessive,comparedtoT;moreprecisely,theratiooftheirfrequencyinOandTisabove(1+d),whereδ=0.05.Similarly,Tm(T-markers)isasetofwordswithO-to-Tfrequencyratiobelow(1-d).Wecre-ateaprototypicalOexamplebytheconcatenationofOm,andaprototypicalTexamplebytheconcatena-tionofTm.Thelanguagemodeloftheseexamplesisthenconstructedbythe(cid:15)-smoothedlikelihoodofeachterminthemarkersvocabularyV=OmSTm,où(cid:15)=0.001.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
1
4
8
1
5
6
6
7
9
8

/

/
t

je

un
c
_
un
_
0
0
1
4
8
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

426

Officiellement,forw∈V,p(w|Om)=tf(w)+(cid:15)|Om|+(cid:15)×|V|p(w|Tm)=tf(w)+(cid:15)|Tm|+(cid:15)×|V|WedenotetheresultinglanguagemodelsbyPOandPT,respectively.Giventwoclusters,C1andC2,wesimilarlycomputetheirlanguagemodels,de-notedbyPC1andPC2,respectively,overthevocab-ularyV.WemeasurethesimilaritybetweenaclassX(eitherOorT)andaclusterCiusingtheJensen-Shannondivergence(JSD)(Lin,1991)onthere-spectiveprobabilitydistributions.Specifically,wedefinethedistancebetweenthelanguagemodelsasthesquarerootofthedivergencevalue,whichisametric,oftenreferredtoasJensen-Shannondistance(EndresandSchindelin,2003):DJS(X,Ci)=2qJSD(PX||PCi)TheassignmentofthelabelXtotheclusterC1isthensupportedbybothC1’sproximitytotheclassXandC2’sproximitytotheotherclass:label(C1)=“O”ifDJS(Ô,C1)×DJS(T,C2)<α×DJS(O,C2)×DJS(T,C1)“T”otherwiseC2isassignedthecomplementarylabel.Thevalueofαisfixedto1inthisequation,butwenotethatitcanbevariedforfurtherinvestigationoftherelated-nessoftheunderlyinglanguagemodels.Weapplytheclusterlabelingtechniquedescribedabovetodeterminethelabelsofgeneratedclusters.WeconstructprototypicalO-andT-textsbyselect-ingO-andT-markersfromarandomsampleofEu-roparlandHansardtexts,using600chunksfromeachcorpus.8Wethencomparethelanguagemod-elsinducedbythesesamplestothoseofthegener-atedclusters(testedondifferentchunks,ofcourse)todeterminetheclusterlabels;thepredictedlabelsarethenverifiedagainstthemajority-drivenlabel-ing,basedongroundtruthannotations.Weapply8ThissubsetoftheEuroparlandHansardcorporawasusedforone-timegenerationofprototypicalOandTlanguagemod-els,andexcludedfromfurtheruse.thisproceduretotheoutcomeofallclusteringexper-iments(perdomain,usingvariousfeatures),achiev-ingoverallprecisionof100%.Inotherwords,thelabelingprocedureyieldsprefectaccuracynotonlyonEuroparlandHansardtextsthatwerenotusedforgenerationofOandTprototypicalexamples,butalsoonunseenLiteratureandTEDdatasets.Weconcludethatitispossible,ingeneral,todeterminethelabelsofclustersproducedbyourclusteringal-gorithmwithperfectaccuracy.5.3ClusteringconsensusamongfeaturesetsSincedifferentfeaturesetshavedifferentpredictionsonourdata,wehypothesizethatconsensusvotingcanimprovetheaccuracyofclustering.Wetreateachindividualclusteringresult(basedonacertainfeatureset)asajudge,votingwhetherasingletextchunkbelongstoOortoT.Weusetheclusterlabel-ingmethodofSection5.2todeterminelabels.Thefinalassignmentofalabeltoaclusterisdeterminedbythemajorityvoteofthevariousjudges.Table6presentstheresultsoftheseexperiments.Wecompareconsensusresultstotheaccuracyachievedbyfunctionwords,thebest-performingsinglefeatureset(onaverage),seeTable5.Boththreejudgesandfivejudgesyieldaconsistentin-creaseinaccuracy.Fivejudgessystematically(and,onEuroparlandHansard,significantly)outperformtheresultofclusteringwithfunctionswordsonly.Thisindicatesthatvariousfeaturestendtocapturedifferentaspectsoftranslationese,thatareeventu-allyleveragedbythe“fusion”ofdifferentclusteringresultsintoasingle,higher-qualityoutcome.5.4SensitivityanalysisInsupervisedclassification,theamountoflabeleddatahasacriticaleffectontheclassificationaccu-racy.Thisdoesnotseemtobethecasewithclus-tering:accuracyremainsstablewhenthenumberofchunksusedforclassificationdecreases(Figure1a).Evidently,asfewas300chunksaresufficientforex-cellentclassification.9Weattributethe(slight)fluc-tuationsinthegraphtotherandomchoiceofthesubsetofchunksthataresubjectforclustering.Nat-urally,clusteringaccuracystabilizeswhenthenum-berofchunksincreases,sincetheeffectofrandom9TheresultsontheLiteraturecorpusarelimitedbytheamountofavailabledatainthisdataset. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 4 8 1 5 6 6 7 9 8 / / t l a c _ a _ 0 0 1 4 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 427 method/corpusEURHANLITTEDFW88.688.978.887.5FWchar-trigramsPOS-trigrams91.1∗86.278.290.9∗FWPOS-trigramscontextualFW95.8∗89.872.386.3FWchar-trigramsPOS-trigramscontextualFWcohesivemarkers94.1∗91.0∗79.288.6Table6:Clusteringconsensusbyvoting;statisticallysig-nificantimprovements,comparedtousingFWonly,aremarkedwith‘*’noisediminisheswithmoredata.Thisresultisofclearpracticalimportance,asinreal-lifesituationsonlyalimitedamountofdatamaybeavailable.Theaccuracyofsupervisedclassificationdete-riorateswhenthesizeoftheunderlyinglogicalunits(here,chunks)decreases(Kurokawaetal.,2009).Wecorroboratethisobservationinthecon-textofclustering,butnotethatreasonableaccuracy(over70%)canbeobtainedevenwith1000-tokenchunks(Figure1b).Thisfurthersupportstheap-plicabilityofunsupervisedclassificationoftransla-tionesetoreal-worldscenarios.6Mixed-domainclassificationPoorcross-domainclassificationresults,asde-scribedinSection4,demonstratethatthein-domaindiscriminativefeaturesoftranslatedtextscannotbeeasilygeneralizedtoother,evenrelated,domains.Inthissectionweexplorethetensionbetweenthediscriminativepowerofdomain-andtranslationese-relatedproperties,intheunsupervisedscenario.Ourunderlyinghypothesisisthatdomain-specificfea-turesovershadowthefeaturesoftranslationese.Thenextseriesofexperimentsinvolves(abalanced)combinationofvariousdatasets;weexcludedthesmallTEDcorpusfromtheseexperimentstopreventdownsamplingofothersub-corpora.6.1Domain-relatedvs.translationese-basedcharacteristicsWebeginwithaninvestigationofthemutualeffectofthedomain-andtranslationese-specificcharacter-isticsontheaccuracyofclustering.WefirstmergedequalnumbersofOandTchunksfromtwocorpora:800chunkseachfromEuroparlandHansard,yield-ing1,600chunks,halfofthemOandhalfT.WeappliedtheclusteringalgorithmofSection5tothisdataset;theresultwasaperfectdomain-drivensepa-rationofallEuroparlandHansardchunks,yieldingpoor(chance-level)translationeseaccuracy.Inotherwords,weobtainedtwoclusters,oneconsistingofEuroparlchunksandtheotherofHansardchunks,independentlyoftheirO-vs.-Tstatus.Werepeatedtheexperimentwithadditionalcorpuspairs,andfur-therextendeditbyaddingequalnumbersofLitera-turechunks(400Oand400T),thistimefixingthenumberofclusterstothree.Again,theresultwasseparationbydomain:Europarl,HansardandLiter-aturechunksweregroupedintodistinctclusters(Ta-ble7,top).Asanadditionalexperiment,weattemptedtoleavethedecisiononthe“best”numberofclusterstothealgorithm.Tothatend,weemployedtheXMeansclusteringprocedure(PellegandMoore,2000),whichusesKMeansbutappliesadditionalstatisticalcuestodecideonthenumberofclustersthatbestexplainthedata.WealsoappliedPCAfordimensionreductionpriortoXMeansinvoca-tion.Werepeatedbothexperiments(two-andthree-domainmixes)withXMeans,expectingtoobtaintwoandthreeclusters,respectively.TheresultisareplicationofthemoreconstrainedKMeansinthreeoutoffourcases(Table7,bottom).Theseobservationshaveacrucialeffectonun-derstandingthetensionbetweenthedomain-andtranslationese-basedcharacteristicsoftheunderly-ingtexts.Notonlyaredomainsaccuratelyseparatedgivenafixednumberofclusters,butevenwhenthedecisiononthenumberofclustersislefttotheclusteringprocedure,classificationintodomainsex-plainsthedatabest(asshownbyXMeans).Recallthattheseexperimentsallrelyonthesetoffunctionwords:topic-independentfeatures,thathavebeenproveneffectivefortellingOfromTinbothsuper-vised(Section2)andunsupervisedscenarios(Sec- l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 4 8 1 5 6 6 7 9 8 / / t l a c _ a _ 0 0 1 4 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 428 50 60 70 80 90 100 100 200 300 400 500 600 700 800 900 1000accuracy (%)(a) number of chunksClustering accuracy as function of number of chunksEURHANLIT 50 60 70 80 90 100 250 500 750 1000 1250 1500 1750 2000accuracy (%)(b) chunk size (tokens)Clustering accuracy as function of chunk sizeEURHANLITFigure1:Theeffectofvaryingthenumberofchunksandchunksize(intokens)onclusteringaccuracymethod/corpusEUR+HANEUR+LITHAN+LITEUR+HAN+LITKMeansaccuracybydomain93.799.599.892.2accuracybytranslationstatus50.350.050.0–XMeansgenerated#ofclusters2233accuracybydomain93.699.599.992.2accuracybytranslationstatus50.350.0––Table7:Clusteringachunk-levelmixofEuroparl,HansardandLiteratureusingfunctionwords;accuracybytransla-tionstatus(Ovs.T)isreportedwhereapplicable(i.e.,theoutcomeconstitutestwoclusters)tion5).Thefactthatthistranslationese-orientedfeaturesetyieldstheresultspresentedinTable7clearlydemonstratesthedominanceofdomain-specificpropertiesoverthecharacteristicsoftrans-lationese.106.2Clusteringinamixed-domainsetupDrivenbytheresultsofSection6.1,weturntoex-ploreamethodologyforidentificationoftransla-tioneseinamixed-domainsetup.Weassumethatwearegivenasetoftextchunksthatcomefrommultipledomains,suchthatsomechunksareOandsomeareT;thetaskistoclassifythetextstoOvs.T,independentlyoftheirdomain.Forthatpurpose,weinvestigatetwoapproaches:two-phaseandflat.Bothmethodsassumethatthenumberofdomains,k,isknown(itcanbediscoveredbyXMeans,asinSection6.1,orfixedtoasomewhathighervaluethanestimatedinordertocaptureunsuspecteddif-ferenceswithindomains).Thetwo-phasemethod10Otherfeaturesetsyieldedsimilaroutcomes.firstclustersamixtureoftextsintodomains(e.g.,usingKMeans),andthenseparateseachofthere-sulting(presumably,domain-coherent)clustersintotwosub-clusters,presumablyOandT.Theflatap-proachappliesKMeans,attemptingtodividethedatasetinto2×kclusters;thatis,weexpectclas-sificationbydomainsandbytranslationesestatus,simultaneously.Weexperimentedwithtwosetups:(i)mixtureoftwodatasetsoutofEuroparl,HansardandLitera-ture(1600chunksintotal);and(ii)mixtureofallthreeofthem(2400chunksintotal).Weappliedbothmethodstoeachofthetwosetups.WeinvokedPCApriortoclusteringintheflatapproach;inthetwo-phaseapproach,weappliedPCAonrawdatainstancesthataresubjecttoclusteringateachhier-archylevel.11Asourgoalisidentificationoftransla-tionese,wedefinetheaccuracyoftheclassificationastheratioofOandTinstancesclassifiedcorrectly11Notethatourtwo-phaseapproachdiffersfromthetradi-tionalhierarchicalclusteringinthissense. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 4 8 1 5 6 6 7 9 8 / / t l a c _ a _ 0 0 1 4 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 429 method/corpusEUR+HANEUR+LITHAN+LITEUR+HAN+LITFlat92.560.777.566.8Two-phase91.379.485.367.5Table8:Flatandtwo-phaseclusteringofdomain-mixusingfunctionwords(i.e.,weignoretheaccuracyofidentifyingthecor-rectdomain).Table8reportstheresults.BothmethodsyieldsimilarlyhighaccuracyintheEuroparl+Hansardsetup,andmuchloweraccuracyinthesetupofallthreedatasets(withasingleexceptionofEUR+LIT).ThisimpliesthatthedifficultyoftellingOfromTincreasesasthenumberofdomainsinthemixed-domainsetupgrows.Thetwo-phaseap-proachoutperformstheflatoneinmostcases:thelatterattemptstoclusterdatainstancesbydomainandtranslationstatussimultaneously,andisthere-forepotentiallymoreerror-prone.Asaconcreteex-ample,intheEuroparl+Literaturesetup,attemptingtoproducefourclusters,weobtainedasingleclus-terofEuroparlchunksandthreeclustersofLitera-turechunks.Thetwo-phaseapproachavoidssuchpitfallsbyexplicitlyseparatingthestepsofdomain-andtranslationese-basedclustering.Table8clearlydemonstratesthatinareal-worldscenario,whereadatasetcanbeassumedtoincludetextsfrommultipledomains,itispossibletoover-comethedominanceofdomain-relatedfeaturesovertranslationese-relatedonesbysplittingthetaskintotwo.Theresultishighlyaccurateidentificationoftranslatedtexts,eveninanextremelychallengingsetup.ComparetheresultsofTable8tothesu-pervisedcase(Tables3,4):whileclusteringcan-notcompetewithten-foldcross-validationresultsofheterogenousdatasets(93–96%),itisfarsuperiortotrainingaclassifierononeormoredatasetsandthenusingitonadatafromanewsource(60–64%).7DiscussionDistinguishingbetweenoriginalandtranslatedtextshasbeenprovenusefulforSMT,asawarenesstotranslationesecanimprovethequalityofSMTsys-tems.Sofar,classifyingtextsintooriginalvs.trans-latedhasbeendonealmostexclusivelybysuper-visedmethods.Inthisworkweadvocatetheuseofunsupervisedclassificationasaneffectivewaytoad-dressthistask.Wedemonstratethatsimplefeaturesets,coupledwithstandardclusteringalgorithms,anovelclusterlabelingtechnique,andvotingamongseveralfeatures,canyieldveryhighaccuracy,over90%inseveralcases.Usingdiversedatasetswero-bustlydemonstratethattheapproachweadvocateiseffectiveforidentificationoftranslationese,evenwhenonlylittledataareavailable,andtextchunksaresmall.Wefurtherhighlightthedominanceofdomain-basedcharacteristicsofthetextsovertheirtranslationese-relatedpropertiesandproposeasim-plemethodologyforidentificationoftranslationeseinamixed-domainsetup.Weconcludethatthepro-posed(two-phase)clusteringapproachisarobustmethodfordistinguishingOfromTinheterogenousdatasets.Byconductingaseriesofexperimentswithunbal-ancedproportionsofOandTtexts,wedemonstratethattheproposedmethodologyisalsoapplicabletoscenarioswheretheoriginalandtranslateddataareunevenlydistributed.WeappliedPCAfordimensionreductionandthetf-idfweightingschemewithFWthroughoutallex-perimentsinthiswork.Thelatterhadaslightposi-tiveeffectonclusteringaccuracyinmostscenarios,andnoimpactinsomecases.Dimensionreductionimprovedcomputationalefficiency,especiallywithlargefeaturesets(e.g.,characterandPOStrigrams).However,itseffectonclusteringaccuracywasnotuniform:themostprominentimprovement(over15percentpoints)wasobtainedontheTEDdataset,whileaslightaccuracydeteriorationwasobservedinafewcases(e.g.,5percentpointsonEuroparlwithFW).Weconcludethatwhilecarryinganover-allpositivevalue,theapplicationofdimensionre-ductioninsimilarscenarioscallsforfurtherinvesti-gation.8ConclusionTothebestofourknowledge,thisisthefirstworktoextensivelyexploreunsupervisedclassificationof l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 4 8 1 5 6 6 7 9 8 / / t l a c _ a _ 0 0 1 4 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 430 translationese.Weonlyscratchedthesurfaceofthisresearchdirection.Inthefuture,weintendtoex-ploretherobustnessofourapproachevenfurther,withmoredatasetsinvariouslanguagepairs.WewillfirstattempttoidentifytranslationeseinFrench,usingthecurrentdataset(inthereversedirection).WewillalsoexperimentwithEnglish-German,inbothdirections,andhopefullyalsowithEnglish-Hebrew,amorechallengingsetup.Thepotentialvalueofunsupervisedidentificationoftranslationeseleavesmuchroomforfurtherex-ploratoryactivities.OurfutureplansincludeusingvariousdatasetsandreducedamountofdataforLMscompiledforclusterlabeling;inparticular,weplantoexplorethecorrelationbetweenthesetwoparam-etersandthescalingfactorαusedforassociationofalabelwithaclusteringoutcome.Furthermore,tohighlightthecontributionoftheseresultstoSMT,weplantoreplicatetheresultsofLemberskyetal.(2012b,2013),usingpredictedratherthanground-truthindicationofthetransla-tionesestatusofthetextsthatareusedtotrainSMTsystems.WebelievethatwewillbeabletoshowanimprovementinthequalityofSMTwithextremelylittlesupervision.AcknowledgementsThisresearchwassupportedbyagrantfromtheIs-raeliMinistryofScienceandTechnology.WearegratefultoCyrilGoutte,GeorgeFosterandPierreIsabelleforprovidinguswithanannotatedver-sionoftheHansardcorpus;andtoAndr´asFarkas12andFranc¸oisYvonforprovidinguswiththeLit-erarycorpus.Finally,weareindebtedtoNoamOrdan,TamirHazan,HaggaiRoitmanandEkate-rinaLapshinova-Koltunskiforcommentingonear-lierversionsofthisarticle.Allremainingerrorsandmisconceptionsare,ofcourse,ourown.ReferencesDavidArthurandSergeiVassilvitskii.K-means++:Theadvantagesofcarefulseeding.InProceed-ingsoftheEighteenthAnnualACM-SIAMSympo-siumonDiscreteAlgorithms,pages1027–1035,Philadelphia,PA,USA,2007.SocietyforIndus-trialandAppliedMathematics.ISBN978-0-12http://farkastranslations.com898716-24-5.URLhttp://dl.acm.org/citation.cfm?id=1283383.1283494.EhudAlexanderAvner,NoamOrdan,andShulyWintner.Identifyingtranslationeseatthewordandsub-wordlevel.DigitalScholarshipintheHumanities,Forthcoming.doi:http://dx.doi.org/10.1093/llc/fqu047.URLhttp://dx.doi.org/10.1093/llc/fqu047.MarcoBaroniandSilviaBernardini.Anewap-proachtothestudyofTranslationese:Machine-learningthedifferencebetweenoriginalandtranslatedtext.LiteraryandLinguisticCom-puting,21(3):259–274,September2006.URLhttp://llc.oxfordjournals.org/cgi/content/short/21/3/259?rss=1.SaschaDiwersy,StefanEvert,andStellaNeumann.Aweaklysupervisedmultivariateapproachtothestudyoflanguagevariation.InBenediktSzm-recsanyiandBernhardW¨alchli,editors,Aggre-gatingDialectology,Typology,andRegisterAnal-ysis.LinguisticVariationinTextandSpeech,pages174–204.DeGruyter,Berlin,Boston,2014.URLhttp://www.degruyter.com/view/product/207699.DominikMariaEndresandJohannesE.Schindelin.Anewmetricforprobabilitydistributions.IEEETransactionsonInformationTheory,49(7):1858–1860,2003.URLhttp://dblp.uni-trier.de/db/journals/tit/tit49.html#EndresS03.MartinGellerstam.TranslationeseinSwedishnov-elstranslatedfromEnglish.InLarsWollinandHansLindquist,editors,TranslationStudiesinScandinavia,pages88–95.CWKGleerup,Lund,1986.MarkHall,EibeFrank,GeoffreyHolmes,Bern-hardPfahringer,PeterReutemann,andIanH.Witten.TheWEKAdataminingsoftware:anupdate.SIGKDDExplorations,11(1):10–18,2009.ISSN1931-0145.doi:10.1145/1656274.1656278.URLhttp://dx.doi.org/10.1145/1656274.1656278.IustinaIliseiandDianaInkpen.TranslationesetraitsinRomaniannewspapers:Amachinelearningap-proach.InternationalJournalofComputationalLinguisticsandApplications,2(1-2),2011. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 4 8 1 5 6 6 7 9 8 / / t l a c _ a _ 0 0 1 4 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 431 IustinaIlisei,DianaInkpen,GloriaCorpasPastor,andRuslanMitkov.Identificationoftransla-tionese:Amachinelearningapproach.InAlexan-derF.Gelbukh,editor,ProceedingsofCICLing-2010:11thInternationalConferenceonCompu-tationalLinguisticsandIntelligentTextProcess-ing,volume6008ofLectureNotesinComputerScience,pages503–511.Springer,2010.ISBN978-3-642-12115-9.URLhttp://dx.doi.org/10.1007/978-3-642-12116-6.ZahurulIslamandAlexanderMehler.Customiza-tionoftheEuroparlcorpusfortranslationstud-ies.InProceedingsoftheEightInternationalConferenceonLanguageResourcesandEvalu-ation(LREC’12).EuropeanLanguageResourcesAssociation(ELRA),may2012.ISBN978-2-9517408-7-7.IanT.Jolliffe.PrincipalComponentAnalysis.SpringerVerlag,2ndedition,2002.PhilippKoehn.Europarl:Aparallelcorpusforsta-tisticalmachinetranslation.InProceedingsofthetenthMachineTranslationSummit,pages79–86.AAMT,2005.URLhttp://mt-archive.info/MTS-2005-Koehn.pdf.PhilippKoehn,AlexandraBirch,andRalfStein-berger.462machinetranslationsystemsforEu-rope.InProceedingsoftheTwelfthMachineTranslationSummit,pages65–72,2009.MosheKoppelandNoamOrdan.Translationeseanditsdialects.InProceedingsofthe49thAn-nualMeetingoftheAssociationforComputa-tionalLinguistics:HumanLanguageTechnolo-gies,pages1318–1326,Portland,Oregon,USA,June2011.AssociationforComputationalLin-guistics.URLhttp://www.aclweb.org/anthology/P11-1132.DavidKurokawa,CyrilGoutte,andPierreIsabelle.Automaticdetectionoftranslatedtextanditsim-pactonmachinetranslation.InProceedingsofMT-SummitXII,pages81–88,2009.GennadiLembersky,NoamOrdan,andShulyWint-ner.Languagemodelsformachinetransla-tion:Originalvs.translatedtexts.InPro-ceedingsofthe2011ConferenceonEmpir-icalMethodsinNaturalLanguageProcess-ing,pages363–374,Edinburgh,Scotland,UK,July2011.AssociationforComputationalLin-guistics.URLhttp://www.aclweb.org/anthology/D11-1034.GennadiLembersky,NoamOrdan,andShulyWint-ner.Adaptingtranslationmodelstotransla-tioneseimprovesSMT.InProceedingsofthe13thConferenceoftheEuropeanChapteroftheAssociationforComputationalLinguis-tics,pages255–265,Avignon,France,April2012a.AssociationforComputationalLinguis-tics.URLhttp://www.aclweb.org/anthology/E12-1026.GennadiLembersky,NoamOrdan,andShulyWint-ner.Languagemodelsformachinetransla-tion:Originalvs.translatedtexts.Compu-tationalLinguistics,38(4):799–825,December2012b.URLhttp://dx.doi.org/10.1162/COLI_a_00111.GennadiLembersky,NoamOrdan,andShulyWint-ner.Improvingstatisticalmachinetranslationbyadaptingtranslationmodelstotranslationese.ComputationalLinguistics,39(4):999–1023,De-cember2013.URLhttp://dx.doi.org/10.1162/COLI_a_00159.JianhuaLin.Divergencemeasuresbasedontheshannonentropy.IEEETransactionsonInforma-tionTheory,37(1):145–151,January1991.ISSN0018-9448.doi:10.1109/18.61115.URLhttp://dx.doi.org/10.1109/18.61115.StuartLloyd.LeastsquaresquantizationinPCM.IEEETransactionsonInformationTheory,28(2):129–137,March1982.ISSN0018-9448.doi:10.1109/TIT.1982.1056489.URLhttp://dx.doi.org/10.1109/TIT.1982.1056489.GerardLynchandCarlVogel.Towardstheau-tomaticdetectionofthesourcelanguageofaliterarytranslation.InProceedingsofCOL-ING2012,the24thInternationalConferenceonComputationalLinguistics:Posters,pages775–784,2012.URLhttp://aclweb.org/anthology/C12-2076.ChristopherD.Manning,MihaiSurdeanu,JohnBauer,JennyFinkel,StevenJ.Bethard,andDavidMcClosky.TheStanfordCoreNLPnat-urallanguageprocessingtoolkit.InProceed-ingsof52ndAnnualMeetingoftheAssociation l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 1 4 8 1 5 6 6 7 9 8 / / t l a c _ a _ 0 0 1 4 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 432 forComputationalLinguistics:SystemDemon-strations,pages55–60,Baltimore,Maryland,June2014.AssociationforComputationalLin-guistics.URLhttp://www.aclweb.org/anthology/P/P14/P14-5010.AndrewY.Ng,MichaelI.Jordan,andYairWeiss.Onspectralclustering:Analysisandanal-gorithm.InT.G.Dietterich,S.Becker,andZ.Ghahramani,editors,AdvancesinNeuralInformationProcessingSystems14,pages849–856.MITPress,2001.URLhttp://papers.nips.cc/paper/2092-on-spectral-clustering-analysis-and-an-algorithm.pdf.SergiuNisioiandLiviuP.Dinu.Aclusteringapproachfortranslationeseidentification.InGaliaAngelova,KalinaBontcheva,andRus-lanMitkov,editors,RecentAdvancesinNaturalLanguageProcessing,RANLP2013,pages532–538.RANLP2011OrganisingCommittee/ACL,September2013.DanPellegandAndrewW.Moore.X-means:Extendingk-meanswithefficientestimationofthenumberofclusters.InProceedingsoftheSeventeenthInternationalConferenceonMa-chineLearning,ICML’00,pages727–734,SanFrancisco,CA,USA,2000.MorganKauf-mannPublishersInc.ISBN1-55860-707-2.URLhttp://dl.acm.org/citation.cfm?id=645529.657808.MariusPopescu.Studyingtranslationeseatthechar-acterlevel.InGaliaAngelova,KalinaBontcheva,RuslanMitkov,andNicolasNicolov,editors,Pro-ceedingsofRANLP-2011,pages634–639,2011.EllaRabinovichandShulyWintner.TheHaifacor-pusoftranslationese.Unpublishedmanuscript,Forthcoming.MichaelSteinbach,GeorgeKarypis,andVipinKu-mar.Acomparisonofdocumentclusteringtech-niques.InKDD-2000WorkshoponTextMining,August2000.HansvanHalteren.SourcelanguagemarkersinEU-ROPARLtranslations.InDoniaScottandHansUszkoreit,editors,COLING2008,22ndInter-nationalConferenceonComputationalLinguis-tics,ProceedingsoftheConference,18-22August2008,Manchester,UK,pages937–944,2008.ISBN978-1-905593-44-6.URLhttp://www.aclweb.org/anthology/C08-1118.VeredVolansky,NoamOrdan,andShulyWintner.Onthefeaturesoftranslationese.DigitalScholar-shipintheHumanities,30(1):98–118,April2015.
Télécharger le PDF