Transactions of the Association for Computational Linguistics, 1 (2013) 165–178. Action Editor: David Chiang.
Submitted 11/2012; Revised 3/2013; Published 5/2013. c
(cid:13)
2013 Association for Computational Linguistics.
Learningtotranslatewithproductsofnovices:asuiteofopen-endedchallengeproblemsforteachingMTAdamLopez1,MattPost1,ChrisCallison-Burch1,2,JonathanWeese,JuriGanitkevitch,NargesAhmidi,OliviaBuzek,LeahHanson,BeenishJamil,MatthiasLee,Ya-TingLin,HenryPao,FatimaRivera,LeiliShahriyari,DebuSinha,AdamTeichert,StephenWampler,MichaelWeinberger,DaguangXu,LinYang,andShangZhao∗DepartmentofComputerScience,JohnsHopkinsUniversity1HumanLanguageTechnologyCenterofExcellence,JohnsHopkinsUniversity2ComputerandInformationScienceDepartment,UniversityofPennsylvaniaAbstractMachinetranslation(MT)drawsfromseveraldifferentdisciplines,makingitacomplexsub-jecttoteach.Thereareexcellentpedagogicaltexts,butproblemsinMTandcurrentalgo-rithmsforsolvingthemarebestlearnedbydoing.AsacenterpieceofourMTcourse,wedevisedaseriesofopen-endedchallengesforstudentsinwhichthegoalwastoim-proveperformanceoncarefullyconstrainedinstancesoffourkeyMTtasks:alignment,decoding,evaluation,andreranking.Studentsbroughtadiversesetoftechniquestotheprob-lems,includingsomenovelsolutionswhichperformedremarkablywell.Asurprisingandexcitingoutcomewasthatstudentsolutionsortheircombinationsfaredcompetitivelyonsometasks,demonstratingthatevennewcom-erstothefieldcanhelpimprovethestate-of-the-artonhardNLPproblemswhilesimulta-neouslylearningagreatdeal.Theproblems,baselinecode,andresultsarefreelyavailable.1IntroductionAdecadeago,studentsinterestedinnaturallan-guageprocessingarrivedatuniversitieshavingbeenexposedtotheideaofmachinetranslation(MT)primarilythroughsciencefiction.Today,incomingstudentshavebeenexposedtoserviceslikeGoogleTranslatesincetheywereinsecondaryschoolorear-lier.Forthem,MTissciencefact.SoitmakessensetoteachstatisticalMT,eitheronitsownorasaunit∗Thefirstfiveauthorswereinstructorsandtheremainingau-thorswerestudentsintheworkeddescribedhere.ThisresearchwasconductedwhileChrisCallison-BurchwasatJohnsHop-kinsUniversity.inaclassonnaturallanguageprocessing(NLP),ma-chinelearning(ML),orartificialintelligence(AI).AcoursethatpromisestoshowstudentshowGoogleTranslateworksandteachthemhowtobuildsome-thinglikeitisespeciallyappealing,andseveraluni-versitiesandsummerschoolsnowoffersuchclasses.Thereareexcellentintroductorytexts—dependingonthelevelofdetailrequired,instructorscanchoosefromacomprehensiveMTtextbook(Koehn,2010),achapterofapopularNLPtextbook(JurafskyandMartin,2009),atutorialsurvey(Lopez,2008),oranintuitivetutorialontheIBMModels(Knight,1999b),amongmanyothers.ButMTisnotjustanobjectofacademicstudy.It’sarealapplicationthatisn’tfullyperfected,andthebestwaytolearnaboutitistobuildanMTsys-tem.Thiscanbedonewithopen-sourcetoolkitssuchasMoses(Koehnetal.,2007),cdec(Dyeretal.,2010),orJoshua(Ganitkevitchetal.,2012),butthesesystemsarenotdesignedforpedagogy.Theyarematurecodebasesfeaturingtensofthousandsofsourcecodelines,makingitdifficulttofocusontheircorealgorithms.Mosttutorialspresentthemasblackboxes.ButourgoalisforstudentstolearnthekeytechniquesinMT,andideallytolearnbydoing.Blackboxesareincompatiblewiththisgoal.Wesolvethisdilemmabypresentingstudentswithconcise,fully-functioning,self-containedcom-ponentsofastatisticalMTsystem:wordalignment,decoding,evaluation,andreranking.Eachimple-mentationconsistsofana¨ıvebaselinealgorithminlessthan150linesofPythoncode.Weassignthemtostudentsasopen-endedchallengesinwhichthegoalistoimproveperformanceonobjectiveeval-uationmetricsasmuchaspossible.ThissettingmirrorsevaluationsconductedbytheNLPresearch
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
2
1
8
1
5
6
6
6
4
5
/
/
t
je
un
c
_
un
_
0
0
2
1
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
166
communityandbytheengineeringteamsbehindhigh-profileNLPprojectssuchasGoogleTranslateandIBM’sWatson.Whilewedesignatespecifical-gorithmsasbenchmarksforeachtask,weencour-agecreativitybyawardingmorepointsforthebestsystems.Asadditionalincentive,weprovideaweb-basedleaderboardtodisplaystandingsinrealtime.InourgraduateclassonMT,studentstookava-rietyofdifferentapproachestothetasks,insomecasesdevisingnovelalgorithms.Amoreexcitingre-sultisthatsomestudentsystemsorcombinationsofsystemsrivaledthestateoftheartonsomedatasets.2DesigningMTChallengeProblemsOurgoalwasforstudentstofreelyexperimentwithdifferentwaysofsolvingMTproblemsonrealdata,andourapproachconsistedoftwoseparablecom-ponents.First,weprovidedaframeworkthatstripskeyMTproblemsdowntotheiressencesostudentscouldfocusonunderstandingclassicalgorithmsorinventnewones.Second,wedesignedincentivesthatmotivatedthemtoimprovetheirsolutionsasmuchaspossible,encouragingexperimentationwithapproachesbeyondwhatwetaughtinclass.2.1Decoding,Reranking,Evaluation,andAlignmentforMT(DREAMT)Wedesignedfourassignments,eachcorrespondingtoarealsubprobleminMT:alignment,decoding,evaluation,andreranking.1FromthemoregeneralperspectiveofAI,theyemphasizethekeyproblemsofunsupervisedlearning,recherche,evaluationdesign,andsupervisedlearning,respectively.InrealMTsystems,theseproblemsarehighlyinterdependent,apointweemphasizedinclassandattheendofeachassignment—forexample,thatalignmentisanexer-ciseinparameterestimationfortranslationmodels,thatmodelchoiceisatradeoffbetweenexpressivityandefficientinference,andthatoptimalsearchdoesnotguaranteeoptimalaccuracy.However,present-ingeachproblemindependentlyandholdingallelseconstantenablesmorefocusedexploration.Foreachproblemweprovideddata,ana¨ıvesolu-tion,andanevaluationprogram.FollowingBirdetal.(2008)andMadnaniandDorr(2008),weimple-mentedthechallengesinPython,ahigh-levelpro-1http://alopez.github.io/dreamtgramminglanguagethatcanbeusedtowriteveryconciseprogramsresemblingpseudocode.2,3Byde-fault,eachbaselinesystemreadsthetestdataandgeneratesoutputintheevaluationformat,sosetuprequiredzeroconfiguration,andstudentscouldbe-ginexperimentingimmediately.Forexample,onre-ceiptofthealignmentcode,aligningdataandeval-uatingresultsrequiredonlytyping:>align|gradeStudentscouldthenrunexperimentswithinminutesofbeginningtheassignment.Threeofthefourchallengesalsoincludedunla-beledtestdata(exceptthedecodingassignment,asexplainedin§4).Weevaluatedtestresultsagainstahiddenkeywhenassignmentsweresubmitted.2.2IncentiveDesignWewantedtobalanceseveralpedagogicalgoals:un-derstandingofclassicalgorithms,freeexplorationofalternatives,experiencewithtypicalexperimentaldesign,andunhinderedcollaboration.Machinetranslationisfarfromsolved,soweex-pectedmorethanreimplementationofprescribedal-gorithms;wewantedstudentstoreallyexploretheproblems.Tomotivateexploration,wemadetheas-signmentscompetitive.Competitionisapowerfulforce,butmustbeappliedwithcareinaneduca-tionalsetting.4Wedidnotwanttheconsequencesofambitiousbutfailedexperimentstobetoodire,andwedidnotwanttodiscouragecollaboration.Foreachassignment,weguaranteedapassinggradeformatchingtheperformanceofaspecifictar-getalgorithm.Typically,thetargetwasimportantbutnotstate-of-the-art:weleftsubstantialroomforimprovement,andthuscompetition.Wetoldstu-dentstheexactalgorithmthatproducedthetargetac-curacy(thoughweexpectedthemtoderiveitthem-selvesbasedonlectures,notes,orliterature).Wedidnotspecificallyrequirethemtoimplementit,buttheguaranteeofapassinggradeprovidedapower-fulincentiveforthistobethefirststepofeachas-signment.Submissionsthatbeatthistargetreceivedadditionalcredit.Thetopfivesubmissionsreceivedfullcredit,whilethetopthreereceivedextracredit.2http://python.org3Somewell-knownMTsystemshavebeenimplementedinPython(Chiang,2007;HuangandChiang,2007).4Thankstoananonymousreviewerforthisturnofphrase.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
2
1
8
1
5
6
6
6
4
5
/
/
t
je
un
c
_
un
_
0
0
2
1
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
167
Thisschemeprovidedstrongincentivetocontinueexperimentationbeyondthetargetalgorithm.5Foreachassignment,studentscouldformteamsofanysize,underthreerules:eachteamhadtopub-licizeitsformationtotheclass,allteammembersagreedtoreceivethesamegrade,andteamscouldnotdropmembers.Ourhopewasthattheserequire-mentswouldbalancetheperceivedcompetitivead-vantageofcollaborationagainstareluctancetotake(andthussupport)teammateswhodidnotcontributetothecompetitiveeffort.6Thisstrategyworked:outofsixteenstudents,tenoptedtoworkcollaborativelyonatleastoneassignment,alwaysinpairs.Weprovidedaweb-basedleaderboardthatdis-playedstandingsonthetestdatainrealtime,iden-tifyingeachsubmissionbyapseudonymoushan-dleknownonlytotheteamandinstructors.Teamscoulduploadsolutionsasoftenastheylikedbeforetheassignmentdeadline.Theleaderboarddisplayedscoresofthedefaultandtargetalgorithms.Thisin-centivizedanearlystart,sinceteamscouldverifyforthemselveswhentheymetthethresholdforapassinggrade.Thougheffective,italsodetractedfromrealisminoneimportantway:itenabledhill-climbingontheevaluationmetric.Inearlyassign-ments,weobservedafewcasesofthisbehavior,sofortheremainingassignments,wemodifiedtheleaderboardsothatchangesinscorewouldonlybereflectedonceeverytwelvehours.Thisstrategytradessomeamountofscientificrealismforsomemeasureofincentive,astrategythathasproveneffectiveinotherpedagogicaltoolswithreal-timefeedback(Spaccoetal.,2006).Toobtainagrade,teamswererequiredtosub-mittheirresults,sharetheircodeprivatelywiththeinstructors,andpubliclydescribetheirexperimen-talprocesstotheclasssothateveryonecouldlearnfromtheircollectiveeffort.Teamswerefree(butnotrequired)tosharetheircodepubliclyatanytime.5Gradesdependoninstitutionalnorms.Inourcase,highgradesintherestofclasscombinedwithmatchingallassignmenttar-getalgorithmswouldearnaB+;beatingtwotargetalgorithmswouldearnanA-;topfiveplacementonanyassignmentwouldearnanA;andtopthreeplacementcompensatedforweakergradesinothercoursecriteria.Everyonewhocompletedallfourassignmentsplacedinthetopfiveatleastonce.6Theequilibriumpointisasingleteam,thoughthisteamwouldstillneedtodecideonadivisionoflabor.Onestudentcontem-platedorganizingthisteam,butdecidedagainstit.Somedidsoaftertheassignmentdeadline.3TheAlignmentChallengeThefirstchallengewaswordalignment:givenapar-alleltext,studentswerechallengedtoproduceword-to-wordalignmentswithlowalignmenterrorrate(AER;OchandNey,2000).ThisisavariantofaclassicassignmentnotjustinMT,butinNLPgen-erally.Klein(2005)describesaversionofit,andweknowseveralotherinstructorswhouseit.7Inmostofthese,theobjectistoimplementIBMModel1or2,orahiddenMarkovmodel.Ourversionmakesitopen-endedbyaskingstudentstomatchorbeatanIBMModel1baseline.3.1DataWeprovided100,000sentencesofparalleldatafromtheCanadianHansards,totalingaroundtwomillionwords.8Thisdatasetissmallenoughtoaligninafewminuteswithourimplementation—enablingrapidexperimentation—yetlargeenoughtoobtainreasonableresults.Infact,Liangetal.(2006)reportalignmentaccuracyondataofthissizethatiswithinafractionofapointoftheiraccuracyonthecom-pleteHansardsdata.Toevaluate,weusedmanualalignmentsofasmallfractionofsentences,devel-opedbyOchandNey(2000),whichweobtainedfromthesharedtaskresourcesorganizedbyMihal-ceaandPedersen(2003).Thefirst37sentencesofthecorpusweredevelopmentdata,withmanualalignmentsprovidedinaseparatefile.Testdatacon-sistedofanadditional447sentences,forwhichwedidnotprovidealignments.93.2ImplementationWedistributedthreePythonprogramswiththedata.Thefirst,align,computesDice’scoefficient(1945)foreverypairofFrenchandEnglishwords,thenalignseverypairforwhichitsvalueisaboveanadjustablethreshold.Ourimplementation(mostof7Amongthem,JordanBoyd-Graber,JohnDeNero,PhilippKoehn,andSlavPetrov(personalcommunication).8http://www.isi.edu/natural-language/download/hansard/9Thisinvitedthepossibilityofcheating,sincealignmentsofthetestdataarepubliclyavailableontheweb.Wedidnotadver-tisethis,butasanaddedsafeguardweobfuscatedthedatabydistributingthetestsentencesrandomlythroughoutthefile.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
2
1
8
1
5
6
6
6
4
5
/
/
t
je
un
c
_
un
_
0
0
2
1
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
168
Listing1ThedefaultalignerinDREAMT:thresh-oldingDice’scoefficient.for(F,e)inbitext:forf_iinset(F):f_count[f_i]+=1fore_jinset(e):fe_count[(f_i,e_j)]+=1fore_jinset(e):e_count[e_j]+=1for(f_i,e_j)infe_count.keys():dice[(f_i,e_j)]=\2.0*fe_count[(f_i,e_j)]/\(f_count[f_i]+e_count[e_j])pour(F,e)inbitext:pour(je,f_i)inenumerate(F):pour(j,e_j)inenumerate(e):ifdice[(f_i,e_j)]>=cutoff:print”%i-%i”%(je,j)whichisshowninListing1)isquiteclosetopseu-docode,makingiteasytofocusonthealgorithm,oneofourpedagogicalgoals.ThegradeprogramcomputesAERandoptionallyprintsanalignmentgridforsentencesinthedevelopmentdata,showingbothhumanandautomaticalignments.Finallythecheckprogramverifiesthattheresultsrepresentavalidsolution,reportinganerrorifnot—enablingstudentstodiagnosebugsintheirsubmissions.Thedefaultimplementationenabledimmediateexperimentation.Onreceiptofthecode,studentswereinstructedtoalignthefirst1,000sentencesandcomputeAERusingasimplecommand.>align-n1000|gradeByvaryingthenumberofinputsentencesandthethresholdforanalignment,studentscouldimmediatelyseetheeffectofvariousparametersonalignmentquality.WeprivatelyimplementedIBMModel1(Brownetal.,1993)asthetargetalgorithmforapassinggrade.WeranitforfiveiterationswithEnglishasthetargetlanguageandFrenchasthesource.Ourimplementationdidnotusenullalignmentorsymmetrization—leavingoutthesecommonim-provementsofferedstudentsthepossibilityofdis-coveringthemindependently,andtherebyrewarded.AER×1002030405060-16days-14days-12days-10days-8days-6days-4days-2daysdueFigure1:Submissionhistoryforthealignmentchallenge.Dashedlinesrepresentthedefaultandbaselinesystemperformance.Eachcoloredlinerepresentsastudent,andeachdotrepresentsasubmission.Forclarity,weshowonlysubmissionsthatimprovedthestudent’sAER.3.3ChallengeResultsWereceived209submissionsfrom11teamsoveraperiodoftwoweeks(Figure1).EveryoneeventuallymatchedorexceededIBMModel1AERof31.26.MoststudentsimplementedIBMModel1,butwesawmanyothersolutions,indicatingthatmanytrulyexperimentedwiththeproblem:•Implementingheuristicconstraintstorequirealignmentofpropernamesandpunctuation.•Runningthealgorithmonstemsratherthansur-facewords.•InitializingthefirstiterationofModel1withparametersestimatedontheobservedalign-mentsinthedevelopmentdata.•RunningModel1formanyiterations.Mostre-searcherstypicallyrunModel1forfiveitera-tionsorfewer,andtherearefewexperimentsintheliteratureonitsbehaviorovermanyiter-ations,asthereareforhiddenMarkovmodeltaggers(Johnson,2007).Ourstudentscarriedouttheseexperiments,reportingrunsof5,20,100,andeven2000iterations.Noimprove-mentwasobservedafter20iterations.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
2
1
8
1
5
6
6
6
4
5
/
/
t
je
un
c
_
un
_
0
0
2
1
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
169
•Implementingvariousalternativeapproachesfromtheliterature,includingIBMModel2(Brownetal.,1993),competitivelinking(Melamed,2000),andsmoothing(Moore,2004).OneofthebestsolutionswascompetitivelinkingwithDice’scoefficient,modifiedtoincorporatetheobservationthatalignmentstendtobemonotonicbyrestrictingpossiblealignmentpointstoawindowofeightwordsaroundthediagonal.Althoughsimple,itacheivedanAERof18.41,anerrorreductionoverModel1ofmorethan40%.Thebestscorecomparesunfavorablyagainstastate-of-the-artAERof3.6(Liuetal.,2010).Butunderadifferentview,itstillrepresentsasignificantamountofprogressforanefforttakingjustovertwoweeks:ontheoriginalchallengefromwhichweob-tainedthedata(MihalceaandPedersen,2003)thebeststudentsystemwouldhaveplacedfifthoutoffifteensystems.Consideralsothecombinedeffortofallthestudents:whenwetrainedaperceptronclas-sifieronthedevelopmentdata,takingeachstudent’spredictionasafeature,weobtainedanAERof15.4,whichwouldhaveplacedfourthontheoriginalchal-lenge.Thisisnotablesincenoneofthesystemsincorporatedfirst-orderdependenciesonthealign-mentsofadjacentwords,longnotedasanimpor-tantfeatureofthebestalignmentmodels(OchandNey,2003).Yetasimplesystemcombinationofstu-dentassignmentsisaseffectiveasahiddenMarkovModeltrainedonacomparableamountofdata(OchandNey,2003).ItisimportanttonotethatAERdoesnotneces-sarilycorrelatewithdownstreamperformance,par-ticularlyontheHansardsdataset(FraserandMarcu,2007).Weusedtheconclusionoftheassignmentasanopportunitytoemphasizethispoint.4TheDecodingChallengeThesecondchallengewasdecoding:givenafixedtranslationmodelandasetofinputsentences,stu-dentswerechallengedtoproducetranslationswiththehighestmodelscore.Thischallengeintroducedthedifficultiesofcombinatorialoptimizationunderadeceptivelysimplesetup:themodelweprovidedwasasimplephrase-basedtranslationmodel(Koehnetal.,2003)consistingonlyofaphrasetableandtri-gramlanguagemodel.Underthissimplemodel,foraFrenchsentencefoflengthI,EnglishsentenceeoflengthJ,andalignmentawhereeachelementconsistsofaspaninbotheandfsuchthateverywordinbotheandfisalignedexactlyonce,theconditionalprobabilityofeandagivenfisasfol-lows.10p(e,un|F)=Yhi,i0,j,j0i∈ap(fi0i|ej0j)J+1Yj=1p(ej|ej−1,ej−2)(1)Toevaluateoutput,wecomputetheconditionalprobabilityofeasfollows.p(e|F)=Xap(e,un|F)(2)Notethatthisformulationisdifferentfromthetyp-icalViterbiobjectiveofstandardbeamsearchde-coders,whichdonotsumoverallalignments,butapproximatep(e|F)bymaxap(e,un|F).ThoughthecomputationinEquation2isintractable(DeNeroandKlein,2008),itcanbecomputedinafewmin-utesviadynamicprogrammingonreasonablyshortsentences.Weensuredthatourdatametthiscrite-rion.Thecorpus-levelprobabilityisthentheprod-uctofallsentence-levelprobabilitiesinthedata.Themodelincludesnodistortionlimitordistor-tionmodel,fortworeasons.First,leavingoutthedistortionmodelslightlysimplifiestheimplementa-tion,sinceitisnotnecessarytokeeptrackofthelastwordtranslatedinabeamdecoder;wefeltthatthisdetailwassecondarytounderstandingthedifficultyofsearchoverphrasepermutations.Second,itactu-allymakestheproblemmoredifficult,sinceasimpledistance-baseddistortionmodelpreferstranslationswithfewerpermutations;withoutit,themodelmayeasilypreferanypermutationofthetargetphrases,makingeventheViterbisearchproblemexhibititstrueNP-hardness(Knight,1999un;Zaslavskiyetal.,2009).Sincethegoalwastofindthetranslationwiththehighestprobability,wedidnotprovideaheld-outtestset;withaccesstoboththeinputsentencesand10Forsimplicity,thisformulaassumesthateispaddedwithtwosentence-initialsymbolsandonesentence-finalsymbol,andignorestheprobabilityofsentencesegmentation,whichwetaketobeuniform.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
2
1
8
1
5
6
6
6
4
5
/
/
t
je
un
c
_
un
_
0
0
2
1
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
170
themodel,studentshadenoughinformationtocom-putetheevaluationscoreonanydatasetthemselves.Thedifficultyofthechallengeliessimplyinfindingthetranslationthatmaximizestheevaluation.In-deed,sincetheproblemisintractable,eventhein-structorsdidnotknowthetruesolution.114.1DataWechose48Frenchsentencestotaling716wordsfromtheCanadianHansardstoserveastestdata.Tocreateasimpletranslationmodel,weusedtheBerkeleyalignertoaligntheparalleltextfromthefirstassignment,andextractedaphrasetableusingthemethodofLopez(2007),asimplementedincdec(Dyeretal.,2010).Tocreateasimplelanguagemodel,weusedSRILM(Stolcke,2002).4.2ImplementationWedistributedtwoPythonprograms.Thefirst,decode,decodesthetestdatamonotonically—usingboththelanguagemodelandtranslationmodel,butwithoutpermutingphrases.Theimple-mentationiscompletelyself-containedwithnoex-ternaldependencies:itimplementsbothmodelsandasimplestackdecodingalgorithmformonotonictranslation.Itcontainsonly122linesofPython—ordersofmagnitudefewerthanmostfull-featureddecoders.Toseeitssimilaritytopseudocode,com-parethedecodingalgorithm(Listing2)withthepseudocodeinKoehn’s(2010)populartextbook(re-producedhereasAlgorithm1).Thesecondpro-gram,grade,computesthelog-probabilityofasetoftranslations,asoutlineabove.Weprivatelyimplementedasimplestackdecoderthatsearchedoverpermutationsofphrases,similartoKoehn(2004).Ourimplementationincreasedthecodebaseby44linesofcodeandincludedparam-etersforbeamsize,distortionlimit,andthemaxi-mumnumberoftranslationsconsideredforeachin-putphrase.Wepostedabaselinetotheleaderboardusingvaluesof50,3,and20forthese,respectively.11WeimplementedaversionoftheLagrangianrelaxationalgo-rithmofChangandCollins(2011),butfounditdifficulttoobtaintight(optimal)solutionswithoutiterativelyreintroduc-ingalloftheoriginalconstraints.Wesuspectthisisduetothelackofadistortionpenalty,whichenforcesastrongpref-erencetowardstranslationswithlittlereordering.However,thesolutionfoundbythisalgorithmisonlyapproximatestheobjectiveimpliedbyEquation2,whichsumsoveralignments.Wealsopostedanoraclecontainingthemostprob-ableoutputforeachsentence,selectedfromamongallsubmissionsreceivedsofar.Theintentofthisoraclewastoprovidealowerboundonthebestpos-sibleoutput,givingstudentsadditionalincentivetocontinueimprovingtheirsystems.4.3ChallengeResultsWereceived71submissionsfrom10teams(Fig-ure2),againexhibitingvarietyofsolutions.•Implementationofgreedydecoderwhichateachstepchoosesthemostprobabletranslationfromamongthosereachablebyasingleswaporretranslation(Germannetal.,2001;Langlaisetal.,2007).•Inclusionofheuristicestimatesoffuturecost.•Implementationofaprivateoracle.Somestu-dentsobservedthattheidealbeamsettingwasnotuniformacrossthecorpus.Theyrantheirdecoderunderdifferentsettings,andthense-lectedthemostprobabletranslationofeachsentence.Manyteamswhoimplementedthestandardstackdecodingalgorithmexperimentedheavilywithitspruningparameters.Thebestsubmissionusedex-tremelywidebeamsettingsinconjunctionwithareimplementationofthefuturecostestimateusedinMoses(Koehnetal.,2007).FiveofthesubmissionsbeatMosesusingitsstandardbeamsettingsafterithadbeenconfiguredtodecodewithourmodel.Weusedthisassignmenttoemphasizetheim-portanceofgoodmodels:themodelscoreofthesubmissionswasgenerallyinverselycorrelatedwithBLEU,possiblybecauseoursimplemodelhadnodistortionlimits.Weusedthistoillustratethediffer-encebetweenmodelerrorandsearcherror,includ-ingfortuitoussearcherror(Germannetal.,2001)madebydecoderswithlessaccuratesearch.5TheEvaluationChallengeThethirdchallengewasevaluation:givenatestcor-puswithreferencetranslationsandtheoutputofsev-eralMTsystems,studentswerechallengedtopro-ducearankingofthesystemsthatcloselycorrelatedwithahumanranking.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
2
1
8
1
5
6
6
6
4
5
/
/
t
je
un
c
_
un
_
0
0
2
1
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
171
Listing2ThedefaultdecoderinDREAMT:astackdecoderformonotonictranslation.stacks=[{}for_inf]+[{}]stacks[0][lm.begin()]=initial_hypothesisfori,stackinenumerate(stacks[:-1]):forhinsorted(stack.itervalues(),key=lambdah:-h.logprob)[:alpha]:forjinxrange(i+1,len(F)+1):iff[je:j]intm:forphraseintm[F[je:j]]:logprob=h.logprob+phrase.logproblm_state=h.lm_stateforwordinphrase.english.split():(lm_state,word_logprob)=lm.score(lm_state,word)logprob+=word_logproblogprob+=lm.end(lm_state)ifj==len(F)else0.0new_hypothesis=hypothesis(logprob,lm_state,h,phrase)iflm_statenotinstacks[j]or\stacks[j][lm_state].logprob