C ´arthach ´O Nuan ´ain, Perfecto

C ´arthach ´O Nuan ´ain, Perfecto
Herrera, and Sergi Jord `a
Music Technology Group
Communications Campus–Poblenou
Universität Pompeu Fabra
Carrer Roc Boronat, 138, 08018
Barcelona, Spanien
{carthach.onuanain, perfecto.herrera,
sergi.jorda}@upf.edu

Rhythmic Concatenative
Synthesis for Electronic
Musik: Techniques,
Implementation, Und
Evaluation

Abstrakt: In diesem Artikel, we summarize recent research examining concatenative synthesis and its application and
relevance in the composition and production of styles of electronic dance music. We introduce the conceptual
underpinnings of concatenative synthesis and describe key works and systematic approaches in the literature. Unser
System, RhythmCAT, is proposed as a user-friendly system for generating rhythmic loops that model the timbre and
rhythm of an initial target loop. The architecture of the system is explained, and an extensive evaluation of the system’s
performance and user response is discussed based on our results.

Historically, reusing existing material for the pur-
poses of creating new works has been a widely
practiced technique in all branches of creative arts.
The manifestations of these expressions can be
wholly original and compelling, or they may be
derivative, uninspiring, and potentially infringe on
copyright (depending on myriad factors including
the domain of the work, the scale of the reuse, Und
cultural context).

In the visual arts, reusing or adapting existing ma-
terial is most immediately understood in the use of
collage, where existing works or parts thereof are as-
sembled to create new artworks. Cubist artists such
as Georges Braque and Pablo Picasso extensively ref-
erenced, appropriated, and reinterpreted their own
works and the works of others, as well as common
found objects from their surroundings (Greenberg
1971). Collage would later serve as direct inspiration
for bricolage, reflecting wider postmodernist trends
towards deconstructionism, self-referentiality, Und
revisionism that include the practice of parody and
pastiche (Lochhead and Auner 2002).

In music and the sonic arts, the natural corollary

of collage came in the form of musique concr `ete
(Holmes 2008), a movement of composition stem-
ming from the experiments of Pierre Schaeffer and,
später, Pierre Henry at the studios of Radiodiffusion-
T ´el ´evision Franc¸ aise in Paris during the 1940s and
1950S (Battier 2007). In contrast to the artificially

Computermusikjournal, 41:2, S. 21–37, Sommer 2017
doi:10.1162/COMJ a 00412
C(cid:2) 2017 Massachusetts Institute of Technology.

and electronically generated elektronische Musik
spearheaded by Karlheinz Stockhausen at the West
German Radio studios in Cologne, the French com-
posers sought to conceive their works from existing
recorded sound, including environmental sources
like trains and speech. Seemingly unrelated and
nonmusical sounds are organized in such a way that
the listener discovered the latent musical qualities
and structure they inherently carry.

It is important to note that in music composi-
tion general appropriation of work predates these
electronic advancements of technology. In West-
ern art music, Zum Beispiel, composers like B ´ela
Bart ´ok—himself a musicologist—have often turned
to folk music for its melodies and dance music styles
(Bart ´ok 1993), and others (z.B., Claude Debussy, vgl.
Tamagawa 1988) became enchanted by music from
other cultures, such as Javanese gamelan, studying
its form and incorporating the ideas into new pieces.
Quotations, or direct lifting of melodies from other
composers’ works, are commonplace rudiments in
jazz music. Charlie Parker, Zum Beispiel, was known
to pepper his solos with reference to Stravinsky’s
Rite of Spring (Mangani, Baldizzone, and Nobile
2006). David Metzer has compiled a good reference
on appropriation and quotation music (Metzer 2003).
The modern notion of sampling stems from the
advent of the digital sampler and its eventual explo-
sion of adaptation in hip-hop and electronic music.
Artists such as Public Enemy and the Beastie Boys
painstakingly assembled bewildering permutations
of musical samples, sound bites, and other miscel-
laneous recorded materials that sought to supplant

´O Nuan ´ain et al.

21

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u
/
C
Ö
M

J
/

l

A
R
T
ich
C
e

P
D

F
/

/

/

/

4
1
2
2
1
1
8
5
6
3
9
3
/
C
Ö
M
_
A
_
0
0
4
1
2
P
D

.

J

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

the many cultural references that permeated their
lyrics (Sewell 2014). Later, the influence of hip-hop
production would inform the sample-heavy arrange-
ments of jungle and drum and bass, insbesondere
with its exhaustive rerendering of the infamous
“Amen Break.” John Oswald, an artist who directly
challenged copyright for artistic gain, dubbed his ap-
proach “plunderphonics” and set out his intentions
in a suitably subtitled essay “Plunderphonics, or Au-
dio Piracy as a Compositional Prerogative” (Oswald
1985). Using tape-splicing techniques, he created
deliberately recognizable montages of pop music,
such as that by Michael Jackson, in a style that be-
came later known as “mashups.” Nowadays, artists
such as Girltalk create extremely complex and mul-
tireferential mashups of popular music, harnessing
the powerful beat-matching and synchronization
capabilities of the modern digital audio workstation
(Humphrey, Turnbull, and Collins 2013).

Although the question of originality and author-

ship is not in the realm of this discussion, Das
interesting and pertinent topic is under the scrutiny
of researchers in musicology and critical studies.
We encourage the reader to consult work by Tara
Rodgers (2003), Paul Miller (2008), and Kembrew
McLeod (2009) for a more focused discourse.

Associated research efforts in computer music,
signal processing, and music information retrieval
(MIR) afford us the opportunity to develop au-
tomated and intelligent systems that apply the
aesthetic of sampling and artistic reuse. The term
concatenative synthesis has been extensively used
to describe musical systems that create new sound
by automatically recycling existing sounds ac-
cording to some well-defined set of criteria and
algorithmic procedures. Concatenative synthesis
can be considered the natural heir of granular syn-
These (Roads 2004), a widely examined approach
to sound synthesis using tiny snippets (“grains”)
of around 20–200 msec of sound, which traces its
history back to Iannis Xenakis’s theories in For-
malized Music (Xenakis 1971). With concatenative
synthesis, the grains become “units” and are more
related to musical scales of length, such as notes
and phrases. Most importantly, information is at-
tached to these units of sound: crucial descriptors
that allow spectral and temporal characteristics

of the sound to determine the sequencing of final
output.

In the following sections, we will present a
thorough, critical overview of many of the key
works in the area of concatenative synthesis, based
on our observation that there has not been such
a broad survey of the state of the art in other
publications in recent years. We will compare
and contrast characteristics, Techniken, und das
challenges of algorithmic design that repeatedly
arise. For the past three years, we have been working
on the European-led initiative GiantSteps (Knees et
al. 2016). The broad goal of the project is the research
and development of expert agents for supporting and
assisting music makers, with a particular focus
on producers of electronic dance music (EDM).
Folglich, one of the focuses of the project has
been on user analysis: thinking about their needs,
desires, and skills; investigating their processes
and mental representations of tasks and tools; Und
evaluating their responses to prototypes.

Modern EDM production is characterized by
densely layered and complex arrangements of tracks
making liberal use of synthesis and sampling,
exploiting potentially unlimited capacity and pro-
cessing in modern computer audio systems. Eins
of our main lines of research in this context has
been the investigation of concatenative synthesis
for the purposes of assisting music producers to
generate rhythmic patterns by means of automatic
and intelligent sampling.

In diesem Artikel, we present the RhythmCAT
System, a digital instrument that creates new
loops emulating the rhythmic pattern and timbral
qualities of a target loop using a separate corpus of
sound material. We first proposed the architecture
of the system in a paper for the conference on New
Interfaces for Musical Expression ( ´O Nuan ´ain, Jord `a,
and Herrera 2016a), followed by papers evaluating
it in terms of its algorithmic performance ( ´O
Nuan ´ain, Herrera, and Jord `a 2016) and a thematic
analysis of users’ experience ( ´O Nuan ´ain, Jord `a,
and Herrera 2016b). This article thus represents
an expanded synthesis of the existing literature,
our developments motivated by some detected
shortcomings, and the illustration of an evaluation
strategy.

22

Computermusikjournal

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u
/
C
Ö
M

J
/

l

A
R
T
ich
C
e

P
D

F
/

/

/

/

4
1
2
2
1
1
8
5
6
3
9
3
/
C
Ö
M
_
A
_
0
0
4
1
2
P
D

.

J

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

State of the Art in Concatenative Synthesis

Other authors have previously provided insightful
summaries of research trends in concatenative
synthesis (z.B., Schwarz 2005; Sturm 2006). Diese
surveys are over ten years old, Jedoch (but see
Schwarz 2017 for a continuously updated online
survey), so we offer here a more recent compendium
of state-of-the-art systems as we see them, based on
our investigations of previous publications up until
Jetzt.

Before music, concatenative synthesis enjoyed
successful application in the area of speech syn-
These; Hunt and Black (1996) first reported a unit
selection scheme using hidden Markov models
(HMMs) to automatically select speech phonemes
from a corpus and combine them into meaningful
and realistic sounding sentences. Hidden Markov
models extend Markov chains by assuming that
“hidden” states output visible symbols, und das
Viterbi algorithm (Rabiner 1989) can return the
most probable sequence of states given a particular
sequence of symbols. In concatenative synthesis,
the maximum probabilistic model is inverted to
facilitate minimal cost computations.

The target cost of finding the closest unit in
the corpus to the current target unit becomes the
emission probability, with the concatenation cost
representing the transition probability between
Staaten. The Viterbi algorithm thus outputs indices
of database units corresponding to the optimal
state sequence for the target, based on a linear
combination of the aforementioned costs. Diemo
Schwarz (2003) directly applied this approach for
musical purposes in his Caterpillar system.

Schwarz notes, Jedoch, that the HMM approach

can be quite rigid for musical purposes because it
produces one single optimized sequence without
the ability to manipulate the individual units.
To address these limitations, he reformulates
the task into a constraint-satisfaction problem,
which offers more flexibility for interaction. A
constraint-satisfaction problem models a problem
as a set of variables, Werte, and a set of constraints
that allows us to identify which combinations
of variables and values are violations of those
constraints, thus allowing us to quickly reduce large

portions of the search space (Russell and Norvig
2009).

Zils and Pachet (2001) first introduced constraint

satisfaction for concatenative synthesis in what
they describe as musical mosaicking—or, to use
their portmanteau, musaicing. They define two
categories of constraints: segment and sequence
constraints. Segment constraints control aspects of
individual units (much like the target cost in an
HMM-like system) based on their descriptor values.
Sequence constraints apply globally and affect
aspects of time, continuity, and overall distributions
of units. The constraints can be applied manually
by the user or learned by modeling a target. Der
musically tailored “adaptive search” algorithm
performs a heuristic search to minimize the total
global cost generated by the constraint problem.
One immediate advantage of this approach over the
HMM is the ability to run the algorithm several
times to generate alternative sequences, wohingegen
the Viterbi process always outputs the most optimal
solution.

A simpler approach is presented in MatConcat
(Sturm 2004), using feature vectors comprising six
descriptors and computing similarity metrics be-
tween target units and corpus units. Built for the
MATLAB environment for scientific computing,
the interface is quite involved, and the user has
control over minute features such as descriptor
tolerance ranges, relative descriptor weightings,
as well as window types and hop sizes of output
transformations. On Sturm’s Web site are short
compositions generated by the author using ex-
cerpts from a Mahler symphony as a target, Und
resynthesized using various unrelated sound sets,
zum Beispiel, pop vocals, found sounds, and solo in-
strumental recordings from saxophone and trumpet
(www.mat.ucsb.edu/∼b.sturm/music/CVM.htm).
As concatenative synthesis methods matured,
user modalities of interaction and control became
more elaborate and real-time operations were
introduced. One of the most compelling features
of many concatenative systems is the concept of
the interactive timbre space. With the release of
CataRT (Schwarz et al. 2006), these authors provided
an interface that arranges the units in an interactive
two-dimensional timbre space. The arrangement

´O Nuan ´ain et al.

23

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u
/
C
Ö
M

J
/

l

A
R
T
ich
C
e

P
D

F
/

/

/

/

4
1
2
2
1
1
8
5
6
3
9
3
/
C
Ö
M
_
A
_
0
0
4
1
2
P
D

.

J

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

of these units is according to a user-selectable
descriptor on each axis. Instead of using a target
sound file to inform the concatenation procedure,
the user’s mouse cursor becomes the target. Sounds
that are within a certain range of the mouse cursor
are sequenced according to some triggering options
(one-shot, loop, and—most crucially—with real-
time output).

Bernardes takes inspiration from CataRT and

from Tristan Jehan’s Skeleton (Jehan 2005) Zu
build his EarGram system for the Pure Data (Pd)
Umfeld (Bernardes, Guedes, and Pennycook
2013). Built on top of William Brent’s excellent
feature-extraction library timbreID (Brent 2010), Es
adds a host of interesting features for visualization
and classification. Zum Beispiel, as well as the
familiar waveform representation and previously
described 2-D timbre representation (with various
clustering modes and dimensionality-reduction
implementations), there are similarity matrices
that show the temporal relations in the corpus
im Laufe der Zeit. Some unique playback and sequencing
modes also exist, such as the infiniteMode, welche
generates endless playback of sequences, und das
soundscapeMap, which features an additional 2-D
control of parameters pertaining to sound scene
Design. Another system that adapts a 2-D timbre
space is AudioGarden by Frisson, Picard, Und
Tardieu (2010), which offers two unique mapping
Verfahren. The first of these, “disc” mode, places
units by assigning the length of the audio file to the
radius of the unit from the center, with the angle
of rotation corresponding to a principal component
der Klangfarbe, mel-frequency cepstrum coefficients
(MFCCs). In the other mode, called “flower” mode,
a point of the sound is positioned in the space
according to the average MFCCs of the entire sound
file. Segments of the particular sound are arranged
in chronological fashion around this center point.

There have been some concatenative systems tai-
lored specifically with rhythmic purposes in mind.
Pei Xiang proposed Granuloop for automatically re-
arranging segments of four different drum loops into
a 32-step sequence (Xiang 2002). Segmentation is
done manually, without the aid of an onset detector,
using the Recycle sample editor from Propellerhead
Software. Segmented sounds are compared using the

inner product of the normalized frequency spectrum,
supplemented with the weighted energy. These val-
ues become weights for a Markov-style probability
transition matrix. Implemented in Pd, the user
interacts by moving a joystick in a 2-D space, welche
affects the overall probability weightings determin-
ing which loop segments are chosen. The system
presents an interesting approach but is let down by
its lack of online analysis. Ringomatic (Aucouturier
and Pachet 2005) is a real-time agent specifically
tailored for combining drum tracks, expanding on
many of the constraint-based ideas from their prior
musaicing experiments. They applied the system to
real-time performance following symbolic feature
data extracted from a human MIDI keyboard player.
They cite, as an example, that a predominance of
lower-register notes in the keyboard performance
applies an inverse constraint that creates comple-
mentary contrast by specifying that high-frequency
heavy cymbal sounds should be concatenated.

As demonstrated in EarGram, concatenative
synthesis has been considered useful in sound
design tasks, allowing the sound designer to build
rich and complex textures and environments that
can be transformed in many ways, both temporally
and timbrally. Cardle, Brooks, and Robinson (2003)
describe their Directed Sound Synthesis software as a
means of providing sound designers and multimedia
producers a method of automatically reusing and
synthesizing sound scenes in video. Users select
one or more regions of an existing audio track and
can draw probability curves on the timeline to
influence resynthesis of these regions elsewhere
(one curve per region). Hoskinson and Pai (2001), In
a nod to granular synthesis, refer to the segments
used in their Soundscapes software as ”natural
grains,” and they seek to synthesize endless streams
of soundscapes. The selection scheme by which
segments are chosen is based on a representation
of each segment as a transition state in a Markov
Kette. Its interface features knobs and sliders for
interactively controlling gain and parameters of
multiple samples. To evaluate the platform they
conducted an additional study (Hoskinson and Pai
2007) to reveal whether listening subjects found the
concatenated sequences convincing compared with
genuinely recorded soundscapes.

24

Computermusikjournal

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u
/
C
Ö
M

J
/

l

A
R
T
ich
C
e

P
D

F
/

/

/

/

4
1
2
2
1
1
8
5
6
3
9
3
/
C
Ö
M
_
A
_
0
0
4
1
2
P
D

.

J

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figur 1. Block diagram of
functionality in the
RhythmCAT system.

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u
/
C
Ö
M

J
/

l

A
R
T
ich
C
e

P
D

F
/

/

/

/

4
1
2
2
1
1
8
5
6
3
9
3
/
C
Ö
M
_
A
_
0
0
4
1
2
P
D

.

J

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figur 1 gives a diagrammatic overview of these
important stages, which can be briefly summarized
als:

More-specific and applied-use cases of concate-
native synthesis include work by Ben Hackbarth,
who explores the possibilities of concatenative syn-
thesis in large-scale music composition (Hackbarth,
Schnell, and Schwarz 2011). Hackbarth has worked
with Schwarz to provide an alternative interface
for exploring variations based on a force-directed
graph. John O’Connell describes a graphical system
for Pd that demonstrates the use of higher-level
perceptual concepts like mood (happy versus sad)
for informing selection in audio mosaics (O’Connell
2011).

Commercial implementations also exist for con-
catenative synthesis. Of particular note is Steinberg’s
Loopmash, a software plug-in and mobile application
for automatically creating mashups from existing
looped content (www.steinberg.net/loopmash). Der
interface consists of a number of tracks in a timeline
arrangement. One track is set as a master, and slices
in the master are replaced with matching slices from
the other slave tracks. Users interact by manipulat-
ing “similarity gain” sliders that control the influ-
ence of each track in the slice selection algorithm.
Other applications exist more as MIDI sampler sys-
tems attempting to model the performance qualities
of natural sources such as orchestral ensembles (z.B.,
SynfulOrchestra, www.synful.com) or the human
voice (z.B., Vocaloid, www.vocaloid.com).

There are many other concatenative systems that
are too numerous to discuss in detail here. Wir haben,
Jedoch, compiled a table in a previous publication
summarizing all the systems we have come across
in our research, with remarks on interaction and
visualization features, support for rhythm, Und
whether any user evaluation was carried out ( ´O
Nuan ´ain, Jord `a, and Herrera 2016b).

Design and Implementation

In diesem Abschnitt, we will describe our implementa-
tion of the RhythmCAT system, beginning with
an explanation of the musical analysis stages of
onset detection, segmentation, and feature extrac-
tion. This is followed by an examination of the
interactive user interface and the pattern-generation
Verfahren.

1. Sound Input
2. Onset Detection and Segmentation
3. Audio Feature Extraction
4. Storage and Data Representation
5. Pattern Synthesis
6. Real-Time Audio Output

´O Nuan ´ain et al.

25

The system is developed in C++ using the
JUCE framework (www.juce.com), the Essentia
musical analysis library (Bogdanov et al. 2013), Und
the OpenCV computer vision library for matrix
Operationen (Bradski 2000).

Sound Input

The first stage in building a concatenative music
system generally involves gathering a database of
sounds from which selections can be made during
the synthesis procedure. This database can be
manually assembled, but in many musical cases the
starting point is some user-provided audio that may
range in length from individual notes to phrases to
complete audio tracks.

The two inputs to the system are the sound
palette and the seed sound. The sound palette refers
to the pool of sound files we want to use as the
sample library for generating our new sounds. Der
seed sound refers to the short loop that we wish
to use as the similarity target for generating those
Geräusche. The final output sound is a short (one to two
bars) loop of concatenated audio that is rendered in
real time to the audio host.

Onset Detection and Segmentation

In cases where the sounds destined for the sound
palette exceed note or unit length, the audio needs
to be split into its constituent units using onset
detection and segmentation.

Onset detection is a large topic of continuous
Studie, and we would encourage the reader to exam-
ine the excellent review of methods summarized
by Simon Dixon (2006). Currently, with some tun-
ing of the parameters, Sebastien Bock’s Superflux
algorithm represents one of the best-performing
state-of-the-art detection methods (B ¨ock and Wid-
mer 2013). For our purposes, we have experienced
good results with the standard onset detector avail-
able in Essentia, which uses two methods based on
analyzing signal spectra from frame to frame (at a
rate of around 11 ms). The first method involves
estimating the high-frequency content in each frame

(Masri and Bateman 1996) and the second method
involves estimating the differences of phase and
magnitude between each frame (Bello and Daudet
2005).

The onset detection process produces a list of
onset times for each audio file, which we use to
segment into new audio files corresponding to unit
sounds for our concatenative database.

Audio Feature Extraction

In MIR systems, the task of deciding which features
are used to represent musical and acoustic properties
is a crucial one. It is a trade-off between choosing
the richest set of features capable of succinctly
describing the signal, on the one hand, und das
expense of storage and computational complexity,
auf dem anderen. When dealing specifically with musical
Signale, there are a number of standard features cor-
responding roughly to certain perceptual sensations.
We briefly describe the features we chose here (for a
more thorough treatment of feature selection with
relation to percussion, see Herrera, Dehamel, Und
Gouyon 2003; Tindale et al. 2004; and Roy, Pachet,
and Krakowski 2007).

Our first feature is the loudness of the signal,
which is implemented in Essentia according to
Steven’s Power Law, nämlich, the energy of the
signal raised to the power of 0.67 (Bogdanov et al.
2013). This is purported to be a more perceptually
effective measure for human ears. Nächste, we extract
der spektrale Schwerpunkt, which is defined as the
weighted mean of the spectral bins extracted using
the Fourier transform. Each bin is then weighted by
its magnitude.

Perceptually speaking, der spektrale Schwerpunkt
relates mostly to the impression of the brightness
of a signal. In terms of percussive sounds, eins
would expect the energy of a kick drum to be more
concentrated in the lower end of the spectrum and
hence have a lower centroid than that from a snare
or crash cymbal.

Another useful single-valued spectral feature is
the spectral flatness. It is defined as the geomet-
ric mean of the spectrum divided by the arithmetic
mean of the spectrum. A spectral flatness value of 1.0

26

Computermusikjournal

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u
/
C
Ö
M

J
/

l

A
R
T
ich
C
e

P
D

F
/

/

/

/

4
1
2
2
1
1
8
5
6
3
9
3
/
C
Ö
M
_
A
_
0
0
4
1
2
P
D

.

J

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

means the energy spectrum is flat, whereas a value of
0.0 would suggest spikes in the spectrum indicating
harmonic tones (with a specific frequency). Der
value intuitively implies a discrimination between
noisy or inharmonic signals and signals that are
harmonic or more tonal. Kick-drum sounds (espe-
cially those generated electronically) often comprise
quite a discernible center frequency, whereas snares
and cymbals are increasingly broadband in spectral
Energie.

Our final feature is MFCCs. These can be con-
sidered as a compact approximation of the spectral
envelope and is a useful aid in computationally
describing and classifying the timbre of a signal. Es
has been applied extensively in speech processing,
genre detection (Tzanetakis, Essl, and Cook 2001),
and instrument identification (Loughran et al. 2004).
The computation of MFCCs, as outlined by Beth
Logan (2000), is basically achieved by computing
the spectrum, mapping the result into the more
perceptually relevant mel scale, taking the log, Und
then applying the discrete cosine transform.

It is difficult to interpret exactly what each of the
MFCC components mean, but the first component
is generally regarded as encapsulating the energy.
Because we are already extracting the loudness
using another measure, we have discarded this
component in our system. For detailed explanations
and formulae pertaining to the features introduced
Hier, as well as others, we direct the reader to
Geoffroy Peeters’s compendium (Peeters 2004).

Storage and Data Representation

Further on in this article we will describe in greater
detail how the seed or target audio signal is actually
received from the Virtual Studio Technology host,
but in terms of analysis on that seed signal, Die
process is the same as before: onset detection and
segmentation followed by feature extraction.

The resulting feature vectors are stored in two
matrices: the palette matrix and the target matrix.
The palette matrix stores the feature vectors of each
unit of sound extracted from the sound palette, Und
the target matrix similarly stores feature vectors of
units of sound extracted from the seed loop.

Pattern Synthesis and Real-Time Audio Output

This section details the visible, aural, and interactive
elements of the system as they pertain to the user.
Figur 2 provides a glimpse of the user interface in a
typical pattern generation scenario.

Workflow

The layout of the interface was the result of a
number of iterations of testing with users who,
while praising the novelty and sonic value of
the instrument, sometimes expressed difficulty
understanding the operation of the system. One of
the main challenges faced was how best to present
the general workflow to the user in a simple and
concise manner. We decided to represent the flow of
the various operations of the software emphatically
by using a simple set of icons and arrows, as seen in
Figure 2a.

The icons indicate the four main logical opera-
tions that the user is likely to implement, and opens
up related dialog screens:

Palette Dialog – indicated by the folder icon
Seed Dialog – indicated by the jack cable icon
Sonic Dialog – indicated by the square feature

space icon

Output Dialog – indicated by the speaker icon

Sound Palette

The user loads a selection of audio files or folders
containing audio files that are analyzed to create
the sound palette, as has previously been discussed.
Nächste, dimensionality reduction is performed on
each feature vector of the units in the sound
palette using principal component analysis (PCA).
Two PCA components are retained and scaled
to the visible area of the interface to serve as
coordinates for placing a circular representation
of the sound in two-dimensional space. We call
these visual representations, along with their
associated audio content, sound objects. Sie sind
clearly visible in the main Timbre Space window,
Figure 2d.

´O Nuan ´ain et al.

27

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u
/
C
Ö
M

J
/

l

A
R
T
ich
C
e

P
D

F
/

/

/

/

4
1
2
2
1
1
8
5
6
3
9
3
/
C
Ö
M
_
A
_
0
0
4
1
2
P
D

.

J

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figur 2. The main user
interface for RhythmCat
consists of panels for
workflow (A), slider
Kontrollen (B), master

Kontrollen (C), the main
timbre space interface (D),
and waveform
representation (e).

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u
/
C
Ö
M

J
/

l

A
R
T
ich
C
e

P
D

F
/

/

/

/

4
1
2
2
1
1
8
5
6
3
9
3
/
C
Ö
M
_
A
_
0
0
4
1
2
P
D

.

J

Seed Input

Seed audio is captured and analyzed by directly
recording the input audio of the track on which
the instrument resides in the audio host. Using
the real-time tempo and information about bar and
beat position provided by the host, the recorder
will wait until the next measure starts to begin
capture and will only capture complete measures
of audio. This audio is analyzed as before, mit einer
exception. Because the goal of the instrument is
to integrate with an existing session and generate
looped material, we assume that the incoming audio
is quantized and matches the tempo of the session.
Daher, onset detection is not performed on the seed
Eingang; stattdessen, segmentation takes place at the
points in time determined by the grid size (lower
left of the screen).

An important aspect to note: Because the instru-
ment fundamentally operates in real time, we need

to be careful about performing potentially time-
consuming operations, such as feature extraction,
when the audio system is running. Daher, we perform
the audio-recording stage and feature-extraction pro-
cess on separate threads, so the main audio-playback
thread is uninterrupted. This is separate to yet
another thread that handles elements of the user
interface.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Sonic Parameters

Clicking on the square sonic icon in the center
of the workflow component opens up the set of
sliders shown in Figure 2b, which allows us to
adjust the weights of the features in the system.
Adjusting these weights has effects in terms of
the pattern-generation process but also in the
visualization. Presenting their technical names
(centroid, flatness, and MFCCs) would be confusing

28

Computermusikjournal

Figur 3. Algorithm for
generating a list of sound
connections.

for the general user, so we relabeled them with
what we considered the most descriptive subjective
Bedingungen. With the pattern-generation process, diese
weights directly affect the features when performing
similarity computation and unit selection, as we
will see in the next section. Depending on the
source and target material, different combinations
of feature weightings produce noticeably different
results. Informally, we have experienced good
results using MFCCs alone, Zum Beispiel, sowie
combinations of the flatness and centroid. In terms
of visualization, when the weights are changed,
dimensionality reduction is reinitiated and, somit,
positioning of the sound objects in the timbre space
changes. Manipulating these parameters can help
disperse and rearrange the sound objects for clearer
interaction and exploration by the user in addition
to affecting the pattern generation process.

Once the palette and seed matrices have been
populated, a similarity matrix between the palette
and seed matrix is created. Using the feature
weightings from the parameter sliders, a sorted
matrix of weighted Euclidean distances between
each onset in the target matrix and each unit sound
in the palette matrix is computed.

Unit Selection and Pattern Generation

The algorithm for unit selection is quite straight-
forward. For each unit i in the segmented target
sequence (z.B., a 16-step sequence) and each corpus
unit j (typically many more), the target unit cost
Ci, j is calculated by the weighted Euclidean distance
of each feature k.

These unit costs are stored in similarity matrix

M. Next we create a matrix M(cid:3) of the indices
of the elements of M sorted in ascending order.
Endlich, a concatenated sequence can be generated
by returning a vector of indices I from this sorted
matrix and playing back the associated sound file.
To retrieve the closest sequence V0 one would only
need to return the first row.

Returning sequence vectors as rows of a sorted
matrix limits the number of possible sequences to
the matrix size. This can be extended if we define
a similarity threshold T and return a random index

Procedure GET-ONSET-LIST

for n in GridSize do

R = Random number 0 < Variance I = Index from Row R of Similarity Matrix S = New SoundConnection S->SoundUnit = SoundUnit(ICH)
Add S to LinkedList

end for
return LinkedList
End Procedure

zwischen 0 and j − T for each step i in the new
sequence.

When the user presses the New Pattern button
(Figure 2c), a new linked list of objects, called sound
connections, is formed. This represents a traversal
through connected sound objects in the timbre space.
The length of the linked list is determined by the grid
size specified by the user, so if the user specifies, für
Beispiel, a grid size of 1/16, a one-measure sequence
of 16th notes will be generated. The algorithm in
Figur 3 details the exact procedure whereby we
generate the list. The variance parameter affects the
threshold of similarity by which onsets are chosen.
Mit 0 variance, the most similar sequence is always
returned. This variance parameter is adjustable from
the Accuracy/Variety slider in the lower-left corner
of the instrument (Figure 2c).

In the main timbre space interface (Figure 2d),
a visual graph is generated in the timbre space by
traversing the linked list and drawing line edges
connecting each sound object pointed to by the
sound connection in the linked list. In this case, A
loop of 16 onsets has been generated, with the onset
numbers indicated beside the associated sound
object for each onset in the sequence. The user
is free to manipulate these sound connections to
mutate these patterns by touching or clicking on
the sound connection and dragging to another sound
Objekt. Multiple sound connections assigned to an
individual sound object can be selected as a group
by slowly double-tapping and then dragging.

On the audio side, every time there is a new beat,
the linked list is traversed. If a sound connection’s
onset number matches the current beat, the corre-
sponding sound unit is played back. One addition

´O Nuan ´ain et al.

29

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u
/
C
Ö
M

J
/

l

A
R
T
ich
C
e

P
D

F
/

/

/

/

4
1
2
2
1
1
8
5
6
3
9
3
/
C
Ö
M
_
A
_
0
0
4
1
2
P
D

.

J

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

that occurred after some user experiments with the
prototype is the linear waveform representation of
the newly generated sequence (Figure 2e). Users felt
the combination of the 2-D interface with the tradi-
tional waveform representation made the sequences
easier to navigate and they also welcomed being able
to manipulate the internal arrangement of sequence
itself once generated.

Evaluation

In the course of our literature review of the state of
the art, we were particularly interested in examining
the procedures and frameworks used in performing
evaluations of the implemented systems. Our most
immediate observation was that evaluation is an
understudied aspect of research into concatenative
Systeme. With creative and generative systems,
this is often the case; many such systems are
designed solely with the author as composer in
Geist.

Some authors provide examples of use cases
(Cardle, Brooks, and Robinson 2003). Authors, solch
as Sturm, have made multimedia examples available
on the Web (see Zils and Pachet 2001; Xiang 2002;
Sturm 2004). Frequently, researchers have made
allusions to some concept of the “user,” but only
one paper has presented details of a user experiment
(Aucouturier and Pachet 2005). One researcher,
Graham Coleman, also highlighted this lack of
evaluation strategies in concatenative synthesis
in his doctoral dissertation (Coleman 2015). Für
the evaluation of his own system, he undertook a
listening experiment with human participants in
tandem with a thorough analysis of algorithmic
performance and complexity.

We conducted extensive evaluation of our own
System, both quantitatively and qualitatively. In
the quantitative portion, we set out to investigate
two key aspects. Erste, if we consider the system as
a retrieval task that aims to return similar items,
how accurate and predictable is the algorithm and
its associated distance metric? Zweite, how does
this objective retrieval accuracy correspond to the
perceptual response of the human listener to the
retrieved items?

The qualitative evaluation consisted of inter-
active, informal interviews with intended users—
mostly active music producers but also music
researchers and students—as they used the soft-
ware. We gathered their responses and impressions
and grouped them according to thematic analysis
Techniken. As alluded to in the introduction, beide
the quantitative evaluation and the qualitative eval-
uation have been previously reported in separate
publications, but we include summaries of each here
for reference.

System Evaluation

We describe here the qualitative portion of the
evaluation, first by introducing the experimental
setup, then presenting and comparing the results of
the algorithm’s retrieval accuracy with the listener
survey.

Experimental Setup

Because the goal of the system is the generation
of rhythmic loops, we decided to formulate an
experiment using breakbeats (short drum solos taken
from commercial funk and soul stereo recordings).
Ten breakbeats were chosen in the range 75–
142 bpm, and we truncated each of them to a
single bar in length. Repeating ten times for each
loop, we selected a single loop as the target seed
and resynthesized it using the other nine loops
(similar to holdout validation in machine learning)
at four different distances from target to create 40
Variationen.

Each of the loops was manually labeled with
the constituent drum sounds as we hear them.
The labeling used was “K” for kick drum, “S” for
snare, “HH” for hi-hat, “C” for cymbal, and “X”
when the content was not clear (such as artifacts
from the onset-detection process or some spillage
from other sources in the recording). Figur 4 zeigt an
the distribution labels in the entire data set and
the distribution according to step sequence. Wir
can notice immediately the heavy predominance
of hi-hat sounds, which is typical in kit-based
drumming patterns. Zusätzlich, the natural trends

30

Computermusikjournal

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u
/
C
Ö
M

J
/

l

A
R
T
ich
C
e

P
D

F
/

/

/

/

4
1
2
2
1
1
8
5
6
3
9
3
/
C
Ö
M
_
A
_
0
0
4
1
2
P
D

.

J

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figur 4. Distribution of
sound labels in the source
corpus (A). Distribution of
sound labels by step
number in the 16-step
sequence (B).

Figur 5. Scatter plot and
linear regression of
accuracy versus distance
for all sound labels (A) Und
for the same sequence
with kick drum and snare
isolated (B).

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u
/
C
Ö
M

J
/

l

A
R
T
ich
C
e

P
D

F
/

/

/

/

4
1
2
2
1
1
8
5
6
3
9
3
/
C
Ö
M
_
A
_
0
0
4
1
2
P
D

.

J

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

of kit drumming are evident, nämlich, kick-drum
placement on the first beat and offbeat peaks for the
snares.

Retrieval Evaluation

We compared each of the labels in each 16-step
position of the quantized target loop with the labels
in each step of the newly generated sequences. Der
accuracy A of the algorithm is then given by the
number of correctly retrieved labels divided by the
total number of labels in the target loop, inspired
by a similar approach adopted by Thompson, Dixon,
and Mauch (2014).

Based on Pearson’s correlation of the retrieval
ratings and the distances of the generated patterns,
we were able to confirm the tendency of smaller
distances to produce more similar patterns in

terms of the labeling accuracy. A moderate
negative correlation of r = −0.516 (significance
level p < 0.001) is visible by the regression line in Figure 5a. If we isolate the kick and snare (often considered the more salient events in drum performances, see Gouyon, Pachet, and Delerue 2000) the negative correlation value decreases sharply to r = −0.826, as shown in Figure 5b. ´O Nuan ´ain et al. 31 Listener Evaluation Observing that the algorithm tends to reproduce labels in a predictable fashion, we sought to establish whether this conforms in reality to what a human listener perceives. An online listener survey was conducted using the same generated loops and targets from the retrieval evaluation. Twenty-one participants completed the survey, drawn mostly from music researchers and students from the institutions of Universitat Pompeu Fabra and the Escola Superior de M ´usica de Catalunya in Barcelona, as well as friends with an interest in music. Twenty out of those indicated that they played an instrument, with nine specifying an instrument from the percussion family. The participants were requested to audition a target loop and each of the generated loops in succession. They were then asked to rate, on a Likert scale of 1 to 5, the similarity of the generated loop to the target in terms of their timbre (i.e., do the kick drums, snares, and hi-hats sound alike?) as well as the rhythmic structure of the pattern (i.e., is the arrangement and placement of the sounds similar?). We also asked them to rate their aesthetic preference for the generated patterns, to determine any possible correlation with similarity. Survey results were collated and analyzed using Spearman’s rank correlation, comparing the mode of the participants’ responses with the distance value of each loop. A moderate-to-strong negative correlation pattern emerged for all of the variables under consideration, namely, r = −0.66 for pattern similarity, r = −0.59 for timbral similarity, and r = −0.63 for their personal preference according to similarity (with significance levels of p < 0.01 in all instances). It should be evident that the listeners’ judgments reflect what the results unearthed in the retrieval evaluation. User Evaluation ing of evaluative scrutiny is the users’ experience of working with the software: gauging their responses to the interface, its modes of interactions, and its rel- evance and suitability for their own compositional styles and processes. To this effect, a qualitative evaluation phase was arranged to gather rich descriptive impressions from related groups of users in Barcelona and Berlin during February 2016. In Barcelona, as with user profiles of the listener survey, most of the participants were researchers or students in the broad area of Sound and Music Computing. In Berlin we were able to gain access to artists involved in the Red Bull Music Academy as well as with employees of the music software company Native Instruments. In broad terms, the overall sense of people’s impressions was positive. Many participants were initially attracted to the visual nature of the soft- ware and were curious to discover its function and purpose. After some familiarization with its operation, people also remarked positively on its sonic output and ability to replicate the target loop: “It’s an excellent tool for making small changes in real time. The interface for me is excellent. This two-dimensional arrangement of the different sounds and its situation by familiarity, it’s also really good for making these changes.” “I’m really interested in more-visual, more- graphical interfaces. Also, the fact that you can come up with new patterns just by the push of a button is always great.” “It’s inspiring because this mix makes some- thing interesting still, but also I have the feeling I can steal it.” “The unbelievable thing is that it can create something that is so accurate. I wouldn’t believe that it’s capable of doing such a thing.” The quantitative evaluation demonstrated the pre- dictive performance of the algorithm based on retrieval accuracy and the corresponding listeners’ judgments of similarity and likeness. Equally deserv- Some of the negative criticism came from the prototypical nature of the instrument, and some users were not comfortable with its perceived indeterminacy: 32 Computer Music Journal l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o m j / l a r t i c e - p d f / / / / 4 1 2 2 1 1 8 5 6 3 9 3 / c o m _ a _ 0 0 4 1 2 p d . j f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 “It was too intense and also [had] a prototype feeling. So I was like, ‘Well, it’s cool and very interesting but not usable yet.’ ” “Right now it’s still hard to find your way around, but that’s something you can refine pretty easily.” Usage Scenarios Participants were asked to consider how they would envisage themselves using the software. Most of them concurred that its strength would be in supporting musicians in their production and compositional workflows. Some users were curious about using it in live contexts, such as continuous analysis of instrumental performance or beat-boxing assistance: “This is great! Ah, but wait . . . Does it mean I could, like, beat box really badly some idea that I have . . . and then bring my samples, my favorite kits and then it will just work?” Continuous recording and analysis is within the realm of possibility, but can potentially be an operation that is prohibitively computationally expensive, depending on the granularity of the beat grid and the size of the corpus. Further benchmarking and tests are required to establish the upper bounds. Another interesting observation was that many users did not want to start with a target, preferring to use the instrument as a new, systematic method of exploring their existing sounds: “I’ve got this fully on wet straight away, which tells you the direction I’d be going with it.” “You just want to drag in a hundred different songs and you just want to explore without having this connection to the original group. Just want to explore and create sound with it.” adapt the new visual paradigm, they still felt the need for a linear waveform to aid their comprehension. Because of this feedback, the waveform view was implemented early on in our development, as is evident in its inclusion in Figure 2. “It’s a bit hard to figure out which sixteenth you are looking for, because you are so used to seeing this as a step grid.” ‘You have a waveform or something. . . Then I know, okay, this is the position that I’m at.” “Is there also a waveform place to put the visualization? People are so used to having that kind of thing.” Shaping Sounds A recurring issue, which cropped up mainly with producers and DJs, was the desire to shape, process, and refine the sounds once a desirable sequence was generated by the system. This way of composing seems emblematic of electronic music producers across the board; they start with a small loop or idea then vary and develop it exploiting the many effects processing and editing features provided by their tools. Most crucially, they desired the option to be able to control the envelopes of the individual units via drawable attack and decay parameters, which is currently being implemented. “. . . an attack and decay just to sort of tighten it up a little bit . . . get rid of some of the rough edges of the onsets and offsets.” “It would be great if you could increase the decay of the snare, for example. Which, if it’s prototype, you can’t expect to have all those functions there immediately, but in an end product, I think it would be a necessity.” Traditional Forms of Navigation Our original intention was for users to solely be able to arrange their patterns through the 2-D timbre space. Through the course of our discussions with users we learned that, although they were eager to Parameterization and Visualization The most overarching source of negative criti- cism from all users was in how we presented the parameters of the system. Users are freely able to manipulate the individual weightings of the features, affecting their relative influence in ´O Nuan ´ain et al. 33 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o m j / l a r t i c e - p d f / / / / 4 1 2 2 1 1 8 5 6 3 9 3 / c o m _ a _ 0 0 4 1 2 p d . j f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 the similarity computation, but also in the PCA dimensional-reduction stage. In an effort to make this more “user friendly,” we relabeled the fea- ture names with more generally comprehensible terms like “timbre,” “brightness,” “harmonicity,” and “loudness.” Despite this, participants reported being confused and overwhelmed by this level of control, stating that they were a “a bit lost already,” that there are “four parameters, and you don’t know which thing is what,” and that they “would prefer not to have too many controls.” Most users were quite content with the overall sonic output from the system without delving into the manipulation of feature parameters. For the visualization, however, there are certain configura- tions of the features that produce the best separation and clustering of the units (although MFCCs alone appear to be the most robust in our experience). One option we are actively investigating would be to remove these parameter sliders and replace them with an optional “advanced” mode, giving users the ability to select specific parameters for the axes (as in CataRT) in addition to “automatic” arrangement configurations made possible by using dimensionality-reduction techniques. These con- figurations could be derived by analyzing different sound sets to find weighting combinations that give the best visual separation, depending on the corpus provided. Finally, we are currently using PCA for dimensionality reduction. There are also other approaches, including multidimensional scaling (Donaldson, Knopke, and Raphael 2007) and the recent t-distributed stochastic neighbor embedding algorithm (Frisson 2015; Turquois et al. 2016), which have been used in musically related tasks and that we are implementing and evaluating as alternatives. Discussion Evaluating systems for music creation and manip- ulation is a difficult, ill-defined, and insufficiently reported task. As we have stressed in the course of this article, this is also the case with systems for concatenative synthesis. After conducting our own evaluation, we considered what key points could be made to help inform future evaluations by interested researchers in the community. Our observations led us to indicate three distinct layers that should be addressed for a significant, full-fledged appraisal. The most high-level and general “system” layer calls for user evaluations that go beyond “quality of experience” and “satisfaction” surveys. Such evalu- ations should strive to address creative productivity and workflow efficiency aspects particular to the needs of computer-music practitioners. At the mid-level “algorithmic” layer, we examine the mechanics of developing solutions strategies for concatenative synthesis. We have identified three main trends in algorithmic techniques used for tackling tasks in concatenative synthesis, namely, similarity-matrix and clustering approaches (like ours), Markov models, and constraint-satisfaction problems. Each of these techniques exhibits its own strengths and weaknesses in terms of accuracy, flexibility, efficiency, and complexity. Comparing these algorithms within a single system and, indeed, across multiple systems, using a well-defined data set, a clear set of goals, and specific success criteria would represent a valuable asset in the evaluation methodology of concatenative synthesis. Additionally, we should pay attention to the distance and similarity metrics used, as there are other possibilities that are explored and compared in other retrieval problems (e.g., Charulatha, Rodrigues, and Chitralekha 2013). At the lowest level, the focus is on the broader implications related to MIR of choosing appropriate features for the task at hand. In the course of our evaluation, we chose the features indicated in the implementation and did not manipulate them in the experiment. There are, of course, many other features relevant to the problem that can be studied and estimated in a systematic way, as is par for the course in classification experiments in MIR. Furthermore, tuning the weights was not explored and is an important consideration that depends greatly on different corpora and output-sequence requirements. In addition to this three-tiered evaluation method- ology, an ideal component would be the availability of a baseline or comparison system that ensures new prototypes improve over some clearly identifiable aspect. Self-referential evaluations run the risk of 34 Computer Music Journal l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o m j / l a r t i c e - p d f / / / / 4 1 2 2 1 1 8 5 6 3 9 3 / c o m _ a _ 0 0 4 1 2 p d . j f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 confirming experimenter bias without establishing comprehensive criticism. References Conclusion In this article, we explored concatenative synthesis as a compositional tool for generating rhythmic patterns for electronic music, with a strong empha- sis on its role in EDM musical styles. One of our first contributions was to present a thorough and up-to-date review of the state of the art, beginning with its fundamental algorithmic underpinnings and proceeding to modern systems that exploit new and experimental visual and interactive modalities. Although there are a number of commercial applica- tions that encapsulate techniques of concatenative synthesis for the user, the vast majority of systems are frequently custom-built for the designer or are highly prototypical in nature. Consequently, there is a marked lack of evaluation strategies or reports of user experiences in the accompanying literature. Based on these investigations, we set out to design a system that applied and extended many of the pervasive techniques in concatenative synthesis with a clear idea of its application and its target user. We built an instrument that was easily integrated with modern digital audio workstations and presented an interface that intended to be attractive and easy to familiarize oneself with. How to evaluate the system, not only in terms of its objective performance but also in its subjective aural and experiential implications for our users, was our final substantial contribution to this area. The results of our evaluations showed that our system performed as expected, and users were positive about its potential for assisting in their creative tasks, while also proposing interesting avenues for future work and contributions. Resources A demonstration version of the software is available online at http://github.com/carthach/rhythmCAT. A video example can be viewed at http://youtu.be /hByhgF fzto. Aucouturier, J.-J., and F. Pachet. 2005. “Ringomatic: A Real-Time Interactive Drummer Using Constraint- Satisfaction and Drum Sound Descriptors.” Proceedings of the International Conference on Music Information Retrieval, pp. 412–419. Bart ´ok, B. 1993. “Hungarian Folk Music.” In B. Suchoff, ed. B ´ela Bart ´ok Essays. Lincoln: University of Nebraska Press, pp. 3–4. Battier, M. 2007. “What the GRM Brought to Music: From Musique Concr `ete to Acousmatic Music.” Organized Sound 12(3):189–202. Bello, J., and L. Daudet. 2005. “A Tutorial on Onset Detection in Music Signals.” IEEE Transactions on Audio, Speech, and Language Processing 13(5):1035– 1047. Bernardes, G., C. Guedes, and B. Pennycook. 2013. “EarGram?: An Application for Interactive Exploration of Concatenative Sound Synthesis in Pure Data.” In M. Aramaki et al., eds. From Sounds to Music and Emotions. Berlin: Springer, pp. 110–129. B ¨ock, S., and G. Widmer. 2013. “Maximum Fil- ter Vibrato Suppression for Onset Detection.” In Proceedings of the International Conference on Digital Audio Effects. Available online at dafx13.nuim.ie/papers/09.dafx2013 submission 12.pdf. Accessed January 2017. Bogdanov, D., et al. 2013. “ESSENTIA: An Audio Analysis Library for Music Information Retrieval.” In Pro- ceedings of the International Conference on Music Information Retrieval, pp. 493–498. Bradski, G. 2000. “The OpenCV Library.” Dr. Dobb’s Journal 25(11):120–125. Brent, W. 2010. “A Timbre Analysis and Classifica- tion Toolkit for Pure Data.” In Proceedings of the International Computer Music Conference, pp. 224– 229. Cardle, M., S. Brooks, and P. Robinson. 2003. “Audio and User Directed Sound Synthesis.” Proceedings of the International Computer Music Conference, pp. 243–246. Charulatha, B., P. Rodrigues, and T. Chitralekha. 2013. “A Comparative Study of Different Distance Metrics That Can Be Used in Fuzzy Clustering Algorithms.” International Journal of Emerging Trends and Tech- nology in Computer Science. Available online at www.ijettcs.org/NCASG-2013/NCASG 38.pdf. Ac- cessed January 2017. Coleman, G. 2015. “Descriptor Control of Sound Transfor- mations and Mosaicing Synthesis.” PhD dissertation, ´O Nuan ´ain et al. 35 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o m j / l a r t i c e - p d f / / / / 4 1 2 2 1 1 8 5 6 3 9 3 / c o m _ a _ 0 0 4 1 2 p d . j f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Universitat Pompeu Fabra, Department of Information and Communication Technologies, Barcelona. Dixon, S. 2006. “Onset Detection Revisited.” In Proceed- ings of the International Conference on Digital Audio Effects, pp. 133–137. Hunt, A. J., and A. W. Black. 1996. “Unit Selection in a Concatenative Speech Synthesis System Using a Large Speech Database.” In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 373–376. Donaldson, J., I. Knopke, and C. Raphael. 2007. “Chroma Jehan, T. 2005. “Creating Music by Listening.” PhD Palette?: Chromatic Maps of Sound as Granular Synthesis Interface.” In Proceedings of the Conference on New Interfaces for Musical Expression, pp. 213–219. Frisson, C. 2015. “Designing Interaction for Browsing Media Collections (by Similarity).” PhD dissertation, Universit ´e de Mons, Faculty of Engineering. Frisson, C., C. Picard, and D. Tardieu. 2010. “Audiog- arden?: Towards a Usable Tool for Composite Audio Creation.” QPSR of the Numediart Research Program 3(2):33–36. Gouyon, F., F. Pachet, and O. Delerue. 2000. ”On the Use of Zero-Crossing Rate for an Application of Classification of Percussive Sounds.” In Proceedings of the International Conference on Digital Audio Effects, pp. 3–8. Greenberg, C. 1971. “Collage.” In Art and Culture: Critical Essays. Boston, Massachusetts: Beacon Press, pp. 70–83. Hackbarth, B., N. Schnell, and D. Schwarz. 2011. “Au- dioGuide?: A Framework for Creative Exploration of Concatenative Sound Synthesis.” IRCAM Re- search Report. Available online at articles.ircam.fr /textes/Hackbarth10a/index.pdf. Accessed January 2017. Herrera, P., A. Dehamel, and F. Gouyon. 2003. “Au- tomatic Labeling of Unpitched Percussion Sounds.” In Proceedings of the 114th Audio Engineering So- ciety Convention. Available online at www.aes.org /e-lib/browse.cfm?elib=12599 (subscription required). Accessed January 2017. Holmes, T. 2008. Electronic and Experimental Music. Abingdon-on-Thames, UK: Routledge. Hoskinson, R., and D. Pai. 2001. “Manipulation and Resynthesis with Natural Grains.” In Proceedings of the International Computer Music Conference, pp. 338–341. Hoskinson, R., and D. K. Pai. 2007. “Synthetic Sound- scapes with Natural Grains.” Presence: Teleoperators and Virtual Environments 16(1):84–99. Humphrey, E. J., D. Turnbull, and T. Collins. 2013. “A Brief Review of Creative MIR.” In Proceedings of the International Conference on Music Information Retrieval. Available online at ismir2013.ismir.net /wp-content/uploads/2014/02/lbd1.pdf. Accessed Jan- uary 2017. dissertation, Massachusetts Institute of Technology, Media Arts and Sciences. Knees, P., et al. 2016. “The GiantSteps Project: A Second- Year Intermediate Report.” In Proceedings of the International Computer Music Conference, pp. 363– 368. Lochhead, J. I., and J. H. Auner. 2002. Postmodern Music/- Postmodern Thought: Studies in Contemporary Music and Culture. Abingdon-on-Thames, UK: Routledge. Logan, B. 2000. “Mel Frequency Cepstral Coefficients for Music Modeling.” In Proceedings of the International Symposium on Music Information Retrieval. Available online at ismir2000.ismir.net/papers/logan paper.pdf. Accessed January 2017. Loughran, R., et al. 2004. “The Use of Mel-Frequency Cepstral Coefficients in Musical Instrument Identifi- cation.” In Proceedings of the International Computer Music Conference, pp. 42–43. Mangani, M., R. Baldizzone, and G. Nobile. 2006. “Quo- tation in Jazz Improvisation: A Database and Some Examples.” Paper presented at the International Con- ference on Music Perception and Cognition, August 22–26, Bologna, Italy. Masri, P., and A. Bateman. 1996. “Improved Modeling of Attack Transients in Music Analysis-Resynthesis.” In Proceedings of the International Computer Music Conference, pp. 100–103. McLeod, K. 2009. “Crashing the Spectacle: A Forgotten History of Digital Sampling, Infringement, Copyright Liberation and the End of Recorded Music.” Culture Machine 10:114–130. Metzer, D. 2003. Quotation and Cultural Meaning in Twentieth-Century Music. Cambridge: Cambridge University Press. Miller, P. D. 2008. Sound Unbound: Sampling Digital Music and Culture. Cambridge, Massachusetts: MIT Press. O’Connell, J. 2011. “Musical Mosaicing with High Level Descriptors.” Master’s thesis, Universitat Pompeu Fabra, Sound and Music Computing, Barcelona. ´O Nuan ´ain, C., P. Herrera, and S. Jord `a. 2016. “An Evaluation Framework and Case Study for Rhythmic Concatenative Synthesis.” In Proceedings of the International Society for Music Information Retrieval Conference, pp. 67–72. 36 Computer Music Journal l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o m j / l a r t i c e - p d f / / / / 4 1 2 2 1 1 8 5 6 3 9 3 / c o m _ a _ 0 0 4 1 2 p d . j f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 ´O Nuan ´ain, C., S. Jord `a, and P. Herrera. 2016a. “An In- teractive Software Instrument for Real-Time Rhythmic Concatenative Synthesis.” In Proceedings of the In- ternational Conference on New Interfaces for Musical Expression, pp. 383–387. ´O Nuan ´ain, C., S. Jord `a, and P. Herrera. 2016b. “Towards User-Tailored Creative Applications of Concatena- tive Synthesis in Electronic Dance Music.” In Pro- ceedings of the International Workshop on Musical Metacreation. Available online at musicalmetacre- ation.org/buddydrive/file/nuanain towards. Accessed January 2017. Oswald, J. 1985. “Plunderphonics, or Audio Piracy as a Compositional Prerogative.” Paper presented at the Wired Society Electro-Acoustic Conference, Toronto, Canada. Reprinted in Musicworks, Winter 1986, 34:5–8. Peeters, G. 2004. “A Large Set of Audio Features for Sound Description (Similarity and Classification) in the CUIDADO Project.” IRCAM Project Report. Available online at recherche.ircam.fr/anasyn/peeters /ARTICLES/Peeters 2003 cuidadoaudiofeatures.pdf. Accessed January 2017. Rabiner, L. R. 1989. “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition.” Proceedings of the IEEE 77(2):257–286. Roads, C. 2004. Microsound. Cambridge, Massachusetts: MIT Press. Rodgers, T. 2003. “On the Process and Aesthetics of Sampling in Electronic Music Production.” Organized Sound 8(3):313–320. Roy, P., F. Pachet, and S. Krakowski. 2007. “Analytical Features for the Classification of Percussive Sounds: The Case of the Pandeiro.” In Proceedings of the International Conference on Digital Audio Effects, pp. 213–220. Russell, S., and P. Norvig. 2009. Artificial Intelligence: A Modern Approach. Upper Saddle River, New Jersey: Prentice Hall. Schwarz, D., et al. 2006. “Real-Time Corpus-Based Concatenative Synthesis with CataRT.” In Proceedings of the International Conference on Digital Audio Effects, pp. 18–21. Sewell, A. 2014. “Paul’s Boutique and Fear of a Black Planet: Digital Sampling and Musical Style in Hip Hop.” Journal of the Society for American Music 8(1):28–48. Sturm, B. L. 2004. “Matconcat: An Application for Explor- ing Concatenative Sound Synthesis Using Matlab.” In Proceedings of the International Conference on Digital Audio Effects, pp. 323–326. Sturm, B. L. 2006. “Adaptive Concatenative Sound Synthe- sis and Its Application to Micromontage Composition.” Computer Music Journal 30(4):46–66. Tamagawa, K. 1988. “Echoes from the East: The Javanese Gamelan and Its Influence on the Music of Claude Debussy.” DMA dissertation, University of Texas at Austin. Thompson, L., S. Dixon, and M. Mauch. 2014. “Drum Transcription via Classification of Bar-Level Rhythmic Patterns.” In Proceedings of the International Society for Music Information Retrieval Conference, pp. 187– 192. Tindale, A., et al. 2004. “Retrieval of Percussion Gestures Using Timbre Classification Techniques.” Proceedings of the International Conference on Music Information Retrieval, pp. 541–545. Turquois, C., et al. 2016. “Exploring the Benefits of 2D Visualizations for Drum Samples Retrieval.” In Proceedings of the ACM SIGIR Conference on Human Information Interaction and Retrieval, pp. 329– 332. Tzanetakis, G., G. Essl, and P. Cook. 2001. “Automatic Musical Genre Classification of Audio Signals.” In Proceedings of the International Symposium on Music Information Retrieval, pp. 293–302. Xenakis, I. 1971. Formalized Music. Bloomington: Indiana Schwarz, D. 2003. “The Caterpillar System for Data- University Press. Driven Concatenative Sound Synthesis.” In Proceedings of the International Conference on Digital Audio Effects, pp. 135–140. Schwarz, D. 2005. “Current Research in Concatenative Sound Synthesis.” In Proceedings of the International Computer Music Conference, pp. 9–12. Schwarz, D. 2017. “Corpus-Based Sound Synthesis Survey.” Available online at imtr.ircam.fr/imtr/Corpus -Based Sound Synthesis Survey. Accessed February 2017. Xiang, P. 2002. “A New Scheme for Real-Time Loop Music Production Based on Granular Similarity and Probability Control.” In Proceedings of the Interna- tional Conference on Digital Audio Effects, pp. 89– 92. Zils, A., and F. Pachet. 2001. “Musical Mosaicing.” In Proceedings of the International Conference on Digital Audio Effects, 1–6. Available online at www.csis.ul.ie/dafx01/proceedings/papers/zils.pdf. Accessed January 2017. ´O Nuan ´ain et al. 37 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o m j / l a r t i c e - p d f / / / / 4 1 2 2 1 1 8 5 6 3 9 3 / c o m _ a _ 0 0 4 1 2 p d . j f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3C ´arthach ´O Nuan ´ain, Perfecto image
C ´arthach ´O Nuan ´ain, Perfecto image
C ´arthach ´O Nuan ´ain, Perfecto image
C ´arthach ´O Nuan ´ain, Perfecto image

PDF Herunterladen