Akira Maezawa, Katsutoshi Itoyama,
Kazunori Komatani, Tetsuya Ogata,
and Hiroshi G. Okuno
Department of Intelligence Science
and Technology
Kyoto University Graduate School of Informatics
Yoshida Honmachi
Sakyo, Kyoto 606-8501, 日本
akira maezawa@gmx.yamaha.com
itoyama@kuis.kyoto-u.ac.jp
komatani@nuee.nagoya-u.ac.jp
{ogata, okuno}@i.kyoto-u.ac.jp
Automated Violin Fingering
Transcription Through
Analysis of an Audio
Recording
抽象的: We present a method to recuperate fingerings for a given piece of violin music in order to recreate the timbre of
a given audio recording of the piece. This is achieved by first analyzing an audio signal to determine the most likely se-
quence of two-dimensional fingerboard locations (string number and location along the string), which recovers elements
of violin fingering relevant to timbre. This sequence is then used as a constraint for finding an ergonomic sequence
of finger placements that satisfies both the sequence of notated pitch and the given fingerboard-location sequence.
Fingerboard-location-sequence estimation is based on estimation of a hidden Markov model, each state of which
represents a particular fingerboard location and emits a Gaussian mixture model of the relative strengths of harmonics.
The relative strengths of harmonics are estimated from a polyphonic mixture using score-informed source segregation,
and compensates for discrepancies between observed data and training data through mean normalization.
Fingering estimation is based on the modeling of a cost function for a sequence of finger placements. We tailor our
model to incorporate the playing practices of the violin.
We evaluate the performance of the fingerboard-location estimator with a polyphonic mixture, and with recordings
of a violin whose timbral characteristics differ significantly from that of the training data. We subjectively evaluate
the fingering estimator and validate the effectiveness of tailoring the fingering model towards the violin.
Fingering decisions
In musical instrument performance, deciding the
sequence of finger placements needed to produce a
given sequence of pitches, known as the fingering,
is an important and sometimes difficult problem
that musicians need to solve.
can be difficult because the fingering must be
both musical and ergonomic, two often-conflicting
ideals. 例如, the “easiest” fingering on a
violin often involves unmusically abrupt changes of
音色. This is because most pitches can be found
in more than one location on the instrument (IE。,
on more than one string). Each string, 然而, 有
a different timbre (for reasons we explain later), 和
often the easiest fingering involves changes between
strings. An experienced musician would choose a
balanced fingering that not only satisfies ergonomic
finger placements but also expresses the musician’s
artistic values.
The essence of violin fingering resides in finding
both an ergonomic sequence of finger placements
电脑音乐杂志, 36:3, PP. 57–72, 落下 2012
C(西德:2) 2012 麻省理工学院.
and a musically appropriate fingerboard location
顺序, the sequence of locations on the finger-
board on which the finger presses the string. 每个
fingerboard location specifies both the longitudinal
location along the string and the latitudinal posi-
的, IE。, which of the four strings is played. 每个
string is tuned differently and has a distinct timbre.
所以, it is essential for a violinist to choose
a fingerboard-location sequence that sounds musi-
cally well-motivated. 例如, in Air on the G
String, August Wilhelmj’s well-known arrangement
of the second movement of J. S. Bach’s Orchestral
Suite No. 3, the arranger specifies the entire solo
violin part to be played on one string (the G string),
most likely to maintain the consistent, warm timbre
that is characteristic of the G string.
This study aims to develop a method for analyzing
an audio recording of a violin and estimating the
fingering required to recreate the “sound” of a
particular artist’s performance. Such a method
would allow a beginner, 例如, to analyze the
recordings of past masters and gain insights on how
to imitate them. It could also help violin students
Maezawa et al.
57
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
米
j
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
3
6
3
5
7
1
8
5
7
1
0
0
/
C
哦
米
_
A
_
0
0
1
2
9
p
d
.
j
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
to appreciate different musical values by comparing
fingerings played by different musicians. Finding
similarities of fingering among musicians would
allow one to find a stylistically suitable way to play
a particular piece, a difficult task for a violinist
studying alone without a teacher’s help.
Our goal requires (1) analyzing an audio signal to
estimate the fingerboard location sequence, 和 (2)
choosing an ergonomic finger placement sequence
that satisfies the sequence of notated notes and the
estimated fingerboard location. Existing methods
in fingering estimation do not focus on timbral
differences caused by different fingerboard locations,
然而, and existing methods for estimating the
fingerboard-location sequence need robustness in a
polyphonic mixture.
Most studies in fingering-estimation research
intend not to retrieve a fingering from a particular
记录, but rather to determine an ergonomic
fingering. In these studies, the general framework
involves designing a cost function for moving the
hand from one shape to another, and finding a
sequence of hand movements that minimizes the
accrued cost. In the seminal work by Sayegh (1989),
the cost uses the distance that the hand traverses
horizontally and vertically across the instrument.
Other work involves similar kinds of costs based
on the actual distance that the hand needs to
traverse based on different constraints posed by
different instruments, such as the guitar (Radicioni,
Anselma, and Lombardo 2004) or the piano (Kasimi,
Nichols, and Raphael 2007; Yonebayashi, Kameoka,
and Sagayama 2007). The costs may be given a
priori or learned using training data (Radisavljevic
and Driessen 2004). The formulation of fingering
estimation as a cost-minimization problem has been
given a probabilistic interpretation using hidden
Markov models (HMM; Yonebayashi, Kameoka, 和
Sagayama 2007). These studies all seem to have the
perspective that musicality resides in the sequence
of notated notes, which overlooks the expression
added by the performer. Such an orientation implies
that given a symbolic representation of music, A
fingering may be determined on a one-to-one basis.
Each person, 然而, given a piece of music,
might play it using a different fingering based
on his or her musical values and physiological
constraints. 清楚地, performance analysis based on
an audio signal or video is essential for realizing
our goal of reconstructing fingerings from recorded
performances.
Studies in detailed performance analysis fail when
the music in the audio recording is polyphonic, 作为
well as when the sound of the instrument that
needs to be analyzed is significantly different
from the training data. In practical situations,
然而, it is essential to be able to analyze
a musical phrase within a polyphonic mixture.
Methods for analyzing audio spectra to determine
the control input for the violin (Krishnaswamy
和史密斯 2003; Barbancho 2009) or the guitar
(Traube and Smith 2000) do not work well on
instruments with acoustic characteristics other than
those used in training, or make highly restrictive
assumptions on how the instrument is played.
Greater accuracy is reported using audiovisual
fusion (张, 朱, and Leow 2007; Lu et al. 2008).
Some studies use only visual information to analyze
the fingering of a guitar (Burns and Wanderley
2006). 在任一情况下, the necessity of video severely
limits the kind of musical recordings that can be
分析过的.
在本文中, we develop a method to recuperate
the fingering from an audio recording of a violin in a
polyphonic mixture by analyzing the audio recording
and finding an ergonomic fingering that captures
the particular recording’s timbre as expressed in a
specific fingerboard-location sequence. Our method
is a two-step procedure. 第一的, it analyzes the audio
signal to estimate the fingerboard location sequence.
第二, it determines an ergonomic fingering that
satisfies the fingerboard-location sequence. 什么时候
analyzing the fingerboard-location sequence, 我们用
features that are robust in a polyphonic mixture.
而且, the method uses a feature-adaptation
mechanism to improve the robustness when the
instrument sounds are different from the ones used
for training. For the ergonomic fingering decision,
we incorporate a cost function that reflects practices
of violin playing (which is not necessarily applicable
to other stringed instruments such as the guitar
or the cello). We evaluate the performance of the
fingerboard-location estimator and the fingering-
decision method.
58
电脑音乐杂志
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
米
j
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
3
6
3
5
7
1
8
5
7
1
0
0
/
C
哦
米
_
A
_
0
0
1
2
9
p
d
.
j
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
数字 1. Violin fingering.
Violin Fingering: A Primer
We shall briefly review the fundamentals of violin
fingering and its terminology, as these play impor-
tant roles in understanding our method. The left
hand defines the pitch and the string on which a
note is played. Although there are two defining
factors in fingering (IE。, finger and string), violinists
often talk about fingering also in terms of the left
hand position, IE。, the general placement of the
left hand required to play a given note on a certain
string using a certain finger. 如图 1,
in this article we associate the string on which a
note is played with the variable s, where s = 1 是个
lowest-tuned string and s = 4 is the highest. 这些
strings are conventionally tuned seven semitones
apart, where s = 3 is typically tuned to A4 = 440 赫兹.
The strings are typically notated in a music score
using a Roman numeral, where the lowest string
(s = 1) is associated with IV, and the highest (s =
4) with I. Placed fingers are labeled by numbers,
在哪里 1 refers to the index finger, 2 the middle, 3
the ring, 和 4 the little finger. An open string (IE。,
no finger is pressed and the string vibrates at the
tuned pitch) is referred to by 0; only four pitches on a
violin can be played as an open string. 在本文中,
the position is defined by the number of semitones
that the first finger must traverse from the nut (这
ridge over which the string passes on the end of the
fingerboard near the tuning pegs) in order to play a
particular fingering.
Finger placement is determined by considering
technical ease and musical effect. Using a certain
finger (例如, the index finger instead of the little
finger) facilitates execution of some musical effects
related to pitch, such as a smooth transition between
two notes (glissando), or a low-frequency modula-
tion of a note (颤音). 同时, 一些
sequences of finger placement are easier to execute
than others. 例如, rapid movement of the
little finger is considerably more tiresome than that
of the index finger. The choice of the string on which
to play a given note is determined by considering the
consistency of sonority. 例如, because each
string has a distinct sonority, violinists often play
on one string to prevent abrupt changes of the sound
质量. The difference in sonority when a given
pitch is played on one string versus another results
从 (1) the differences in the physical attributes of
each string itself, such as the diameter, tension, 和
材料; (2) differences in the position along the
string at which the finger must be placed (因为
each string is tuned differently); 和 (3) in cases
when the pitch is available as an open string, 这
difference between the rigid termination provided
by the nut and the soft termination provided by the
finger.
The choice of string is the one factor in fingering
that is most likely to produce audible differences. 经过
对比, the position is sometimes chosen for visual
effect and to demonstrate the violinist’s technical
技能. 例如, the violinist may present a
“flashy” playing style through a wide change of
位置.
The consistency of sonority offered by playing
on one string often forms a trade-off with the
consistency of position that facilitates playing. 在
a fast piece, abrupt changes of sonority caused by a
certain fingering may be overlooked if it simplifies
playing. 另一方面, in a slow piece, a vio-
linist may choose a difficult fingering that produces
a certain sonority. A violinist may, 例如,
value consistency of sonority in a slow, “singing”
(cantilena) passage by playing on one string.
方法
Our method is based on determining the fingerboard-
location sequence from a polyphonic audio mixture
that contains a musical phrase for solo violin, 和
finding a sensible fingering that satisfies the esti-
mated fingerboard-location sequence. We assume
the violin plays a monophonic melody using normal
bowing technique (IE。, we do not consider extended
Maezawa et al.
59
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
米
j
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
3
6
3
5
7
1
8
5
7
1
0
0
/
C
哦
米
_
A
_
0
0
1
2
9
p
d
.
j
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
数字 2. System block
diagram.
playing technique such as plucking [pizzicato],
striking the string with the wooden side of the bow
[col legno],or playing very close to the bridge to
produce a shimmering sound [sul ponticello]).
Our task involves three aspects:
1. Designing a feature that performs well in a
polyphonic mixture.
2. Designing a scheme such that the perfor-
mance does not degrade from an acoustical
mismatch with the training data, 无论
because of characteristics of the violins, 这
房间, or the recording process.
3. Designing a fingering model that reflects the
practices of violin performance.
We attack the first problem by fitting a sum-of-
sinusoids plus noise model to the observed spectrum
and using the estimated harmonic parameters as
the feature. The second problem is attacked by
normalizing the average features of the training
set and by doing the same for the recording whose
fingerboard-location sequence we would like to
identify. We solve the third problem by introducing a
new model of violin fingering, the violin pedagogical
模型, which incorporates features that are inspired
by practices of violin playing. The system-level
block diagram is shown in Figure 2.
Fingerboard-Location Sequence Estimation
through Viterbi Alignment
In order to estimate a sensible fingering from the
声音的, the fingerboard-location sequence must be
estimated from a polyphonic mixture of sound
that contains a violin melody. Because the string
on which a note is played (the bowed string) 是
the primary factor that influences the timbre for
different fingerboard locations, we would like
to estimate the sequence of bowed strings. 我们的
estimation method is based on a bowed-string
classifier using features that are robust to polyphonic
accompaniment, and a classifier that takes into
account the playability of a particular fingerboard-
location sequence.
Feature Extraction by Harmonic-Model Fitting
We extract, from a polyphonic audio mixture that
contains a violin melody, the relative strengths of
the first N violin harmonics (部分), where N =
10, the first partial is the fundamental frequency,
and the others are overtones. The parameter is
dependent on the material property of the string,
the body resonance characteristics, and on how
the instrument is played (例如, bow force, bow
velocity, and the contact point of the bow and
字符串 [Cremer and Allen 1984; Fletcher and
Rossing 1998]). The feature extraction involves
fitting a sum of harmonically spaced sinusoids
onto the observed short-time Fourier transform
representation of the input audio signal, 是( F , t),
where f is the frequency bin and t is the time index.
Because the input contains not only the harmonic
sound of the violin but also transients of the violin
and accompaniments, the harmonic sound of the
violin part must be segregated. This is achieved by
generating a time-frequency mask that passes the
harmonic component of the violin part, based on the
observed spectrogram, the music score given as a
60
电脑音乐杂志
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
米
j
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
3
6
3
5
7
1
8
5
7
1
0
0
/
C
哦
米
_
A
_
0
0
1
2
9
p
d
.
j
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
standard MIDI file (SMF), and the mapping between
positions in the SMF and the audio signal (声音的-
score alignment). As a signal-processing front end,
we attenuate the DC component and emphasize the
high-frequency component by applying a two-tap,
high-pass filter with filter coefficients [1,–1]. 然后
our method iterates the following steps for a fixed
number of iterations:
1. Update the time-frequency mask to segregate
the harmonic components of the violin part,
using the audio-score alignment and the
estimated fundamental frequency as the cue.
2. Estimate the fundamental frequency of
the violin melody, taking into account the
pitch notated in the score, the observed
spectrogram, and the audio-score alignment.
In the first step, we generate a time-frequency
mask, which approaches 1 for time-frequency bins
that contain the violin sound and 0 否则. 我们
incorporate the musical score information (从
SMF) and the audio-score alignment to find which
bins are expected to contain the sound of the violin.
Audio-score alignment is determined by extracting
a time-sequence of chroma values from both the
audio and the SMF, and performing dynamic time-
warping between the chroma representation of the
audio and SMF to find the optimal path, using the
cosine distance as the metric between the audio
and the score, similar to the existing work of Hu,
丹能伯格, and Tzanetakis (2003).
Audio-score alignment gives the time-sequence of
the notated pitch of the violin part, ˆf0(t), for all time
points t. 鉴于这种, we simultaneously segregate
the violin part and estimate the fundamental
frequency of the violin part, f0(t). Our idea, 相似的
to Kameoka’s method (Kameoka, Nishimoto, 和
Sagayama 2007), is to apply a mask that resembles
a comb filter to the spectrum centered about ˆf0(t)
to segregate the violin part. Using the segregated
signal, we re-estimate the fundamental frequency.
We then re-apply to the original spectrum the mask
with the updated fundamental frequency. We iterate
these two steps of fundamental frequency update
and violin-part segregation until the fundamental
frequency converges.
We first segregate the violin part by separating
the signal into two sub-signals—the violin part
and the residual (IE。, the rest of the signal). 我们
assume that the likelihood of observing the violin
part at frequency f is hv( F | f0(t)), and is of form
(西德:2)
F ), and we assume that the
πnN( F |nf0(t), σ 2
氮
n=1
likelihood of observing the residual is hR( F ) and is of
形式 1
F , where F is the number of frequency bins to
consider. πn is a multinomial variable that indicates
the relative strength of the nth partial. 氮(X|μ, σ 2)
is the likelihood of x for a normal distribution
with mean μ and variance σ 2. We assume that the
likelihood of observing the violin part is α, 和
likelihood of observing the residual is 1−α. For each
frequency bin, we associate a latent variable ZV,
which indicates whether the violin or the residual
contributed to the observed power. ZV( F , t) = 1
indicates that at time t, the power contained in
frequency f originated from the violin part, 和
ZO( F , t, n) = 1 表明, of frames generated
from the violin, it originated from the nth partial of
the violin part. The joint likelihood of the observed
signal and the latent variable is given as follows:
p(X( F , t), Z( F , t)| f0(t), A, 圆周率 )
=
(西德:3)
F ,t
氮(西德:3)
n=1
((1 − α)hR( F ))X( F ,t)(1−ZV( F ,t))
(西德:4)
(西德:4)
απnN
F |nf0(t), σ 2
F
(西德:5)(西德:5)
X( F ,t)ZV( F ,t)ZO( F ,t,n)
We optimize this model using the expectation-
maximization (EM) algorithm, in a manner similar
to the work of Kameoka, Nishimoto, and Sagayama
(2007). In the E-step, we use the parameter from the
previous step to find the distribution of Z given X
and the parameters. We assign the following:
QO( F , t, n) = p(Zn|X, f0, A, 圆周率 )
=
Qv( F , t) =
(西德:2)
πnN( F |nf0(t), σ 2
F )
π ˜nN( F | ˜n f0(t), σ 2
F )
˜n
(西德:2)
n
A(t)
(1− α(t))hR( F ) + A(t)
πnN( F |nf0(t), σ 2
F )
πnN( F |nf0(t), σ 2
F )
(西德:2)
n
Maezawa et al.
61
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
米
j
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
3
6
3
5
7
1
8
5
7
1
0
0
/
C
哦
米
_
A
_
0
0
1
2
9
p
d
.
j
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
数字 3. Block diagram of
feature extraction step.
数字 4. Block diagram of
the GMM.
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
米
j
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
3
6
3
5
7
1
8
5
7
1
0
0
/
C
哦
米
_
A
_
0
0
1
2
9
p
d
.
j
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
n=1 X( F )QV( F , t)QO( F , t, n)n f df
˜n=1 X( F )QV( F , t)QO( F , t, ˜n) ˜n2 df
p(s = i|圆周率 (t), p(t), (西德:5)GMM( f0(t)))
In the M-step, we update the fundamental
频率. We incorporate the knowledge that
the fundamental frequency does not deviate too
much from the notated pitch, by incorporating a
prior distribution of the fundamental frequency
as p( f0(t)| ˆf0(t)) = N( f0(t)| ˆf0(t), σ 2
yields the following updates:
F ). 然后, the M-step
(西德:6)
(西德:2)
σ 2
F
ˆf0
+ σ 2
ˆf
(西德:6)
+ σ 2
σ 2
F
ˆf
F
(西德:2)
F
(西德:6)
F X( F )QV( F , t)QO( F , t, n) df
(西德:6)
F X( F )QV( F , t)QO( F , t, ˜n) df
(西德:6)
(西德:2)
˜n
F X( F )QV( F , t) df
F X( F ) df
(西德:6)
f0 : =
πn : =
A(t) =
A block diagram of this method is shown in
数字 3.
最后, we take the logarithm of π (t) to enable
feature adaptation, as will be discussed. 更远-
更多的, we decorrelate the log-relative strength
by taking the discrete cosine transform. 那是,
圆周率 (t) : = DCT(log π (t)).
Bowed String Identification
Once we extract the parameters (西德:5)(t) = { f0(t), 圆周率 (t)},
we find the likelihood of the observed data for each
of the four bowed strings, IE。, the probability at each
point in time that the violinist is bowing a given
string.
The likelihood of bowed string si is modeled
using a Gaussian mixture model (GMM). 这
GMM models the density of π for each pitch
pi( f0), where pi( F ) converts frequency f into the
closest MIDI note number. Let φ(我)
j ( p), 和
(西德:2)
(我)
j ( p) indicate respectively the weight, 平均值,
j ( p), μ(我)
and the covariance of the jth component of the
GMM for the ith bowed string at pitch p. 然后,
the likelihood of observing bowed string i given
the feature { f0(t), 圆周率 (t)} and the GMM parameters
θGMM = {φ(我)
如下:
j ( p)|∀i, j, p} is given as
j ( p), μ(我)
j ( p), σ (我)
(西德:7)
∝
j (pi(t))氮(圆周率 (t)|μ(我)
φ(我)
j (pi(t)), (西德:8)(我)
j (pi(t))
j
θGMM is trained using violin audio examples with
various playing styles, each of which has a known
sequence of pitch and fingerboard position. 使用
examples with a variety of playing styles has the
effect of averaging out statistical discrepancies of
the feature arising from playing the example in a
particular playing style. 因此, different parameters
in GMM correspond to timbral difference arising
from the variety of fingerboard position and pitch
combinations, and not playing style, 例如, bow
pressure or bow velocity. 数字 4 shows the block
diagram of the GMM.
The log-likelihood of the bowed string for the kth
笔记, 然后, is a sum of bowed-string likelihoods for
all audio frames that play the kth note. Let Tk(t) 是
a binary variable that is 1 if frame t plays the kth
62
电脑音乐杂志
数字 5. Block diagram of
the HMM.
数字 6. Simplified
horizontal–vertical model
used in the Viterbi
algorithm.
probability of the HMM. 然后, we find the most
likely bowed string as follows:
ˆS = arg max
S={s0,···,sN}
p(S|(西德:5); (西德:5)GMM, v)
= arg max
(西德:7)
S={s0,···,sN}
log ρ0(s0) + log p(s0; v)
+
log p(si|si−1; v) + log ρi(si)
note notated on the score. 然后, the bowed string
likelihood at note k, ρk(s), is given as follows:
ρk(s) =
(西德:7)
t
Tk(t) log p(s|圆周率 (t), p(t), θGMM)
Sequence Estimation using the Viterbi Algorithm
We determine the fingerboard-location sequence by
finding the most likely bowed-string sequence, 不是-
ing into account the observed acoustic signal and the
difficulty of traversing from one fingerboard location
to another. To incorporate both the likelihood of a
given bowed string sequence and the likelihood of
observing the features given a particular fingerboard
position in which a note is played, we model the
bowed-string sequence as an HMM. 数字 5 节目
a graphical depiction of our model.
We treat the string the note is played on as the
hidden state that needs to be estimated, given the
observed features π (t), the fundamental frequency
f0(t), and the inherent difficulty from traversing
from one fingerboard position to another. Let v
be a parameter that governs the state transition
我=1
ρi(si) is the likelihood of the GMM obtained in the
previous section, for the ith note. The maximum-
likelihood sequence is determined inductively,
using the Viterbi algorithm. Let Sopt(米, s|X, (西德:5)) 是
the optimal bowed-string sequence for the first m
notes that ends in bowed string s. 然后, 我们定义
Sopt(m+ 1, s|X, (西德:5)) as follows:
Sopt(m+ 1, s|圆周率 , (西德:5)) = 对数 p(圆周率 (m+ 1)|s, (西德:5))
+ arg max
log p(Sopt(米, ˆs|圆周率 , (西德:5)))
ˆs
+ log p(s| ˆs, (西德:5))
We design a suitable bowed-string transition
probability p(si|sj; v) based on a simplified violin
fingering model, 如图 6:
(西德:8)
(西德:8)(西德:9)
p(和|Sj; v) ∝ exp
−v
(西德:11)(西德:11)
(西德:10)
2
(西德:10)pi, j
7
+ (西德:10)s2
我, j
这里, (西德:10)pi, j is the amount of change between
fingerboard positions and (西德:10)si, j is the amount of
change between string numbers. The constant
7 models the violinists’ tendencies to finger an
interval of up to 7semitones by either playing on the
same string or crossing a string, but to finger a larger
interval by crossing a string.
Maezawa et al.
63
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
米
j
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
3
6
3
5
7
1
8
5
7
1
0
0
/
C
哦
米
_
A
_
0
0
1
2
9
p
d
.
j
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
Feature Adaptation
Fingering Estimation
Two violins typically sound different, mainly
because each violin has a unique body-resonance
特征. 所以, it is essential for us to
adapt our model to violins with different body-
resonance characteristics.
Let Y1 and Y2 be the observed log-magnitude
spectrum of two violins. Let B1 and B2 be the
log-magnitude frequency response of the bodies
of the two violins, and S be the log-magnitude
spectrum of the bow-string interaction. Assum-
ing that the acoustic property of the bow–string
interaction does not change significantly across
different instruments, we obtain the following
关系:
Y1 = B1 + S
Y2 = B2 + S
S
+ (西德:4)Y2(西德:5)
X denotes expecta-
Taking the expectation, we obtain Y2 =
S, 在哪里 (西德:4)是(西德:5)
S is obtained readily by taking
S is
Y1 − (西德:4)Y1(西德:5)
tion of Y under probability distribution of X. 在
our method, we let Y1 be the observed features,
and generate Y2, which is the feature when Y1 is
played on an instrument with body resonance char-
acteristic B2. (西德:4)Y1(西德:5)
the average of the observed signal, 和 (西德:4)Y2(西德:5)
obtained by summing the average feature of each
bowed string, weighted by how often a particu-
lar string is played at a given pitch for the given
music score. Because our features are linear trans-
formations of the logarithm of the relative powers
of harmonic peaks, the same rationale holds. Ini-
tially, we set the probability distribution of S as
如下:
p(S| 沥青, β) ∼ exp
(西德:9)
− pitch − pitch0(S)
β
(西德:10)
if pitch > pitch0(S), 0 否则
这里, pitch is the played pitch, pitch0 is the
pitch of the open string played on string S, 和
β is a positive parameter that assigns a greater
probability to lower fingerboard positions, 哪个
are more commonly used and somewhat easier for
the violinist.
At the core of our fingering estimation is the violin-
fingering model, which models the difficulty of a
particular fingering. The fingering is determined by
designing a cost function between multiple states
of hand position, which reflects how difficult it
is for the hand to traverse from one position to
其他.
Violin Fingering Model
In this study, we extend the existing horizontal–
vertical cost model presented in the previous section
to include fingering practices that are unique to the
violin, which we call the violin pedagogical model.
It is mainly inspired by practices of violin fingering
as suggested by Yampolsky (1967) and Flesch (Flesch
2000). The violin, 尤其, is a small, unfretted
bowed stringed instrument, making it susceptible to
intonation errors. 因此, violin fingerings are often
set such that the weak finger (the little finger) 是
used sporadically, and other fingers in such a way
that a natural hand position is maintained.
We define a fingering as a 4-tuple n = (n, s, F , p),
where n ∈ N is the pitch in MIDI note number,
s ∈ (1, 2, 3, 4) is the bowed string, f ∈ (0, 1, 2, 3, 4) 是
the pressed finger (0 = open string, IE。, no finger
放置), and p ∈ N is the position. Let F be a set of all
possible fingerings. 还, let n(n) be a function that
retrieves the pitch of n ∈ F, s(n) the bowed string,
F(n) the finger, 和 p(n) the position.
We define an unnotated score, Su ∈ nM to be
an M−tuple of notes, where M is the number
of notes contained in the music score, 和
ith element, Su(我), is the ith note of the mu-
sic score. We define a bowed-string constraint,
cbow ∈ (1, 2, 3, 4)M associated with an unnotated
score Su as an M−tuple, where the ith element
indicates the bowed string for the ith note. 我们
finally define a notated score, Sn ∈ F M, 在哪里
the ith element contains the fingering for the ith
笔记.
We then formulate our problem as finding the
optimal notated score Sopt that satisfies both the
note sequence of the unnotated score obtained
using the SMF and the bowed-string constraint
64
电脑音乐杂志
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
米
j
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
3
6
3
5
7
1
8
5
7
1
0
0
/
C
哦
米
_
A
_
0
0
1
2
9
p
d
.
j
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
determined using the Viterbi algorithm. 即,
we define cost functions for traversing a sequence of
two notes or three notes, and find the fingering with
the smallest net cost that satisfies the constraints.
Let Cb(si, si−1) : F 2 → R be a cost function defined
over a sequence of two notes. We define a symbol
Z that satisfies the following:
∀A ∈ F.Cb(Z, A) = Cb(A, Z) = 0
然后, the optimal fingering is a notated score Fopt
that satisfies the following:
Fopt = arg min
s1···sM∈F
中号(西德:7)
我=1
Cb(si, si−1)
such that ∀ j < 1, ∀ j ∈ [1, M]n(si) = Su(i) sj = 0 and and s(Si) = cst(i) Let τp = 3, δi,j be Kronecker’s delta, and 1(c) be a function that is 1 if condition c is true and 0 otherwise. We define the following quantities for convenience: (cid:10)p(i, j) = p(si) − p(sj) (cid:10)s(i, j) = s(si) − s(sj) (cid:10)f(i, j) = f(si) − f(sj) (cid:10)pr(i, j) = |n(si) − p0(st(sj))| − |n(si) − p0(sj(sj))| R(i, j) = [Pmin(f(si), Pmax(f(sj))] Pmax = Pnat = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ 0 0 0 0 0 0 0 2 4 6 0 2 0 2 5 0 4 2 0 3 0 6 5 3 0 0 0 0 0 0 0 0 2 3 5 0 2 0 1 2 0 3 1 0 2 0 5 2 2 0 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o m j / l a r t i c e - p d f / / / / 3 6 3 5 7 1 8 5 7 1 0 0 / c o m _ a _ 0 0 1 2 9 p d . j f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ Pmin = 0 0 0 0 0 2.1 3.5 1 0 0 1.1 2.5 0 0 1 1.5 0 2.1 1.1 0 0 3.5 2.5 1.5 0 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ Then, we define a two-note generic fingering model (denoted HVM-2) as the following set of features XBN: b0 = |(cid:10)p(i, i − 1)| · 1(f(st) (cid:9)= 0 ∧ f(si−1) (cid:9)= 0) b1 = 1(fg(si) (cid:9)= 0 ∧ (cid:10)f(i, i − 1) = 0 ∧ (cid:10)p(i, i − 1) (cid:9)= 0) b2 = 1((cid:10)f(i, i − 1) · (cid:10)pr(i, i − 1) < 0 ∧ |(cid:10)p| < τp b3 = |(cid:10)pr(i, i − 1)| − Pnat(f(si), f(si−1)) ·1(|(cid:10)pr(i, i − 1)| ∈ R(i, i − 1)) xBN(1) = ∞ · 1(n(si) − p0(st(si)) − pos(si) /∈ R(1, i)) xBN(2, 3) = (b0, b2 0) xBN(4) = b1 xBN(5, 6) = (|(cid:10)s(i, i − 1)|, |(cid:10)s(i, i − 1)|2) ·1(fg(si) · fg(si−1) (cid:9)= 0) xBN(7) = 1(|(cid:10)pr(i, i − 1)| − Pnat(fg(si), fg(si−1)) (cid:9)= 0) xBN(8) = b2 · 1(fg(i) (cid:9)= 0 ∧ fg(i − 1) (cid:9)= 0) xBN(9) = b3 · 1(fg(i) = 0 ∨ fg(i − 1) = 0) xBN(1) determines whether a note is physically playable on a particular string, where ∞ × 0 = 0; i.e., this feature discards any fingering that is physically unplayable, as shown in Figure 7. xBN(2, 3) applies a penalty for a change of position. xBN(4) penalizes a change of position using the same finger (i.e., a glissando). xBN(5, 6) adds a penalty for a change of fingerboard location. xBN(7) penalizes playing in an unnatural hand position. xBN(8) penalizes playing a sequence in which the second note is placed higher on the fingerboard, but the finger traverses from high finger to low (e.g., little finger to the index finger), and vice versa. Finally, xBN(9) prevents change of Maezawa et al. 65 Figure 8. Graphical description of XBV(1). Figure 9. Graphical description of XBV(2). Figure 10. Graphical description of XBV(3). Figure 7. Graphical description of XBN(1). A Roman numeral indicates the string (IV = lowest, I = highest), and a number indicates the pressed finger (1 = index finger, 4 = little finger) Figure 7 Figure 8 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o m j / l a r t i c e - p d f / / / / 3 6 3 5 7 1 8 5 7 1 0 0 / c o m _ a _ 0 0 1 2 9 p d . j f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 9 Figure 10 lower-numbered finger, or vice versa, when the shift is small, as shown in Figure 10. xBV(4) penalizes an adjacent use of the little finger, which is weak and prone to intonation errors. xBV(5) adds a penalty for an extremely high position, which is harder to play in tune. xBV(6, 7) adds a preference for playing in the first, second, or third positions, which are typically the three easiest positions to play in. xBV(8) penalizes the half-position, the lowest possible position but an unconventional one; hence, it is unnatural from the perspective of violin pedagogy but perhaps the easiest position to play in tune from a physical perspective. Using these features, we define the two-note cost function as follows: Cb = diag(WBN WBV)[xBN xBV]T WBN and WBV define the relative weight of each feature. Given ample training data, statistical machine learning of these weights might be possible. In this article, we choose to manually adjust the weights, as the violin fingering corpus is too small to permit statistical machine learning. Experiments We perform three experiments. The first experiment assesses the performance of feature-adaptation and error-correction algorithms on bowed-string estimation, the second experiment assesses the performance of the bowed-string classifier in a position (wrist movement), when it is completely natural not to do so. We define a two-note fingering model inspired by the literature of violin pedagogy (denoted VPM-2) as the following set of features xBV: cond0 = fg(si) · fg(si−1) (cid:9)= 0 ∧ |(cid:10)p(i, i − 1)| < τp xBV(1) = 1(cond0 ∧ (cid:10)pr(i, i − 1) (cid:9)= 0 ∧ (cid:10)f(i, i − 1) = 0) xBV(2) = 1(cond0 ∧ |(cid:10)pr(i, i − 1)| = 1 ∧ |(cid:10)f(i, i − 1)| > 1)
xBV(3) = 1(fg(si) · fg(si−1) (西德:9)= 0 ∧ (西德:10)p(我, i − 1)
(西德:9)= 0 ∧ (西德:10)pr(我, i − 1) · (西德:10)F(我, i − 1) = −1)
xBV(4) = 1(fg(si) = fg(si−1)
= 4 ∧ |笔记(我) − note(i − 1)| > 1)
xBV(5) = pos(si)
xBV(6, 7, 8) = (δ2,pos(si), δ5,pos(si), δ1,fg(si))
xBV(1) penalizes using the same finger to move
to a different position, 如图 8. xBV(2)
penalizes a chromatic change of the relative position
using a nonadjacent finger, 如图 9;
such movement involves wrist motion, 这是
considerably harder to tune than placing a finger
with a slight stretch. xBV(3) lessens the penalty
for shifting up from a higher-numbered finger to a
66
电脑音乐杂志
数字 11. Recognition
准确性 (%) for different
values of v (all values are
for data with data
adaptation). SS = same
violin and same strings as
training data; SD = same
violin with different
strings; DD = different
violin and different strings.
polyphonic mixture, and the third experiment
evaluates the playability of our new fingering model
through subjective experiments. In the first two
实验, we prepared recordings of three pieces
of classical music, using two significantly different
fingerings (207 notes for three pieces, yielding a total
的 414 笔记). For each fingering, we recorded the
music with the following three conditions:
1. Using the same violin and strings as were
used to record the training data (denoted as
setup SS).
2. Using the same violin but with a different
brand of strings (setup SD).
3. Using a different violin with a different brand
of strings (setup DD).
This results in about 18 minutes of audio for
验证. The training data, which lasts about 24
minutes, consists of two-octave chromatic scales
played on each string of an electric violin, 使用
various dynamics.
Setup SS and SD were played on the same electric
violin, and DD on an acoustic violin. We chose to
record the training data using an electric violin and
to use an acoustic violin for DD for two reasons.
第一的, it is easy to record noise-free training data
using an electric violin. 第二, because electric
and acoustic violins sound extremely different, 这
evaluation of the feature adaptation mechanism
can be regarded as the worst-case performance.
所以, in a real-life application, we expect the
system to perform somewhere between setup SD
and DD, as long as the training data use an acoustic
= 0.1 and σ2
violin. In the EM algorithm, we set σ 2
F
ˆf
to start at 50 and narrow down inverse-linearly with
each iteration of the algorithm. The value β used in
model adaption is chosen to be 0.1.
实验 1: Evaluation of Bowed-String
Identifier
We evaluated the accuracy of the bowed-string
estimator with and without feature adaptation, each
time evaluating the accuracy (1) of a baseline, IE。,
without considering any sequential information; (2)
considering sequential information using a previous
学习 (Maezawa et al. 2009, 2010); 和 (3) considering
sequential information using the Viterbi algorithm
as proposed in this article. The value of v used for the
Viterbi algorithm in the bowed sequence estimation
was set to 30, chosen by evaluating the accuracy
for various values of v, 如图 11.
数字 12 shows the result. We find that our method
consistently outperforms our previous study. 在
all cases, adaptation decreases the performance
when the training data and the validation data are
from the same instrument and the same brand of
strings. This is because adaptation itself depends
on the estimated sequence of fingerboard positions,
which contains errors; the discrepancy between
the actual sequence of fingerboard positions and
the estimated ones is small enough to be effective
in absorbing the differences in body resonance
characteristics between two different violins or
strings, but significant when the same string and
violin is used.
Maezawa et al.
67
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
米
j
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
3
6
3
5
7
1
8
5
7
1
0
0
/
C
哦
米
_
A
_
0
0
1
2
9
p
d
.
j
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
数字 12. Comparison of
recognition accuracy (%)
for different sequence
estimation methods.
SS = same violin and
same strings as training
数据; SD = same violin
with different strings;
DD = different violin and
different strings.
数字 13. Recognition
accuracy with different
levels of accompaniment.
SS = same violin and same
strings as training data;
SD = same violin with
different strings;
DD = different violin
and different strings.
数字 12
violin is greater than the “noise” level (IE。, 这
accompaniment).
实验 3: Subjective Evaluation
of the Fingering Estimation
第一的, ten excerpts from three classical pieces were
准备好了 (大约 200 notes total). For each
excerpt, we generated two fingerings: one using
HVM-2, and the other using HVM-2+VPM-2.
We assume that bowed-string estimation is per-
fectly accurate, by not incorporating constraints
based on the estimated bowed string. The feature
weight W was manually tuned, by first adjusting
the parameters for example (1) until it generated
satisfactory fingerings for the repertoires consid-
埃雷德. 然后, parameters pertaining to example (2)
were adjusted. They were set to WBN = [1, 0.5,
1, 1, 5, 1, 0.5, 20, 10] and WBV = [1, 10, –0.5, 10,
1.1, –0.3, –0.2, 0.2]. Excerpts that generated dif-
ferent fingerings for each of the setups were then
extracted.
下一个, seven violinists of various skills (业余
and professional, IE。, ten or more years of experience)
evaluated the generated fingerings using a form as
shown in Figure 14. Each violinist was presented
with fingerings generated using VPM-2 and VPM-2
+HVM-2, and was asked to choose the better of
the two, 如果有的话. Seven violinists were each given
ten questions (70 全部的), 但仅 66 answers were
数字 13
实验 2: Effect of Accompaniment
on Feature Extraction
Piano accompaniments were generated for two of
the three pieces, using a synthesizer with uniform
note velocity, and the amplitude was adjusted such
that the peak values of the solo violin and the ac-
companiment were identical. 下一个, the recognition
accuracy was evaluated for each violin/string type,
by changing the level of the accompaniment. 我们
scaled the amplitude of the violin part such that the
root-mean-square power of the violin part relative
to that of the accompaniment is set to a given value
of signal-to-accompaniment ratio. 数字 13 节目
the result. We find that the bowed-string estimator
performs similarly for different ratios, as long as
the ratio is positive, 那是, the signal level of the
68
电脑音乐杂志
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
米
j
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
3
6
3
5
7
1
8
5
7
1
0
0
/
C
哦
米
_
A
_
0
0
1
2
9
p
d
.
j
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
数字 14. A screen capture
of the questionnaire.
桌子 1. The Number of Answers Indicating a
Preference for the Baseline Model (HVM-2) or for
the Violin Pedagogical Model (HVM-2 + VPM-2)
Setup
HVM-2
HVM-2 + VPM-2
No preference
Total Count
5
54
7
假如. 桌子 1 shows the result of the survey. 这
sign test shows that HVM-2 and VPM-2+HVM-2
are not equally favored ( p = 0.01), which suggests
that our proposed model (HVM-2+VPM-2) is favored
超过基线 (HVM-2 only).
讨论
From the results in Experiment 1 we observe that in
setup DD—the most realistic situation—adaptation
improves the recognition accuracy, 而在
SS and SD, the accuracy decreases. 我们相信
this occurs because of the mismatch between the
distribution of the actual bowed-string sequence
and that assumed in our model. 而且, 我们发现
that our study offers major improvements compared
with our previous study. In all cases, our method
improves the recognition accuracy over the baseline,
which suggests that considering the playability of
a particular sequence of notes over a particular
sequence of bowed strings is effective.
实验 2 suggests that our features are
robust in polyphonic audio mixtures, as long as the
signal level of the violin solo part is as loud as the
accompaniment. This condition seems to hold in
pieces where the violin has an important melody
(因此, the choice of fingerboard location
becomes an even greater musical issue). 所以,
we believe our method performs without significant
degradation of accuracy in practical applications
involving works for violin and piano.
最后, we found that the fingering generated is
more natural when the proposed fingering model
is used than when existing ones are. 我们相信
incorporating the preference for the first and third
positions (index finger placed 2 semitones and 5
semitones above the nut of the violin, 分别)
is the chief reason—they are the most frequently
used positions on the violin, and fingerings generated
on these positions are more natural for violinists
to play. We find that incorporating these kinds of
heuristics could drastically improve playability.
Maezawa et al.
69
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
米
j
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
3
6
3
5
7
1
8
5
7
1
0
0
/
C
哦
米
_
A
_
0
0
1
2
9
p
d
.
j
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
数字 15. Estimated
fingering from first 50 bars
of Romanze in C played by
Joachim (1903).
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
米
j
/
我
A
r
t
我
C
e
–
p
d
F
/
下一个, we shall discuss an application of our
system for recovering the fingering of a recording
from more than a century ago.
had to have been played by stopping a string with
a finger. These kinds of errors might have occurred
for three reasons:
/
/
/
3
6
3
5
7
1
8
5
7
1
0
0
/
C
哦
米
_
A
_
0
0
1
2
9
p
d
.
j
Application: Demystifying Historical Recordings
Our method may be used to analyze a recording
from the past to recover the fingering of a legendary
violinist. 在这个例子中, we attempt to recover the
fingering of Romanze in C composed by Joachim,
a legendary violinist of the late 19th century, 使用
a gramophone recording of Joachim in 1903. 这
recording was pitch-shifted such that the notated A4
was set to 440 赫兹. VPM-2+HVM-2 fingering model
was used to estimate the fingering, an excerpt of
which is shown in Figure 15.
We speculate, 然而, that the actual fingering
may have been very different from that estimated
by our method. 例如, there are notes that are
playable using open strings (IE。, no finger pressed)
whose estimated fingerings on the notated score
show open strings; 然而, on the recording, we clearly
hear the note played with a vibrato, meaning that it
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
A) In Joachim’s time, the material used for the
violin string was significantly different. 在
特别的, the E-string (the highest-pitched
string) used gut as its core, which has a much
mellower timbre than the kind of string used
今天. Such discrepancies in the material
property of the string itself would cause the
performance of the bowed-string estimator
恶化.
乙) A high-quality recording of Joachim is not
可用的. The remastered recording we used
had a frequency range that extended up to
only approximately 5 千赫; the rest was cut
off, perhaps by applying an equalizer. 因此,
there were only few partials that conveyed
meaningful information.
C) Joachim’s use of pitch-based playing tech-
好的, such as vibrato and glissando, gives
many clues to violinists who want to infer
Joachim’s playing. 例如, if a transition
70
电脑音乐杂志
from one note to another is completely
smooth, it strongly suggests that the two
notes were played on the same string. Vibrato
can be a giveaway for distinguishing whether
a note is played using an open string or not.
Our method, 然而, does not incorporate
the pitch trajectory for fingerboard-location
estimation, 因此, misses such clues.
This output suggests a future direction for
研究: incorporating pitch-based cues for finger-
board inference, and perhaps robustness to differ-
ent materials used for a given string. The latter,
然而, can be ameliorated with our method to a
certain extent; we can record the training data using
the string that is thought to be used in the recording
whose fingering we would like to infer.
On Inharmonicity
Stringed instruments such as the violin do not
produce overtones that are exact integer multiples
of the fundamental frequency. Such deviation from
integer harmonics is caused by the inharmonicity
of the violin string, which in turn is created
by torsion of the string. Because inharmonicity
is dependent on the material of the string, 我们的
model initially incorporated inharmonicity, 使用
a beta distribution with a small shape parameter
as its prior distribution. 我们发现, 然而, 那
such a model produced lower accuracy than that
without inharmonicity. Because of the nature of
the EM algorithm, we found that the model tends
to “explain” partials of the accompaniment as
arising from the violin sound with extremely high
inharmonicity.
结论
mixture. We also incorporated a sequential model
based on violin playing. We found that such sequen-
tial modeling drastically improved the accuracy. 这
fingering estimation incorporated features that are
specific to violin playing practices (the pedagogical
模型), in addition to some of the more fundamen-
tal features applicable to other instruments as well.
Incorporating such heuristics generated a drastically
easier fingering.
Future research directions may involve improved
recognition accuracy, refining the fingering model,
and more applications. 例如, recognition ac-
curacy may be improved by exploiting the smooth-
ness of pitch trajectory (glissando, 颤音, ETC。), 作为
we observed that violinists listen to the smoothness
of pitch transition to infer the fingerboard location
and the fingering. The fingering model may further
be improved by incorporating more features that are
inspired from violin pedagogy or through machine
learning of feature weights by preparing a large
violin fingering corpus. Another possibility is to
perform joint estimation of fingerboard location
and fingering—the problems of fingerboard location
estimation and fingering estimation are dependent
on each other, suggesting that joint estimation
may improve the performance of both. From a
musicological perspective, artist classification, 技能
assessment, and analysis of historical recordings
may be interesting applications of our approach to
fingering estimation.
致谢
This work is supported by Grant-in-aid for Scientific
研究 (S) and CREST-MUSE of JST.
We would like to thank violinists P. Klinger,
博士. 磷. Sunwoo, and Dr. J. Choi for stimulating and
inspiring discussions on violin fingering.
This article presented a method for recovering the
violin fingering from an input audio signal and a
music score by analyzing the bowed-string sequence,
and using it as a constraint to determine the optimal
fingering. Bowed-string sequence classification was
based on features that are robust in a polyphonic
参考
Barbancho, 我. 2009. “Transcription and Expressiveness
Detection System for Violin Music.” In Proceedings of
the International Conference on Acoustics, Speech and
Signal Processing, PP. 189–192.
Maezawa et al.
71
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
米
j
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
3
6
3
5
7
1
8
5
7
1
0
0
/
C
哦
米
_
A
_
0
0
1
2
9
p
d
.
j
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
Burns, A. M。, 和M. 中号. Wanderley. 2006. “Visual Methods
for Retrieval of Guitarist Fingering.” In Proceedings
of the International Conference on New Interface for
Musical Expression, PP. 196–199.
Cremer, L。, 和 J. S. 艾伦. 1984. The Physics of the Violin.
Maezawa, A。, 等人. 2009. “Bowed String Sequence
Estimation of a Violin Based on Adaptive Audio
Signal Classification and Context-Dependent Error
Correction.” In Proceedings of the International
Symposium on Multimedia, PP. 9–16.
剑桥, 马萨诸塞州: 与新闻界.
Flesch, C. 2000. The Art of Violin Playing. 2nd 版。, 卷. 我.
纽约: Carl Fischer.
弗莱彻, 氮. H。, 和T. D. Rossing. 1998. The Physics of
Musical Instruments. 2ND版. 纽约: 施普林格.
胡, N。, 右. 乙. 丹能伯格, and G. Tzanetakis. 2003.
“Polyphonic Audio Matching and Alignment for Music
Retrieval.” In Proceedings of the Institute of Electrical
and Electronics Engineers Workshop on Applications
of Signal Processing to Audio and Acoustics, PP. 185–
188.
Kameoka, H。, 时间. Nishimoto, 和S. Sagayama. 2007. “A
Multipitch Analyzer Based on Harmonic Temporal
Structured Clustering.” Institute of Electrical and
Electronics Engineers Transactions on Audio, Speech
and Language Processing 15(3):982–994.
Kasimi, A. A。, 乙. Nichols, 和C. 拉斐尔. 2007. “A Simple
Algorithm for Automatic Generation of Polyphonic
Piano Fingerings.” In Proceedings of the International
Society for Music Information Retrieval Conference,
PP. 355–356.
Krishnaswamy, A。, 和 J. 史密斯. 2003. “Inferring Control
Inputs to an Acoustic Violin from Audio Spectra.” In
Proceedings of the Institute of Electrical and Electronics
Engineers International Conference on Multimedia and
Expo, PP. 733–736.
鲁, H。, 等人. 2008. “iDVT: An Interactive Digital Violin
Tutoring System Based on Audio-Visual Fusion.”
In Austria Association for Computing Machinery
International Conference on Multimedia, PP. 300–
301.
Maezawa, A。, 等人. 2010. “Violin Fingering Estimation
Based on Violin Pedagogical Fingering Model Con-
strained by Bowed Sequence Estimation from Audio In-
put.” Trends in Applied Intelligent Systems, 3:249–259.
Radicioni, D. P。, L. Anselma, 和V. Lombardo. 2004.
“A Segmentation-Based Prototype to Compute
String Instruments Fingering.” In Proceedings of the
Conference on Interdisciplinary Musicology. 可用的
online at www.uni-graz.at/richard.parncutt/cim04.
Accessed May 2012.
Radisavljevic, A。, 和P. F. Driessen. 2004. “Path
Difference Learning for Guitar Fingering Problems.”
In Proceedings of the International Computer Music
会议, PP. 730–733.
Sayegh, S. 我. 1989. “Fingering for String Instruments with
the Optimal Path Paradigm.” Computer Music Journal
13(3):76–83.
葡萄, C。, 和 J. 史密斯. 2000. “Estimating the Plucking
Point on a Guitar String.” In Proceedings of the Interna-
tional Conference on Digital Audio Effects, PP. 153–158.
Yampolsky, I.M. 1967. Principles of Violin Fingering. 新的
约克: 牛津大学出版社.
Yonebayashi, Y。, H. Kameoka, 和S. Sagayama. 2007.
“Automatic Decision of Piano Fingering Based
on a Hidden Markov Model.” In Proceedings of
the International Joint Conference on Artificial
智力, PP. 2915–2921.
张, B., J. 朱, 和W. Leow. 2007. “Visual Analysis
of Fingering for Pedagogical Violin Transcription.” In
Proceedings of the ACM Conference on Multimedia,
PP. 521–524.
72
电脑音乐杂志
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
C
哦
米
j
/
我
A
r
t
我
C
e
–
p
d
F
/
/
/
/
3
6
3
5
7
1
8
5
7
1
0
0
/
C
哦
米
_
A
_
0
0
1
2
9
p
d
.
j
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3