13:47

And therefore, it has to be clearly distinguished from what

we would consider a normal spectrograms, quite different.

So here, for example, we see the first 12 coefficients.

We can choose a number of coefficients at 12

coefficients is a quite standard number that is used.

And in fact, the zero coefficient is not shown.

Normally, we do not display the zero coefficient because

that's relates to the loudness or to the energy of the sound.

And that we have other majors for that.

So we normally show starting from the first coefficient,

at until normally like I say 12 coefficients.

The first coefficient is the one that describes the bigger

picture of the spectrum, so the bigger overall shape and as we go higher up,

it describe more details, more small changes in the spectrum.

And so, this is normally used as a vector including all these

coefficients at every frame.

So we have very compact representation, just 11 or

12 values, that can capture different aspects of the spectral shape.

Let's now talk about some features,

some descriptors that relate with the pitch information of a sound.

And the first one is idea of pitch salience.

15:27

Pitch salience is a measure of the presence of a pitch sounds in signal.

This particular implementation of pitch salience that is available in Essentia

starts from the spectral peaks and we know about that.

And from then it computes the salience of all possible pitches present.

So it does it by summing the weighted energies found at

multiples of every particular peak.

So it tries to find the possible harmonics that are present of a particular peak.

And then it sums all that and it computes this pitch salience per every peak volume.

Here in this we see the magnetic spectrum, how we find the peaks.

We get the amplitude and frequencies of every peak and

then we have this pitch salience function which is a quite complex equation.

And here we just see a very overall picture of

overall equation of this computation.

But basically at every peak and for every amplitude of every peak,

we apply a waiting function that measures these multiples and

measures the energy of all these multiples of the fundamental frequency,

of our peak being considered a fundamental frequency.

And then it sums all together into S[b] is this salience

at every bin frequency that we are starting with.

So we are basically computing the salience of all possible

frequencies being considered as a fundamental frequency.

And this is the result that we obtain if we only take

the maximum salience at every particular frame.

So at every particular frame, we have many salient values,

but this idea of peak salience normally relates to how much

of a peak is present at a particular frame.

So by taking the maximum of it is a good measure of how probable,

let's say, there is a good pitch sound at every particular frame.

So this is an orchestral sound.

It is this Chinese orchestra that we have heard before.

And there are many instruments playing together.

Some are pitch sounds, some are percussive sounds.

So by looking at this function,

this pitch salience, we can sort of visualize and

estimate the presence of the pitch sounds in every frame.

And that can be quite useful to characterize quite a number of sounds.

And then let me talk about another type of feature that is also

related with pitch information and this is the chroma feature.

And in particular, we'll talk about the harmonic pitch class profile.

But chroma, which is a concept used in music perception and music theory,

is a concept that represents the inherent singularity of pitch organization.

The same pitch notes in different octaves have the same chroma.

So when we talk about pitch classes,

we refer to all the pitches that have the same chroma.

And the HPCP, the harmonic pitch class profile,

is a particular implementation of this idea of chroma features.

And it is a distribution of the signal energy across

a predefined set of pitch classes.

So the idea, and this equation shows that, again,

starts from the spectral peaks, A sub p.

And then by applying a function to that and

summing over all possible peaks, we can get a measure of the different

pitches that are present within a particular octave.

So the idea of chroma is that we fold everything into one octave and

we can divide the octave in 12 semi-tones or

in any other type of frequency quantization.

And this equation and this implementation basically

finds the pitches that have that particular chroma,

that have that particular, let's say, note name.

So this is an example of analyzing a sound

with the HPCP implementation available in asynthia.

This is the cello sound in which I played two notes, in fact, let's listen to that.

[MUSIC]

So in here, what we see is basically the pitches,

the pitch classes that are present in this fragment.

This is a fragment in which I play basically two strings,

a double string in which one is very stable, the low note.

And in fact the zero values that we see here,

the more red horizontal line relates to one of these very stable pitches,

which basically is the A sound that is always present.

And then what we see is the other pitches,

there is a very strong D sound that is also present throughout.

So we see it all throughout and

we see also the other notes a little bit, it’s not very clear.

But it gives us an idea that there is some clear pitches.

And we might hearing it a little bit, we could get quite a distant

view of the pitch classes, not the absolute frequencies of the pitches, but

the pitch classes or the notes that are present in this recording.

Okay, now let's go to multiple frames so

features that require to be analyzed with multiple frames.

And let me give you just three examples of things that we could do with

multiple frames.

One is the idea of segmenting an audio recording and

identifying onsets, for example.

Another is to find the prominent pitch, and for the prominent pitch,

we need to see the continuation of the pitch.

And finally, the idea is that we can compute the statistics

of the single frame features but on a larger scale, on a fragment of a sound.

So the segmentation of a recording, and for

example, identification of the onsets,

can be obtained by calculating some spectral

features that measure the change in frequency content.

For example, the spectral flux,

which is a very common feature used in segmentation,

what it does, it compares two consecutive spectra.

And then it sums overall these differences.

This is basically the L1 norm of these differences.

And these can give a measure of the spectral variation, and

these can be an indication of where things are changing in the sound.

There are many implementations of this idea of a spectral flux.

And we can develop variations that can focus on a particular aspect.

For the particular case of identifying the onsets,

segmenting the sound by finding where are the beginning of an event or

a note to starts, there is a number of features that we can use,

in fact the spectral flux could be used for that.

But here, I have put another feature which is the high frequency content.

So what this descriptor does is Find the content,

the high-frequency content.

So how much of the high frequencies are present, and

then we compare with the previous one.

So in the case of identifying the onset, clearly an onset is

a part of the sound in which there is an increase of high frequencies.

Most attacks represent a higher presence of high frequencies.

So if we identify where we have an increasing presence

of high frequencies, we can detect where the onsets are.