Sinusoidal modelling aims at representing sounds by means of sinusoids plus noise signal models, in order to analyse and synthesis sounds.
The algorithms yielding a representation of sounds with the Fourier transform spectrum are not based on an explicit signal model, which can be used for instance for time stretch, or fundamental frequency estimation.
Representations based on sinusoidal modeling are higher level of sound representation, where perceptually different components, sinusoids and noise, are distinguished.
Signals produced with a source filter model – typically, vocal sounds, or strings for instance, are often periodic : they can be represented by a superposition of sinusoids, whose frequencies are in harmonic relation. In the case of a forced oscillation[1], which means that the vibrating body
In the case of free oscillation[2] systems – typically, percussions – sounds can be discribed as a superposition of modes. The oscillation is not forced, as with a bowed string. Each mode[3] produces a sinusoid, with its characteristic frequency. If modes are not too dense, the sound will be perceived as rather clean.
Each sinusoid has an index k, a time varying amplitude ak(n) and phase φk(n).
For a time continuous sinusoid, the frequency is the time derivative of the phase.
For a discrete time sinusoid, the frequency is defined as the phase difference of subsequent samples.
In this context, each signal can be interpreted as a sinusoid, but each sinusoidal component can also be perceived as individual entities, provided that its perceived quality is close to that of a stationary sinusoid.
The sinusoidal components are detected by the means of the amplitude and phase. The resulting set of sinusoidal components is represented as a superposition. They are substracted from the signal, with, as a result, the residual, that is, to put it shortly, the noise.
Their exist few algorithms that allow to distinguish between spectral peaks representing sinusoids and noise. Common techniques are based on features that are derived
from the form of the phase, frequency trajectory and amplitude spectrum smoothness. Sinusoids forming amplitude or frequency trajectories that are not sufficiently smooth are removed from the set of sinusoids. This selection is simplified in the case of harmonic sounds, since the expected frequency positions of sinusoids are supposed to be integer multiples of the fundamental frequency.
The residual combines signal noise and modeling error.
The sinusoidal analysis is performed on the Short Time Fourier Transform of the signal.
A sinusoid creates a prominent local peak, that is, a local maximum of the magnitude spectrum. For each frame, the peaks are determined searching these local maxima.
In each STFT frame the spectral peaks are detected.
Amplitude thresholds, among others, may be used to avoid processing a large number of peaks that will be qualified as noise.
For each selected peak, an estimation of the sinusoidal parameters is executed : amplitude, frequency. In many cases, the phase needs to be estimated to, in order to subtract the sinusoid from the sound.
Once the parameters of the spectral peaks have been estimated in each frame, the peaks have to be connected to form amplitude, frequency and phase trajectories to be synthesized,
As the values yielded by the STFT are comprized within each analysis frame, the values between the frames have to be interpolated. Therefore, a connection between the peaks in consecutive frames is created to render a coherent trajectory.
Although there isn't one algorithm for all situations, one algorithm allows to connect each peak in the previous frame to the closest peak in frequency, in the next frame. To avoid unreasonable jumps, the amplitude and frequency differences of the peaks are compared.
Only peaks that do not exceed a minimum variation for both parameters are connected.
Unconnected peaks belong to dying partials.
Peaks without any connections may represent a new born sinusoid.
The variation thresholds can be adapted to favor smoothness of amplitude and frequency trajectories.
Recent algorithms try to incorporate a trajectory model into the peak continuation algorithm.
The interpolation for the sinusoidal parameters corresponding on one hand to the frame rate, and on the other hand to the sample rate are interpolated. The resulting sinusoids are synthesized and substracted from the signal.
A dedicated noise model can be fitted to the residual spectrum. Common choice is based on a source-filter model, using a spectral envelope of the residual, and excitation using white noise.