ComParser
index
   ComParser  1.37
   audio recognition
ComParser

The software consists of three layers:
  1. Preprocessing layer:
    Spectral analysis and amplitude-measurements, delivers normalised feature vectors to the second layer.
    Used during recognition as well as during training.
  2. Spatiotemporal layer:
    Avalanche structures to recognize sequences of feature vectors.
  3. Pseudo-scorefollowing logic:
    Parsing the network structure, hopping from node to node, sending out MIDI.
Data flows from the first layer to the second, there is no feedback from the second layer back to the first.
Scorefollowing logic turns on and off neural networks in the spatiotemporal layer so separation between second and third layer is not as nice.

Preprocessing layer

Before incoming audio can be processed, it must be translated into a more suitable form. A stream of overlapping spectral snapshots seems the most obvious form for it allows polyphonic and 'disharmonic' sounds to be recognised. ComParser's preprocessing layer extracts feature vectors from a re-ordered FFT-output and some amplitude measurements, and delivers them, at a constant rate, to the second layer.

  fig.1: preprocessing diagram fig.1: ComParser's preprocessing layer.  Incoming audio is transformed into a constant stream of normalized input feature vectors, delivered to the second layer.

Regrouping and pitch invariance

Spectral analysis is done very efficiently by the FFT algorithm, but FFT magnitude-outputs need to be recombined to form a more logarithmic-like frequency scale: Ideally, we would like to have equidistantial (whole-tone, semitone, or quartertone) frequency-bands with some overlap between the bands. ComParser, however, now uses an algorithm called 'varidist' which just groups together (i.e. adds up) the upper FFT bins and may not be ideal yet: distances (i.e. bandwidths) still vary. This 'varidist'-regrouping gives the software some pitch tolerance. Little more pitch invariance can be gained though, by increasing the bandwidth of the 'varidist'-bands. Extreme pitch invariance may be accomplished by running many transposed copies of the avalanche structure in parallel.

Feature vector composition

Besides spectral magnitudes, the RMS difference within one audio-block is included in the feature vector. It chains input feature vectors by encoding attacks, decays and steady states. A peak measurement may further complete our feature vector.
So at any discrete time t, an input feature vector
  Rt = { rt,0, rt,1,... rt,B-1 }
is composed from all regrouped FFT outputs, a differential RMS value and a peak value, for example
  rt,1 =  D RMSt
  rt,2 =  peakt - RMSt
  rt,3 ... B-1 =  varidistt,0 ... V-1
  V  =  B-3
where B is the total number of scalars contained by the raw input vector, and V is the number of frequency bands after FFT regrouping with the varidist algorithm. Notice that, unfortunately, the vector is almost completely unipolar, the RMS difference is the only bipolar dimension.

Normalization and amplitude invariance

In the end of the preprocessing stage, feature vectors are normalized. The second layer requires normalized input, but normalization also gives the network some amplitude-invariance. The degree of amplitude-invariance depends on the way silence is handled by the network and on the datatype (floating- or fixed-point) and accuracy (noise level) of the spectral analysis.
It may be clear that absolute amplitude-invariance (having no silence classification at all) is not desired. (This is practically possible when we let the normalization process boost up the 1-bit-noise-signal from our 'silent' DAC, and classify that broadband noise in the same category as the spectrum of an overflying F16, which may be quite alike).
The current ComParser implementation yields null-vectors at low sound pressure levels, due to integer calculations and down-rounding (which function as noise-gate tresholds) in the analysis routines. The current ComParser version does not explicitly check for null-vectors, but smoothly encodes the transition between silence and non-silence (gradually, up to maximum SPL), by including an extra, never-zero component in the vector, just before normalization. For example, at dimension 0:
  rt,0   =  a
The constant a should be choosen as small as possible (preferably 1 LSB). It should be just enough to prevent division by zero during vector normalization, when the rest of the vector is null. Only when the lowest bits of the analysis data contain too much noise, a may be choosen a little higher (to surpress noise in the other dimensions of the feature vector).
After calculating the absolute value (or magnitude) of the raw feature vector
  Rt | = sqrt b<B
SIGMA
b=0
rt,b2  ,
the normalized vector
  It  =  { it,0it,1, ... it,B-1 } ,    It |  =  1

is calculated by
  0 < = b < B ,   it,b =  rt,b
-
Rt |

Spatiotemporal layer

A variation on Stephen Grossberg's [1] avalanche structure is used for the recognition of spatiotemporal patterns. It provides a tolerant audio recogniser that can be easily implemented, optimised and tuned for a specific purpose.
Exact time-alignment is avoided, things like Dynamic Time Warping, Hidden Markov Models and Viterbi algorithms, quite common to speech recognition, are not used here. ComParser is designed to recognise many different sounds in a very specific order, not in any order.

Avalanche structures

The neural networks in ComParser's second layer are inspired on Stephen Grossberg's formal avalanche [1] (1982), but there are some differences:
Avalanche structure

Simplified diagram of the avalanche structure used in ComParser. p-units perform (mostly spectral) data matching, while q-units take care of sequence matching. When the rightmost processing element's output exceeds a certain treshold, a sequence has been recognised.

An avalanche structure in ComParser consists of a row of processing elements (PEs).
Each processing element is trained to recognise a specific feature vector. During training, the first feature vector of a sound is stored (as weight vector) in the leftmost PE, and so on, up to the rightmost PE, which stores the last feature vector of a sound. During forward propagation, incoming feature vectors are distributed to all PEs. A PE can get activated when its' weight vector matches the incoming feature vector to a certain degree, but this only happens if it was already (a little) active, or when preceding PEs (neighbouring PEs on the left side) are, or were, active. Each PE has memory and reacts a bit slowly. In general, becoming activated takes less time than getting deactivated: PEs exhibit a fast attack- and a slow decay-rate.
When a row of PEs, previously trained by a series of feature vectors, is now presented that same (or very alike) series again, a wave of activity may be observed, passing through the row, from left to right. As soon as the rightmost PE's activity exceeds a certain treshold, the row is said to have recognized the sequence.

Whereas both 'vector matching' and 'sequence checking' tasks are done by single processing elements in the classic avalanche, these tasks are split up in ComParser. In the figure above, three different types of processing units (plus a simple multiplier) can be distinguished:
  1. p-units  calculate the (mainly) spectral match. Each p-unit is trained for a specific input vector (spectral snapshot). The p-unit furthermore subtracts a treshold (minimum required match) and applies a non-linearity (bottom clip).
  2. q-units  just gather (a weighted sum of) output activities from preceding (in place) processing elements. They govern the sequential behaviour of the network.
  3. o-units  or 'output units' apply a non-linearity (top clip) on the output, give it a unit-sample-delay, and take care of a recurrent connection. The rightmost output unit is the layer's output.

Training

Training of an avalanche structure is straightforward.  When we want it to learn sequence

  T = {  IttrainI1+ttrainI2+ttrain,...  IN-1+ttrain }

that started at discrete time ttrain and contains N normalized input vectors, we simply assign these vectors to the weights of successive P-units:

  0 < = n < N ,   Wn = In + ttrain
where
  Wn  =  { wn,0wn,1, ... wn,B-1 } ,    Wn |  =  1
is the normalized weight vector at processing element n.  Weights are regarded time-invariant after training.
During training, we thus create N artificial neurons. Here time is translated to space. After training, however, concepts of time (t) and space (n) are kept strictly separated to avoid any confusion.

Vector clustering

The network may be optimized, during training, by clustering successive input vectors that are very much alike. In practice, this greatly reduces the number of necessary processing elements, and thus saves CPU consumption.
Instead of calculating the euclidean distance between two successive vectors,
  Wn - Wn-1 |  = sqrt b<B
SIGMA
b=0
( wn,b - wn-1,b ) 2  ,
the vector dot product is calculated to decide whether two successive vectors fall in the same category (which saves the CPU B subtractions per vector):
  Wn . Wn-1  =  b<B
SIGMA
b=0
wn,b wn-1,b
If, for example,
  Wn . Wn-1  >  0.92 ,    1 < = n < N ,
neuron n may be cut out of the network, or in other words,
vector In+ttrain is removed from sequence T, and N is decreased by 1.  This step may be iterated.
This very crude form of vector quantisation works fine, and this is how the current version of the software performs it. A more precise method, where average vectors are calculated instead of just skipping look-alikes, would require re-normalization of the average vector. When this is done in the current integer-DSP implementation of ComParser, this doesn't even yield that much more precision, due to quantisation errors during normalization. Anyway, for integer-DSP, the crude clustering algorithm seems sufficient.

Forward propagation

Each time a normalised input feature vector arrives, thus at any time t, it is distributed to all N processing elements.

p-unit processing:
The p-unit determines how good the feature vector matches its' internal weight vector by calculating dot product
  It . Wn = b<B
SIGMA
b=0
it,b wn,b , 0 < = n < N .
Since both input and weight vectors are normalized, a perfect match will result +1.0. The worst possible match will result -1.0, but only if vectors are fully bipolar. Since our vectors are almost completely unipolar, our minimum dot product is just slightly less than 0.0. The p-units furthermore subtract a treshold and apply a non-linearity, their output is given by
  pn,t  =  [ It . Wn - G ] +
where the non-linear function [u ] +  is defined by
  [ u ] + = { u   if   u > 0
0   if   u <= 0
and treshold G is a constant between 0 and 1 that may for example be set to 0.75, to require a 'good' (spectral) match.
So far, this was identical to the formal avalanche, but now come the differences: q-units and gates.
Processing p-, q- and o-units is discussed here in that order, in practice, however,
qt,n is evaluated before pt,n and if qt,n = 0,  the processing of p-unit n is skipped.

q-unit processing:
The q-units control the sequential behaviour of the network, their output is described by
  qt,n =  b ot,n-1 + m<M
SIGMA
m=0
gm ot-1,n-m , 0 < = n < N .
(eq.1) 
where M is the number of forward connections (in figure 2, M = 3),
b is a small positive constant,
gm is a weighting function which decays over m, something like
  gm = -m ,
and ot,n is the output of o-unit n at time t.
Output units with negative position-index do not actually exist but can be defined to produce positive values to enable (re)triggering of the layer, for example:
  s < 0 ,   ot,s = 0.25
(eq.2) 
The exact balancing of b and g is not important here, nor is it critical in practice, the general idea is that activity is gathered from processing elements placed to the left, thus preceding in place, and the farther to the left, the weaker the connection.
Most activity is coming from the previous moment, t-1, some activity, though, is coming from the current time t. If you do this slightly different, when you, for example, leave out b or g, it will probably work as well.

Now equation 1 can be rewritten as a pure recursive function plus a simple addition:
  ct,n  =  (1-k) ct,n-1 +  k ot-1,n
(eq.3a) 
  qt,n =  ct,n + b ot,n-1
(eq.3b) 
where b is the same positive weighting constant as in equation 1,
and constant k, determining the attack and decay distance, is restricted to
  0 < k < 1 ,
for instance (as in ComParser version 0.58),  k = 0.5.

Thus, in an implementation where place variable n is iterated from 0 to N-1, or in other words, when neurons are evaluated from left to right, q-units can much more efficiently be processed with equation 3 than with equation 1. Let's thus forget about equation 1, it was only convenient for modelling our connections in figure 2. This optimization (using eq.3 instead of eq.1) can also be seen as replacing an higher order FIR filter by a 1 pole IIR filter.
The t-1 in equation 3a is just confusing, it is, in essence, a difference equation over place (n), not over time (t).
We can also forget equation 2 and replace it by
  k < 0 ,   ot,k = 0
o-unit processing:
Output units are governed by a difference equation over time.  It is a first order, nonlinear difference equation, given by
  ot,n  =  [ ot-1,n +  A pt,n qt,n - ot-1,n ) ] 1-
(eq.4) 
where A (u) is the attack/decay function defined by
  Au ) = { d u   if  u > 0
a u   if  u <= 0
(eq.5) 
where constants a and d control respectively the attack- and decay-rate:
  0 < a < 1
0 < d < 1
Finally, the clipping function [u ]1-  is defined by
  [ u ]1- = { u   if   u <= 1
1   if   u > 1


Pseudo-scorefollowing

ComParser uses one or more real audio recordings to follow. No actual score (in the form of an abstract, symbolical representation of the audio) is used! The user has to point out several cue-points in her audio recordings (manually).

ComParser can read-in audiofiles and fragments of audiofiles. ComParser can be programmed by the user to send out certain MIDI messages after recognition of an audio fragment.
The software is not very good at locating in realtime exactly where we are in the (pseudo)score. So it cannot be used for very rithmical (MIDI-)accompanyments. This version is especially built to trigger on certain (not too many and not too dense) cue points in some electroacoustical pieces. Just opening a delay line, fade out a reverb by MIDI-commands after recognising the last note of a phrase, things like that, while the computer is following the instrumentalist(s). ComParser essentially recognises ends of phrases. Due to the parallelism in the Avalanche algorithm (which makes it so tolerant regarding variations in timing and dynamics), ComParser may sometimes trigger too early. ComParser is very bad in realising a cue point in the middle of long, sustained note. The software is much better in triggering on spectral fluxes, that is where notes begin, end, or change. No beat-tracking or time-alignment is done by this version.

Conclusions

Avalanche structures appear to be very tolerant regarding time stretching and time compression: tempo-tolerance. They also show a remarkable metre-tolerance (repeated tempo fluctuations).

Recognising (mostly spectral) sequences works just fine with music. However, timbre may be heavily affected by using different microphones, musical instruments, room-acoustics, etc. On string instruments, we even observed failure of recognition when a melody was played on a different string of the same instrument (bass guitar).

When a certain avalanche refuses to recognise new utterances, we may run separate avalanches in parallel, which effectly solves the abovementioned problem of relying too much on the spectrum solely (spectral dependance), but it also solves any other generalisation-problem. It must be mentioned however, that all this manual network-building may involve a lot of time (supervised learning). Furthermore, running more avalance structures simultaneously will take up more CPU.

Premature recognition (triggering too soon) may be a much greater a problem than failing to recognise (you can always put an alternative 'in parallel').

References

[1] Grossberg, Stephen. Learning by Neural Networks. In Stephen Grossberg, editor, Studies of Mind and Brain. D. Reidel Publishing, Boston, MA, pages 65-156, 1982.