To set up ComParser to recognise a certain sound, one needs at least 1
(but preferably more) audio recordings of that sound. One furthermore needs
a so called network file that refers to these audio files and specifies which
(MIDI) responses should be sent after recognition.
To follow the performance of a musical composition (which is basically a
concatenation of sounds) one needs the same: 1 (but preferably more) audio recordings and a network file.
The network structure allows pieces of audio to be ordered in series (sequentially), as well as
in parallel (simultaneously).
The writing, editing and testing of a network file is merely a form of
supervised learning and may involve a lot of time.
The software can only roughly follow a musical performance: it does not always know precisely
where in the score "we are". How accurately ComParser
can point out "where we are" largely depends on the number of cue-points (nodes) that the user
specifies in her network file. No tempo-tracking is done, so it is impossible to
cue in the middle of long sustained note (if we don't know on beforehand how long that note is
going to be played, in absolute seconds or milliseconds).
ComParser may be used to follow some eloctroacoustic pieces, where, only once in a while, something
needs to be switched. Polyphonic input and 'weird' sounds are no problem, but the software can
not be used for automated MIDI-accompanyment of let's say BWV 1027
(where you would be playing viola da gamba and ComParser on harpsichord, or the other way around).
That would involve far too many nodes in the network, and even then,
the software would still need some awareness of tempo to accompany a musician on that level.
ComParser is not aware of tempo at all.
The software can neither read nor write MIDI files.
| |
 |
| |
Figure 1: Screenshot of ComParser,
just started up on a Silicon Graphics machine. |
ComParser's primary user interface is the console.
This section summarizes all available commands. Command lines may be typed in the console
and must be terminated by hitting ENTER or RETURN.
Commands may also be given through a UNIX pipe, or put in a startup file.
Example file startup demonstrates most important user commands.
In the following summary, square brackets []
should not be typed in literally on the console, they just denote that the enclosed
argument is optional. Single quotes, however, should be typed litarally to allow names to
contain whitespace.
More precise and up-to-date documentation can be found in sourcefiles
CMP_commands.c and
CMP.c which define ComParser's console-
and MIDI-commands. Patching and naming of the user commands may easily be changed there.
All abovementioned commands may also be put in a textfile called
startup, which must reside next to the
CMP application. All lines in this file are executed as command lines when
ComParser is started up. Empty lines, however, and lines starting with a #
character are ignored. One may edit this file or write one's own startup file.
Network files, in principal, refer to audio files. However, after a network has been loaded
and avalanche files have been generated, it is possible to throw away originating audio files.
It is, however, better to keep original audio files. Avalanche files may always be regenerated,
even if analysis-methods and/or the syntax of the avalanche files change.
For now, ComParser can only read audio files in the AIFF format.
ComParser also only writes in AIFF format. (With the a-option
in net follow-command, one may start a simultaneous audio recording
during following.)
Although filename suffixes like .aif, .AIF,
.aiff, .AIFF, etcetera, are recognised by
the software, audio files do not necessarily have to have a filename suffix.
Since avalanche files are generated automatically when ComParser loads a network for
the first time, one usually does not need to know about them in great detail. First-time
users may skip this section and focus on their audio recordings and the writing of a network file.
Names of avalanche files:
An avalanche filename is derived from the name of the audio file it refers to.
When a complete audio file is converted (when both starttime and endtime are ommited in the
network file), the name of the avalanche file simply becomes the name of audio file,
with any .aiff-like extension substituted by .avl.
When only a part of an audio file is converted (when
starttime and endtime were specified in the network file), these times are included
in the avalanches' filename. Filenames may grow by at most 24 characters!
See for example sax1.network. It refers to 2.8 seconds of audio in recording
sax.aiff, from 0.5 to 3.3 seconds. The name of the avalanche file becomes
sax-00m00s500-00m03s300.avl, and it will be created in
the same directory as the audio file.
Derivation of avalanche filenames in ISO-EBNF:
audio filename = name, ['.',('A'|'a'),('I'|'i'),('F'|'f'),['F'|'f']];
avalanche filename = name, starttime, endtime, '.avl';
starttime = time;
endtime = time;
time = '-', minutes, 'm', seconds, 's', milliseconds;
minutes = 2 * decimal digit (* And also < 60. *);
seconds = 2 * decimal digit (* And also < 60. *);
milliseconds = 3 * decimal digit (* Implies < 1000. *);
decimal digit = '0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9';
Users of Macintosh OS 7,8 and 9 should take care not to use more than 7 characters in their
original audio filenames (any .aiff-like suffixes excluded).
Under IRIX, long filenames are less a problem.
Content of avalanche files:
Avalanche files are regular textfiles that may be edited manually. One may insert, delete and alter
data lines, and insert or remove comments. Most times, however, it won't be necessary to alter
avalanche files manually: it easier to make changes to the network file instead.
Empty lines and lines that start with a # chararcter are ignored.
Before the data lines, two headerlines must be present in the file. Their order
does not matter, but one line should say fps=<fps>
and another one should say bfq=<bfq>. Where
<fps> is a fractional number, specifying the number of
feature vectors used per second during audio analysis.
And <bfq> is a fractional number, specifying the
base frequency of the audio analysis.
After these two headerlines, data lines follow. A data line consists of an integer greater
than zero, followed by a colon, followed by a series of integers, separated by space characters.
Values range from -2560 to +2560. The number of columns is the same for all lines.
The number before the colon is the number of clustered feature vectors. The numbers after the
colon represent the feature vector.
The first number of a line is a measure for silence.
The second number represents RMS difference (within one single frame, not between frames).
From the third column on, all numbers are linear spectral magnitudes.
After the colon, at least one space character must be present. Finally, the file should end with
one final newline and/or carriage return character.
Avalanche files may also be generated manually with the
deprecated audio avalanche command. This is however discouraged because it is likely that originating audio gets lost this way.
It is often better (and much easier) to leave avalanche file generation up to the software
and work with audiofiles directly.
As long as the user keeps her original audio recordings, she may throw away avalanche textfiles at
any time. UNIX users may use the rm *.avl command to do so.
Format of the avalanche textfiles is also described in headerfile
CMP_avalanche.h.
To facilitate visual interpretation of avalanche files, they may be converted to JPEG images
with the net info jpg command.
JPEG files are generated in the same directory as the originating avalanche textfiles.
Their filenames are identical to avalanche filenames, only.avl extensions
are substituted by .jpg .
JPEG files show the same data as the avalanche files, only rotated 90 degrees counterclockwise.
| |
 |
|
Figure 2: Example of an avalanche textfile converted
to a JPEG file. Data originates from avalanche file
sax-00m00s500-00m03s300.avl,
which in its' turn was extracted from audiofile sax.aiff.
The bottom row represents silence, the darker the blue, the more quiet the sound.
The second row (from bottom) represents the RMS difference
and is the only bipolar scalar in the vector.
The third row (from bottom) represents the energy in the lowest frequency band,
the top row shows the energy in the highest frequency band.
Time progresses from left to right.
Red is used for negative values, blue for positive values. |
A network file tells ComParser which pieces of audio are expected, in what order
they should appear, and which MIDI messages should be sent out after recognition of each of
these pieces. The network structure allows new pieces of audio to be added in parallel,
which may be necessary as soon as new utterances appear not to be recognised.
Only one single network can be loaded at a time. Filename-suffix
.network may be used to distinguish network files, but this is
not mandatory.
A network consists of a linear sequence of nodes and interconnecting avalanches.
Nodes coincide with cue-points in the score and each may contain some MIDI messages. During following, ComParser acts like a
finite state machine: hopping from node to node, sending out MIDI after arriving at
each node.
At least one node must be present in a network, and a network must both start and end with
a node. Between two successive nodes, one or more, but at least one, avalanche must be present.
An avalanche is (a fragment of) an audiofile converted to a series of clustered feature vectors
for audio recognition.
| |
 |
| |
Figure 3: Network structure: a single avalanche between the first
two nodes; three avalanches in parallel beween node1 and node2; and two avalanches in parallel beween
node2 and node3. |
When ComParser loads a network file into memory, it first looks for avalanche textfiles
that were possibly generated earlier. When all
necessary avalanche files are found, this all ComParser needs (one may throw away one's audio files).
When, however, not all avalanche files are present, the necessary ones will be generated,
in the same directory as the originating audio file(s), provided these audio files can be found of course.
Network files are regular textfiles that have to be written and edited manually.
A network file consists of lines. Empty lines and lines starting with a # chararcter
are ignored.
The first line that is not ignored should specify the feature vector rate
in the preprocessing layer, for the whole network, in cycles per second, a fractional number
between 5 and 200. Because ComParser doesn't provide for samplerate conversion in general,
only certain combinations of samplerate and feature vector rate are allowed. When, for example,
the first line of a network file says fps=31.25, then, only 16 and
32 kHz audio files can be analysed, and realtime input is also limited to 16 or 32 kHz:
fps=15.625: sr=8000 | sr=16000 | sr=32000
fps=31.250: sr=16000 | sr=32000
fps=62.500: sr=32000
Then, at least one node line should follow: starting with the node number,
followed by a colon, optionally followed by some MIDI messages and/or pauses.
MIDI messages should be written as raw decimal bytes, separated by one or more
space or tab characters. Pauses, in milliseconds, may be put in between MIDI messages
(not between individual MIDI-bytes inside a message!) by enclosing up to 3 decimal
digits by parentheses.
A network file should also conclude with a node line. Between node lines, always
one or more avalanche lines should be present.
An avalanche line starts with one or more horizontal whitespace characters,
followed by the pathname of an audio file or avalanche file, optionally followed by a starttime
and an endtime. Pathnames must be given relative to the application, or absolute.
Times can be notated in only one way: 2 decimal digits specifying the
number of minutes, followed by a colon, then again 2 decimal digits specifying the
number of seconds, followed by a point, concluded with 3 decimal digits specifying the
number of milliseconds.
The syntax of a network file is described more precisely in headerfile
CMP_network.h.
The next section discusses some example network files.
The following examples demonstrate essential cases of behaviour:
- Automated classification and the problem of inappropriate recognition.
- Manual classification as the solution to the problem of failure of recognition.
- Pseudo score-following.
- Using Pure Data instead of the commandline ComParser.
The documentation directory (doc/) contains all the necessary audio
recordings and network files for the reader/user to repeat the following 3 experiments.
When you are using the PD-version of ComParser, you don't need any special setup
and you can skip the following section.
Setting up equipment to test the commandline-version of ComParser:
To test the software's realtime behaviour, you need to playback the provided audiofiles
to ComParser's realtime input. When you want use an audiofile-player on the same
computer as you run ComParser on, you can connect your computer's audio-out directly
to its' audio-in. Or, perhaps, you can create some feedback route via the mixing desk attached
to your computer. Otherwise, you may need to first copy the provided AIFF files to audio-CD,
to tape. Or you may even use a second computer for playing back audio. ComParser only
listens to the first input channel, so connecting left-out to left-in should be sufficient.
| |
 |
| |
Figure 4: Wire audio-outputs to audio-inputs. |
Take care to switch off any 'software-monitoring' or 'audio-thru' on your computer
before creating any external feedback loop!
In the examples, unix-style pathnames are used (/). Mac and MSDOS users may have to
substitute forward slashes (/) by colons (:) and backward slashes (\).
Example 1: Generalisation (automated classification):
Audio recording sax.aiff is used as source material
for this example. Hein Pijnenburg plays on alto saxophone, a phrase of 7 notes (on the dominant),
he repeats it 4 times, only slightly varying, and then concludes with 3 variations,
ending on the tonic.
Network file sax1.network trains only the first utterance of
the phrase. The file declares just two nodes, named 0 and 1,
and a single avalanche in between the nodes: refering to 0.5 seconds to 3.3 seconds
in the audiofile. A MIDI-response is furthermore attached to node 1:
it will cause a middle C to be sent after the recognition of the phrase.
The cue point for triggering is set immediately (at the beginning of the rest)
after the (notated) f-sharp.
When one loads network sax1.network into ComParser and
plays back complete recording sax.aiff, it recognises the trained 'theme'
5 times, as one would expect. But, as the file plays on, one may also observe ComParser
triggering a sixth time, just after the last note (notated g). That is wrong!
Or, apparently, these 3 final lines --together-- share similarity with the 'theme'?!
| |
 |
| |
Figure 5: Successful and inappropriate recognition. Only the first
phrase (red) from audio recording sax.aiff is trained in network
sax1.network. All repetitions and slight variations are recognised (blue),
as expected. An unexpected recognition occurs at the last note of the recording (broken blue).
Transcriptions of the audiofile are also available as sax.pdf,
sax.ps and sax.abc. |
You can repeat the above experiment by starting up ComParser, load the network,
then startup the audio-preprocessor and the score-follower, by typing:
./CMP
net load doc/sax1.network
audio on
net follow r=10 d
This instructs ComParser to follow the loaded 'pseudo-score' 10 times, and the
d option causes neural activities to be displayed on the console.
The midi on command may be ommited because responses will be visible on the console.
When you now playback complete audiofile sax.aiff to ComParser's
input, it should respond 5 times, outputting lines like:
NODE '0' 111111111111111111
NODE '1' Last node reached, 10 time(s) again.
NODE '0' 1111111111111111111111
NODE '1' Last node reached, 9 time(s) again.
NODE '0' 1111111111111111111
NODE '1' Last node reached, 8 time(s) again.
NODE '0' 1111111111111111111111
NODE '1' Last node reached, 7 time(s) again.
NODE '0' 111111111111111111
NODE '1' Last node reached, 6 time(s) again.
And a sixth time, near the end of the audiofile:
NODE '0' 1111111111111111111
NODE '1' Last node reached, 5 time(s) again.
Example 1 demonstrates ComParser's tolerance, especially with respect to timing. All 5 utterances
of the theme are recognised successfully: legato or staccato, fast or slow, with grace note, or even a
different rhythm, for ComParser, it's all the same. The system thus shows generalising capabilities.
The fact that the last 3 phrases (together) are recognised incorrectly as being the 'the theme' also shows
the downside of this tolerance: unexpected firing of neurons, it can go very wrong!
The inappropriate sixth firing can be explained by (inner) melodic similarities and the fact that, for the software,
the last (notated) note g is not so different from the (notated) f-sharp that it was learned.
Example 3 shows that excessive tolerance forms no problem in
pure score-following of the above music (as a whole): after having heard the 'theme' 5 times,
ComParser will no longer expect it to sound.
Nevertheless, undesired triggering may form a problem in the end, and it may become necessary
to decrease tolerances. That can only be done in the source code! In the this case for example,
one might want to decrease the decay time of the individual neurons (to make the avalanche
structure more time-critical). Or, one might want to decrease bandwidths within the feature vector
(to make it more precise with regard to pitch).
Decreasing tolerances may (eventually) lead to the next problem: failure of recognition.
Example 2: Supervised learning (manual classification):
Whereas example 1 demonstrated excessive generalisation, example 2 demonstrates the opposite:
failure of recognition. In this example, 2 separate audio recordings of an electric bass guitar are used:
- bass_hi.aiff A major scale 2 octaves up, starting on the A-string (bright).
- bass_lo.aiff A major scale 2 octaves up, starting on the E-string (darker).
Network file bass1.network trains only the brighter sounding scale.
When you load this network file by typing
./CMP
net load doc/bass1.network
audio on
net follow r=100 d
and playback the same recording bass_hi.aiff to it, it will (of course) recognise.
But, when you present recording bass_lo.aiff, it will (most surely)
refuse to recognise.
Although failure of recognition --at first sight-- may seem more problematic than excagerated recognition,
it is less a problem, and, ComParser offers a very pragmatic solution:
manual labeling, supervised learning, whatever you may want to call it.
You may put avalanche layers in parallel, for simultaneous recognition. This idea of side-tracks
is visualised in figure 3 above.
To actually test this solution, you can now release bass1.network and load
network file bass2.network (which trains both bass guitar recordings)
by typing:
net free
net load doc/bass1.network
net follow r=100 d
Then, ComParser should recogise both the bright bass_hi.aiff
and the dark bass_hi.aiff recordings.
Notice that no start- and end-times are specified in these network files. Because the author of
this documentation took the time to and split-up the audio recording in 2 seperate audiofiles,
the network file(s) can simply refer to a whole files, which makes writing and maintaining
network files a lot easier.
When generalisation fails, one can generally undertake 2 things:
- Increase tolerance by altering internal tresholds and time-constants and such, but we have have seen
this can lead to unexpected (i.e. too tolerant) behaviour).
- Apply the manual labeling technique: learn the new audio utterance by putting it parallel
to an utterance already put in the network. There are 2 downsides to this:
- Supervised learning is laborious: looking up start- and end-times, audio-editing, etc.
- Putting avalance structures in parallel in a ComParser network file increases CPU-usage
(linearly).
Example 3: Score-following
The third example is again based on audio recording sax.aiff.
Network file sax2.network trains the whole audio recording
(not just the first phrase as was done in the first example).
Although example 1 already showed that it is not strictly necessary, all timing-variations
are now explicitly trained. Eight cue points are positioned at the ends of the phrases.
You can test whether ComParser is able to score-follow the above saxophone part.
Startup ComParser, load network file sax2.network
and start the audio and recognition layers by typing:
./CMP
midi on
net load doc/sax2.network
audio on
net follow d
Since responses can also be observed from the command line, it is not really necessary
to switch on MIDI.
When you now playback audiofile sax.aiff,
ComParser should respond 8 times. Each cue point is located immediately after
the last note of a phrase, so ComParser is expected to trigger at the beginning
of each rest.
Example 4: Pure Data external
ComParser can also be run as PD external.
The commands for the PD object are similiar to the commands for the standalone ComParser.
Figure 8 below shows a PD patch to pseudo-score-follow sax.aiff.
It is exactly the same as the previous example (3), but no need for realtime audio-input!,
having only audio-out is enough to test realtime behaviour).
| |
 |
| |
Figure 8: Running the comparser~ object in Miller Puckette's Pure Data environment. |
All PD-related stuff can be found in subdirectory src/pd/.