The Listening Machine

One aspect of the Good Recording project is to develop algorithms which will be able ‘listen’ to audio and make judgement on the quality.  I thought it would be interesting to have a look into the history of machines which can listen and act upon audio.  This is application area is known as machine audition .  The most well known modern algorithm is that of apple’s speech recognition personality Siri.  But there are other aspect of our lives where machine audition is carried out.  Think of the song identification applications Shazaan and Soundhound.  These applications are great for identifying a song you just heard on the radio.  These devices and algorithms are  sound identifiers or classifiers  where a sound is recorded and then classified, perhaps identified as a particular piece of music or a particular word, or classified as being a particular style of music or language.

The information content even within a very small section of audio is huge, for example 1 second of digital audio can contain 44100 individual samples of the acoustic wave.  Therefore this data is compressed to reduce its size so that it is much easier to represent and classify.  Music identification methods work by producing an audio fingerprint, which is a sparse representation of the frequency and location of the loudest sounds.  A recorded sound can now be significantly compressed but sufficient information retained such that when the finger print is compared with a database, if the song has been stored, a match can be found.  These algorithms are so well designed that even the presence of additional sounds or distortion will not prevent a match.  However the pattern matching is very precise, as it is based on individual music recordings and thus will be unable to identify as band playing a song live as the timing of the song will be subtly different.  That is of course assuming the band is not miming!!

Modern applications use advance processing techniques to identify and classify audio.  One of the earliest machines which can carry out classification of audio was Bell Laboratories’  machine the Audrey’.  This machine was developed in the 1950s and was able to recognize individual digits.  This was achieved by detecting the first and second formants of speech, and based on how the formants changed relative to one another a pattern was extracted and stored in an array of Vacuum tubes, the voltages from the vacuum tubes now represents a pattern which is compared with  a database (a series of circuits) and the digit is identified.

Pattern matching is still used today for speech recognition.  However modern algorithms have improved over these initial experiments at bell laboratories.  To improve the accuracy and size of the word database, modern algorithms use a technique known as Statistical Pattern recognition.  This method of pattern matching enables the match to be more flexible.  For example, it allows words spoken in different accents or at different speeds to produce the same match.  This is where speech recognition differs from the recorded music identification task.

A machine that feels?

Identification and classification of sounds is one thing, but we are interested developing a machine which can not only identify and classify sounds but to also interpret and react to in the same way human would.  This implies that the machine must have a level of intelligence above a simple pattern matching algorithm.  For Siri and other speech recognition algorithms the machine intelligence is a separate processing block where once the words have been identified the intelligence is added to the system by means of language interpretation and internet search.

Our goal is to build into the machine, signal processing which will mimic the operation of the human auditory system.  We understand a-lot about how acoustic signals are converted from a pressure vibration into nerve firings within our auditory system.  Modern audio compression techniques rely on this knowledge to remove information in the audio signal while retaining a perceptually similar or identical experience.  However we know much less about the process within the brain how the auditory information is converted to opinion and felling for the listener.  To build a machine that can simulate this we must understand better what effects the perception of audio quality.  This is what we hope to achieve by carrying out perceptual tests, asking people what they think about various sounds, and using this information to build an intelligent algorithm which can predict how a human would react.

But audio quality is highly contextual, the acceptable audio quality of a telephone call and a classical music reproduction are very different.  Therefore in this respect we must build a level of intelligence into the machine perhaps where identification of sounds is carried out which will alter the mode of operation of the machine from (for example) telephone analysis to classical music analysis.