Our work into the perception and automated detection of microphone wind noise had been published in the Journal of The Acoustical Society of America. This paper discuss how wind noise is perceived by listeners, and uses this information to form the basis of s wind noise detector / meter for analyzing audio files you can access the Journal here:
Or if you don’t have access, the paper is will also be available here (the next couple of days)
‘The Pips’ are series six of short tone bursts transmitted on Radio 4, they are known as the Greenwich time signal and are intended to accurately mark the start of the hour. They have been transmitted since 1924, and originate from an atomic clock.
On the 21st July 2014 a listener wrote to the Radio 4 programme ‘pm’ to ask why the pips had been changed. The programme played the offending pips and the originals. (here is a link to the program, the item is at 28m 31s http://www.bbc.co.uk/programmes/b049y9pn)
Here is an ‘old’ pip:
and a ‘new’ pip,
You may think that ‘new’ pip sound harsher, by looking at the wave form and spectra we can begin to understand what has happened. Here are the two waveforms of the pips,
Waveforms of the two pips
and the two spectra.
Frequency Spectra of the two pips
We can see from the spectra there are additional lines in the spectrum known as harmonics, comparing the two waveforms we can see that the ‘new’ pips appear to be similar to the older ones except that the peaks of the waveform have been flattened or ‘Clipped’ a little.
This clipping is a form of distortion, it occurs when the gain applied to the signal is to great or if there is a fault in a preamp and the amplifier is no longer able to properly replicate the signal at the input. We can clearly hear the difference between the two signals and according to the concerned listener (and his cat) it has a very negative impact on the sound quality. Denis Nolan, the network manager for radio 4, identified the fault as being due to a particular desk the signal was going through.
In our project we are writing an algorithm to perform a similar function to the upset listener, we don’t mean that our algorithm will write pithy letters to Eddie Mair, we want to build an algorithm to automatically detect when something like this has gone wrong and the sound is being distorted. The way we are going about this is to simulate all sorts of types of faults on many different types of sounds, and then see if we can look for ‘features’ of the audio which seem to be very dependant on theses faults. We can then build automated systems that look for occurrences of these features to locate them, and try and estimate how bad the error is from the features themselves.
We have a developed an algorithm which is able to measure the level of wind noise on your recordings. This algorithm is the result of research carried out for our project where we carried out perceptual studies about the effect of microphone wind noise on sound quality of recordings. We then developed an algorithm which was able to analyse audio files and detect wind noise and predict the level of degradation to audio quality.
This program is useful to people who may have a lot of audio files they want to quickly sort through to find versions of recordings without wind noise. Or if they want to quickly located regions in recordings which are free of problems. A possible application of this technology is to collect together many recordings of an out door concert and without having to listen to all recordings piece together the best quality files.
The program has been uploaded to GIThub, it is a command line program written in c/c++ and needs to be compiled first.
Good news! Today sees the launch of the project’s first ever app – The Good Recorder. Absolutely free and available now via the iTunes store, or click here.
What is The Good Recorder?
The Good Recorder is a sound recording app (currently only for iOS 7 devices) designed to help users achieve high quality audio recordings by monitoring for common recording errors and providing feedback about them. Currently the app incorporates findings and algorithms from our previous work with wind noise. The plan is to further develop the app with auto-detection of handling noise and distortion as our research in these areas progresses. Continue reading →
Audio quality research often involves manipulating a known facet of a recording (such as distortion level, bit rate, and so on) and seeing what effect it has on people’s ratings of quality. Unfortunately however, the simple act of requesting a rating of quality can change the way people would normally listen to the recording. Recently we’ve been considering alternative ways of approaching this problem.
If, for instance, we could find another measure that predicted quality reasonably enough we might not have to ask directly for people’s ratings. And if this implicit measure of quality could be found quickly and freely, in data that already exists, we might have any number of new and exciting avenues to pursue.
One of the major issues that was raised from our survey is when a device gets overloaded when presented with excessive sound levels. A common issue is recording the audio at a rock concert where the device is simply unable to cope with the sound pressure levels it is exposed to. In order to understand how devices respond when placed in this situation an experiment was designed to attempt to capture the kind of non-linear behaviours that may occur.
The performance of a series of common devices was quantified including the; Cannon 550D, Edirol r44, Neumann U87ai via Focusrite 2i4, Shure SM57 via Focusrite 2i4, Zoom H2, Zoom H4, Google Nexus 4, Apple Iphone and a Sony camcorder (vx2000).
Most devices have some form of dynamic gain control to prevent signal clipping, but the implementations clearly differ considerably. Some devices have many settings for different situations indicting that there is no one particular method suitable for all cases. The attack and release times of the measured systems range from 5 to 17 ms and 30 and 400 ms respectively. Some devices may also demonstrate a nonlinear gain curve with no attack or release but which try to limit audible distortion by using a compression ratio of between 1.4 and 10. While other systems have no protection and when presented with excessive sound levels will exhibit hard clipping.
We are interested in how people perceive the quality of user-generated content and to help us understand this better we are currently carrying out an experiment comparing youtube clips of glastonbury. If you would like to take part please click here, its quite interesting how different devices and positions in the audience can make such a big difference to the sound.
Also from a sound engineering perspective providing a good quality sound to the whole audience is a very difficult task, you need to be part engineer part meteorologist, as the weather can have such a huge effect on the sound, read prof. Cox’s blog for more info.
Our project is focusing on how to improve the quality of recordings on mobile consumer devices. This article by the BBC new team suggests the reason why many artists are against the recording of concerts on smart phones is because of the lower level of quality.
In my opinion I think it would be interesting to see if artist’s opinons change when the quality of recordings increases. I think it could be likely that the real issue for artists is a loss of control of their art form. I would be interested to see what other people thought.
So we have we have been working on a number of things recently. We have finished our web experiments where we have been looking at the influence of wind noise on the perceptual quality of speech. For this experiment people were asked to listen to samples of recordings with added wind noise and rate the quality, attempt to repeated what was said and rate the difficulty of the task. We varied the wind noise sample in term of level and ‘gustiness’. We are analyzing the data at the moment attempting to understand how level and gustiness relate to sound quality for this particular case.
Wind noise Detector
In addition to these subjective tests we have developed a ‘wind noise detector’. This algorithm listens to an audio stream and detects the presence of ‘wind-noise’. The detector compresses the information within the audio stream by extracting ‘audio features’. Audio Features are efficient representations of sounds. The amount of data required to represent an uncompressed digital audio stream is very large and to build a detector which utilized the raw audio stream is simply not possible. Therefore features must be extracted which can represent the information present in the stream much more efficiently Luckily, by an understanding of how sound is processed by the human auditory system, gives us a way of compressing the information stream, throwing away all the perceptually unimportant parts while keep the salient features. This is the how mp3 and other compression method achieve their high compression ratios. See the later topic for more information on the features extraction.
Teaching a machine to detect wind noise
The wind noise was simulated based on a number of realistic models. This allowed us to generate a huge range of possible examples. The scheme adopted was a supervised learning one. This is where a set of audio features are extracted and a target value (the wind nosie level) is associated with this feature vector. a large number of examples are generated an classified according to wind noise level. A support vector machine is then train to try to classify between two groups, where one group contains features from wind noise above a certain level and the other below. A support vector machine (SVM) is a binary classifier where the objective find a line, (or a plane or hyper plane depending on the number of dimensions of the features) which can be drawn in the feature space which will separate between the two groups. A number of SVMs are trained using different wind noise levels as a thresholds. Three thresholds are chosen so that four class are defined: high, medium low and undetectable. Three SVMs are trained and the data combined using a decision tree. The results are very promising which simulated data showing detection rates of 87%, and real world test also showing good promise.
Audio Features – Mel-Frequency Cepstrum coefficients
The audio features representation called Mel Frequency Cepstrum Coefficients (MFCC) is commonly used in speech recognition to compress the information stream prior to the recognition stage. The MFCC is a spectral representation of a signal over a (usually short eg 20 ms ) time period. A spectral representation means, rather than representing the signal in the time domain i.e. how the pressure fluctuations over time the representation simply shows the levels of the different frequency components with the analysis time period (this time window is often referred to as a window). The ‘Mel’ part refers to the frequencies over which the spectrum is evaluated. A Fourier transform has a linearly spaced frequency components, however this is not how the human auditory system performs The human system is sensitive over a logarithmic scale, in other words the change in frequency for a low pitched sound is much more noticeable compared with the same change but a t a higher pitch. The Mel scale attempts to represent how the human auditory system represents pitch.
Cepstrum – The cepstrum is a representation of a signal where the inverse Fourier transform of the log spectrum is computed. A property of the logarithm is that process that previously were multiplicative become additive, this enables components parts of signals to be separated more easily. For example speech spectra can be thought as a product between the spectra of the speech source and the vocal tract. The vocal tract produces resonances or ‘Formants’, by computing the cepstrum the formats and speech source components can be separated out, where low ‘quefrency‘ components represent the spectral envelop of formants and higher components represent the speech source.
Therefore the Mel-frequency cepstrum is a representation of the spectral envelope of a signal where the frequency scale is warped to be representative of the human auditory system. Typically this reduces the data in a 20 ms wind sampled at 44.1 kHz from 1102 samples to 12 MFCCs. This is a very efficient representation and much of the salient information is preserved.
In an earlier blog post we presented some findings from our web survey on the differences between iPhones and other brands of mobile phone. In this post we look beyond mobiles and give a brief overview of some of the other findings from the survey. Continue reading →