Five Audio Processing Tasks that are a Lot Harder than you Think

I regularly get emails along these lines…

“Hi, I am a student and I need to make program to take MP3 speech recording and output what the words are. I tried Microsoft speech recognition but it kept getting wrong. I need 100% accuracy. And if there is more than one person speaking it needs to say who is speaking. It has to use NAudio, but I am new to audio programming, so please send me the codes to do this. Please hurry because I need it by Friday.”

It seems to be a common problem that people new to audio processing greatly underestimate how difficult some tasks are. So here I present the top five audio processing problems I get asked about. All of which you might find frustratingly difficult to solve…

1. Speech Recognition

Speech recognition is slowly becoming more and more mainstream. Apple has Siri, Windows comes with built-in speech recognition, Google have their own speech recognition technology. The trouble is, despite their huge R&D budgets, none of these three leading software companies have actually produced speech recognition engines that don’t get it hilariously wrong on a regular basis.

Speech recognition is such a difficult problem that most existing products need to spend a certain amount of time learning the way you speak (regional accents can be a huge problem), and often they are simply looking for matches in a limited set of keywords (e.g. “open”, “play”, “search”) which can increase their reliability.

One of the biggest issues is that in human speech two completely different sentences can sound exactly the same. For example, how does a computer know if you said “I would like an ice-cream” or “Eye wood lie can I scream”? Humans know because the second sentence is complete nonsense. For a computer to know that, it also needs to learn about the rules of grammar, and understand the wider context of the audio it is transcribing.

In short, if anyone asks me if they can implement a speech recognition algorithm from scratch using NAudio, I tell them to give up, unless they’ve got a lot of time, and have a large team of signal processing experts they can call on.

2. Speaker Recognition

This is a related problem to speech recognition, where you attempt to determine who is speaking in a recording of a conversation. Again, this is fraught with difficulties. First, how does the computer know how many speakers there are? Ideally, it should be given a voice sample of each speaker individually, to build up some kind of profile. You might have some success using an FFT to get the pitch information, but this would only likely be successful if the speakers had very different voices (e.g. a male and female voice, or an adult and a child).

Doubtless there are some state of the art algorithms being developed somewhere to do this, but I know of none in the public domain, and anyone who solves this problem is likely to keep their technique a closely guarded secret.

3. Transcribing Music

Is it possible to take a piece of music and turn it into sheet music, indicating what notes were played when? Well it depends very much on what exactly is being played. If you have a recording of a single monophonic instrument being played, then pitch detection may well give you a decent transcription.

But if you have recorded a polyphonic instrument, such as a piano or a guitar (or worse yet, a whole band), then things get a whole lot more difficult. It becomes a lot less clear when a note starts and stops, and which note exactly is being played. One of the big issues is harmonics. When you play a middle C on a piano, you don’t just hear a single frequency (261Hz), but rather a whole host of other frequencies as well. It’s what gives each instrument a rich and distinctive sound. This means there is inevitably some amount of guesswork involved in determining which note(s) exactly were being played in order to produce the complex set of frequencies that have been detected.

4. BPM Detection

This request I think comes from DJs who want to group tracks together by their BPM. In theory this should be easy – detect each beat, count how many beats there are in a given time interval, and then calculate the BPM.

The trouble is that there is no rule in music that the kick drum must play on every beat, or the snare must only hit on beats 2 and 4. Some music has no percussion, and if there are drums, they can be very busy or very sparse. So even if you did create an algorithm that detected the “transient” for each kick or snare hit (or strum on the guitar, or strike of a bongo), you would need to come up with a strategy for ignoring the ones that weren’t on the beat. For example if the music is in 12/8 you could end up detecting a far too high BPM.

Depending on the type of music you are analysing, you may actually get reasonable success with a primitive BPM detection algorithm. For example, if it is all four to the floor dance music then you might be able to get consistently good results. Probably it would be best to measure the BPM in several places in the song, and select the most common one.

5. Song Matching

This problem is where you have a snippet of audio, maybe someone humming a tune, or a recording made with your phone, and want to match it to a database of songs. What song is my snippet from?

This turns out to be extremely difficult. You could try to solve it by matching the melody – looking for similar patterns of notes, but even that is fraught with difficulty – how do you extract just the melody for each song and store it in such a way that it can easily be matched on.

One additional complicating factor is that the same song can be sung in different keys and at different speeds. You can easily recognise a song you know even if it is sung by someone with a completely different sounding voice and on a completely different instrument to the original performer. But a computer will find that a lot harder to do.

This is a problem that some big music companies are attempting to tackle, as it could be used to help them identify illegal distribution of songs. But it’s very unlikely that they will reveal the secrets of any algorithms they do come up with. So if you do want to tackle this problem, don’t expect to find much by way of helpful information.

Conclusion

I don’t write this to suggest that these five tasks are impossible. With enough effort and ingenuity I am sure great solutions to each one can be found. What I am trying to say is that there are many ways in which the human brain is still vastly superior to state of the art software technology, particularly when it comes to the types of recognition tasks discussed here. So if you have dreams of creating the next killer audio application using NAudio, by all means try, but make sure you have realistically set your expectations of what can be achieved. And if you do want to tackle one of these five problems listed above, prepare to spend lots of time learning advanced DSP techniques, and learning to live with much less than 100% accuracy.