Audio Processing Meets AI

Ten years ago I wrote an article entitled Five Audio Processing Tasks that are a Lot Harder than you Think. It was borne out of my frustration that many users of my NAudio library had unrealistic expectations about what was achievable.

But things have moved on a lot in a decade, and in particular the recent stunning advances in AI and machine learning have essentially solved many of these problems, and unlocked new possibilities I couldn't have imagined when I wrote that original post.

Let's briefly revisit the five problems I originally referenced, and then look at some more recent exciting developments...

1. Speech Recognition

Real-time speech recognition has made tremendous strides forward, to the point where I'd say its more or less as good as a human. We see this technology in action daily giving us real-time transcription of Teams calls or YouTube videos, and its very easy to access using services such as Azure Speech Services which includes access to the state of the art Whisper model.

The main way in which I see speech recognition fail is when it lacks the context to know that a particular conversation or talk will be using certain specialized vocabulary. I can imagine that this will improve as we give AIs the ability to access wider context than simply the audio they are being asked to transcribe.

2. Speaker Recognition

The speaker-recognition problem is also pretty well solved by recent AI transcription models. In Azure this is called "speaker diarization", and can even operate in real-time.

3. Transcribing Music

This is actually one of the few tasks where I don't think huge progress has been made (although there are some products claiming to do this so I could be very wrong here). I suspect its to do with the difficulty of finding sufficient volumes of high-volume training material (although it would arguably not be very hard to generate given the ease with which you can take MIDI data and render it as an instrumental performance using virtual instruments).

As a musician, I'm really looking forward to being simply able to ask an LLM to listen to a piece of music and produce sheet music for the piano part or guitar tab for a solo. Of course, there is benefit in doing it manually yourself but it can be a very time-consuming process.

4. BPM Detection

This is definitely a solved problem with plenty of BPM detection algorithms that work well and are used by many DAWs and audio tools. It's often used with managing sample libraries, and machine learning is taking that to the next level by being able to automatically categorise and describe samples, allowing you to easily search for just the right sound, or find similar sounds to one you already have.

5. Song Recognition

Probably YouTube's Content ID system represents the state of the art here. It enables them to detect not only if you've used someone else's song on your video, but can even detect if you've made your own completely new cover of that song. Of course, it's not without its controversy - when this system messes up (which it does in strange ways from time to time), it can result in legitimate artists being denied the revenue from their own compositions. There is clearly some progress to be made on this front.

There are also various "hum to search" apps, which can take your attempt to replicate a small fragment of a song's melody, which is likely not in the same key or tempo as the original and yet is able to identify the song.

Wait - there's more!

So not only have these five "hard" problems of a decade ago become more or less solved problems, but in the general area of audio processing and AI/machine learning, there have been several more astonishing developments. Here's a few of my favourites...

1. Polyphonic pitch correction

I genuinely thought this was impossible a few years back, but Celemony Melodyne can do it, allowing you to pitch correct or even change a single note in a chord.

2. Stem splitting

Stem splitting is where you take a piece of music, and separate it out into the various instruments that made it up - for example, vocals, drums, bass, guitars etc. The models that can do this are not perfect, but still highly impressive, and often available for free. It's a great tool for musicians wanting to learn a part in more detail, or to make backing tracks to practice along to.

3. Advanced audio cleanup

Having produced over 20 Pluralsight courses over the years, I've spent astonishing amounts of time painstakingly editing audio, getting rid of stray background noises such as birds singing in the back garden, and all the weird clicking sounds my mouth tends to make when I speak into a microphone. You might also want to get rid of excessive room reverb or constant background noise from air conditioning units. I wish that tools like Steinberg Spectralayers or iZotope Rx had been available when I started making course content, as they have hugely impressive capabilities to clean up recordings.

4. Speech generation

You could even take it further. Instead of me reading my scripts for a course and then editing the audio (which typically takes many days), what if an AI model could learn my voice and say it for me?

It would mean that there would be no background noises to edit out, I wouldn't rush, and I'd easily be able to update the narration if I wanted to update or fix it. That's essentially what the Azure personal voice service claims to do, and I'm hoping to try it out at some point.

I suspect that there will be some "tells", with certain words pronounced not quite how I would, and inflections and tone of voice not quite right. So I don't feel like I'd be comfortable actually making a course using this, but at some point if the quality is good enough, it might start to make a lot of sense.

Speech generation is actually just one example of how machine learning could be used for good. Knowing someone who has suffered from the very unpleasant condition of vocal cord paresis, a "personal voice" has the potential to make a dramatic improvement to their quality of life.

5. AI Singing

It's not just spoken word. You can even create AI singers to perform your music! Or take what you've sung and have it sung by a completely different voice (and in tune). There's a whole host of services offering this capability (e.g. kits.ai). I've not yet tried this out, but feels like it could be a lot of fun.

6. Guitar Tone

Guitarists are always after getting the best possible tone, and that often means accurately replicating the sounds of classic guitar amplifiers. IK Multimedia made some waves last year by releasing a remarkably effective machine-learning based amp tone capture with their TONEX which I've absolutely loved using.

Amazingly, you don't even need to pay to get access to this technology. The fully open source Neural Amp Modeler does essentially the same thing for free.

7. Audio mastering (and all kinds of effects)

In fact, we're seeing something of a gold rush to apply machine learning to just about every type of audio processing.

One of the most obvious examples is that of "mastering", where a mix of a song is prepared for release. This is usually performed by an expensive mastering engineer who specializes in knowing just how to tweak the audio frequency spectrum and dynamics to ensure the song sounds as good as possible in a wide range of listening environments.

But now there are plenty of online AI mastering services such as LANDR, or mastering processors you can run locally powered by machine learning such as Master Plan. This is particularly good news for hobbyists who can't really justify paying for professional mastering of their songs.

One of my most frequently-used effects plugins is Baby Audio TAIP, an AI-powered tape saturator. And we are increasingly seeing audio developers turning to machine learning to power newer plugins - I expect many more of these to be released in 2025 and beyond.

8. LLM all the things!

Of course, large language models have their place too in the world of audio. I've already made use of them for all kinds of suggestions and advice on how best to use various audio effects or producing variations on chord progression ideas.

I would not be surprised to see LLM interfaces baked into DAWs and effects processors so we can use natural language to describe what we want to achieve and it will not only give us a helpful response, but actually automate the implementation of the suggestions.

I was also astonished to discover that LLMs were suprisingly good at writing code in the relatively obscure EEL2 scripting language that can be used to write JSFX plugins for the REAPER DAW.

9. Generative AI and Music

Let's end with perhaps the most impressive but also controversial (and maybe even depressing) application of AI to music. And that is the fact that just as we've already seen with images, videos and code, AI is more than capable of generating entire songs. Suno AI can produce music on-demand in almost any genre, with the lyrics on whatever topic you choose. You can even supply your own lyrics.

Not only do I expect a flood of people simply releasing entirely AI-generated music, but I can also see it extensively being used as a cheat-code for rapidly generating compositional ideas, that you then add your own take on by recording your own version of it.

At the moment, these AI music generators simply output an entirely finished and mixed song, which makes the task of tweaking it to taste quite difficult, but I would not be at all surprised for it to be expanded in the near future to support generating separate stems for all the instruments, allowing you to more easily customise and refine smaller details.

Whilst the quality is by no means perfectly, I can safely say that it's already a lot better than the music I make in my spare time. Generative AI sadly does seem like it will make it harder than ever for artists to generate meaningful income from producing their own music, but I don't see why it should stop people recording and sharing their own compositions just for the fun of it.

Conclusion

It's astonishing to me not only how far things have come since I wrote that original post, but how fast they are moving. I think its safe to say that we can expect a whole new generation of machine learning based audio and music related tools in the coming years.

I'd love to hear in the comments your thoughts on this - particularly if like me, you have an interest in combining tech with music production. What are the most impactful breakthroughs you've taken advantage of so far, and what do you want to see next?

BTW - if you are wondering "did ChatGPT write this blog post for you?" The answer is no! I did try briefly to see if I could use ChatGPT Canvas feature to collaboratively improve the article, but it kept messing up the markdown formatting. And it certainly didn't know how to authentically match my "tone of voice". I feel like its inevitable that gradually over the coming years LLMs will inject themselves increasingly into all my creative endeavours including making music and writing blog posts. But for now at least, I'm still doing a lot of things the slow old-fashioned way!