Audio DSPs and the Rise of Speech AI

By Bob O’Donnell
This post is sponsored by Cadence.

One of the biggest impacts that the rush of interest in generative AI (GenAI) has created is a growing awareness of the critical need to make access to key technology-based tools easier. Allowing simple language input to drive the discovery and analysis of data or the operation of applications and devices is arguably the most important part of the revolution that foundation models and applications like ChatGPT are enabling.

While the initial efforts have focused on text-based input, the next obvious step is the move to speech, a process that’s already begun. Tools like OpenAI’s Whisper and Google’s Chirp, for example, are going to allow us to simply talk to our applications and devices to request the information or perform the action that we’d like them to do.

Of course, some will rightly point out that we’ve had voice-based input to our devices in the form of digital assistants for many years now. From Apple’s Siri to Amazon’s Alexa and Google’s Assistant, the notion of using speech to interact with information and applications is something many people have grown accustomed to.

But just as large language models (LLMs) and other GenAI tools have completely transformed how “traditional” AI-based analytics and other applications operate, the next generation of GenAI-powered voice assistants are bound to completely reset our expectations on what voice-based interactions can do.

To put it bluntly, they should finally start to work in the reliable and consistent manner we always hoped they did (but rarely ever experienced). Having truly capable speech-based interactions is poised to dramatically change how we think about and work with computing devices, applications and data.

In order to make this dream a reality, there are several different key technology developments that need to come together. First, the voice-based interaction tools need to be modernized and trained with the latest generation GenAI foundation models. Equally important—but much less understood—is that semiconductor chips that are optimized for the unique requirements of audio-based interfaces need to evolve as well.

Early on in the development of the silicon and the intellectual property (IP) that power them, audio-focused semiconductors were faced with the challenging task of being always on, always listening and always prepared to respond when the appropriate trigger word (or sound) came in. They also had to be able to distinguish words, understand their meaning and context and deal with the challenges of different accents, different languages and other audio noise that often sonically clouds our environments.

Companies like Cadence with their Tensilica IP have been working to address these challenges and much more for many years now. Some of the company’s early work enabled things like voice trigger/wake on word, automatic speech recognition (ASR), and voice ID. Now Cadence is working on more advanced solutions that meet the more demanding requirements of newer AI-based voice applications, all while maintaining the extremely low power draw that the category has always had.

One of the best ways to meet these demands is through a chip architecture known as a DSP or digital signal processor. DSPs are optimized to process audio for things like noise cancellation, equalization, voice recognition and more and can do so in a power efficient manner.

Cadence has been enhancing the instruction set architecture (ISA) and developing software libraries and AI tool flow to efficiently map neural networks to run on its audio DSP IP designs for many years now. The company’s current Tensilica HiFi DSPs, for example, are a key part of smart speaker systems, modern automotive infotainment systems, and much more. The company’s NNE100 IP takes these capabilities even further and can be used for advanced computer vision, driver assistance, and other applications.

Most of the designs where the Tensilica IP is included are part of larger SoC (system on chip) architectures that incorporate multiple components including CPUs and more. The Tensilica components function as audio accelerators that can help offload certain tasks and workloads from the CPU so that devices can run more efficiently and extend their battery life. As audio-based applications become more important and more demanding in devices, the need to improve that efficiency and increase the performance becomes critical.

This is why both device vendors and silicon providers are often so obsessed with a metric known as PPA, or power and performance in a given area. The higher the TOPS (Total Operations Per Second) in a specified size of the chip’s design, the better. But beyond just raw TOPS, it’s also critical to consider the overall efficiency of the design, particularly for battery-powered devices.

As mentioned above, the key to achieving breakthrough speech-based applications is through a combination of advanced software and silicon. A critical part of that is providing tools that allow software developers who may not know or understand the intricacies of DSP and other audio-focused chip architectures to take advantage of their capabilities. These types of bridging tools let developers leverage today’s popular AI software frameworks, including things like PyTorch, TensorFlow, etc., and have the applications they build within these frameworks run seamlessly on audio-focused hardware. The Tensilica group within Cadence is doing this as well, offering software tools that provide the critical translation layers necessary to make this process work.

Even with these capabilities, the march of technological evolution continues onwards so it’s reasonable to expect advances in all these areas as well. Tensilica’s current offerings were all built before the GenAI explosion, for example, and while they can run many audio applications quite well, it certainly seems plausible that new architectures specifically optimized for GenAI-based audio models would be a smart move forward. Next-generation architectures that can support an interactive voice-based user interface for query and response—which isn’t possible or practical on existing designs—are going to be essential to move applications like robotic assistants for retail, healthcare and service forward, for example.

The overall opportunity for speech-based interactions with our devices and applications is absolutely enormous. The notion of truly smart machines and software that allows regular people to interact with them in intuitive ways is something that has only been in the realm of science fiction until very recently. With the types of advances we’re starting to see, however, it’s clear that audio-driven actions and requests are going to be an important part of our near-term future.

Bob O’Donnell is the president and chief analyst of TECHnalysis Research, LLC a market research firm that provides strategic consulting and market research services to the technology industry and professional financial community. You can follow him on Twitter @bobodtech.