Is it Necessary to Make Great DSPs Greater?

The HiFi’s comprise a rich portfolio of audio voice and speech processing DSPs flanked at the ultra-low-power end by HiFi 1 DSP and by the versatile HiFi 5 at the high-performance end for melding AI and DSP processing. Well received in the market, these have made their way into various applications, from the always-on island of an earbud SoC to higher-order ambisonics processing in premium cars and home theaters—enhancing user interaction and listening experience. Thus, is it necessary to improve these DSPs? If yes, in what aspects?

The post-pandemic market has a growing appetite for richer features, even better user experiences, and user convenience. How can the DSP platform support this? Does it suffice to offer a commensurate increase in raw performance? Indeed, customers would welcome quantum jumps in general performance that improve their applications across the board. In many cases, though, scaling raw performance doesn’t solve the pain points of customers. Product managers ask how well the DSP plays with the rest of the system. Is it too specialized to be used outside of its core strengths? Or can it be stretched to help the other compute elements in the SoC? In other cases, the software architect asks how difficult it is to harness the increased performance. The system architect is concerned about how easy it is to connect the DSP to the SoC fabric efficiently.

The latest HiFi DSP upgrades address these concerns from customers. The multi-fold improvements are discussed below.

Platform Upgrade

The HiFi DSPs are built on an LX controller platform that has evolved through seven generations. The current platform version is LX7. HiFi DSPs now use the upgraded LX8 controller platform, bringing the following improvements.

  • It adds a branch predictor, previously only found in higher-end application CPUs. The HiFi DSPs utilize the branch predictor to reduce the overhead of control code that is ever prevalent in kernel outer loops and DSP code. The SoC designer can configure the number of entries in the branch table to be optimal for their applications. The cycle-accurate simulator facilitates the exploration of performance with different branch table sizes.
  • LX8 provides an optional L2 cache controller. As audio workloads become large and diverse, their concurrent operation causes L1 cache thrashing. The latency to refill from system memory causes many a DSP stall, greatly reducing the instructions/cycle metric. System memory typically cannot be located close to the DSP because it is a shared resource with other compute elements in the SoC. With an L2 cache closer to the DSP than the system memory, the L1 cache miss penalty is greatly reduced. Performance gains of up to 50% are seen over systems that lack L2 cache.
  • LX8 based HiFi DSP achieves efficient data transfers between main memory and local tightly coupled data memory (DTCM). For this, the LX8 provides a 3D DMA with compression and decompression capabilities. This helps reduce memory storage as well as the cycles and energy needed to transfer data across the buses. HiFi benefits from this when acting on AI inferencing workloads. The LX8 DMA also enables a wider address range, reducing the need for applications to use windowing.
  • System interfacing of HiFi DSPs is made easier with the LX8. It provides for significantly expanded interrupt support, up to 128 in number, so interrupt sources do not need to share interrupts. In addition to the AMBA 4 bus, LX8 now provides a low-latency APB interface to reduce pressure on the primary bus.

The LX8 platform, therefore, enables HiFi DSPs to achieve higher performance through controller and system-level enhancements. Customers can choose to enable the features they need to achieve various performance/area tradeoff targets.

Auto-vectorization

The Cadence compiler toolchain based on LLVM has excellent parallelization built at the instruction level to utilize the two to five slots in the VLIW architecture of HiFi DSPs. However, it is harder to achieve the data-level parallelism to utilize the HiFi’s SIMD capability. Software engineers typically spend large amounts of time hand-optimizing the code, retrofitting the code with HiFi-specific intrinsics (code that describes data parallelism) to get the required performance. Not only does this process affect time to market, but it also ties up premium resources, that is, the DSP programmers, who could have been developing new algorithms instead.

The HiFi 1s and HiFi 5s DSPs address this problem by significantly enhancing auto-vectorization. They incorporate innovations allowing the compiler to generate SIMD code automatically without programmer intervention. For well-written parallelizable code, the compiler will automatically generate parallel SIMD code for the HiFi 1s and HiFi 5s DSPs. The generated code will be optimal, rivaling the performance of hand-written or hand-optimized code.

These innovations required careful hardware-software co-design, spanning efforts across multiple engineering teams. Close collaboration between the DSP hardware, software, and compiler teams led to embedding special ISA and data types in the HiFi DSPs and the compiler. Now, the compiler can auto-vectorize data arrays of all standard “C” data types for the HiFi 1s and HiFi 5s DSPs, saving the programmer significant time and effort.

Well-written versus non-parallelizable code

Not all code, however, is well written and may not be written to be parallelizable. Gains with such code will be limited out-of-box (OOB). Generally, programmers will hand-optimize such code without taking the time and effort to refactor such code to be parallelizable. One reason is that programmers do not see the benefit of first refactoring the code and then going through and optimizing it. Yet, this is painstaking work, is very time intensive, and the hand-optimized code looks very different from the original code, leading to lower backward traceability of performance and functional issues. Further, the optimizations pertain to a single DSP and do not scale well to other DSPs. Code hand-optimized for HiFi 1 will run on HiFi 5 but will not utilize the greater SIMD capacity of HiFi 5.

Simple edits to handle non-parallelizable code

Code can become non-parallelizable for many reasons. For one, array pointers not indicated as non-overlapping will cause the compiler to treat code as serial. If the intent was for the pointers to be non-overlapping, a simple pragma directive can qualify the pointers, and the compiler will now be free to auto-vectorize the code. Auto-vectorization can also stumble when data types and values are not crisply defined. For example, equating a variable to 1.0 would cause the compiler to treat it (unnecessarily) as a double value, preventing auto-vectorization. In contrast, a simple change of the value to 1.0f will define it as a single-precision float variable, allowing the compiler to vectorize. Such issues can sometimes stem from code developed on PCs supporting double-precision vector operations and running them efficiently. Programmers may be blissfully unaware that the type definition is overkill, that is, until such code is ported to embedded DSPs, where it will run functionally correct, but could face severe performance bottlenecks.

A list of these and other conditions that get in the way of auto-vectorization, along with recommendations to handle them, is available—please check with your Cadence sales representative. As the above examples show, simple edits can convert code from non-parallelizable to compiler-vectorizable. The resulting source code is portable across HiFi DSPs and can compile and run optimally on them without the need for per-DSP-specific code optimization. The compiler will perform DSP-specific optimization, freeing the software engineer from that burden.

ITU-T STL2019/STL2023 and Dolby Intrinsics

Codecs from standards organizations and ecosystem movers have various intrinsics defined for their mathematical operations. Auto-vectorization can also parallelize those, providing optimal to near-optimal out-of-box performance on HiFi 1s and HiFi 5s.

Double-Precision Floating Point Unit (DPFPU)

Audio, vision, and other algorithms sometimes cannot live with single-precision floating point operations. The tradeoff between precision and range that 32-bit representation causes problems, and therefore, they rely often on double-precision computation. Examples of this include functions such as log, exponential, and tan . At other times, audio algorithms start with double-precision code, typically in the PC domain, so as not to make any compromises that could affect quality, with the intention to move to single-precision or fixed-point when optimizing on embedded systems. But then, who has the time? To requalify the code with all the test inputs and test conditions after converting to single precision, and then adjusting the algorithm to make up for any quality issues that crop up, and then re-testing is a lengthy process. By then, time-to-market pressures set in, and teams are forced to leave double precision operations sprinkled all over the code to ship the product on time. Performance could suffer if the DSP platform does not include acceleration for double precision.

The inclusion of the double-precision floating point unit is optional and can be selected or deselected through the Cadence Xplorer tool while configuring the DSP. Operation speedups of up to 30X have been observed with the scalar double-precision floating point unit.

Audio DSP adding imaging features?

The market is changing quickly. Domain-specific DSPs were once the order of the day. Now, SoC architects are asking, “So it’s a great Audio DSP. What else can it do?”. The question arises because SoC architects are pressured to maximize compute capability across disparate compute elements they are designing into the SoC. Depending on the use case, they may have one compute element overloaded, whereas another one has many cycles to spare. They would like to rebalance the workload so that no compute element gets overloaded. For this, they would need to move the function temporarily to one or more of the other compute elements in the SoC.

Yet another use case is the always-on use case. Yesterday, they had one DSP listen for spoken keywords and another DSP to recognize the person who was speaking to the device. Today, they want more efficiency—why can’t the same DSP perform both functions? After all, in the always-on domain, keeping leakage low is important, which in turn means a smaller area or fewer compute elements.

Considering such use cases, HiFi 1s and HiFi 5s include imaging ISA and 8-bit MACs at different performance levels to process images and vision efficiently.

AI Performance Enhancement

HiFi 1s and HiFi 5s inherit the AI performance of their predecessor DSPs, including acceleration for non-linear functions such as sigmoid and tanh, which are key layers in many neural networks. While HiFi 5 already had 32 8-bit MACs (and as many 4x16 and 8x16 MACs), HiFi 1s adds 8 MACs of 8x8 and 8x16, significantly enhancing its capability to handle neural networks. The always-on domain with HiFi 1s can efficiently run more intensive AI inferencing workloads than its predecessor.

Conclusion

The LX8 platform and the new HiFi DSPs, namely, HiFi 1s and HiFi 5s, represent significant advancements in usability, software ease, time-to-market, performance, and functionality, leading the charge for convergent DSPs to solve a variety of problems at the edge, in the audio, voice, AI and imaging world.

Cadence is a pivotal leader in electronic systems design and computational expertise, using its Intelligent System Design strategy to turn design concepts into reality. Cadence.com