MP3 decoding on an ARM Cortex embedded microcontroller

I’m working on an STM32H7 project that has the need for mp3 decoding. For esp32 I’ve created a component, esp-audio-player that uses the libhelix-mp3 fixed-point MP3 decoder. This particular libhelix-mp3 fork also has RISC-V supported added (some of the esp32 versions are RISC-V and not XTensa). Because libhelix-mp3 appears tailored for the esp32, and to see what’s out there, I decided to try out another mp3 decoder, minimp3, for the STM32H7 project.

minimp3

minimp3 looked like it could be a good fit. While it is using a floating point implementation, it also has SSE and NEON (ARM’s SIMD implementation) support. I had some hope that the performance on STM32H7 would be sufficient.

Once it was integrated into the codebase I was seeing some system exceptions and faults. As this is a Zephyr based embedded system I suspected that it could be stack related.

Increasing the stack from 4k to 8k to 10k didn’t resolve the issue.

Looking at the library function being called, mp3dec_decode_frame(), did turn up something interesting.

int mp3dec_decode_frame(mp3dec_t *dec, const uint8_t *mp3, int mp3_bytes, mp3d_sample_t *pcm, mp3dec_frame_info_t *info)
{
    int i = 0, igr, frame_size = 0, success = 1;
    const uint8_t *hdr;
    bs_t bs_frame[1];
    mp3dec_scratch_t scratch; <------ what is this and how large is it?

    if (mp3_bytes > 4 && dec->header[0] == 0xff && hdr_compare(dec->header, mp3))
    {
typedef struct
{
    bs_t bs;
    uint8_t maindata[MAX_BITRESERVOIR_BYTES + MAX_L3_FRAME_PAYLOAD_BYTES];
    L3_gr_info_t gr_info[4];
    float grbuf[2][576], scf[40], syn[18 + 15][2*32];
    uint8_t ist_pos[2][39];
} mp3dec_scratch_t;

Right off it looks like this is a huge structure for an embedded system. grbuf is float (4 bytes) of over 1000 entries, ~4k, syn is (18+15) x (2 * 32) == 2112 floats, so over 8k.

Writing a test app on OSX I was able to print out the size of this structure at some 16k.

This explains why increasing the stack size to 10k didn’t resolve the issue, the mp3dec_scratch_t structure alone was more than the largest stack that I was trying to use for the mp3 decoding task.

PC based systems, Linux, Windows, OSX etc, wouldn’t even flinch at 16k of data on the stack. The OS would detect the stack memory page fault, allocate additonal memory for it, and the application would continue. For embedded systems a stack of 8k is pretty big an 16k+ is significant.

There was a pretty straightforward fix. The mp3dec_t structure represents the data required to track the state of the minimp3 decoder. It’s what you pass into each minimp3 public API call. We can move mp3dec_scratch_t from being a stack variable in mp3dec_decode_frame() and into the mp3dec_t structure.

You might be asking why does this matter? Memory is memory right? There are a few reasons for the shift:

It’s more friendly for embedded systems where stacks tend to be small.

Why? Well, moving it to mp3dec_t means you can make an instance of mp3dec_t at file scope and the memory usage shifts from run-time (stack based), to compile time as the linker will reserve RAM for the structure. Compile time / static memory allocation means your system is guaranteed not to run out of RAM as long as its able to compile and link. For may things it may make sense to have some dyanamic memory allocation but in general, good embedded software design will push to make as much memory statically allocated as possible.

There is little to no impact for non-embedded systems.

Big-OS systems can allocate mp3dec_t on the stack, or malloc it, or new it etc. Increasing the size of mp3dec_t by ~16k is a small amount, small enough that it won’t be noticed on these systems. The programmer already has to allocate mp3dec_t somewhere, now its just a little bit larger.

After this change mp3dec_decode_frame() is able to run without exceeding a stack of only 4k. I haven’t measured stack usage to see if its possible to reduce the stack size further.

On to testing….

Now that the mp3 decoder is running I was seeing another issue, interrupted text on the terminal. The system would boot up and print some startup text that would get cut off and the serial terminal was unresponsitve.

After adding a few debug prints it was clear the system was decoding mp3 data but the behavior was odd, almost like the system was hung but the decoding prints were continuing.

Profiling

So I added some profiling to measure the time it took to call mp3dec_decode_frame(). The results were surprising…

[00:10:51.529,000] <inf> audio_player: sample_count 1152
[00:10:51.529,000] <inf> audio_player: comparing
[00:10:51.529,000] <inf> audio_player: frame_offset 0, frame_bytes 261, bitrate_kbps 80
[00:10:51.575,000] <inf> audio_player: playing audio, ready underruns: 0, underflow_count: 0, decoded_buffer_count: 565
[00:10:52.075,000] <inf> audio_player: playing audio, ready underruns: 0, underflow_count: 0, decoded_buffer_count: 565
[00:10:52.575,000] <inf> audio_player: playing audio, ready underruns: 0, underflow_count: 0, decoded_buffer_count: 565
[00:10:52.629,000] <inf> audio_player: buffer 0x24005d64
[00:10:52.629,000] <inf> audio_player: decoding
[00:10:52.680,000] <inf> audio_player: delta NS: 51515687, MS: 51.515687 <------------------- 51ms whaaaattt????

It’s taking 51ms to decode a single mp3 frame! Why is this an issue?

At a typical mp3 playback rate of 44100hz this means that the mp3 decoding rate has to be high enough to provide at least 44100hz * 16-bits * 2 channels = 88200 16-bit samples/sec in order to provide real-time playback.

At 51ms per mp3 frame, and a mp3 frame being 1152 samples, the present decode rate was:

1152 samples / 51ms per frame = 22153.85 samples/sec

22153 samples/sec is roughly one quarter of the 88200 samples/sec required for real-time decoding, and this was with the processor spending all of its available time decoding mp3 data.

Conclusion

The STM32H7 processor is a Cortex-M7 core. It has 8 and 16-bit fixed point SIMD but nothing as extensive as the NEON floating point SIMD support other ARM cores have. The Cortex-M7 was decoding the mp3 without any floating point accelration. The single FPU is simply not performant enough at ~400MHz to be able to decode an MP3 in real-time when using a floating-point based mp3 decoder with no hardware acceleration :-(

I had assumed incorrectly, based on the performance of libhelix-mp3 on a 240MHz esp32 processor, that minimp3 would have a similar level of performance. Thankfully it was only a handful of hours to implement and get to this point, and much of the decode code in my project can be reused with the next mp3 decoder library.

It’s possible to improve the performance of minimp3 but improving its performance for Cortex-M7 is likely to require switching from a floating point to a fixed point implementation. That would be a fun adventure but not something I’m up for at the moment.

At this point I’m going to be looking into other mp3 decoding libraries.

A PR against the upstream repository has been opened to improve the stack usage on embedded platforms, maybe it will help those using minimp3 on embedded systems with NEON support. You can check out the PR at chmorgan/minimp3.

Updated: