Vulkan ProRes decoder: implementation

Implementation details

Integration

This project is integrated within the FFmpeg project, which already includes a software ProRes decoder. Optionally, it can offload a good part of the decoding process to a hardware accelerator. So far, only Apple platforms were supported via VideoToolbox. However, the VT accelerator ingests the entire bitstream, including frame and picture headers. This is not compatible with the approach I wanted to take for the Vulkan code, which would operate at the slice level.
Therefore, I had to adapt the ProRes decoder to send slice data to the hardware accelerator.

I also added a ProRes “parser”. This is a small routine, used during the initial probing part of the decoding process, which extracts essential data about the file from the frame header. Previously, decoding a ProRes video would start by processing and discarding an entire frame, in order to determine its dimensions and other parameters.
This was pretty wasteful, especially in the context of a hardware decoder where you expect all heavy processing to be off-loaded.

Basic description

The Vulkan decoder is divided in two main shaders:

One for variable-length decoding (VLD). This retrieves the DCT coefficients from the compressed bitstream. A shader invocation is launched for each slice and each component.
A second for inverse transform (IDCT). It calculates the final color output from the DCT coefficients. It is launched 8 times for each block.

From the start, I wanted the decoder to not use an intermediary surface to store decoded DCT coefficients, and instead write them to the final framebuffer and perform the IDCT in-place.This is possible because ProRes only supports 10- and 12-bit depths, stored in 16-bit memory atoms, which can accomodate the DCT coefficients. The idea was to save memory (ProRes is often used to store high-resolution footage), and also that more frame data could fit in the GPU cache.
However, this requires initializing the framebuffer to 0 before the VLD step, because DCT coefficients are sparsely encoded.AC coefficients are encoded using an RLE-like scheme, where only non-zero values are signaled. This is quite similar to JPEG. I was very disappointed to find that Vulkan does not support using copy engines to clear YUV images. Instead, a 3^rd shader is inserted before VLD, that just writes zeroes to the color planes.

Two shader variants are compiled during decoder initialization, for progressive and interlaced frames.I’ve also considered compiling two variants for luma/chroma components in the VLD shader, since the block scanning pattern is different and requires a branch to calculate. Right now I’ve decided against it but I might change it later if I find that it makes a significant difference. Indeed, interlaced decoding writes to different locations, and I didn’t want to have a runtime branch for each texel get/store operation. Instead, the choice happens at compile time.

Relevant frame metadata is sent to the shaders using uniform push constants. This includes the frame dimensions, subsampling, depth, quantization matrix, etc. The whole structure is just 160 bytes big, and could probably be reduced by a few bytes.
The decoder also sends 3 buffers:

One containing the compressed bitstream data,
A second with the offsets of the different slices within this former buffer,
A last one with slice information (number of macroblocks and location within the frame).

VLD shader

This is the most complex, but also the most boring part of the project. Indeed the entropy decoding process is inherently not parallelizable (since codeword boundaries cannot be determined without decoding the preceding data), and therefore restricts our freedom of movement to optimize for GPU architectures.
It is also using integer math, which runs at half-speed on modern NVidia cards.This is because instructions on Ampre and up GPUs can be dispatched to two pipelines, only one of which supporting INT32 operations. However, Blackwell “unified” these into mixed cores, yet some operations (including IMAD and LOP3 which are used for multiplies and bitwise) still run at a reduced throughput. I’ve tried to remove as many multiplies as possible, which was straightforward since most of them are by powers of 2 and can be converted to shifts.

IDCT shader

The inverse transform is where the parallel execution model of GPUs really shines, because it is an arithmetic-intensive operation and independent for each 8x8 block. Moreover, since the DCT is separable, each row and column can be processed separately. Currently, each GPU invocation processes 8 texels, first as a row, then as a columns before final output.
Since the IDCT also requires a fair bit of memory (it ingests, then produces the entire framebuffer), careful access patterns are important. The shader stores two 16x16 macroblocks in single-precision floating-point format within shared memory,Shared memory is essentially programmer-managed L1 cache memory which is partionned out for our use. for intermediary calculation. This block is padded to avoid bank conflicts, and read to/from using coalesced operations.

The actual IDCT implementation was taken from a CUDA sample. It exploit the symmetric properties of the DCT basis functions to reduce the amount of calculations, however it could probably be optimized further.

What’s next?

Optimization

VLD shader

I believe the luma/chroma branch on texel writes could be eliminated using a trick similar to the one decribed here.
Aside from this, I don’t see many ways of significantly improve the entropy decode shader.

IDCT shader

There are a few alleyways I’d like to explore for the inverse transform step.

First, the IDCT described by Arai, Agui & NakajimaArai, Y., Agui, T., Nakajima, M., “A Fast DCT-SQ Scheme for Images,” 1988. Reproduced in Pennebaker, W., Mitchell, J., “JPEG Still Image Data Compression Standard,” Van Nostrand Reinhold, 1992. is, as far I know, the most efficient algorithm, with just 5 multiplies (the rest being moved to the rescaling step).
In contrast, the IDCT I’m currently using has 28 multiplies, though on NVidia chips floating-point adds, multiples and FMAs have the same latency so I’m not sure this is the correct metric to optimize for, since generating FMA will double FLOP over straight FMUL/FADD. Down the road, I will need to try looking in detail at the generated instruction stream.

I also want to think about improving performance using subgroup operations, though at the moment I’m not sure what it would look like.

Finally, cooperative matrices (ie. tensor/matrix cores on NVidia/AMD) seems like an obvious candidate for improving IDCT performance. This could reduce inverse-quantization and -transform to just 3 matrix instructions. However, these operations do not support single-precision floating point types, only up to half-precision. The precision loss should therefore be validated against the ProRes accuracy test (Annex A of the SMPTE document).

Benchmarking/profiling

Using tools such as the Radeon Gpu Profiler or Nsight graphics, I’d like to investigate the memory access patterns, cache behavior, etc of the shaders.

Validation/testing

Very unfortunately, Apple provides no reference decoder, or conformance test suite to validate third-party decoders. The baseline is therefore the pre-existing, reverse-engineered software decoder within FFmpeg. Moreover, Prores is not specified for bitexact decoding, instead it specifies several precision contraints that the IDCT should respect.The accuracy requirements are actually more stringent than the IEEE 1180-1990 “Specifications for the Implementations of 8x8 Inverse Discrete Cosine Transform”.
This makes comparison with another decoder complicated, as both could be producing correct, but different results. I will probably have to calculate the deviation between the software decoder and my implementation.

So far, I’ve tested my encoder on my personal NVidia and AMD cards, on Linux. This should be extended to Windows, Intel iGPUs, perhaps Android devices (unfortunately I do not own a Qualcomm Adreno device which would be the most common mobile GPU series), or even Raspberry Pi.

Finally, the lack of publicly available encoders makes it difficult to test for resilience against several sources. I’m planning to test for edge cases (eg. on frame dimensions), but more different encoders would have been welcome.

Introduction

Implementation details

Integration

Basic description

VLD shader

IDCT shader

What’s next?

Optimization

VLD shader

IDCT shader

Benchmarking/profiling

Validation/testing