July 2025
In a previous post,
I introduced the Apple ProRes video codec as part of my enrollment in
the Google Summer of Code initiative, 2025 edition. This week I finished
most of the initial implementation of this project, ie. a
GPU-accelerated ProRes decoder using Vulkan compute shaders. It supports
nearly all codec features, including all profiles, 4:2:2/4:4:4
subsampling, 10/12-bit depth, and interlacing.
In this post, I will go over some of the implementation details, and
outline my next steps for this project. Everything mentionned in this
post can be found in code here: https://github.com/averne/FFmpeg/tree/vk-proresdec.
This project is integrated within the FFmpeg project, which already
includes a software
ProRes decoder. Optionally, it can offload a good part of the
decoding process to a hardware accelerator. So far, only Apple platforms
were supported via VideoToolbox. However, the VT accelerator ingests the
entire bitstream, including frame and picture headers. This is not
compatible with the approach I wanted to take for the Vulkan code, which
would operate at the slice level.
Therefore, I had to adapt the ProRes decoder to send slice data to the
hardware accelerator.
I also added a ProRes “parser”. This is a small routine, used during
the initial probing part of the decoding process, which extracts
essential data about the file from the frame header. Previously,
decoding a ProRes video would start by processing and discarding an
entire frame, in order to determine its dimensions and other
parameters.
This was pretty wasteful, especially in the context of a hardware
decoder where you expect all heavy processing to be off-loaded.
The Vulkan decoder is divided in two main shaders:
From the start, I wanted the decoder to not use an intermediary
surface to store decoded DCT coefficients, and instead write them to the
final framebuffer and perform the IDCT
in-place.This
is possible because ProRes only supports 10- and 12-bit depths, stored
in 16-bit memory atoms, which can accomodate the DCT
coefficients. The idea was to save memory (ProRes is often used
to store high-resolution footage), and also that more frame data could
fit in the GPU cache.
However, this requires initializing the framebuffer to 0 before the VLD
step, because DCT coefficients are sparsely
encoded.AC
coefficients are encoded using an RLE-like scheme, where only non-zero
values are signaled. This is quite similar to JPEG. I was very
disappointed to find that Vulkan does not support using copy engines to
clear YUV images. Instead, a 3rd shader is inserted before
VLD, that just writes zeroes to the color planes.
Two shader variants are compiled during decoder initialization, for progressive and interlaced frames.I’ve also considered compiling two variants for luma/chroma components in the VLD shader, since the block scanning pattern is different and requires a branch to calculate. Right now I’ve decided against it but I might change it later if I find that it makes a significant difference. Indeed, interlaced decoding writes to different locations, and I didn’t want to have a runtime branch for each texel get/store operation. Instead, the choice happens at compile time.
Relevant frame metadata is sent to the shaders using uniform push
constants. This includes the frame dimensions, subsampling, depth,
quantization matrix, etc. The whole structure is just 160 bytes big, and
could probably be reduced by a few bytes.
The decoder also sends 3 buffers:
This is the most complex, but also the most boring part of the
project. Indeed the entropy decoding process is inherently not
parallelizable (since codeword boundaries cannot be determined without
decoding the preceding data), and therefore restricts our freedom of
movement to optimize for GPU architectures.
It is also using integer math, which runs at half-speed on modern NVidia
cards.This
is because instructions on Ampre and up GPUs can be dispatched to two
pipelines, only one of which supporting INT32 operations. However,
Blackwell “unified”
these into mixed cores, yet some operations (including IMAD and LOP3
which are used for multiplies and bitwise) still run at a reduced
throughput. I’ve tried to remove as many multiplies as
possible, which was straightforward since most of them are by powers of
2 and can be converted to shifts.
The inverse transform is where the parallel execution model of GPUs
really shines, because it is an arithmetic-intensive operation and
independent for each 8x8 block. Moreover, since the DCT is separable,
each row and column can be processed separately. Currently, each GPU
invocation processes 8 texels, first as a row, then as a columns before
final output.
Since the IDCT also requires a fair bit of memory (it ingests, then
produces the entire framebuffer), careful access patterns are important.
The shader stores two 16x16 macroblocks in single-precision
floating-point format within shared
memory,Shared
memory is essentially programmer-managed L1 cache memory which is
partionned out for our use. for intermediary calculation. This
block is padded to avoid bank
conflicts, and read to/from using coalesced
operations.
The actual IDCT implementation was taken from a CUDA sample. It exploit the symmetric properties of the DCT basis functions to reduce the amount of calculations, however it could probably be optimized further.
I believe the luma/chroma branch on texel writes could be eliminated
using a trick similar to the one decribed here.
Aside from this, I don’t see many ways of significantly improve the
entropy decode shader.
There are a few alleyways I’d like to explore for the inverse transform step.
First, the IDCT described by Arai, Agui &
NakajimaArai,
Y., Agui, T., Nakajima, M., “A Fast DCT-SQ Scheme for Images,” 1988.
Reproduced in Pennebaker, W., Mitchell, J., “JPEG Still Image Data
Compression Standard,” Van Nostrand Reinhold, 1992. is, as far I
know, the most efficient algorithm, with just 5 multiplies (the rest
being moved to the rescaling step).
In contrast, the IDCT I’m currently using has 28 multiplies, though on
NVidia chips floating-point adds, multiples and FMAs have the same
latency so I’m not sure this is the correct metric to optimize for,
since generating FMA will double FLOP over straight FMUL/FADD. Down the
road, I will need to try looking in detail at the generated instruction
stream.
I also want to think about improving performance using subgroup operations, though at the moment I’m not sure what it would look like.
Finally, cooperative matrices (ie. tensor/matrix cores on NVidia/AMD) seems like an obvious candidate for improving IDCT performance. This could reduce inverse-quantization and -transform to just 3 matrix instructions. However, these operations do not support single-precision floating point types, only up to half-precision. The precision loss should therefore be validated against the ProRes accuracy test (Annex A of the SMPTE document).
Using tools such as the Radeon Gpu Profiler or Nsight graphics, I’d like to investigate the memory access patterns, cache behavior, etc of the shaders.
Very unfortunately, Apple provides no reference decoder, or
conformance test suite to validate third-party decoders. The baseline is
therefore the pre-existing, reverse-engineered software decoder within
FFmpeg. Moreover, Prores is not specified for bitexact decoding, instead
it specifies several precision contraints that the IDCT should
respect.The
accuracy requirements are actually more stringent than the IEEE
1180-1990 “Specifications for the Implementations of 8x8 Inverse
Discrete Cosine Transform”.
This makes comparison with another decoder complicated, as both could be
producing correct, but different results. I will probably have to
calculate the deviation between the software decoder and my
implementation.
So far, I’ve tested my encoder on my personal NVidia and AMD cards, on Linux. This should be extended to Windows, Intel iGPUs, perhaps Android devices (unfortunately I do not own a Qualcomm Adreno device which would be the most common mobile GPU series), or even Raspberry Pi.
Finally, the lack of publicly available encoders makes it difficult to test for resilience against several sources. I’m planning to test for edge cases (eg. on frame dimensions), but more different encoders would have been welcome.