Vulkan ProRes decoder: prologue

averne

May 2025

Introduction

This year I enrolled in Google Summer of Code for the FFmpeg project, mentored by Lynne (with hassn as backup), and working on a Vulkan shader-based decoder for Apple ProRes codec (profiles 4444 and 4444 XQ).
What does this mean? Let’s go over each of these terms:

Now that the context is clearer, let’s move on with the rest of the introduction.

Motivation

Nowadays, consumer media resolutions and framerates have ballooned, reaching 8K (8192x8192 px) and several hundreds of Hz. This becomes problematic for even simple codecs such as ProRes, because of the huge computational and bandwidth requirements associated with processing this much data.
Meanwhile, graphics processing devices (GPUs) have quickly become one of the most important computing peripherals (especially since the AI craze), and are especially well suited to media processing, due to their parallel programming model. While CPUs are optimised for latency-critical logic tasks, GPUs excel at bandwidth-heavy arithmetic work.
A ProRes GPU-based decoderThis is different from traditional GPU-accelerated (often called hardware-accelerated) video decoding, exposed for instance by DXVA or VAAPI, where the device relies on dedicated silicon (an ASIC) designed for a specific codec, and separate from the main shader cores. Here, we use these programmable shading units to implement our decoder. is therefore desirable, particularly to speed up video editing software. In order to take advantage of modern GPU features and maximise cross-platform compatibility, we will use the low-level Vulkan API.

Apple ProRes overview

In this section, I will outline the principle of operation of the ProRes codec, focusing mostly on the relevant profiles (4444 and 4444 XQ). Most of the information here can be found within the aforementioned SMPTE document.

ProRes is an intra-codec, meaning each frame is encoded independently from the others. This is great for video editing, because when jumping around the video only a frame’s worth of data will be required, without needing to process keyframes first. However, this means it doesn’t take advantage of the temporal redundancies within the video stream, from which a large part of the compression originates in inter-codecs.
It supports up to 8k resolution, 4:4:4 subsampling (meaning the chroma planes are not downscaled compared to the luma data), up to 12 bits of depth for luma and chroma, and an optional alpha plane with up to 16 bits of depth.

Stream structure

A ProRes stream is constituted of several elements, or syntax structures, organised in a hierarchy:

ProRes structure hierarchyFigure 1: ProRes structure hierarchy

Picture structure

A ProRes picture is broken down into slices which can be decoded in parallel, and contain a variable number of 16x16 MBs (up to 8). To this end, the stream specifies a parameter in the picture header, log2_desired_slice_size_in_mb. The picture data is then tiled into slices of this size, starting from the left. If the picture dimensions don’t exactly line up with the slice size, the remaining data is broken down in slices of the largest power of two that still fits, and this process is repeated recursively until the entire picture is covered. If a picture dimension is not a multiple of 16, the encode will slightly overshoot the picture size, and the stream consumer will have to discard or ignore the data past the specified frame boundary.

ProRes frame structure. The frame width is 296 px, which is not a multiple of 16 (296 \bmod 16 = 8). Since the frame header signals log2_desired_slice_size_in_mb = 3, the frame is first tiled by 2 slices of 128 px (\lfloor 296 / 128 \rfloor). The closest power of two to the 40 px left (296 \bmod 128) is 32, therefore the next slice contains 2 MBs. The final remainder is 8 px, so the last slice contains a single MB and the total picture size overshoots by 8 px. Similarly, the picture is tiled vertically into 14 slices (\lceil 216 / 16 \rceil), and contains 8 extraneous pixels.Figure 2: ProRes frame structure.
The frame width is 296 px, which is not a multiple of 16 (296mod16=8296 \bmod 16 = 8). Since the frame header signals log2_desired_slice_size_in_mb = 3, the frame is first tiled by 2 slices of 128 px (296/128\lfloor 296 / 128 \rfloor). The closest power of two to the 40 px left (296mod128296 \bmod 128) is 32, therefore the next slice contains 2 MBs. The final remainder is 8 px, so the last slice contains a single MB and the total picture size overshoots by 8 px.
Similarly, the picture is tiled vertically into 14 slices (216/16\lceil 216 / 16 \rceil), and contains 8 extraneous pixels.

Slices can contain 1, 2, 4 or 8 macroblocks, arranged horizontally (ie. 16x16, 32x16, 64x16, or 128x16 px). MBs are made up of 8x8 px blocks, whose number depends on the subsampling scheme: there are always 4 luma blocks, but the number of chroma blocks can be 2 (for 4:2:2), or 4 (for 4:4:4). The chroma blocks are “stretched” in the final composite image, to cover the same area as the luma.Note that this is not the job of the decoder, whose task is simply to reconstruct the raw data contained within a compressed bitstream. In our case, the profiles we care about use 4:4:4 subsampling, ie. no chroma downscaling.

Luma and chroma sample locations for common subsampling schemes. In 4:4:4 mode, each luma sample is associated a single chroma sample. In 4:2:2 mode, each luma sample is associated 2x1 chroma samples (horizontal stretching). In 4:2:0 mode, each luma sample is associated 2x2 chroma samples (horizontal and vertical stretching). Note that ProRes does not support 4:2:0.Figure 3: Luma and chroma sample locations for common subsampling schemes.
In 4:4:4 mode, each luma sample is associated a single chroma sample.
In 4:2:2 mode, each luma sample is associated 2x1 chroma samples (horizontal stretching).
In 4:2:0 mode, each luma sample is associated 2x2 chroma samples (horizontal and vertical stretching). Note that ProRes does not support 4:2:0.

Entropy coding

Entropy coding turns a sequence of fixed-width non-negative integers (the DCT coefficients) into a bitstream of variable-width codes.
ProRes uses a mix of the Golomb-Rice and exp-Golomb coding schemes, where small values are encoded with the former, and larger with the latter method.This favors decode speed, as the encoded DCT coefficients are mostly small in magnitude, and Golomb-Rice is computationally simpler than exp-Golomb. The scheme and the Rice/exp-Golomb threshold being used is determined according to the type of coefficient being encoded (DC/AC),The very first DCT coefficient (in the top-left corner of the transformed matrix) is often called “DC”, because it corresponds to a null frequency, and therefore represents a flat color over the whole spatial domain. Other coefficients are termed “AC”. its position within the bitstream (eg. whether it the first DC coefficient), and the magnitude of the coefficient that came before it.

Inverse scanning

The decoded coefficients have been packed in a way that increases entropy compression efficiency. The decoder must therefore rearrange them in the intended spatial order before proceeding.
The coefficients are placed in the stream essentially in a bottom-up direction with regards to the frame structure: each coefficient is grouped with its peers from other blocks, then other macroblocks. In addition, coefficients within blocks are scanned using a Morton curve pattern (illustrated in the 5th block in the figure below).

Scanning order Red numbers indicate the first coefficients in the scanned stream, and their spatial position. Green numbers represent the second coefficients, and so on.Figure 4: Scanning order
Red numbers indicate the first coefficients in the scanned stream, and their spatial position. Green numbers represent the second coefficients, and so on.

These reorderings result in a greatly increased locality and reduced entropy, by grouping together coefficients whose value is likely to be close. Indeed, spatial variations over the image tend to be small, and DCT coefficients typically have a decreasing magnitude when moving away from the top-left corner (the DC coefficient).The departure for zigzag scanning (used for instance in JPEG) is interesting, as zigzag more accurately represents the usual layout of DCT coefficients. This might be a tradeoff for decode speed, as a Morton order can be efficiently calculated using bitwise arithmetic, while zigzag ordering commonly requires a lookup table.

Scaling and transform

Reordering yields quantised coefficients 𝐐𝐅\mathbf{QF} organised in 8x8 frequency-domain blocks. Quantisation is the main source of compression in ProRes, and in effect is performed by a simple rounding division of the original matrix. To retrieve the de-quantised block, the inverse operation is performed, by multiplying with a global weight matrix 𝐖\mathbf{W}. While the 𝐖\mathbf{W} matrix is global for the whole frame,While the weight matrix is global, it can be different for the luma and chroma components. It can also be set to the default one, in which case all its 64 components are 4. it is paired with a scalar, the rescaling factor 𝑞𝑆𝑐𝑎𝑙𝑒\mathit{qScale}, which is signaled per-slice.
The rescaling operation is therefore: 𝐅=𝐐𝐅*𝐖*𝑞𝑆𝑐𝑎𝑙𝑒\mathbf{F} = \mathbf{QF} \ast \mathbf{W} \ast \mathit{qScale}.

The coefficients are then transformed from the frequency domain 𝐅\mathbf{F} to the spatial domain 𝐒\mathbf{S}. This is achieved using the inverse discrete cosine transform (iDCT), with the expression given below.Note the similarity with the 2D Fourier transform. The sum range from 0 to 7 since the blocks are 8 px large, and produces a matrix of the same dimensions.While this operation (and in particular the nested sum) may seem intensive, there are clever ways of computing it rather efficiently.

𝐒y,x=14u=07v=07C(u)C(v)𝐅v,ucos((2x+1)uπ16)cos((2y+1)vπ16) \mathbf{S}_{y, x} = \frac{1}{4} \sum_{u = 0}^{7} \sum_{v = 0}^{7} C(u) C(v) \mathbf{F}_{v, u} \cos \left( \frac{(2x+1)u\pi}{16} \right) \cos \left( \frac{(2y+1)v\pi}{16} \right)

with C(n)={1/2if n=01otherwiseC(n) = \begin{cases} 1/\sqrt{2} & \text{if \(n = 0\)}\\ 1 & \text{otherwise} \end{cases}

After this operation, the decoding operation for luma and chroma components is essentially complete.There are a few steps remaining in order to calculate the final integer pixel data, and to output it at the correct location within the frame, but those are not particularly noteworthy.

Alpha plane

If present, the alpha data is encoded losslessly and without subsampling, in raster-scanned order.
The data is differentially encoded using RLE, meaning the difference of the current alpha value against the former gets encoded, and identical consecutive values are sent only once, along with the number of times they are repeated. For both the run lengths and the alpha delta, pre-defined tables are used to represent common small values using fewer bits.

State of the art

There is little software accelerating video decoding using GPU shaders, mostly due to the fact that most platforms already offer support for mainstream codecs such as H.264, H.265 or AV1, with dedicated hardware blocks.
A few proprietary solutions exist, such as the codec packs sold by MainConcept, or the decoders bundled with DaVinci Resolve.
In the open-source world, the most notable example is probably the FFmpeg project, which has been developing Vulkan-based acceleration for several codecs. There are also a few works towards JPEG, along with some relatively old academic exploration of this same codec.

FFmpeg Vulkan decoders

At the time of writing, FFmpeg includes a shader-based FFv1 decoder and encoder, and there has been some effort towards the VC-2 codec (as part of the 2024 GSoC event).
These codecs are however rather dissimilar to ProRes (FFv1 being lossless, and VC-2 based on the wavelet transform instead of DCT). However, the techniques used in entropy decoding could still be relevant for our work, and the shaders include some optimised variants making use of advanced hardware features (eg. subgroups).

JPEG decoders

The astute reader may have noted similarities between ProRes and JPEG.These similarities have led some to refer to ProRes as a “JPEG clone” (dixit #ffmpeg-devel). In particular, block sizes are the same, and the transform operation is identical (though JPEG quantisation lacks the qScaleqScale parameter). Techniques employed by JPEG decoders can therefore be applied to some extent to our situation.

GPUJPEG

GPUJPEG is an open-source project developed by the Czech Educational and Research Network, and implements a JPEG codec in CUDA. It has some interesting optimisations especially regarding entropy decoding, though its implementation of the iDCT seems (from a cursory look) inferior to the one in the next project.

NVIDIA JPEG decoder

NVIDIA offers a JPEG decoder as part of their cuvid library. While cuvid is primarily meant to expose hardware codec functionality, NVIDIA chips lack, for the most part, a dedicated JPEG block.Recent NVIDIA GPUs (such as the A100 professional cards and Blackwell-generation consumer products) include such a purpose-built ASIC, called NVJPG. Tegra chips have also been bundling an NVJPG core for a long time. The library therefore uses a few CUDA programs to speed up decoding operations.
NVIDIA also provides CUDA sample showcasing the application of parallel programming to implement the (i)DCT, along with a whitepaper explaining the techniques used.

I’ve decided to reverse-engineer the CUDA binaries present in the cuvid library implementing this functionality. My goal here is two-fold: explore techniques used by NVIDIA developers themselves (who presumably have deep knowledge of parallel programming, along with access to internal documentation and tooling), and learn more about the low-level operation of NVIDIA shader cores, which should come in handy when optimising my own shaders.
However, this work is mostly unrelated to the present topic and this post is already running long, so I will be covering it in another blog post.

Conclusion

In this post, I exposed the motivation behind building a Vulkan-based ProRes accelerator within FFmpeg. I briefly overviewed the technical features and decoding process for this codec, which allowed me to solidify my knowledge.
In future posts, I will be referring to this text while exploring implementation details.