May 2025
This year I enrolled in Google Summer of Code for the FFmpeg project,
mentored by Lynne (with hassn as backup), and working on a
Vulkan shader-based decoder for Apple ProRes codec (profiles 4444 and
4444 XQ).
What does this mean? Let’s go over each of these terms:
ffmpeg
, ffprobe
, etc. binaries then use these
libraries to expose a command-line interface for them.VK_EXT_shader_object
, which maps almost perfectly to NVIDIA
hardware, but is pretty hard to implement for other vendors. or
delay with respect to the chip release.Now that the context is clearer, let’s move on with the rest of the introduction.
Nowadays, consumer media resolutions and framerates have ballooned,
reaching 8K (8192x8192 px) and several hundreds of Hz. This becomes
problematic for even simple codecs such as ProRes, because of the huge
computational and bandwidth requirements associated with processing this
much data.
Meanwhile, graphics processing devices (GPUs) have quickly become one of
the most important computing peripherals (especially since the AI
craze), and are especially well suited to media processing, due to their
parallel programming model. While CPUs are optimised for
latency-critical logic tasks, GPUs excel at bandwidth-heavy arithmetic
work.
A ProRes GPU-based
decoderThis
is different from traditional GPU-accelerated (often called
hardware-accelerated) video decoding, exposed for instance by DXVA or
VAAPI, where the device relies on dedicated silicon (an ASIC)
designed for a specific codec, and separate from the main shader cores.
Here, we use these programmable shading units to implement our
decoder. is therefore desirable, particularly to speed up video
editing software. In order to take advantage of modern GPU features and
maximise cross-platform compatibility, we will use the low-level Vulkan
API.
In this section, I will outline the principle of operation of the ProRes codec, focusing mostly on the relevant profiles (4444 and 4444 XQ). Most of the information here can be found within the aforementioned SMPTE document.
ProRes is an intra-codec, meaning each frame is encoded independently
from the others. This is great for video editing, because when jumping
around the video only a frame’s worth of data will be required, without
needing to process keyframes first. However, this means it doesn’t take
advantage of the temporal redundancies within the video stream, from
which a large part of the compression originates in inter-codecs.
It supports up to 8k resolution, 4:4:4 subsampling
(meaning the chroma planes are not downscaled compared to the luma
data), up to 12 bits of depth for luma and chroma, and an optional alpha
plane with up to 16 bits of depth.
A ProRes stream is constituted of several elements, or syntax structures, organised in a hierarchy:
Figure
1: ProRes structure hierarchy
A ProRes picture is broken down into slices which can be decoded in
parallel, and contain a variable number of 16x16 MBs (up to 8). To this
end, the stream specifies a parameter in the picture header,
log2_desired_slice_size_in_mb
. The picture data is then
tiled into slices of this size, starting from the left. If the picture
dimensions don’t exactly line up with the slice size, the remaining data
is broken down in slices of the largest power of two that still fits,
and this process is repeated recursively until the entire picture is
covered. If a picture dimension is not a multiple of 16, the encode will
slightly overshoot the picture size, and the stream consumer will have
to discard or ignore the data past the specified frame boundary.
Figure
2: ProRes frame structure.
The frame width is 296 px, which is not a multiple of 16
().
Since the frame header signals
log2_desired_slice_size_in_mb = 3
, the frame is first tiled
by 2 slices of 128 px
().
The closest power of two to the 40 px left
()
is 32, therefore the next slice contains 2 MBs. The final remainder is 8
px, so the last slice contains a single MB and the total picture size
overshoots by 8 px.
Similarly, the picture is tiled vertically into 14 slices
(),
and contains 8 extraneous pixels.
Slices can contain 1, 2, 4 or 8 macroblocks, arranged horizontally
(ie. 16x16, 32x16, 64x16, or 128x16 px). MBs are made up of 8x8 px
blocks, whose number depends on the subsampling scheme: there are always
4 luma blocks, but the number of chroma blocks can be 2 (for 4:2:2), or
4 (for 4:4:4). The chroma blocks are “stretched” in the final composite
image, to cover the same area as the
luma.Note
that this is not the job of the decoder, whose task is simply to
reconstruct the raw data contained within a compressed bitstream.
In our case, the profiles we care about use 4:4:4 subsampling, ie. no
chroma downscaling. Figure
3: Luma and chroma sample locations for common subsampling
schemes.
In 4:4:4 mode, each luma sample is associated a single chroma
sample.
In 4:2:2 mode, each luma sample is associated 2x1 chroma samples
(horizontal stretching).
In 4:2:0 mode, each luma sample is associated 2x2 chroma samples
(horizontal and vertical stretching). Note that ProRes does not support
4:2:0.
Entropy coding turns a sequence of fixed-width non-negative integers
(the DCT coefficients) into a bitstream of variable-width codes.
ProRes uses a mix of the Golomb-Rice
and exp-Golomb
coding schemes, where small values are encoded with the former, and
larger with the latter
method.This
favors decode speed, as the encoded DCT coefficients are mostly small in
magnitude, and Golomb-Rice is computationally simpler than
exp-Golomb. The scheme and the Rice/exp-Golomb threshold being
used is determined according to the type of coefficient being encoded
(DC/AC),The
very first DCT coefficient (in the top-left corner of the transformed
matrix) is often called “DC”, because it corresponds to a null
frequency, and therefore represents a flat color over the whole spatial
domain. Other coefficients are termed “AC”. its position within
the bitstream (eg. whether it the first DC coefficient), and the
magnitude of the coefficient that came before it.
The decoded coefficients have been packed in a way that increases
entropy compression efficiency. The decoder must therefore rearrange
them in the intended spatial order before proceeding.
The coefficients are placed in the stream essentially in a bottom-up
direction with regards to the frame structure: each coefficient is
grouped with its peers from other blocks, then other macroblocks. In
addition, coefficients within blocks are scanned using a Morton
curve pattern (illustrated in the 5th block in the figure below).
Figure
4: Scanning order
Red numbers indicate the first coefficients in the scanned stream, and
their spatial position. Green numbers represent the second coefficients,
and so on.
These reorderings result in a greatly increased locality and reduced entropy, by grouping together coefficients whose value is likely to be close. Indeed, spatial variations over the image tend to be small, and DCT coefficients typically have a decreasing magnitude when moving away from the top-left corner (the DC coefficient).The departure for zigzag scanning (used for instance in JPEG) is interesting, as zigzag more accurately represents the usual layout of DCT coefficients. This might be a tradeoff for decode speed, as a Morton order can be efficiently calculated using bitwise arithmetic, while zigzag ordering commonly requires a lookup table.
Reordering yields quantised coefficients
organised in 8x8 frequency-domain blocks. Quantisation is the main
source of compression in ProRes, and in effect is performed by a simple
rounding division of the original matrix. To retrieve the de-quantised
block, the inverse operation is performed, by multiplying with a global
weight matrix
.
While the
matrix is global for the whole
frame,While
the weight matrix is global, it can be different for the luma and chroma
components. It can also be set to the default one, in which case all its
64 components are 4. it is paired with a scalar, the rescaling
factor
,
which is signaled per-slice.
The rescaling operation is therefore:
.
The coefficients are then transformed from the frequency domain to the spatial domain . This is achieved using the inverse discrete cosine transform (iDCT), with the expression given below.Note the similarity with the 2D Fourier transform. The sum range from 0 to 7 since the blocks are 8 px large, and produces a matrix of the same dimensions.While this operation (and in particular the nested sum) may seem intensive, there are clever ways of computing it rather efficiently.
with
After this operation, the decoding operation for luma and chroma components is essentially complete.There are a few steps remaining in order to calculate the final integer pixel data, and to output it at the correct location within the frame, but those are not particularly noteworthy.
If present, the alpha data is encoded losslessly and without
subsampling, in raster-scanned order.
The data is differentially encoded using RLE,
meaning the difference of the current alpha value against the former
gets encoded, and identical consecutive values are sent only once, along
with the number of times they are repeated. For both the run lengths and
the alpha delta, pre-defined tables are used to represent common small
values using fewer bits.
There is little software accelerating video decoding using GPU
shaders, mostly due to the fact that most platforms already offer
support for mainstream codecs such as H.264, H.265 or AV1, with
dedicated hardware blocks.
A few proprietary solutions exist, such as the codec packs sold by MainConcept, or the decoders
bundled with DaVinci
Resolve.
In the open-source world, the most notable example is probably the
FFmpeg project, which has been developing Vulkan-based acceleration for
several codecs. There are also a few works towards JPEG, along with some
relatively old academic exploration of this same codec.
At the time of writing, FFmpeg includes a shader-based FFv1
decoder and encoder, and there has been some effort towards the VC-2
codec (as part of the 2024 GSoC event).
These codecs are however rather dissimilar to ProRes (FFv1 being
lossless, and VC-2 based on the wavelet transform instead of DCT).
However, the techniques used in entropy decoding could still be relevant
for our work, and the shaders include some optimised variants making use
of advanced hardware features (eg. subgroups).
The astute reader may have noted similarities between ProRes and JPEG.These similarities have led some to refer to ProRes as a “JPEG clone” (dixit #ffmpeg-devel). In particular, block sizes are the same, and the transform operation is identical (though JPEG quantisation lacks the parameter). Techniques employed by JPEG decoders can therefore be applied to some extent to our situation.
GPUJPEG is an open-source project developed by the Czech Educational and Research Network, and implements a JPEG codec in CUDA. It has some interesting optimisations especially regarding entropy decoding, though its implementation of the iDCT seems (from a cursory look) inferior to the one in the next project.
NVIDIA offers a JPEG decoder as part of their cuvid library.
While cuvid is primarily meant to expose hardware codec functionality,
NVIDIA chips lack, for the most part, a dedicated JPEG
block.Recent
NVIDIA GPUs (such as the A100 professional cards and
Blackwell-generation consumer products) include such a purpose-built
ASIC, called NVJPG. Tegra chips have also been bundling an NVJPG core
for a long time. The library therefore uses a few CUDA programs
to speed up decoding operations.
NVIDIA also provides CUDA
sample showcasing the application of parallel programming to
implement the (i)DCT, along with a whitepaper
explaining the techniques used.
I’ve decided to reverse-engineer the CUDA binaries present in the
cuvid library implementing this functionality. My goal here is two-fold:
explore techniques used by NVIDIA developers themselves (who presumably
have deep knowledge of parallel programming, along with access to
internal documentation and tooling), and learn more about the low-level
operation of NVIDIA shader cores, which should come in handy when
optimising my own shaders.
However, this work is mostly unrelated to the present topic and this
post is already running long, so I will be covering it in another blog post.
In this post, I exposed the motivation behind building a Vulkan-based
ProRes accelerator within FFmpeg. I briefly overviewed the technical
features and decoding process for this codec, which allowed me to
solidify my knowledge.
In future posts, I will be referring to this text while exploring
implementation details.