June 2025
As part of the preliminary work for my Google Summer of Code ’25 project, I wanted to
reverse-engineer a CUDA binary implementing a JPEG decoder.
My goal here was two-fold:
This CUDA program is bundled in the cuvid library,
which I had been reverse-engineering for my aforementioned hardware
acceleration project. Due to this prior work, I had already found the
the location of the CUDA JPEG initialisation routine.
This function, after performing some work, ends up loading a CUDA
program from a table found in the executable’s .data.rel.ro
section.In
fact, this section appears to contain only CUDA binaries. Many other
program are bundled for other tasks such as color conversion,
deinterlacing, etc. The table associates a CUDA compute
capability with a given
program.In
fact, there are two tables, and the library picks one based on some
parameter which I haven’t bothered to figure out (though I suspect it
might be related to the kernel driver). The two programs for CC 8.6 are
bit-identical, so this doesn’t matter. In my case, I have a GA107
chip with compute capability 8.6, so that’s what I dumped from
Ghidra.For
reproduction, the binary was extracted from libnvcuvid.so
version 565.77, and has an xxHash64 checksum of
baef3c633ea7451e
.
While static decompilation of the library worked in this case, a more
general approach would’ve been to hook cuModuleLoadData
or similar, with tooling such as gdb or an LD_PRELOAD
stub.
cuobjdump
With the binary in hand, we can gather some important information
before diving into the assembly. Firstly, the file is just a 64-bit ELF,
with e_machine
set to 190 to indicate CUDA.
NVIDIA provides a set of utilities
for working with compiled CUDA binaries. cuobjdump
can be
used to parse and print metadata from our ELF
file,The
command I used is:
cuobjdump -arch sm_86 -sass -elf [path]
. and yields
a listing of binary sections, some metadata, and disassembly for code
sections.
From this, we find that:
jpegparse_pass1
,
jpegparse_pass2
, jpegdec_vld
, and
jpegdec_idct
.jpeg_natural_order
(which is a conversion table for zigzag
ordering)..nv.info.[function]
metadata section, which gives (among other) the number of kernel
arguments and their sizes, the number of used general-purpose registers
(GPRs), and the launch
bounds.This information is summarised below:
Function | Arguments | GPR usage | Shared memory usage | Launch bounds |
---|---|---|---|---|
jpegparse_pass1 |
3 (sizes 8, 8, 4) | 18 | 1024 b | 0x100, 1, 1 |
jpegparse_pass2 |
6 (sizes 8, 8, 8, 4, 4, 4) | 15 | 2048 b | 0x100, 1, 1 |
jpegdec_vld |
12 (sizes 8, 8, 8, 8, 4, 4, 4, 4, 4, 4, 8, 8) | 25 | 0 b | 0x80, 1, 1 |
jpegdec_idct |
6 (sizes 8, 4, 4, 4, 4, 8) | 22 | 2336 b | 0x40, 1, 1 |
NVIDIA Nsight Systems is a system-level profiler that is able to trace calls to the CUDA API, and kernel execution metrics. Using it, we can watch the CUDA kernels being called during JPEG decoding.
We observe that for most files, the only function called is
jpegdec_idct
. Reverse-engineering the CPU-side code, I
found that the other kernels are only enabled when the JPEG bitstream
contains restart
markersRestart
markers is a JPEG feature, which inserts special codes in the compressed
bitstream. Upon encountering one of these codes, the entropy context is
reset. In effect, this allows for parallel processing of the Huffman
decode step, and error recovery in case of corruption. The frequency of
these markers is specified in the file header. at every single
minimal coding unit (MCU)
boundary.MCUs
are 16x16 blocks, divided in 8x8 luma/chroma sample blocks. They are
equivalent to macroblocks in video codecs. This was a bit
surprising, as JPEGs containing restart markers are quite rare to find
in the wild, and even more so at this rate of restart
occurence.In
fact, I had to dig pretty far for an encoder capable of producing files
with such a restart frequency. For the record, I used
cjpeg -restart 1B
.
For each kernel, Nsight reports its grid size, execution time, and some other metadata. It also confirms some information we found previously (launch bounds/block size, shared memory usage). I report below the numbers for a 1920x1080 px, 8-bit 4:2:0, 402 kiB JPEG file:
Function | Grid size | Execution time |
---|---|---|
jpegparse_pass1 |
0x100, 1, 1 | 6.784 μs |
jpegparse_pass2 |
0x100, 1, 1 | 38.304 μs |
jpegdec_vld |
1, 68, 1 | 116.864 μs |
jpegdec_idct (1) |
30, 135, 1 | 44.001 μs |
jpegdec_idct (2) |
15, 68, 1 | 10.656 μs |
jpegdec_idct (3) |
15, 68, 1 | 10.624 μs |
Nsight
compute is a tool aimed at in-depth performance analysis of CUDA
programs. It includes much many more metrics and indicators than Nsight
systems, but no system-wide trace.
Occupancy is one of the most critical metrics when optimising compute
workloads. In the table below, I report theoretical and achieved
occupancies for all kernels. While most approach the material limit,
jpegdec_vld
only achieves 25% of occupancy. Indeed, we see
that this kernel is very memory-intense compared to the other ones,
which could lead to stalling.
Function | Theoretical occupancy | Achieved occupancy | Bytes read | Bytes written |
---|---|---|---|---|
jpegparse_pass1 |
100% | 87.50% | 0.42 MB | 0.52 MB |
jpegparse_pass2 |
100% | 79.21% | 1.09 MB | 3.32 MB |
jpegdec_vld |
100% | 25.01% | 14.15 MB | 17.81 MB |
jpegdec_idct (1) |
66.67% | 62.19% | 2.55 MB | 2.07 MB |
jpegdec_idct (2) |
66.67% | 58.97% | 0.31 MB | 0.52 MB |
jpegdec_idct (3) |
66.67% | 59.50% | 0.30 MB | 0.52 MB |
Memory usage also gives a hint as to what each kernel is performing.
We see that jpegparse_pass1
reads only about 0.42 MB, which
is pretty much the size of the compressed JPEG bitstream. So this kernel
likely performs some pre-processing work using a quick inspection of the
stream.
On the other hand, jpegparse_pass2
reads over 6x as much
data, so it is likely doing much more complex work.
Finally, the amount of data written by the various
jpegdec_idct
invocations lines up with the uncompressed
image size (luma is
MB, and chroma
MB), meaning that the kernel processes each component (Y, Cb, Cr)
separately.
NVIDIA provides a variety of tools capable of extracting the SASS
(Streaming ASSembler) instruction stream from a compiled CUDA kernel.
While cuobjdump
and even Nsight compute can do it, I used
nvdisasm
,I
used options -plr
when disassabling to a text dump, and
-cfg
for generating control flow graphs. in
particular for its ability to generate control flow graphs (CFG) of the
programs, which can then be rendered using dot
. These CFGs
are reproduced below, along with the assembly listing:
Figure
1: SASS control flow graph for each kernel
Complete listings:
jpegparse_pass1
//--------------------- .text.jpegparse_pass1 --------------------------
.section .text.jpegparse_pass1,"ax",@progbits
.sectionflags @"SHF_BARRIERS=1"
.sectioninfo @"SHI_REGISTERS=18"
.align 128
.global jpegparse_pass1
.type jpegparse_pass1,@function
.size jpegparse_pass1,(.L_x_59 - jpegparse_pass1)
.other jpegparse_pass1,@"STO_CUDA_ENTRY STV_DEFAULT"
jpegparse_pass1:
.text.jpegparse_pass1:
/*0000*/ IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;
/*0010*/ S2R R6, SR_CTAID.X ;
/*0020*/ ULDC UR5, c[0x0][0x170] ;
/*0030*/ BSSY B0, `(.L_x_17) ;
/*0040*/ UIADD3 UR4, UR5, 0xffff, URZ ;
/*0050*/ S2R R5, SR_TID.X ;
/*0060*/ UIADD3 UR5, UR5, -0x1, URZ ;
/*0070*/ IMAD.MOV.U32 R3, RZ, RZ, RZ ;
/*0080*/ USHF.R.U32.HI UR4, URZ, 0x10, UR4 ;
/*0090*/ ULDC.64 UR6, c[0x0][0x118] ;
/*00a0*/ UISETP.LT.U32.AND UP0, UPT, UR4, 0x4, UPT ;
/*00b0*/ USEL UR4, UR4, 0x4, !UP0 ;
/*00c0*/ IMAD R0, R6.reuse, 0x100, R5 ;
/*00d0*/ LOP3.LUT P0, RZ, R6, R5, RZ, 0xfc, !PT ;
/*00e0*/ IMAD R4, R0, UR4, RZ ;
/*00f0*/ SEL R10, RZ, 0x1, P0 ;
/*0100*/ IADD3 R2, R4, UR4, RZ ;
/*0110*/ IMNMX R7, R2, UR5, PT ;
/*0120*/ ISETP.GE.AND P1, PT, R7, R4, PT ;
/*0130*/ @!P1 BRA `(.L_x_18) ;
/*0140*/ IADD3 R2, -R4, 0x1, R7 ;
/*0150*/ IMAD.IADD R7, R7, 0x1, -R4 ;
/*0160*/ BSSY B1, `(.L_x_19) ;
/*0170*/ IMAD.MOV.U32 R8, RZ, RZ, RZ ;
/*0180*/ LOP3.LUT R2, R2, 0x3, RZ, 0xc0, !PT ;
/*0190*/ IMAD.MOV.U32 R3, RZ, RZ, RZ ;
/*01a0*/ ISETP.GE.U32.AND P0, PT, R7, 0x3, PT ;
/*01b0*/ ISETP.NE.AND P2, PT, R2, RZ, PT ;
/*01c0*/ @!P0 BRA `(.L_x_20) ;
/*01d0*/ IADD3 R3, R0, 0x1, RZ ;
/*01e0*/ ULDC UR5, c[0x0][0x170] ;
/*01f0*/ IADD3.X R6, P0, R4.reuse, c[0x0][0x160], RZ, PT, !PT ;
/*0200*/ UIADD3 UR5, -UR5, URZ, URZ ;
/*0210*/ IMAD.MOV.U32 R8, RZ, RZ, RZ ;
/*0220*/ IMAD R3, R3, UR4, RZ ;
/*0230*/ LEA.HI.X.SX32 R9, R4, c[0x0][0x164], 0x1, P0 ;
/*0240*/ LOP3.LUT R3, RZ, R3, RZ, 0x33, !PT ;
/*0250*/ IMNMX R7, R3, UR5, !PT ;
/*0260*/ IMAD.MOV.U32 R3, RZ, RZ, RZ ;
/*0270*/ IMAD.IADD R15, R2, 0x1, R7 ;
.L_x_21:
/*0280*/ IMAD.MOV.U32 R7, RZ, RZ, R9 ;
/*0290*/ LDG.E.U8 R14, [R6.64+-0x1] ;
/*02a0*/ LDG.E.U8 R11, [R6.64] ;
/*02b0*/ LDG.E.U8 R12, [R6.64+0x1] ;
/*02c0*/ LOP3.LUT R9, R14, 0xf8, RZ, 0xc0, !PT ;
/*02d0*/ PRMT R9, R9, 0x9910, RZ ;
/*02e0*/ ISETP.NE.AND P0, PT, R9, 0xd0, PT ;
/*02f0*/ ISETP.EQ.AND P0, PT, R8, 0xff, !P0 ;
/*0300*/ LDG.E.U8 R8, [R6.64+0x2] ;
/*0310*/ LOP3.LUT R9, R11, 0xf8, RZ, 0xc0, !PT ;
/*0320*/ LOP3.LUT R13, R12, 0xf8, RZ, 0xc0, !PT ;
/*0330*/ PRMT R9, R9, 0x9910, RZ ;
/*0340*/ PRMT R13, R13, 0x9910, RZ ;
/*0350*/ ISETP.NE.AND P1, PT, R9, 0xd0, PT ;
/*0360*/ IADD3 R9, R10, 0x1, RZ ;
/*0370*/ @!P0 IMAD.MOV R9, RZ, RZ, R10 ;
/*0380*/ ISETP.EQ.AND P1, PT, R14, 0xff, !P1 ;
/*0390*/ ISETP.NE.AND P3, PT, R13, 0xd0, PT ;
/*03a0*/ ISETP.EQ.AND P0, PT, R10, RZ, P0 ;
/*03b0*/ IADD3 R10, R9.reuse, 0x1, RZ ;
/*03c0*/ ISETP.EQ.AND P4, PT, R9, RZ, P1 ;
/*03d0*/ @!P1 IMAD.MOV R10, RZ, RZ, R9 ;
/*03e0*/ ISETP.EQ.AND P1, PT, R11, 0xff, !P3 ;
/*03f0*/ @P0 IADD3 R3, R4, 0x1, RZ ;
/*0400*/ ISETP.EQ.AND P3, PT, R10, RZ, P1 ;
/*0410*/ @P4 IADD3 R3, R4, 0x2, RZ ;
/*0420*/ IADD3 R11, R10, 0x1, RZ ;
/*0430*/ IADD3 R6, P4, R6, 0x4, RZ ;
/*0440*/ @!P1 IMAD.MOV R11, RZ, RZ, R10 ;
/*0450*/ @P3 IADD3 R3, R4.reuse, 0x3, RZ ;
/*0460*/ IADD3 R4, R4, 0x4, RZ ;
/*0470*/ IADD3 R10, R11, 0x1, RZ ;
/*0480*/ LOP3.LUT R9, R8, 0xf8, RZ, 0xc0, !PT ;
/*0490*/ ISETP.NE.AND P0, PT, R9, 0xd0, PT ;
/*04a0*/ IMAD.IADD R9, R15, 0x1, R4 ;
/*04b0*/ ISETP.EQ.AND P0, PT, R12, 0xff, !P0 ;
/*04c0*/ ISETP.NE.AND P3, PT, R9, RZ, PT ;
/*04d0*/ IMAD.X R9, RZ, RZ, R7, P4 ;
/*04e0*/ ISETP.EQ.AND P1, PT, R11, RZ, P0 ;
/*04f0*/ SEL R3, R4, R3, P1 ;
/*0500*/ @!P0 IMAD.MOV R10, RZ, RZ, R11 ;
/*0510*/ @P3 BRA `(.L_x_21) ;
.L_x_20:
/*0520*/ BSYNC B1 ;
.L_x_19:
/*0530*/ @!P2 BRA `(.L_x_18) ;
/*0540*/ IADD3 R6, P0, R4.reuse, c[0x0][0x160], RZ ;
/*0550*/ BSSY B1, `(.L_x_22) ;
/*0560*/ IMAD.MOV.U32 R9, RZ, RZ, R10 ;
/*0570*/ IADD3 R10, R4.reuse, 0x1, RZ ;
/*0580*/ LEA.HI.X.SX32 R7, R4, c[0x0][0x164], 0x1, P0 ;
.L_x_23:
/*0590*/ LDG.E.U8 R4, [R6.64] ;
/*05a0*/ IADD3 R2, R2, -0x1, RZ ;
/*05b0*/ ISETP.NE.AND P1, PT, R2, RZ, PT ;
/*05c0*/ IADD3 R6, P2, R6, 0x1, RZ ;
/*05d0*/ IMAD.X R7, RZ, RZ, R7, P2 ;
/*05e0*/ LOP3.LUT R11, R4, 0xf8, RZ, 0xc0, !PT ;
/*05f0*/ ISETP.NE.AND P0, PT, R11, 0xd0, PT ;
/*0600*/ IADD3 R11, R9, 0x1, RZ ;
/*0610*/ ISETP.EQ.AND P0, PT, R8, 0xff, !P0 ;
/*0620*/ IMAD.MOV.U32 R8, RZ, RZ, R4 ;
/*0630*/ @!P0 IMAD.MOV R11, RZ, RZ, R9 ;
/*0640*/ ISETP.EQ.AND P0, PT, R9, RZ, P0 ;
/*0650*/ IMAD.MOV.U32 R9, RZ, RZ, R11 ;
/*0660*/ SEL R3, R10.reuse, R3, P0 ;
/*0670*/ IADD3 R10, R10, 0x1, RZ ;
/*0680*/ @P1 BRA `(.L_x_23) ;
/*0690*/ BSYNC B1 ;
.L_x_22:
/*06a0*/ IMAD.MOV.U32 R10, RZ, RZ, R11 ;
.L_x_18:
/*06b0*/ BSYNC B0 ;
.L_x_17:
/*06c0*/ LOP3.LUT R2, R5.reuse, 0x1, RZ, 0xc0, !PT ;
/*06d0*/ STS [R5.X4], R10 ;
/*06e0*/ BAR.SYNC 0x0 ;
/*06f0*/ ISETP.NE.U32.AND P0, PT, R2, 0x1, PT ;
/*0700*/ IMAD.SHL.U32 R2, R5, 0x4, RZ ;
/*0710*/ R2P PR, R5, 0x7e ;
/*0720*/ @!P0 LOP3.LUT R4, R2.reuse, 0xfffffff8, RZ, 0xc0, !PT ;
/*0730*/ @!P0 LDS R6, [R5.X4] ;
/*0740*/ @P1 LOP3.LUT R8, R2.reuse, 0xfffffff0, RZ, 0xc0, !PT ;
/*0750*/ @!P0 LDS R7, [R4] ;
/*0760*/ @P2 LOP3.LUT R9, R2, 0xffffffe0, RZ, 0xc0, !PT ;
/*0770*/ @P1 LOP3.LUT R8, R8, 0x4, RZ, 0xfc, !PT ;
/*0780*/ @P2 LOP3.LUT R9, R9, 0xc, RZ, 0xfc, !PT ;
/*0790*/ @!P0 IMAD.IADD R6, R6, 0x1, R7 ;
/*07a0*/ @!P0 STS [R5.X4], R6 ;
/*07b0*/ BAR.SYNC 0x0 ;
/*07c0*/ @P1 LDS R8, [R8] ;
/*07d0*/ @P3 LOP3.LUT R6, R2, 0xffffffc0, RZ, 0xc0, !PT ;
/*07e0*/ LOP3.LUT P0, RZ, R5, 0x80, RZ, 0xc0, !PT ;
/*07f0*/ @P1 LDS R7, [R5.X4] ;
/*0800*/ @P3 LOP3.LUT R6, R6, 0x1c, RZ, 0xfc, !PT ;
/*0810*/ @P1 IMAD.IADD R10, R7, 0x1, R8 ;
/*0820*/ @P1 STS [R5.X4], R10 ;
/*0830*/ BAR.SYNC 0x0 ;
/*0840*/ @P2 LDS R9, [R9] ;
/*0850*/ @P4 LOP3.LUT R10, R2, 0xffffff80, RZ, 0xc0, !PT ;
/*0860*/ @P2 LDS R4, [R5.X4] ;
/*0870*/ @P4 LOP3.LUT R10, R10, 0x3c, RZ, 0xfc, !PT ;
/*0880*/ @P2 IMAD.IADD R4, R4, 0x1, R9 ;
/*0890*/ @P5 LOP3.LUT R9, R2, 0xffffff00, RZ, 0xc0, !PT ;
/*08a0*/ @P2 STS [R5.X4], R4 ;
/*08b0*/ @P5 LOP3.LUT R9, R9, 0x7c, RZ, 0xfc, !PT ;
/*08c0*/ BAR.SYNC 0x0 ;
/*08d0*/ @P3 LDS R6, [R6] ;
/*08e0*/ @P3 LDS R7, [R5.X4] ;
/*08f0*/ @P3 IMAD.IADD R8, R7, 0x1, R6 ;
/*0900*/ @P3 STS [R5.X4], R8 ;
/*0910*/ BAR.SYNC 0x0 ;
/*0920*/ @P4 LDS R10, [R10] ;
/*0930*/ @P4 LDS R7, [R5.X4] ;
/*0940*/ @P4 IMAD.IADD R4, R7, 0x1, R10 ;
/*0950*/ @P6 LOP3.LUT R7, R2, 0xfffffe00, RZ, 0xc0, !PT ;
/*0960*/ @P0 LOP3.LUT R2, R2, 0xfffffc00, RZ, 0xc0, !PT ;
/*0970*/ @P4 STS [R5.X4], R4 ;
/*0980*/ @P6 LOP3.LUT R7, R7, 0xfc, RZ, 0xfc, !PT ;
/*0990*/ BAR.SYNC 0x0 ;
/*09a0*/ @P5 LDS R9, [R9] ;
/*09b0*/ @P0 LOP3.LUT R4, R2, 0x1fc, RZ, 0xfc, !PT ;
/*09c0*/ @P5 LDS R6, [R5.X4] ;
/*09d0*/ @P5 IMAD.IADD R6, R6, 0x1, R9 ;
/*09e0*/ @P5 STS [R5.X4], R6 ;
/*09f0*/ BAR.SYNC 0x0 ;
/*0a00*/ @P6 LDS R7, [R7] ;
/*0a10*/ @P6 LDS R8, [R5.X4] ;
/*0a20*/ @P6 IMAD.IADD R8, R8, 0x1, R7 ;
/*0a30*/ IMAD.MOV.U32 R7, RZ, RZ, 0x8 ;
/*0a40*/ @P6 STS [R5.X4], R8 ;
/*0a50*/ IMAD.WIDE.U32 R6, R0, R7, c[0x0][0x168] ;
/*0a60*/ BAR.SYNC 0x0 ;
/*0a70*/ @P0 LDS R9, [R4] ;
/*0a80*/ @P0 LDS R2, [R5.X4] ;
/*0a90*/ @P0 IMAD.IADD R2, R2, 0x1, R9 ;
/*0aa0*/ @P0 STS [R5.X4], R2 ;
/*0ab0*/ @!P0 LDS R2, [R5.X4] ;
/*0ac0*/ STG.E.64 [R6.64], R2 ;
/*0ad0*/ EXIT ;
.L_x_24:
/*0ae0*/ BRA `(.L_x_24);
/*0af0*/ NOP;
/*0b00*/ NOP;
/*0b10*/ NOP;
/*0b20*/ NOP;
/*0b30*/ NOP;
/*0b40*/ NOP;
/*0b50*/ NOP;
/*0b60*/ NOP;
/*0b70*/ NOP;
.L_x_59:
jpegparse_pass2
//--------------------- .text.jpegparse_pass2 --------------------------
.section .text.jpegparse_pass2,"ax",@progbits
.sectionflags @"SHF_BARRIERS=1"
.sectioninfo @"SHI_REGISTERS=15"
.align 128
.global jpegparse_pass2
.type jpegparse_pass2,@function
.size jpegparse_pass2,(.L_x_58 - jpegparse_pass2)
.other jpegparse_pass2,@"STO_CUDA_ENTRY STV_DEFAULT"
jpegparse_pass2:
.text.jpegparse_pass2:
/*0000*/ IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;
/*0010*/ S2R R9, SR_CTAID.X ;
/*0020*/ IMAD.MOV.U32 R7, RZ, RZ, 0x8 ;
/*0030*/ ULDC.64 UR4, c[0x0][0x118] ;
/*0040*/ IMAD.MOV.U32 R0, RZ, RZ, RZ ;
/*0050*/ S2R R11, SR_TID.X ;
/*0060*/ ISETP.GE.U32.AND P0, PT, R11, R9, PT ;
/*0070*/ @!P0 IMAD.SHL.U32 R2, R11, 0x100, RZ ;
/*0080*/ @!P0 IMAD.WIDE R2, R2, 0x8, RZ ;
/*0090*/ @!P0 LOP3.LUT R4, R2, 0x7f8, RZ, 0xfc, !PT ;
/*00a0*/ IMAD R2, R9, 0x100, R11 ;
/*00b0*/ @!P0 IADD3 R4, P1, R4, c[0x0][0x168], RZ ;
/*00c0*/ @!P0 IADD3.X R5, R3, c[0x0][0x16c], RZ, P1, !PT ;
/*00d0*/ IMAD.WIDE.U32 R2, R2, R7, c[0x0][0x168] ;
/*00e0*/ @!P0 LDG.E R0, [R4.64] ;
/*00f0*/ LDG.E.64 R2, [R2.64] ;
/*0100*/ ISETP.NE.AND P0, PT, R9, RZ, PT ;
/*0110*/ CS2R R6, SRZ ;
/*0120*/ ISETP.NE.AND P1, PT, R11, RZ, PT ;
/*0130*/ STS [R11.X4+0x400], R0 ;
/*0140*/ STS [R11.X4], R2 ;
/*0150*/ @!P0 BRA `(.L_x_0) ;
/*0160*/ IMAD.MOV.U32 R0, RZ, RZ, 0x1 ;
.L_x_1:
/*0170*/ BAR.SYNC 0x0 ;
/*0180*/ LOP3.LUT P2, RZ, R0, R11, RZ, 0xc0, !PT ;
/*0190*/ @P2 IADD3 R5, R0.reuse, -0x1, RZ ;
/*01a0*/ @P2 LDS R2, [R11.X4+0x400] ;
/*01b0*/ @P2 LOP3.LUT R5, R5, R11, R0, 0xf4, !PT ;
/*01c0*/ IMAD.SHL.U32 R0, R0, 0x2, RZ ;
/*01d0*/ @P2 LDS R5, [R5.X4+0x400] ;
/*01e0*/ @P2 IMAD.IADD R2, R2, 0x1, R5 ;
/*01f0*/ @P2 STS [R11.X4+0x400], R2 ;
/*0200*/ ISETP.GT.U32.AND P2, PT, R0, R9, PT ;
/*0210*/ @!P2 BRA `(.L_x_1) ;
.L_x_0:
/*0220*/ @P0 IMAD.SHL.U32 R9, R9, 0x4, RZ ;
/*0230*/ BAR.SYNC 0x0 ;
/*0240*/ @P0 LDS R6, [R9+0x3fc] ;
/*0250*/ @P1 LDS R7, [R11.X4+-0x4] ;
/*0260*/ LDS R0, [R11.X4] ;
/*0270*/ IMAD.IADD R6, R7, 0x1, R6 ;
/*0280*/ IMAD.IADD R0, R0, 0x1, -R7 ;
/*0290*/ ISETP.GE.U32.AND P0, PT, R6, c[0x0][0x178], PT ;
/*02a0*/ ISETP.LT.OR P0, PT, R0, 0x1, P0 ;
/*02b0*/ @P0 EXIT ;
/*02c0*/ IMAD R2, R0, 0x1800, R3 ;
/*02d0*/ LEA.HI.SX32 R5, R3, c[0x0][0x178], 0x1e ;
/*02e0*/ IMNMX.U32 R12, R2, c[0x0][0x17c], PT ;
/*02f0*/ IMAD.IADD R7, R6, 0x1, R5 ;
.L_x_15:
/*0300*/ IMAD.MOV.U32 R5, RZ, RZ, 0x4 ;
/*0310*/ ISETP.GE.AND P0, PT, R3, R12, PT ;
/*0320*/ BSSY B0, `(.L_x_2) ;
/*0330*/ CS2R R8, SRZ ;
/*0340*/ IMAD.WIDE R4, R6, R5, c[0x0][0x170] ;
/*0350*/ STG.E [R4.64], R7 ;
/*0360*/ @P0 BRA `(.L_x_3) ;
/*0370*/ IMAD.MOV.U32 R10, RZ, RZ, R3 ;
/*0380*/ CS2R R8, SRZ ;
.L_x_8:
/*0390*/ IADD3 R4, P0, R10, c[0x0][0x160], RZ ;
/*03a0*/ LEA.HI.X.SX32 R5, R10, c[0x0][0x164], 0x1, P0 ;
/*03b0*/ LDG.E.U8 R11, [R4.64] ;
/*03c0*/ IADD3 R3, R10, 0x1, RZ ;
/*03d0*/ BSSY B1, `(.L_x_4) ;
/*03e0*/ ISETP.NE.AND P0, PT, R11, 0xff, PT ;
/*03f0*/ ISETP.GE.OR P0, PT, R3, R12, P0 ;
/*0400*/ @P0 BRA `(.L_x_5) ;
/*0410*/ LDG.E.U8 R5, [R4.64+0x1] ;
/*0420*/ BSSY B2, `(.L_x_6) ;
/*0430*/ ISETP.NE.AND P0, PT, R5.reuse, RZ, PT ;
/*0440*/ LOP3.LUT R2, R5, 0xf8, RZ, 0xc0, !PT ;
/*0450*/ ISETP.NE.AND P1, PT, R2, 0xd0, PT ;
/*0460*/ @!P0 BRA `(.L_x_7) ;
/*0470*/ IADD3 R8, R8, 0x1, RZ ;
/*0480*/ IMAD.SHL.U32 R2, R9, 0x100, RZ ;
/*0490*/ ISETP.NE.AND P0, PT, R8, 0x4, PT ;
/*04a0*/ LOP3.LUT R9, R2, R11, RZ, 0xfc, !PT ;
/*04b0*/ IMAD.MOV.U32 R11, RZ, RZ, R5 ;
/*04c0*/ ISETP.GE.U32.OR P0, PT, R7, c[0x0][0x180], P0 ;
/*04d0*/ @!P0 IMAD.MOV.U32 R4, RZ, RZ, 0x4 ;
/*04e0*/ @!P0 IMAD.MOV.U32 R11, RZ, RZ, R5 ;
/*04f0*/ @!P0 IMAD.WIDE R2, R7.reuse, R4, c[0x0][0x170] ;
/*0500*/ @!P0 IADD3 R4, R7, 0x1, RZ ;
/*0510*/ @!P0 IMAD.MOV.U32 R8, RZ, RZ, RZ ;
/*0520*/ @!P0 STG.E [R2.64], R9 ;
/*0530*/ @!P0 IMAD.MOV.U32 R7, RZ, RZ, R4 ;
.L_x_7:
/*0540*/ BSYNC B2 ;
.L_x_6:
/*0550*/ @!P1 BREAK B1 ;
/*0560*/ IADD3 R3, R10, 0x2, RZ ;
/*0570*/ @!P1 BRA `(.L_x_3) ;
.L_x_5:
/*0580*/ BSYNC B1 ;
.L_x_4:
/*0590*/ IADD3 R8, R8, 0x1, RZ ;
/*05a0*/ IMAD.SHL.U32 R2, R9, 0x100, RZ ;
/*05b0*/ ISETP.GE.AND P1, PT, R3, R12, PT ;
/*05c0*/ IMAD.MOV.U32 R10, RZ, RZ, R3 ;
/*05d0*/ ISETP.NE.AND P0, PT, R8, 0x4, PT ;
/*05e0*/ LOP3.LUT R9, R2, R11, RZ, 0xfc, !PT ;
/*05f0*/ ISETP.GE.U32.OR P0, PT, R7, c[0x0][0x180], P0 ;
/*0600*/ @!P0 IMAD.MOV.U32 R4, RZ, RZ, 0x4 ;
/*0610*/ @!P0 IADD3 R2, R7.reuse, 0x1, RZ ;
/*0620*/ @!P0 IMAD.MOV.U32 R8, RZ, RZ, RZ ;
/*0630*/ @!P0 IMAD.WIDE R4, R7, R4, c[0x0][0x170] ;
/*0640*/ @!P0 IMAD.MOV.U32 R7, RZ, RZ, R2 ;
/*0650*/ @!P0 STG.E [R4.64], R9 ;
/*0660*/ @!P1 BRA `(.L_x_8) ;
.L_x_3:
/*0670*/ BSYNC B0 ;
.L_x_2:
/*0680*/ ISETP.NE.AND P0, PT, R8, RZ, PT ;
/*0690*/ BSSY B0, `(.L_x_9) ;
/*06a0*/ ISETP.GE.U32.OR P0, PT, R7, c[0x0][0x180], !P0 ;
/*06b0*/ @P0 BRA `(.L_x_10) ;
/*06c0*/ ISETP.GT.U32.AND P0, PT, R8, 0x3, PT ;
/*06d0*/ BSSY B1, `(.L_x_11) ;
/*06e0*/ @P0 BRA `(.L_x_12) ;
/*06f0*/ IMAD.MOV R2, RZ, RZ, -R8 ;
/*0700*/ IADD3 R8, -R8, 0x3, RZ ;
/*0710*/ BSSY B2, `(.L_x_13) ;
/*0720*/ LOP3.LUT R2, R2, 0x3, RZ, 0xc0, !PT ;
/*0730*/ ISETP.GE.U32.AND P0, PT, R8, 0x3, PT ;
/*0740*/ ISETP.NE.AND P1, PT, R2, RZ, PT ;
/*0750*/ @!P1 BRA `(.L_x_14) ;
/*0760*/ ISETP.NE.AND P1, PT, R2, 0x1, PT ;
/*0770*/ LEA R9, R9, 0xff, 0x8 ;
/*0780*/ @!P1 BRA `(.L_x_14) ;
/*0790*/ ISETP.NE.AND P1, PT, R2, 0x2, PT ;
/*07a0*/ LEA R9, R9, 0xff, 0x8 ;
/*07b0*/ @P1 LEA R9, R9, 0xff, 0x8 ;
.L_x_14:
/*07c0*/ BSYNC B2 ;
.L_x_13:
/*07d0*/ @!P0 BRA `(.L_x_12) ;
/*07e0*/ IMAD.MOV.U32 R9, RZ, RZ, -0x1 ;
.L_x_12:
/*07f0*/ BSYNC B1 ;
.L_x_11:
/*0800*/ IMAD.MOV.U32 R4, RZ, RZ, 0x4 ;
/*0810*/ IMAD.WIDE R4, R7.reuse, R4, c[0x0][0x170] ;
/*0820*/ IADD3 R7, R7, 0x1, RZ ;
/*0830*/ STG.E [R4.64], R9 ;
.L_x_10:
/*0840*/ BSYNC B0 ;
.L_x_9:
/*0850*/ IADD3 R6, R6, 0x1, RZ ;
/*0860*/ ISETP.GT.AND P1, PT, R0, 0x1, PT ;
/*0870*/ ISETP.GE.U32.AND P0, PT, R6, c[0x0][0x178], PT ;
/*0880*/ IADD3 R0, R0, -0x1, RZ ;
/*0890*/ @!P0 BRA P1, `(.L_x_15) ;
/*08a0*/ EXIT ;
.L_x_16:
/*08b0*/ BRA `(.L_x_16);
/*08c0*/ NOP;
/*08d0*/ NOP;
/*08e0*/ NOP;
/*08f0*/ NOP;
/*0900*/ NOP;
/*0910*/ NOP;
/*0920*/ NOP;
/*0930*/ NOP;
/*0940*/ NOP;
/*0950*/ NOP;
/*0960*/ NOP;
/*0970*/ NOP;
.L_x_58:
jpegdec_vld
//--------------------- .text.jpegdec_vld --------------------------
.section .text.jpegdec_vld,"ax",@progbits
.sectioninfo @"SHI_REGISTERS=25"
.align 128
.global jpegdec_vld
.type jpegdec_vld,@function
.size jpegdec_vld,(.L_x_60 - jpegdec_vld)
.other jpegdec_vld,@"STO_CUDA_ENTRY STV_DEFAULT"
jpegdec_vld:
.text.jpegdec_vld:
/*0000*/ IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;
/*0010*/ S2R R7, SR_CTAID.X ;
/*0020*/ S2R R0, SR_TID.X ;
/*0030*/ IMAD R7, R7, 0x80, R0 ;
/*0040*/ ISETP.GE.AND P0, PT, R7, c[0x0][0x184], PT ;
/*0050*/ @P0 EXIT ;
/*0060*/ S2R R9, SR_CTAID.Y ;
/*0070*/ IADD3 R0, R7, c[0x0][0x180], RZ ;
/*0080*/ IMAD R0, R9, c[0x0][0x184], R0 ;
/*0090*/ TLD.SCR.LZ RZ, R0, R0, 0x0, 0x68, 1D, 0x1 ;
/*00a0*/ IADD3 R3, R0, 0x1, RZ ;
/*00b0*/ TLD.SCR.LZ RZ, R5, R0, 0x0, 0x68, 1D, 0x1 ;
/*00c0*/ TLD.SCR.LZ RZ, R3, R3, 0x0, 0x68, 1D, 0x1 ;
/*00d0*/ UMOV UR4, 0xf ;
/*00e0*/ IMAD.MOV.U32 R6, RZ, RZ, RZ ;
/*00f0*/ ULDC.64 UR8, c[0x0][0x190] ;
/*0100*/ ULOP3.LUT UR5, UR4, UR8, URZ, 0xc0, !UPT ;
/*0110*/ ULDC.64 UR6, c[0x0][0x188] ;
/*0120*/ ULOP3.LUT UR4, UR4, UR7, URZ, 0xc0, !UPT ;
/*0130*/ UIADD3 UR7, UR5, 0x1, URZ ;
/*0140*/ UIADD3 UR5, UR4, 0x1, URZ ;
/*0150*/ UIMAD UR6, UR7, UR6, URZ ;
/*0160*/ IMAD R4, R9, UR7, RZ ;
/*0170*/ IMAD.U32 R14, RZ, RZ, UR5 ;
/*0180*/ USHF.L.U32 UR5, UR9, 0xc, URZ ;
/*0190*/ IMAD R9, R4, c[0x0][0x184], R7 ;
/*01a0*/ IADD3 R0, R0, 0x2, RZ ;
/*01b0*/ IMAD R7, R7, R14, RZ ;
/*01c0*/ ULOP3.LUT UR5, UR5, 0x70000, URZ, 0xc0, !UPT ;
/*01d0*/ IMAD R2, R14, R9, RZ ;
/*01e0*/ ULDC.64 UR8, c[0x0][0x118] ;
/*01f0*/ IMAD.SHL.U32 R9, R2, 0x40, RZ ;
/*0200*/ IMAD R2, R14, c[0x0][0x184], RZ ;
/*0210*/ IMAD R9, R2, UR6, R9 ;
/*0220*/ IMAD.MOV.U32 R2, RZ, RZ, RZ ;
/*0230*/ IADD3 R8, R9, 0x40, RZ ;
.L_x_32:
/*0240*/ UIADD3 UR7, UR7, -0x1, URZ ;
/*0250*/ IMAD.MOV.U32 R9, RZ, RZ, R14 ;
/*0260*/ IMAD.MOV.U32 R11, RZ, RZ, R7 ;
/*0270*/ ISETP.NE.AND P0, PT, RZ, UR7, PT ;
.L_x_31:
/*0280*/ IADD3 R10, -R2.reuse, 0x20, RZ ;
/*0290*/ ULDC UR6, c[0x0][0x194] ;
/*02a0*/ SHF.L.U32 R13, R5, R2, RZ ;
/*02b0*/ USHF.L.U32 UR6, UR6, 0x10, URZ ;
/*02c0*/ SHF.R.U32.HI R10, RZ, R10, R3 ;
/*02d0*/ ISETP.NE.AND P1, PT, R2, RZ, PT ;
/*02e0*/ ULOP3.LUT UR6, UR6, 0x70000, URZ, 0xc0, !UPT ;
/*02f0*/ LOP3.LUT R10, R10, R13, RZ, 0xfc, !PT ;
/*0300*/ SEL R14, R5, R10, !P1 ;
/*0310*/ LEA.HI R10, R14, UR6, RZ, 0x10 ;
/*0320*/ TLD.SCR.LZ RZ, R10, R10, 0x0, 0x66, 1D, 0x1 ;
/*0330*/ IMAD.MOV.U32 R12, RZ, RZ, R3 ;
/*0340*/ SHF.R.U32.HI R15, RZ, 0x8, R10 ;
/*0350*/ LOP3.LUT R13, R10, 0xff, RZ, 0xc0, !PT ;
/*0360*/ SGXT.U32 R15, R15, 0x8 ;
/*0370*/ IADD3 R2, R15, R13, R2 ;
/*0380*/ ISETP.GE.U32.AND P1, PT, R2, 0x20, PT ;
/*0390*/ @P1 TLD.SCR.LZ RZ, R12, R0, 0x0, 0x68, 1D, 0x1 ;
/*03a0*/ LOP3.LUT P2, RZ, R10, 0xff, RZ, 0xc0, !PT ;
/*03b0*/ BSSY B0, `(.L_x_25) ;
/*03c0*/ SHF.L.U32 R16, R14, R15, RZ ;
/*03d0*/ IMAD.MOV.U32 R15, RZ, RZ, 0x4 ;
/*03e0*/ @P1 LOP3.LUT R2, R2, 0x1f, RZ, 0xc0, !PT ;
/*03f0*/ @P1 IMAD.MOV.U32 R5, RZ, RZ, R3 ;
/*0400*/ @P2 LOP3.LUT R14, RZ, R16, RZ, 0x33, !PT ;
/*0410*/ @P2 IADD3 R13, -R13, 0x20, RZ ;
/*0420*/ @P2 SHF.R.S32.HI R14, RZ, 0x1f, R14 ;
/*0430*/ @P2 SHF.R.U32.HI R10, RZ, R13.reuse, R16 ;
/*0440*/ IMAD.MOV.U32 R16, RZ, RZ, 0x1 ;
/*0450*/ @P2 SHF.R.U32.HI R13, RZ, R13, R14 ;
/*0460*/ IMAD.WIDE R14, R8, R15, c[0x0][0x160] ;
/*0470*/ @P2 IADD3 R6, -R13, R10, R6 ;
/*0480*/ IMAD.MOV.U32 R10, RZ, RZ, R8 ;
/*0490*/ IADD3 R13, R8, 0x1, RZ ;
/*04a0*/ IMAD.SHL.U32 R17, R6, 0x40, RZ ;
/*04b0*/ @P1 IADD3 R0, R0, 0x1, RZ ;
/*04c0*/ IMAD.MOV.U32 R8, RZ, RZ, R13 ;
/*04d0*/ STG.E [R14.64], R17 ;
/*04e0*/ IMAD.MOV.U32 R3, RZ, RZ, R12 ;
.L_x_30:
/*04f0*/ IADD3 R13, -R2.reuse, 0x20, RZ ;
/*0500*/ SHF.L.U32 R14, R5, R2, RZ ;
/*0510*/ SHF.R.U32.HI R13, RZ, R13, R12 ;
/*0520*/ ISETP.NE.AND P1, PT, R2, RZ, PT ;
/*0530*/ LOP3.LUT R14, R13, R14, RZ, 0xfc, !PT ;
/*0540*/ SEL R14, R5, R14, !P1 ;
/*0550*/ LEA.HI R13, R14, UR5, RZ, 0x10 ;
/*0560*/ TLD.SCR.LZ RZ, R13, R13, 0x0, 0x66, 1D, 0x1 ;
/*0570*/ SHF.R.U32.HI R15, RZ, 0x8, R13 ;
/*0580*/ LOP3.LUT R17, R13, 0xf, RZ, 0xc0, !PT ;
/*0590*/ SGXT.U32 R15, R15, 0x8 ;
/*05a0*/ IADD3 R2, R15, R17, R2 ;
/*05b0*/ ISETP.GE.U32.AND P1, PT, R2, 0x20, PT ;
/*05c0*/ @P1 TLD.SCR.LZ RZ, R3, R0, 0x0, 0x68, 1D, 0x1 ;
/*05d0*/ LOP3.LUT R19, R13, 0xff, RZ, 0xc0, !PT ;
/*05e0*/ @P1 IMAD.MOV.U32 R5, RZ, RZ, R12 ;
/*05f0*/ SHF.L.U32 R14, R14, R15, RZ ;
/*0600*/ ISETP.NE.AND P2, PT, R19.reuse, RZ, PT ;
/*0610*/ LEA.HI R19, R19, R16, RZ, 0x1c ;
/*0620*/ @P1 LOP3.LUT R2, R2, 0x1f, RZ, 0xc0, !PT ;
/*0630*/ @P1 IADD3 R0, R0, 0x1, RZ ;
/*0640*/ @!P2 BRA `(.L_x_26) ;
/*0650*/ ISETP.GT.U32.AND P1, PT, R19, 0x3f, PT ;
/*0660*/ @P1 BRA `(.L_x_27) ;
/*0670*/ LOP3.LUT P1, RZ, R13, 0xf, RZ, 0xc0, !PT ;
/*0680*/ BSSY B1, `(.L_x_28) ;
/*0690*/ @!P1 BRA `(.L_x_29) ;
/*06a0*/ LOP3.LUT R12, RZ, R14, RZ, 0x33, !PT ;
/*06b0*/ IMAD.MOV.U32 R15, RZ, RZ, 0x4 ;
/*06c0*/ IADD3 R17, -R17, 0x20, RZ ;
/*06d0*/ SHF.R.S32.HI R12, RZ, 0x1f, R12 ;
/*06e0*/ SHF.R.U32.HI R13, RZ, R17.reuse, R14 ;
/*06f0*/ SHF.R.U32.HI R12, RZ, R17, R12 ;
/*0700*/ IMAD.IADD R14, R13, 0x1, -R12 ;
/*0710*/ IMAD.WIDE R12, R8.reuse, R15, c[0x0][0x160] ;
/*0720*/ IADD3 R8, R8, 0x1, RZ ;
/*0730*/ IMAD R15, R14, 0x40, R19 ;
/*0740*/ STG.E [R12.64], R15 ;
.L_x_29:
/*0750*/ BSYNC B1 ;
.L_x_28:
/*0760*/ IADD3 R16, R19, 0x1, RZ ;
/*0770*/ IMAD.MOV.U32 R12, RZ, RZ, R3 ;
/*0780*/ ISETP.GE.U32.AND P1, PT, R16, 0x40, PT ;
/*0790*/ @!P1 BRA `(.L_x_30) ;
/*07a0*/ BRA `(.L_x_26) ;
.L_x_27:
/*07b0*/ S2R R12, SR_LANEID ;
/*07c0*/ VOTEU.ALL UR6, UPT, PT ;
/*07d0*/ IMAD.MOV.U32 R13, RZ, RZ, c[0x0][0x17c] ;
/*07e0*/ UFLO.U32 UR10, UR6 ;
/*07f0*/ POPC R15, UR6 ;
/*0800*/ ISETP.EQ.U32.AND P1, PT, R12, UR10, PT ;
/*0810*/ IMAD.MOV.U32 R12, RZ, RZ, c[0x0][0x178] ;
/*0820*/ @P1 RED.E.ADD.STRONG.GPU [R12.64], R15 ;
.L_x_26:
/*0830*/ BSYNC B0 ;
.L_x_25:
/*0840*/ UIADD3 UR6, UR4, 0x1, URZ ;
/*0850*/ LOP3.LUT R13, RZ, R10, RZ, 0x33, !PT ;
/*0860*/ IMAD.MOV.U32 R17, RZ, RZ, 0x4 ;
/*0870*/ IADD3 R9, R9, -0x1, RZ ;
/*0880*/ IMAD.U32 R14, RZ, RZ, UR6 ;
/*0890*/ ISETP.NE.AND P1, PT, R9, RZ, PT ;
/*08a0*/ IMAD R13, R10, 0x40, R13 ;
/*08b0*/ IMAD R12, R14, c[0x0][0x184], RZ ;
/*08c0*/ IMAD.IADD R15, R8, 0x1, R13 ;
/*08d0*/ IMAD R12, R12, R4, R11 ;
/*08e0*/ IADD3 R11, R11, 0x1, RZ ;
/*08f0*/ IADD3 R12, R12, 0x40, RZ ;
/*0900*/ IMAD.WIDE R12, R12, R17, c[0x0][0x160] ;
/*0910*/ STG.E [R12.64], R15 ;
/*0920*/ @P1 BRA `(.L_x_31) ;
/*0930*/ IADD3 R4, R4, 0x1, RZ ;
/*0940*/ @P0 BRA `(.L_x_32) ;
/*0950*/ S2R R7, SR_CTAID.X ;
/*0960*/ S2UR UR4, SR_CTAID.Y ;
/*0970*/ IMAD.MOV.U32 R4, RZ, RZ, c[0x0][0x190] ;
/*0980*/ ULDC UR5, c[0x0][0x194] ;
/*0990*/ S2R R8, SR_TID.X ;
/*09a0*/ IMAD.MOV.U32 R6, RZ, RZ, c[0x0][0x18c] ;
/*09b0*/ SHF.R.U32.HI R4, RZ, 0x8, R4 ;
/*09c0*/ SHF.R.U32.HI R6, RZ, 0x8, R6 ;
/*09d0*/ SGXT.U32 R4, R4, 0x4 ;
/*09e0*/ IADD3 R4, R4, 0x1, RZ ;
/*09f0*/ IMAD R12, R7, 0x80, R8 ;
/*0a00*/ SGXT.U32 R7, R6, 0x4 ;
/*0a10*/ IMAD R6, R4, UR4, RZ ;
/*0a20*/ USHF.L.U32 UR4, UR5, 0x8, URZ ;
/*0a30*/ IADD3 R7, R7, 0x1, RZ ;
/*0a40*/ IMAD R8, R6, c[0x0][0x184], R12 ;
/*0a50*/ USHF.L.U32 UR5, UR5, 0x4, URZ ;
/*0a60*/ ULOP3.LUT UR4, UR4, 0x70000, URZ, 0xc0, !UPT ;
/*0a70*/ IMAD R8, R7.reuse, R8, RZ ;
/*0a80*/ ULOP3.LUT UR5, UR5, 0x70000, URZ, 0xc0, !UPT ;
/*0a90*/ IMAD R9, R7, c[0x0][0x184], RZ ;
/*0aa0*/ IMAD.SHL.U32 R10, R8, 0x40, RZ ;
/*0ab0*/ IMAD R8, R4, c[0x0][0x188], RZ ;
/*0ac0*/ IMAD R11, R9, R8, R10 ;
/*0ad0*/ IMAD.MOV.U32 R8, RZ, RZ, RZ ;
/*0ae0*/ IMAD R10, R12, R7, RZ ;
/*0af0*/ IADD3 R11, R11, 0x40, RZ ;
.L_x_40:
/*0b00*/ IADD3 R4, R4, -0x1, RZ ;
/*0b10*/ IMAD R12, R9, R6, 0x40 ;
/*0b20*/ IMAD.MOV.U32 R13, RZ, RZ, R7 ;
/*0b30*/ ISETP.NE.AND P0, PT, R4, RZ, PT ;
/*0b40*/ IMAD.MOV.U32 R15, RZ, RZ, R10 ;
.L_x_39:
/*0b50*/ IADD3 R14, -R2, 0x20, RZ ;
/*0b60*/ SHF.L.U32 R17, R5, R2, RZ ;
/*0b70*/ SHF.R.U32.HI R14, RZ, R14, R3 ;
/*0b80*/ ISETP.NE.AND P1, PT, R2, RZ, PT ;
/*0b90*/ LOP3.LUT R14, R14, R17, RZ, 0xfc, !PT ;
/*0ba0*/ SEL R17, R5, R14, !P1 ;
/*0bb0*/ LEA.HI R14, R17, UR4, RZ, 0x10 ;
/*0bc0*/ TLD.SCR.LZ RZ, R14, R14, 0x0, 0x66, 1D, 0x1 ;
/*0bd0*/ IMAD.MOV.U32 R16, RZ, RZ, R3 ;
/*0be0*/ SHF.R.U32.HI R18, RZ, 0x8, R14 ;
/*0bf0*/ LOP3.LUT R19, R14, 0xff, RZ, 0xc0, !PT ;
/*0c00*/ SGXT.U32 R18, R18, 0x8 ;
/*0c10*/ IADD3 R2, R18, R19, R2 ;
/*0c20*/ ISETP.GE.U32.AND P1, PT, R2, 0x20, PT ;
/*0c30*/ @P1 TLD.SCR.LZ RZ, R16, R0, 0x0, 0x68, 1D, 0x1 ;
/*0c40*/ ISETP.NE.AND P2, PT, R19, RZ, PT ;
/*0c50*/ BSSY B0, `(.L_x_33) ;
/*0c60*/ SHF.L.U32 R21, R17, R18, RZ ;
/*0c70*/ @P1 IMAD.MOV.U32 R5, RZ, RZ, R3 ;
/*0c80*/ @P1 LOP3.LUT R2, R2, 0x1f, RZ, 0xc0, !PT ;
/*0c90*/ IMAD.MOV.U32 R20, RZ, RZ, 0x1 ;
/*0ca0*/ @P2 LOP3.LUT R17, RZ, R21, RZ, 0x33, !PT ;
/*0cb0*/ @P2 IADD3 R14, -R19, 0x20, RZ ;
/*0cc0*/ @P2 SHF.R.S32.HI R17, RZ, 0x1f, R17 ;
/*0cd0*/ @P2 SHF.R.U32.HI R18, RZ, R14.reuse, R21 ;
/*0ce0*/ @P2 SHF.R.U32.HI R17, RZ, R14, R17 ;
/*0cf0*/ IMAD.MOV.U32 R14, RZ, RZ, 0x4 ;
/*0d00*/ @P2 IADD3 R8, -R17, R18, R8 ;
/*0d10*/ IMAD.WIDE R18, R11.reuse, R14, c[0x0][0x168] ;
/*0d20*/ IADD3 R17, R11, 0x1, RZ ;
/*0d30*/ IMAD.SHL.U32 R21, R8, 0x40, RZ ;
/*0d40*/ @P1 IADD3 R0, R0, 0x1, RZ ;
/*0d50*/ IMAD.MOV.U32 R14, RZ, RZ, R11 ;
/*0d60*/ IMAD.MOV.U32 R11, RZ, RZ, R17 ;
/*0d70*/ STG.E [R18.64], R21 ;
/*0d80*/ IMAD.MOV.U32 R3, RZ, RZ, R16 ;
.L_x_38:
/*0d90*/ IADD3 R17, -R2, 0x20, RZ ;
/*0da0*/ SHF.L.U32 R18, R5, R2, RZ ;
/*0db0*/ SHF.R.U32.HI R17, RZ, R17, R16 ;
/*0dc0*/ ISETP.NE.AND P1, PT, R2, RZ, PT ;
/*0dd0*/ LOP3.LUT R18, R17, R18, RZ, 0xfc, !PT ;
/*0de0*/ SEL R18, R5, R18, !P1 ;
/*0df0*/ LEA.HI R17, R18, UR5, RZ, 0x10 ;
/*0e00*/ TLD.SCR.LZ RZ, R17, R17, 0x0, 0x66, 1D, 0x1 ;
/*0e10*/ SHF.R.U32.HI R19, RZ, 0x8, R17 ;
/*0e20*/ LOP3.LUT R22, R17, 0xf, RZ, 0xc0, !PT ;
/*0e30*/ SGXT.U32 R19, R19, 0x8 ;
/*0e40*/ IADD3 R2, R19, R22, R2 ;
/*0e50*/ ISETP.GE.U32.AND P1, PT, R2, 0x20, PT ;
/*0e60*/ @P1 TLD.SCR.LZ RZ, R3, R0, 0x0, 0x68, 1D, 0x1 ;
/*0e70*/ LOP3.LUT R21, R17, 0xff, RZ, 0xc0, !PT ;
/*0e80*/ @P1 IMAD.MOV.U32 R5, RZ, RZ, R16 ;
/*0e90*/ SHF.L.U32 R18, R18, R19, RZ ;
/*0ea0*/ ISETP.NE.AND P2, PT, R21.reuse, RZ, PT ;
/*0eb0*/ LEA.HI R20, R21, R20, RZ, 0x1c ;
/*0ec0*/ @P1 LOP3.LUT R2, R2, 0x1f, RZ, 0xc0, !PT ;
/*0ed0*/ @P1 IADD3 R0, R0, 0x1, RZ ;
/*0ee0*/ @!P2 BRA `(.L_x_34) ;
/*0ef0*/ ISETP.GT.U32.AND P1, PT, R20, 0x3f, PT ;
/*0f00*/ @P1 BRA `(.L_x_35) ;
/*0f10*/ ISETP.NE.AND P1, PT, R22, RZ, PT ;
/*0f20*/ BSSY B1, `(.L_x_36) ;
/*0f30*/ @!P1 BRA `(.L_x_37) ;
/*0f40*/ LOP3.LUT R16, RZ, R18, RZ, 0x33, !PT ;
/*0f50*/ IADD3 R17, -R22, 0x20, RZ ;
/*0f60*/ SHF.R.S32.HI R16, RZ, 0x1f, R16 ;
/*0f70*/ SHF.R.U32.HI R19, RZ, R17.reuse, R18 ;
/*0f80*/ IMAD.MOV.U32 R18, RZ, RZ, 0x4 ;
/*0f90*/ SHF.R.U32.HI R16, RZ, R17, R16 ;
/*0fa0*/ IMAD.IADD R19, R19, 0x1, -R16 ;
/*0fb0*/ IMAD.WIDE R16, R11.reuse, R18, c[0x0][0x168] ;
/*0fc0*/ IADD3 R11, R11, 0x1, RZ ;
/*0fd0*/ IMAD R19, R19, 0x40, R20 ;
/*0fe0*/ STG.E [R16.64], R19 ;
.L_x_37:
/*0ff0*/ BSYNC B1 ;
.L_x_36:
/*1000*/ IADD3 R20, R20, 0x1, RZ ;
/*1010*/ IMAD.MOV.U32 R16, RZ, RZ, R3 ;
/*1020*/ ISETP.GE.U32.AND P1, PT, R20, 0x40, PT ;
/*1030*/ @!P1 BRA `(.L_x_38) ;
/*1040*/ BRA `(.L_x_34) ;
.L_x_35:
/*1050*/ S2R R16, SR_LANEID ;
/*1060*/ VOTEU.ALL UR6, UPT, PT ;
/*1070*/ IMAD.MOV.U32 R17, RZ, RZ, c[0x0][0x17c] ;
/*1080*/ UFLO.U32 UR7, UR6 ;
/*1090*/ POPC R19, UR6 ;
/*10a0*/ ISETP.EQ.U32.AND P1, PT, R16, UR7, PT ;
/*10b0*/ IMAD.MOV.U32 R16, RZ, RZ, c[0x0][0x178] ;
/*10c0*/ @P1 RED.E.ADD.STRONG.GPU [R16.64], R19 ;
.L_x_34:
/*10d0*/ BSYNC B0 ;
.L_x_33:
/*10e0*/ LOP3.LUT R17, RZ, R14, RZ, 0x33, !PT ;
/*10f0*/ IMAD.IADD R16, R12, 0x1, R15 ;
/*1100*/ IADD3 R13, R13, -0x1, RZ ;
/*1110*/ IMAD.MOV.U32 R19, RZ, RZ, 0x4 ;
/*1120*/ IADD3 R15, R15, 0x1, RZ ;
/*1130*/ IMAD R14, R14, 0x40, R17 ;
/*1140*/ ISETP.NE.AND P1, PT, R13, RZ, PT ;
/*1150*/ IMAD.WIDE R16, R16, R19, c[0x0][0x168] ;
/*1160*/ IMAD.IADD R19, R11, 0x1, R14 ;
/*1170*/ STG.E [R16.64], R19 ;
/*1180*/ @P1 BRA `(.L_x_39) ;
/*1190*/ IADD3 R6, R6, 0x1, RZ ;
/*11a0*/ @P0 BRA `(.L_x_40) ;
/*11b0*/ S2R R6, SR_CTAID.X ;
/*11c0*/ S2UR UR4, SR_CTAID.Y ;
/*11d0*/ IMAD.MOV.U32 R4, RZ, RZ, c[0x0][0x190] ;
/*11e0*/ S2R R7, SR_TID.X ;
/*11f0*/ IMAD.MOV.U32 R8, RZ, RZ, c[0x0][0x18c] ;
/*1200*/ SHF.R.U32.HI R4, RZ, 0x10, R4 ;
/*1210*/ SHF.R.U32.HI R8, RZ, 0x10, R8 ;
/*1220*/ SGXT.U32 R4, R4, 0x4 ;
/*1230*/ IADD3 R4, R4, 0x1, RZ ;
/*1240*/ IMAD R12, R6, 0x80, R7 ;
/*1250*/ SGXT.U32 R7, R8, 0x4 ;
/*1260*/ IMAD R6, R4, UR4, RZ ;
/*1270*/ ULDC UR4, c[0x0][0x194] ;
/*1280*/ IADD3 R7, R7, 0x1, RZ ;
/*1290*/ IMAD R8, R6, c[0x0][0x184], R12 ;
/*12a0*/ USHF.R.U32.HI UR5, URZ, 0x4, UR4 ;
/*12b0*/ ULOP3.LUT UR4, UR4, 0x70000, URZ, 0xc0, !UPT ;
/*12c0*/ IMAD R8, R7.reuse, R8, RZ ;
/*12d0*/ ULOP3.LUT UR5, UR5, 0x70000, URZ, 0xc0, !UPT ;
/*12e0*/ IMAD R9, R7, c[0x0][0x184], RZ ;
/*12f0*/ IMAD.SHL.U32 R10, R8, 0x40, RZ ;
/*1300*/ IMAD R8, R4, c[0x0][0x188], RZ ;
/*1310*/ IMAD R11, R9, R8, R10 ;
/*1320*/ IMAD.MOV.U32 R8, RZ, RZ, RZ ;
/*1330*/ IMAD R10, R12, R7, RZ ;
/*1340*/ IADD3 R11, R11, 0x40, RZ ;
.L_x_48:
/*1350*/ IADD3 R4, R4, -0x1, RZ ;
/*1360*/ IMAD R12, R9, R6, 0x40 ;
/*1370*/ IMAD.MOV.U32 R13, RZ, RZ, R7 ;
/*1380*/ ISETP.NE.AND P0, PT, R4, RZ, PT ;
/*1390*/ IMAD.MOV.U32 R15, RZ, RZ, R10 ;
.L_x_47:
/*13a0*/ IADD3 R14, -R2, 0x20, RZ ;
/*13b0*/ SHF.L.U32 R17, R5, R2, RZ ;
/*13c0*/ SHF.R.U32.HI R14, RZ, R14, R3 ;
/*13d0*/ ISETP.NE.AND P1, PT, R2, RZ, PT ;
/*13e0*/ LOP3.LUT R14, R14, R17, RZ, 0xfc, !PT ;
/*13f0*/ SEL R17, R5, R14, !P1 ;
/*1400*/ LEA.HI R14, R17, UR4, RZ, 0x10 ;
/*1410*/ TLD.SCR.LZ RZ, R14, R14, 0x0, 0x66, 1D, 0x1 ;
/*1420*/ IMAD.MOV.U32 R16, RZ, RZ, R3 ;
/*1430*/ SHF.R.U32.HI R18, RZ, 0x8, R14 ;
/*1440*/ LOP3.LUT R19, R14, 0xff, RZ, 0xc0, !PT ;
/*1450*/ SGXT.U32 R18, R18, 0x8 ;
/*1460*/ IADD3 R2, R18, R19, R2 ;
/*1470*/ ISETP.GE.U32.AND P1, PT, R2, 0x20, PT ;
/*1480*/ @P1 TLD.SCR.LZ RZ, R16, R0, 0x0, 0x68, 1D, 0x1 ;
/*1490*/ ISETP.NE.AND P2, PT, R19, RZ, PT ;
/*14a0*/ BSSY B0, `(.L_x_41) ;
/*14b0*/ SHF.L.U32 R21, R17, R18, RZ ;
/*14c0*/ @P1 IMAD.MOV.U32 R5, RZ, RZ, R3 ;
/*14d0*/ @P1 LOP3.LUT R2, R2, 0x1f, RZ, 0xc0, !PT ;
/*14e0*/ IMAD.MOV.U32 R20, RZ, RZ, 0x1 ;
/*14f0*/ @P2 LOP3.LUT R17, RZ, R21, RZ, 0x33, !PT ;
/*1500*/ @P2 IADD3 R14, -R19, 0x20, RZ ;
/*1510*/ @P2 SHF.R.S32.HI R17, RZ, 0x1f, R17 ;
/*1520*/ @P2 SHF.R.U32.HI R18, RZ, R14.reuse, R21 ;
/*1530*/ @P2 SHF.R.U32.HI R17, RZ, R14, R17 ;
/*1540*/ IMAD.MOV.U32 R14, RZ, RZ, 0x4 ;
/*1550*/ @P2 IADD3 R8, -R17, R18, R8 ;
/*1560*/ IMAD.WIDE R18, R11.reuse, R14, c[0x0][0x170] ;
/*1570*/ IADD3 R17, R11, 0x1, RZ ;
/*1580*/ IMAD.SHL.U32 R21, R8, 0x40, RZ ;
/*1590*/ @P1 IADD3 R0, R0, 0x1, RZ ;
/*15a0*/ IMAD.MOV.U32 R14, RZ, RZ, R11 ;
/*15b0*/ IMAD.MOV.U32 R11, RZ, RZ, R17 ;
/*15c0*/ STG.E [R18.64], R21 ;
/*15d0*/ IMAD.MOV.U32 R3, RZ, RZ, R16 ;
.L_x_46:
/*15e0*/ IADD3 R17, -R2, 0x20, RZ ;
/*15f0*/ SHF.L.U32 R18, R5, R2, RZ ;
/*1600*/ SHF.R.U32.HI R17, RZ, R17, R16 ;
/*1610*/ ISETP.NE.AND P1, PT, R2, RZ, PT ;
/*1620*/ LOP3.LUT R18, R17, R18, RZ, 0xfc, !PT ;
/*1630*/ SEL R18, R5, R18, !P1 ;
/*1640*/ LEA.HI R17, R18, UR5, RZ, 0x10 ;
/*1650*/ TLD.SCR.LZ RZ, R17, R17, 0x0, 0x66, 1D, 0x1 ;
/*1660*/ SHF.R.U32.HI R19, RZ, 0x8, R17 ;
/*1670*/ LOP3.LUT R22, R17, 0xf, RZ, 0xc0, !PT ;
/*1680*/ SGXT.U32 R19, R19, 0x8 ;
/*1690*/ IADD3 R2, R19, R22, R2 ;
/*16a0*/ ISETP.GE.U32.AND P1, PT, R2, 0x20, PT ;
/*16b0*/ @P1 TLD.SCR.LZ RZ, R3, R0, 0x0, 0x68, 1D, 0x1 ;
/*16c0*/ LOP3.LUT R21, R17, 0xff, RZ, 0xc0, !PT ;
/*16d0*/ @P1 IMAD.MOV.U32 R5, RZ, RZ, R16 ;
/*16e0*/ SHF.L.U32 R18, R18, R19, RZ ;
/*16f0*/ ISETP.NE.AND P2, PT, R21.reuse, RZ, PT ;
/*1700*/ LEA.HI R20, R21, R20, RZ, 0x1c ;
/*1710*/ @P1 LOP3.LUT R2, R2, 0x1f, RZ, 0xc0, !PT ;
/*1720*/ @P1 IADD3 R0, R0, 0x1, RZ ;
/*1730*/ @!P2 BRA `(.L_x_42) ;
/*1740*/ ISETP.GT.U32.AND P1, PT, R20, 0x3f, PT ;
/*1750*/ @P1 BRA `(.L_x_43) ;
/*1760*/ ISETP.NE.AND P1, PT, R22, RZ, PT ;
/*1770*/ BSSY B1, `(.L_x_44) ;
/*1780*/ @!P1 BRA `(.L_x_45) ;
/*1790*/ LOP3.LUT R16, RZ, R18, RZ, 0x33, !PT ;
/*17a0*/ IADD3 R17, -R22, 0x20, RZ ;
/*17b0*/ SHF.R.S32.HI R16, RZ, 0x1f, R16 ;
/*17c0*/ SHF.R.U32.HI R19, RZ, R17.reuse, R18 ;
/*17d0*/ IMAD.MOV.U32 R18, RZ, RZ, 0x4 ;
/*17e0*/ SHF.R.U32.HI R16, RZ, R17, R16 ;
/*17f0*/ IMAD.IADD R19, R19, 0x1, -R16 ;
/*1800*/ IMAD.WIDE R16, R11.reuse, R18, c[0x0][0x170] ;
/*1810*/ IADD3 R11, R11, 0x1, RZ ;
/*1820*/ IMAD R19, R19, 0x40, R20 ;
/*1830*/ STG.E [R16.64], R19 ;
.L_x_45:
/*1840*/ BSYNC B1 ;
.L_x_44:
/*1850*/ IADD3 R20, R20, 0x1, RZ ;
/*1860*/ IMAD.MOV.U32 R16, RZ, RZ, R3 ;
/*1870*/ ISETP.GE.U32.AND P1, PT, R20, 0x40, PT ;
/*1880*/ @!P1 BRA `(.L_x_46) ;
/*1890*/ BRA `(.L_x_42) ;
.L_x_43:
/*18a0*/ S2R R16, SR_LANEID ;
/*18b0*/ VOTEU.ALL UR6, UPT, PT ;
/*18c0*/ IMAD.MOV.U32 R17, RZ, RZ, c[0x0][0x17c] ;
/*18d0*/ UFLO.U32 UR7, UR6 ;
/*18e0*/ POPC R19, UR6 ;
/*18f0*/ ISETP.EQ.U32.AND P1, PT, R16, UR7, PT ;
/*1900*/ IMAD.MOV.U32 R16, RZ, RZ, c[0x0][0x178] ;
/*1910*/ @P1 RED.E.ADD.STRONG.GPU [R16.64], R19 ;
.L_x_42:
/*1920*/ BSYNC B0 ;
.L_x_41:
/*1930*/ LOP3.LUT R17, RZ, R14, RZ, 0x33, !PT ;
/*1940*/ IMAD.IADD R16, R12, 0x1, R15 ;
/*1950*/ IADD3 R13, R13, -0x1, RZ ;
/*1960*/ IMAD.MOV.U32 R19, RZ, RZ, 0x4 ;
/*1970*/ IADD3 R15, R15, 0x1, RZ ;
/*1980*/ IMAD R14, R14, 0x40, R17 ;
/*1990*/ ISETP.NE.AND P1, PT, R13, RZ, PT ;
/*19a0*/ IMAD.WIDE R16, R16, R19, c[0x0][0x170] ;
/*19b0*/ IMAD.IADD R19, R11, 0x1, R14 ;
/*19c0*/ STG.E [R16.64], R19 ;
/*19d0*/ @P1 BRA `(.L_x_47) ;
/*19e0*/ IADD3 R6, R6, 0x1, RZ ;
/*19f0*/ @P0 BRA `(.L_x_48) ;
/*1a00*/ EXIT ;
.L_x_49:
/*1a10*/ BRA `(.L_x_49);
/*1a20*/ NOP;
/*1a30*/ NOP;
/*1a40*/ NOP;
/*1a50*/ NOP;
/*1a60*/ NOP;
/*1a70*/ NOP;
/*1a80*/ NOP;
/*1a90*/ NOP;
/*1aa0*/ NOP;
/*1ab0*/ NOP;
/*1ac0*/ NOP;
/*1ad0*/ NOP;
/*1ae0*/ NOP;
/*1af0*/ NOP;
.L_x_60:
jpegdec_idct
//--------------------- .text.jpegdec_idct --------------------------
.section .text.jpegdec_idct,"ax",@progbits
.sectionflags @"SHF_BARRIERS=1"
.sectioninfo @"SHI_REGISTERS=22"
.align 128
.global jpegdec_idct
.type jpegdec_idct,@function
.size jpegdec_idct,(.L_x_61 - jpegdec_idct)
.other jpegdec_idct,@"STO_CUDA_ENTRY STV_DEFAULT"
jpegdec_idct:
.text.jpegdec_idct:
/*0000*/ MOV R1, c[0x0][0x28] ;
/*0010*/ S2R R6, SR_TID.X ;
/*0020*/ MOV R4, 0x24 ;
/*0030*/ BSSY B0, `(.L_x_50) ;
/*0040*/ LOP3.LUT R0, R6.reuse, 0x7, RZ, 0xc0, !PT ;
/*0050*/ LOP3.LUT R5, R6, 0xfffffff8, RZ, 0xc0, !PT ;
/*0060*/ ISETP.NE.AND P0, PT, R0, RZ, PT ;
/*0070*/ IMAD R3, R5, R4, 0x20 ;
/*0080*/ LEA R2, R0, R3, 0x2 ;
/*0090*/ SHF.R.U32.HI R3, RZ, 0x3, R6 ;
/*00a0*/ @P0 BRA `(.L_x_51) ;
/*00b0*/ S2R R6, SR_CTAID.X ;
/*00c0*/ S2UR UR4, SR_CTAID.Y ;
/*00d0*/ MOV R8, c[0x0][0x174] ;
/*00e0*/ LEA R7, R6, R3, 0x3 ;
/*00f0*/ USGXT UR4, UR4, 0x18 ;
/*0100*/ SGXT R6, R8, 0x18 ;
/*0110*/ IADD3 R7, R7, 0x40, RZ ;
/*0120*/ IMAD R6, R6, UR4, R7 ;
/*0130*/ TLD.SCR.LZ RZ, R6, R6, 0x0, 0x5e, 1D, 0x1 ;
/*0140*/ STS [R3.X4], R6 ;
.L_x_51:
/*0150*/ BSYNC B0 ;
.L_x_50:
/*0160*/ STS [R2], RZ ;
/*0170*/ BSSY B0, `(.L_x_52) ;
/*0180*/ STS [R2+0x24], RZ ;
/*0190*/ STS [R2+0x48], RZ ;
/*01a0*/ STS [R2+0x6c], RZ ;
/*01b0*/ STS [R2+0x90], RZ ;
/*01c0*/ STS [R2+0xb4], RZ ;
/*01d0*/ STS [R2+0xd8], RZ ;
/*01e0*/ STS [R2+0xfc], RZ ;
/*01f0*/ BAR.SYNC 0x0 ;
/*0200*/ LDS R6, [R3.X4] ;
/*0210*/ LOP3.LUT R7, R6, 0x3f, RZ, 0xc0, !PT ;
/*0220*/ SHF.R.U32.HI R9, RZ, 0x6, R6 ;
/*0230*/ ISETP.GT.U32.AND P0, PT, R0, R7, PT ;
/*0240*/ ISETP.EQ.OR P0, PT, R9, RZ, P0 ;
/*0250*/ @P0 BRA `(.L_x_53) ;
/*0260*/ ISETP.GE.U32.AND P2, PT, R0.reuse, R7, PT ;
/*0270*/ IADD3 R9, R0, R9, RZ ;
/*0280*/ @P2 TLD.SCR.LZ RZ, R6, R9, 0x0, 0x5e, 1D, 0x1 ;
/*0290*/ @P2 LOP3.LUT R10, R6, 0x3f, RZ, 0xc0, !PT ;
/*02a0*/ @P2 TLD.SCR.LZ RZ, R8, R10, 0x0, 0x5e, 1D, 0x1 ;
/*02b0*/ @P2 LDC.U8 R12, c[0x3][R10] ;
/*02c0*/ @P2 SHF.R.S32.HI R11, RZ, 0x6, R6 ;
/*02d0*/ BSSY B1, `(.L_x_54) ;
/*02e0*/ MOV R6, R0 ;
/*02f0*/ PLOP3.LUT P0, PT, P2, PT, PT, 0x8, 0x0 ;
/*0300*/ @P2 IADD3 R6, R6, 0x8, RZ ;
/*0310*/ @P2 IADD3 R9, R9, 0x8, RZ ;
/*0320*/ ISETP.GT.U32.AND P1, PT, R7, R6, PT ;
/*0330*/ @P2 LEA.HI R13, R12.reuse, R5, RZ, 0x1d ;
/*0340*/ @P2 SHF.L.U32 R12, R12, 0x2, RZ ;
/*0350*/ @P2 IMAD R13, R13, R4, 0x20 ;
/*0360*/ @P2 LOP3.LUT R12, R12, 0x1c, RZ, 0xc0, !PT ;
/*0370*/ @P2 IADD3 R12, R13, R12, RZ ;
/*0380*/ @P2 IMAD R8, R8, R11, RZ ;
/*0390*/ @P2 I2FP.F32.S32 R11, R8 ;
/*03a0*/ IADD3 R8, R7, -R6, RZ ;
/*03b0*/ @P2 FMUL R11, R11, 0.00390625 ;
/*03c0*/ ISETP.LE.U32.OR P1, PT, R8, 0x7, !P1 ;
/*03d0*/ @P2 STS [R12], R11 ;
/*03e0*/ @P1 BRA `(.L_x_55) ;
/*03f0*/ PLOP3.LUT P0, PT, PT, PT, PT, 0x8, 0x0 ;
/*0400*/ IADD3 R19, R7, -0x7, RZ ;
.L_x_56:
/*0410*/ IADD3 R14, R9, 0x8, RZ ;
/*0420*/ TLD.SCR.LZ RZ, R8, R9, 0x0, 0x5e, 1D, 0x1 ;
/*0430*/ TLD.SCR.LZ RZ, R14, R14, 0x0, 0x5e, 1D, 0x1 ;
/*0440*/ LOP3.LUT R10, R8, 0x3f, RZ, 0xc0, !PT ;
/*0450*/ LOP3.LUT R15, R14, 0x3f, RZ, 0xc0, !PT ;
/*0460*/ TLD.SCR.LZ RZ, R11, R10, 0x0, 0x5e, 1D, 0x1 ;
/*0470*/ TLD.SCR.LZ RZ, R16, R15, 0x0, 0x5e, 1D, 0x1 ;
/*0480*/ LDC.U8 R12, c[0x3][R10] ;
/*0490*/ SHF.R.S32.HI R8, RZ, 0x6, R8 ;
/*04a0*/ SHF.R.S32.HI R17, RZ, 0x6, R14 ;
/*04b0*/ IADD3 R6, R6, 0x10, RZ ;
/*04c0*/ IADD3 R9, R9, 0x10, RZ ;
/*04d0*/ ISETP.GE.U32.AND P1, PT, R6, R19, PT ;
/*04e0*/ LEA.HI R13, R12, R5, RZ, 0x1d ;
/*04f0*/ SHF.L.U32 R12, R12, 0x2, RZ ;
/*0500*/ IMAD R13, R13, R4, 0x20 ;
/*0510*/ LOP3.LUT R12, R12, 0x1c, RZ, 0xc0, !PT ;
/*0520*/ IADD3 R13, R13, R12, RZ ;
/*0530*/ LDC.U8 R12, c[0x3][R15] ;
/*0540*/ IMAD R8, R11, R8, RZ ;
/*0550*/ LEA.HI R11, R12.reuse, R5, RZ, 0x1d ;
/*0560*/ IMAD.SHL.U32 R12, R12, 0x4, RZ ;
/*0570*/ IMAD R16, R16, R17, RZ ;
/*0580*/ I2FP.F32.S32 R8, R8 ;
/*0590*/ IMAD R11, R11, R4, 0x20 ;
/*05a0*/ LOP3.LUT R12, R12, 0x1c, RZ, 0xc0, !PT ;
/*05b0*/ I2FP.F32.S32 R16, R16 ;
/*05c0*/ FMUL R8, R8, 0.00390625 ;
/*05d0*/ IADD3 R11, R11, R12, RZ ;
/*05e0*/ FMUL R16, R16, 0.00390625 ;
/*05f0*/ STS [R13], R8 ;
/*0600*/ STS [R11], R16 ;
/*0610*/ @!P1 BRA `(.L_x_56) ;
.L_x_55:
/*0620*/ BSYNC B1 ;
.L_x_54:
/*0630*/ ISETP.LE.U32.OR P0, PT, R6, R7, P0 ;
/*0640*/ @!P0 BRA `(.L_x_53) ;
/*0650*/ TLD.SCR.LZ RZ, R9, R9, 0x0, 0x5e, 1D, 0x1 ;
/*0660*/ LOP3.LUT R6, R9, 0x3f, RZ, 0xc0, !PT ;
/*0670*/ TLD.SCR.LZ RZ, R7, R6, 0x0, 0x5e, 1D, 0x1 ;
/*0680*/ LDC.U8 R8, c[0x3][R6] ;
/*0690*/ SHF.R.S32.HI R10, RZ, 0x6, R9 ;
/*06a0*/ LEA.HI R11, R8.reuse, R5, RZ, 0x1d ;
/*06b0*/ SHF.L.U32 R8, R8, 0x2, RZ ;
/*06c0*/ IMAD R11, R11, R4, 0x20 ;
/*06d0*/ LOP3.LUT R8, R8, 0x1c, RZ, 0xc0, !PT ;
/*06e0*/ IADD3 R8, R11, R8, RZ ;
/*06f0*/ IMAD R7, R7, R10, RZ ;
/*0700*/ I2FP.F32.S32 R7, R7 ;
/*0710*/ FMUL R7, R7, 0.00390625 ;
/*0720*/ STS [R8], R7 ;
.L_x_53:
/*0730*/ BSYNC B0 ;
.L_x_52:
/*0740*/ IADD3 R5, R0.reuse, R5, RZ ;
/*0750*/ BAR.SYNC 0x0 ;
/*0760*/ IMAD R15, R5, 0x24, RZ ;
/*0770*/ LEA R18, R0, R3, 0x3 ;
/*0780*/ LDS R4, [R15+0x28] ;
/*0790*/ LDS R5, [R15+0x38] ;
/*07a0*/ LDS R10, [R15+0x34] ;
/*07b0*/ LDS R11, [R15+0x3c] ;
/*07c0*/ LDS R6, [R15+0x20] ;
/*07d0*/ LDS R7, [R15+0x30] ;
/*07e0*/ LDS R13, [R15+0x2c] ;
/*07f0*/ LDS R14, [R15+0x24] ;
/*0800*/ FADD R4, R4, -R5 ;
/*0810*/ FFMA R9, R5.reuse, 1.4142134189605712891, R4.reuse ;
/*0820*/ FFMA R4, R5, 1.4142134189605712891, -R4 ;
/*0830*/ FADD R12, R10, -R11 ;
/*0840*/ FMUL R9, R9, 0.46193981170654296875 ;
/*0850*/ FADD R8, R6.reuse, -R7.reuse ;
/*0860*/ FADD R6, R6, R7 ;
/*0870*/ FADD R5, R12, -R13.reuse ;
/*0880*/ FADD R10, -R10, R13 ;
/*0890*/ FFMA R7, R6.reuse, 0.35355341434478759766, -R9.reuse ;
/*08a0*/ FFMA R6, R6, 0.35355341434478759766, R9 ;
/*08b0*/ FADD R9, R5, R14 ;
/*08c0*/ FFMA R13, R10.reuse, 0.70710682868957519531, -R11.reuse ;
/*08d0*/ FFMA R10, R10, 0.70710682868957519531, R11 ;
/*08e0*/ FMUL R5, R4, 0.19134169816970825195 ;
/*08f0*/ FFMA R11, R12.reuse, 1.4142134189605712891, R9.reuse ;
/*0900*/ FFMA R12, R12, -1.4142134189605712891, R9 ;
/*0910*/ FFMA R4, R8, 0.35355341434478759766, -R5 ;
/*0920*/ FFMA R14, R10, -2.6131260395050048828, R11 ;
/*0930*/ FFMA R5, R8, 0.35355341434478759766, R5 ;
/*0940*/ FFMA R8, R13.reuse, -1.0823919773101806641, R12.reuse ;
/*0950*/ FFMA R11, R10, 2.6131260395050048828, R11 ;
/*0960*/ FFMA R9, R14.reuse, 0.097545139491558074951, R7.reuse ;
/*0970*/ FFMA R14, R14, -0.097545139491558074951, R7 ;
/*0980*/ FFMA R13, R13, 1.0823919773101806641, R12 ;
/*0990*/ FFMA R7, R8.reuse, 0.27778509259223937988, R5.reuse ;
/*09a0*/ FFMA R8, R8, -0.27778509259223937988, R5 ;
/*09b0*/ FFMA R5, R11.reuse, 0.49039259552955627441, R6.reuse ;
/*09c0*/ FFMA R11, R11, -0.49039259552955627441, R6 ;
/*09d0*/ FFMA R6, R13.reuse, 0.41573479771614074707, R4.reuse ;
/*09e0*/ FFMA R4, R13, -0.41573479771614074707, R4 ;
/*09f0*/ STS [R15+0x2c], R9 ;
/*0a00*/ STS [R15+0x30], R14 ;
/*0a10*/ STS [R15+0x28], R7 ;
/*0a20*/ STS [R15+0x34], R8 ;
/*0a30*/ STS [R15+0x20], R5 ;
/*0a40*/ STS [R15+0x24], R6 ;
/*0a50*/ STS [R15+0x38], R4 ;
/*0a60*/ STS [R15+0x3c], R11 ;
/*0a70*/ BAR.SYNC 0x0 ;
/*0a80*/ LDS R7, [R2+0xb4] ;
/*0a90*/ LDS R8, [R2+0xfc] ;
/*0aa0*/ LDS R10, [R2+0x48] ;
/*0ab0*/ LDS R13, [R2+0xd8] ;
/*0ac0*/ LDS R5, [R2+0x6c] ;
/*0ad0*/ LDS R14, [R2+0x24] ;
/*0ae0*/ LDS R9, [R2] ;
/*0af0*/ LDS R12, [R2+0x90] ;
/*0b00*/ S2R R11, SR_CTAID.X ;
/*0b10*/ S2R R4, SR_CTAID.Y ;
/*0b20*/ FADD R6, R7, -R8 ;
/*0b30*/ FADD R10, R10, -R13 ;
/*0b40*/ IMAD R16, R11, 0x8, R0 ;
/*0b50*/ FADD R11, R6, -R5.reuse ;
/*0b60*/ FADD R7, -R7, R5 ;
/*0b70*/ LEA R4, R4, R3, 0x3 ;
/*0b80*/ FFMA R3, R13, 1.4142134189605712891, -R10 ;
/*0b90*/ FADD R11, R11, R14 ;
/*0ba0*/ FFMA R10, R13, 1.4142134189605712891, R10 ;
/*0bb0*/ ISETP.GE.AND P0, PT, R4, c[0x0][0x170], PT ;
/*0bc0*/ FMUL R3, R3, 0.19134169816970825195 ;
/*0bd0*/ FFMA R13, R6, -1.4142134189605712891, R11.reuse ;
/*0be0*/ FMUL R10, R10, 0.46193981170654296875 ;
/*0bf0*/ FADD R0, R9.reuse, -R12.reuse ;
/*0c00*/ FADD R9, R9, R12 ;
/*0c10*/ FFMA R12, R7.reuse, 0.70710682868957519531, -R8.reuse ;
/*0c20*/ FFMA R7, R7, 0.70710682868957519531, R8 ;
/*0c30*/ FFMA R6, R6, 1.4142134189605712891, R11 ;
/*0c40*/ SHF.L.U32 R16, R16, 0x3, RZ ;
/*0c50*/ FFMA R5, R0.reuse, 0.35355341434478759766, -R3.reuse ;
/*0c60*/ FFMA R0, R0, 0.35355341434478759766, R3 ;
/*0c70*/ FFMA R11, R12, -1.0823919773101806641, R13.reuse ;
/*0c80*/ FFMA R3, R9, 0.35355341434478759766, -R10 ;
/*0c90*/ FFMA R8, R7, -2.6131260395050048828, R6 ;
/*0ca0*/ ISETP.GE.OR P0, PT, R16, c[0x0][0x16c], P0 ;
/*0cb0*/ FFMA R9, R9, 0.35355341434478759766, R10 ;
/*0cc0*/ FFMA R12, R12, 1.0823919773101806641, R13 ;
/*0cd0*/ FFMA R6, R7, 2.6131260395050048828, R6 ;
/*0ce0*/ FFMA R13, R11.reuse, 0.27778509259223937988, R0.reuse ;
/*0cf0*/ FFMA R17, R11, -0.27778509259223937988, R0 ;
/*0d00*/ FFMA R11, R8.reuse, 0.097545139491558074951, R3.reuse ;
/*0d10*/ FFMA R15, R8, -0.097545139491558074951, R3 ;
/*0d20*/ FFMA R7, R12, 0.41573479771614074707, R5 ;
/*0d30*/ FFMA R3, R6, 0.49039259552955627441, R9 ;
/*0d40*/ FFMA R5, R12, -0.41573479771614074707, R5 ;
/*0d50*/ FFMA R9, R6, -0.49039259552955627441, R9 ;
/*0d60*/ STS [R2+0x48], R13 ;
/*0d70*/ STS [R2+0xb4], R17 ;
/*0d80*/ STS [R2+0x6c], R11 ;
/*0d90*/ STS [R2+0x90], R15 ;
/*0da0*/ STS [R2+0x24], R7 ;
/*0db0*/ STS [R2+0xd8], R5 ;
/*0dc0*/ STS [R2], R3 ;
/*0dd0*/ STS [R2+0xfc], R9 ;
/*0de0*/ BAR.SYNC 0x0 ;
/*0df0*/ @P0 EXIT ;
/*0e00*/ IMAD R18, R18, 0x24, RZ ;
/*0e10*/ ULDC.64 UR4, c[0x0][0x118] ;
/*0e20*/ LDS R2, [R18+0x24] ;
/*0e30*/ LDS R0, [R18+0x20] ;
/*0e40*/ LDS R7, [R18+0x30] ;
/*0e50*/ LDS R9, [R18+0x34] ;
/*0e60*/ LDS R10, [R18+0x38] ;
/*0e70*/ LDS R3, [R18+0x28] ;
/*0e80*/ LDS R6, [R18+0x2c] ;
/*0e90*/ LDS R12, [R18+0x3c] ;
/*0ea0*/ FADD R2, R2, 128 ;
/*0eb0*/ FADD R0, R0, 128 ;
/*0ec0*/ FMNMX R2, RZ, R2, !PT ;
/*0ed0*/ FADD R7, R7, 128 ;
/*0ee0*/ FMNMX R0, RZ, R0, !PT ;
/*0ef0*/ FMNMX R2, R2, 255, PT ;
/*0f00*/ FADD R9, R9, 128 ;
/*0f10*/ FMNMX R8, RZ, R7, !PT ;
/*0f20*/ FMNMX R0, R0, 255, PT ;
/*0f30*/ FADD R11, R10, 128 ;
/*0f40*/ FMNMX R9, RZ, R9, !PT ;
/*0f50*/ F2I.NTZ R5, R2 ;
/*0f60*/ FMNMX R8, R8, 255, PT ;
/*0f70*/ FADD R3, R3, 128 ;
/*0f80*/ FMNMX R10, R9, 255, PT ;
/*0f90*/ FADD R6, R6, 128 ;
/*0fa0*/ FMNMX R3, RZ, R3, !PT ;
/*0fb0*/ F2I.NTZ R0, R0 ;
/*0fc0*/ FMNMX R2, RZ, R11, !PT ;
/*0fd0*/ FADD R9, R12, 128 ;
/*0fe0*/ FMNMX R3, R3, 255, PT ;
/*0ff0*/ FMNMX R12, R2, 255, PT ;
/*1000*/ FMNMX R6, RZ, R6, !PT ;
/*1010*/ F2I.NTZ R8, R8 ;
/*1020*/ FMNMX R2, RZ, R9, !PT ;
/*1030*/ FMNMX R6, R6, 255, PT ;
/*1040*/ FMNMX R14, R2, 255, PT ;
/*1050*/ F2I.NTZ R11, R10 ;
/*1060*/ LEA R0, R5, R0, 0x8 ;
/*1070*/ F2I.NTZ R7, R3 ;
/*1080*/ F2I.NTZ R13, R12 ;
/*1090*/ IMAD R3, R4, c[0x0][0x168], RZ ;
/*10a0*/ LEA R8, R11, R8, 0x8 ;
/*10b0*/ SHF.R.S32.HI R4, RZ, 0x1f, R16.reuse ;
/*10c0*/ IADD3 R2, P0, P1, R3, c[0x0][0x160], R16 ;
/*10d0*/ F2I.NTZ R9, R6 ;
/*10e0*/ SHF.R.S32.HI R3, RZ, 0x1f, R3 ;
/*10f0*/ LEA R0, R7, R0, 0x10 ;
/*1100*/ IADD3.X R3, R3, c[0x0][0x164], R4, P0, P1 ;
/*1110*/ F2I.NTZ R15, R14 ;
/*1120*/ LEA R8, R13, R8, 0x10 ;
/*1130*/ LEA R4, R9, R0, 0x18 ;
/*1140*/ LEA R5, R15, R8, 0x18 ;
/*1150*/ STG.E.64 [R2.64], R4 ;
/*1160*/ EXIT ;
.L_x_57:
/*1170*/ BRA `(.L_x_57);
/*1180*/ NOP;
/*1190*/ NOP;
/*11a0*/ NOP;
/*11b0*/ NOP;
/*11c0*/ NOP;
/*11d0*/ NOP;
/*11e0*/ NOP;
/*11f0*/ NOP;
.L_x_61:
At a glance, jpegparse_pass1
seems to be the simplest
kernel, with jpegparse_pass2
containing a similar amount of
instructions but a more complicated logical flow.
jpegdec_idct
looks very arithmetic-intensive, containing
very few nodes, while jpegdec_vld
is by far the most
logically complex kernel.
After completing this initial, high-level analysis, I moved on to
examining the raw SASS code. While the process was tedious and lengthy,
there are in fact few different instructions being used in a normal CUDA
kernel. Once I became familiar with the assembler’s general structure,
my pace improved, and I began to recognize patterns more
intuitively.
My method was simply to translate each instruction into its GLSL
equivalent, line by line. When I could notice a pattern, such as a loop
or a branch, I would try to reproduce the logic in “natural” code. After
a while, the flow of the kernel starts appearing, until a complete
understanding is achieved.
Throughout this work, I found Compiler
Explorer to be extremely useful for testing decompiled patterns.
Since NVIDIA provides documentation for their PTX
intermediate assembler language, you can explore the mapping to SASS
instructions by writing a small PTX snippet using inline assembly, and
validate your assumptions.
jpegparse_pass1
The first kernel I took a detailed look at is
jpegparse_pass1
, for two reasons. First, its control flow
appeared relatively simple. Second, I was reasonably confident that it
operated directly on the raw JPEG bitstream without additional metadata,
reducing the need for speculation during reverse engineering.
Below is my manual decompilation of this kernel. Note that I haven’t
made sure all edge case handling was reproduced correctly, the goal is
to illustrate the core algorithm.
The assembly is also translated to GLSL instead of CUDA because that’s
what I’m ultimately more interested in.
layout(local_size_x = 0x100) in;
shared int smem[0x100];
void main(uint8_t *data_in, uint64_t *data_out, int size) {
// Initialise state
int thread_sz = max((size + 0xffff) >> 0x10, 4);
int bitstream_off = (gl_WorkGroupID.x * 0x100 + gl_LocalInvocationID.x) * thread_sz;
int bitstream_end = min(bitstream_off + thread_sz, size - 1);
int first_mcu_off = 0, num_mcus = ((gl_WorkGroupID.x | gl_LocalInvocationID.x) != 0) ? 0 : 1;
// Scan stream for restart markers
int prev;
for (int i = bitstream_off; i < bitstream_end & ~3; i += 4) {
int a = data_in[i+0], b = data_in[i+1], c = data_in[i+2], d = data_in[i+3];
if (prev == 0xff && (a & 0xf8) == 0xd0) {
if (!num_mcus) first_mcu_off = i + 1;
++num_mcus;
}
if (a == 0xff && (b & 0xf8) == 0xd0) {
if (!num_mcus) first_mcu_off = i + 2;
++num_mcus;
}
if (b == 0xff && (c & 0xf8) == 0xd0) {
if (!num_mcus) first_mcu_off = i + 3;
++num_mcus;
}
if (c == 0xff && (d & 0xf8) == 0xd0) {
if (!num_mcus) first_mcu_off = i + 4;
++num_mcus;
}
= d;
prev }
for (int i = bitstream_end & ~3; i < bitstream_end; ++i) {
int a = data_in[i];
if (prev == 0xff && (a & 0xf8) == 0xd0) {
if (!num_mcus) first_mcu_off = i + 1;
++num_mcus;
}
= a;
prev }
[gl_LocalInvocationID.x] = num_mcus;
smem
// Inclusive prefix sum
for (int i = 0; i < 8; ++i) {
barrier();
int n = 1 << i;
if (gl_LocalInvocationID.x & n)
[gl_LocalInvocationID.x] += smem[(gl_LocalInvocationID.x & ~n) | (n - 1)];
smem}
// Store to output
[gl_WorkGroupID.x * 0x100 + gl_LocalInvocationID.x] =
data_out[gl_LocalInvocationID.x] | (first_mcu_off << 32);
smem}
What this kernel does is first divide the input JPEG bitstream in
equal parts, between all 65536 threads. Each thread looks through its
slice of data for restart markers, which are denoted in the stream by a
FFD0-7
sequence. Upon finding a marker, the thread
increments a counter representing the number of MCU sequences within its
slice of the bitstream, and also remembers the location of the first
encountered
marker.Special
casing is given for the first thread, since the initial MCU doesn’t
start with a marker. The thread then writes the MCU counter to
shared memory.
Interestingly, this search has been “vectorised” to read 4 bytes of the
stream at a time (with a cleanup loop to handle the remaining unaligned
bytes). This probably improves memory access patterns.
After this, the threads perform an inclusive prefix sum over the MCU
counters. This is done using the Sklansky
algorithm,Sklansky,
J. “Conditional-Sum Addition Logic.” IEEE Transactions on Electronic
Computers EC-9, no. 2 (June 1960): 226–31. (link) in a
loop with 8 iterations (summing over 256 values). This strategy will
enable half the threads in each iteration, forming a “block” (patterned
in blue in the figure below). Each thread within the block adds the
value of the thread immediately left of the block to its own.
Figure
2: Inclusive prefix sum for 16 values using the Sklansky algorithm, in 4
iterations.
Enabled threads are patterned in blue at each
iteration.
Finally, the thread writes its summed MCU count, and the bitstream position of its first MCU in the output stream.
jpegdec_idct
The second kernel I reverse engineered was jpegdec_idct
.
This time, I was actually interested in the GPU programming techniques
used to improve the speed of the iDCT operation. Indeed, since the iDCT
is inherently parallelisable, there is more room for advanced
optimisation work.
From the grid size observed in Nsight
systems and the size of the picture, we know the kernel outputs
samples per block. From the block size
(),
we can guess each thread outputs a line of
samples, though we will need to verify this from the decompilation.
TODO: Finish writeup :)
TODO: Write this.
Below are listed a number of random bits of knowledge that I found interesting, or that took me a while to figure out:
Rn
: GPR, per-thread. RZ
always 0.URn
: Uniform register, per-warp.Pn
/UPn
: Predicate registers,
per-thread/warp. PT
always true. PR
is
P0
to P6
, aggregated together (used eg. in
R2P
).Bn
: Barrier control register, per-warp. Used with
BSSY
/BSYNC
to perform control flow
reconvergence.CC
: Condition code register, per-thread. Present in
older architectures, newer ones seem to use normal predicate registers
to hold overflow/carry/etc information.c[bank][offset]
. Bank 0 is reserved for internal driver
state (see below for some examples). The bank mapping for constant
blocks is specified in the ELF section (eg.
.nv.constant3
).blockIdx.{x,y,z}
: SR_CTAID.{X,Y,Z}
.threadIdx.{x,y,z}
: SR_TID.{X,Y,Z}
.blockDim.{x,y,z}
:
c[0x0][{0x0,0x4,0x8}]
.gridDim.{x,y,z}
:
c[0x0][{0xc,0x10,0x14}]
.c[0x0][0x160]
and following. The
offset is architecture-specific and probably part of some internal ABI
(cuobjdump
lists EIATTR_CUDA_API_VERSION
as
0x73).TLD.SCR.LZ RZ, R5, R0, 0x0, 0x58, 1D, 0x1
for a texture
object at c[0x0][0x160]
, see here). Color components can
be selected using a mask as the last argument.LOP3.LUT
(see
PTX documentation).
This instruction can represent all bitwise operations on 3 operands,
using a lookup table (LUT).IADD3
/IMAD.IADD
for scheduling: IADD3
executes on the ALU pipeline,
IMAD
on the FMA pipeline (see here).
IMAD.MOV
seems to be another instance of this optimisation
strategy.LEA.HI
to produce a right-shifted scaled offset
()
(link)PRMT
(link)IADD3
/IADD3.X
(link)SELP
/SELP
and predicate (link)SGXT
(maps to CUDA
__mul24
) (link). Later versions of NVCC
(11.6.0+) have started emitting shifts instead.