CUDA Kernels: Video Processing Algorithms Explained

CUDA kernels are GPU-executed functions that operate in parallel on massive data sets, such as video frames. Video processing tasks"including filtering, color conversion, scaling, and edge detection"can be efficiently implemented as custom CUDA kernels to take advantage of GPU parallelism.

Frame Representation on GPU

Video frames are stored in GPU device memory, typically as 2D arrays with formats such as RGB or YUV420. Memory allocation must account for the number of channels and alignment, with each kernel thread accessing a specific pixel or macroblock. Proper memory layout ensures coalesced access and optimal performance during kernel execution

Example: RGB Frame Allocation

code

uint8_t* d_rgb_frame;
cudaMalloc(&d_rgb_frame, width * height * 3); // 3 bytes per pixel

Each thread in the CUDA kernel maps to a pixel (or macroblock), enabling parallel per-pixel operations.

Color Space Conversion Kernel

Color space conversion, such as RGB to grayscale, is a frequent preprocessing step in video analytics. The CUDA kernel applies the luma extraction formula to each pixel in parallel, allowing all pixels in a frame to be converted simultaneously. This greatly accelerates tasks like object detection or segmentation that require grayscale input

RGB to Grayscale (Luma extraction)

This operation applies the formula:

code

Y = 0.299*R + 0.587*G + 0.114*B

CUDA Kernel Implementation

code

__global__ void rgb_to_gray(const uint8_t* rgb, uint8_t* gray, int width, int height) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x < width && y < height) {
int idx = (y * width + x) * 3;
uint8_t r = rgb[idx];
uint8_t g = rgb[idx + 1];
uint8_t b = rgb[idx + 2];
gray[y * width + x] = 0.299f * r + 0.587f * g + 0.114f * b;
}
}

Kernel Launch

Defines how many threads and blocks will run the kernel. Threads are organized in 2D grids to map directly to frame dimensions. Common launch sizes use 16??16 threads per block.

code

dim3 threads(16, 16);
dim3 blocks((width + 15) / 16, (height + 15) / 16);
rgb_to_gray<<<blocks, threads>>>(d_rgb_frame, d_gray_frame, width, height);

Bilinear Scaling Kernel

Bilinear interpolation kernels resize images by computing each output pixel value from four neighboring input pixels. This provides smoother results than nearest-neighbor scaling and is suitable for both upscaling and downscaling. Implementing bilinear scaling as a CUDA kernel allows high-resolution frames to be resized efficiently, maintaining real-time throughput

CUDA Kernel Structure (Pseudocode)

code

__global__ void scale_bilinear(const uint8_t* input, uint8_t* output,
int in_w, int in_h, int out_w, int out_h) {
// Compute target pixel coordinates
// Map back to source space and apply bilinear interpolation
}

This approach is more accurate than nearest-neighbor and suitable for quality-preserving downscaling or upscaling.

Brightness Adjustment Kernel

Brightness adjustment modifies each pixel"s intensity by adding or subtracting a scalar value from all color channels. The CUDA kernel processes all pixels in parallel, making it ideal for color correction or normalization in video pipelines.

CUDA Kernel

CUDA kernels are device-side GPU functions launched from the host. Each kernel executes in parallel across threads and blocks, performing operations like filtering, transformation, or analysis on individual pixels or pixel groups in a video frame.

code

__global__ void adjust_brightness(uint8_t* frame, int width, int height, int pitch, int value) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x >= width || y >= height) return;

int idx = y * pitch + x * 3;
for (int c = 0; c < 3; ++c) {
int pixel = frame[idx + c] + value;
frame[idx + c] = min(255, max(0, pixel));
}
}

Frame Differencing (Motion Detection)

Frame differencing detects motion by subtracting pixel values of consecutive frames and applying a threshold. The CUDA kernel compares corresponding pixels in parallel, quickly generating a binary mask indicating regions of motion. This is for surveillance, tracking, and event detection in video streams

Kernel Implementation

code

__global__ void frame_difference(uint8_t* frame1, uint8_t* frame2, uint8_t* output, int width, int height) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < width * height) {
int diff = abs(frame1[idx] - frame2[idx]);
output[idx] = diff > 25 ? 255 : 0;
}
}

Gaussian Blur Using Shared Memory

Gaussian blur smooths images by averaging neighboring pixels with a weighted kernel, reducing noise and detail. Leveraging shared memory in the CUDA kernel minimizes global memory access, significantly improving performance. This method is widely used for preprocessing, edge detection, and noise reduction in video analytics

Kernel Overview

code

__global__ void gaussian_blur(uint8_t* input, uint8_t* output, int width, int height, float* kernel) {
__shared__ uint8_t tile[32][32]; // Add padding as needed

int x = blockIdx.x * blockDim.x + threadIdx.x - 2;
int y = blockIdx.y * blockDim.y + threadIdx.y - 2;

if (x >= 0 && x < width && y >= 0 && y < height) {
tile[threadIdx.y][threadIdx.x] = input[y * width + x];
}
__syncthreads();

if (threadIdx.x >= 2 && threadIdx.x < blockDim.x - 2 &&
threadIdx.y >= 2 && threadIdx.y < blockDim.y - 2) {
float val = 0.0f;
for (int ky = 0; ky < 5; ky++) {
for (int kx = 0; kx < 5; kx++) {
val += tile[threadIdx.y - 2 + ky][threadIdx.x - 2 + kx] * kernel[ky * 5 + kx];
}
}
output[(y + 2) * width + (x + 2)] = val;
}
}

CUDA Kernels: Video Processing Algorithms Explained

Frame Representation on GPU

Color Space Conversion Kernel

RGB to Grayscale (Luma extraction)

CUDA Kernel Implementation

Kernel Launch

Bilinear Scaling Kernel

CUDA Kernel Structure (Pseudocode)

Brightness Adjustment Kernel

CUDA Kernel

Frame Differencing (Motion Detection)

Kernel Implementation

Gaussian Blur Using Shared Memory

Kernel Overview

Was this article helpful?