Video super-resolution (VSR) enhances the resolution of low-resolution video frames using machine learning models. When accelerated with CUDA, the compute-intensive operations involved in model inference and frame processing can be executed efficiently on NVIDIA GPUs, allowing for real-time or batch super-resolution pipelines.

Super-Resolution Model Architecture

Common model types used for super-resolution include:

ESPCN (Efficient Sub-Pixel Convolutional Network)

The ESPCN architecture employs a three-layer convolutional structure optimized for real-time processing. The first layer utilizes 64 filters with 5??5 kernels for coarse feature extraction, followed by a second layer with 32 filters using 3??3 kernels for finer detail refinement.

The final layer applies sub-pixel convolution through a single 3??3 filter that rearranges feature map channels into spatial dimensions, achieving ??4 upscaling with a 3.7 ms inference time on RTX 4090 for 1080p inputs.

EDSR (Enhanced Deep Residual Networks)

EDSR removes batch normalization layers to preserve feature magnitude consistency across its 32 residual blocks, each containing two 3??3 convolutional layers with 256 channels. The architecture incorporates residual scaling factors of 0.1 to stabilize gradient flow in deep networks.

A multi-scale variant shares initial convolutional weights across ??2, ??3, and ??4 upscaling factors, using pixel shuffle operations with different repetition counts for each scale. This weight sharing reduces model size by 43% while maintaining 31.4 dB PSNR on DIV2K validation.

Real-ESRGAN

Real-ESRGAN combines RRDB blocks with U-Net discriminator architecture for adversarial training. Each RRDB contains three dense blocks with leaky ReLU (??=0.2) and residual scaling factors of 0.2. The generator uses 23 RRDB blocks with 64 base channels, while the discriminator employs 7 convolutional layers with spectral normalization.

Training incorporates a combination of L1 loss, perceptual VGG-19 loss (conv3_4 features), and RaGAN adversarial loss with 0.05, 0.6, and 0.35 weighting, respectively. The model demonstrates a 0.87 MOS (Mean Opinion Score) improvement over bicubic upsampling in subjective quality assessments.

Cincopa Video API

Frame Preprocessing on GPU

Frames are converted from compressed formats (like NV12 or YUV420) into RGB using CUDA-accelerated libraries (e.g., NPP). The output is normalized to floating-point format for input into neural networks. If models require a fixed resolution, frames are resized using either CUDA kernels or cuDNN. These preprocessing steps run on the GPU to avoid unnecessary host-device transfers.

code
nppiYCbCr420ToRGB_8u_P2C3R(...); // NPP color conversioncudaMemcpyAsync(...); // Transfer to model input buffer

Model Inference Using TensorRT

TensorRT accelerates inference by optimizing and running deep learning models on NVIDIA hardware. It uses precompiled .engine files from ONNX, TensorFlow, or PyTorch exports. Buffers for input and output are pre-allocated in GPU memory and bound to a context. Kernels are launched using enqueueV2() within a CUDA stream for non-blocking inference.

Example:

code
context->enqueueV2(buffers, stream, nullptr); cudaMemcpyAsync(...); // Retrieve output

Postprocessing and Frame Output

The floating-point output from the model must be clamped to [0, 255], cast to uint8_t, and converted to NV12 or YUV420 format for encoding. Postprocessing is done via NPP or CUDA kernels to minimize latency. The result is then passed to an NVENC encoder or stored in memory for streaming.

Example:

code
nppiRGBToYCbCr420_8u_C3P2R(...); // Convert back to YUV  cudaMemcpy2D(...); // Prepare for NVENC

Batch Inference and Streamed Super-Resolution

CUDA streams allow concurrent processing of multiple frames. Each inference task runs in its own stream, which enables decoding, super-resolution, and encoding to overlap. Synchronization ensures order without stalling the GPU. Efficient use of streams is crucial for maximizing throughput in video pipelines.

code
cudaStream_t stream; cudaStreamCreate(&stream); context->enqueueV2(..., stream, nullptr); cudaStreamSynchronize(stream);

Each frame is processed asynchronously and pipelined for better GPU utilization.

Benchmarking and Performance Profiling

Use NVIDIA profiling tools to measure per-frame latency and GPU resource usage. nsys provides detailed trace-level insights, while nvidia-smi dmon tracks real-time usage (SM, memory, NVENC). These tools help detect performance bottlenecks, such as memory transfer delays or underutilized compute cores.

code
nsys profile ./vsr_app nvidia-smi dmon

Track:

  • GPU utilization (SM, MEM).
  • Inference latency per frame.
  • NVENC throughput if output is encoded.

Memory Optimization Techniques

Pinned (page-locked) memory improves transfer bandwidth between host and device, and reused buffers reduce malloc/free overhead. Use FP16 if the model supports it to reduce memory footprint and improve performance on Ampere+ GPUs. Persistent memory allocation avoids stalls in real-time applications.

Example:

code
cudaHostAlloc(..., cudaHostAllocMapped); cudaMalloc(&d_frame, ...); // Persistent buffer

Integration with FFmpeg Pipelines

FFmpeg handles decoding and encoding, while super-resolution runs in an intermediate CUDA application. Raw frames are piped between processes using standard input/output to avoid disk I/O.

code
ffmpeg -hwaccel cuda -i input.mp4 -f rawvideo -pix_fmt rgb24 - | ./vsr_infer | ffmpeg -f rawvideo -pix_fmt yuv420p -s 1920x1080 -i - -c:v h264_nvenc output.mp4

Explanation:

  • -hwaccel cuda: Uses GPU for video decoding.
  • -f rawvideo: Specifies uncompressed output for piping.
  • -pix_fmt rgb24: Matches input format expected by the inference engine.
  • -i -: FFmpeg reads input from stdin (via pipe).
  • -c:v h264_nvenc: Encodes the output using NVENC.
  • output.mp4: Output file containing the upscaled video.