Scaling FFmpeg workflows with GPU acceleration improves processing throughput, lowers CPU usage, and maintains low-latency execution for video-intensive tasks. By leveraging NVDEC for decoding, CUDA filters for transformations, and NVENC for encoding, end-to-end video pipelines can remain entirely on the GPU with minimal host-device interaction.

Prerequisites

  • FFmpeg compiled with support for CUDA, cuvid, and NVENC
  • NVIDIA GPU with NVENC and NVDEC support (Turing generation or newer recommended)
  • NVIDIA drivers and CUDA toolkit installed
  • Raw or compressed video input sources

Verify GPU encoding/decoding support:

code
ffmpeg -hwaccels ffmpeg -encoders | grep nvenc ffmpeg -decoders | grep cuvid
  • -hwaccels: Lists available hardware acceleration backends
  • nvenc: Confirms GPU encoding capability
  • cuvid: Confirms availability of GPU-based decoders

GPU-Based Decoding with NVDEC

NVDEC enables hardware-accelerated video decoding, keeping frames in GPU memory and avoiding costly transfers to the host. Using FFmpeg with -hwaccel cuda and -c:v h264_cuvid allows efficient decoding of H.264 streams directly on the GPU. This approach is essential for high-throughput pipelines, as it minimizes CPU involvement and data movement.

code
ffmpeg -hwaccel cuda -hwaccel_output_format cuda -c:v h264_cuvid -i input.mp4
  • -hwaccel cuda: Enables GPU-accelerated decoding
  • -hwaccel_output_format cuda: Keeps frames in GPU memory
  • -c:v h264_cuvid: Uses cuvid-based decoder for H.264 input

This avoids host-device memory transfers during decode.

Banner

GPU Scaling with CUDA Filters

The scale_npp filter leverages NVIDIA Performance Primitives to perform resizing operations entirely on the GPU. By chaining this filter in FFmpeg, you can efficiently scale video frames after decoding and before encoding, maintaining the entire processing path on the GPU. This reduces latency and maximizes throughput, especially for high-resolution or batch workloads.

code
-vf scale_npp=1280:720

Example: Pipeline

code
ffmpeg -hwaccel cuda -hwaccel_output_format cuda -c:v h264_cuvid -i input.mp4 \ -vf scale_npp=1280:720,format=yuv420p \ -c:v h264_nvenc -preset p1 -b:v 4M output.mp4
  • scale_npp: GPU-based scaler using NPP
  • format=yuv420p: Converts frame format for NVENC compatibility
  • -c:v h264_nvenc: Encodes using NVIDIA NVENC hardware encoder

Batch Transcoding with Parallel GPU Streams

Scaling to multiple files or streams is achieved by running several FFmpeg processes in parallel, each using NVDEC and NVENC. Tools like GNU Parallel can help manage these jobs, and monitoring with nvidia-smi dmon ensures you don"t oversubscribe GPU resources. Assigning jobs to specific GPUs can further optimize resource allocation and prevent bottlenecks.

code
parallel -j 4 'ffmpeg -hwaccel cuda -i {} -vf scale_npp=1280:720 -c:v h264_nvenc -b:v 5M output_{#}.mp4' ::: *.mp4

Explanation:

  • -j 4: Runs 4 jobs in parallel
  • -i {}: Placeholder for each input file
  • output_{#}.mp4: Output named by job index
Batch Processing Videos

Monitor GPU saturation:

code
nvidia-smi dmon
  • Tracks the utilization of NVENC, NVDEC, and memory bandwidth
  • Prevents oversubscription across multiple streams

Encoding with NVENC and Preset Tuning

NVENC provides multiple presets and rate control options to balance encoding speed and output quality. Selecting the right preset (e.g., p1 for speed, p7 for quality) and rate control mode (CBR, VBR, ConstQP) allows you to tailor the workflow to your needs. Advanced options like lookahead and B-frames can further improve quality or reduce latency, depending on your application.

code
-c:v h264_nvenc -preset p1 -rc cbr -b:v 5000k

Explanation:

  • preset: p1 (fastest) to p7 (best quality)
  • rc: cbr, vbr, constqp
  • b:v: Target bitrate

To enable lookahead and B-frames:

code
-rc-lookahead 32 -bf 3 -b_ref_mode each

Explanation:

  • -rc-lookahead 32: Looks ahead 32 frames for bitrate optimization
  • -bf 3: Enables 3 B-frames
  • -b_ref_mode each: Allows B-frames to be used as references

GPU-Only Pipeline Summary

A fully GPU-accelerated pipeline in FFmpeg decodes, scales, and encodes video without transferring frames back to the CPU. This setup ensures that the CPU is only responsible for orchestration and muxing, while all intensive frame processing remains on the GPU. The result is a significant reduction in processing time and system resource usage.

code
ffmpeg -hwaccel cuda -hwaccel_output_format cuda \ -c:v h264_cuvid -i input.mp4 \ -vf scale_npp=1920:1080,format=yuv420p \ -c:v h264_nvenc -preset p3 -rc vbr -b:v 6M output.mp4

Explanation:

  • h264_cuvid: Hardware decode on GPU
  • scale_npp: Resolution adjustment using GPU
  • h264_nvenc: GPU-accelerated encoding
  • No intermediate CPU-bound operations
  • Output is muxed by CPU; all frame ops stay on device memory

Performance Profiling

To ensure optimal scaling and resource usage, monitor GPU utilization and encoding throughput with tools like nvidia-smi and dmon. FFmpeg"s benchmarking options can provide per-frame timing and overall performance metrics. Keeping an eye on NVENC, NVDEC, and CUDA kernel activity helps identify bottlenecks and guides further optimization.

code
nvidia-smi dmon

Explanation:

  • enc: NVENC encoder usage
  • dec: NVDEC decoder usage
  • sm: Streaming multiprocessor usage (CUDA filters)
  • mem: Global memory bandwidth usage

For per-frame timing, enable FFmpeg benchmarking:

code
ffmpeg -benchmark -i input.mp4 ...

Explanation:

  • -benchmark: Prints per-frame processing time and total elapsed time
  • Helps analyze overhead introduced by individual stages (decode, filter, encode)