Scaling FFmpeg workflows with GPU acceleration improves processing throughput, lowers CPU usage, and maintains low-latency execution for video-intensive tasks. By leveraging NVDEC for decoding, CUDA filters for transformations, and NVENC for encoding, end-to-end video pipelines can remain entirely on the GPU with minimal host-device interaction.
Prerequisites
- FFmpeg compiled with support for CUDA, cuvid, and NVENC
- NVIDIA GPU with NVENC and NVDEC support (Turing generation or newer recommended)
- NVIDIA drivers and CUDA toolkit installed
- Raw or compressed video input sources
Verify GPU encoding/decoding support:
ffmpeg -hwaccels ffmpeg -encoders | grep nvenc ffmpeg -decoders | grep cuvid- -hwaccels: Lists available hardware acceleration backends
- nvenc: Confirms GPU encoding capability
- cuvid: Confirms availability of GPU-based decoders
GPU-Based Decoding with NVDEC
NVDEC enables hardware-accelerated video decoding, keeping frames in GPU memory and avoiding costly transfers to the host. Using FFmpeg with -hwaccel cuda and -c:v h264_cuvid allows efficient decoding of H.264 streams directly on the GPU. This approach is essential for high-throughput pipelines, as it minimizes CPU involvement and data movement.
ffmpeg -hwaccel cuda -hwaccel_output_format cuda -c:v h264_cuvid -i input.mp4- -hwaccel cuda: Enables GPU-accelerated decoding
- -hwaccel_output_format cuda: Keeps frames in GPU memory
- -c:v h264_cuvid: Uses cuvid-based decoder for H.264 input
This avoids host-device memory transfers during decode.
GPU Scaling with CUDA Filters
The scale_npp filter leverages NVIDIA Performance Primitives to perform resizing operations entirely on the GPU. By chaining this filter in FFmpeg, you can efficiently scale video frames after decoding and before encoding, maintaining the entire processing path on the GPU. This reduces latency and maximizes throughput, especially for high-resolution or batch workloads.
-vf scale_npp=1280:720Example: Pipeline
ffmpeg -hwaccel cuda -hwaccel_output_format cuda -c:v h264_cuvid -i input.mp4 \ -vf scale_npp=1280:720,format=yuv420p \ -c:v h264_nvenc -preset p1 -b:v 4M output.mp4- scale_npp: GPU-based scaler using NPP
- format=yuv420p: Converts frame format for NVENC compatibility
- -c:v h264_nvenc: Encodes using NVIDIA NVENC hardware encoder
Batch Transcoding with Parallel GPU Streams
Scaling to multiple files or streams is achieved by running several FFmpeg processes in parallel, each using NVDEC and NVENC. Tools like GNU Parallel can help manage these jobs, and monitoring with nvidia-smi dmon ensures you don"t oversubscribe GPU resources. Assigning jobs to specific GPUs can further optimize resource allocation and prevent bottlenecks.
parallel -j 4 'ffmpeg -hwaccel cuda -i {} -vf scale_npp=1280:720 -c:v h264_nvenc -b:v 5M output_{#}.mp4' ::: *.mp4Explanation:
- -j 4: Runs 4 jobs in parallel
- -i {}: Placeholder for each input file
- output_{#}.mp4: Output named by job index
Monitor GPU saturation:
nvidia-smi dmon- Tracks the utilization of NVENC, NVDEC, and memory bandwidth
- Prevents oversubscription across multiple streams
Encoding with NVENC and Preset Tuning
NVENC provides multiple presets and rate control options to balance encoding speed and output quality. Selecting the right preset (e.g., p1 for speed, p7 for quality) and rate control mode (CBR, VBR, ConstQP) allows you to tailor the workflow to your needs. Advanced options like lookahead and B-frames can further improve quality or reduce latency, depending on your application.
-c:v h264_nvenc -preset p1 -rc cbr -b:v 5000kExplanation:
- preset: p1 (fastest) to p7 (best quality)
- rc: cbr, vbr, constqp
- b:v: Target bitrate
To enable lookahead and B-frames:
-rc-lookahead 32 -bf 3 -b_ref_mode eachExplanation:
- -rc-lookahead 32: Looks ahead 32 frames for bitrate optimization
- -bf 3: Enables 3 B-frames
- -b_ref_mode each: Allows B-frames to be used as references
GPU-Only Pipeline Summary
A fully GPU-accelerated pipeline in FFmpeg decodes, scales, and encodes video without transferring frames back to the CPU. This setup ensures that the CPU is only responsible for orchestration and muxing, while all intensive frame processing remains on the GPU. The result is a significant reduction in processing time and system resource usage.
ffmpeg -hwaccel cuda -hwaccel_output_format cuda \ -c:v h264_cuvid -i input.mp4 \ -vf scale_npp=1920:1080,format=yuv420p \ -c:v h264_nvenc -preset p3 -rc vbr -b:v 6M output.mp4Explanation:
- h264_cuvid: Hardware decode on GPU
- scale_npp: Resolution adjustment using GPU
- h264_nvenc: GPU-accelerated encoding
- No intermediate CPU-bound operations
- Output is muxed by CPU; all frame ops stay on device memory
Performance Profiling
To ensure optimal scaling and resource usage, monitor GPU utilization and encoding throughput with tools like nvidia-smi and dmon. FFmpeg"s benchmarking options can provide per-frame timing and overall performance metrics. Keeping an eye on NVENC, NVDEC, and CUDA kernel activity helps identify bottlenecks and guides further optimization.
nvidia-smi dmonExplanation:
- enc: NVENC encoder usage
- dec: NVDEC decoder usage
- sm: Streaming multiprocessor usage (CUDA filters)
- mem: Global memory bandwidth usage
For per-frame timing, enable FFmpeg benchmarking:
ffmpeg -benchmark -i input.mp4 ...Explanation:
- -benchmark: Prints per-frame processing time and total elapsed time
- Helps analyze overhead introduced by individual stages (decode, filter, encode)


