Benchmarking video encoding on NVIDIA GPUs involves measuring the performance of the NVENC hardware engine across different conditions. This includes evaluating throughput in frames per second, GPU encoder utilization, and encoding latency.

Results vary depending on resolution, bitrate, rate control modes, and selected NVENC presets. The process uses standardized input, consistent encoding commands, and system-level profiling tools to ensure repeatable and comparable results.

Benchmark Goals and Scope

The primary focus is to measure:

  • Average encoded frames per second (FPS)
  • Peak GPU NVENC engine utilization
  • Encoding latency per frame
  • Impact of resolution and bitrate on throughput
  • Behavior across different presets and rate control modes

Only NVENC hardware encoding is considered; software encoders or decode steps are excluded.

Benchmark Setup and Tools

Hardware Requirements:

  • NVIDIA GPU with NVENC support (Turing and newer preferred)
  • Driver version → 450.xx
  • Sufficient VRAM for 4K workloads
  • Dedicated system with no GUI rendering load

Software Stack:

  • FFmpeg compiled with-- enable-nvenc
  • CUDA Toolkit (for monitoring with nvprof, nvidia-smi)
  • Raw YUV test videos (e.g., NV12 format)
Banner

Test Video Preparation

Consistent benchmarking requires the use of standardized raw input files, typically in NV12 or YUV420p format. Synthetic test videos can be generated with FFmpeg, ensuring uniform content and frame counts for repeatable tests. Converting videos to NV12 ensures compatibility with NVENC and removes variability from source material.

code
ffmpeg -f lavfi -i testsrc=size=1920x1080:rate=30 -frames:v 300 -pix_fmt yuv420p test.yuv

Explanation:

  • -f lavfi -i testsrc: Generates a color pattern test video.
  • -size=1920x1080: Target resolution.
  • -rate=30: Frame rate.
  • -frames:v 300: Total number of frames.
  • -pix_fmt yuv420p: Ensures format compatibility.

Convert to NV12 if required:

code
ffmpeg -i test.yuv -pix_fmt nv12 test_nv12.yuv

Explanation:

  • -pix_fmt nv12: Converts video to NVENC-compatible pixel format.
  • Required because NVENC expects NV12 (planar Y + interleaved UV).

FFmpeg Encoding Benchmark Command

This is a command-line example for encoding raw video input with FFmpeg using NVENC, specifying parameters like pixel format, resolution, preset, rate control, and bitrate. Output is discarded to null to eliminate disk I/O as a bottleneck, ensuring that only encoding performance is measured. Tests should be repeated for different resolutions and presets to fully characterize GPU performance

code
ffmpeg -f rawvideo -pix_fmt nv12 -s:v 1920x1080 -r 30 -i test_nv12.yuv \-c:v h264_nvenc -preset p1 -rc cbr -b:v 4M -f null -

Explanation:

  • -f rawvideo: Indicates raw uncompressed input.
  • -pix_fmt nv12: Specifies NVENC-compatible pixel format.
  • -s:v 1920x1080: Sets frame size.
  • -r 30: Input frame rate.
  • -i test_nv12.yuv: Input file path.
  • -c:v h264_nvenc: Selects NVENC encoder for H.264 output.
  • -preset p1: Uses the fastest preset (p1 = maximum throughput, p7 = best quality).
  • -rc cbr: Enables constant bitrate mode.
  • -b:v 4M: Sets average target bitrate to 4 Mbps.
  • -f null -: Discards output to null sink; avoids disk I/O interference

Metrics Collection

During encoding, FFmpeg reports the average frames per second, which serves as the primary throughput metric. NVENC engine utilization can be monitored using nvidia-smi dmon to allow the assessment of hardware saturation and efficiency. These metrics together provide a comprehensive view of encoding performance under various settings.

a. Frames Per Second (FPS)

FFmpeg will print the average speed after encoding:

code
frame= 300 fps= 870 q=31.0 Lsize= ...

Explanation:

  • fps=850: Approximate frames encoded per second.
  • speed=28.3x: Multiplier of real-time (30 fps ?? 28 = ~840 fps).

b. NVENC Utilization

Use:

code
nvidia-smi dmon -s u

Explanation:

  • -s u: Enables encoder (NVENC) and decoder (NVDEC) utilization sampling.
  • Reports % utilization of the encoder unit over time.

Testing Multiple Presets and Bitrates

To assess performance under different encoder workloads, vary:

  • Presets: p1 (fastest) to p7 (highest quality)
  • Bitrates: 2M, 4M, 8M, 16M
  • Resolutions: 720p, 1080p, 4K

Example matrix:

GPU Resolution Preset Bitrate FPS NVENC Util
RTX 3060 1080p p1 4M850 95%
RTX 3060 1080p p7 4M 400 90%
RTX 3060 4kp18M300 98%

Comparing Multi-Stream Encoding

The capacity for simultaneous encoding is tested by running multiple FFmpeg processes in parallel, each encoding a separate stream. Observing FPS drop-off and NVENC utilization across streams reveals the GPU"s multi-session limits and helps identify potential bottlenecks in parallel workloads. Note that consumer GPUs have hardware-imposed limits on concurrent NVENC sessions.

code
ffmpeg -re -stream_loop -1 -f rawvideo -pix_fmt nv12 -s 1920x1080 -i test_nv12.yuv \-c:v h264_nvenc -gpu 0 -preset p1 -b:v 4M -f null -

Explanation:

  • -re: Simulates real-time input.
  • -stream_loop -1: Loops the input infinitely.
  • -gpu 0: Binds encoding to a specific GPU.
  • Launch 2"4 instances and monitor FPS drop and nvidia-smi metrics.

Encoding Latency Measurement (SDK-only)

When using the Video Codec SDK directly, latency per frame can be measured by timestamping before and after each EncodeFrame call. Averaging these measurements over many frames quantifies real-time responsiveness, which is crucial for live streaming and low-latency applications.

code
auto t_start = std::chrono::high_resolution_clock::now();
encoder.EncodeFrame(...);
auto t_end = std::chrono::high_resolution_clock::now();
auto latency_ms = std::chrono::duration<double, std::milli>(t_end - t_start).count();

Explanation:

  • t_start and t_end: Mark timestamps before and after frame encoding.
  • latency_ms: Time taken to encode a single frame in milliseconds.

Recommendations for Consistent Benchmarking

  • Disable desktop compositors (e.g., X server)
  • Pin CPU cores to FFmpeg with taskset
  • Set GPU to persistent mode: nvidia-smi -pm 1
  • Repeat each test 3"5 times and average the results
  • Use a dedicated SSD or RAMDISK to avoid I/O bottlenecks if outputting to disk