Efficient management of CUDA memory is a fundamental requirement in video applications where large frame buffers, high memory bandwidth, and low latency are critical factors. The performance of CUDA-based video pipelines depends heavily on how memory is allocated, accessed, and transferred between the host and GPU.

This includes leveraging pinned host memory for faster transfers, using zero-copy buffering to minimize overhead, and organizing processing tasks across multiple CUDA streams to maximize concurrency. Proper cleanup and synchronization further ensure stability and resource efficiency in demanding real-time video workloads.

Types of CUDA Memory

Global Memory

Global memory is the primary memory space accessible by all threads on the GPU. It is large in capacity but has high access latency compared to other memory types. In video applications, global memory is typically used for storing full-resolution frames, inference input/output tensors, and batch buffers. Efficient access patterns (e.g., coalesced reads/writes) are important for maximizing throughput.

Pinned (Page-Locked) Host Memory

Pinned, or page-locked, host memory is allocated on the CPU and locked so it cannot be paged out by the operating system. This enables direct memory access (DMA) transfers between host and device, significantly improving host-to-device (H2D) and device-to-host (D2H) transfer speeds.

Pinned memory is required for asynchronous CUDA memory copies (e.g., cudaMemcpyAsync) and is critical for real-time video streaming and low-latency applications. However, excessive use of pinned memory can reduce overall system performance, as it limits available pageable memory for the OS.

Unified Memory

Unified memory provides a single address space shared between the CPU and the GPU. It simplifies programming by automatically migrating data between the host and the device as needed. However, for high-throughput video workloads, unified memory can introduce unpredictable latency due to page migration and may significantly reduce frame rates and increase GPU utilization. It is generally not recommended for performance-critical video pipelines.

Device Memory

Device memory is allocated directly on the GPU using cudaMalloc. All CUDA kernels and most DeepStream plugins operate on device memory for maximum efficiency. Device memory is used for intermediate frame buffers, inference tensors, and results. Proper management is crucial to avoid fragmentation and leaks, especially in long-running or multi-stream applications.

Asset Management

Frame Buffer Allocation Strategy

Static Allocation

Static allocation involves pre-allocating memory buffers for the maximum expected batch size and frame resolution at application startup. This avoids runtime cudaMalloc calls, which are expensive and can cause fragmentation or latency spikes. For example, Pre-allocate memory buffers for the expected maximum batch size and resolution. Avoids runtime cudaMalloc calls.

code
cudaMalloc(&frame_buffer, width * height * channels);

Explanation:

  • For NV12 or P010, ensure 1.5 or 2 bytes per pixel, respectively.
  • For multiple frames, allocate a pool indexed by frame count or stream ID.

Memory Reuse and Buffer Pooling

Frequent allocation and deallocation of device buffers inside processing loops can lead to fragmentation and performance degradation. Instead, allocate a fixed pool of frame buffers at startup and reuse them across frames. For example:

Allocation

code
std::vector<uint8_t*> frame_pool;
for (int i = 0; i < max_frames; ++i) {
uint8_t* d_buf;
cudaMalloc(&d_buf, frame_size);
frame_pool.push_back(d_buf);
}

Explanation:

  • Track usage via flags or circular indexing.
  • Prevents GPU heap fragmentation and allocation overhead.

Zero-Copy Buffering

Zero-copy allows CUDA kernels to access host memory directly without explicit host-to-device copies. This is achieved by mapping pinned host memory into the device address space or by using EGL images for interoperability between CUDA and other APIs (such as OpenGL or video decoders)

code
cudaHostAlloc(&host_buf, size, cudaHostAllocMapped);
cudaHostGetDevicePointer(&dev_ptr, host_buf, 0);

Explanation:

  • host_buf is page-locked system RAM.
  • dev_ptr is a device-accessible alias to the same memory.
  • Suitable for use cases like live camera ingest or direct socket capture into GPU-accessible memory.
  • Bandwidth is lower than device memory, but it eliminates copy overhead.

Copying Data to/from GPU

Efficient data transfer between the host and the device is crucial in video applications. Use cudaMemcpy2D() or cudaMemcpy2DAsync() to move frame data, as these functions handle pitch (row alignment) and allow for asynchronous operation

Synchronous Copy

Synchronous copies block the CPU thread until the entire transfer is complete. These copies occur in the default stream and ensure that all previously issued CUDA work (on all streams) is completed before and after the copy.

code
cudaMemcpy2D(d_frame, pitch, h_frame, width, width, height, cudaMemcpyHostToDevice);

Explanation:

  • Use for simple or one-off transfers.
  • Requires all previous GPU work to be completed before and after the copy.

Asynchronous Copy

Asynchronous copies allow memory transfers to proceed in parallel with kernel execution, provided they occur on different CUDA streams. This is critical for maximizing throughput in pipelined video processing, where decoding, filtering, and encoding happen simultaneously.

code
cudaMemcpy2DAsync(d_frame, pitch, h_frame, width, width, height, cudaMemcpyHostToDevice, stream);

Explanation:

  • Use asynchronous copies to overlap data transfer and computation.
  • Ensure that h_frame is allocated with cudaHostAlloc() or cudaHostRegister() for async transfer compatibility.

Using Page-Locked (Pinned) Host Memory

Pinned memory ensures that the host buffer is not paged out and enables DMA transfers to/from the GPU.

Allocation Example

code
uint8_t* h_pinned;cudaHostAlloc(&h_pinned, buffer_size, cudaHostAllocDefault);

Explanation:

  • cudaHostAllocDefault: Allocates memory for both synchronous and asynchronous access.
  • Other flags include cudaHostAllocMapped for zero-copy and cudaHostAllocWriteCombined for write-optimized memory.

Stream-Optimized Frame Processing

To achieve full pipeline throughput, allocate multiple buffers and process them using separate CUDA streams.

Example:

code
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

// Decode in stream1
cudaMemcpyAsync(d_input, h_input, size, cudaMemcpyHostToDevice, stream1);

// Process in stream2
my_kernel<<<grid, block, 0, stream2>>>(d_input, d_output);

Explanation:

  • Synchronize streams only at critical points (e.g., encoder handoff).
  • Assign one stream per stage (decode, process, encode) to decouple frame operations.

Memory Cleanup

At application shutdown, it is essential to free all allocated resources to prevent memory leaks and ensure a clean exit.

code
cudaFree(d_frame);
cudaFreeHost(h_pinned);
cudaHostUnregister(h_buf);
cudaStreamDestroy(stream1);

Explanation:

  • Always synchronize streams before freeing associated memory.
  • For buffer pools, deallocate each cudaMalloc pointer in the pool vector.
  • Use cudaDeviceReset() at application exit to release all remaining allocations (for debug/test only).