How to use PyTorch to measure your memory bandwidth

Published on Mar 30, 2025 in Benchmarks  PyTorch  

In my first PyTorch tutorial, I showed you how to create a neural network with PyTorch.

But even if we will very often use torch.nn to create neural networks, we can also use basic PyTorch to simply do tensor calculation.

In this tutorial, you’ll learn how to measure your memory bandwidth, host to GPU, GPU to host, and intra GPU, with PyTorch. To do this, I’m going to create host and GPU tensors, and make different copies to measure the different bandwidths.

The code is quite simple and doesn’t need much more explanation. The torch.randn function creates a randomly initialized tensor, and torch.empty creates an uninitialized tensor. If the device is not specified, the tensor is created by default on the host (cpu device in torch).

python
import torch
import time

# Number of elements, here a tensor with 25 million float32 (4 bytes per element) ~100 MB
SIZE = 25 * 10**6
ITERATIONS = 100

def measure_transfer(src, dst, iterations=ITERATIONS):
    # Using CUDA events to precisely measure the transfer time
    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    # Warm-up pass
    dst.copy_(src)
    torch.cuda.synchronize()
    # Measure over several passes
    elapsed_times = []
    for _ in range(iterations):
        start_event.record()
        dst.copy_(src)
        end_event.record()
        # Ensure the device is ready before calling elapsed_time
        torch.cuda.synchronize()
        t_ms = start_event.elapsed_time(end_event)
        elapsed_times.append(t_ms)
    avg_time_ms = sum(elapsed_times) / len(elapsed_times)
    tensor_bytes = src.element_size() * src.nelement()
    # Convert: GB/s = (bytes / (ms/1000)) / (10**9)
    bandwidth = (tensor_bytes / (avg_time_ms / 1000)) / 10**9
    return avg_time_ms, bandwidth

def main():
    if not torch.cuda.is_available():
        print("CUDA is not available on this system.")
        return
    device = torch.device("cuda")
    # CPU tensor -> GPU tensor
    src_cpu = torch.randn(SIZE, dtype=torch.float32)
    dst_gpu = torch.empty(SIZE, dtype=torch.float32, device=device)
    t_ms, bw = measure_transfer(src_cpu, dst_gpu)
    print(f"Host to Device: Average time = {t_ms:.3f} ms, Bandwidth = {bw:.3f} GB/s")
    # GPU tensor -> CPU tensor
    src_gpu = torch.randn(SIZE, dtype=torch.float32, device=device)
    dst_cpu = torch.empty(SIZE, dtype=torch.float32)
    t_ms, bw = measure_transfer(src_gpu, dst_cpu)
    print(f"Device to Host: Average time = {t_ms:.3f} ms, Bandwidth = {bw:.3f} GB/s")
    # GPU tensor -> GPU tensor
    src_gpu_intra = torch.randn(SIZE, dtype=torch.float32, device=device)
    dst_gpu_intra = torch.empty(SIZE, dtype=torch.float32, device=device)
    t_ms, bw = measure_transfer(src_gpu_intra, dst_gpu_intra)
    print(f"Intra-Device: Average time = {t_ms:.3f} ms, Bandwidth = {bw:.3f} GB/s")

if __name__ == "__main__":
    main()

Here’s the result of the script on a VM with an RTX A5000 GPU.

plaintext
Host to Device: Average time = 14.112 ms, Bandwidth = 7.086 GB/s
Device to Host: Average time = 14.340 ms, Bandwidth = 6.974 GB/s
Intra-Device: Average time = 0.315 ms, Bandwidth = 317.149 GB/s

The internal bandwidth announced by the manufacturer is 768 GB/s. Since a copy requires both a read and a write, I think that’s why the result of the script is closer to half of this value. For various reasons (driver and Pytorch optimization…) the result is still lower than the theoretical value.

However, this script is interesting to make measurements in real conditions and comparisons between different configurations.