How to use PyTorch to measure your memory bandwidth
Published on Mar 30, 2025 in Benchmarks PyTorch
In my first PyTorch tutorial, I showed you how to create a neural network with PyTorch.
But even if we will very often use torch.nn to create neural networks, we can also use basic PyTorch to simply do tensor calculation.
In this tutorial, you’ll learn how to measure your memory bandwidth, host to GPU, GPU to host, and intra GPU, with PyTorch. To do this, I’m going to create host and GPU tensors, and make different copies to measure the different bandwidths.
The code is quite simple and doesn’t need much more explanation. The torch.randn function creates a randomly initialized tensor, and torch.empty creates an uninitialized tensor. If the device is not specified, the tensor is created by default on the host (cpu device in torch).
import torch
import time
# Number of elements, here a tensor with 25 million float32 (4 bytes per element) ~100 MB
SIZE = 25 * 10**6
ITERATIONS = 100
def measure_transfer(src, dst, iterations=ITERATIONS):
# Using CUDA events to precisely measure the transfer time
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
# Warm-up pass
dst.copy_(src)
torch.cuda.synchronize()
# Measure over several passes
elapsed_times = []
for _ in range(iterations):
start_event.record()
dst.copy_(src)
end_event.record()
# Ensure the device is ready before calling elapsed_time
torch.cuda.synchronize()
t_ms = start_event.elapsed_time(end_event)
elapsed_times.append(t_ms)
avg_time_ms = sum(elapsed_times) / len(elapsed_times)
tensor_bytes = src.element_size() * src.nelement()
# Convert: GB/s = (bytes / (ms/1000)) / (10**9)
bandwidth = (tensor_bytes / (avg_time_ms / 1000)) / 10**9
return avg_time_ms, bandwidth
def main():
if not torch.cuda.is_available():
print("CUDA is not available on this system.")
return
device = torch.device("cuda")
# CPU tensor -> GPU tensor
src_cpu = torch.randn(SIZE, dtype=torch.float32)
dst_gpu = torch.empty(SIZE, dtype=torch.float32, device=device)
t_ms, bw = measure_transfer(src_cpu, dst_gpu)
print(f"Host to Device: Average time = {t_ms:.3f} ms, Bandwidth = {bw:.3f} GB/s")
# GPU tensor -> CPU tensor
src_gpu = torch.randn(SIZE, dtype=torch.float32, device=device)
dst_cpu = torch.empty(SIZE, dtype=torch.float32)
t_ms, bw = measure_transfer(src_gpu, dst_cpu)
print(f"Device to Host: Average time = {t_ms:.3f} ms, Bandwidth = {bw:.3f} GB/s")
# GPU tensor -> GPU tensor
src_gpu_intra = torch.randn(SIZE, dtype=torch.float32, device=device)
dst_gpu_intra = torch.empty(SIZE, dtype=torch.float32, device=device)
t_ms, bw = measure_transfer(src_gpu_intra, dst_gpu_intra)
print(f"Intra-Device: Average time = {t_ms:.3f} ms, Bandwidth = {bw:.3f} GB/s")
if __name__ == "__main__":
main()
Here’s the result of the script on a VM with an RTX A5000 GPU.
Host to Device: Average time = 14.112 ms, Bandwidth = 7.086 GB/s
Device to Host: Average time = 14.340 ms, Bandwidth = 6.974 GB/s
Intra-Device: Average time = 0.315 ms, Bandwidth = 317.149 GB/s
The internal bandwidth announced by the manufacturer is 768 GB/s. Since a copy requires both a read and a write, I think that’s why the result of the script is closer to half of this value. For various reasons (driver and Pytorch optimization…) the result is still lower than the theoretical value.
However, this script is interesting to make measurements in real conditions and comparisons between different configurations.