使用 CUDA 实现高效计算

从环境准备到编程模型与优化策略，用 GPU 获得数量级的性能加速。

适用读者

需要在数值计算、深度学习、图像/时序处理等任务中利用 GPU 获得显著加速的工程师与研究者。

基础概念速览

GPU 适合大规模数据并行（SIMT）与吞吐密集计算。
CUDA 是 NVIDIA 的并行计算平台与编程模型（C/C++/Fortran、Python 生态等）。
关键组件：CUDA Toolkit（编译/驱动库）、cuDNN（深度学习核心库）、NCCL（多卡通信）、TensorRT（推理优化）。

环境准备

1) 安装 NVIDIA 驱动与 CUDA Toolkit

Windows/Ubuntu 可直接安装官方驱动，然后安装与框架匹配的 CUDA 版本（建议参考目标框架“官方兼容矩阵”）。
推荐优先使用“框架提供的带 CUDA 的发行版”（如 PyTorch 的 pip/conda 指定 cuda），降低版本不兼容风险。

2) cuDNN 与 NCCL

深度学习建议安装 cuDNN；多卡分布式建议安装 NCCL（Linux）。很多框架在预编译包中已内置。

3) 快速验证 GPU 可用

import torch
print('cuda available:', torch.cuda.is_available())
print('device count  :', torch.cuda.device_count())
print('device name    :', torch.cuda.get_device_name(0) if torch.cuda.is_available() else '-')

Python 生态的三条主线

A. PyTorch（深度学习/张量计算）

import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
x = torch.randn((4096, 4096), device=device)
w = torch.randn((4096, 4096), device=device)
for _ in range(10):
    y = x @ w  # 在 GPU 上调用高效的 GEMM
torch.cuda.synchronize()

要点：

张量放到 cuda 设备；算子链路保持在 GPU，避免 CPU/GPU 频繁拷贝。
训练开启混合精度（AMP）与 cudnn benchmark：

model = model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
scaler = torch.cuda.amp.GradScaler(enabled=True)

torch.backends.cudnn.benchmark = True  # 对固定形状卷积显著提速

for input, target in loader:
    input, target = input.to(device), target.to(device)
    optimizer.zero_grad(set_to_none=True)
    with torch.cuda.amp.autocast(True):
        loss = model(input).loss
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

B. CuPy（NumPy 的 GPU 化替代）

import cupy as cp
a = cp.random.randn(10_000, 10_000)
b = cp.random.randn(10_000, 10_000)
c = a.dot(b)  # 调用 cuBLAS
cp.cuda.Stream.null.synchronize()

优势：几乎与 NumPy 相同的 API；劣势：与第三方库的互操作需注意。

C. Numba CUDA（自定义核函数）

from numba import cuda
import numpy as np

@cuda.jit
def saxpy(a, x, y, out):
    i = cuda.grid(1)
    if i < x.size:
        out[i] = a * x[i] + y[i]

n = 1_000_000
x = np.random.rand(n).astype(np.float32)
y = np.random.rand(n).astype(np.float32)
out = np.empty_like(x)

threads = 256
blocks = (n + threads - 1) // threads

saxpy[blocks, threads](2.0, x, y, out)
cuda.synchronize()

适合算子级优化或库不覆盖的特殊计算。

性能优化 10 条实践

数据驻留在 GPU：避免频繁的 .cpu()/.numpy() 往返；批量把数据搬到 GPU。
批处理与张量融合：增大 batch、合并小核函数，减少 kernel launch 开销。
混合精度（FP16/BF16）：利用 Tensor Cores；关注数值稳定性（AMP 已处理常见溢出）。
cuDNN/cuBLAS 自动调优：启用 cudnn.benchmark 与固定形状；不规则形状可关闭以避免代价。
多流与异步：用 torch.cuda.Stream() 或 cp.cuda.Stream() 将 H2D、计算流水化。
固定随机形状/图：对 PyTorch 2.x 可尝试 torch.compile()（兼容性视算子而定）。
内存复用：optimizer.zero_grad(set_to_none=True)、梯度检查点（checkpointing）。
I/O 与预处理：使用 DataLoader 多进程、pin memory、prefetch，缩短喂数瓶颈。
多卡分布式：优先用 torchrun + DDP，NCCL 后端；正确设置 num_workers 与 batch 切分。
剖析与瓶颈定位：先测时间线再改代码，避免“盲目优化”。

流与异步示例（PyTorch）

import torch

stream_h2d = torch.cuda.Stream()
stream_compute = torch.cuda.Stream()

with torch.cuda.stream(stream_h2d):
    batch_gpu = batch_cpu.to('cuda', non_blocking=True)

with torch.cuda.stream(stream_compute):
    y = model(batch_gpu)

torch.cuda.synchronize()  # 视需要同步

内存与拷贝小贴士

使用 pin_memory=True 的 DataLoader，使 H2D 更快；配合 non_blocking=True。
避免在热点路径上创建大量临时张量，尽量就地（in-place）或复用。
大矩阵乘法优先使用库函数；自定义核函数注意 coalesced memory、共享内存与寄存器使用。

多卡与分布式训练（PyTorch DDP）

torchrun --nproc_per_node=4 train.py --config config.yaml

训练脚本中：

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

dist.init_process_group('nccl')
torch.cuda.set_device(int(os.environ['LOCAL_RANK']))
model = DDP(model.to(torch.cuda.current_device()), device_ids=[torch.cuda.current_device()])

容器化与部署

Docker（NVIDIA 容器）

# 安装 nvidia-container-toolkit 后
docker run --gpus all -it --rm \
  -v $PWD:/workspace nvcr.io/nvidia/pytorch:24.08-py3 nvidia-smi

优势：驱动与 CUDA 版本兼容清晰、可复现；建议优先使用 NVIDIA 官方镜像或框架官方镜像。

Windows 与 WSL2

建议在 WSL2（Ubuntu）下使用 CUDA on WSL，体验更接近 Linux 生产环境。
安装步骤：更新到支持 CUDA 的 WSL 内核 → 安装 NVIDIA Windows 驱动 → WSL 里安装 CUDA Toolkit。

剖析与调试

PyTorch Profiler：from torch.profiler import profile, record_function 获取算子耗时；
Nsight Systems（nsys）与 Nsight Compute：系统级/内核级剖析，定位 kernel、内存、流并行瓶颈；
nvcc --ptxas-options=-v 查看寄存器占用（C++/CUDA）；
nvidia-smi, watch -n 0.5 nvidia-smi 观察显存与利用率。

常见兼容性问题与解法

框架与 CUDA 版本不匹配：参考框架官网“安装”页面选择指定版本发行包。
驱动版本过旧：升级到与 CUDA 兼容的最低驱动版本（逆向兼容策略：驱动≥CUDA 的最低要求）。
DLL/SO 找不到：确认环境变量与 ldconfig/PATH 设置；容器中使用官方镜像最省心。