How to run on GPU (CUDA / MPS / Blackwell)¶
QQA4CO is a thin layer over PyTorch — every call accepts a device
keyword and forwards it down. The pitfalls below are the ones we have
hit in practice; treat this page as the GPU operations manual.
Pick a device¶
import torch
# Most common: pick CUDA when present, fall back to CPU.
device = "cuda" if torch.cuda.is_available() else "cpu"
# Apple Silicon — supported because we never call CUDA-only kernels.
# device = "mps" if torch.backends.mps.is_available() else "cpu"
qqa.anneal() and qqa.pignn.train_* both raise an early, actionable
RuntimeError when you pass device="cuda" without a CUDA-enabled
PyTorch build. You will not get a deep, cryptic stack trace.
Always construct the problem on the same device you train on¶
This is the #1 source of GPU bugs for Python-API users:
# WRONG — will crash on the first epoch:
problem = qqa.MaximumIndependentSet(g, penalty=2, device="cpu")
qqa.anneal(problem, device="cuda", num_epochs=1000)
The constructor materialises tensors (Q_mat, etc.) on whatever
device= you pass. If the solver later runs on a different device the
inner einsum raises a CUDA / CPU mismatch.
qqa.anneal()— does not auto-migrate the problem. You must construct on the right device.qqa.pignn.train_cra_pi_gnn/train_cpra_pi_gnn— does auto-migrate silently via_ensure_problem_on_device. The CLI always builds on the right device, so this only matters for Python-API users of the PyG backend.
Recommended pattern¶
device = "cuda" if torch.cuda.is_available() else "cpu"
qqa.fix_seed(0)
g = nx.random_regular_graph(d=3, n=200, seed=0)
problem = qqa.MaximumIndependentSet(g, penalty=2, device=device)
result = qqa.anneal(problem, sol_size=128, num_epochs=2000, device=device)
Note: passing device= to both the problem and the solver. Always.
Determinism¶
qqa.fix_seed(seed) seeds Python, NumPy, and PyTorch (CPU + CUDA) and
flips torch.backends.cudnn.deterministic = True. That is enough for
QQA's pure tensor ops to be reproducible. The PyG GCN forward includes
some non-deterministic kernels by default; if you need bit-identical
runs across processes you may also need:
This will trade a few percent of throughput for full determinism.
Picking sol_size¶
sol_size is the parallel population — these run as a batch on the
GPU. Larger is better up to your VRAM ceiling.
| GPU memory | MaximumIndependentSet(N=1024) safe sol_size |
|---|---|
| 8 GB | ~256 |
| 24 GB | ~1024 |
| 48 GB | ~2048 |
| 80 GB (H100) | ~4096 |
| 192 GB (B200) | ~8192 |
If you OOM, halve sol_size and try again — the loop is otherwise
identical.
Blackwell / B200 / sm_100¶
QQA's default backend uses only stock PyTorch ops, so any CUDA build
that supports your card works. As of April 2026 you need PyTorch
≥ 2.7 with CUDA 12.8 for B200; the project's pyproject.toml
keeps the lower bound permissive (torch >= 2.2) but our development
environment uses the cu130 wheel from https://download.pytorch.org/whl/cu130.
The optional qqa.pignn backend uses
torch_geometric, which itself uses pure PyTorch kernels (no
hand-written CUDA). It is therefore Blackwell-ready, unlike the
upstream DGL-based CRA reference which lacks a Blackwell wheel.
# Blackwell-friendly install (pip)
pip install torch --index-url https://download.pytorch.org/whl/cu130
pip install "qqa[pignn]"
MPS (Apple Silicon)¶
device="mps" works for the default qqa.anneal backend. The
qqa.pignn backend has not been validated on MPS and may fall back to
CPU for unsupported ops; for the PyG path we recommend CUDA or CPU.
Slurm / cluster recipe¶
A Slurm template script lives at
scripts/sanity_pignn_gpu.sbatch. Adapt it for your cluster:
#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --time=00:10:00
module load cuda/12.8
uv run python scripts/demo_pignn_mis.py
Use qqa solve --device cuda --output result.pkl if you want to run
the CLI from a job script and post-process the result later.
Diagnosing slow GPU runs¶
- Look at
time per epoch. Print theresult.runtime / num_epochs. For MIS on N=200 the QQA backend should be < 1 ms/epoch on an A100. - Profile.
torch.profiler.profile()works unmodified onqqa.anneal. The hot path isloss_fn+optimizer.step(). - Check
pin_memoryand async copies. QQA does not move data between host and device per-epoch; if your problem does so in itsloss_fn, factor those tensors out into the constructor.