Speeds and feeds
Accelerator Catalog
A selectable comparison matrix for AI training and inference accelerators: memory capacity, HBM bandwidth, low-precision peaks, interconnects, power envelopes, and source-backed detail pages.
How to read this table
This page keeps chip-level and accelerator-level numbers separate from node, rack, and pod numbers where possible. Vendor peak FLOPS are useful for orientation, but sustained model throughput depends on kernels, parallelism, memory traffic, collective communication, and software maturity.
Dense and sparse tensor peaks are not interchangeable. FP4, FP6, FP8, BF16, and INT8 formats also differ across vendors, so this catalog is a starting point for analysis rather than a final benchmark ranking.
| Metric | Instinct MI300XAMD - CDNA 3 | Instinct MI325XAMD - CDNA 3 | Instinct MI355XAMD - CDNA 4 | H200 SXMNVIDIA - Hopper | B200 SXMNVIDIA - Blackwell | TrainiumAWS - NeuronCore-v2 | Trainium2AWS - NeuronCore-v3 | Cloud TPU v5pGoogle - TPU v5p TensorCore | Cloud TPU v6e / TrilliumGoogle - TPU v6e TensorCore |
|---|---|---|---|---|---|---|---|---|---|
| Vendor | AMD | AMD | AMD | NVIDIA | NVIDIA | AWS | AWS | ||
| Architecture | CDNA 3 | CDNA 3 | CDNA 4 | Hopper | Blackwell | NeuronCore-v2 | NeuronCore-v3 | TPU v5p TensorCore | TPU v6e TensorCore |
| Unit | OAM accelerator | OAM accelerator | OAM accelerator | SXM GPU | SXM GPU | Cloud accelerator chip | Cloud accelerator chip | Cloud TPU chip | Cloud TPU chip |
| Form factor | OAM module | OAM module | OAM module | SXM | SXM | Trn1 instance chip | Trn2 instance chip | Cloud TPU slice chip | Cloud TPU slice chip |
| Launch | 2023-12-06 | 2024-10-10 | 2025-06-12 | 2023-11-13 | 2024-03-18 | 2022-11-28 | 2024-12-03 | 2023-12-06 | 2024-12-11 |
| Memory | 192 GB HBM3 | 256 GB HBM3E | 288 GB HBM3E | 141 GB HBM3E | 180 GB HBM3E | 32 GiB HBM | 96 GiB HBM | 95 GB HBM2e | 32 GB HBM |
| HBM bandwidth | 5.3 TB/s | 6 TB/s | 8 TB/s | 4.8 TB/s | 8 TB/s | 0.82 TB/s | 2.9 TB/s | 2.765 TB/s | 1.6 TB/s |
| BF16 peak | 1.3 PFLOPS | 1.3 PFLOPS | 2.5 PFLOPS | 989 TFLOPS | 2.25 PFLOPS | 190 TFLOPS | 667 TFLOPS | 459 TFLOPS | 918 TFLOPS |
| FP16 peak | 1.3 PFLOPS | 1.3 PFLOPS | 2.5 PFLOPS | 989 TFLOPS | 2.25 PFLOPS | 190 TFLOPS | 667 TFLOPS | n/a | n/a |
| FP8 dense peak | 2.61 PFLOPS | 2.61 PFLOPS | 5 PFLOPS | 1.98 PFLOPS | 4.5 PFLOPS | n/a | 1.3 PFLOPS | n/a | n/a |
| FP8 sparse peak | 5.22 PFLOPS | 5.22 PFLOPS | 10.1 PFLOPS | 3.96 PFLOPS | 9 PFLOPS | n/a | 2.56 PFLOPS | n/a | n/a |
| FP4 dense peak | n/a | n/a | 10.1 PFLOPS | n/a | 9 PFLOPS | n/a | n/a | n/a | n/a |
| FP4 sparse peak | n/a | n/a | n/a | n/a | 18 PFLOPS | n/a | n/a | n/a | n/a |
| FP64 peak | 81.7 TFLOPS | 81.7 TFLOPS | 78.6 TFLOPS | 34 TFLOPS | 40 TFLOPS | n/a | n/a | n/a | n/a |
| INT8 peak | 2.6 POPS | 2.6 POPS | 5 POPS | 1.98 POPS | n/a | 380 TOPS | n/a | n/a | 1.84 POPS |
| Interconnect | AMD Infinity Fabric - 8 links, 128 GB/s peak per link | AMD Infinity Fabric - 8 links, 128 GB/s peak per link | AMD Infinity Fabric - 7 links, 153 GB/s peak per link | NVIDIA NVLink - 900 GB/s per GPU | NVIDIA NVLink 5 / NVSwitch - 1.8 TB/s per GPU, derived from DGX B200 aggregate | NeuronLink-v2 - 384 GB/s inter-chip | NeuronLink-v3 - 1.28 TB/s per chip | ICI 3D torus - 1200 GB/s bidirectional per chip | ICI 2D torus - 800 GB/s bidirectional per chip |
| Power | 750 W peak TBP | 1000 W peak TBP | 1400 W TBP | Up to 700 W configurable TDP | Platform dependent; DGX B200 is ~14.3 kW max for 8 GPUs | Not published per chip | Not published per chip | Not published per chip | Not published per chip |
| Software stack | ROCm | ROCm | ROCm | CUDA, TensorRT-LLM, NVIDIA AI Enterprise | CUDA, TensorRT-LLM, NVIDIA AI Enterprise | AWS Neuron SDK | AWS Neuron SDK | JAX, XLA, TensorFlow, PyTorch/XLA | JAX, XLA, TensorFlow, PyTorch/XLA |
Hardware Roofline
Selected devices plotted as peak throughput ceilings against arithmetic intensity on log-log axes.