Speeds and feeds

Accelerator Catalog

A selectable comparison matrix for AI training and inference accelerators: memory capacity, HBM bandwidth, low-precision peaks, interconnects, power envelopes, and source-backed detail pages.

How to read this table

This page keeps chip-level and accelerator-level numbers separate from node, rack, and pod numbers where possible. Vendor peak FLOPS are useful for orientation, but sustained model throughput depends on kernels, parallelism, memory traffic, collective communication, and software maturity.

Dense and sparse tensor peaks are not interchangeable. FP4, FP6, FP8, BF16, and INT8 formats also differ across vendors, so this catalog is a starting point for analysis rather than a final benchmark ranking.

Filter chips

Metric	Instinct MI300XAMD - CDNA 3	Instinct MI325XAMD - CDNA 3	Instinct MI355XAMD - CDNA 4	H200 SXMNVIDIA - Hopper	B200 SXMNVIDIA - Blackwell	TrainiumAWS - NeuronCore-v2	Trainium2AWS - NeuronCore-v3	Cloud TPU v5pGoogle - TPU v5p TensorCore	Cloud TPU v6e / TrilliumGoogle - TPU v6e TensorCore
Vendor	AMD	AMD	AMD	NVIDIA	NVIDIA	AWS	AWS	Google	Google
Architecture	CDNA 3	CDNA 3	CDNA 4	Hopper	Blackwell	NeuronCore-v2	NeuronCore-v3	TPU v5p TensorCore	TPU v6e TensorCore
Unit	OAM accelerator	OAM accelerator	OAM accelerator	SXM GPU	SXM GPU	Cloud accelerator chip	Cloud accelerator chip	Cloud TPU chip	Cloud TPU chip
Form factor	OAM module	OAM module	OAM module	SXM	SXM	Trn1 instance chip	Trn2 instance chip	Cloud TPU slice chip	Cloud TPU slice chip
Launch	2023-12-06	2024-10-10	2025-06-12	2023-11-13	2024-03-18	2022-11-28	2024-12-03	2023-12-06	2024-12-11
Memory	192 GB HBM3	256 GB HBM3E	288 GB HBM3E	141 GB HBM3E	180 GB HBM3E	32 GiB HBM	96 GiB HBM	95 GB HBM2e	32 GB HBM
HBM bandwidth	5.3 TB/s	6 TB/s	8 TB/s	4.8 TB/s	8 TB/s	0.82 TB/s	2.9 TB/s	2.765 TB/s	1.6 TB/s
BF16 peak	1.3 PFLOPS	1.3 PFLOPS	2.5 PFLOPS	989 TFLOPS	2.25 PFLOPS	190 TFLOPS	667 TFLOPS	459 TFLOPS	918 TFLOPS
FP16 peak	1.3 PFLOPS	1.3 PFLOPS	2.5 PFLOPS	989 TFLOPS	2.25 PFLOPS	190 TFLOPS	667 TFLOPS	n/a	n/a
FP8 dense peak	2.61 PFLOPS	2.61 PFLOPS	5 PFLOPS	1.98 PFLOPS	4.5 PFLOPS	n/a	1.3 PFLOPS	n/a	n/a
FP8 sparse peak	5.22 PFLOPS	5.22 PFLOPS	10.1 PFLOPS	3.96 PFLOPS	9 PFLOPS	n/a	2.56 PFLOPS	n/a	n/a
FP4 dense peak	n/a	n/a	10.1 PFLOPS	n/a	9 PFLOPS	n/a	n/a	n/a	n/a
FP4 sparse peak	n/a	n/a	n/a	n/a	18 PFLOPS	n/a	n/a	n/a	n/a
FP64 peak	81.7 TFLOPS	81.7 TFLOPS	78.6 TFLOPS	34 TFLOPS	40 TFLOPS	n/a	n/a	n/a	n/a
INT8 peak	2.6 POPS	2.6 POPS	5 POPS	1.98 POPS	n/a	380 TOPS	n/a	n/a	1.84 POPS
Interconnect	AMD Infinity Fabric - 8 links, 128 GB/s peak per link	AMD Infinity Fabric - 8 links, 128 GB/s peak per link	AMD Infinity Fabric - 7 links, 153 GB/s peak per link	NVIDIA NVLink - 900 GB/s per GPU	NVIDIA NVLink 5 / NVSwitch - 1.8 TB/s per GPU, derived from DGX B200 aggregate	NeuronLink-v2 - 384 GB/s inter-chip	NeuronLink-v3 - 1.28 TB/s per chip	ICI 3D torus - 1200 GB/s bidirectional per chip	ICI 2D torus - 800 GB/s bidirectional per chip
Power	750 W peak TBP	1000 W peak TBP	1400 W TBP	Up to 700 W configurable TDP	Platform dependent; DGX B200 is ~14.3 kW max for 8 GPUs	Not published per chip	Not published per chip	Not published per chip	Not published per chip
Software stack	ROCm	ROCm	ROCm	CUDA, TensorRT-LLM, NVIDIA AI Enterprise	CUDA, TensorRT-LLM, NVIDIA AI Enterprise	AWS Neuron SDK	AWS Neuron SDK	JAX, XLA, TensorFlow, PyTorch/XLA	JAX, XLA, TensorFlow, PyTorch/XLA

Hardware Roofline

Selected devices plotted as peak throughput ceilings against arithmetic intensity on log-log axes.

Precision

Instinct MI300X1.3 PFLOPS peak, knee 245.28 FLOP/byte

Instinct MI325X1.3 PFLOPS peak, knee 216.67 FLOP/byte

Instinct MI355X2.5 PFLOPS peak, knee 312.5 FLOP/byte

H200 SXM989 TFLOPS peak, knee 206.04 FLOP/byte

B200 SXM2.25 PFLOPS peak, knee 281.25 FLOP/byte

Trainium190 TFLOPS peak, knee 231.71 FLOP/byte

Trainium2667 TFLOPS peak, knee 230 FLOP/byte

Cloud TPU v5p459 TFLOPS peak, knee 166 FLOP/byte

Cloud TPU v6e / Trillium918 TFLOPS peak, knee 573.75 FLOP/byte

Accelerator Catalog

How to read this table

Comparison Matrix

Hardware Roofline