Speeds and feeds

Accelerator Catalog

A selectable comparison matrix for AI training and inference accelerators: memory capacity, HBM bandwidth, low-precision peaks, interconnects, power envelopes, and source-backed detail pages.

How to read this table

This page keeps chip-level and accelerator-level numbers separate from node, rack, and pod numbers where possible. Vendor peak FLOPS are useful for orientation, but sustained model throughput depends on kernels, parallelism, memory traffic, collective communication, and software maturity.

Dense and sparse tensor peaks are not interchangeable. FP4, FP6, FP8, BF16, and INT8 formats also differ across vendors, so this catalog is a starting point for analysis rather than a final benchmark ranking.

Comparison Matrix

9 of 9 chips selected

MetricInstinct MI300XAMD - CDNA 3Instinct MI325XAMD - CDNA 3Instinct MI355XAMD - CDNA 4H200 SXMNVIDIA - HopperB200 SXMNVIDIA - BlackwellTrainiumAWS - NeuronCore-v2Trainium2AWS - NeuronCore-v3Cloud TPU v5pGoogle - TPU v5p TensorCoreCloud TPU v6e / TrilliumGoogle - TPU v6e TensorCore
VendorAMDAMDAMDNVIDIANVIDIAAWSAWSGoogleGoogle
ArchitectureCDNA 3CDNA 3CDNA 4HopperBlackwellNeuronCore-v2NeuronCore-v3TPU v5p TensorCoreTPU v6e TensorCore
UnitOAM acceleratorOAM acceleratorOAM acceleratorSXM GPUSXM GPUCloud accelerator chipCloud accelerator chipCloud TPU chipCloud TPU chip
Form factorOAM moduleOAM moduleOAM moduleSXMSXMTrn1 instance chipTrn2 instance chipCloud TPU slice chipCloud TPU slice chip
Launch2023-12-062024-10-102025-06-122023-11-132024-03-182022-11-282024-12-032023-12-062024-12-11
Memory192 GB HBM3256 GB HBM3E288 GB HBM3E141 GB HBM3E180 GB HBM3E32 GiB HBM96 GiB HBM95 GB HBM2e32 GB HBM
HBM bandwidth5.3 TB/s6 TB/s8 TB/s4.8 TB/s8 TB/s0.82 TB/s2.9 TB/s2.765 TB/s1.6 TB/s
BF16 peak1.3 PFLOPS1.3 PFLOPS2.5 PFLOPS989 TFLOPS2.25 PFLOPS190 TFLOPS667 TFLOPS459 TFLOPS918 TFLOPS
FP16 peak1.3 PFLOPS1.3 PFLOPS2.5 PFLOPS989 TFLOPS2.25 PFLOPS190 TFLOPS667 TFLOPSn/an/a
FP8 dense peak2.61 PFLOPS2.61 PFLOPS5 PFLOPS1.98 PFLOPS4.5 PFLOPSn/a1.3 PFLOPSn/an/a
FP8 sparse peak5.22 PFLOPS5.22 PFLOPS10.1 PFLOPS3.96 PFLOPS9 PFLOPSn/a2.56 PFLOPSn/an/a
FP4 dense peakn/an/a10.1 PFLOPSn/a9 PFLOPSn/an/an/an/a
FP4 sparse peakn/an/an/an/a18 PFLOPSn/an/an/an/a
FP64 peak81.7 TFLOPS81.7 TFLOPS78.6 TFLOPS34 TFLOPS40 TFLOPSn/an/an/an/a
INT8 peak2.6 POPS2.6 POPS5 POPS1.98 POPSn/a380 TOPSn/an/a1.84 POPS
InterconnectAMD Infinity Fabric - 8 links, 128 GB/s peak per linkAMD Infinity Fabric - 8 links, 128 GB/s peak per linkAMD Infinity Fabric - 7 links, 153 GB/s peak per linkNVIDIA NVLink - 900 GB/s per GPUNVIDIA NVLink 5 / NVSwitch - 1.8 TB/s per GPU, derived from DGX B200 aggregateNeuronLink-v2 - 384 GB/s inter-chipNeuronLink-v3 - 1.28 TB/s per chipICI 3D torus - 1200 GB/s bidirectional per chipICI 2D torus - 800 GB/s bidirectional per chip
Power750 W peak TBP1000 W peak TBP1400 W TBPUp to 700 W configurable TDPPlatform dependent; DGX B200 is ~14.3 kW max for 8 GPUsNot published per chipNot published per chipNot published per chipNot published per chip
Software stackROCmROCmROCmCUDA, TensorRT-LLM, NVIDIA AI EnterpriseCUDA, TensorRT-LLM, NVIDIA AI EnterpriseAWS Neuron SDKAWS Neuron SDKJAX, XLA, TensorFlow, PyTorch/XLAJAX, XLA, TensorFlow, PyTorch/XLA

Hardware Roofline

Selected devices plotted as peak throughput ceilings against arithmetic intensity on log-log axes.

Instinct MI300X1.3 PFLOPS peak, knee 245.28 FLOP/byte
Instinct MI325X1.3 PFLOPS peak, knee 216.67 FLOP/byte
Instinct MI355X2.5 PFLOPS peak, knee 312.5 FLOP/byte
H200 SXM989 TFLOPS peak, knee 206.04 FLOP/byte
B200 SXM2.25 PFLOPS peak, knee 281.25 FLOP/byte
Trainium190 TFLOPS peak, knee 231.71 FLOP/byte
Trainium2667 TFLOPS peak, knee 230 FLOP/byte
Cloud TPU v5p459 TFLOPS peak, knee 166 FLOP/byte
Cloud TPU v6e / Trillium918 TFLOPS peak, knee 573.75 FLOP/byte