FPGA vs GPU for Low-Latency Inference in Trading: Cost-Benefit Analysis

Category: High-Frequency & Algorithmic Trading • Updated: June 2026 • Reading time: 14 minutes

TL;DR

FPGAs and GPUs play in different latency leagues, not different positions in the same league. Production FPGA trading systems operate in the nanoseconds-to-low-microseconds range; even the most aggressively optimized GPU inference published to date bottoms out around 5 microseconds, and a naive GPU deployment lands at hundreds of microseconds. But the honest answer for most teams is neither: the models that actually run in tick-to-trade critical paths are small enough that a CPU with kernel-bypass networking handles them in single-digit microseconds at a fraction of the engineering cost. Choose FPGA when nanoseconds are the strategy; choose GPU when you need model throughput and flexibility off the critical path; choose CPU when — as is usually the case — your latency budget is measured in microseconds, not nanoseconds.

In this guide

The latency ladder: what each technology actually achieves
Where ML inference sits in an HFT pipeline
What "ML inference" really means at these speeds
When FPGA wins
When GPU wins
When neither wins: the CPU + kernel-bypass middle path
The cost side: hardware, people, iteration speed
Hybrid architectures: how real desks split the work
Decision framework

The Latency Ladder: What Each Technology Actually Achieves

Most FPGA-vs-GPU articles quote latency numbers with no source and no description of what was measured. That makes them useless, because "latency" in this domain spans six orders of magnitude depending on where you put the measurement probes. Here is what has actually been published, with what each number means:

System	Published figure	What was measured
FPGA trigger logic (Exegy + AMD Alveo UL3524)	13.989 ns	STAC-T0 "actionable latency": last bit of market data in to first bit of order out, hardware-timestamped on the wire — with essentially no trading logic in between (Exegy, AMD)
FPGA order entry with pattern matching (Enyx nxAccess)	< 400 ns	Start-of-packet to start-of-packet, CME trade summary message triggering an order (Enyx latency report)
FPGA ML inference, gradient-boosted trees (Xelera Silva)	~400 ns inline; 1.1–1.5 µs via PCIe	XGBoost/LightGBM-class model inference at batch size 1; 1.136 µs median measured on an AMD Alveo V80 (Xelera)
FPGA neural network (hls4ml, physics triggers)	~100 ns	Small fully-connected network, on-chip, no network I/O — from the CERN trigger work that produced the hls4ml compiler (Duarte et al.)
FPGA feed handler (Enyx nxFeed)	< 1.2 µs wire-to-wire	Full market data processing: arbitration, decode, normalize, book build for a CME feed (Enyx)
GPU, maximally optimized (NVIDIA GH200)	4.7–15.8 µs p99	LSTM inference with persistent CUDA kernels: 4.7 µs for a 160K-weight model, 15.8 µs for a 16.7M-weight model, FP16, compute only (NVIDIA)
GPU, conventional deployment	5–15 µs per kernel launch, before any math	CUDA launch overhead through the runtime/driver/PCIe path on modern discrete GPUs; a framework-served small model stacks dozens of these (Modal)
CPU with kernel-bypass NIC (Solarflare OpenOnload)	~4–4.5 µs	TCP half-round-trip through a user-space network stack in a published vendor test (Fujitsu/Solarflare report); current-generation hardware is faster, but firms don't publish production numbers

Two honest caveats before using this table. First, every row is a vendor-sponsored or vendor-published benchmark. The STAC-T0 methodology is rigorously audited (hardware wire capture, 200 ps jitter in the Exegy run), but it deliberately excludes real trading logic — the 14 ns figure is the floor of the I/O path, not the latency of a strategy. Second, production tick-to-trade numbers at top firms are trade secrets. Public data here is thin by design; treat any article quoting precise unsourced figures — including the previous version of this one — with suspicion.

The structural takeaway survives the caveats: FPGAs demonstrably operate at tens of nanoseconds for trigger logic and hundreds of nanoseconds to ~1.5 µs for real ML inference, while GPUs even with heroic optimization have a published floor around 5 µs. That 10–100× gap is architectural, not incremental: a GPU is a peripheral on the far side of a PCIe bus driven by a host CPU, while an FPGA can sit directly on the wire.

Where ML Inference Sits in an HFT Pipeline

"Inference latency" only matters relative to where the model sits. A tick-to-trade pipeline has roughly four stages, and they have very different latency budgets:

Feed handling — packet arrives, gets decoded, order book gets updated. This is the most FPGA-dominated stage in the industry: deterministic, protocol-bound work where products like Enyx's nxFeed do the full decode-normalize-book-build cycle in under 1.2 µs wire-to-wire. No ML here; it's parsing.
Signal computation — features get updated and a model produces a score: fair value, short-horizon direction, queue position value. This is where the FPGA-vs-GPU question actually lives, and where model size determines everything.
Decision and risk checks — position limits, fat-finger checks, self-trade prevention. Deterministic logic, often colocated with order entry on the FPGA precisely because it must run on every order with zero scheduling jitter.
Order entry — serialize, TCP out. Again FPGA-friendly: Enyx's pattern-matched order entry runs under 400 ns start-of-packet to start-of-packet.

Notice what this implies: the famous nanosecond numbers come from stages 1, 3, and 4 — parsing, checking, and triggering. The ML model in stage 2 is frequently not in the nanosecond path at all. A common production pattern: the model computes target prices or reaction thresholds asynchronously, writes them into FPGA registers, and the FPGA's hot-path job is merely to compare incoming ticks against precomputed thresholds and fire. The "inference" that happens at wire speed is a table lookup; the learning happened upstream. Understanding this decoupling will save you from buying the wrong hardware.

What "ML Inference" Really Means at These Speeds

The phrase "ML inference in HFT" conjures transformers. The reality at microsecond timescales is much smaller. The models with published sub-2 µs FPGA latencies are gradient-boosted trees and small MLPs (Xelera supports XGBoost, LightGBM, CatBoost at batch size 1); the CERN trigger networks behind hls4ml's ~100 ns results are few-layer fully-connected nets, heavily quantized and pruned; and even NVIDIA's capital-markets showcase tops out at a 16.7M-parameter LSTM — a few thousand times smaller than a small LLM. Surveys of the field consistently note that deep learning has seen limited adoption in the tick-to-trade path precisely because of its computational burden; the workhorse models are linear families, SVMs, and trees (Kearns & Nevmyvaka remains the canonical academic treatment).

This has a liberating consequence: if a model is small enough to be a candidate for the critical path, all three platforms can run it — and the comparison becomes about I/O, determinism, and cost rather than FLOPs. Do the arithmetic on a 256-feature linear model: 256 multiply-accumulates, which a single AVX-512 core executes in a few dozen vector instructions — nanoseconds of compute. The math was never the bottleneck; getting the packet to the math and the order back out is. Conversely, if a model genuinely needs a GPU's FLOPs (a deep order book model, a cross-asset transformer), it has by that very fact disqualified itself from the tick-to-trade path, and belongs off it as an asynchronous signal producer — see the hybrid section below.

Quantization and distillation shift these boundaries but don't erase them. The hls4ml line of work showed that binary and ternary quantization lets surprisingly capable networks fit in an FPGA's on-chip resources at fixed nanosecond latency, and distilling a large offline model into a small online one is standard practice. The honest framing: compression buys you maybe an order of magnitude of model capacity at a given latency, not four.

When FPGA Wins

Win condition 1: the strategy is the latency. Pure speed races — queue position at the top of book, cross-venue arbitrage on correlated instruments, reacting to a CME trade summary before the rest of the market — are decided in the nanoseconds where only FPGAs (and beyond them, ASICs) compete. The STAC-T0 record progression tells the story: the field has pushed actionable latency from tens of nanoseconds a few years ago to 13.989 ns in the 2024 Exegy/AMD run, a 49% improvement over the previous published record. If your edge decays in under a microsecond, there is no second choice.

Win condition 2: determinism. An FPGA design has a fixed, clock-counted latency — no OS scheduler, no cache misses, no garbage collector, no kernel launch jitter. The Exegy benchmark reported jitter of 200 picoseconds. For risk checks that must run on every order, this determinism is itself the product; it's why exchanges and brokers deploy FPGA pre-trade risk gateways even when they don't care about speed records.

Win condition 3: the model is small and changes rarely. Tree ensembles and small quantized networks map beautifully onto FPGA fabric — that's exactly the niche commercial offerings like Xelera Silva (~400 ns inline GBT inference) and the open-source hls4ml/FINN toolflows occupy. Note the qualifier "changes rarely": a new model means a new bitstream, and as covered in the cost section, that is an hours-long rebuild, not a file copy.

How the entry cost has fallen. A decade ago FPGA trading meant a from-scratch Verilog effort. Today AMD sells purpose-built trading cards (the Alveo UL3524 and the slimmer UL3422), vendors sell feed handlers and order entry as off-the-shelf IP cores, and ML compilers (hls4ml, FINN, Vitis AI) take a Python-trained model to fabric without hand-written RTL — at some cost in efficiency versus an expert design. The build-vs-buy default has flipped: firms now buy the I/O layer and build only their secret sauce.

When GPU Wins

Win condition 1: the model can't fit on an FPGA's terms. Deep order book models (see our piece on order book prediction with temporal graph networks), cross-sectional rankers over thousands of symbols, anything with attention — these need the memory bandwidth and FLOPs of a GPU. If the model produces signals on a millisecond cadence rather than reacting to individual packets, GPU latency is a non-issue and GPU throughput is decisive.

Win condition 2: iteration speed is the edge. A GPU model retrains overnight and deploys by copying a file; researchers test ideas in PyTorch the same day they have them. In any strategy where alpha comes from model quality rather than reaction time, this iteration loop compounds and the FPGA's months-long development cycle is a tax on every experiment. The talent math (below) points the same direction.

Win condition 3: throughput per dollar. One GPU scores thousands of symbols in a batch. The economics of batch inference — risk models, portfolio-wide signals, the streaming anomaly detectors we cover in quote-stuffing detection — favor GPUs overwhelmingly.

What GPUs cannot do, even optimized: get under ~5 µs, on published evidence. NVIDIA's own capital-markets benchmark — persistent kernels resident for the application's lifetime, weights preloaded into shared memory, GDRCopy shaving ~0.5 µs off host transfers, FP16 on a GH200 — achieved 4.7 µs p99 for a small LSTM. That's a landmark result precisely because conventional GPU serving can't approach it: each ordinary CUDA kernel launch costs 5–15 µs of host-side overhead before any computation happens, so a framework-served model stacking dozens of launches lands at hundreds of microseconds. CUDA Graphs, kernel fusion, and TensorRT close much of that gap; the PCIe round trip and the host CPU in the loop set the floor they can't remove. If 5 µs is fast enough for your strategy, note that a plain CPU likely also clears your bar — keep reading.

When Neither Wins: The CPU + Kernel-Bypass Middle Path

The most underrated option in this debate isn't in the title. A modern x86 server with a kernel-bypass NIC runs small-model inference in the same single-digit-microsecond class as an optimized GPU, with none of the PCIe round trip and none of the FPGA toolchain.

The networking is the solved part. The standard Linux stack struggles past ~1 million packets per second and adds unpredictable kernel-space latency; user-space stacks — Solarflare/AMD OpenOnload (POSIX-compatible, no code changes), ef_vi, TCPDirect, DPDK — bypass it entirely. Published OpenOnload test reports show TCP half-round-trips around 4–4.5 µs on hardware that is now several generations old; current X2/X3-class adapters with ef_vi do meaningfully better, though, in keeping with this industry's habits, nobody publishes production figures.

The compute is the trivially solved part. Compile the model: lleaves turns LightGBM ensembles into LLVM-optimized native code for 10–30× speedups over the reference implementation; Treelite does the same across tree libraries; a hand-vectorized linear model or tiny MLP in AVX-512 intrinsics is an afternoon's work and runs in nanoseconds. Pin the thread to an isolated core, busy-poll the NIC, keep features in L1/L2, and the model evaluation disappears into the network latency.

This is why the right mental model is not "FPGA vs GPU" but a three-rung ladder: CPU+bypass-NIC covers the 2–20 µs class at the lowest total cost; FPGA covers everything below ~2 µs; GPU covers everything where throughput or model size matters more than reaction time. Teams that skip the middle rung usually either overpay for an FPGA program their strategy didn't need, or strand a throughput-class GPU in a latency-class job.

The Cost Side: Hardware, People, Iteration Speed

Hardware is the smallest and most visible line item; people and iteration speed dominate. Taking each in turn, with sources, since this is where vendor marketing is least reliable:

Hardware

Cloud prices give the cleanest public like-for-like anchor. AWS rents F1 FPGA instances (Xilinx UltraScale+ VU9P) from $1.65/hr for f1.2xlarge, and the newer F2 generation (Virtex UltraScale+ VU47P) at $3.96/hr for f2.12xlarge. On the GPU side, H100s rent for roughly $1.49–$6.98/hr across providers, and a small-model-appropriate L4 instance (g6.xlarge) costs about $0.80/hr. The point isn't that HFT runs in the cloud (the critical path can't — colocation is mandatory); it's that raw silicon rents at comparable rates, and owned datacenter GPUs and trading-grade FPGA cards both land in the four-to-five-figure range per card. Either way, hardware is a rounding error next to colocation fees, market data licenses, and the next line item.

People

This is where the platforms truly diverge. FPGA engineers who combine RTL skills with trading domain knowledge are among the scarcest talent in the industry — recruiting firms covering the niche describe demand persistently outrunning supply (Selby Jennings). The public comp data agrees: Citadel Securities advertises FPGA engineer base salaries of $125k–$350k before bonus, and aggregate listings for FPGA trading roles cluster around $123k–$196k — versus a national average around $112k for CUDA developers, of whom there are vastly more, to say nothing of ordinary PyTorch-proficient quants. A serious in-house FPGA effort needs a team of these people for years; a GPU or CPU inference stack can be stood up by the researchers you already employ.

Iteration speed

The least-discussed cost and often the decisive one. Changing an FPGA design means re-running synthesis and place-and-route — a process AMD's own tooling blog discusses in terms of hours per compile for large designs, with practitioners on big devices reporting overnight builds. Then you verify timing closure and deploy a new bitstream. A GPU model redeploy is a file copy; a compiled CPU model is a rebuild measured in seconds. If your researchers ship model improvements weekly, an FPGA in the model path turns every improvement into a hardware release. This is exactly why the threshold-register pattern from the pipeline section exists: keep the learned parameters in registers the software can rewrite, so model updates don't touch the bitstream.

Dimension	FPGA	GPU	CPU + bypass NIC
Demonstrated latency class	~14 ns (trigger) to ~1.5 µs (GBT inference)	~5 µs floor (heroics); 100s of µs conventional	Low single-digit µs network + ns–µs compute
Determinism / jitter	Clock-deterministic (ps-level jitter measured)	Launch/scheduling jitter unless persistent-kernel	Good with core isolation and busy-polling
Model capacity at that latency	Trees, linear, small quantized NNs	Up to multi-million-parameter NNs	Trees, linear, tiny MLPs
Throughput (batch scoring)	Poor fit	Best in class	Moderate
Deployment iteration	Hours per bitstream build	Minutes (file copy)	Seconds–minutes (recompile)
Talent pool	Scarce, premium comp	Large	Largest
Indicative silicon rental	$1.65–$3.96/hr (AWS F1/F2)	$0.80–$7/hr (L4 to H100)	Commodity

Hybrid Architectures: How Real Desks Split the Work

Mature systems don't choose; they layer. The pattern that recurs across vendor architectures and practitioner accounts:

FPGA at the edges: feed handling in, risk checks and order entry out — the deterministic, protocol-bound stages where commercial IP (Enyx, Exegy, and peers) is mature and the latency numbers in the table above live.
CPU in the middle: strategy logic and small-model inference on isolated cores fed by the FPGA over PCIe or by a bypass NIC. Enyx publishes wire-to-software latency under 1.5 µs for exactly this handoff.
GPU off the hot path: heavy models run asynchronously — recomputing fair values, regimes, and parameters on a milliseconds-to-seconds cadence — and push their outputs down as thresholds and coefficients the fast layers consume. The big model steers; the small fast thing shoots.

The fully-fused alternative — running the ML model on the FPGA inline, as Xelera's ~400 ns IP core does — earns its complexity only when the model's output must gate the order itself within the wire-speed budget. Whether it actually beats the simpler architecture is a replay question — see our guide on replay frameworks for HFT strategy testing: measure latency sensitivity before buying hardware, not after.

Decision Framework

Step 1: Measure your latency sensitivity. Replay your strategy with +1 µs, +10 µs, +100 µs of simulated reaction delay. If P&L barely moves until +100 µs, stop reading vendor benchmarks — any platform works, optimize for iteration speed (GPU or CPU). If the edge dies between 1 and 10 µs, you're in CPU+bypass or hybrid territory. If it dies below 1 µs, you're in an FPGA race, and you should budget for it honestly.

Step 2: Size the model the strategy actually needs. If a tree ensemble or linear model captures the signal (test this — it usually does at these horizons), all three platforms are open. If you genuinely need a deep model, it belongs off the critical path, and the question becomes how to compress its output into parameters the fast path can consume.

Step 3: Price the program, not the card. An FPGA program means scarce engineers (Citadel-posted base range $125k–$350k), hours-long build cycles on every change, and verification overhead — against which buy-not-build vendor IP for the commodity stages (feed handling, order entry, even GBT inference) cuts most of the schedule risk. A GPU program means your existing researchers, file-copy deploys, and a hard ~5 µs floor you cannot engineer past. A CPU program is the cheapest of all and covers more strategies than this debate's framing admits.

Step 4: Default to the hybrid shape. FPGA I/O edges, CPU strategy core, GPU asynchronous brain. Buy the edges, build the middle, and only push the model into fabric when Step 1 proves nanoseconds pay for it.

Bottom Line

The published record is unambiguous about capability: FPGAs reach 14 ns trigger latency and sub-microsecond real inference; the best GPU result in capital markets is ~5 µs with custom persistent-kernel engineering; a CPU with a bypass NIC sits in between at commodity cost. The decision, though, is rarely won on the latency table — it's won on the cost table: talent scarcity, bitstream-rebuild cycles, and whether your edge actually decays in nanoseconds or just feels like it should. Most teams asking "FPGA or GPU?" should first ask "have we proven a CPU can't do this?" The firms winning speed races already know they're in one; everyone else is better served by the platform that lets them ship a better model next week.