Building Low-Latency Trading Systems
2024-12-20•3 min read
engineeringcomputer-science
In high-frequency trading, microseconds matter. Building systems that consistently deliver sub-millisecond latency requires careful attention to every layer of the stack.
The Latency Budget
Understanding where time goes is crucial:
| Component | Typical Latency |
|---|---|
| Network (same DC) | 10-50 μs |
| Kernel bypass (DPDK) | 1-5 μs |
| Application logic | 5-20 μs |
| Market data parsing | 1-3 μs |
| Order generation | 2-5 μs |
Every microsecond saved compounds across millions of messages.
Hardware Considerations
The foundation starts with hardware:
- CPU: Modern processors with high clock speeds and large L3 caches
- NUMA awareness: Keep data and threads on same socket
- NIC: Low-latency network cards with kernel bypass
- Memory: Fast RAM with low CAS latency
Software Architecture
Kernel Bypass
Traditional networking goes through the kernel. Kernel bypass (DPDK, RDMA) eliminates this overhead:
// Traditional socket
recv(socket_fd, buffer, size, 0); // ~50μs
// DPDK
rte_eth_rx_burst(port_id, queue_id, pkts, MAX_BURST); // ~2μs
Lock-Free Data Structures
Locks are latency killers. Lock-free queues enable thread communication without blocking:
// Lock-free ring buffer
template<typename T, size_t SIZE>
class LockFreeQueue {
std::atomic<size_t> head{0};
std::atomic<size_t> tail{0};
T buffer[SIZE];
bool push(const T& item) {
size_t current_tail = tail.load(std::memory_order_relaxed);
size_t next_tail = (current_tail + 1) % SIZE;
if (next_tail == head.load(std::memory_order_acquire))
return false;
buffer[current_tail] = item;
tail.store(next_tail, std::memory_order_release);
return true;
}
};
Memory Management
Dynamic allocation is unpredictable. Pre-allocate everything:
- Object pools for order structures
- Ring buffers for message passing
- Huge pages to reduce TLB misses
Optimization Techniques
CPU Pinning
Bind critical threads to specific cores:
taskset -c 2,3 ./trading_engine
Compiler Optimizations
// Branch prediction hints
if (__builtin_expect(is_market_order, 1)) {
// Fast path
}
// Prefetching
__builtin_prefetch(&market_data[i+1]);
// Likely/unlikely
[[likely]]
[[unlikely]]
Cache Optimization
- Align data structures to cache lines (64 bytes)
- False sharing elimination
- Hot/cold path separation
Monitoring and Measurement
You can't optimize what you don't measure:
- TSC (Time Stamp Counter): Cycle-accurate timing
- Percentile analysis: P50, P99, P99.9
- Histograms: Latency distribution visualization
uint64_t start = __rdtsc();
process_order(order);
uint64_t end = __rdtsc();
uint64_t cycles = end - start;
Trade-offs
Low latency comes with costs:
- Increased complexity
- Higher resource usage (dedicated cores, memory)
- More difficult debugging
- Reduced flexibility
The key is knowing when these trade-offs are justified.