DK
Deniz Kartal
Back to posts

Building Low-Latency Trading Systems

2024-12-203 min read
engineeringcomputer-science

In high-frequency trading, microseconds matter. Building systems that consistently deliver sub-millisecond latency requires careful attention to every layer of the stack.

The Latency Budget

Understanding where time goes is crucial:

ComponentTypical Latency
Network (same DC)10-50 μs
Kernel bypass (DPDK)1-5 μs
Application logic5-20 μs
Market data parsing1-3 μs
Order generation2-5 μs

Every microsecond saved compounds across millions of messages.

Hardware Considerations

The foundation starts with hardware:

  • CPU: Modern processors with high clock speeds and large L3 caches
  • NUMA awareness: Keep data and threads on same socket
  • NIC: Low-latency network cards with kernel bypass
  • Memory: Fast RAM with low CAS latency

Software Architecture

Kernel Bypass

Traditional networking goes through the kernel. Kernel bypass (DPDK, RDMA) eliminates this overhead:

// Traditional socket
recv(socket_fd, buffer, size, 0);  // ~50μs

// DPDK
rte_eth_rx_burst(port_id, queue_id, pkts, MAX_BURST);  // ~2μs

Lock-Free Data Structures

Locks are latency killers. Lock-free queues enable thread communication without blocking:

// Lock-free ring buffer
template<typename T, size_t SIZE>
class LockFreeQueue {
    std::atomic<size_t> head{0};
    std::atomic<size_t> tail{0};
    T buffer[SIZE];

    bool push(const T& item) {
        size_t current_tail = tail.load(std::memory_order_relaxed);
        size_t next_tail = (current_tail + 1) % SIZE;

        if (next_tail == head.load(std::memory_order_acquire))
            return false;

        buffer[current_tail] = item;
        tail.store(next_tail, std::memory_order_release);
        return true;
    }
};

Memory Management

Dynamic allocation is unpredictable. Pre-allocate everything:

  • Object pools for order structures
  • Ring buffers for message passing
  • Huge pages to reduce TLB misses

Optimization Techniques

CPU Pinning

Bind critical threads to specific cores:

taskset -c 2,3 ./trading_engine

Compiler Optimizations

// Branch prediction hints
if (__builtin_expect(is_market_order, 1)) {
    // Fast path
}

// Prefetching
__builtin_prefetch(&market_data[i+1]);

// Likely/unlikely
[[likely]]
[[unlikely]]

Cache Optimization

  • Align data structures to cache lines (64 bytes)
  • False sharing elimination
  • Hot/cold path separation

Monitoring and Measurement

You can't optimize what you don't measure:

  • TSC (Time Stamp Counter): Cycle-accurate timing
  • Percentile analysis: P50, P99, P99.9
  • Histograms: Latency distribution visualization
uint64_t start = __rdtsc();
process_order(order);
uint64_t end = __rdtsc();
uint64_t cycles = end - start;

Trade-offs

Low latency comes with costs:

  • Increased complexity
  • Higher resource usage (dedicated cores, memory)
  • More difficult debugging
  • Reduced flexibility

The key is knowing when these trade-offs are justified.