Ng Ting Sheng

Low-latency systems are crucial in domains like high-frequency trading, financial services where every microsecond matters. This post explores the theoretical principles behind low-latency Go systems and provides practical code examples you can use in your own applications. We’ll cover goroutine management, memory optimization, profiling techniques, and lock contention strategies.

Goroutines for Efficient Concurrency

Go’s concurrency model revolves around goroutines, lightweight threads managed by the Go runtime rather than the operating system. Unlike OS threads, goroutines have a small initial stack (2KB, growing as needed), enabling thousands to run efficiently on a single machine. This makes them ideal for parallelizing tasks in low-latency systems.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


func heavyTask(id int, wg *sync.WaitGroup) {
    defer wg.Done()

    // Simulate CPU-intensive work
    count := 0
    for i := 0; i < 1000000; i++ {
        count += i
    }    
}

func main() {
    numTasks := 1000
    var wg sync.WaitGroup
    
    fmt.Printf("Starting %d goroutines\n", numTasks)
    fmt.Printf("Number of OS threads: %d\n", runtime.GOMAXPROCS(0))
    
    start := time.Now()
    
    for i := 0; i < numTasks; i++ {
        wg.Add(1)
        go heavyTask(i, &wg)
    }
    
    wg.Wait()
    fmt.Printf("All tasks completed in %v\n", time.Since(start))
}

Worker Pool Pattern

While goroutines are lightweight, creating unlimited goroutines can still overwhelm the system. The worker pool pattern helps control concurrency and resource usage Here’s an example of a worker pool processing tasks concurrently using goroutines and buffered channels:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64


type Job struct {
    ID   int
    Data string
}

type Result struct {
    Job    Job
    Output string
    Error  error
}

func worker(id int, jobs <-chan Job, results chan<- Result) {
    for job := range jobs {
        // Simulate work
        time.Sleep(time.Millisecond * 100)
        
        result := Result{
            Job:    job,
            Output: fmt.Sprintf("Worker %d processed job %d: %s", id, job.ID, job.Data),
        }
        
        results <- result
    }
}

func main() {
    numWorkers := runtime.NumCPU()
    numJobs := 20
    
    jobs := make(chan Job, numJobs)
    results := make(chan Result, numJobs)
    
    // Start workers
    var wg sync.WaitGroup
    for i := 1; i <= numWorkers; i++ {
        wg.Add(1)
        go func(workerID int) {
            defer wg.Done()
            worker(workerID, jobs, results)
        }(i)
    }
    
    // Send jobs
    go func() {
        for i := 1; i <= numJobs; i++ {
            jobs <- Job{
                ID:   i,
                Data: fmt.Sprintf("data-%d", i),
            }
        }
        close(jobs)
    }()
    
    // Collect results
    go func() {
        wg.Wait()
        close(results)
    }()
    
    // Process results
    for result := range results {
        fmt.Println(result.Output)
    }
}

Memory Allocation and Garbage Collection

Go’s garbage collector (GC) is designed for low pause times, but excessive memory allocation can trigger frequent GC cycles, increasing latency. Understanding allocation and GC behavior is crucial for low-latency systems.

Key Concepts

Memory Allocation: Allocating memory (e.g., creating new objects) triggers heap growth, potentially invoking GC.
GC Pause: During GC, the runtime pauses execution to mark and sweep unused objects, introducing latency.
GC Pressure: Frequent allocations increase GC frequency, degrading performance.
Mitigation Strategies:
- Use sync.Pool to reuse short-lived objects.
- Pre-allocate slices to avoid dynamic resizing.
- Minimize allocations in hot paths.

Object Pooling

Object pooling is a critical technique for reducing garbage collection pressure in low-latency systems. Instead of constantly allocating and discarding objects, you maintain a pool of reusable objects that can be borrowed and returned. This eliminates the allocation overhead and prevents these objects from becoming garbage that needs to be collected. The key insight is that object creation and garbage collection both consume CPU time and can cause unpredictable latency spikes.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65


// Message represents a typical message in a low-latency system
type Message struct {
    ID        uint64
    Timestamp int64
    Payload   []byte
    processed bool
}

// Reset clears the message for reuse
func (m *Message) Reset() {
    m.ID = 0
    m.Timestamp = 0
    m.Payload = m.Payload[:0] // Reset slice but keep capacity
    m.processed = false
}

// MessagePool manages message object reuse
type MessagePool struct {
    pool sync.Pool
}

func NewMessagePool() *MessagePool {
    return &MessagePool{
        pool: sync.Pool{
            New: func() interface{} {
                return &Message{
                    Payload: make([]byte, 0, 1024), // Pre-allocate capacity
                }
            },
        },
    }
}

func (p *MessagePool) Get() *Message {
    return p.pool.Get().(*Message)
}

func (p *MessagePool) Put(msg *Message) {
    msg.Reset()
    p.pool.Put(msg)
}

// Example: High-frequency message processing
func ProcessMessages(pool *MessagePool, count int) {
    for i := 0; i < count; i++ {
        // Get from pool instead of allocating
        msg := pool.Get()
        
        // Use the message
        msg.ID = uint64(i)
        msg.Timestamp = time.Now().UnixNano()
        msg.Payload = append(msg.Payload, byte(i%256))
        
        // Process message (simulation)
        processMessage(msg)
        
        // Return to pool
        pool.Put(msg)
    }
}

func processMessage(msg *Message) {
    // Simulate processing without allocations
    msg.processed = true
}

Buffer Reuse and Slice Management

Efficient slice and buffer management is crucial because slices are one of the most commonly allocated data structures in Go programs. The fundamental principle is to reuse the underlying arrays rather than creating new ones. When you append to a slice that has reached its capacity, Go allocates a new, larger array and copies the existing data - this allocation and copying can introduce latency. By pre-allocating buffers with sufficient capacity and resetting their length (but not capacity) when reusing them, you can eliminate these unexpected allocations.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39


type ByteBufferPool struct {
    pool sync.Pool
}

func NewByteBufferPool(initialSize int) *ByteBufferPool {
    return &ByteBufferPool{
        pool: sync.Pool{
            New: func() interface{} {
                return make([]byte, 0, initialSize)
            },
        },
    }
}

func (p *ByteBufferPool) Get() []byte {
    return p.pool.Get().([]byte)
}

func (p *ByteBufferPool) Put(buf []byte) {
    // Reset length but preserve capacity
    buf = buf[:0]
    p.pool.Put(buf)
}

// Example: Zero-allocation byte processing
func ProcessData(data []byte, bufPool *ByteBufferPool) []byte {
    // Get buffer from pool
    result := bufPool.Get()
    
    // Process data without allocations
    for _, b := range data {
        if b != 0 { // Simple filtering example
            result = append(result, b^0xFF) // Simple transformation
        }
    }
    
    // Caller responsible for returning buffer to pool
    return result
}

GC Tuning

Garbage Collection (GC) tuning is one of the most critical aspects of building low-latency systems in Go. The garbage collector automatically manages memory by periodically cleaning up unused objects, but this cleanup process can introduce unpredictable pauses that are detrimental to latency-sensitive applications.

What is GC Tuning?

GC tuning involves configuring the garbage collector’s behavior to minimize pause times and make them more predictable. The key insight is that there’s always a trade-off between GC frequency and pause duration. More frequent GC cycles mean shorter individual pauses, but more overall GC overhead.

Why GC Pauses Matter for Low Latency

When the garbage collector runs, it needs to pause your application (called “stop-the-world” pauses) to safely examine and clean up memory. Even though Go’s GC is concurrent and these pauses are typically short, they can still be problematic for systems that need to respond within microseconds. A single 1-millisecond GC pause can ruin the performance profile of a high-frequency trading system or real-time game server.

The GOGC Parameter

The most important GC tuning parameter is GOGC, which controls when garbage collection is triggered. By default, GOGC=100, meaning GC runs when the heap has grown 100% (doubled) since the last collection. Lower values trigger GC more frequently with smaller heap sizes, resulting in shorter pauses but more frequent interruptions.

To be continued

The article had unexpectedly gone too long, I will continue in Part 2 for this topic.

Building Low-Latency Systems in Go - Part 1