Ng Ting Sheng

Profiling and Benchmarking

Profiling and benchmarking are essential for building low-latency systems because performance intuition is often wrong - what seems like it should be fast might have hidden bottlenecks, and apparent optimizations might actually make things slower. Go provides powerful tools like pprof and GODEBUG to analyze CPU, memory, and GC performance.

Why Profile?

Identify Hot Paths: Pinpoint functions consuming the most CPU or memory.
Optimize GC: Detect excessive allocations triggering frequent GC cycles.
Measure Latency: Quantify the impact of changes on performance.

Example: Profiling with pprof

Here’s a simple program with a potential bottleneck, followed by steps to profile it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


func slowFunction(n int) int {
	sum := 0
	for i := 0; i < n; i++ {
		sum += i
		time.Sleep(1 * time.Millisecond) // Simulate delay
	}
	return sum
}

func main() {
	for i := 0; i < 5; i++ {
		result := slowFunction(100)
		fmt.Println("Result:", result)
	}
}

To profile:

Add import _ “net/http/pprof” to your program and start an HTTP server (http.ListenAndServe(“localhost:6060”, nil)).
Run go tool pprof http://localhost:6060/debug/pprof/profile?seconds=10 to collect a CPU profile.
Analyze with go tool pprof -http=:8080 profile.out to visualize bottlenecks.

The profile will reveal that slowFunction dominates execution time due to the time.Sleep loop. Optimizing this (e.g., removing unnecessary delays) reduces latency.

GC Profiling

Set GODEBUG=gctrace=1 when running your program to log GC activity. For example:

1

GODEBUG=gctrace=1 go run main.go

This outputs GC pause times and heap statistics, helping you identify excessive GC activity.

CPU and Memory Profiling

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59


package main

import (
    "context"
    "log"
    "os"
    "runtime"
    "runtime/pprof"
    "time"
)

func EnableProfiling(ctx context.Context) {
    // CPU profiling
    cpuFile, err := os.Create("cpu.prof")
    if err != nil {
        log.Fatal(err)
    }
    
    pprof.StartCPUProfile(cpuFile)
    go func() {
        <-ctx.Done()
        pprof.StopCPUProfile()
        cpuFile.Close()
    }()
    
    // Memory profiling - periodic snapshots
    go func() {
        ticker := time.NewTicker(10 * time.Second)
        defer ticker.Stop()
        
        for {
            select {
            case <-ticker.C:
                memFile, err := os.Create("mem.prof")
                if err != nil {
                    continue
                }
                
                runtime.GC() // Force GC before snapshot
                pprof.WriteHeapProfile(memFile)
                memFile.Close()
                
            case <-ctx.Done():
                return
            }
        }
    }()
}

// Example usage
func main() {
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()
    
    EnableProfiling(ctx)
    
    // Run your low-latency workload here
    // ...
}

Minimizing Lock Contention

Locks (e.g., sync.Mutex) serialize access to shared resources, introducing contention and delays in concurrent systems. Excessive locking can queue goroutines, increasing latency.

Strategies to Reduce Contention

Avoid Locks: Use lock-free patterns like channels or atomic operations (sync/atomic).
Sharded Locks: Split data into partitions with separate locks to reduce contention.
sync.Map: Use for concurrent key-value stores, but benchmark against alternatives like map with RWMutex.
Message Passing: Prefer channels for communication over shared state.

Example: Sharded Locks vs. sync.Map

Here’s a comparison of a sharded map (using multiple RWMutex-protected maps) vs. sync.Map:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77


package main

import (
	"fmt"
	"sync"
	"time"
)

const shards = 4

type ShardedMap struct {
	shards []map[string]int
	locks  []sync.RWMutex
}

func NewShardedMap() *ShardedMap {
	m := &ShardedMap{
		shards: make([]map[string]int, shards),
		locks:  make([]sync.RWMutex, shards),
	}
	for i := 0; i < shards; i++ {
		m.shards[i] = make(map[string]int)
	}
	return m
}

func (m *ShardedMap) Set(key string, value int) {
	shard := uint32(hash(key)) % shards
	m.locks[shard].Lock()
	m.shards[shard][key] = value
	m.locks[shard].Unlock()
}

func (m *ShardedMap) Get(key string) (int, bool) {
	shard := uint32(hash(key)) % shards
	m.locks[shard].RLock()
	defer m.locks[shard].RUnlock()
	v, ok := m.shards[shard][key]
	return v, ok
}

func hash(s string) int {
	h := 0
	for _, c := range s {
		h += int(c)
	}
	return h
}

func main() {
	// ShardedMap
	sm := NewShardedMap()
	var wg sync.WaitGroup
	start := time.Now()
	for i := 0; i < 1000; i++ {
		wg.Add(1)
		go func(i int) {
			defer wg.Done()
			sm.Set(fmt.Sprintf("key-%d", i), i)
		}(i)
	}
	wg.Wait()
	fmt.Printf("ShardedMap took: %v\n", time.Since(start))

	// sync.Map
	var smap sync.Map
	start = time.Now()
	for i := 0; i < 1000; i++ {
		wg.Add(1)
		go func(i int) {
			defer wg.Done()
			smap.Store(fmt.Sprintf("key-%d", i), i)
		}(i)
	}
	wg.Wait()
	fmt.Printf("sync.Map took: %v\n", time.Since(start))
}

This example demonstrates that sharded locks can outperform sync.Map in high-contention scenarios by reducing lock scope. Always benchmark to confirm the best approach for your workload.

Closing thoughts

Go’s concurrency model (goroutines and channels), efficient runtime, and profiling tools make it a powerful choice for low-latency systems. Key takeaways:

Use goroutines and worker pools to parallelize tasks efficiently, but avoid leaks.
Minimize allocations with sync.Pool and pre-allocated slices to reduce GC pressure.
Profile with pprof and GODEBUG to identify and eliminate bottlenecks.
Reduce lock contention with sharded locks, atomic operations, or channels.

The examples in this post demonstrate practical patterns that can be adapted to your specific use case. Remember that optimization is about trade-offs—sometimes the most readable code isn’t the fastest, and sometimes the fastest code isn’t the most maintainable. Always measure the impact of your optimizations and ensure they actually improve your system’s performance characteristics.

Building low-latency systems is an iterative process. Start with a working system, measure its performance, identify the bottlenecks, and optimize incrementally. Go’s excellent tooling makes this process much more manageable than in many other languages, giving you the insights needed to build systems that can handle the most demanding latency requirements.

Building Low-Latency Systems in Go - Part 2