Fractal Ideas

←  Back to index

Read on Medium  →

Aymeric Augustin Feb. 23, 2021 11 min read

How fast can you Go?

Today I will explore how profiling tools can help us make a Go program run faster.

If you expected an inspirational think piece on personal development, well, perhaps another day :-)

Like many Pythonistas, I have been frustrated by Python’s relatively slow runtime. The promise of fast execution is a key reason for learning Go. The other key reason would be goroutines, but we won’t encounter them today.

So I set out to read The Go Programming Language.

Of course, there’s no substitute to writing code for learning a language. I decided that writing a Sudoku solver in Go would be a good first step. It’s easy enough to do in a few hours and interesting enough to get a feeling for how a language works.

The algorithm we’ll optimize

If you never solved a Sudoku, you can get familiar with the rules on Wikipedia.

Since I experimented with Sudoku in other programming languages before, I already know an algorithm that does the job1:

  1. Start with an empty 9x9 grid: all cells are empty and all values from 1 to 9 are possible in all cells.
  2. Set initial values in cells; along the way, record that these values are now impossible in other cells from the same row, column, or 3x3 subgrid.
  3. When the value of a cell becomes known because the other eight values are impossible, set it as well. Do this recursively.
  4. If there are empty cells left, pick a cell and try all possible values, following the same process as in step 3. See which one leads to a solution.

This is pretty much a mechanized version of how a human would solve the problem, taking advantage of the computer’s larger memory space.

This algorithm lends itself fairly well to an object-oriented implementation. An object stores the state of the grid; its methods to mutate that state.

The most interesting methods, which we’ll encounter below, are:

Here’s my first attempt at implementing the Solve function: solver.go. Surely there’s room for improving that code, but with only a few hours of Go experience at this point, I’m reasonably happy with it.

For good measure, here’s the Generate function: generator.go. As the name implies, it generates a random puzzle. It builds upon the solver to create a valid grid with a value in every cell, then it clears some cells while ensuring there’s still exactly one solution in order to produce the final puzzle.

Let’s benchmark

Time to see how fast this code runs!

If you’re completely unfamiliar with benchmarking in Go, you may want to read Dave Cheney’s introduction first.

Let’s start with Generate. Since it doesn’t take any parameters, it’s easier to test.

func BenchmarkGenerate(b *testing.B) {
    for n := 0; n < b.N; n++ {
        rand.Seed(int64(42 * n))
        Generate()
    }
}

Now we can run this benchmark:

$ go test -bench=BenchmarkGenerate -run=X
goos: darwin
goarch: amd64
pkg: github.com/aaugustin/sudoku/go/sudoku
BenchmarkGenerate-8          148       7763004 ns/op
PASS
ok      github.com/aaugustin/sudoku/go/sudoku   2.034s

7.8 milliseconds seems to be in the right ballpark. I have a distant memory from previous experiments that generating a grid took hundreds of milliseconds in Python and milliseconds in C.

Perhaps you’re wondering about -run=X. It doesn’t match any test, and therefore nothing other than the benchmark runs.

I made the benchmark deterministic by seeding the random generator with a constant value. I also chose to change that value for each n. This provides a larger sample of the performance of Generate, which can vary quite a bit depending on the puzzle it creates.

In theory, changing the seed could make the benchmark less stable. However, running it for 20 seconds shows exactly the same performance, so it doesn’t seem like we have a stability problem. Besides, if BenchmarkGenerate can churn through more values of n in one second, for sure it’s getting faster, and I care more about measuring gains than about the absolute performance of Generate.

Now let’s turn our attention to Solve.

I have a file of 95 hard Sudoku problems lying around. Atfer writing a quick loadPuzzles function to read these problems from disk, we can benchmark the solver.

func BenchmarkSolve(b *testing.B) {
    grids, err := loadPuzzles()
    if err != nil {
        b.Errorf("failed to load grids: %s", err)
        return
    }
    for n := 0; n < b.N; n++ {
        grid := grids[n%len(grids)]
        Solve(&grid)
    }
}

Let’s run the benchmark:

$ go test -bench=BenchmarkSolve -run=X
goos: darwin
goarch: amd64
pkg: github.com/aaugustin/sudoku/go/sudoku
BenchmarkSolve-8         201       6476909 ns/op
PASS
ok      github.com/aaugustin/sudoku/go/sudoku   1.965s

Again, 6.5 milliseconds seems reasonable.

Perhaps you’re surprised that Solve takes roughly the same amount of time as Generate in these benchmarks, even though the generator runs the solver many times. This happens because we’re benchmarking the solver on a set of hards problems which require exploring lots of possibilities. Conversely, the generator usually solves a lot of very easy problems.

All this looks pretty good. Now… can we make this code run faster?

Enter profiling

The first step of optimizing a program is to figure out where the execution time goes. Then we can look for wasted time.

At this stage:

Let’s run our benchmark again with CPU profiling turned on.

$ go test -bench=BenchmarkSolve -run=X -cpuprofile cpu.prof
goos: darwin
goarch: amd64
pkg: github.com/aaugustin/sudoku/go/sudoku
BenchmarkSolve-8         183       6964865 ns/op
PASS
ok      github.com/aaugustin/sudoku/go/sudoku   2.398s

We can see that CPU profiling slows down execution by a bit less than 10% for this benchmark. We’re now solving in 7.0 milliseconds instead of 6.5 milliseconds.

Out of the box, Go provides a nice web-based UI to visualize profiles.

$ go tool pprof -http :8000 sudoku.test cpu.prof
Serving web UI on http://localhost:8000

Let’s select sudoku Solve, the function we’re benchmarking, then Refine > Focus in the menu, and we get this visualization.

1 - CPU profile of Solve

We’re spending more than half of our time in the mark function, which makes sense, given that it’s the core of the algorithm.

We can also see that mark spends almost 40% of its time running growslice. This is interesting! growslice must be the routine in charge of allocating a larger array when we’re calling append on a slice that reached the capacity if its underlying array. Can we optimize this?

To understand where these allocations happen in mark, let’s select mark, then
View > Source in the menu, and we get line by line profiling data.

2 - CPU profile of mark

There’s only one call to append in mark, at line 86. We’re spending 430ms there. When the value of a cell becomes known, this line pushes it to a queue, s.next, with the intent to pop it from the queue and call mark for that cell later, at line 90.

Allocating memory is expensive

Given how the algorithm works, s.next cannot contain more than the 81 cells of the grid.

Let’s add a function to pre-allocate a slice with a capacity of 81, instead of relying on the Go runtime to resize the slice as needed:

func (s *solver) init() {
    // Allocate memory only once instead of growing the slice incrementally.
    if s.next != nil {
        panic("must init only once")
    }
    s.next = make([]int, 0, 81)
}

While doing this, it appears that we can share the array underlying s.next between grids in search, as long as we don’t explore possibilities concurrently. As a consequence, we can call init exactly once in Solve, rather than every time we make a copy of a grid. See this commit.

Let’s look at the results.

$ go test -bench=BenchmarkSolve -run=X -cpuprofile cpu.prof
goos: darwin
goarch: amd64
pkg: github.com/aaugustin/sudoku/go/sudoku
BenchmarkSolve-8         271       4501117 ns/op
PASS
ok      github.com/aaugustin/sudoku/go/sudoku   1.846s

Well that’s a nice improvement! We removed over a third of the run time!

The memory allocations for growing slices were costing us a lot of performance.

I really like that Go provides the convenience of slices and the ability to manage the underlying array if we want to do low-level optimizations.

How did the CPU profile change?

$ go tool pprof -http :8000 sudoku.test cpu.prof
Serving web UI on http://localhost:8000

3 - CPU profile of Solve

Better algorithms win

Now that mark is much faster, we notice that candidate takes an unexpectedly large amount of time. This method looks for the cell with the fewest possible values, in order to minimize the number of branches that search has to explore.

It turns out that we don’t have to iterate over all cells. We can stop as soon as we find a cell with only two possibilities, since that’s the fewest possible cases. That’s a very easy change to make. See this commit.

$ go test -bench=BenchmarkSolve -run=X -cpuprofile cpu.prof
goos: darwin
goarch: amd64
pkg: github.com/aaugustin/sudoku/go/sudoku
BenchmarkSolve-8         282       4243857 ns/op
PASS
ok      github.com/aaugustin/sudoku/go/sudoku   1.847s

It’s a small gain, but it didn’t require much effort either, so we’ll take it happily.

With such a small gain, it looks like we’re reaching the end of how far we can optimize.

Let’s take a one last look at the profile.

$ go tool pprof -http :8000 sudoku.test cpu.prof
Serving web UI on http://localhost:8000

4 - CPU profile of Solve

duffcopy and mallocgc are provided by the Go runtime. In addition to being outside of our control, they must be highly optimized.

mark seems completely optimized at this point. Looking at source, it appears that it isn’t doing anything besides running the logic of the algorithm.

5 - CPU profile of mark

search is a very small function. It doesn’t seem to do much besides calling mark and calling itself recursively.

6 - CPU profile of search

Hmm.

Can we do something about these 290ms at line 126? We’re making a copy of the grid, which requires allocating memory, and now we know this can be slow.

Let’s take a memory profile to better understand the situation.

$ go test -bench=BenchmarkSolve -run=X -memprofile mem.prof
goos: darwin
goarch: amd64
pkg: github.com/aaugustin/sudoku/go/sudoku
BenchmarkSolve-8         284       3999008 ns/op
PASS
ok      github.com/aaugustin/sudoku/go/sudoku   1.609s

We can visualize it like a CPU profile.

$ go tool pprof -http :8000 sudoku.test mem.prof
Serving web UI on http://localhost:8000

7 - memory profile of Solve

Running the benchmark allocates 1 gigabyte of RAM, entirely consisting of 288 bytes data structures, which is the size of the grid objects that hold the solver’s state.

Unsurprisingly, making all these memory allocations, and then garbage-collecting them, takes a bit of time!

Well, if we don’t want to allocate memory on the heap, either we can pre-allocate it, like we did earlier, or we can use memory on the stack.

Remove all memory allocations!

Go makes no guarantee about whether variables will live on the heap or on the stack, which is a completely reasonable position for compiler authors.

However, this sentence sounds interesting:

When possible, the Go compilers will allocate variables that are local to a function in that function’s stack frame.

What if we try our luck and rewrite the program to use only local variables? See this commit.

How did the memory profile change?

$ go test -bench=BenchmarkSolve -run=X -memprofile mem.prof
goos: darwin
goarch: amd64
pkg: github.com/aaugustin/sudoku/go/sudoku
BenchmarkSolve-8         429       2916189 ns/op
PASS
ok      github.com/aaugustin/sudoku/go/sudoku   1.571s
$ go tool pprof -http :8000 sudoku.test mem.prof
Serving web UI on http://localhost:8000

8 - memory profile of Solve

Looks like it worked! Now only did we allocate 2000x less memory, but we’re also 25% faster!

What does the CPU profile look like now?

$ go test -bench=BenchmarkSolve -run=X -cpuprofile cpu.prof
goos: darwin
goarch: amd64
pkg: github.com/aaugustin/sudoku/go/sudoku
BenchmarkSolve-8         423       3024014 ns/op
PASS
ok      github.com/aaugustin/sudoku/go/sudoku   1.789s
$ go tool pprof -http :8000 sudoku.test cpu.prof
Serving web UI on http://localhost:8000

9 - CPU profile of Solve

Now we’re spending all our CPU time executing the algorithm — or copying memory with duffcopy: indeed, even if we now use memory on the heap for making copies of grids, we still need to copy their contents.

Of course, this last optimization is very fragile. In fact, I broke it with the next commit in the repository!

That commit restructured the program in order to support exiting search as soon as two solutions are found. This provides a large optimization for Generate, which only cares about whether search finds zero, one, or more than one solution.

I didn’t notice the regression until I wrote this blog post! Fixing it required some refactoring and made the logic simpler. Dumb code for the win.

While this was a fun experiment, I wouldn’t recommend to rely on this level of optimization in a serious project — unless you find a way to test it, perhaps by running the code with profiling enabled then making assertions on the generated profile.

Would C be faster?

I tried to beat that performance by porting the program to C and performing similar optimizations. Actually, I ported only the core of the algorithm and wrapped it in a Python module to handle external interfaces.

Not only was the C version unbelievably painful to write and debug, but it provided only marginal performance gains compared to the Go version. The solver is only 4% faster in C. This shouldn’t come as a surprise to anyone reading this.

Furthermore, the generator is 40% slower in C. I have no desire to figure out why.

¯\_(ツ)_/¯

Final score

Even though I thought my original implementation was reasonably tight, with careful profiling, I could make the code faster by a large factor:

If you’d like to reproduce these results, everything is available on GitHub. Take a look at the README or jump to the Go, Python, and C implementations.

I hope you enjoyed discovering Go’s profiling tools as much as I did — they’re pretty cool!


Thanks to Philip Stevens for reviewing this post.


  1. There may be a smarter algorithm, but I’m more interested in learning Go than in solving Sudoku puzzles, so this one will do. 

go profiling

Did you learn something? Share this post on Twitter!

Follow us if you'd like to be notified of future posts :-)

←  Back to index

Read on Medium  →