Announcing Go-Grid: Scaling Cloud Load Testing to Millions of Users

We recently released a big change to Flood that you might not notice at all. The change was massive for our position as the leading cloud load testing platform, and in fact we’re so certain that you didn’t notice that we decided to write a blog post to tell you about it.

What are Grids in Flood?

If you use Flood you’ve already used Grids, but you may not know just what “Grid” means.

In Flood terminology Grids are the groups of servers where your load tests run (the name persists from way back in the mists of time when Flood was once known as “GRIDinit”). “Grid” is also the name of the agent which runs on each Grid node to orchestrate running your test script and to route the results back to the Flood data ingestion pipeline.

How Go Grid Scales Million User Load Tests

As you may have been able to tell from the potted history lesson, the Grid agent is one of the more venerable parts of the Flood platform. OG Grid is written in Ruby and bash and deployed as a docker container. Recently we decided to port Grid to Go. With some luck and a lot of hard work you won’t notice the difference at all! So why do it at all you ask?

Go code is portable and easy to deploy. Go produces statically linked binaries which can simply be copied to a machine and run. Although by and large we still use docker, we now have the option of using go grid in scenarios where docker isn’t feasible. This opens up potential for additional hosting platforms in the future.
Go programs generally use fewer resources than their Ruby equivalents, especially for long running servers. This massively supports our mission to help companies with some of the largest load tests ever executed.
Refactored grid code paves the way for future Flood features, including support for new projects such as Element.
Extracting and re-examining assumptions is important to the health of a platform -- and the team supporting it. If nothing else it increases the “bus number” of people who understand the system in depth.

Some normal wobbles notwithstanding, the rollout went swimmingly. Flood’s continuous deployment philosophy makes fixing and testing bugs quick and easy. I learnt a lot and upped my go game significantly on the journey to reverse engineer grid.

For the rest of the article, I thought I’d share some of the more technical things I discovered along the way.

Reverse engineering, testing in prod & feature flags

Reverse engineering is challenging. Simply understanding raw code is part of it, but it’s also about extracting the assumptions from the code’s mastermind. So to see how the code integrates with the system as a whole it really needs to be tested in production.

Thankfully the Flood backend has multiple layers of version switching. Generally speaking we use release channels to control the versions of our servers and feature flags for testing and gradual rollout of web UI features. Initially I only thought to use release channels (ruby grid was “stable” and go grid was “beta”). However feature flags turned out to be useful too.

Although feature flags are generally associated with progressively enabling parts of web UIs, we ultimately extended our approach to allow fine grained, instant switching of grid versions. Even now customers who hit unanticipated edge cases can be reverted to “grid classic” within seconds.

Supervision trees with context.Context and errgroup

Grid is written as a few independent-but-linked components. It listens on a queue for control messages, it reacts to these messages to start (and stop) the load test tools, it phones home to the Flood platform with health checks, and it runs the test data ingestion worker for collecting test results. If any one of these components encounters an error, grid needs to clean up the other components and exit.

When a load test starts, grid sets up a work environment for the appropriate load testing tool, fires up the tool and then cleans up after the tool finishes -- but grid is a long-running server, so when the tool finishes running grid still needs to keep running, listening for the next message and reporting system health.

To achieve this setup I used go’s standard context.Context, as well as a variation on the less well known “error group” construct. The combination results in something similar to Erlang’s supervision trees.

Go contexts have multiple uses, but I only used them for nested cancellation. When a context is cancelled, it and all its child contexts are cancelled.

Error groups allow for coordinated running and cancellation of related goroutines. A number of goroutines are added to the error group; when the first one returns an error, the entire group is cancelled.

Together, contexts and error groups allow for running concurrent tasks which are also easy to clean up once they exit. For example, a crashing load testing tool will cancel its context, clean up its environment and return resources (goroutines, memory, disk space) back to the system, all without affecting the higher level parts of the server (the queue listener etc).

There are some downsides to contexts, though. The biggest downside is that support in the standard library isn’t as complete as you’d expect from an officially recommended approach. This is particularly clear when the method call closely maps on to a syscall and there’s no choice but to follow the syscall’s semantics.

The second downside to contexts is the boilerplate they introduce, and the error-pronedness this adds. Contexts and cancellation are all about preventing goroutine leaks, but forgetting to add the ever-present context awaiting code can result in a leaked goroutine.

The error groups implementation I used is located here, though there's also a semi-official implementation here.

Docker API versioning

To interact with docker I opted to use the docker API as provided by the docker/moby codebase itself. This turned out to be a mistake due to API versioning.

The original use-case for go grid was to be as self-sufficient as possible, to run with as little installation of extra tools as possible. In this way we could drop a go grid binary onto a fresh VM and it could run with it. I’ll also admit that I thought that using the docker API directly was way smarter than shelling out to the docker command.

The problem is that go is quite static. The docker client API is carefully versioned and only works with docker daemons running a compatible version of the API. However, to use the docker client API in a go program, you have to pick a version and stick with it. And so, when go grid is built with a certain client API and then dropped onto a legacy VM running an older docker, things don’t work.

Ultimately I think I was humbled by the simple wisdom of the shell script: the most stable API in all of docker is the command line interface. The underlying reason is that docker-the-daemon and docker-the-cli have always been tightly coupled. Originally they were both the same all-in-one binary, but even nowadays they’re almost always installed onto a box together. So since for go grid we’re always controlling the local docker daemon, 99% of the time I can assume that on the same box docker-the-cli will talk the same API version as docker-the-daemon.

For now, we’re being a bit more careful with the docker versions we’re running on our grid VMs, but eventually I’ll switch go grid over to using the way less clever but much more robust -- and actually working -- shelling out to the the CLI approach.

Conclusion

So there you have it, an exciting journey to a key feature in our mission to scale cloud load testing.

Reverse engineering is a challenge best overcome by testing in production using fine grained version controls.

Go’s as good as they say for writing servers. The language is maturing and accruing quirks as you’d expect, but is still a good default choice due to its performance and deployability. Contexts are really handy despite their downsides, and paired with error groups they make writing robust long-running servers easy.

And sometimes the cleverest approach is to take the simplest one. :)

‍