In his recent webinar, Continuous Load Testing, Tim Koopmans tackled the topic of load testing for the DevOps environment, and where load testing fits into the increasingly prevalent approach to QA: continuous testing.
You can watch the Continuous Load Testing webinar here, or read below for the full transcript.
Continuous load testing. What does it mean? Where is it going? And why should we care? My name is Tim Koopmans, and I have the privilege of living in a country with some of the worst internet speeds in the world. As Director of Load Testing at Tricentis, this gives me a great sandbox to experience and work with performance issues of every type, especially those which are web-based.
I also like competitive player-vs-player games online, so living in Australia, I also suffer from high ping, packet loss, and ultimately dropped connections. You could say that performance issues are near and dear to my heart both professionally and vocationally. In this presentation, I’ll be discussing what it actually means to do continuous testing, and I’ll explore where load testing fits in that context.
I’ll examine why we actually care about load testing and what it is we’re typically trying to measure. I’ll finish off the presentation discussing what we need or want in a load testing platform, and how Tricentis provides that with Tosca and Flood. Firstly, let’s get some definitions out of the way.
Load testing can simply be defined as putting demand on a system and measuring it for performance.
The word continue means to persist, carry on, or resume. Now, many make the mistake of thinking that this is all it means to do continuous load testing: we persist with a test effort or we maintain status quo or we resume from the many interruptions to our load test effort, when in fact, by ditching the E and adding an O-U-S, it takes on a new meaning. And the theme of my presentation today is to explore load testing in the context of an uninterrupted, unbroken, and progressive approach.
But enough of the dictionary definitions. I want to explore why we care about continuous testing and what that means to load testing. If you Google the word “continuous testing,” you’ll find Tricentis at the top. That’s not to say we pioneered the term, but we care deeply about continuous testing. We care about how it fits in a continuous integration and continuous delivery lifecycle, we care about its impact on development operations processes, and it is indeed the cornerstone of much of our product thinking and approach to test automation.
If you visit our site, you’ll find this infinity logo, which on the left hand side shows activities related to continuous integration (CI) and on the right hand side activities typically associated with continuous delivery (CD) or deployment. And in the middle continuous testing seems to link those two sides together. But given that testing is a nexus between development and operations, it’s fair to say that the quicker you can complete the testing, the quicker you can move between the two spheres of development and operations.
CI and CD are just common acronyms that people use when they talk about modern DevOps practices, and testing is really common to both. It’s a mistake, though, to think that testing is simply a one-off activity or a single bridge between CI and CD.
Testing really enables us to make decisions, decisions which are crucial to a feedback loop. And continuous testing means more decisions and more feedback.
Now, CI encourages tight feedback loops that help validate developers changes by creating builds and running automated tests against them. This form of early feedback is sometimes referred to as shift left, or as I think of it, failing early. The emphasis is on test automation, and it’s used to accentuate the brokenness of a build and subsequently fix those prior to making it into production.
CD, on the other hand, is about delivery and deployment, and this is more in line with a shift right approach. It’s a business-facing practice, making sure that all of the builds coming out of development can make their way into production and hopefully fail forwards and not backwards. CD is great at accelerating the feedback loops, for example, with your customers.
And continuous testing is a process of executing automated tests as part of the whole delivery pipeline. It ties together feedback from both sides of the development operations fence, and it helps us learn about business risks associated with software that is released. It’s progressive in nature, and it helps form that unbroken whole which we’re chasing.
If you tried to explain continuous testing in pure math, it would be expressed as a ratio between the number of decisions and time – effectively a rate which you could measure. You can therefore ask yourself, “Is your continuous testing process expressed as a rate of, say, decisions per year, or is it decisions per quarter?” How often do you end up in production with a poor outcome?
How many decisions did your DevOps process allow you to make before landing in production with sub-standard software? Hopefully, it can be measured in terms of decisions per week and decisions per day. The more decisions you can make, the faster you can work your way through those decisions and their possible consequences and the more likely it is that you’ll have software that your customers love.
This type of decision making is exactly what we aim to do at Flood with our own product development. Now, load testing has a part to play very early in the life cycle. Before we even code or build a product, we should be thinking about and planning for performance requirements. On the flip side, performance is very much part of the conversation in response to operations and monitoring activities.
Perhaps we’ve triggered a threshold alert or observed abnormal behavior in production which we want to avoid again in the future. This is the shift right approach to load testing, originating from and perhaps targeting the production environment only. Back to the left again, as we start to code and build, we have a strong interest in performance at the commit level, so we can know as early as possible what commit contributed to a change in performance.
We can also kick off load testing much earlier by focusing on smaller pieces of the system perhaps not integrated yet, but which can still be tested at an API level. Or we can drift to the Right again and load test builds that are coming out of development and entering into our delivery pipeline, utilizing the fully integrated stack in either production or pre-production environments.
“The point I’m trying to make is that there’s no single place to define and execute load tests from in a modern DevOps process.”
So at Flood, instead of trying to position load testing as something that shifts left or right early or late, I like to imagine continuous load testing as something that gets shifted always in a clockwise direction. We want to screw the nut in; not reverse it. We want to fail forwards and not backwards.
Yesterday’s risk in development is today’s issue in production, so by testing earlier in the CI/CD lifecycle, we can decouple ourselves from a system view of performance and concentrate on, say, component views of performance in a highly repeatable and perhaps even easier way to test. It helps us identify risk, but maybe not solve it. It also helps us identify defects closer to a commit level or unit level of code.
In testing much later in the lifecycle, we become tightly coupled to the system view of performance, and we can start to tackle other production-sized problems around availability or scalability or reliability in more production-like environments. The repeatability of tests can be compromised to a certain extent. It’s harder to wind back the nut in terms of configuration and data, but we gain valuable insight into production behavior, and we can help validate risk mitigation strategies that we’ve identified earlier in the cycle.
So continuous load testing is really more about applying the spanner in that clockwise direction and screwing the nut down, winding down the issues. Sure, you may need to go anticlockwise, and no doubt the thread of the bolt just keeps growing, but I guess that’s another point. Load testing, by its association with risk management, is really hard to place into your traditional definitions of “done”.
On the topic of risk, load testing is essentially a risk management activity. It helps us identify, learn, and control risk to production applications. Application risks generally transpose to much larger business risks, be it financial or simply face value of the company. This is especially true for performance related risks, which often has a wide surface area in terms of impact.
There are many examples where large software failures have led to severe business repercussions. We just need to look at the existing Flood customer base for examples. Why do we care so much about load testing? We know the reasons around application and business risk, but what is it that we’re actually trying to measure as part of the definition? Typically, there are four areas or dimensions that we want to measure.
I refer to them loosely as performance, availability, reliability, and scalability. Interestingly, even a novice load tester who has zero background in load testing will still have a gut feel for the things we’re trying to measure through the course of load testing. Let’s take a look at the first area of performance. Response time. How long does it take to load a page or deliver content to your users?
I think it’s fair to say all of us have experienced page load performance. Looking at an overseas train timetable website from Austria – it takes around 20 seconds to load. We can just revel in the sheer joy of waiting for this page to load. We definitely care about this dimension of performance, but we might also care about other dimensions for the same domain:
- How many concurrent users are looking at the page?
- How many times per minute is this page viewed?
- What is the weight in terms of number of requests made and the size of each request?
- Where are the page components being served from?
- How well does that browser cache page items?
Performance is really a loose umbrella term for a whole range of application performance metrics that we can measure.
And what about something more tangible and broken like the infamous HTTP 503? If we’re lucky, we get something that looks maybe a little prettier like the 503 Unicorn for GitHub with corresponding links to status pages. It’s important to note that 503s do happen, so it can help public perception if you move away from boilerplate messages from the web server.
And if you’re unlucky, you might get a blank page from, say, the Elastic Load Balancer, that looks like this. This is really just a nice way of letting your customers know that no, you have not tested for high availability scenarios in your load test effort. Availability issues are perhaps the worst dimension to experience as a customer, and they should feature really high in your load test effort.
Reliability is a different story. Does your application return the expected response? Maybe you uncover generic 5xx ranges of error codes, or maybe you have HTTP200s, but it’s simply showing the wrong data. It’s worse if you start showing data to a customer that they shouldn’t see. And even worse if you show customers other customers’ data. The list of reliability issues is really long and exhausting and can be difficult to uncover in, say, a scripted scenario.
Exploratory load testing, which I guess is a topic for another day, is a great way to delve into reliability issues under load. Last, but not least, scalability issues. Everyone knows what it’s like waiting in queues. Long wait times, reduced services, incomplete transactions, they all have the knack of driving up customer rage and frustration, and you’ll surely feel the effects of this on social media.
Having infrastructure which is correctly sized for the workload helps deliver operations for a known cost, and having infrastructure which can scale out or up on demand in response to increased workloads, can help you survive unexpected events. The reverse is true as well – being able to tear down infrastructure can lead to better cost savings overall. All of these dimensions of scalability should really be part of your load testing.
Unfortunately, when we get into load testing, the areas of interest are often represented like the chart on the left: a whole lot of focus around more readily measured performance metrics like concurrency and response time and transaction rate, with not a lot of thought or effort going into the other dimensions. The more pragmatic view might be the chart on the right with a strong emphasis on performance and availability and effort-dependent coverage of reliability and scalability scenarios.
Thinking about these four areas of concern is a nice, qualitative way of thinking about the desired outcomes from your load test effort and, to a certain extent, a measure of the maturity of your load test effort. Now we know what load testing is and what measurements or dimensions we care about. And if it’s that important, you could ask, “Well why doesn’t everybody load test?”
Working with Flood these past years has given me the unique opportunity of listening to thousands of customers who have made the jump and got started with load testing on our platform. We’ve also had the luxury of talking to customers who have attempted to start load testing but didn’t get as far as we’d like on our own platform. The answers seem pretty obvious, and there’s three major trains of thought.
“We want to load test, but we don’t know how to get started.”
Now, this is at odds with our continuous testing approach, as load testing is interrupted before it even starts. The big blocker here, of course, is not knowing how to create what we call a load test script. This has been a long-term issue for us, and I’ll talk in a bit more detail shortly about the complexities and issues around creating load tests themselves and what we’re doing to circumvent these issues.
The other impediment to load testing is that customers simply don’t have enough time. Once again, this is at odds with a continuous approach as it’s pretty hard for load testing to be progressive and part of an unbroken whole if there’s not enough time to load test to begin with. A lot of this is based on factors which are external to the load test, but which the load test depends on.
Things like a suitable environment to test in, a suitable build to test against, and of course having the data and the scripts in place to be able to place can often mean, in a traditional test approach, you simply run out of time. So with Flood, we aim to remove any time-blockers from your load testing, making sure that you can provision assets just in time and aren’t locked behind clumsy licensing or resource constraints.
And with Tosca, we’re really excited about some of the complimentary technologies that we can take advantage of as a platform, such as orchestrated service virtualization and test data management to take out any of the environment or data-blockers to your load testing.
The other answer often sounds like this:
“We don’t do any load testing because it’s simply too complicated. It’s too expense or too hard.”
These really fall into the realm of “How accessible is load testing to your company?” What do we need or want in a load test platform? Given the things that we want to measure and the holistic goal of making a load test platform for everyone, I like to frame the requirements in three broad functional areas: the ability to create or define load-test scenarios, a way to execute those scenarios at scale, and an intuitive way to analyze and interpret results.
At Flood, you’ve always been able to create load tests, which we call Floods, using any of your favorite open source tools, including JMeter, Gatling, or Selenium, and this has been a great choice for many of our customers wanting to adopt open source. And provided you have the bandwidth to learn and write code, these tools provide an open and generally extensible way of writing load tests.
The barrier to entry, however, is still high. As I mentioned earlier, many of our potential customers can’t proceed past go in that they want to load test, but they have no idea how to get started in terms of creating the load test itself. This is where Tricentis Tosca plays an important part in load-test creation, specifically it’s model-based approach. This begs the question then, what does Tosca’s model-based test automation approach really provide us?
If you could imagine a Lego set, where you would have to 3D print the Lego pieces themselves before beginning to build your model, this is kind of what classic test automation means to me. Before you even get down to the business of automating things, you need to build the framework first and then the test data and the code and the models and the keywords and all of the data attributes and so on.
Tricentis Tosca helps you define your test cases by first scanning the application. And the automation model consists of the automation logic, which is decoupled from the test logic, which is specified as a test case. Once the required models are defined, they can be used to execute both manual and automated test cases with input and verification data. These models are dynamic in the sense that they’re synchronized with the application under test and they can be updated to reflect any changes in the applications.
That eliminates a lot of the maintenance challenges around traditional approach of rolling your own framework. What that means in layman’s terms is that I can dive straight into the practice of building reusable test automation. There’s no need to build the building blocks themselves. They’re already there to assemble into a model that represents your business logic.
So what does reusable test automation really give me as a functional tester? It gives me one platform with one test-case definition and of course the ability to simulate a single user. And what does it give to me as a load tester? It gives me the same platform, the same test cases, and now I have the ability to simulate hundreds or thousands of users at scale. What that means for Tosca customers is that we’ll be able to support execution of browser-based test cases which are defined in Tosca and then executed at scale on Flood.
Now, Browser Level Users are great for simulating real user behavior in the browser without a lot of the complexity at the protocol level. This is a great choice for customers that already have a functional test automation background, all the assets, and we’re achieving around 50 browsers per node with this approach. It’s also a great approach for solving the ever-increasing complexity of simulating load for modern web applications.
We use SAP Fiori as our benchmark of complexity, and not needing to simulate load at the protocol level for this means we get to trade in a lot of the complexity for concurrency. We also support API-based test cases in the same fashion, which are executed at scale on Flood. This time users are simulated at the protocol level. This is a great choice for really high-concurrency or high-volume requirements.
It’s especially suited to testing well-defined APIs or other web-based applications at the HTTP protocol level. As a planning figure, you can achieve around 1,000 users per node with this approach. At the end of the day, the complexity of the target application and to some extent the target numbers required will define what tool you use. We’ll be able to comfortably support browsers up around the 50,000 mark and Protocol Level Users numbering the millions.
All of this is taking advantage of Tosca’s excellent ability to create, maintain, and reuse test automation assets. In fact, with Tosca, you don’t really need to worry about the underlying configuration required to launch load tests. You can simply define load-test scenarios from Tosca Commander with the same test steps and test cases that you’ve already been using for functional testing.
So Tosca is not going to make you a performance test engineer overnight, but it is going to remove the barriers to getting started with load testing itself. That ultimately means you can spend less time coding and more time measuring your application’s performance, availability, reliability, and scalability. Now, you might ask, “How do we actually do all of that on the Flood side of things?”
We consider ourselves masters of running distributed systems at scale, and we’ve helped over 5,000 companies simulate millions of users under load. We really do all of that by leveraging the cloud. If you had asked me seven years ago to launch, say, 50,000 browsers on a distributed infrastructure, I would’ve felt sorry for the penguins at the South Pole, for fear of punching an even greater hole in the ozone layer with all the machines it would’ve required to run such a load.
These days, however, thanks to advances in technology around headless browsers and particularly thanks to the Google Chrome project, we’re able to achieve much higher levels of concurrency per node. And thanks to our distributed, cluster-less design, we can comfortably run hundreds or thousands of load generators at scale across the world. And also thanks to a cloud-first design, all of our infrastructure is provisioned in near real time without having to wait for resources to start and without the waste of expensive infrastructure just sitting idle when you’re not testing.
To that end, cloud gives us this unprecedented economy of scale, and along with that comes a test-friendly pricing structure that doesn’t gouge you per virtual user or per test like in the past. You simply pay for the infrastructure that you use. This means that your continuous load testing is largely uninterrupted. We currently support 15 geographic regions, which are linked to Amazon Web Services. And later this year we’ll be supporting up to 36 additional regions through Microsoft Azure.
We’ll also add another 14 regions from Google Cloud, and we’ll be working with select customers on supporting a hybrid cloud model for those customers who have applications sitting behind a corporate firewall. As I round out this presentation, I want to touch briefly on the area of analysis and how I see that fitting into the continuous load test picture. Let me use this quite from Edward Tufte because in many ways it sets the scene for how we approach visual analysis of results in Flood.
He says that “Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space.” With that quote in mind, I like to use train maps as an example of what I consider graphical excellence. No matter where I travel in the world, a train map is instantly recognizable in its form. Strong use of color, lots of white space, and from the map itself I can infer distance, start and end points, directions, and stops along the way. I can see how lines interact with each other and what stations intersect different lines.
If I’m in New York, London, Melbourne, or Tokyo – regardless of country – the information is the same. It’s pretty much a global standard for trains. The same can be applied to load test results. As we start to measure things, charts become a natural way to visualize what we measure. Consistent use of color, white space, and once again, lines, can tell us a story about system behavior. From these lines, we can deduce things like bottlenecks and points of interest or cause for further investigation.
It becomes a very powerful tool. This is why in Flood we really focus on a single chart, as we feel this is the best way to convey information about the system under test. From there, of course, we can drill down and obtain more information about specific transactions, but what really ties this all together is the charts themselves. As we look towards a more continuous view of performance, we need to decouple ourselves from just a single test view of results.
That’s not to say that a single test doesn’t have merit in terms of metrics at that point in time, but what you can expect from Flood, as we move the product forward, are more features that let us view and consume more data from different data sources and visualization which help us identify or detect bottlenecks and features that let us compare multiple tests or analyze performance over time.
In conclusion (and referring back to our theme for this presentation of continuous being something that forms an unbroken whole or without interruption), I covered how load testing fits inside the CI/CD lifecycle and that it’s a mistake to think it shifts in just one direction. I spoke to the importance of feedback loops both in develop and also in operations and how that impacts customers. We touched on risk management and load-testing’s central role in identifying and mitigating application and business risks.
We also spoke about areas of measurement, what we care most about in terms of performance, availability, reliability, and scalability. Finally, I covered at a really high level what the platform requirements are in terms of a load-test creation, execution, and analysis, and in particular the marriage of Tosca’s model-based test automation with Flood’s execution at scale. Thanks very much for your time.