How to Keep Your Cloud Tests Running Smoothly with Flood’s Monitoring

Good news - you’ve written an excellent load test in a tool like JMeter, Gatling, or even Flood Element that you are ready to run as a cloud load test. You may already know that JMeter and Gatling typically can produce 1,000 threads or virtual users on a single 4 CPU/16GB RAM node, While this number is a good rule of thumb, every test runs with slightly different efficiency. How do you know if you have the right amount of threads per node for your specific test?

What Are the Key Resources that Drive Node Health?

Healthy nodes are those that should maintain their resources so they can safely run the test in a level below its limitations. When unhealthy, a node is typically running out of one or more of the following primary resources:

Memory (RAM)
CPU
Network Bandwidth

Why Do My Nodes Run out of Resources?

Nodes run out of resources when the cloud load testing scenario consumes more resources than it's typically benchmarked for JMeter or Gatling. This situation can be as a result of more intense processing, for actions like video streaming or post-processing of data that consume more resources than a simpler scenario.

Predicting whether your cloud load injector will run out of memory can be more art than science. But, you can try to predict the number of threads you should use per node so that you are within the resource utilization boundary. This boundary can be roughly determined by observing the resources (RAM and CPU) needed for a single thread, and extrapolate this figure into the total number of resources for a node.

For Flood’s on-demand product, we use AWS and spin up m5.xlarge instances for most of our available regions, and where it’s not supported we spin up m4.xlarge. This kind of instance has 4 vCPUs, 16 GB of Memory and up to 10 Gbps of network bandwidth, running Linux Debian.

How Do I Monitor the Health of My Grid Nodes?

The CPU and Memory utilization can be easily monitored thru the Grids option available in Flood’s toolbar. Once in the Grids area of the dashboard, you can select your grid, and detailed statistics for these servers will be displayed on the right-hand pane. These values are a calculated average (assuming your grid has more than one node). The green line reflects the user consumption and the blue line reflects the system usage. We recommend configuring your test so the CPU or Memory never reaches above 80% consumption, as doing so could cause your test results to be compromised or adversely affected.

What Should I Do if My Node Runs out of Resources?

If you are reaching 100% of CPU or memory, don’t worry. Here’s what we recommend you do to keep your cloud load test on track:

Stop your test, using the “stop” button either on the Flood analysis view. This will stop all nodes and the Flood gracefully. We don’t recommend stopping the Grids themselves from the Grids area.
In the drop-down for your test, you can select to “start more like this” to populate a stream with the same settings as the test you just stopped
In the stream view, decrease the number of threads to make sure that there are adequate resources to run your test. Another option you can investigate is, disabling listeners or debuggers and/or improving your code with a more efficient design. We only recommend modifying timers as a last result, as you will end up with a lower true load as a result due to a slower request rate.

Monitoring bandwidth isn't as easy as CPU or Memory but can be done with the right knowledge. The process is two-fold:

First, we need to monitor the network throughput counter from flood dashboard and do a relatively simple calculation. If a single node delivers up to 10 Gbps, and 200 threads are consuming 6 Gbps, we know we will fail to ramp up 1000 users in a single node, because the test would require about 30 Gbps to run the full 1000 users successfully. For example, we’ve provided the below images taken during the ramping up of an eCommerce site tested with 1,000 threads.

‍

Second, we need to follow a similar process of stopping and restarting the node to resolve the issue. To ensure we have the right bandwidth, we have several options like:

Increasing the number of nodes.
Change constant timers for random timers.
Increasing the timer's value to reduce throughput.
Limiting the number of parallel downloads for embedded resources.
Enable caching and limit the number of cached elements.
Adding pacing time to space out the requests.

For this scenario to run smoothly, we advised this customer to split the load into two nodes and modify timers a bit to reduce throughput. In consequence, the Flood charts changed drastically by reducing the oscillation and flattening the counters.

‍

In contrast with the previous images, we took this one at steady state, and during 15 minutes we confirmed that throughput never exceeds 20 Gbps, thanks to the new configurations.

Putting It All Together

When you have your script created, you can upload it into Flood to execute it with 100’s or 1,000’s of concurrent users and try your hand at monitoring your load injectors. The free trial will provide you with enough node hours to execute this test outlined here for 1 hour with roughly 5,000 concurrent users.

If you are load testing using Flood, we’d love to hear from you. Drop us a note and share any ideas that are working for you and feel free to ask our team any of your tough questions!

‍