Analyzing the performance of Red Hat Enterprise Linux OpenStack Platform using Rally

 In our recent blog post, we’ve discussed the steps involved in determining the performance and scalability of a Red Hat Enterprise Linux OpenStack Platform environment. To recap, we’ve recommended the following:

  1. Validate the underlying hardware performance using AHC
  2. Deploy Red Hat Enterprise Linux OpenStack Platform
  3. Validate the newly deployed infrastructure using Tempest
  4. Run Rally with specific scenarios that stress the control plane of OpenStack environment
  5. Run CloudBench (cbtool) experiments that stress applications running in virtual machines within OpenStack environment

In this post, we would like to focus on step 4: Running Rally with a specific scenario to stress the control plane of the OpenStack environment. The main objectives are:

  1. Provide a brief introduction to Rally
  2. Provide a specific scenario used within the Guidelines and Considerations for Performance and Scaling of Red Hat Enterprise Linux OpenStack Platform 6-based cloud reference architecture
  3. Demonstrate how captured results lead to the tweaking of the HAProxy OpenStack parameter timeout value

What is Rally?

Rally is a benchmarking tool created to answer the underlying question of “How does OpenStack work at scale?”. Rally is able to answer this question by automating the processes that entail an OpenStack deployment, cloud verification, benchmarking, and profiling. While Rally has the capabilities to offer an assortment of actions to test and validate the OpenStack cloud, this blog focuses specifically on using the benchmarking tool to test a specific scenario using an existing Red Hat Enterprise Linux OpenStack Platform-based cloud and generate an HTML report based upon captured results.

Benchmarking with Rally

Rally runs different types of scenarios based on the information provided by a user defined .json file. While Rally has many scenarios to choose from, we are showing one key scenario that focuses on testing end-user usability of the RHEL OpenStack Platform-based cloud.  The scenario is called NovaServers.boot_server.

In order to create the user-defined .json file, an understanding of how to assign parameter values is required. The following example breaks down an existing .json file that runs the NovaServers.boot_server scenario.

A .json file consists of the following:

  • A curly bracket {, followed by the name of the Rally scenario, e.g. “NovaServers.boot_server”, followed by a colon : and bracket [. The syntax is critical when creating a .json file, otherwise the Rally task fails. Each value assigned requires a comma, unless it is the final argument in a section.args – that consists of parameters that are assigned user defined values. The most notable parameters include:
    • auto_assign_nic – The value can be set to true in which a random network is chosen. Otherwise, a network ID can be specified
    • flavor – The size of the guest instances to be created, e.g. “m1.small”
    • image – The name of the image file used for creating guest instances
    • quotas – Specification of quotas for the CPU cores, instances, and memory (ram). Setting a value of -1 for cores, instances, and ram allows for use of all the resources available within the RHEL OpenStack Platform 6 cloud
    • tenants – amount of total tenants to be created
    • users_per_tenant – amount of users to be created within each tenant
    • concurrency – amount of guest instances to run on each iteration
    • times – amount of iterations to perform
  • An ending bracket ] and curly bracket bracket  } are the required in the syntax of a .json file to properly close it.

When benchmarking with Rally, the initial objective is to use small values for times and concurrency parameters in order to diagnose any errors as quickly as possible. When creating a .json file, concurrency and times have static values that dictate the maximum number of guests to launch for a specified scenario. To overcome this limitation, the rally-wrapper.sh script (found within this reference architecture) is created.

The script increments the values of concurrency and times by a value specified as long as the success rate is met thus increasing the maximum number of running guests.

Below is an example of how to use Rally in a practical situation.

Initial boot-storm Rally Results

A good first step for stressing the control plane of a RHEL OpenStack Platform-based environment using Rally is to run boot-storm tests that attempt to launch as many guests as that environment can simultaneously handle. The initial results gathered by the rally-wrapper.sh showed 50 guests booting concurrently with a success rate of merely 66%, as shown on the screen capture below.

Low success rate necessitates further investigation of boot-storm results which yields the  following error:

We could see that the Connection aborted, but BadStatusLine error does not provide any definitive reasons as to why that happen. However, the above error suggests that we must investigate what is causing incoming client connection requests to be aborted. From a top-down approach, this lead us to investigating HAProxy module. HAProxy is a load-balancer that spreads incoming connection requests across multiple servers. The default timeout value of HAProxy within the RHEL OpenStack Platform-based reference environment is 30 seconds. With a low default HAProxy timeout value, it was determined that the timeout value of incoming connections is not sufficient to handle incoming Rally client connection requests due to Rally reusing client connections instead of creating new client connections. Rally has a default client timeout of 180 seconds in the /etc/rally/rally.conf file, thus the t HAProxy timeout value was increased from 30 seconds to 180 seconds to align with Rally’s client connection timeout value. As a result of this investigation Red Hat Bugzilla 1199568 has been filed against the low timeout value of HAProxy that produces a ConnectionError.

To address the above issues, several steps described below had to be taken. On the Provisioning node the common.pp script located within the /etc/puppet/environments/production/modules/quickstack/manifests/load_balancer/ directory was modified to change the value of “‘client 30s'” to “‘client 180s'” as shown.

Once the above changes have been propagated to each Controller node, run the following

puppet command on each Controller node for the changes to take effect.

With the HAProxy configuration changes, and rerunning the Rally boot-storm tests, the number of guests booting concurrently increased from 50 (with 66% success rate) to 170 (with 100% success rate). This effectively increased the maximum number of guests by more than 3x with no reported failures.

 Conclusion

Benchmarking tools such as Rally and its scenarios play a key role in achieving optimal performance in a specified environment and can be pretty handy for troubleshooting. Familiarizing yourself with different arguments within the scenario, especially concurrency and times values since they control maximum number of guests to launch, could be quite useful. As we just demonstrated, tweaking these values allowed us to identify the low timeout value of HAProxy that precluded us from achieving an acceptable number of running guests. By modifying HAProxy value, we were able to achieve more than a 3x performance of  “out-of-the-box” RHEL OpenStack Platform environment and had no failures when launching guest instances.