Evaluating Performance for Cloud Data Warehousing, Part 1 - Design

Many decision makers who are responsible for the quality, security, and accessibility of business data today are faced with the following question: is now the time to invest in a cloud data warehouse? In this 5-part “how to” blog series, I’ll explore that question by drilling down into performance test design, execution, and evaluation of a cloud data warehouse and why it's a viable solution for organizations looking for a more cost-effective option for storing data sources.

Cloud data warehousing is essentially a Data Warehouse as a Service (DWaaS) approach that helps to reduce time-consuming management and tuning activities that are traditionally found in on-prem data warehouses. The choice whether or not to make the switch or leverage one now can become easier by understanding its benefits and starting with the design of a performance test.

Part 1 (of 5): How to design a performance test

Benchmarking is at the heart of any good performance test design. When determining how to benchmark a cloud data warehouse, many of the same fundamental considerations should be made for the cloud as they are for on-prem:

  • Single-stream query performance: how fast can you get an answer from an otherwise quiet system?

  • Query performance under concurrency: how many reports or ad-hoc queries can get done concurrently in a set amount of time?

  • Realistic workloads: do a variety of query styles and query predicates meet expectation?

  • Realistic database sizes and compute resources: will you need to benchmark all of your environments?

There are also important dimensions to benchmark design that are unique to the cloud though so you have to take it past the fundamentals. For example:

  • Infrastructure: with the cloud you can test large-scale infrastructure with no gap between your planning and execution phases. This lets you test multiple scenarios and alternative platforms in a shorter time frame since you do not have to wait for hardware, sourcing, or provisioning within your organization. These things are available almost instantly in the cloud so your scope of evaluation will likely be larger.

  • Operating Costs: with cloud you typically only pay for what you use so the operating model is quite different from the cost of purchasing the needed equipment and bandwidth for operating on-prem solutions. Your Price / Performance equation is different.

  • Elasticity: all cloud data warehouse vendors advertise the value of their elasticity so this will be a major area of focus during your benchmarking. Does doubling the size of your cluster cut the runtime in half? Can you deploy multiple clusters that are querying the same data sets, and how long does this configuration take?

  • End-to-end usability/reliability: test the whole system, including ingestion using your primary, secondary and tertiary methods (ODBC, JDBC, native drivers) as well as how the system responds to things like transient disconnects. If you’re in a mobile environment and you’re running a very long query, does your intermittent connectivity get handled gracefully?

Let’s drill down further into the design of a proper cloud data warehouse performance test. The most relevant approach, some would say the best approach, is to use your own data and your own queries from your existing data warehouse when performing your benchmarking. This involves unloading data from your current on-prem environment and moving it to the cloud. Even if it’s not feasible to move an entire application’s data set, which may have several years of data due to a retention policy, it’s usually feasible to move just one month of data then leverage the power of the database to insert into larger tables a selection from the original table while expanding the date key as you go. You should also be able to capture a realistic mix of queries from the logs of your on-prem system, and group them into common patterns; queries that power fast dashboards or interactive work, queries that power standard reports, and a representative sample of complex/ad-hoc queries so you can test periodic or analytical workloads that may be more taxing on the system.

You also want to pay attention to the kinds of business intelligence (BI) and dashboard tools that your organization uses. Often, these tools create their own queries and they expect fast performance so it’s important to consider these too. By using your own data, you can get a highly accurate estimate of the cost of running a cloud data warehouse in your own environment.

Sometimes though, perhaps due to security and/or regulatory concerns, you don’t have the luxury of using your own data for benchmarking. In this case, we encourage the TPC-DS benchmark be used to evaluate cloud data warehouse performance due to the range of complexity and variety of queries included with that benchmark. It’s free, open source, and you can find a lot of other blogs describing how it can be used to determine results. TPC-DS also includes an open source script generator and data generator.

There are a few other dimensions to benchmark design that should also be mentioned:

  • Make sure to take the time to tune the database for each platform you test, and measure the amount of effort it takes to do that tuning. For example, cloud platforms provide a way to partition data. What level of effort does that take?

  • Document what level of expertise the administrator would require to do his/her job. Do they have to be a database administrator (DBA)? Do they have to be a developer?

  • Don’t underestimate how much you should test concurrency! You shouldn’t plan to use the same queries over and over again because by doing so you’ll over-optimize the caching mechanisms of the cloud platforms and artificially boost your performance results. Instead, create a variety of scripts that mix up the query order and use different predicates.

  • It’s a good idea to take the time to document the steps you took to tune each platform since you’ll have to do that for each schema anyways once you select a cloud data warehouse. An ounce of diligent documentation today will provide a pound of development time tomorrow!

These are some high-level thoughts and considerations for designing your cloud data warehouse performance test. In my 2nd blog entry, I’ll go into how to best execute your performance test so you can gain measurable results. Stay connected for these updates.