Comparing Cloud Data Warehouse Prices – Part 2
In part 1 of our series on cloud data warehouse pricing, we compared and summarized the primary compute and pricing variables for BigQuery, Redshift, and Snowflake. Revisit this first blog in the series to view the detailed comparative summary table.
With part 2, we begin to look at hypothetical scenarios that provide a guide to understanding price operations. We are an agnostic iPaaS solution that ties together and orchestrates data flows between on-premises apps and data, cloud SaaS applications, and a variety of cloud data warehouses.
What follows is a hypothetical setup, including a common on-demand compute scenario that may be useful to illustrate how basic pricing differences between cloud data warehouses work. Future blogs will cover additional, more advanced pricing scenarios.
Hypothetical On-Demand Scenario – BigQuery vs. Redshift vs. Snowflake Pricing
Rather than a minimum configuration, which would create misleading comparisons, we engage a hypothetical setup that’s typical for large data query operations that are on-demand in nature – a great use case for cloud-based query tools. These days, with enterprises flooded with data, even with on-demand or exploratory query workloads, it’s not uncommon for companies to require a specific maximum response time against large data sets.
Therefore, for our hypothetical scenario, we have three different computational heavy queries. Each query is required to return in 30-seconds (or less) and must scan an entire half-terabyte (500 GB) dataset to produce the query result. The data originates from an on-premises system of record and it is loaded to the cloud data warehouse via the respective SnapLogic Snap Pack – for BigQuery, Redshift, or Snowflake. For this scenario, the queries are run sporadically throughout a month – meaning, not a constantly running 24×7 environment, which will be investigated with a different hypothetical scenario.
For the purpose of illustrating pricing differences, we assume the following cloud data warehouse sizes are necessary to achieve the required maximum 30-second response time (it is assumed smaller configurations would produce slower, unacceptable results):
- BigQuery: 2,000 slots, on-demand or with flex pricing
- Redshift: 8 nodes x dc2.8xlarge or 5 nodes of ra3.xlarge, on-demand
- Snowflake: Standard Edition, sized at X-Large, on-demand
Use the BigQuery Snap in your SnapLogic pipeline to move data to BigQuery. Once there, with on-demand pricing, the price to run one of the 30-second queries is $2.50 (.5 * 1 TB * $5 per TB scanned). Run the queries whenever – batch them up or run them independently. Regardless, you will always pay $2.50 per query because of the half- terabyte of data scanned per query.
With BigQuery flex pricing, the cost is $80 per hour for 2,000 slots ($4 per hour per 100 slots * 20). Since a 30-second query consumes all 2,000 slots, if each query comes in sporadically, you’ll always pay the 60-second minimum charge, which is $1.34 per query (60 seconds * ($80 / 3,600 sec./hr.)) for each individual query.
Therefore given the 60-second minimum, you are motivated to batch up the queries, if possible, and hope for some level of concurrency with the expectation that the maximum you’ll pay is for 90 seconds of compute to run all three queries at a cost of $2 (90 * $80 / 3,600 (sec./hr.)).
However, note, flex pricing for BigQuery is effectively spot pricing, and slot resources are not guaranteed to be available at the time of your query request. In addition, the meter continues to run until you cancel the slot resources. Therefore, your actual compute costs may be higher than the above BigQuery flex pricing estimates.
Also note, because the data is coming from an outside source, you must also pay for Google Cloud Storage, which is approximately $9.80 per month (500 GB – 10 GB (free))/1,000 GB * $20), which can be prorated down if the data is not in place for an entire month.
With this compute scenario, BigQuery flat-rate pricing, which removes the dependency of the amount of data scanned, may not be a price-practical option because query activity is sporadic and not enough to keep BigQuery busy to justify the flat-rate pricing of $40,000 per month (20 * $2,000 per month per 100 slots).
Thanks to relatively new pause and resume features, when using Redshift on-demand, you now have tighter control of Redshift costs. Use the Redshift Snap Pack in your SnapLogic pipeline to move data to Redshift. Then, as long as the three queries are batched or individually run within an hour, you will be able to keep Redshift costs down. Otherwise, without pause and resume, for dc2.8xlarge nodes, you’re exposed to an hourly cost of $38.40 (8 nodes * $4.80 per hour per node) for the three queries. Or, with ra3-16xlarge nodes, a cost of $65 (5 nodes * $13.04 per hour per node).
The challenge in this case, however, is Redshift pause and resume is set manually or is scheduled. You will have to know in advance when the queries will need to run on the Redshift cluster. If the Redshift admin is in control of this, then timing of pause and resume can be made tightly around when the queries need to run. This keeps the cost down because obviously, you’ll accrue costs that are less than the full hourly rate. If the admin is not in control of timing, costs will balloon because of the idle time when the Redshift cluster is active, but not running a query.
When the Redshift cluster is paused, you’re still on the hook for storage pricing. For Redshift dc2.8xlarge nodes, which includes 2.56 TB of live storage, your estimated cost is $492 per month for storage (8 nodes * 2.56 TB per node * 24 TB per month). Storage charges for Redshift ra3-12xlarge nodes is calculated differently. With ra3 nodes, you pay only for the storage that you consume, $12 per month (.5 TB * 24 TB per month).
Use the Snowflake Snap in your SnapLogic pipeline to move data to Snowflake. Once data is moved to Snowflake, storage charges immediately kick-in at the on-demand rate of $40 per TB per month, compressed. Typical data compression for Snowflake is assumed to be ~3:1, actual compression may vary. Therefore, your estimated storage charge is $7 per month (.5 TB / 3 * $40 per TB per month).
Snowflake has the advantage of enabling automatic suspend and automatic resume, enabling it to operate like a query engine, much in the same way BigQuery operates. Enable auto suspend and auto-resume at the time a Snowflake warehouse is created, or at any time post-creation via SQL scripting.
Once the warehouse is idle, auto suspend kicks in after a certain set time, which you can select. If selected from the Snowflake “Create Warehouse” configuration drop-down menu, five minutes is the minimum. This can be reduced even further through SQL scripting. For example, “AUTO_SUSPEND = 60” (value is always in seconds) reduces the time to 1 minute. While it may be tempting to reduce time as short as possible to save costs, be aware there are potential warehouse operational consequences if the suspend time is too tight when queries arrive frequently and sporadically, and the data warehouse size is big in size.
There’s a one-minute minimum compute charge with Snowflake. If the three hypothetical queries are run concurrently and all three queries complete within 30 seconds, your estimated compute charge for a Snowflake Standard edition at an X-Large size is $.53 ($2 per hour for Standard type * 16 credits for X-Large size / 60 minutes per hour) + the cost of the auto suspend wait time. If the auto suspend wait time is set at two minutes, the wait-time costs is $1.07 (rounded) – resulting in a total cost of $1.60 ($.53 + $1.07).
If the three hypothetical queries are run sequentially, then the cost is $1.87 total (90 seconds of compute time ($.80) + 2 minutes of suspend wait-time ($1.07)).
Especially for on-demand operations, as with the Redshift example, it’s important to have Snowflake auto suspend and auto-resume features enabled. Otherwise, the meter is running while the data warehouse is idle, exposing you to higher than necessary costs.
Watch carefully when operating on demand
As the hypothetical scenario illustrates, Google BigQuery, Amazon Redshift, and Snowflake provide the capability to operate in an on-off manner and with low costs, ideal for when query loads are light and sporadic. Comparative compute costs are similar. On-off operation is the key to avoid runaway costs. BigQuery is most natural with on-off operation since it will run queries only when presented, while Redshift and Snowflake will require settings to be in place.
With the next installment of this blog series, we’ll look at continuous operation scenarios and compare prices.