Halving your Data Warehouse Costs in the Cloud
Data warehouses (DWH) are critical to modern businesses, enabling them to store, process, and analyze large volumes of data. However, as organizations increasingly migrate to cloud-based data warehouses, managing and reducing costs becomes a significant concern. Efficiency is important for an on premise data warehouse, for cloud based data warehouses it's critical. Whether you run your data warehouse as IAAS, PAAS or SAAS solution or you use a cloud native data warehouse like Amazon Redshift, Google BigQuery, or Snowflake, inefficiency translates directly to extra cost. And we're not talking marginal improvements, last week a query in a customers data warehouse was reduced by a factor of 24.000.000. Nobody had ever thought to check if that query was efficient yet it went from one of the biggest cost factors to costing almost nothing. In this blog, I'll explore various strategies to reduce data warehouse costs in the cloud.
The usual way is often not efficient
One of our customers had a very large table with 55 true/false columns (boolean columns for the technical people). To get the information in their data warehouse the dimension of each column was first determined, then a count was done for each value in the dimension. Now the dimension for a boolean is always [true, false] so determining it's dimension is not needed. That would have saved half the time already. Reading the table once and keeping 55 true counters and 55 false counters further reduced processing time by a factor of 55. With the exact same end result in the data warehouse.
Because of the size of the table it had to be read from disk every time. Reading it once instead of 110 times reduced processing time from about 2 hours to 1 minute. And because it ran so fast now, and a number of other optimizations where made, total memory need for the data warehouse was reduced, saving significant cost, especially on licenses.
Software has wear and tear
And a data warehouse is built with software. It's not like a physical warehouse where people walk in and to fill it with stuff and the collect stuff. A data warehouse requires software to put the data in and get the data out. To many of you that will be an open door, but I still meet people that think physical and digital warehouses work the same.
Now software degrades over time if you don't maintain it. Mainly because their surroundings change. And they need to cater to a larger population. A small craft beer brewery may deliver their brew with small vans. As they grow they will shift to truck and even trains to transport their product. A data warehouse is no different. If it grows and you still deliver the data with the equivalent of vans, you're not efficient and paying way too much.
Cloud forces you to buy resources you don't use
If your workload is outside standard cloud resource ratio's you need to buy resources you're not going to use. And often even pay license costs over those unused resources. Inefficient data warehouses generally need a ton of memory to make it run at all within available time. That means you need to buy extra CPU you don't use, and even more expensive, licenses for all those CPU's.
Moving parts of the workload from memory and disk to compute you don't use will reduce cost. Traditionally data warehouses are more memory and disk than compute intensive. Using the compute you're already paying for but not using will reduce the need for memory and disk.
Cloud storage is slower and more expensive
If you get 8TB of premium SSD in the cloud, it will set you back the same amount each month that it would cost you to buy an SSD the same size for a server. Only the SSD is 14 times as fast as the cloud storage. So in the cloud you pay 60 times the price (assuming hardware lasts 5 years) for 7% of the performance. The SSD has an 840 times better value for money. Note: your mileage may vary depending on the hardware you buy, but even if I'm off by a factor of 10, cloud still has 84 times less value for money.
Now I'm not saying you should move back to an on premise data warehouse. But optimizing I/O in the cloud is much more important than it was on premise. It saves more money.
Group processes for the same data
If you have processes that use the same data, make them run sequentially. That way, the data will still be present in memory when the second, third and further processes run. if you run the processes in no particular order, data will have to be re-fetched from disk, which is slow and expensive.
Most data warehouses first run all data quality checks on source data, then do all ETL, the run data quality checks on the whole resulting data warehouse.
Running the data quality checks on a piece of data immediately after it's been loaded by ETL, for example, will give you a much better chance of still finding it in memory instead of having to re-fetch it from disk.
Check your memory bandwidth
Memory bandwidth is one of the most important resources for a data warehouse. It's the speed at which data can move from the main memory to the CPU's. And it's always overlooked when sizing a data warehouse. The faster data is moved to the CPU, the shorter that data needs to be present in main memory, reducing the amount of memory required. There is no way to measure the memory bandwidth directly, but the bandwidth you have and the bandwidth you need can both be calculated. Matching the two parameters will speed up you data warehouse and reduce the amount of internal memory you need, which again saves significant cost.
Rightsizing memory bandwidth requires selecting the correct virtual hardware. There is no one size fits all answer to what's right for you. After your data warehouse load is optimized a proper sizing need to be made based on your cloud providers offering.
Get help
If you're not measuring, you're guessing. Sure, many data warehouse platforms offer you some metrics and statistics, but at most they'll that you that you have a problem and not why or how to solve it. Often they'll tell you you need more resources and that's not what you're looking for if you want to save cost.
Finding theses optimizations may feel like finding a needle in a haystack. More often it's more like finding a needle in a needle stack. We help you find these optimizations quickly and for a fraction of the cost savings you'll realize. Make a 15 minute appointment and I'll explain how it's done. Click here to make the appointment.