In the last five years, we’ve seen the cloud data warehouse, exemplified by Snowflake and BigQuery, become the dominant tool for large and small businesses that need to combine and analyze data. The initial use cases are usually classic decision support. What is my revenue? How many customers do I have? How are these metrics changing and why?
But the iron law of databases is data attracts workloads. When you have all of your data in one place, clever people in your team will come up with unexpected uses for it. The cloud data warehouse enables these new use cases with its elasticity. As you discover new things you’d like to do with data, you can add new compute capacity, effectively without limit.
However, these new workloads often don’t look like the classic analytical queries that data warehouses are optimized for. For the last 20 years, commercial data warehouses have been optimized for handling a small number of large queries that scan entire tables and aggregate them into summary statistics. They are well-optimized for questions like:
How many new customers did I add, in each state, in each month, for the last year?
But they are less-well optimized for questions like:
What are all the interactions I have had with one particular customer?
These queries require many data sources to be in one place, but they touch only a small percentage of data from any particular source. They have both analytical and operational characteristics, and they are typical of the new workloads we see as cloud data warehouses have become ubiquitous.
The major data warehouse vendors are making changes to better support these types of queries. Snowflake recently released the search optimization service, which allows you to have indexes in your data warehouse. Indexes are ubiquitous in operational databases, but in the past most data warehouses did not support them, because they were thought to be irrelevant to analytical workloads. Meanwhile, BigQuery has released BI Engine, which allows you to store a subset of your database in-memory for faster access.
Over the next five years, these operational-analytical use cases will come to dominate cloud data warehouse workloads. The leading cloud data warehouses will continue to pivot to better support these workloads, but we may also see the emergence of a new database architecture optimized for this scenario. There are several new database engines from the academic world that explore a new point in the design space that in theory is optimized for both analytical and operational queries and everything in between. Notable examples are Umbra from Technical University of Munich and NoisePage from Carnegie Mellon.
The evolution of technology is hard to predict, and highly path-dependent. Ten years ago, many smart commentators expected Hadoop to displace the traditional SQL data warehouse, but that trend abruptly reversed with the rise of the cloud-native data warehouse. The Hadoop ecosystem evolved too slowly, and new commercial database systems were able to leverage the unique characteristics of the cloud to provide a dramatically better user experience. In the next ten years, the growth of operational-analytical workloads will either cause an evolution of the now-incumbent cloud data warehouse—or a revolution.
George Fraser is the CEO of Fivetran.
New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to email@example.com.
Copyright © 2021 IDG Communications, Inc.