Data Lakehouse

The Database Catalysed the Digital Revolution

Data storage has evolved numerous times since the dawn of the computer age. In the mainframe era, initially the primary storage mechanism was magnetic tape, which was superseded by the more economical disk storage. In the 1970s/80s, relational databases gave programmers an abstraction so they could manipulate a logical representation of the data that was detached from the physical disk. This made mapping applications to databases far easier.

Over time, the language to work with relational databases turned into SQL (Structured Query Language), which has an easy to use English-like syntax, and is still the industry standard today. This made databases more accessible to a broad range of techie and non-techie employees, and was a major catalyst to digitalisation in the 1990s.

Database Fragmentation

The PC revolution and the rise of the Internet, combined with standards being established such as SQL, led to the creation of lots of applications and systems, each needing their own RDBMS (Relational Database Management System). Then the mobile, cloud, and SaaS era came which resulted in even more database fragmentation.

Databases are transactional by nature - designed for simple insert, update, delete operations, for maximum write and read speeds - and hence are not suitable for retrieval of large datasets for analysis. The fragmentation and limited analytical capabilities of orgs' database estates, left them with large blind spots. With hindsight, it's no surprise that, over the past few decades, many once-dominating corporations have lost their leadership, and many of which have liquidated altogether.

Data Warehouse 1.0

To the rescue came data warehouses - one location, and a single source of truth, for all of an org's structured data (prices, sales, products, business unit, etc.). Databases feed data into the data warehouse, which is optimsed for querying and retrieval of large volumes of data in order to analyse and discover business insights to support decision making.

Enterprises finally had holistic views of their data, though managing the on-prem hardware requires high capex and opex. IT need to constantly project peak demand for data warehouse usage and buy and operationalise the necessary number of servers accordingly, and this invariably leads to time lags and wasted spend.

Data Warehouse 2.0

The data warehouse 2.0 was cloud-based - provided by the hyperscalers - and solves some of the aforementioned operational issues. However, scalability is limited because compute and storage are bundled together, which poses problems. Firstly, storage is not elastic in real-time the same way as compute can be, which means data warehouse analytical operations can't be instantaneously scaled up and down according to usage demand. Secondly, orgs may have high storage needs but infrequent compute needs, or vice versa, so it's more economical if the two can be separated.

Data Warehouse 3.0

Data warehouse 3.0 - that is the Snowflake (NASDAQ ticker: SNOW) architecture - solves the scalability and cost challenges with a pioneering technology that separates the compute and storage. SNOW customers can pay upfront for storage and elastically scale the compute requirements according to usage demand. Sounds fantastic, right? And it is, but data warehouses store structured data only, and since the advent of Web 2.0, enterprises receive and generate orders of magnitude more unstructured data.

The likes of text-based content, images, audio, video from webpages and emails and whatever other source, are not suitable for data warehouses. Ideally, such data has a lot of potential value but needs to be kept in raw form because the end use is typically unknown. This necessitated an alternative technology for the cost-efficient storage and exploration of very large volumes of undefined data - i.e., big data. Data lake was the analogy that emerged to succinctly articulate this requirement.

Data Lakes

Data lakes are great for cheap and scalable storage but are not optimised for querying and retrieval. Furthermore, due to weak governance controls many data lakes become what is derided as a data swamp, riddled with poor quality data. Data scientists have the skills to circumvent such issues and derive value, but other consumers, like business analysts, do not. To make data lakes more broadly accessible, they need to add some structure to enable querying with common languages like SQL, and ensure orgs can extract the full potential of their data assets.

In recent years, there has been innovation around data lakes to make them more interoperable with more data consumers. Some data lakes have evolved by providing somewhat of a staging area to house standardised datasets ready to supply data warehouses, receive SQL-based queries, and/or feed ML models. However, this requires frequent ETL (Extract, Transform, Load) operations to place data in a data warehouse, that takes time (several hours or even days), is expensive (in terms of computation and because of double storage), and creates additional overhead (because of managing the data lake and data warehouse).

The Latest Evolution - The Data Lakehouse

This brings us to the latest architectural evolution - the Data Lakehouse - which aims to address the limitations of the two storage types. This architecture blends together 1) the cheap, scalable, and explorable storage of a data lake, and 2) the data quality and querying/retrieval optimisation of a data warehouse. A few of the main features that make the combined capabilities possible include:

Enhanced governance controls to prevent poor quality data and inappropriate usage of data.
Enforcing schemas familiar with data warehouses onto unstructured data to make it queryable.
The use of open file formats to give standardisation to the unstructured data - this allows business intelligence tools to query the unstructured data directly, without being staged beforehand.
The implementation of a metadata layer, that is detached from the storage layer but provides attributes about the unstructured data.

In general, the data lakehouse infuses the data management capabilities of a data warehouse into the low-cost and scalable storage of a data lake. Moreover, by keeping unstructured data in its raw form while also making both unstructured and structured data queryable, it can serve machine learning and business intelligence requirements, alike.

Databricks (DTBK) has spearheaded the data lakehouse concept, but SNOW is closely behind in innovation. As a result, these two vendors are experiencing a vibrant ecosystem being built around them, leading to many tools and new startups, making for a very intriguing market for the next few years.

For institutional investors (public and private), on request, we can do tailored research for your requirements. For all types of investors, here are individual reports you can purchase related to data management:

For institutional inquiries please email admin@g.convequity.com for more information.

Reports

Snowflake: Apple of MDS (Part 3) & 1Q24 Update (May 2023)

Snowflake: Apple of MDS (Part 1) (December 2022)

Snowflake: Apple of MDS (Part 2) (December 2022)

Databricks (coming soon)

Confluent (coming soon)