Data warehousing has evolved a lot, as shown in the following figure copied from the paper.
- Management features such as ACID transactions, data versioning, auditing.
- Performance features such as indexing, caching and query optimization.
- Cannot access internal format.
- Cannot store unstructured data.
- Cannot handle data science or machine learning workloads - data is large compared to BI, need to process using non-SQL code, reading data via JDBC is slow
Data first go to the lake, and then get synced to warehouses.
- Lake provides low-cost and directly-accessible storage (previously HDFS, now S3) in open format (Parquet).
- Support unstructured data.
- Lake loses management and performance features.
- Difficult to keep lake and warehouse data in sync.
- Data warehouse can run late in data.
- Duplicated storage between lake and warehouse.
Best of both worlds! Here's how:
- Add a transactional metadata layer that supports management features.
- Implement performance optimizations:
- Indexing: (not discussed yet).
- Caching: use SSDs and RAM as the cache.
- Query optimization: record ordering.
Note that Lakehouse is more of a specification, while Delta Lake is an implementation by Databricks. Go check it out!
Vercel and Nextra.