x

    A Guide to Modern Data Lakehouse: History, Architecture, Differences, and Technology

    • LinkedIn
    • Twitter
    • Copy
    • |
    • Shares 0
    • Reads 794
    Author
    • SudhaData & BI Addict
      When you theorize before data - Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.
    20-February-2023
    Featured
    • Data Lake
    • IT/ITeS
    • Data Analytics

    Editor’s Note:

    In this article on Data Lakehouse, we answer the million-dollar question of Data warehouse vs Data Lakehouse vs Data Lake by giving a brief synopsis of why and how it emerged. Keep reading if you want to understand the nuances of lakehouse wrt Databricks, and Fabric!

    Introduction: The emergence of Data Lakehouse

    Are you in this state: Bored of the fact that everyone is talking about Gen AI, when in reality, all you can see is the true obstacle behind it? The data. The testament to the confused data ecosystem is the current data, machine learning and AI landscape, which looks something like this:

    Grateful to Mattturk & Firstmark for this wonderful research
    Grateful to Mattturk & Firstmark for this wonderful research

    TLDR: The MAD landscape is as chaotic as neural network hidden layers.

    But with 72% of top-performing CEOs (according to IBM) saying competitive advantage depends on who has the most advanced generative AI and Artificial intelligence; and ML being the Top 1 priority of companies (multiple reports)– we believe it is important to ask this question: Is your data actually capable of supporting all this?

    To answer this question let’s take a trip back in the data timeline, to see how the data needs and the data management needs have evolved:

    • Era 1.0: Business wanted insights from their data, so they got Data Warehouses on top of their data with schema-on-write provisions for supporting business intelligence. But the problem emerged with growing types of unstructured data sets, paying for peak user loads, and growing need for complex analysis. As, none of the leading ML systems, such as TensorFlow, PyTorch, and XGBoost, work well on top of warehouses. ML systems need to process large datasets using complex non-SQL code.

    • Era 2.0: To support offloading raw in low-cost & open formats like Parquet while supporting systems to process non-SQL code – we entered the era of Data Lakes with schema-on-read architecture. But they lost the rich data management features needed for ACID transactions, indexing, etc. which existed on Data warehouses.

    • Era 3.0: The combination of Datalake + Data warehouse – Or the two-tier architecture– where a subset of data from the data lake is ETLed to a downstream warehouse for subsequent analysis and Business Intelligence applications. Though this helps both BI and AI – the complexity of added ETL steps, the change in semantics, the SQL dialects, and the supported data types might differ. Also, the increase probability of failures and bugs.

    • Era 4.0: The current era of Data Lakehouse – to combine the semantic flexibility and storage of data lake with the computation & delivery of a data warehouse, i.e. to eliminate or decrease the friction between the data utilization and ingestion.

    So, what is a Data Lakehouse?

    Data Lakehouse is a hybrid data storage and processing platform that combines the best of both traditional data lake and data warehousing technologies: low-cost storage in an open format accessible by a variety of systems from the former, and powerful management and optimization features from the latter.

    data lake architecture
    The evolution from a Two-tier architecture to Data Lakehouse>

    Lakehouse architecture, which you can see in Databricks, Microsoft Fabric, etc. aims to reduce the cost, operational overhead, and complexity of bringing data for multiple purposes from Business Intelligence to Artificial Intelligence.

    Though the journey of data lakehouses has started with Uber, then Netflix, now majority of the Fortune 500 companies have already started their journey with them. Three of the main perceivable benefits of using Data Lakehouse are:

    • Reduced data staleness: Roughly 70-80% of analysts use out of date data – with all the data being stored in one place and format – you can get more control over your data.

    • Unified platform for all data analytics: Data lakehouses can provide a location for BI, SQL analytics, and more advanced analytics, including machine learning as many ML libraries, such as TensorFlow and Spark MLlib, can already read data lake file formats such as Parquet.

    • Reduced cost: You may be able to eliminate the need for organizations to pay to store the same data twice, as they often have to do when they use both data warehouse and a data lake. Also, commercial data warehouses often lock data into proprietary formats, which can be expensive – while Data Lakehouses use open formats.

    Understanding the differences between Data Warehouse, Data Lake, and Data Lakehouses

    Though we’ve kind of spoken about all three of them and how the evolution in data management brought out the necessity to have more than one of them. We’ll try to compare them in the simplest way possible.

    From the emerging need to stay agile to adapting quickly – the need to balance cost with the growing need for data management is balanced with Data lakehouse.

    Differences between Data Warehouse, Data Lake, and Data Lakehouse

    data lake architecture
    Showcasing the differences across various parameters like Data format, schema, openness, performance, etc.

    TL;DR: Data Lakehouse stores data in a similar format to data lake but the transactional meta data layer defines which objects are part of a table version like a data warehouse. From the greater flexibility of interoperability like that of data lakes and allowing ACID transactions like Data warehouses – data lakehouse is an amalgamation of both the ideas.

    Shedding some light on the architecture behind Data Lakehouse

    The purpose behind the why Data Lakehouse must be clear by now. So, let’s dive into the ‘how’. A usual data lakehouse architecture consists of 5 layers namely:

    • Ingestion Layer: Gathers data from multiple sources like transactional data, CRM, No SQL data bases and transforms it into storable format.

    • Storage Layer: Usually stored in a low cost storage format, this means the typical cloud object storage, such as Amazon S3, Azure Data Lake Storage (ADLS), or Google Cloud Storage (GCS) - in open, directly accessible formats like Apache Parquet and ORC.

    • Meta Data Layer: This layer unique to data lakehouse resides on top of the lake storage. This layer manages the table format and tracks the files – while enabling features like schema enforcement, data versioning, auditing, etc. Examples of metadata layers include Delta Lake, Apache Hudi, and Apache Iceberg.

    • API Layer: This is to enable data querying with SQL APIs for traditional BI and SQL analytics and declarative DataFrame APIs for data science and machine learning workloads.

    • Data consumption layer: From BI tools for dashboarding, data science and machine learning for modelling data, real-time application for streaming data, and data sharing across collaborators – this layer enables users to consume and interact with data in multiple ways.

    data lake architecture
    Highlighting the 5 layers of a Data lakehouse architecture

    What’s the current Data lakehouse Technology?

    With the advent of Microsoft fabric and One Lake i.e. Azure Data lakehouse has become more mainstream than it was before. It's worth mentioning that the "Curated layer" (in a Microsoft Fabric architecture) can be replaced with a Data Warehouse when needed. The lakehouse in Fabric can be created very easily, with the following options in place.

    You might find more details on Azure data lakehouse architecture and where to get started here.

    lakehouse new list

    Databricks’ Lakehouse makes it easy to automate and orchestrate ETL pipelines, previously you must have come across Azure Databricks (which is usually used with Azure Synapse) or their Delta Live tables to navigate the complexities of infrastructure management, task orchestration, error handling, etc.

    data lake architecture
    Source: Databricks

    There are obviously other players in the market ranging from Amazon Redshift, Google Cloud BigQuery, Salesforce Data cloud, Apache Hudi, and more. The key considerations while choosing between them can be about: Performance, Cost, Data variety, Scalability, integration capabilities, and governance.

    What’s next with DataLakehouse?

    We’ve seen the benefits of data lakehouse from being more economical than the two-tier architecture, a simplified architecture in preparing data, increased reliability by reducing data quality issues & duplicity, improved governance with consolidation, to increased scalability for the future.

    As organizations recognize the value behind it, especially in this era of AI – it would be pivotal in playing a central role in driving data-driven decision making and innovation. The future of data lakehouses is bright, and their continued evolution will undoubtedly shape the next generation of data platforms – all that’s left is for you to adopt it well.

    Talk to our data engineering experts to see it where and how this fits in your data management strategy.

    About Author

    Modern Data Lakehouse
    Sudha

    Data & BI Addict

    When you theorize before data - Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.

    Generally Talks About

    • Data Lake
    • IT/ITeS
    • Data Analytics

    Related blog