You’ve probably been led astray about what a ‘Data Lake’ is by so many articles, conference talks, vendor marketing hype machines, and lack of use cases to prove the fulfillment of the vision previously held above your head. This post takes a brief look at how a Data Lake compares to a Data Warehouse. But first let’s define a Data Lake for what it really is, a storage repository for data. The value proposition originally proposed all data, could be stored in a Data Lake storage system. And it is very often compared to a Data Warehouse, which has also very often left incumbent business management and non-tech people confused. Very often conversation ensue questioning, if a data warehouse is event necessary, or should we stop work or get rid of our existing data warehouse and just go build a data lake. This post seeks to compare certain characteristics of each, the Data Warehouse and Data Lake Storage. Hopefully, this gives you some idea why we belief both are necessary in order for organizations to be data-driven. The DataLakeHouse platform aims to enable organizations of all sizes to understand this complement of system, and show value sooner rather than later by providing its take on a unified data analytics solution.
You learned from a previous post, how a Data Warehouse relates to an Operational Database. You also may have read this article about why Data Lake really equals Data Lake Storage. Here we’re briefly enlightening you to the fact that with the latest tools a Data Lake storage system can ingest/store any structured or semi-structured data. We say semi-structured because we believe all data has some level of structure, but you’ll also see the term ‘unstructured’ to reference any type of data object such as video file, an mp3/mp4, text files, etc. which are also objects that get stored in Data Lake storage. This is why most vendors that provide Data Lake storage capability refer to their offering as ‘Object Storage’ which can be confusing to newcomers to the technology. A Data Lake storage system has no immediate analytical benefit other than distributed object storage – which is the crux of modern Big Data. But tools must sit on top of Data Lake storage in order for it to bring any analytical value. Data Lake storage has no defined schema, it is raw data that must be organized, processed, and evaluated at some stage. Newer technology is enabling direct access to data lake object storage for querying but this is probably the main point of difference between data lake storage and a data warehouse. A data warehouse brings a defined schema optimized for analytics. Some data warehouse systems enable machine learning, forecasting, data mining, prediction, and other advanced capabilities directly in the data warehouse. The best part of a data warehouse is that it is typically business focused, and synonymous with reporting and analytics because of the confidence in the curated data definitions and consistency. Our opinion is you need both to be competitive in the post 2000’s world of Big Data. How will you get there? We have some ideas.
Here is a basic breakdown of some simple characteristics of a Data Warehouse and a Data Lake Storage for your review.
Characteristic Item | Data Warehouse | Data Lake Storage |
---|---|---|
Purpose / Identified By | Trend Analysis, Analytics, Reporting, typically from relational or multi-dimensional data, and operational systems (GL, ERP, CRM) | Object Storage – Any data, any object, of almost any size distributed for high-availability |
Transactional / Analytical / Other | Analytical, Historical Data | Other: Event Object Storage |
Sources of Information/Data | Structured data, any logical data source, CSV, Flat File, Operational Database | Any type of source that can be electronically submitted, stored, and retrieved, i.e.: Object Storage |
Automation Type | Transformation process, usually in batch, micro-batch, or near real-time, typically ETL or ELT processes | Any programmatic approach to moving data objects from one location to another. Typically managed by REST APIs and GUIs. |
Schema/Model Form | Schema-on-Write (i.e.: the schema/DDL is pre-designed and written on the data warehouse to create its structure in advance, so data model is consistent and predictable. Typically dimensionally structured (Star schemas, satellites, etc.) | Schema-on-Read (i.e.: the form/structure of the model/information written at time of data/object retrieval or query) |
Data Quality | Decades of process and strategy used to deliver a highly curated set of data to provide a single source of truth and key performance indicators (KPIs) for organizations. | Newer system that allows any type of data or object, usually not curated, but should aim to be organized and given some curation and thought to prevent systemic issues. |
Storage Form | File system based and often in memory for faster query performance, often columnar storage is used for MPP capabilities | Distributed object storage – originally conceived by the Hadoop File System and similar projects for Big Data |
Security / Access | Based on model developed, but often follows constraints of database vendor or any attached BI/Analytics/Reporting application used to query the data. | Based on object policies, buckets, and other identity access management protocols. |
Price & Performance | Multi-parallel Processing (MPP), and speed query results especially on aggregated data, for hot data and cold data, typically with a higher storage cost. | Object Storage Cost is considerably less expensive. Querying object storage directly has query latency considerations on large volumes of data. And without a schema, speed and cost are difficult to predict. |
Users (typical) | Line of Business Managers, Business Analysts, Data Engineers, Executives, Data Scientists | Data Scientists, Developers, Analysts, Machine Learning Developers |
Use Cases (primary) | Near real-time, intra-day, batch, micro-batch analytics/reporting, Business Intelligence, Data Visualizations, Executive Reporting, Enterprise Business Analytics | Machine Learning, Data Wrangling, Predictive Analytics, Data Discovery, Data Profile, Data Science, Pre-Data Warehouse ingestion analysis |
Please comment and leave feedback to what you think are some additional characteristics that delineate a typical Data Lake object storage system from a Data Warehouse and we’ll seek to add them to the list.