We’ve been implementing, discussing, designing and generally speaking about Data Lake, Lambda Architectures, and Object Storage since before it was cool. The idea that you can look at all of your organizations data as an asset – the new oil – is amazing when you think about the potential. The term Data Lake in its earliest incarnations was meant to be a place to store, load, or otherwise offload data and artifacts of all kinds: CSV, JSON, media files (i.e.: mp4, mpv, etc.) and any other type of files that could be parsed, reference, accessed at a later date. These artifacts are just ‘objects’, thus it is referred to object storage, regardless of the vendors platform (ex: AWS S3, Alibaba OSS, etc.) for the object storage. Hadoop File System (HDFS) logic gave us the proliferation of distributed storage. Cloud vendors made the distributed object storage inexpensive. And accessing the objects in object storage convenient and considerably secure.
The proliferation of the term ‘Data Lake’ has made it seem more like a panacea for all things analytics, reporting, data science, machine learning. But we all realize (or at least we all should realize) that most references to a ‘Data Lake’ simple refers to the object storage and the activity that surrounds it. Unfortunately in the early days of ‘Data Lake’ considerations, cloud vendor marketing amongst others transitioning from other areas of technology often referenced and/or made comparison of a ‘Data Lake’ to a ‘Data Warehouse’, usually and most often in the context of ‘Big Data’ which at the time mainly referred to Hadoop and the activity being done around Apache Hive, MapReduce, etc. These days most capability around processing massive amounts of data via MapReduce, Spark, etc. can be done against data in HDFS, or other object storage systems like Google Cloud Storage, AWS S3, etc.
So in order to achieve any material value from Data Lake Storage you need additional systems to leverage that object storage. The aforementioned MapReduce would be one of those systems. Another would be Spark. And since thee object storage can contain just about any type of data or file, amazing systems such as Apache Drill, and Apache Arrow can give you the ability to query files quickly from almost anywhere on the globe using legacy SQL language without having to ELT or otherwise load the files into a relational database. Apache Drill even gives you table ability to join a CSV or JSON file with a relational database system as if they were both in a relational data store. This has amazing implications which we’ll talk about in the future but for now just remember a ‘Data Lake’, that is ‘Data Lake Storage’, needs some other system or program to sit on top of it in order for it to surface any value or insights – otherwise it is, albeit necessary and amazing, just a dumping group for files and stuff.
So, this brings us back to how what we really need every to observe and persist as knowledge when discussing Big Data, Data Lakes, etc. is that Data Lake is a generalized term that mainly relates to Object Storage. So this is where other terms such as DataLakeHouse and DataLakeAnalytics come into play. The implied concepts here are that there is a system on top of the Data Lake Storage that brings some perspective of value to the off loaded data elements/object. Since those objects are just sitting round. waiting for something to happen building a DataLakeHouse, or a DataLakeAnalytics platform on top of the object storage is what needs to happen next. While there are several (at this point many) systems that have enabled just a simple integration to the Data Lake Storage, there are very important considerations that should be considered achieve a solid foundation to best leverage your object storage. This is everything from security, organization, repeatability, orchestration, curating data, etc. The DataLakeHouse provides a framework for all these considerations and more because it is focused on the Big Picture of ingress to egress with an opinionated perspective of bringing business value from the data value chain pipeline.
Summary
It seems easy to say that looking back at the term Data Lake in about 10 years will be similar to how many of us remember terms like ‘Decision Support’ and ‘Data Repository’ and acknowledging that most were just figuring it out as they went along and had some wins along the way until a framework came along that gave repeatable processes and conforming approaches that allowed for best practices and a pool of resource that were educated to deliver the same value from project to project. It is always the best intention to codify something and modify terminologies as new facts are presented. What we didn’t know then about Data Lake Storage was how vast the acceleration of technology would be to meet the challenges of Big Data. It expanded a space in technology that continues to change, and that change also needs to be codified. Thus is the cycle of technology.