What is a Data Lake Data lake is a central repository that holds big data from many different data sources Can be structured, semi-structured or unstructured data To ingest data as quickly as possible and make it available asap Used extensively for machine learning and analytical solutions Has to be secure and scale Hardware should be inexpensive so you can store as much data as possible Idea is to store as much data as possible so it can be made available to others, and they can make use of it later. R&D on data products Cannot always define the structure of the data Data Lake vs Data Warehouse Data Lake Data Warehouse Data is unstructured Data is structured Data Scientists or Data Analysts Business Analysts Stores data on the scale of petabytes Used for batch processing, Business Intelligence, and Reporting Stream Processing, Machine learning and real time analysis Data size is generally small Data is undefined, no relation between data Data Warehouses contain historic and relational data Gotchas of Data Lake Starts with a good intention, but soon turns into a Data Swamp: Very hard to be useful
...