Data Lake

Data lake is a central repository that holds big data from many different data sources
Can be structured, semi-structured or unstructured data
To ingest data as quickly as possible and make it available asap
Used extensively for machine learning and analytical solutions
Has to be secure and scale
Hardware should be inexpensive so you can store as much data as possible
Idea is to store as much data as possible so it can be made available to others, and they can make use of it later.
R&D on data products
Cannot always define the structure of the data

Data Lake	Data Warehouse
Data is unstructured	Data is structured
Data Scientists or Data Analysts	Business Analysts
Stores data on the scale of petabytes	Used for batch processing, Business Intelligence, and Reporting
Stream Processing, Machine learning and real time analysis	Data size is generally small
Data is undefined, no relation between data	Data Warehouses contain historic and relational data

Starts with a good intention, but soon turns into a Data Swamp: Very hard to be useful
No versioning: Incompatible schemas between the data, and Different file types
No metadata associated; what’s the usefulness of the data
Joins not possible between different data sets

ELT	ETL
Extract, Load, and Transform	Extract, Transform, and Load
Used for large amount of data	Used for small amount of data
Provides data lake support (Schema on read)	Data warehouse solution
Data lake solution	Schema on write
Write the data first, and then determine the schema when loading the data	We define the schema, define the relationships, and then we load the data