What is a Data Lake
- Data lake is a central repository that holds big data from many different data sources
- Can be structured, semi-structured or unstructured data
- To ingest data as quickly as possible and make it available asap
- Used extensively for machine learning and analytical solutions
- Has to be secure and scale
- Hardware should be inexpensive so you can store as much data as possible
- Idea is to store as much data as possible so it can be made available to others, and they can make use of it later.
- R&D on data products
- Cannot always define the structure of the data
Data Lake vs Data Warehouse
Data Lake | Data Warehouse |
---|---|
Data is unstructured | Data is structured |
Data Scientists or Data Analysts | Business Analysts |
Stores data on the scale of petabytes | Used for batch processing, Business Intelligence, and Reporting |
Stream Processing, Machine learning and real time analysis | Data size is generally small |
Data is undefined, no relation between data | Data Warehouses contain historic and relational data |
Gotchas of Data Lake
-
Starts with a good intention, but soon turns into a Data Swamp: Very hard to be useful
-
No versioning: Incompatible schemas between the data, and Different file types
-
No metadata associated; what’s the usefulness of the data
-
Joins not possible between different data sets
ELT vs. ETL
ELT | ETL |
---|---|
Extract, Load, and Transform | Extract, Transform, and Load |
Used for large amount of data | Used for small amount of data |
Provides data lake support (Schema on read) | Data warehouse solution |
Data lake solution | Schema on write |
Write the data first, and then determine the schema when loading the data | We define the schema, define the relationships, and then we load the data |
Alternatives to components (S3/HDFS, Redshift, Snowflake etc.)
- GCP -> Cloud Storage
- AWS -> S3
- Azure -> Azure Blob