- TypeWebinar
- Location Online Event
- Date 26-10-2022
A Data Lake is an architectural pattern rather than a specific platform, built around big data repositories and using a schema-on-read approach. A data lake stores large amounts of unstructured data in object storage like Amazon S3 without structuring the data in advance, while maintaining the flexibility to apply more ETL and ELT to the data in the future. This is ideal for companies that need to analyze constantly changing data or very large data sets.
A data lake architecture is simply a combination of tools used to build and operationalize this type of data approach, from event processing tools to ingestion and transformation pipelines to analytics and query tools. As you can see in the examples below, there are many different combinations of these tools that you can use to build your data lake based on the specific skills and tools available in your organization.
Design Principles and Best Practices for Building Data Lakes
Schema Visibility: Ingested data should be understood in terms of schema, sparse fields, and metadata properties for each data source. Gaining this visibility while reading, rather than trying to get it while writing, can help you build your ETL pipeline based on the most accurate and available data, thus avoiding many problems later.