2023-12-09, 14:20–14:50 (Asia/Taipei), NYCU
In the current data-driven world, we are always face on large data volumn storage, analytics and machine-learning application problem. In ths past, we always use database, data lake or data warehouse to store different data, includes structured data, unstructured data or semi-structured data. Although current have many related storage and tool can solve corresponding problems and scenraio, still have some limitation and imperfection.
In order to improve these, one concept gradually is discussed in these year. That is a Lakehouse, which integrate data lake and data warehouse advantages so that become a powerful architecture to implement modern data stack. Based on this concept, have some completed service and tool can implement it. Includes Databricks - Delta Lake, Apache Iceberg or Apache Hudi.
In this session, i will quickly describe and analyze these concept, benefits and drawbacks about database, data lake, data warehouse and lakehouse. And introduce some represent service. Lastly, i will show some demo about lakehouse so that attendees can more understand it specifically.
Brief Description
This talk i will start with sharing differecne and respective concept, benefits and drawbacks database, data lake and data warehouse. And based on these introduction, can quickly assist attendees pickup basic knowledge. And next, introduce architecture and importance of lakehouse. Then, i will prepare demo code (ex. python & delta lake) to present lakehouse operating principle so that attendees can deep understand it and image it. Lastly, describe one of usecase in my past production experience to know how to work in realworld.
By this talk. In addition to lakehouse application, i also introduce additional knowledge and concept, includes batch, streaming and MLOps. So attendees can fully learn and understand these skill and current data enginnering related important architecture and trend. Becuase have a good data architecture and pipeline, can do many interested and innovative application from data and lightening the load of data team member, like data analyst, data scientist or machine learning engineer so that establish the great data cycle.
Session Outline
- Database vs. Data Lake vs. Data Warehouse
- Introducing these concept.
- Listing respective benefits and drawbacks.
- What is the pain point in these?
- What & Why is the lakehouse?
- Introducing importance and trend about lakehouse.
- Realworld data architecture integration.
- Talk about represent service.
- Databricks - Delta Lake.
- Apache Iceberg.
- Apache Hudi.
- Code & Lakehouse Demo
- Usecase about lakehouse and streaming integration.
- Sharing Apache Kafka, Apache Spark Structured Streaming and Delta Lake integration application architecture.
- Future and Conclusion
Reference
- What is a Data Lakehouse?
- Delta Lake Architecture: A Bridge Between Data Lakes & Data Warehouses
- Databricks - Delta Lake
- Apache Iceberg
- Apache Hudi
Ask questions at slido
Slides
Example code on Github
No, previous knowledge expected
Language –Mandarin talk w. English slides
https://www.linkedin.com/in/huei-yuan-su-a16458134/
Work Experience
- Trendmicro - Staff Data Engineer (Present)
- Gogolook - Sr. Data/ML Engineer
- Wavenet - Data Scientist
- adGeek - Advertising AI Engineer
Important Experience
- PyCon APAC 2022 Speaker
- Publish of Book - Apache NiFi 讓你輕鬆設計 Data Pipeline
- itHome 2021 AI&Data - Champion
Education
- National Taiwan University of Science and Technology - Master's Degree, IM
- National Changhua University of Education - Bachelor's Degree, IM