Data Lakehouse Architecture Evolution and Future sciwork 2023

Data Lakehouse Architecture Evolution and Future
.ical

2023-12-09, 14:20–14:50 (Asia/Taipei), NYCU

In the current data-driven world, we are always face on large data volumn storage, analytics and machine-learning application problem. In ths past, we always use database, data lake or data warehouse to store different data, includes structured data, unstructured data or semi-structured data. Although current have many related storage and tool can solve corresponding problems and scenraio, still have some limitation and imperfection.

In order to improve these, one concept gradually is discussed in these year. That is a Lakehouse, which integrate data lake and data warehouse advantages so that become a powerful architecture to implement modern data stack. Based on this concept, have some completed service and tool can implement it. Includes Databricks - Delta Lake, Apache Iceberg or Apache Hudi.

In this session, i will quickly describe and analyze these concept, benefits and drawbacks about database, data lake, data warehouse and lakehouse. And introduce some represent service. Lastly, i will show some demo about lakehouse so that attendees can more understand it specifically.

Brief Description

This talk i will start with sharing differecne and respective concept, benefits and drawbacks database, data lake and data warehouse. And based on these introduction, can quickly assist attendees pickup basic knowledge. And next, introduce architecture and importance of lakehouse. Then, i will prepare demo code (ex. python & delta lake) to present lakehouse operating principle so that attendees can deep understand it and image it. Lastly, describe one of usecase in my past production experience to know how to work in realworld.

By this talk. In addition to lakehouse application, i also introduce additional knowledge and concept, includes batch, streaming and MLOps. So attendees can fully learn and understand these skill and current data enginnering related important architecture and trend. Becuase have a good data architecture and pipeline, can do many interested and innovative application from data and lightening the load of data team member, like data analyst, data scientist or machine learning engineer so that establish the great data cycle.

Session Outline

Database vs. Data Lake vs. Data Warehouse
- Introducing these concept.
- Listing respective benefits and drawbacks.
- What is the pain point in these?
What & Why is the lakehouse?
- Introducing importance and trend about lakehouse.
- Realworld data architecture integration.
Talk about represent service.
- Databricks - Delta Lake.
- Apache Iceberg.
- Apache Hudi.
Code & Lakehouse Demo
Usecase about lakehouse and streaming integration.
- Sharing Apache Kafka, Apache Spark Structured Streaming and Delta Lake integration application architecture.
Future and Conclusion

Reference

Ask questions at slido

Slides

Slides - Data Lakehouse Architecture Evolution and Future

Example code on Github

sciwork-2023-conf-lakehouse-example

Prior Knowledge Expected? –

No, previous knowledge expected

Language –

Mandarin talk w. English slides

Mars Su

https://www.linkedin.com/in/huei-yuan-su-a16458134/

Work Experience

Trendmicro - Staff Data Engineer (Present)
Gogolook - Sr. Data/ML Engineer
Wavenet - Data Scientist
adGeek - Advertising AI Engineer

Important Experience

PyCon APAC 2022 Speaker
Publish of Book - Apache NiFi 讓你輕鬆設計 Data Pipeline
itHome 2021 AI&Data - Champion

Education

National Taiwan University of Science and Technology - Master's Degree, IM
National Changhua University of Education - Bachelor's Degree, IM

Data Lakehouse Architecture Evolution and Future .ical 2023-12-09, 14:20–14:50 (Asia/Taipei), NYCU