Jun-05 14:40-15:10 in Rhapsody

Data quality plays a crucial role in data engineering to enable efficient and insightful data pipelines at scale. In this session, we will leverage Apache Iceberg as the scalable table format with ACID guarantee, Apache Toree’s interactive computation capabilities and orchestrate the automated data workflow on Apache Airflow. We will start by talking about how iceberg can use its column level statistics stored in metadata for efficient and reliable data quality validation. Following this, we will present a practical example using a Jupyter notebook together with apache toree to customize data audit and analysis step before publication. Finally, we will demonstrate how Apache airflow can integrate sensor and operator to orchestrate workflow at scale. Attendees will equipped with right tooling and knowledge about proactive data quality assurance, backed by power and flexibility of open-source Apache projects.