Going beyond Parquet: Handling large-scale datasets with Iceberg and Dask

Track:
Data Engineering and MLOps
Type:
Talk
Level:
intermediate
Duration:
30 minutes

Abstract

Apache Iceberg is a high-performance open format for huge analytic tables. It brings the reliability and simplicity of SQL tables to big data. Originally built by Netflix to handle datasets at the petabyte scale, Iceberg is becoming the de facto open standard for storing tabular datasets. Its adoption can be seen across the industry, from Databricks' acquisition of Tabular to the release of Snowflake's Iceberg Tables or Amazon's S3 Tables.

In this talk, we start by looking at Iceberg's storage model and how it enables reliable concurrent data operations on top of Parquet files. We continue with Dask's DataFrame API, which allows users to scale Pandas workloads to clusters of machines in a pure Python environment. We discuss how it integrates with Apache Iceberg for ACID capabilities and high-performing querying. Equipped with this understanding, we investigate more advanced Iceberg features like hidden partitioning or time travel. Throughout the talk, we compare Dask workloads executed on Iceberg datasets against plain Parquet to gain a practical understanding of Iceberg's benefits.