Sharing is caring: Efficient Data Exchange with pyarrow

Track:
Data Engineering and MLOps
Type:
Talk (long session)
Level:
advanced
Duration:
45 minutes

Abstract

Apache Arrow was designed with multiple goals in mind, one of the most important being the ability to exchange data between systems efficiently. In this talk we will explore what that really means and what has been the evolution of the Arrow project around the data exchange area during the years.

We will cover how to share Arrow data in process leveraging the use of the C Data interface, C Device Interface and C Stream Interface along with the Arrow PyCapsule Interface. We will show examples on how popular dataframe libraries (pandas, polars) use those exchange methods.

We will also cover an overview of the Inter Process Communication Protocol used to share Arrow data between processes and how to build your own network exchange leveraging the use of the Arrow format with Flight RPC. These overviews will be accompanied by Python examples.

By the end of the session, attendees will have a clear understanding of how pyarrow can be utilized to exchange data faster within and between their data applications. We will provide examples on how and will share our tips on when to use them.