Computer Vision Data Version Control and Reproducibility at Scale

Track:: Machine Learning: Research & Applications
Type:: Talk (long session)
Level:: intermediate
Room:: South Hall 2B
Start:: 10:45 on 16 July 2025
End:: 11:30 on 16 July 2025
Duration:: 45 minutes

Abstract

Petabytes of unstructured data stand as the cornerstone upon which triumphant Machine Learning (ML) models are built. One common method for researchers to extract subsets of data to their local environments is by simply using the age-old copy-paste, for model training. This method allows for iterative experimentation, but it also introduces challenges with the efficiency of data management when developing machine learning models, including reproducibility constraints, inefficient data transfer, alongside limited compute power.

This is where data version control technologies can help overcome these challenges for computer vision researchers. In this workshop we'll cover:

How to use open source tooling to version control your data when working with data locally.
Best practices for working with data, preventing the need to copy data locally, while enabling the training of models at scale directly on the cloud. This will be demoed with an OSS stack:
Langchain -Tensorflow
PyTorch
Keras

You will come away with practical methods to improve your data management when developing and iterating upon Machine Learning models, built for modern computer vision research.