Computer Vision Data Version Control and Reproducibility at Scale
- Track:
- Machine Learning: Research & Applications
- Type:
- Talk (long session)
- Level:
- intermediate
- Duration:
- 45 minutes
Abstract
Petabytes of unstructured data stand as the cornerstone upon which triumphant Machine Learning (ML) models are built. One common method for researchers to extract subsets of data to their local environments is by simply using the age-old copy-paste, for model training. This method allows for iterative experimentation, but it also introduces challenges with the efficiency of data management when developing machine learning models, including reproducibility constraints, inefficient data transfer, alongside limited compute power.
This is where data version control technologies can help overcome these challenges for computer vision researchers. In this workshop we'll cover:
- How to use open source tooling to version control your data when working with data locally.
- Best practices for working with data, preventing the need to copy data locally, while enabling the training of models at scale directly on the cloud. This will be demoed with an OSS stack:
- Langchain -Tensorflow
- PyTorch
- Keras
You will come away with practical methods to improve your data management when developing and iterating upon Machine Learning models, built for modern computer vision research.