ML Data Version Control and Reproducibility at Scale
In the ever-evolving landscape of machine learning (ML), data stands as the cornerstone upon which triumphant models are built. However, as ML projects expand and encompass larger and more complex datasets, the challenge of efficiently managing and controlling data at scale becomes more pronounced.
Breaking Down Conventional Approaches: The Copy/Paste Predicament
In the world of data science, it's commonplace for data scientists to extract subsets of data to their local environments for model training. This method allows for iterative experimentation, but it introduces challenges that hinder the seamless evolution of ML projects:
Reproducibility Constraints: Traditional practices of copying and modifying data locally lack the version control and audit-ability crucial for reproducibility. Iterating on models with various data subsets becomes a daunting task.
Inefficient Data Transfer: Regularly shuttling data between the central repository and local environments strains resources and time, especially when choosing different subsets of data for each training run.
Limited Compute Power: Operating within a local environment hampers the ability to harness the full power of parallel computing, as well as the distributed prowess of systems like Apache Spark.
In this session, we will demonstrate
- How to use lakeFS to version control your data when working with your data locally.
- How to use lakeFS without the need to copy data locally, and train your model at scale directly on the Cloud. We will be leveraging the technology stack of:
- AWS S3
- Databricks Delta Lake