Data versioning and reproducible ML with DVC and MLflow
Sep 30, 2020
- Categories
- Data Science
- DevOps & SRE
- Events
- Tags
- Data Engineering
- Databricks
- Delta Lake
- Git
- Machine Learning
- MLflow
- Storage [more][less]
Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.
Our talk on data versioning and reproducible Machine Learning proposed to the Data + AI Summit (formerly known as Spark+AI) is accepted. The summit will take place online the 17-19th November. Registration is now open, register and join our talk. The discussion is build upon two previous articles on how to use DVC for data versioning and how to reproduce Data Science experiments with MLflow. Below is our submited proposal.
Machine Learning development involves comparing models and storing the artifacts they produced. We often compare several algorithms to select the most efficient ones. We assess different hyper-parameters to fine-tune the model. Git helps us store multiple versions of our code. Additionally, we need to keep track of the datasets we are using. This is important not only for audit purposes but also for assessing the performances of the models, developed at a later time. Git is a standard code versioning tool in software development. It can be used to store your datasets but it does not offer an optimal solution.
An alternative solution is to use Data Version Control (DVC). Despite its name, it is not just a data versioning tool, but also enables model and pipeline tracking. It runs on top of Git, which makes it easy to learn for Git users. At the same time, it overcomes the limitations of storing big files by storing them remotely (e.g. Azure, S3) and keeping in Git only their metadata.
MLflow is a tool that is easily integrated with the code of your model and can track dependencies, model parameters, metrics, and artifacts. Every run is linked with its corresponding Git commit. Once the model is trained, MLflow can pack it in different flavors (e.g. Python/R function, H2O, Spark, TensorFlow…) ready to be deployed. DVC also runs along with Git. When MLflow helps you manage Machine Learning lifecycle, DVC helps you manage your datasets.
In this tutorial, we will learn how to leverage the capabilities of these powerful tools. We will go through a toy ML project and look at the sample code on how to increase the reproducibility of individual steps.