Version your datasets with Data Version Control (DVC) and Git

Using a Version Control System such as Git for source code is a good practice and an industry standard. Considering that projects focus more and more on data, shouldn’t we have a similar approach such as Data Versioning?

The Data Version Control (DVC) project aims at bringing Git in projects that use a lot of data. You often find in such projects a link of some sort to download the data, or sometimes Git LFS is used. However, doing so has a major drawback: the data itself is not tracked by git.

Which means that another tool is required to track the changes to the data, and more often than not, this tool is a spreadsheet.

Projects that are data intensive such as deep learning projects, strongly rely on the quality of the dataset to produce good results. It is fair to say that the data is often more important than the model processing it. Having a way of tracking the version and changes made to your data and exchanging it with your colleagues in a way that does not involve a .tar.gz sounds like a good idea. And considering that Git is the most widely used versioning system, coupling Git with the Data Version Control system sounds like an even better idea.

This way, when our dataset version is tracked, we can later use MLflow to link the accuracy of a model to the version of the dataset that we used.

DVC Usage

DVC usage is very similar to Git. The commands look almost the same, basically DVC allows you to choose files in your repository and push them on an alternative remote storage. This can be a S3, HDFS, SSH. All are more suitable for large files than Git. To install DVC, simply use pip with the command sudo pip install dvc.

A DVC repository should be tracked by a scm tool first, the SCM tool will track the index files from DVC which will indicate where to get the actual files that are too big to put directly on Git. As a reminder, SCM stands for Source Control Management. It is a family of tools that includes Git, SVN and Mercurial. If you are only using Git, you can replace the term SCM with Git.

So first, initialize a Git repo, as usual: git init.

Then, initialize a DVC repo in the same folder: dvc init.

This way your folder is tracked by Git and DVC. You can still use Git to track the changes to your project like usual. But with DVC the file you choose are removed from Git and added to the .gitignore and managed by dvc.

DVC will create a .dvc containing information about your file. For example, if you have a file dataset, dataset.dvc would look like:

md5: a33606741514b870f609eb1510d8c6cf
outs:
- md5: b2455b259b1c3b5d86eac7dfbb3bbe6d
  path: dataset
  cache: true
  metric: false
  persist: false

This file describes a file called dataset which is present on the remote (or DVC local cache) and available for checkout (just as in Git). The md5 hash is used for version control and to indicate which version of the file should be pulled and used in the project.

The .dvc files should be committed on the SCM as DVC is not a SCM tool: DVC has commits but they are not linked between them. Commits do not have a parent, this is handled by the SCM software. In the example above, the md5 hash is not tied to any other DVC hash. They are indepedant and tracked by the SCM software.

The commands to commit, push and pull from a remote storage like S3 or HDFS are very simple:

Use dvc add to trak a file with DVC (and automatically append it to the .gitignore) which will create the associated .dvc file.
Use dvc commit to commit the changes to the DVC local cache.
Use dvc pull/push to receive and checkout or to send the file to the remote.

So when cloning a Data Science project, simply use the following sequence to clone the project and get the associated large files such as datasets:

git clone <url> project
cd project
dvc pull

Use this sequence to add and push the file dataset.zip to dvc.

dvc add dataset.zip
dvc commit 
git add dataset.zip.dvc .gitignore
git commit -m "[DVC] Move dataset to dvc"
dvc push
git push

Using DVC with MLflow

DVC can be used with other Data Science tools to make the most of it. One of those tools is MLflow, which is used to track the effeciency of Machine Learning models.

MLflow can be used with a lot of Data Science frameworks: Tensorflow, Pytorch, Spark… The interesting thing is that MLflow’s runs can be tagged with the Git commit hash. Before, only the code would go on Git and the dataset information would typically go on some Excel files, passed from departements to departments. Now, with DVC, the dataset is integrated in Git. You can keep track of the modifications made to it in the Git commit messages and have a branch dedicated to the dataset tracking. Also, you can see the influence of each modification on the model’s performance with MLflow. Because DVC has nothing to do with the code of the project, it is very easy to add to an existing project.

We use MLflow here as an example, but any other framework using a SCM tool would work as well.

DVC storage caveat

The main problem with DVC is the initial configuration of the remote, especially with HDFS which requires some configuration to have a usable client. One of the easiest way to setup DVC is to use a S3 bucket.

dvc remote add myremote s3://bucket/path

But DVC can also be used with many remote like GoogleDrive:

dvc remote add myremote gdrive://root/my-dvc-root
dvc remote modify myremote gdrive_client_id my_gdrive_client_id
dvc remote modify myremote gdrive_client_secret gdrive_client_secret

More information on remote storage are available on their website.

Be also aware that DVC has a local storage for the files that are beeing tracked. The files tracked by DVC can be harklinked to the cache. Be aware of it if you decide to remove DVC from your project. More information about the DVC cache structure are available on their website.

Resources

Share this article