Version your datasets with Data Version Control (DVC) and Git
By Grégor JOUET
Sep 3, 2020
- Categories
- Data Science
- DevOps & SRE
- Tags
- DevOps
- Infrastructure
- Operation
- Git
- GitOps
- SCM [more][less]
Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.
Using a Version Control System such as Git for source code is a good practice and an industry standard. Considering that projects focus more and more on data, shouldn’t we have a similar approach such as Data Versioning?
The Data Version Control (DVC) project aims at bringing Git in projects that use a lot of data. You often find in such projects a link of some sort to download the data, or sometimes Git LFS is used. However, doing so has a major drawback: the data itself is not tracked by git.
Which means that another tool is required to track the changes to the data, and more often than not, this tool is a spreadsheet.
Projects that are data intensive such as deep learning projects, strongly rely on the quality of the dataset to produce good results. It is fair to say that the data is often more important than the model processing it. Having a way of tracking the version and changes made to your data and exchanging it with your colleagues in a way that does not involve a .tar.gz
sounds like a good idea. And considering that Git is the most widely used versioning system, coupling Git with the Data Version Control system sounds like an even better idea.
This way, when our dataset version is tracked, we can later use MLflow to link the accuracy of a model to the version of the dataset that we used.
DVC Usage
DVC usage is very similar to Git. The commands look almost the same, basically DVC allows you to choose files in your repository and push them on an alternative remote storage. This can be a S3, HDFS, SSH. All are more suitable for large files than Git. To install DVC, simply use pip
with the command sudo pip install dvc
.
A DVC repository should be tracked by a scm tool first, the SCM tool will track the index files from DVC which will indicate where to get the actual files that are too big to put directly on Git. As a reminder, SCM stands for Source Control Management. It is a family of tools that includes Git, SVN and Mercurial. If you are only using Git, you can replace the term SCM with Git.
So first, initialize a Git repo, as usual: git init
.
Then, initialize a DVC repo in the same folder: dvc init
.
This way your folder is tracked by Git and DVC. You can still use Git to track the changes to your project like usual. But with DVC the file you choose are removed from Git and added to the .gitignore
and managed by dvc.
DVC will create a
containing information about your file. For example, if you have a file dataset
, dataset.dvc
would look like:
md5: a33606741514b870f609eb1510d8c6cf
outs:
- md5: b2455b259b1c3b5d86eac7dfbb3bbe6d
path: dataset
cache: true
metric: false
persist: false
This file describes a file called dataset
which is present on the remote (or DVC local cache) and available for checkout (just as in Git). The md5 hash is used for version control and to indicate which version of the file should be pulled and used in the project.
The .dvc
files should be committed on the SCM as DVC is not a SCM tool: DVC has commits but they are not linked between them. Commits do not have a parent, this is handled by the SCM software. In the example above, the md5 hash is not tied to any other DVC hash. They are indepedant and tracked by the SCM software.
The commands to commit, push and pull from a remote storage like S3 or HDFS are very simple:
- Use
dvc add
to trak a file with DVC (and automatically append it to the .gitignore) which will create the associated.dvc
file. - Use
dvc commit
to commit the changes to the DVC local cache. - Use
dvc pull/push
to receive and checkout or to send the file to the remote.
So when cloning a Data Science project, simply use the following sequence to clone the project and get the associated large files such as datasets:
git clone <url> project
cd project
dvc pull
Use this sequence to add and push the file dataset.zip
to dvc.
dvc add dataset.zip
dvc commit
git add dataset.zip.dvc .gitignore
git commit -m "[DVC] Move dataset to dvc"
dvc push
git push
Using DVC with MLflow
DVC can be used with other Data Science tools to make the most of it. One of those tools is MLflow, which is used to track the effeciency of Machine Learning models.
MLflow can be used with a lot of Data Science frameworks: Tensorflow, Pytorch, Spark… The interesting thing is that MLflow’s runs can be tagged with the Git commit hash. Before, only the code would go on Git and the dataset information would typically go on some Excel files, passed from departements to departments. Now, with DVC, the dataset is integrated in Git. You can keep track of the modifications made to it in the Git commit messages and have a branch dedicated to the dataset tracking. Also, you can see the influence of each modification on the model’s performance with MLflow. Because DVC has nothing to do with the code of the project, it is very easy to add to an existing project.
We use MLflow here as an example, but any other framework using a SCM tool would work as well.
DVC storage caveat
The main problem with DVC is the initial configuration of the remote, especially with HDFS which requires some configuration to have a usable client. One of the easiest way to setup DVC is to use a S3 bucket.
dvc remote add myremote s3://bucket/path
But DVC can also be used with many remote like GoogleDrive:
dvc remote add myremote gdrive://root/my-dvc-root
dvc remote modify myremote gdrive_client_id my_gdrive_client_id
dvc remote modify myremote gdrive_client_secret gdrive_client_secret
More information on remote storage are available on their website.
Be also aware that DVC has a local storage for the files that are beeing tracked. The files tracked by DVC can be harklinked to the cache. Be aware of it if you decide to remove DVC from your project. More information about the DVC cache structure are available on their website.