💿 DVC (Data Version Control) Integration
This document details the role of DVC in managing and versioning large data files and models within the project, ensuring reproducibility and collaboration.
📝 1. Overview and Purpose
DVC (Data Version Control) extends Git's capabilities to handle large files, datasets, and machine learning models. In this project, DVC is used to:
- Version Data and Models: Track changes to datasets and trained models alongside the code, ensuring that every experiment is fully reproducible.
- Enable Reproducibility: Allows switching between different versions of data and models by simply checking out a Git commit and running
dvc pull. - Manage Large Files: Keeps large files out of the Git repository, storing them efficiently in remote storage (like AWS S3) while Git tracks only small
.dvcmetadata files. - Facilitate Collaboration: Provides a streamlined way for team members to work with consistent versions of data and models.
⚙️ 2. DVC Setup and Configuration
2.1. Initial Setup
After initializing a Git repository, DVC is initialized in the project root:
dvc init
The primary configuration involves setting up a remote storage location, which in this project is a Backblaze B2 S3-compatible bucket.
2.2. Dynamic Remote Configuration
For production and containerized environments, this project uses a dynamic approach to configure the DVC remote storage at runtime. This is handled by the docker-entrypoint.sh script and relies on environment variables.
The script executes the following commands to configure the DVC remote:
# Add the remote storage location
dvc remote add -d -f "$DVC_REMOTE_NAME" s3://"$DVC_BUCKET_NAME"
# Modify the remote with specific credentials and settings
dvc remote modify "$DVC_REMOTE_NAME" endpointurl "$DVC_S3_ENDPOINT_URL"
dvc remote modify --local "$DVC_REMOTE_NAME" region "$DVC_S3_ENDPOINT_REGION"
dvc remote modify --local "$DVC_REMOTE_NAME" access_key_id "$DVC_AWS_ACCESS_KEY_ID"
dvc remote modify --local "$DVC_REMOTE_NAME" secret_access_key "$DVC_AWS_SECRET_ACCESS_KEY"
2.3. Environment Variables for Configuration
DVC requires credentials to access the S3 bucket. These are provided via the following environment variables:
DVC_REMOTE_NAME: The name to assign to the DVC remote (e.g.,myremote).DVC_BUCKET_NAME: The name of the S3 bucket where data and models are stored.DVC_S3_ENDPOINT_URL: The S3 endpoint URL (for S3-compatible storage like Backblaze).DVC_S3_ENDPOINT_REGION: The AWS region of the S3 bucket.DVC_AWS_ACCESS_KEY_ID: The AWS access key ID.DVC_AWS_SECRET_ACCESS_KEY: The AWS secret access key.
For more details on securely managing these credentials, refer to the MLflow Integration documentation.
🔄 3. Core Workflow
3.1. Versioning Data and Models
To version a file or directory with DVC:
dvc add data/raw/flights.csv
dvc add models/
This command replaces the actual file/directory with a small .dvc file, which is then tracked by Git:
git add data/raw/flights.csv.dvc models.dvc
git commit -m "Add DVC-tracked raw data and models"
To push the actual data to remote storage:
dvc push
3.2. Retrieving Data and Models
To retrieve the data and models associated with your current Git commit, use dvc pull:
dvc pull
This command downloads the data from the DVC remote to your local machine.
3.3. Reproducing Past States
One of DVC's most powerful features is the ability to reproduce past experiments by checking out a specific Git commit and retrieving the exact data and models used at that time:
git checkout <commit_hash>
dvc pull
This ensures full reproducibility of any experiment or model version.
🏗️ 4. DVC in the Project
4.1. DVC in the Docker Entrypoint
For the prediction server, DVC plays a critical role during container startup. The docker-entrypoint.sh script is configured to pull the necessary models before the FastAPI application launches. This ensures the server always starts with the correct model version.
4.2. DVC Pipelines
This project also utilizes DVC pipelines to define and manage the data processing and model training workflows. The entire end-to-end pipeline is defined in dvc.yaml.
To run the full DVC pipeline:
dvc repro
This command executes all stages defined in dvc.yaml in the correct order, ensuring that data dependencies are met and only necessary stages are re-executed when inputs change.