1/6 – Migrating Git Large File Storage (LFS) to Bitbucket Cloud

Introduction

This post will explain what Git LFS is and provide a detailed guide on how to migrate your LFS objects to the cloud using a custom script.

Understanding Git LFS

Git LFS is an extension that enables efficient handling of large files by storing references to these files in the Git repository while keeping the files themselves on a separate server. This approach prevents large files from bloating the repository size, ensuring that your Git operations remain fast and efficient.

Why Migrate Git LFS?

When moving to Bitbucket Cloud, migrating your Git LFS objects is crucial to maintain the integrity of your repositories and ensure that all necessary assets are available in the cloud environment. The process involves transferring LFS objects from your local or on-premises storage to Bitbucket Cloud’s LFS storage.

How to Migrate Git LFS

The migration process is divided into two main scripts: one for fetching repository information from both your Bitbucket Server (Data Center) and Bitbucket Cloud, and another for cloning the repositories and syncing their LFS objects to the cloud.

Pre-Requisites

Python installed on your migration machine
Git and Git LFS installed
Access credentials for both your Bitbucket Server and Cloud accounts

Script 1: Fetching Repository Information

The first script extracts information about your repositories from both Bitbucket Server and Cloud, saving them into CSV files. This data includes repository names, slugs, and clone URLs.

Configuration: Fill in the config.py file with your Bitbucket Server and Cloud credentials, including workspace, username, and token.

# Bitbucket Cloud configurations
cloud = {
    'workspace': 'rodolfobortolin',  # The workspace ID for Bitbucket Cloud
    'username': 'rodolfobortolin',  # Your Bitbucket Cloud username: https://bitbucket.org/account/settings/
    'token': '<your_token>',  # Your Bitbucket Cloud app password: https://bitbucket.org/account/settings/app-passwords/new
    'bitbucket_cloud_repositories' : 'bitbucket_cloud_repositories.csv'  # Output CSV file for Cloud repositories
}

# Bitbucket Server configurations
on_prem = {
    'base_url': 'http://localhost:7990',  # The base URL of your Bitbucket Server instance
    'username': 'rbortolin',  # Your Bitbucket Server username
    'password': 'admin',  # Your Bitbucket Server password
    'domain': 'localhost:7990',
    'bitbucket_server_repositories' : 'bitbucket_server_repositories.csv'  # Output CSV file for Server repositories
}

repository_folder = 'repositories' #the directory which the script will download the repositories

Running the Script: Execute the script to generate two CSV files: one for your Bitbucket Server repositories and another for your Bitbucket Cloud repositories.
Merging CSV Files: The script then merges these files into a single CSV, mapping each server repository to its cloud counterpart based on the repository name.

import csv
import json
import os
import requests
import logging
from requests.auth import HTTPBasicAuth
from config import cloud, on_prem

# Initialize logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def get_bitbucket_cloud_repos(workspace, username, token, output_file):
    logging.info("Starting to fetch Bitbucket Cloud repositories.")
    url = f"https://api.bitbucket.org/2.0/repositories/{workspace}"
    auth = HTTPBasicAuth(username, token)
    headers = {"Accept": "application/json"}

    try:
        with open(output_file, 'w', newline='', encoding='utf-8') as file:
            writer = csv.writer(file)
            writer.writerow(['uuid', 'slug', 'name', 'scm', 'https', 'ssh'])

            while url:
                response = requests.get(url, auth=auth, headers=headers)
                if response.status_code == 200:
                    data = response.json()
                    for repo in data.get('values', []):
                        clone_https = repo['links']['clone'][0]['href']
                        clone_ssh = repo['links']['clone'][1]['href']
                        writer.writerow([repo['uuid'], repo['slug'], repo['name'], repo['scm'], clone_https, clone_ssh])
                    url = data.get('next', None)
                else:
                    logging.error(f"Failed to fetch repositories: {response.text}")
                    break
        logging.info("Successfully fetched and saved Bitbucket Cloud repositories.")
    except Exception as e:
        logging.error(f"Error fetching Bitbucket Cloud repositories: {e}")

def get_bitbucket_server_repos(base_url, username, password, output_file):
    logging.info("Starting to fetch Bitbucket Server repositories.")
    auth = HTTPBasicAuth(username, password)  # Authentication setup
    headers = {"Accept": "application/json"}  # Request headers
    project_limit = 100  # Define the project limit
    repo_limit = 1000  # Define the repository limit

    try:
        with open(output_file, 'w', newline='', encoding='utf-8') as file:
            writer = csv.writer(file)
            writer.writerow(['id', 'slug', 'name', 'scmId', 'project_key', 'https', 'ssh'])  # Header row for CSV

            # Pagination setup for projects
            projects_start = 0
            projects_is_last_page = False

            while not projects_is_last_page:
                projects_url = f"{base_url}/rest/api/1.0/projects?start={projects_start}&limit={project_limit}"
                projects_response = requests.get(projects_url, auth=auth, headers=headers)
                if projects_response.status_code == 200:
                    projects_data = projects_response.json()
                    projects = projects_data.get('values', [])
                    projects_is_last_page = projects_data.get('isLastPage', True)

                    for project in projects:
                        # Logging each project being processed
                        logging.info(f"Processing project: {project['key']}")
                        repos_start = 0
                        repos_is_last_page = False

                        while not repos_is_last_page:
                            repos_url = f"{base_url}/rest/api/1.0/projects/{project['key']}/repos?start={repos_start}&limit={repo_limit}"
                            repos_response = requests.get(repos_url, auth=auth, headers=headers)
                            if repos_response.status_code == 200:
                                repos_data = repos_response.json()
                                repos = repos_data.get('values', [])
                                repos_is_last_page = repos_data.get('isLastPage', True)

                                for repo in repos:
                                    clone_https = None
                                    clone_ssh = None
                                    for clone_link in repo['links']['clone']:
                                        if clone_link['name'] == 'http':
                                            clone_https = clone_link['href']
                                        elif clone_link['name'] == 'ssh':
                                            clone_ssh = clone_link['href']
                                    if clone_https or clone_ssh:
                                        writer.writerow([repo['id'], repo['slug'], repo['name'], repo['scmId'], project['key'], clone_https, clone_ssh])
                                        logging.info(f"Added repository '{repo['name']}' to CSV.")
                            else:
                                logging.error(f"Failed to fetch repositories for project {project['key']}. Status code: {repos_response.status_code}")
                                break

                        if not projects_is_last_page:
                            projects_start += project_limit
                else:
                    logging.error(f"Failed to fetch projects. Status code: {projects_response.status_code}")
                    break
            logging.info("Successfully fetched and saved Bitbucket Server repositories.")
    except Exception as e:
        logging.error(f"Error while fetching Bitbucket Server repositories: {e}")


def merge_repos_to_csv(server_csv, cloud_csv, output_csv):
    """Merges data from Bitbucket Server and Cloud CSV files into a single CSV file."""
    try:
        script_dir = os.path.dirname(__file__)
        server_csv_path = os.path.join(script_dir, server_csv)
        cloud_csv_path = os.path.join(script_dir, cloud_csv)
        
        server_repos = {}
        with open(server_csv_path, 'r', encoding='utf-8') as file:
            reader = csv.DictReader(file)
            for row in reader:
                server_repos[row['name']] = row

        cloud_repos = {}
        with open(cloud_csv_path, 'r', encoding='utf-8') as file:
            reader = csv.DictReader(file)
            for row in reader:
                cloud_repos[row['name']] = row

        with open(output_csv, 'w', newline='', encoding='utf-8') as file:
            writer = csv.writer(file)
            writer.writerow(['name', 'project', 'match', 'source', 'target'])

            for name, server_repo in server_repos.items():
                if name in cloud_repos:
                    cloud_repo = cloud_repos[name]
                    project = server_repo['project_key']
                    match = "yes"
                    source = server_repo['https']
                    target = cloud_repo['https']
                    writer.writerow([name, project, match, source, target])
        logging.info("Successfully merged repositories into a single CSV.")
    except Exception as e:
        logging.error(f"Error merging repositories: {e}")

# Fetch and merge repositories
get_bitbucket_cloud_repos(cloud['workspace'], cloud['username'], cloud['token'], cloud['bitbucket_cloud_repositories'])
get_bitbucket_server_repos(on_prem['base_url'], on_prem['username'], on_prem['password'], on_prem['bitbucket_server_repositories'])
merge_repos_to_csv(on_prem['bitbucket_server_repositories'], cloud['bitbucket_cloud_repositories'], 'merged_repositories.csv')

Script 2: Cloning and Syncing LFS Objects

The second script uses the merged CSV file to clone repositories from the server to a local machine, add a remote to the corresponding cloud repository, and sync the Git LFS objects.

Clone Repositories: The script clones repositories from the Bitbucket Server to your local machine if they haven’t been cloned already.
Add Cloud Remote: It adds a remote link to the corresponding Bitbucket Cloud repository.
Sync LFS Objects: Fetches all LFS objects from the server and pushes them to the cloud repository.

import csv
import os
import subprocess
import urllib.parse
import logging
from config import on_prem, repository_folder

# Initialize logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Determine the path to the 'repositories' subfolder relative to this script's location
script_location = os.path.dirname(os.path.abspath(__file__))
save_folder = os.path.join(script_location, repository_folder)

# Ensure the 'repositories' subfolder exists
os.makedirs(save_folder, exist_ok=True)

# Predefined configuration variables
input_csv = os.path.join(script_location, "merged_repositories.csv")  # Adjusted for script location
should_clone = True  # Set to True to clone repositories
should_sync_lfs = True  # Set to True to sync LFS files

def run_command(command, cwd=None):
    """Execute a system command with optional working directory."""
    logging.info(f"Executing: {command}")
    try:
        result = subprocess.run(command, shell=True, cwd=cwd, capture_output=True, text=True)
        if result.stdout:
            logging.debug(result.stdout)
        if result.stderr:
            logging.error(result.stderr)
    except Exception as e:
        logging.exception("Failed to execute command")

def clone_and_sync_repos():
    """Clone and sync repositories from a CSV file."""
    with open(input_csv, newline='') as csvfile:
        reader = csv.DictReader(csvfile, delimiter=',')
        for row in reader:
            # Construct source and target URLs with credentials
            credentials = f"{on_prem['username']}:{urllib.parse.quote_plus(on_prem['password'])}"
            source_url = row['source']

            # Check if the URL starts with http:// or https://
            if source_url.startswith('http://'):
                new_url = source_url.replace('http://', f"http://{credentials}@")
            elif source_url.startswith('https://'):
                new_url = source_url.replace('https://', f"https://{credentials}@")
            
            source_url = new_url
            target_url = row['target']
            repo_folder = os.path.join(save_folder, row['name'])

            logging.info(f"Processing [{row['name']}] repository...")

            if should_clone:
                # Clone repository if it does not exist
                if not os.path.exists(repo_folder):
                    clone_command = f"git clone {source_url} \"{repo_folder}\""
                    run_command(clone_command)
                else:
                    logging.info("Repository already exists, skipping clone.")

                # Add cloud remote
                remote_add_command = f"git remote add cloud {target_url}"
                run_command(remote_add_command, cwd=repo_folder)

            if should_sync_lfs:
                # Fetch and push LFS files
                logging.info("Fetching LFS files...")
                run_command("git lfs fetch --all", cwd=repo_folder)
                logging.info("Pushing LFS files...")
                run_command("git lfs push --all cloud", cwd=repo_folder)

# Main execution
if __name__ == "__main__":
    clone_and_sync_repos()

Step-by-Step Migration

Prepare Your Environment: Ensure all pre-requisites are met and your config.py is correctly filled out.
Execute the First Script: Fetch and merge repository information from your Bitbucket Server and Cloud.
Run the Second Script: Clone the repositories and sync their Git LFS objects to Bitbucket Cloud.
Verify the Migration: After the scripts complete, check your Bitbucket Cloud repositories to ensure that all LFS objects have been successfully migrated.

Atlassian Expert Tips

Real-Life Consulting Insights and Atlassian Expertise

1/6 – Migrating Git Large File Storage (LFS) to Bitbucket Cloud

Introduction

Understanding Git LFS

Why Migrate Git LFS?

How to Migrate Git LFS

Pre-Requisites

Script 1: Fetching Repository Information

Script 2: Cloning and Syncing LFS Objects

Step-by-Step Migration

One thought on “1/6 – Migrating Git Large File Storage (LFS) to Bitbucket Cloud”

Leave a comment Cancel reply

Introduction

Understanding Git LFS

Why Migrate Git LFS?

How to Migrate Git LFS

Pre-Requisites

Script 1: Fetching Repository Information

Script 2: Cloning and Syncing LFS Objects

Step-by-Step Migration

Share this:

Related

One thought on “1/6 – Migrating Git Large File Storage (LFS) to Bitbucket Cloud”

Leave a comment Cancel reply