Introduction
This article documents the process that we used to create the cardano-coin-selection
repository.
Contents
Overview
The cardano-coin-selection
repository was created by taking a clone of the pre-existing cardano-wallet
repository and filtering out a relevant subset of the version control history.
This article provides a record of the steps we used to perform this operation.
Background
Filtering the version control history of a repository is non-trivial.
One issue is that files in a repository are often renamed several times over the course of their history, and other files are composed of content from multiple ancestor files.
Assuming that we’d like to generate a new repository from a subset of the files in some source repository, we have to find a way to retain the history of not just the subset of files that we’re interested in, but also the histories of all files that served as content ancestors for that subset of files.
If we just naively filter for commits that affect the subset of files we’re interested in, we run the risk of losing those commits that affect older versions of files that existed at different paths.
In general, we’d like to keep commits for both:
- all files of interest; and
- all ancestors of files that are of interest.
Example
For example, suppose that the module Cardano.Wallet.Primitive.Types
includes content from files that once existed at the following paths:
-
lib/core/src/Cardano/Wallet/Types.hs
-
lib/core/src/Cardano/Wallet/Primitive/Types.hs
-
src/Cardano/Wallet/Primitive.hs
-
src/Cardano/Wallet/Primitive/Types.hs
We’d ideally like to keep commits relating to all of those paths.
Process
Here is a record of the steps we used to create the cardano-coin-selection
repository.
Step 1: Clone Source Repository
We start with a fresh clone of the source repository (in our case, the cardano-wallet
repository).
Step 2: Remove Irrelevant Files
In this step, we identify the files that we want to keep, and remove all files that are irrelevant.
To achieve this, we make a single commit to the master
branch that removes all unwanted files from the repository. The result of applying this commit should be precisely the set of files we want to keep.
Note that in the case of Haskell modules, we need to be somewhat careful, and avoid deleting any modules that define functions imported by the modules we want to keep. To avoid deleting too much, we need to determine the transitive closure of module dependencies required by the modules that we’re interested in.
A safe way to achieve this is to iteratively remove files that we’re not interested in, while confirming that it is still possible to build the remaining subset, repeating the process until all unwanted files are deleted.
Example
Suppose that we want to keep file src/ImportantModule.hs
, but that it imports functions defined in the following modules:
-
src/Wibble.hs
-
src/Wobble.hs
Furthermore, suppose that src/Wibble.hs
imports functions from the following modules:
-
src/Foo.hs
-
src/Bar.hs
If we wish to keep src/ImportantModule.hs
, we should therefore also keep:
-
src/Wibble.hs
-
src/Wobble.hs
-
src/Foo.hs
-
src/Bar.hs
Step 3: Identify Content Ancestors
In this step, we identify the historic ancestors of all files that we want to keep.
We generate a list of path names for all current files, as well as path names for all historical ancestor files, using the following script:
find-paths.sh
:
#!/usr/bin/env bash
git ls-tree -r master --name-only | while read -r file; do
git log --follow --name-status -- "$file" \
| awk '/^R[0-9]+/ { print $2; print $3 } {}'
done | sort -u
Run the script from the root of the repository, as follows:
$ find-paths.sh > files-to-keep
Step 4: Filter Commit History
In this step, we filter the version control history, removing all commits that are unrelated to the list of files identified in the previous step.
Run the following command, using the git-filter-repo
tool:
$ git filter-repo --paths-from-file files-to-keep
Step 5: Verify Commit History
In this step, we verify that we have retained all relevant parts of the history.
Re-run the find-paths
script from the root of the repository, as follows:
$ find-paths.sh > files-kept
The content of files-kept
should be identical to files-to-keep
.
Step 6: Push to GitHub
First follow the New Repo Checklist to create a repo on GitHub. Then:
$ git remote set-url origin git@github.com:input-output-hk/newrepo.git
$ git push --force -u origin master