Clearvision Technical Consultant Philip Armour tackles the topic of "Brexgit" - how to split a Git repository into smaller ones, or combine into one repository.
Divide and Conquer?
Unity and connectivity are usually positive things, but as we’ve seen recently, sometimes people want to split up and be more independent. The same can be true in software. If you are using Git it is well worth understanding the methods by which a software repository can be split up into several smaller ones, or indeed the opposite: what to do when you want to combine several smaller repositories into one.
In this post we explore both of these methods, but it is perhaps first worth comparing the positives and negatives of having a multiple repository set-up versus a monolithic one.
Good reasons for using multiple repositories include:
Wishing to split the software along functional or architectural lines into independent logical components
Wishing to split the software along organizational lines – according to ownership and how access is to be granted (different teams working on different repositories)
The multi-repository model may also fit well with the use of microservices to implement complex applications.
… Or Better Together?
I may not have been on the winning side in the Brexit vote, but when it comes to Git, I find it is much easier to convince people about the tangible positives of the monolithic approach (one big repo for all related software), assuming if there are no compelling reasons to avoid it for a specific case. The main benefit is that you sidestep the cross-repository dependency headache, which may push you towards adding complexity to your workflow through using one of the following ‘solutions’:
git submodules which have a bad, bad reputation for being complex (even among git evangelists it is hard to find someone who recommends this)
git subtrees are easier to use but effectively move you towards the mono-repository
repo (a tool originating from Google/Android development) is an interesting option but with uncertainty regarding how well it works with Git-hosting solutions other than Gerrit
There are other options, but these are arguably the main ones.
When Repos Collide
Let’s get back to methods of splitting and combining Git repos. Combining Git repositories is, on the surface, easy. In fact it can even be done without changing the commit histories (SHA1s) of either repository. It is also a great illustration of the flexibility of the Git DAG (directed acyclic graph): if you want to ‘import’ commits from one (completely different) repo into another repo, just add the ‘foreign’ repo (if you are not suspicious of foreign repos) as a ‘remote’ and fetch or pull the commits.
Imagine we wish to combine repository foo with a completely different repository bar while keeping full commit history of both. For simplicity, we will only consider a simple case, where each repository has a single master branch. In more realistic cases, the procedure must take other branches into account also.
Our starting point is depicted below:
When you have a local clone of a Git repository there is usually a remote (called origin by default) which is the address of the repository you cloned from. For a local clone of repository, foo in our example, let’s assume that origin has the address: email@example.com:example/foo.git
We now define a new remote of foo which points to the address of the second repository (bar). Let’s say this is firstname.lastname@example.org:example/bar.git
Now from foo, if we run the following git commands:
git remote add bar_repo email@example.com:example/bar.git
git checkout master
git fetch bar_repo master
The result is depicted in the following diagram. I think it is pretty cool that Git does not mind at all that there are two completely separate histories in its database.
The final step is to connect the two histories with a merge:
git merge bar_repo/master
Which results in:
So we have successfully merged both commits without changing any commit IDs. We can now remove the bar_repo remote and the bar_repo/master branch.
The only downside of combining two repos with a merge as described above is that you may end-up with a mish-mash of files and directories. Therefore this is possibly an argument for always putting the content of Git repositories in a single subdirectory at the top-level. Where that’s not the case it’s possible to first use a method based on a tool like git-filter-branch to restructure the repos before combining them (accepting the fact that this will re-write all commit IDs).
Finally, the ‘Brexgit’ option. Splitting up a Git repo is also done via git-filter-branch method, for which there are standard cookbook recipes. For example, one method is documented here by Atlassian.
This also results in rebuilding the commit histories and modifying all SHA1s.
With this method, if after the split you later decide that things weren’t really so bad with the big old repository you had before, you can always eat some humble pie and combine them together again.