Git repository granularity
How big should a Git repository be? Repository size may seem like a matter of personal taste, but it actually has a huge impact on development and release workflow. I have a lot of repositories of varying size and purpose, so I researched this a bit and here's what I found.
The primary constraint on repository size comes from shared repository-wide resources:
- Namespaces: There is only one repository-wide namespace for tags and branches. When the repository is too big, you have to add prefixes, which makes things inconvenient and makes the repo incompatible with some standard tools. On the other hand, if the repo is too small, tags and branches have to span repositories, which is even more inconvenient.
- Branching and merging: If you use branches a lot, this might be quite important. With too big a repository, you end up having to do a lot of merging from main branch to any long-lived branch. With too small repo, branches have to span repositories, which makes them very unwieldy.
- Code reuse: While opensource projects can rely on excellent dependency management tools, which generally work best with fine-grained repositories, smaller commercial projects usually have to make do with locally resolved dependencies, which work better in one big repository. Tightly coupled opensource components are also easier to develop together in one repo.
- Move history: Moving code between repositories loses history. On the other hand, losing history may be desirable if the move is intended to create new component with its own tags and release history.
- Tools: Continuos integration, GitHub actions, and repository-aware development and deployment tools assume they are being applied to the entire repository. Some of these tools can be configured to work on a subdirectory, but that's inconvenient if it is even possible. On the other hand, tools almost never work well across numerous small repositories.
- Hosting: GitHub and other code hosting ties additional resources to the repository: releases, downloads, homepage, etc. Overly large repositories end up with a complicated mess in all these resources. Having too small a repository is less problematic, but there are still downsides, for example if common releases and downloads have to be found under related repository.
- Community: GitHub and similar sites organize communities of contributors around repositories. If the repository is too big, multiple tangentially related communities will step on each other's feet. If the repository is instead too small, people interested in the project will have to hop around several repositories to get things done.
Okay, those are constraints we can use to evaluate any given repository granularity, but what granularities can we choose from? As I see it, there are three basic choices:
- One repo per artifact: An artifact in this case usually means a library or a binary. One repo per artifact works best for opensource libraries. Small size makes the repo accessible for contributors. Standard tools usually work out of the box.
- One repo per product: Product is a set of related artifacts released together. Product-level repo simplifies branching, tagging, and sharing code among closely related components. People working on the product only have to deal with one repository. This works best for multi-artifact opensource projects, commercial projects delivered in source form, and siloed teams in larger organizations.
- Monorepo: Monorepo is as large as possible, often encompassing everything an organization is doing. Monorepo is often the simplest and most productive option for private repositories, which do not need hosting or community, but they benefit strongly from code reuse and all-encompassing branches and tags.
These three basic choices can be interpolated by choosing between in-tree development and separate repository for every component. So tightly coupled components, for example add-ons enabled by default in the app, would be developed in-tree while experimental and third-party add-ons get their own repository. Things get even more complicated with large files and unruly developers, both of which might be better off isolated in separate repos.
There is no one perfect solution. Choice of repository granularity has to take into account nature of the project, development workflow, tooling, and interaction with users, clients, and contributors. If I am to make a generalization from all of this, I would say that repository is a unit of sharing. It's not a matter of what is in the repository but rather of who is working on it. Repository granularity should match granularity of sharing, teams, and community.