Git is great for version control, but is inefficient for storing large or many binary blobs (videos, photos) in a way they are immediately accessible in the file system -- if you manage your photo collection in git, the disk space used is roughly twice the size of your bare photo collection.
git-annex works around those shortcomings, but is better at tracking storage locations (and "recently" also providing a GUI and metadata management) than at actually storing the data, where a native git approach could do the same.
When working on a copy-on-write file system like Btrfs or ZFS, Git should leverage the file system's ability to present the same data at different places (in the committed object store and the checkout) and still leave the original version unmodified when one version (the checkout) gets changed.
Currently, Git has two ways of storing an object: regular objects (in
.git/objects/[hex]/
), which are prefixed with type and length and then
zlib-compressed, and pack files (which can contain more than one object and
apply delta compression).
Both forms of compression perform poorly on blobs, especially when they are alreay compressed, like media data typically is.
There should be a third kind of objects that will be called "bare objects"
here: A bare object is always of type blob and uncompressed, thus its size
can be seen from the file system. All that bare objects would need is a
distinct path in the hierarchy, I'll use .git/objects/bare/XX/YY
in
examples (where XX
and YY
are the first two and the rest of the object's
hash, just as with regular objects).
A first implementation would be relatively easy, it would just need to hook into hash lookup and storage, where the storage part needs a way of telling whether it makes sense for a blob to be stored as bare object or not.
The simple implementation outlined above would not yield any savings yet, it further needs:
A fast lane for COW copies: As I figure out how Git probably works
internally, the "store as object" function takes a memory region (or a file
handle to be consumed to the end), and stores that. Unless it acually works
with file handles and a COW copy can be created from that alone, callers of
git-hash-object
must be provided with an additional function that copies
the input file to the bare hash location with the appropriate COW flags set.
Same goes for the other direction.
Heuristics for git-gc
(or whatever that calls to create pack files) to know
what to better put into pack files, and what to keep on disk because it's
free due to the checkout anyway. As a first step, git-gc
could just be told
to never pack files present in bare form, provided the storage mechanism from
above is clever enough not to store all files bare.
Repacking of pack files to get big blobs into bare objects. This is not primarily an issue of legacy git trees, but of receiving data from the network. It is particularly difficult because the information on whether a file will finally be checked out might only be available in the pack file at a later time than reception of the blob, meaning that git would have to store the complete received pack and later dissect it. (The receiver could peek into the received data and, for example, decide to store binary data >10MB as a bare object unconditionally.)
Handling of deltas: If those blobs start changing in a way deltas would be a reasonable compression (I'm thinking tagging of audio files), the storage-efficient method would be to have the currently checked out version as a bare object, and the delta in the pack file. As far as I know, pack files currently don't allow references to objects not in the pack, so that might pose additional difficulties.
Git already knows sparse checkouts (documented in git-read-tree
), but that
only refers to the files being copied for checkout. When the above is
implemented, that kind of sparse checkout does not give a benefit of disk usage
any more -- instead, we can aim to eliminate fetching currently unneeded large
objects at all. Such checkouts have been called "narrow clones", I'll stick to
that term here.
Git already allows clones without history (shallow clones, with --depth
), and
they are not as limited as they used to be since version 1.9.
There, to avoid that every part of Git has to deal with the resulting dead
ends, the commits that would reference an nonexisting commit are grafted to
hide their ancestry. This can also be achieved using the git-replace
mechanism, where the historically correct parents that are not acutally
available are removed from the commit in a kind of an overlay.
(Even though git-replace
has a --graft
option that does this, the grafts
from --depth
are different beasts as discussed on Stack Overflow).
Narrow clones could employ the replace mechanism on blobs that are known not to be of interest to the current clone. They can be split like that:
Checkout. This is the easiest to address. The objects known to be missing
(there has to be a list somewhere, to be discussed further down) could all be
git-replaced to a symlink to .git/unavailable-object
. With some clever
smudge filtering, those symlinks could be replaced with files without read
permissions. Those files could even be sparse files (ie. zero-filled without
needing much space) with the correct size, if that size is known.
Transfer. This is the hardest to address. I don't know much about the Git transfer protocol, but to my understanding things run like "I have commit A and B, give me C and everything I need to know for that". If the large blob X is in C and neither in A nor B, it gets transferred.
I have experimented with hacking things up so the client sends "I have commit A and commit B as well as X, get me C", and afair that worked out, but it's a long way from disabling checks and adding hard-coded hashes to that working out of the box.
Knowing which files not to fetch. This needs collaboration from somewher. Ideally, the Git server would not need to know of anything discussed on this page, and thus can not be relied on.
As things are, even if the client managed to obtain only the commit and tree
objects but not the blobs, it would not have a way to tell whether README
is a 10-line text file or a 4GB video -- but ideally, README should be
checked out in a git clone --narrow="size < 1MB"
clone.
The best solution I can currently think of is collaborating clients
maintaining a list of possibly not checkout-worthy blobs in a separate
branch; that could be a tree similar to .git/objects
that contains metadata
about all blobs a client ever deemed checkout unworthy.
A client managing a narrow clone that wants to pull would first fetch the metadata branch, decide which of the blobs it will not want, and then update its master branch, sending a large set of hashes in the "don't send this" section of the request. A careful implementation might even track for which master branch version the metadata branch was last updated, so the client could fetch the master branch in baby steps (ie. fetch an object without fetching any of its dependencies, if that is allowed in the protocol at all), refusing to download any blob. (Whether a tree element is a blob or another tree should follow from its file mode).
A completely different approach would be to filter only by file extensions or other gitattributes. In that case, the client would need to fetch new commits with complete trees but without blobs, which may require server-side changes or baby-stepping through the objects.
Modifying the narrowness parameters: If a user wants a file currently narrowed out, he needs a command to make that file available -- that is, fetch the object, remove the replacement, check out the unreplaced version.
Same goes for dropping an object. There, additional care has to be taken not to drop the last good copy of it; that task would best be handled by a different mechanism on top of Git (such as git-annex).
Merging: As with shallow clones, that might work in a particular situation if sufficient context is available, or simply might not if the common ancestor is missing. That's a reasonable limitation.
This topic has come up on the mailing list before:
Narrow clone implementation difficulty estimate
Indicates that there can be a limit on the object sizes in pack files. This would make thins a lot easier. Also, the gitattributes mentioned that prevent delta-ification could be useful for bare objects.
The COW savings part obviously only works on COW file systems. I have no
compassion for people stuck on other file systems. Users of other file
systems might need to resort to mechanisms like currently employed by
git-annex; apart from that, Git would behave like it always has to them.
This document is an elaboration on what has been previously (Feb. 2011) described in the git-annex wiki.
Mercurial has a NarrowClonePlan.
This primarily needs feedback from people familiar with the Git packing and wire formats, or more detailed research into those topics, or links to reports on similar attempts in that area.
--chrysn 2015-03-11
This page is part of chrysn's public personal idea incubator; go up for its other entries, or read about the idea of having an idea incubator for more information on what this is.