The issue

Git is great for version control, but is inefficient for storing large or many binary blobs (videos, photos) in a way they are immediately accessible in the file system -- if you manage your photo collection in git, the disk space used is roughly twice the size of your bare photo collection.

git-annex works around those shortcomings, but is better at tracking storage locations (and "recently" also providing a GUI and metadata management) than at actually storing the data, where a native git approach could do the same.

Changes I'd like to see in Git

Storing objects in a COW friendly way

When working on a copy-on-write file system like Btrfs or ZFS, Git should leverage the file system's ability to present the same data at different places (in the committed object store and the checkout) and still leave the original version unmodified when one version (the checkout) gets changed.

First steps

Currently, Git has two ways of storing an object: regular objects (in .git/objects/[hex]/), which are prefixed with type and length and then zlib-compressed, and pack files (which can contain more than one object and apply delta compression).

Both forms of compression perform poorly on blobs, especially when they are alreay compressed, like media data typically is.

There should be a third kind of objects that will be called "bare objects" here: A bare object is always of type blob and uncompressed, thus its size can be seen from the file system. All that bare objects would need is a distinct path in the hierarchy, I'll use .git/objects/bare/XX/YY in examples (where XX and YY are the first two and the rest of the object's hash, just as with regular objects).

A first implementation would be relatively easy, it would just need to hook into hash lookup and storage, where the storage part needs a way of telling whether it makes sense for a blob to be stored as bare object or not.

Optimizations

The simple implementation outlined above would not yield any savings yet, it further needs:

Better handling of sparse / narrow checkouts

Git already knows sparse checkouts (documented in git-read-tree), but that only refers to the files being copied for checkout. When the above is implemented, that kind of sparse checkout does not give a benefit of disk usage any more -- instead, we can aim to eliminate fetching currently unneeded large objects at all. Such checkouts have been called "narrow clones", I'll stick to that term here.

Git already allows clones without history (shallow clones, with --depth), and they are not as limited as they used to be since version 1.9.

There, to avoid that every part of Git has to deal with the resulting dead ends, the commits that would reference an nonexisting commit are grafted to hide their ancestry. This can also be achieved using the git-replace mechanism, where the historically correct parents that are not acutally available are removed from the commit in a kind of an overlay.

(Even though git-replace has a --graft option that does this, the grafts from --depth are different beasts as discussed on Stack Overflow).

Narrow clones could employ the replace mechanism on blobs that are known not to be of interest to the current clone. They can be split like that:

This topic has come up on the mailing list before:

Assorted notes

Incubator status

This primarily needs feedback from people familiar with the Git packing and wire formats, or more detailed research into those topics, or links to reports on similar attempts in that area.

--chrysn 2015-03-11


This page is part of chrysn's public personal idea incubator; go up for its other entries, or read about the idea of having an idea incubator for more information on what this is.