From: "C. Scott Ananian" <cscott@cscott.net>
To: git@vger.kernel.org
Subject: space compression (again)
Date: Fri, 15 Apr 2005 13:19:30 -0400 (EDT) [thread overview]
Message-ID: <Pine.LNX.4.61.0504151232160.27637@cag.csail.mit.edu> (raw)
I've been reading the archives (a bad idea, I know). Here's a concrete
suggestion for GIT space-compression which is (I believe) consistent with
the philosophy of GIT.
Why are blobs per-file? [After all, Linus insists that files are an
illusion.] Why not just have 'chunks', and assemble *these*
into blobs (read, 'files')? A good chunk size would fit evenly into some
number of disk blocks (no wasted space!).
We already have the rsync algorithm which can scan through a file and
efficiently tell which existing chunks match (portions of) it, using a
rolling checksum. (Here's a refresher:
http://samba.anu.edu.au/rsync/tech_report/node2.html
). Why not treat the 'chunk' as the fundamental unit, and compose files
from chunks?
This should get better space utilization: a small change to file X
will only require storage to save the changed chunk, plus meta data to
describe the chunks composing the new file. I propose keeping this only
one-level deep: we can only specify chunks, not pieces of files.
Unlike xdelta schemes, there is no 'file' dependency. Chunks for a blob
can be and are shared among *all the other files and versions in the
repository*. Moving pieces from file 'a' to file 'b' "just works".
Best of all, I believe this can be done in a completely layered fashion.
From git's perspective, it's still 'open this blob' or 'write this blob'.
It just turns out that the filesystem representation of a blob is slightly
more fragmented. Even better, you ought to be able to convert your
on-disk store from one representation to the other: the named blob doesn't
change, just 'how to fetch the blob' changes. So, for example, Linus'
tree can be unchunked for speed, but the release tree (say) can pull
pruned history from Linus into a chunked on-disk representation that can
be efficiently wget'ted (only new chunks need be transferred).
My first concern is possible fragmentation: would we end up with a large
number of very small chunks, and end up representing files as a list of
lines (effectively)? Maybe someone can think of an effective coalescing
strategy, or maybe it is sufficient just to avoid creating chunks smaller
than a certain size (ie, possibly writing redundant data to a new chunk,
just to improve the possibility of reuse).
I'm also not sure what the best 'chunk' size is. Smaller chunks save more
space but cost more to access (# of disk seeks per file/blob). Picking a
chunk half the average file size should reduce space by ~50% while only
requiring ~2 additional seeks per file-read. OTOH, rsync experience
suggests 500-1000 byte chunk sizes. Probably empirical testing is best.
Lastly, we want to avoid hitting the dcache to check the existence of
chunks while encoding. In a large repository, there will be a very large
number of chunks. We don't *have* to index all of them, but our
compression gets better the more chunks we know about. The rsync
algorithm creates hash tables of chunks at different levels of granularity
to avoid doing a full check at every byte of the input file. How large
should this cached-on-disk chunk hash table be to avoid saturating it as
the repository grows (maybe the standard grow-as-you-go hash table is
fine; you only need one bit per entry anyway)?
Thoughts? Is the constant-factor overhead of indirection-per-blob going
to kill git's overwhelming speed?
--scott
JUBILIST explosion MKULTRA HTAUTOMAT Indonesia Shoal Bay RUCKUS ammunition
GPFLOOR Hager SDI MKDELTA KUBARK Dictionary Soviet BLUEBIRD Delta Force
( http://cscott.net/ )
next reply other threads:[~2005-04-15 17:16 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-04-15 17:19 C. Scott Ananian [this message]
2005-04-15 18:34 ` space compression (again) Linus Torvalds
2005-04-15 18:45 ` C. Scott Ananian
2005-04-15 19:00 ` Derek Fawcus
2005-04-15 19:11 ` Linus Torvalds
2005-04-16 14:39 ` Martin Uecker
2005-04-16 15:11 ` C. Scott Ananian
2005-04-16 17:37 ` Martin Uecker
2005-04-19 12:39 ` Martin Uecker
2005-04-15 18:50 ` Derek Fawcus
-- strict thread matches above, loose matches on Subject: below --
2005-04-15 19:33 Ray Heasman
2005-04-16 12:29 ` David Lang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.61.0504151232160.27637@cag.csail.mit.edu \
--to=cscott@cscott.net \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).