space compression (again) - C. Scott Ananian

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: "C. Scott Ananian" <cscott@cscott.net>
To: git@vger.kernel.org
Subject: space compression (again)
Date: Fri, 15 Apr 2005 13:19:30 -0400 (EDT)	[thread overview]
Message-ID: <Pine.LNX.4.61.0504151232160.27637@cag.csail.mit.edu> (raw)

I've been reading the archives (a bad idea, I know).  Here's a concrete 
suggestion for GIT space-compression which is (I believe) consistent with 
the philosophy of GIT.

Why are blobs per-file?  [After all, Linus insists that files are an 
illusion.]  Why not just have 'chunks', and assemble *these* 
into blobs (read, 'files')?  A good chunk size would fit evenly into some 
number of disk blocks (no wasted space!).

We already have the rsync algorithm which can scan through a file and 
efficiently tell which existing chunks match (portions of) it, using a 
rolling checksum. (Here's a refresher:
    http://samba.anu.edu.au/rsync/tech_report/node2.html
).  Why not treat the 'chunk' as the fundamental unit, and compose files 
from chunks?

This should get better space utilization: a small change to file X 
will only require storage to save the changed chunk, plus meta data to 
describe the chunks composing the new file.  I propose keeping this only 
one-level deep: we can only specify chunks, not pieces of files.

Unlike xdelta schemes, there is no 'file' dependency.  Chunks for a blob 
can be and are shared among *all the other files and versions in the 
repository*.  Moving pieces from file 'a' to file 'b' "just works".

Best of all, I believe this can be done in a completely layered fashion. 
From git's perspective, it's still 'open this blob' or 'write this blob'. 
It just turns out that the filesystem representation of a blob is slightly 
more fragmented.  Even better, you ought to be able to convert your 
on-disk store from one representation to the other: the named blob doesn't 
change, just 'how to fetch the blob' changes.  So, for example, Linus' 
tree can be unchunked for speed, but the release tree (say) can pull 
pruned history from Linus into a chunked on-disk representation that can 
be efficiently wget'ted (only new chunks need be transferred).

My first concern is possible fragmentation: would we end up with a large 
number of very small chunks, and end up representing files as a list of 
lines (effectively)?  Maybe someone can think of an effective coalescing 
strategy, or maybe it is sufficient just to avoid creating chunks smaller 
than a certain size (ie, possibly writing redundant data to a new chunk, 
just to improve the possibility of reuse).

I'm also not sure what the best 'chunk' size is.  Smaller chunks save more 
space but cost more to access (# of disk seeks per file/blob).  Picking a 
chunk half the average file size should reduce space by ~50% while only 
requiring ~2 additional seeks per file-read. OTOH, rsync experience 
suggests 500-1000 byte chunk sizes.  Probably empirical testing is best.

Lastly, we want to avoid hitting the dcache to check the existence of 
chunks while encoding.  In a large repository, there will be a very large 
number of chunks.  We don't *have* to index all of them, but our 
compression gets better the more chunks we know about.  The rsync 
algorithm creates hash tables of chunks at different levels of granularity 
to avoid doing a full check at every byte of the input file.  How large 
should this cached-on-disk chunk hash table be to avoid saturating it as 
the repository grows (maybe the standard grow-as-you-go hash table is 
fine; you only need one bit per entry anyway)?

Thoughts?  Is the constant-factor overhead of indirection-per-blob going 
to kill git's overwhelming speed?
  --scott

JUBILIST explosion MKULTRA HTAUTOMAT Indonesia Shoal Bay RUCKUS ammunition 
GPFLOOR Hager SDI MKDELTA KUBARK Dictionary Soviet  BLUEBIRD Delta Force
                          ( http://cscott.net/ )

next             reply	other threads:[~2005-04-15 17:16 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-04-15 17:19 C. Scott Ananian [this message]
2005-04-15 18:34 ` space compression (again) Linus Torvalds
2005-04-15 18:45   ` C. Scott Ananian
2005-04-15 19:00     ` Derek Fawcus
2005-04-15 19:11     ` Linus Torvalds
2005-04-16 14:39       ` Martin Uecker
2005-04-16 15:11         ` C. Scott Ananian
2005-04-16 17:37           ` Martin Uecker
2005-04-19 12:39             ` Martin Uecker
2005-04-15 18:50 ` Derek Fawcus
  -- strict thread matches above, loose matches on Subject: below --
2005-04-15 19:33 Ray Heasman
2005-04-16 12:29 ` David Lang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.61.0504151232160.27637@cag.csail.mit.edu \
    --to=cscott@cscott.net \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).