git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Shawn O. Pearce" <spearce@spearce.org>
To: Steven Grimm <koreth@midwinter.com>
Cc: Junio C Hamano <junkio@cox.net>,
	Daniel Barkalow <barkalow@iabervon.org>,
	Theodore Ts'o <tytso@mit.edu>,
	Git Mailing List <git@vger.kernel.org>
Subject: Re: [PATCH] Add --no-reuse-delta option to git-gc
Date: Wed, 9 May 2007 15:10:52 -0400	[thread overview]
Message-ID: <20070509191052.GD3141@spearce.org> (raw)
In-Reply-To: <46418E24.9020309@midwinter.com>

Steven Grimm <koreth@midwinter.com> wrote:
> On that note, has any thought been given to looking at other compression 
> algorithms? Gzip is a great high-speed compressor, but there are others 
> out there (some a bit slower, some much slower at both compression and 
> decompression) that produce substantially smaller output.

Its been discussed once before on the list, in very recent history,
but not by a whole lot.  As Junio pointed out, I don't think there
ever really was any discussion of is gzip the best way to deflate the
objects.  I think gzip was just chosen simply because it was readily
available in libz, stable, and has a pretty decent speed/size ratio.
 
> I think it'd be kind of neat to have my .git directory shrink by another 
> 20+%. That's conservative; on maximumcompression.com's test of a mix of 
> different file types including images, gzip compresses 64% and the 
> best-scoring one does 80%. On English text gzip does 71% and the top 
> scorer does 89%. Most of the top-tier compressors are proprietary, but 

Yes.  But in many cases we might actually be able to do even better
by going with a pack-wide dictionary.  Why?

Think about source code structure.  E.g.

  $ git grep --cached 'struct object'| cut -d: -f1|wc -l
     402

So 402 files in git.git use the term 'struct object', and that's just
the current revision I had in my index.  With our current packfile
organization we are likely to store this string at least 402 times.
We'll store it once in each file's delta chain, assuming each
file's blobs largely fall into a single delta chain for that file
(reasonable assumption, but certainly not always true).

That's just one string that does appear somewhat frequently in any
file its used in.  Now try 'unsigned char' (its 944 files, but an
even higher frequency-per-file).

So anyway, for the past year I've been thinking about trying to
implement a blob-level dictionary prototype to see if it helps on a
project like linux-2.6.git, but I haven't gotten to it.  The pack v4
work was about applying that basic dicationary principal to trees
and commits, and I think it pays off nicely there.  Just need to
get it cleaned up, rebased onto current master, and submitted to
the list for wider testing.  ;-)

-- 
Shawn.

  parent reply	other threads:[~2007-05-09 19:11 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-05-08  2:54 [PATCH] Add --no-reuse-delta, --window, and --depth options to git-gc Theodore Ts'o
2007-05-08  3:13 ` Nicolas Pitre
2007-05-08  3:21   ` Theodore Tso
2007-05-08  3:38     ` Dana How
2007-05-08  4:43     ` Junio C Hamano
2007-05-08 13:46       ` Nicolas Pitre
2007-05-08 13:28 ` [PATCH] Add --no-reuse-delta, --window, and --depth options to Theodore Ts'o
2007-05-08 13:28   ` [PATCH] Add pack.depth option to git-pack-objects and change default depth to 50 Theodore Ts'o
2007-05-08 13:28     ` [PATCH] Add --no-reuse-delta option to git-gc Theodore Ts'o
2007-05-08 15:35       ` Nicolas Pitre
2007-05-09  5:05       ` Daniel Barkalow
2007-05-09  8:15         ` Junio C Hamano
2007-05-09  9:02           ` Steven Grimm
2007-05-09 11:35             ` Other compression?, was " Johannes Schindelin
2007-05-09 15:15             ` Junio C Hamano
2007-05-09 19:10             ` Shawn O. Pearce [this message]
2007-06-10  7:40               ` Sam Vilain
2007-06-11  1:51                 ` Nicolas Pitre
2007-06-11  6:20                   ` Steven Grimm
2007-06-11  6:31                     ` Shawn O. Pearce
2007-06-11 10:20                   ` Johannes Schindelin
2007-06-11 14:01                     ` Nicolas Pitre
2007-06-11 21:40                       ` Johannes Schindelin
2007-05-09 19:48           ` [PATCH] Add --aggressive option to 'git gc' Theodore Tso
2007-05-09 20:19             ` Junio C Hamano
2007-05-09 22:22               ` Theodore Tso
2007-05-10  7:38             ` Junio C Hamano
2007-05-08 15:38     ` [PATCH] Add pack.depth option to git-pack-objects and change default depth to 50 Nicolas Pitre
2007-05-08 16:30       ` Theodore Tso
2007-05-08 16:49         ` Johannes Schindelin
2007-05-08 18:09           ` Theodore Tso
2007-05-08 18:46             ` Nicolas Pitre
2007-05-09 13:49               ` Theodore Tso
2007-05-09 14:17                 ` Johannes Schindelin
2007-05-08 17:07         ` Dana How
2007-05-08 17:35         ` Nicolas Pitre
2007-05-09  5:03           ` Junio C Hamano
2007-05-08 15:30   ` [PATCH] Add --no-reuse-delta, --window, and --depth options to Nicolas Pitre
2007-05-08 21:12     ` Junio C Hamano
2007-05-08 23:59       ` Nicolas Pitre

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070509191052.GD3141@spearce.org \
    --to=spearce@spearce.org \
    --cc=barkalow@iabervon.org \
    --cc=git@vger.kernel.org \
    --cc=junkio@cox.net \
    --cc=koreth@midwinter.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).