git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [RFC] Improving compression in git network protocols
@ 2005-08-13 16:22 Sergey Vlasov
  2005-08-13 16:23 ` [RFC] [PATCH] Add "--compression-level=N" option to git-pack-objects Sergey Vlasov
  0 siblings, 1 reply; 2+ messages in thread
From: Sergey Vlasov @ 2005-08-13 16:22 UTC (permalink / raw
  To: git

[-- Attachment #1: Type: text/plain, Size: 5754 bytes --]

Hello!

The git pack format has two uses:

1) A space-optimized format for local repository storage.

2) A compact format for transferring repository data over network.

However, these uses have some conflicting requirements, and currently
the pack format is not as optimal for the task of network transfer as
it could be.  In particular, because using pack files in a local
repository requires random access to the contained objects, all
objects in the pack are compressed separately, which negatively
impacts the compression rate.

I have made a patch which adds the "--compression-level=N" option to
git-pack-objects and tried to look what kind of improvement in the
compression rate we can get if we compress the whole pack file instead
of individual objects.  The patch is in a separate message, however,
I'm not sure if it should be applied immediately - currently there is
no infrastructure for using this option, and maybe we will choose to
implement the same idea in some different way.

Here are the results of my tests:

===========================================================================

1. Packing the whole linux-2.4 repository (13390 objects):

-rw-r--r--  1 vsu 159632105 Aug 13 17:12 pack-0-ef502d3d97088a8b1da4594fda438c268dc5c692.pack
-rw-r--r--  1 vsu  30878501 Aug 13 17:16 pack-0-ef502d3d97088a8b1da4594fda438c268dc5c692.pack.bz2
-rw-r--r--  1 vsu  38035157 Aug 13 17:14 pack-0-ef502d3d97088a8b1da4594fda438c268dc5c692.pack.gz
-rw-r--r--  1 vsu  37739041 Aug 13 17:17 pack-0-ef502d3d97088a8b1da4594fda438c268dc5c692.pack.gz-9
-rw-r--r--  1 vsu  43035924 Aug 13 17:12 pack-9-ef502d3d97088a8b1da4594fda438c268dc5c692.pack
-rw-r--r--  1 vsu  43288931 Aug 13 17:11 pack-default-ef502d3d97088a8b1da4594fda438c268dc5c692.pack

The "pack-default-*" file is made without the --compression-level
option; the "pack-0-*" and "pack-9-*" files are made with level 0 (no
compression) and 9 (max compression) respectively.  From this we can
see:

- Using maximum compression for objects in the pack provides little
  benefit - about 0.6%.
  
- Creating a pack with uncompressed objects and compressing it with
  gzip gives a 12% improvement over the pack with compressed objects.
  Using "gzip -9" at this stage gives about 0.7% more compression.

- For an offline compression, bzip2 could be used instead of gzip -
  the pack compressed with bzip2 is 28% smaller than the pack with
  zlib-compressed objects.

2. Packing the whole linux-2.6 repository (67111 objects):

-rw-r--r--  1 vsu 232977930 Aug 13 17:47 pack-0-af8d554c2a184c1ebbaab13a8f844329bcbfe763.pack
-rw-r--r--  1 vsu  49245784 Aug 13 17:52 pack-0-af8d554c2a184c1ebbaab13a8f844329bcbfe763.pack.bz2
-rw-r--r--  1 vsu  59767656 Aug 13 17:49 pack-0-af8d554c2a184c1ebbaab13a8f844329bcbfe763.pack.gz
-rw-r--r--  1 vsu  59323808 Aug 13 17:50 pack-0-af8d554c2a184c1ebbaab13a8f844329bcbfe763.pack.gz-9
-rw-r--r--  1 vsu  70067732 Aug 13 17:45 pack-9-af8d554c2a184c1ebbaab13a8f844329bcbfe763.pack
-rw-r--r--  1 vsu  70415173 Aug 13 17:43 pack-default-af8d554c2a184c1ebbaab13a8f844329bcbfe763.pack

- Again, --compression-level=9 does not help much - only 0.5%
  reduction.

- Using gzip on the pack with uncompressed objects gives 15%
  improvement over the pack with compressed objects; "gzip -9" does
  not help much.

- The pack with uncompressed objects compressed with bzip2 is 30%
  smaller than the pack with zlib-compressed objects.

3. Creating an incremental pack for the linux-2.6 repository (743
objects):

-rw-r--r--  1 vsu 4270645 Aug 13 17:54 pack-0-8d2c7fe3c00288d4a46fe25a61db35ec965db8a1.pack
-rw-r--r--  1 vsu 1068277 Aug 13 17:56 pack-0-8d2c7fe3c00288d4a46fe25a61db35ec965db8a1.pack.bz2
-rw-r--r--  1 vsu 1221308 Aug 13 17:56 pack-0-8d2c7fe3c00288d4a46fe25a61db35ec965db8a1.pack.gz
-rw-r--r--  1 vsu 1214597 Aug 13 17:56 pack-0-8d2c7fe3c00288d4a46fe25a61db35ec965db8a1.pack.gz-9
-rw-r--r--  1 vsu 1314817 Aug 13 17:55 pack-9-8d2c7fe3c00288d4a46fe25a61db35ec965db8a1.pack
-rw-r--r--  1 vsu 1319322 Aug 13 17:54 pack-default-8d2c7fe3c00288d4a46fe25a61db35ec965db8a1.pack

- Once again, --compression-level=9 is next to useless - 0.3%
  improvement.

- The pack with uncompressed objects compressed with gzip is 7%
  smaller than the pack with zlib-compressed objects.

- The pack with uncompressed objects compressed with bzip2 is 19%
  smaller than the pack with zlib-compressed objects.

===========================================================================

As you see, compressing the pack as a whole can give noticeable
improvements (less on smaller files, more on bigger files).  Now we
need to find a way to use this:

- For methods which use git tools on both ends (git-clone-pack,
  git-ssh-pull, git-daemon) we could just create pipes to gzip/gunzip
  in the appropriate places.

- For non-git-aware methods (rsync, http) we still can use these
  improvements, but there are additional complications because of the
  pack index file.  We could have globally-compressed pack files in a
  separate directory together with their index files, and write an
  utility which will take a pack file with its index, recompress all
  objects and produce a pack file with compressed objects and new
  index.  In theory, we could reconstruct the index from just the pack
  file alone, but this procedure may be expensive (it will need to
  reconstruct all objects represented by deltas to find their hash
  values).

BTW, it could be possible to improve the global compression even more
by optimizing the order of objects in the pack file (currently trees
and blobs seems to be intermixed).  I did not try this yet, however.

-- 
Sergey Vlasov

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

* [RFC] [PATCH] Add "--compression-level=N" option to git-pack-objects
  2005-08-13 16:22 [RFC] Improving compression in git network protocols Sergey Vlasov
@ 2005-08-13 16:23 ` Sergey Vlasov
  0 siblings, 0 replies; 2+ messages in thread
From: Sergey Vlasov @ 2005-08-13 16:23 UTC (permalink / raw
  To: git

[PATCH] Add "--compression-level=N" option to git-pack-objects

Setting the compression level for objects in the pack is useful in some
cases; in particular, disabling compression of the individual objects and
then compressing the whole pack can improve the overall compression ratio.

Signed-off-by: Sergey Vlasov <vsu@altlinux.ru>
---

 Documentation/git-pack-objects.txt |   14 +++++++++++++-
 csum-file.c                        |    5 +++--
 csum-file.h                        |    2 +-
 pack-objects.c                     |   15 +++++++++++++--
 4 files changed, 30 insertions(+), 6 deletions(-)

e54e13c2f057e71998ed39299545717311c7678d
diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -9,7 +9,7 @@ git-pack-objects - Create a packed archi
 
 SYNOPSIS
 --------
-'git-pack-objects' [--incremental] [--window=N] [--depth=N] {--stdout | base-name} < object-list
+'git-pack-objects' [--incremental] [--window=N] [--depth=N] [--compression-level=N] {--stdout | base-name} < object-list
 
 
 DESCRIPTION
@@ -61,6 +61,18 @@ base-name::
 	side, because delta data needs to be applied that many
 	times to get to the necessary object.
 
+--compression-level::
+	Set zlib compression level for the object data.  The
+	compression level is a number from 0 to 9, where 0 means
+	no compression, 1 indicates the fastest compression
+	method (less compression), and 9 indicates the slowest
+	compression method (best compression).  The default
+	compression level is 6.  Setting the compression level
+	to 0 may be useful when the pack will be compressed as a
+	whole at a later stage (in the pack format every object
+	is compressed separately to allow random access, which
+	is less efficient than compressing the whole file).
+
 --incremental::
 	This flag causes an object already in a pack ignored
 	even if it appears in the standard input.
diff --git a/csum-file.c b/csum-file.c
--- a/csum-file.c
+++ b/csum-file.c
@@ -117,14 +117,15 @@ struct sha1file *sha1fd(int fd, const ch
 	return f;
 }
 
-int sha1write_compressed(struct sha1file *f, void *in, unsigned int size)
+int sha1write_compressed(struct sha1file *f, void *in, unsigned int size,
+			 int compression_level)
 {
 	z_stream stream;
 	unsigned long maxsize;
 	void *out;
 
 	memset(&stream, 0, sizeof(stream));
-	deflateInit(&stream, Z_DEFAULT_COMPRESSION);
+	deflateInit(&stream, compression_level);
 	maxsize = deflateBound(&stream, size);
 	out = xmalloc(maxsize);
 
diff --git a/csum-file.h b/csum-file.h
--- a/csum-file.h
+++ b/csum-file.h
@@ -14,6 +14,6 @@ extern struct sha1file *sha1fd(int fd, c
 extern struct sha1file *sha1create(const char *fmt, ...) __attribute__((format (printf, 1, 2)));
 extern int sha1close(struct sha1file *, unsigned char *, int);
 extern int sha1write(struct sha1file *, void *, unsigned int);
-extern int sha1write_compressed(struct sha1file *, void *, unsigned int);
+extern int sha1write_compressed(struct sha1file *, void *, unsigned int, int);
 
 #endif
diff --git a/pack-objects.c b/pack-objects.c
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -5,7 +5,7 @@
 #include "pack.h"
 #include "csum-file.h"
 
-static const char pack_usage[] = "git-pack-objects [--incremental] [--window=N] [--depth=N] {--stdout | base-name} < object-list";
+static const char pack_usage[] = "git-pack-objects [--incremental] [--window=N] [--depth=N] [--compression-level=N] {--stdout | base-name} < object-list";
 
 struct object_entry {
 	unsigned char sha1[20];
@@ -21,6 +21,7 @@ struct object_entry {
 static unsigned char object_list_sha1[20];
 static int non_empty = 0;
 static int incremental = 0;
+static int compression_level = Z_DEFAULT_COMPRESSION;
 static struct object_entry **sorted_by_sha, **sorted_by_type;
 static struct object_entry *objects = NULL;
 static int nr_objects = 0, nr_alloc = 0;
@@ -103,7 +104,7 @@ static unsigned long write_object(struct
 		sha1write(f, entry->delta, 20);
 		hdrlen += 20;
 	}
-	datalen = sha1write_compressed(f, buf, size);
+	datalen = sha1write_compressed(f, buf, size, compression_level);
 	free(buf);
 	return hdrlen + datalen;
 }
@@ -421,6 +422,16 @@ int main(int argc, char **argv)
 					usage(pack_usage);
 				continue;
 			}
+			if (!strncmp("--compression-level=", arg, 20)) {
+				char *end;
+				compression_level = strtoul(arg+20, &end, 0);
+				if (!arg[20] || *end)
+					usage(pack_usage);
+				if (compression_level < Z_NO_COMPRESSION
+				    || compression_level > Z_BEST_COMPRESSION)
+					die("invalid compression level");
+				continue;
+			}
 			if (!strcmp("--stdout", arg)) {
 				pack_to_stdout = 1;
 				continue;

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2005-08-13 16:24 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-13 16:22 [RFC] Improving compression in git network protocols Sergey Vlasov
2005-08-13 16:23 ` [RFC] [PATCH] Add "--compression-level=N" option to git-pack-objects Sergey Vlasov

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).