* [RFC] Improving compression in git network protocols
@ 2005-08-13 16:22 Sergey Vlasov
2005-08-13 16:23 ` [RFC] [PATCH] Add "--compression-level=N" option to git-pack-objects Sergey Vlasov
0 siblings, 1 reply; 2+ messages in thread
From: Sergey Vlasov @ 2005-08-13 16:22 UTC (permalink / raw
To: git
[-- Attachment #1: Type: text/plain, Size: 5754 bytes --]
Hello!
The git pack format has two uses:
1) A space-optimized format for local repository storage.
2) A compact format for transferring repository data over network.
However, these uses have some conflicting requirements, and currently
the pack format is not as optimal for the task of network transfer as
it could be. In particular, because using pack files in a local
repository requires random access to the contained objects, all
objects in the pack are compressed separately, which negatively
impacts the compression rate.
I have made a patch which adds the "--compression-level=N" option to
git-pack-objects and tried to look what kind of improvement in the
compression rate we can get if we compress the whole pack file instead
of individual objects. The patch is in a separate message, however,
I'm not sure if it should be applied immediately - currently there is
no infrastructure for using this option, and maybe we will choose to
implement the same idea in some different way.
Here are the results of my tests:
===========================================================================
1. Packing the whole linux-2.4 repository (13390 objects):
-rw-r--r-- 1 vsu 159632105 Aug 13 17:12 pack-0-ef502d3d97088a8b1da4594fda438c268dc5c692.pack
-rw-r--r-- 1 vsu 30878501 Aug 13 17:16 pack-0-ef502d3d97088a8b1da4594fda438c268dc5c692.pack.bz2
-rw-r--r-- 1 vsu 38035157 Aug 13 17:14 pack-0-ef502d3d97088a8b1da4594fda438c268dc5c692.pack.gz
-rw-r--r-- 1 vsu 37739041 Aug 13 17:17 pack-0-ef502d3d97088a8b1da4594fda438c268dc5c692.pack.gz-9
-rw-r--r-- 1 vsu 43035924 Aug 13 17:12 pack-9-ef502d3d97088a8b1da4594fda438c268dc5c692.pack
-rw-r--r-- 1 vsu 43288931 Aug 13 17:11 pack-default-ef502d3d97088a8b1da4594fda438c268dc5c692.pack
The "pack-default-*" file is made without the --compression-level
option; the "pack-0-*" and "pack-9-*" files are made with level 0 (no
compression) and 9 (max compression) respectively. From this we can
see:
- Using maximum compression for objects in the pack provides little
benefit - about 0.6%.
- Creating a pack with uncompressed objects and compressing it with
gzip gives a 12% improvement over the pack with compressed objects.
Using "gzip -9" at this stage gives about 0.7% more compression.
- For an offline compression, bzip2 could be used instead of gzip -
the pack compressed with bzip2 is 28% smaller than the pack with
zlib-compressed objects.
2. Packing the whole linux-2.6 repository (67111 objects):
-rw-r--r-- 1 vsu 232977930 Aug 13 17:47 pack-0-af8d554c2a184c1ebbaab13a8f844329bcbfe763.pack
-rw-r--r-- 1 vsu 49245784 Aug 13 17:52 pack-0-af8d554c2a184c1ebbaab13a8f844329bcbfe763.pack.bz2
-rw-r--r-- 1 vsu 59767656 Aug 13 17:49 pack-0-af8d554c2a184c1ebbaab13a8f844329bcbfe763.pack.gz
-rw-r--r-- 1 vsu 59323808 Aug 13 17:50 pack-0-af8d554c2a184c1ebbaab13a8f844329bcbfe763.pack.gz-9
-rw-r--r-- 1 vsu 70067732 Aug 13 17:45 pack-9-af8d554c2a184c1ebbaab13a8f844329bcbfe763.pack
-rw-r--r-- 1 vsu 70415173 Aug 13 17:43 pack-default-af8d554c2a184c1ebbaab13a8f844329bcbfe763.pack
- Again, --compression-level=9 does not help much - only 0.5%
reduction.
- Using gzip on the pack with uncompressed objects gives 15%
improvement over the pack with compressed objects; "gzip -9" does
not help much.
- The pack with uncompressed objects compressed with bzip2 is 30%
smaller than the pack with zlib-compressed objects.
3. Creating an incremental pack for the linux-2.6 repository (743
objects):
-rw-r--r-- 1 vsu 4270645 Aug 13 17:54 pack-0-8d2c7fe3c00288d4a46fe25a61db35ec965db8a1.pack
-rw-r--r-- 1 vsu 1068277 Aug 13 17:56 pack-0-8d2c7fe3c00288d4a46fe25a61db35ec965db8a1.pack.bz2
-rw-r--r-- 1 vsu 1221308 Aug 13 17:56 pack-0-8d2c7fe3c00288d4a46fe25a61db35ec965db8a1.pack.gz
-rw-r--r-- 1 vsu 1214597 Aug 13 17:56 pack-0-8d2c7fe3c00288d4a46fe25a61db35ec965db8a1.pack.gz-9
-rw-r--r-- 1 vsu 1314817 Aug 13 17:55 pack-9-8d2c7fe3c00288d4a46fe25a61db35ec965db8a1.pack
-rw-r--r-- 1 vsu 1319322 Aug 13 17:54 pack-default-8d2c7fe3c00288d4a46fe25a61db35ec965db8a1.pack
- Once again, --compression-level=9 is next to useless - 0.3%
improvement.
- The pack with uncompressed objects compressed with gzip is 7%
smaller than the pack with zlib-compressed objects.
- The pack with uncompressed objects compressed with bzip2 is 19%
smaller than the pack with zlib-compressed objects.
===========================================================================
As you see, compressing the pack as a whole can give noticeable
improvements (less on smaller files, more on bigger files). Now we
need to find a way to use this:
- For methods which use git tools on both ends (git-clone-pack,
git-ssh-pull, git-daemon) we could just create pipes to gzip/gunzip
in the appropriate places.
- For non-git-aware methods (rsync, http) we still can use these
improvements, but there are additional complications because of the
pack index file. We could have globally-compressed pack files in a
separate directory together with their index files, and write an
utility which will take a pack file with its index, recompress all
objects and produce a pack file with compressed objects and new
index. In theory, we could reconstruct the index from just the pack
file alone, but this procedure may be expensive (it will need to
reconstruct all objects represented by deltas to find their hash
values).
BTW, it could be possible to improve the global compression even more
by optimizing the order of objects in the pack file (currently trees
and blobs seems to be intermixed). I did not try this yet, however.
--
Sergey Vlasov
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 2+ messages in thread
* [RFC] [PATCH] Add "--compression-level=N" option to git-pack-objects
2005-08-13 16:22 [RFC] Improving compression in git network protocols Sergey Vlasov
@ 2005-08-13 16:23 ` Sergey Vlasov
0 siblings, 0 replies; 2+ messages in thread
From: Sergey Vlasov @ 2005-08-13 16:23 UTC (permalink / raw
To: git
[PATCH] Add "--compression-level=N" option to git-pack-objects
Setting the compression level for objects in the pack is useful in some
cases; in particular, disabling compression of the individual objects and
then compressing the whole pack can improve the overall compression ratio.
Signed-off-by: Sergey Vlasov <vsu@altlinux.ru>
---
Documentation/git-pack-objects.txt | 14 +++++++++++++-
csum-file.c | 5 +++--
csum-file.h | 2 +-
pack-objects.c | 15 +++++++++++++--
4 files changed, 30 insertions(+), 6 deletions(-)
e54e13c2f057e71998ed39299545717311c7678d
diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -9,7 +9,7 @@ git-pack-objects - Create a packed archi
SYNOPSIS
--------
-'git-pack-objects' [--incremental] [--window=N] [--depth=N] {--stdout | base-name} < object-list
+'git-pack-objects' [--incremental] [--window=N] [--depth=N] [--compression-level=N] {--stdout | base-name} < object-list
DESCRIPTION
@@ -61,6 +61,18 @@ base-name::
side, because delta data needs to be applied that many
times to get to the necessary object.
+--compression-level::
+ Set zlib compression level for the object data. The
+ compression level is a number from 0 to 9, where 0 means
+ no compression, 1 indicates the fastest compression
+ method (less compression), and 9 indicates the slowest
+ compression method (best compression). The default
+ compression level is 6. Setting the compression level
+ to 0 may be useful when the pack will be compressed as a
+ whole at a later stage (in the pack format every object
+ is compressed separately to allow random access, which
+ is less efficient than compressing the whole file).
+
--incremental::
This flag causes an object already in a pack ignored
even if it appears in the standard input.
diff --git a/csum-file.c b/csum-file.c
--- a/csum-file.c
+++ b/csum-file.c
@@ -117,14 +117,15 @@ struct sha1file *sha1fd(int fd, const ch
return f;
}
-int sha1write_compressed(struct sha1file *f, void *in, unsigned int size)
+int sha1write_compressed(struct sha1file *f, void *in, unsigned int size,
+ int compression_level)
{
z_stream stream;
unsigned long maxsize;
void *out;
memset(&stream, 0, sizeof(stream));
- deflateInit(&stream, Z_DEFAULT_COMPRESSION);
+ deflateInit(&stream, compression_level);
maxsize = deflateBound(&stream, size);
out = xmalloc(maxsize);
diff --git a/csum-file.h b/csum-file.h
--- a/csum-file.h
+++ b/csum-file.h
@@ -14,6 +14,6 @@ extern struct sha1file *sha1fd(int fd, c
extern struct sha1file *sha1create(const char *fmt, ...) __attribute__((format (printf, 1, 2)));
extern int sha1close(struct sha1file *, unsigned char *, int);
extern int sha1write(struct sha1file *, void *, unsigned int);
-extern int sha1write_compressed(struct sha1file *, void *, unsigned int);
+extern int sha1write_compressed(struct sha1file *, void *, unsigned int, int);
#endif
diff --git a/pack-objects.c b/pack-objects.c
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -5,7 +5,7 @@
#include "pack.h"
#include "csum-file.h"
-static const char pack_usage[] = "git-pack-objects [--incremental] [--window=N] [--depth=N] {--stdout | base-name} < object-list";
+static const char pack_usage[] = "git-pack-objects [--incremental] [--window=N] [--depth=N] [--compression-level=N] {--stdout | base-name} < object-list";
struct object_entry {
unsigned char sha1[20];
@@ -21,6 +21,7 @@ struct object_entry {
static unsigned char object_list_sha1[20];
static int non_empty = 0;
static int incremental = 0;
+static int compression_level = Z_DEFAULT_COMPRESSION;
static struct object_entry **sorted_by_sha, **sorted_by_type;
static struct object_entry *objects = NULL;
static int nr_objects = 0, nr_alloc = 0;
@@ -103,7 +104,7 @@ static unsigned long write_object(struct
sha1write(f, entry->delta, 20);
hdrlen += 20;
}
- datalen = sha1write_compressed(f, buf, size);
+ datalen = sha1write_compressed(f, buf, size, compression_level);
free(buf);
return hdrlen + datalen;
}
@@ -421,6 +422,16 @@ int main(int argc, char **argv)
usage(pack_usage);
continue;
}
+ if (!strncmp("--compression-level=", arg, 20)) {
+ char *end;
+ compression_level = strtoul(arg+20, &end, 0);
+ if (!arg[20] || *end)
+ usage(pack_usage);
+ if (compression_level < Z_NO_COMPRESSION
+ || compression_level > Z_BEST_COMPRESSION)
+ die("invalid compression level");
+ continue;
+ }
if (!strcmp("--stdout", arg)) {
pack_to_stdout = 1;
continue;
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2005-08-13 16:24 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-13 16:22 [RFC] Improving compression in git network protocols Sergey Vlasov
2005-08-13 16:23 ` [RFC] [PATCH] Add "--compression-level=N" option to git-pack-objects Sergey Vlasov
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).