git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Johannes Schindelin <Johannes.Schindelin@gmx.de>,
	Nicolas Pitre <nico@cam.org>, Nix <nix@esperi.org.uk>,
	Steven Grimm <koreth@midwinter.com>,
	Git Mailing List <git@vger.kernel.org>
Subject: Subject: [PATCH] git-merge-pack
Date: Thu, 06 Sep 2007 16:12:56 -0700	[thread overview]
Message-ID: <7v1wdb9ymf.fsf_-_@gitster.siamese.dyndns.org> (raw)
In-Reply-To: <alpine.LFD.0.999.0709061906010.5626@evo.linux-foundation.org> (Linus Torvalds's message of "Thu, 6 Sep 2007 19:15:58 +0100 (BST)")

This is a beginning of "git-merge-pack" that combines smaller
packs into one.  Currently it does not actually create a new
pack, but pretends that it is a (dumb) "git-rev-list --objects"
that lists the objects in the affected packs.  You have to pipe
its output to "git-pack-objects".

The command reads names of pack-*.pack files from the standard
input, outputs the objects' names in the order they are stored
in the original packs (i.e. the offset order).  This sorting is
done in order to emulate the traversal order the original
"git-rev-list --objects" that was used to create the existing
pack listed the objects.

While this approach would give the resulting packfile very
similar locality of access as the original, it does not give the
"name" component you would see in "git-rev-list --objects"
output.  This information is used as the clustering cue while
computing delta, and the lack of it means you can get horrible
delta selection.  You do _not_ want to run the downstream
"git-pack-objects" without the optimization/heuristics to reuse
delta.  IOW, do not run it with --no-reuse-delta.

To consolidate all packs that are smaller than a megabytes into
one, you would use it in its current form like this:

    $ old=$(find .git/objects/pack -type f -name '*.pack' -size 1M)
    $ new=$(echo "$old" | git merge-pack | git pack-objects pack)
    $ for p in $old; do rm -f $p ${p%.pack}.idx; done
    $ for s in pack idx; do mv pack-$new.$s .git/objects/pack/; done

An obvious next steps that can be done in parallel by interested
parties would be:

 (1) come up with a way to give "name" aka "clustering cue" (I
     think this is very hard);

 (2) run the above four command sequence internally without
     having to resort to shell wrapper (easy).

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---

  Linus Torvalds <torvalds@linux-foundation.org> writes:

  > IOW, if you get lots of small incrmental packs, after a while you really 
  > *do* need to do "git gc" to get the real pack generated.

  'auto' should do a lessor impact repack than the usual one.
  Especially we do not want to lose objects that do not look like
  they are reachable from this reopsitory, to help people with
  alternate object stores, aka "repo.or.cz style _forked_
  repositories".  However, a full repack with "-a -d" discards
  unreferenced objects that are only in packs.

  We need a middle ground between "pack and prune-pack only loose
  ones" and "full repack.

  Here is one.

 Makefile             |    1 +
 builtin-merge-pack.c |   87 ++++++++++++++++++++++++++++++++++++++++++++++++++
 builtin.h            |    1 +
 git.c                |    1 +
 4 files changed, 90 insertions(+), 0 deletions(-)
 create mode 100644 builtin-merge-pack.c

diff --git a/Makefile b/Makefile
index dace211..cdff756 100644
--- a/Makefile
+++ b/Makefile
@@ -343,6 +343,7 @@ BUILTIN_OBJS = \
 	builtin-mailsplit.o \
 	builtin-merge-base.o \
 	builtin-merge-file.o \
+	builtin-merge-pack.o \
 	builtin-mv.o \
 	builtin-name-rev.o \
 	builtin-pack-objects.o \
diff --git a/builtin-merge-pack.c b/builtin-merge-pack.c
new file mode 100644
index 0000000..c98da80
--- /dev/null
+++ b/builtin-merge-pack.c
@@ -0,0 +1,87 @@
+#include "builtin.h"
+#include "cache.h"
+#include "pack.h"
+
+struct in_pack_object {
+	off_t offset;
+	const unsigned char *sha1;
+};
+
+static uint32_t get_packed_object_list(struct packed_git *p, struct in_pack_object *list, uint32_t loc)
+{
+	uint32_t n;
+
+	for (n = 0; n < p->num_objects; n++) {
+		list[loc].sha1 = nth_packed_object_sha1(p, n);
+		list[loc].offset = find_pack_entry_one(list[loc].sha1, p);
+		loc++;
+	}
+	return loc;
+}
+
+static int ofscmp(const void *a_, const void *b_)
+{
+	struct in_pack_object *a = (struct in_pack_object *)a_;
+	struct in_pack_object *b = (struct in_pack_object *)b_;
+	if (a->offset < b->offset)
+		return -1;
+	else if (a->offset > b->offset)
+		return 1;
+	else
+		return hashcmp(a->sha1, b->sha1);
+}
+
+int cmd_merge_pack(int ac, const char **av, const char *prefix)
+{
+	char filename[PATH_MAX];
+	struct packed_git **pack = NULL;
+	int pack_nr = 0;
+	int pack_alloc = 0;
+	uint32_t max_objs, cnt;
+	struct in_pack_object *objs;
+	int i;
+
+	while (fgets(filename, sizeof(filename), stdin) != NULL) {
+		int len = strlen(filename);
+		struct packed_git *p;
+
+		while (0 < len) {
+			if (filename[len-1] != '\n' &&
+			    filename[len-1] != '\r')
+				break;
+			filename[--len] = '\0';
+		}
+		if (strcmp(filename + len - 5, ".pack"))
+			goto error;
+
+		/* add-packed-git wants the name of .idx file */
+		strcpy(filename + len - 5, ".idx");
+		len--;
+		p = add_packed_git(filename, len, 1);
+		if (!p)
+			goto error;
+		if (open_pack_index(p))
+			goto error;
+
+		if (pack_alloc <= pack_nr) {
+			pack_alloc = alloc_nr(pack_nr);
+			pack = xrealloc(pack, pack_alloc * sizeof(*pack));
+		}
+		pack[pack_nr++] = p;
+		continue;
+	error:
+		die("Cannot add a pack .idx file: %s", filename);
+	}
+
+	max_objs = 0;
+	for (i = 0; i < pack_nr; i++)
+		max_objs += pack[i]->num_objects;
+	objs = xmalloc(sizeof(*objs) * max_objs);
+	cnt = 0;
+	for (i = 0; i < pack_nr; i++)
+		cnt = get_packed_object_list(pack[i], objs, cnt);
+	qsort(objs, cnt, sizeof(*objs), ofscmp);
+	for (cnt = 0; cnt < max_objs; cnt++)
+		printf("%s\n", sha1_to_hex(objs[cnt].sha1));
+	return 0;
+}
diff --git a/builtin.h b/builtin.h
index bb72000..aff28ca 100644
--- a/builtin.h
+++ b/builtin.h
@@ -49,6 +49,7 @@ extern int cmd_mailinfo(int argc, const char **argv, const char *prefix);
 extern int cmd_mailsplit(int argc, const char **argv, const char *prefix);
 extern int cmd_merge_base(int argc, const char **argv, const char *prefix);
 extern int cmd_merge_file(int argc, const char **argv, const char *prefix);
+extern int cmd_merge_pack(int argc, const char **argv, const char *prefix);
 extern int cmd_mv(int argc, const char **argv, const char *prefix);
 extern int cmd_name_rev(int argc, const char **argv, const char *prefix);
 extern int cmd_pack_objects(int argc, const char **argv, const char *prefix);
diff --git a/git.c b/git.c
index fd3d83c..69e86bc 100644
--- a/git.c
+++ b/git.c
@@ -353,6 +353,7 @@ static void handle_internal_command(int argc, const char **argv)
 		{ "mailsplit", cmd_mailsplit },
 		{ "merge-base", cmd_merge_base, RUN_SETUP },
 		{ "merge-file", cmd_merge_file },
+		{ "merge-pack", cmd_merge_pack },
 		{ "mv", cmd_mv, RUN_SETUP | NEED_WORK_TREE },
 		{ "name-rev", cmd_name_rev, RUN_SETUP },
 		{ "pack-objects", cmd_pack_objects, RUN_SETUP },
-- 
1.5.3.1.860.g2cce2

  parent reply	other threads:[~2007-09-06 23:13 UTC|newest]

Thread overview: 97+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-09-05  7:09 People unaware of the importance of "git gc"? Linus Torvalds
2007-09-05  7:21 ` Martin Langhoff
2007-09-05  7:37   ` Karl Hasselström
2007-09-05  7:30 ` Junio C Hamano
2007-09-05  7:26   ` Tomash Brechko
2007-09-05  8:13   ` Johan Herland
2007-09-05  8:39     ` Matthieu Moy
2007-09-05  8:41       ` Johan Herland
2007-09-05  8:47         ` David Kastrup
2007-09-05  8:51       ` Pierre Habouzit
2007-09-05  9:02         ` David Kastrup
2007-09-05  9:04         ` Matthieu Moy
2007-09-05  8:51   ` Wincent Colaiuta
2007-09-05  7:42 ` Pierre Habouzit
2007-09-05  8:16   ` Junio C Hamano
2007-09-05  8:50   ` Steven Grimm
     [not found]     ` <86ps0xcwxo.fsf@lola.quinscape.zz>
2007-09-05  9:07       ` Steven Grimm
2007-09-05  9:13         ` David Kastrup
2007-09-05  9:07     ` Junio C Hamano
2007-09-05  9:27       ` Martin Langhoff
2007-09-05  9:33         ` Matthieu Moy
2007-09-05 14:17           ` Johan De Messemaeker
2007-09-05 17:31             ` Matthieu Moy
2007-09-05 23:56               ` Jeff King
2007-09-05  9:13     ` David Kastrup
2007-09-05  9:14     ` Pierre Habouzit
2007-09-05 17:51   ` Nix
2007-09-05 18:14     ` Steven Grimm
2007-09-05 18:22       ` Nix
2007-09-05 18:54         ` Nicolas Pitre
2007-09-05 20:01           ` Junio C Hamano
2007-09-05 20:35             ` Nicolas Pitre
2007-09-05 21:14               ` Nix
2007-09-05 21:46               ` Junio C Hamano
2007-09-05 23:04                 ` Nicolas Pitre
2007-09-05 23:42                   ` Junio C Hamano
2007-09-06  0:27                     ` Carlos Rica
2007-09-06  5:55                 ` David Kastrup
2007-09-05 21:49               ` Junio C Hamano
2007-09-05 21:59                 ` Invoke "git gc --auto" from commit, merge, am and rebase Junio C Hamano
2007-09-06  2:39                   ` Shawn O. Pearce
2007-09-05 20:37             ` [PATCH] Invoke "git gc --auto" from "git add" and "git fetch" Junio C Hamano
     [not found]               ` <69b0c0350709051357ifa547aarfe3e0b36cf9be98f@mail.gmail.com>
2007-09-05 20:59                 ` Fwd: " Govind Salinas
2007-09-06 12:02               ` Johannes Schindelin
2007-09-05 21:18             ` People unaware of the importance of "git gc"? Alex Riesen
2007-09-06  2:44             ` Russ Dill
2007-09-06  2:52               ` Shawn O. Pearce
2007-09-06  9:28               ` Andreas Ericsson
2007-09-06  2:45             ` Shawn O. Pearce
2007-09-06  2:49               ` Steven Grimm
2007-09-06  2:56                 ` Shawn O. Pearce
2007-09-06 15:54             ` Johannes Schindelin
2007-09-06 17:49               ` Junio C Hamano
2007-09-06 18:15                 ` Linus Torvalds
2007-09-06 18:29                   ` Steven Grimm
2007-09-06 23:12                   ` Junio C Hamano [this message]
2007-09-06 23:35                     ` Subject: [PATCH] git-merge-pack Linus Torvalds
2007-09-07  0:51                     ` Nicolas Pitre
2007-09-07  1:58                       ` Junio C Hamano
2007-09-07  2:32                         ` Nicolas Pitre
2007-09-07  4:07                       ` Shawn O. Pearce
2007-09-07  4:43                       ` Junio C Hamano
2007-09-08  9:50                         ` [PATCH] make sha1_file.c::matches_pack_name() available to others Junio C Hamano
2007-09-08 10:01                         ` [PATCH] pack-objects --repack-unpacked Junio C Hamano
2007-09-07  7:11                     ` Subject: [PATCH] git-merge-pack Johannes Sixt
2007-09-07  7:34                       ` Junio C Hamano
2007-09-07  7:24                     ` Andy Parkins
2007-09-07  4:48                 ` People unaware of the importance of "git gc"? Shawn O. Pearce
2007-09-07 10:12                 ` Johannes Schindelin
2018-10-07 18:28           ` What's so special about objects/17/ ? Ævar Arnfjörð Bjarmason
2018-10-07 18:35             ` Johannes Sixt
2018-10-07 19:06               ` Ævar Arnfjörð Bjarmason
2018-10-07 22:39                 ` Johannes Sixt
2018-10-08  0:54                   ` Junio C Hamano
2018-10-07 19:46             ` Junio C Hamano
2018-10-07 20:07               ` Junio C Hamano
2018-10-08 19:17                 ` Stefan Beller
2018-10-09  1:03                   ` Junio C Hamano
2018-10-09 17:37                     ` Stefan Beller
2018-10-10  1:10                       ` Junio C Hamano
2018-10-10 19:08                         ` Stefan Beller
2018-10-08 10:36               ` Ævar Arnfjörð Bjarmason
2018-10-09  1:07                 ` Junio C Hamano
2018-10-09 17:40                   ` Stefan Beller
2007-09-05  8:16 ` People unaware of the importance of "git gc"? David Kastrup
2007-09-05 16:47 ` Govind Salinas
2007-09-05 17:19   ` Carl Worth
2007-09-05 17:55     ` Jing Xue
2007-09-05 17:35   ` Steven Grimm
2007-09-05 18:28     ` Nix
2007-09-05 17:44 ` J. Bruce Fields
2007-09-05 18:46   ` Brandon Casey
2007-09-05 19:09     ` David Kastrup
2007-09-05 19:13       ` J. Bruce Fields
2007-09-05 19:43         ` David Kastrup
2007-09-05 19:20       ` Mike Hommey
2007-09-05 21:07 ` Alex Riesen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7v1wdb9ymf.fsf_-_@gitster.siamese.dyndns.org \
    --to=gitster@pobox.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    --cc=koreth@midwinter.com \
    --cc=nico@cam.org \
    --cc=nix@esperi.org.uk \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).