Re: [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: Git Mailing List <git@vger.kernel.org>,
	Duy Nguyen <pclouds@gmail.com>,
	"Shawn O. Pearce" <spearce@spearce.org>,
	Nicolas Pitre <nico@fluxnic.net>
Subject: Re: [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP
Date: Fri, 23 Aug 2013 14:24:10 -0400	[thread overview]
Message-ID: <20130823182409.GA30130@sigill.intra.peff.net> (raw)
In-Reply-To: <xmqqob8opdey.fsf@gitster.dls.corp.google.com>

On Fri, Aug 23, 2013 at 09:41:57AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Furthermore, we know that one of our endpoints must be
> > the edge of the run of duplicates. For example, given this
> > sequence:
> >
> >  idx 0 1 2 3 4 5
> >  key A C C C C D
> >
> > If we are searching for "B", we might hit the duplicate run
> > at lo=1, hi=3 (e.g., by first mi=3, then mi=0). But we can
> > never have lo > 1, because B < C. That is, if our key is
> > less than the run, we know that "lo" is the edge, but we can
> > say nothing of "hi". Similarly, if our key is greater than
> > the run, we know that "hi" is the edge, but we can say
> > nothing of "lo". But that is enough for us to return not
> > only "not found", but show the position at which we would
> > insert the new item.
> 
> This is somewhat tricky and may deserve an in-code comment.

Do you want me to re-roll, pushing it down into the comment, or do you
want to mark it up yourself? I think there might be some value in the
latter as your re-writing of it as a comment may cross-check that my
logic is sound.

> > +			if (ofs == 20) {
> > +				mi = lo;
> > +				mi_key = base + elem_size * mi + key_offset;
> > +				cmp = memcmp(mi_key, key, 20);
> 
> It think we already know that mi_key[0:ofs_0] and key[0:ofs_0] are
> the same at this point and we do not have to compare full 20 bytes
> again, but this is done only once and a better readablity of the
> above trumps micro-optimization possibility, I think.

Yes, I had the same idea, and came to the same conclusion. Though if
anybody did want to try it, note that we have just overwritten the old
ofs_0, so you would want to bump the new code up above that line).

> > +uint32_octal() {
> 
> micronit (style):
> 
> 	uint32_octal () {

Hmph. I always forget which one we prefer, and we seem to have equal
numbers of both already. Again, want a re-roll or to mark it up
yourself?

> > +# Print the pack data for object $1, as a delta against object $2 (or as a full
> > +# object if $2 is missing or empty). The output is suitable for including
> > +# directly in the packfile, and represents the entirety of the object entry.
> > +# Doing this on the fly (especially picking your deltas) is quite tricky, so we
> > +# have hardcoded some well-known objects. See the case statements below for the
> > +# complete list.
> 
> Cute ;-) I like the idea of having this function with a right API in
> place, and cheating by limiting its implementation to what is
> necessary.

Just for reference, the procedure I used to generate the "base" data is
reasonably straight forward:

  sha1=$(printf %s "$content" | git hash-object -w --stdin)
  echo $sha1 | git pack-objects --stdout >tmp.pack
  tail -c +13 tmp.pack >no-header.pack
  head -c -20 no-header.pack >no-trailer.pack
  od -b no-trailer.pack | grep ' ' | cut -d' ' -f2- | tr ' ' '\\'

Since we want binary, we can skip the "od" call at the end (I needed it
to convert to something readable to hand "printf"). But "head -c" is not
portable, nor is head with a negative count.

To find items in the same fanout, I just used for-loops to calculate the
sha1s of all 2-byte blobs. And that is why we have the odd magic "\7\76"
blob.

Making the deltas was considerably less elegant, since we cannot provoke
pack-objects to pick arbitrary deltas (and it will not even try to delta
tiny objects, anyway, which would bloat our samples). I ended up with
the horrible patch below. We _could_ clean it up (error-checking? Who
needs it?) and make it a debug-and-testing-only option for pack-objects,
but I just didn't think the grossness was worth it. Still, it's probably
worth documenting here on the list in case somebody else ever needs to
add new samples to lib-pack.sh.

---
 builtin/pack-objects.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 8da2a66..e8937f5 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2442,6 +2442,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	const char *rp_av[6];
 	int rp_ac = 0;
 	int rev_list_unpacked = 0, rev_list_all = 0, rev_list_reflog = 0;
+	int magic = 0;
 	struct option pack_objects_options[] = {
 		OPT_SET_INT('q', "quiet", &progress,
 			    N_("do not show progress meter"), 0),
@@ -2505,6 +2506,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			    N_("pack compression level")),
 		OPT_SET_INT(0, "keep-true-parents", &grafts_replace_parents,
 			    N_("do not hide commits by grafts"), 0),
+		OPT_BOOL(0, "magic", &magic, "make deltas"),
 		OPT_END(),
 	};
 
@@ -2520,6 +2522,34 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	argc = parse_options(argc, argv, prefix, pack_objects_options,
 			     pack_usage, 0);
 
+	if (magic) {
+		unsigned char sha1[20];
+		struct delta_index *index;
+		void *src, *trg, *delta;
+		enum object_type src_type, trg_type;
+		unsigned long src_size, trg_size, delta_size, z_delta_size;
+		unsigned char header[10];
+		unsigned long header_len;
+
+		get_sha1(argv[0], sha1);
+		trg = read_sha1_file(sha1, &trg_type, &trg_size);
+
+		get_sha1(argv[1], sha1);
+		src = read_sha1_file(sha1, &src_type, &src_size);
+
+		index = create_delta_index(src, src_size);
+		delta = create_delta(index, trg, trg_size, &delta_size, 8192);
+
+		z_delta_size = do_compress(&delta, delta_size);
+		header_len = encode_in_pack_object_header(OBJ_REF_DELTA, delta_size,
+							  header);
+		fwrite(header, 1, header_len, stdout);
+		fwrite(sha1, 1, 20, stdout);
+		fwrite(delta, 1, z_delta_size, stdout);
+		return 0;
+	}
+
+
 	if (argc) {
 		base_name = argv[0];
 		argc--;

next prev parent reply	other threads:[~2013-08-23 18:24 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-14 18:17 duplicate objects in packfile Jeff King
2013-08-14 18:29 ` Junio C Hamano
2013-08-14 18:39   ` Junio C Hamano
2013-08-14 18:54     ` Nicolas Pitre
2013-08-16 15:01     ` Jeff King
2013-08-21 20:49       ` [RFC/PATCH 0/4] duplicate objects in packfiles Jeff King
2013-08-21 20:51         ` [PATCH 1/4] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King
2013-08-21 20:52         ` [PATCH 2/4] index-pack: optionally reject packs with duplicate objects Jeff King
2013-08-22 13:45           ` Duy Nguyen
2013-08-22 14:43             ` Jeff King
2013-08-22 23:12               ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Jeff King
2013-08-22 23:12                 ` [PATCH 1/6] test-sha1: add a binary output mode Jeff King
2013-08-22 23:14                 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King
2013-08-23 16:41                   ` Junio C Hamano
2013-08-23 18:24                     ` Jeff King [this message]
2013-08-23 18:54                       ` Nicolas Pitre
2013-08-23 18:56                         ` Jeff King
2013-08-24  0:01                       ` [PATCHv3 0/6] duplicate objects and delta cycles Jeff King
2013-08-24  0:01                         ` [PATCH 1/6] test-sha1: add a binary output mode Jeff King
2013-08-24  0:02                         ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King
2013-08-24  0:02                         ` [PATCH 3/6] add tests for indexing packs with delta cycles Jeff King
2013-08-24  0:02                         ` [PATCH 4/6] test index-pack on packs with recoverable " Jeff King
2013-08-24  0:02                         ` [PATCH 5/6] index-pack: optionally reject packs with duplicate objects Jeff King
2013-08-24  0:02                         ` [PATCH 6/6] default pack.indexDuplicates to false Jeff King
2013-08-23 19:41                   ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Johannes Sixt
2013-08-23 23:37                     ` Jeff King
2013-08-22 23:14                 ` [PATCH 3/6] add tests for indexing packs with delta cycles Jeff King
2013-08-22 23:15                 ` [PATCH 4/6] test index-pack on packs with recoverable " Jeff King
2013-08-22 23:15                 ` [PATCH 5/6] index-pack: optionally reject packs with duplicate objects Jeff King
2013-08-22 23:16                 ` [PATCH 6/6] default pack.indexDuplicates to false Jeff King
2013-08-22 23:35                 ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Nicolas Pitre
2013-08-21 20:53         ` [PATCH 3/4] reject duplicates when indexing a packfile we created Jeff King
2013-08-21 20:55         ` [DO NOT APPLY PATCH 4/4] index-pack: optionally skip duplicate packfile entries Jeff King
2013-08-21 23:20           ` Junio C Hamano
2013-08-22  0:47             ` Jeff King
2013-08-21 22:17         ` [RFC/PATCH 0/4] duplicate objects in packfiles Junio C Hamano
2013-08-14 18:50 ` duplicate objects in packfile Nicolas Pitre

find likely ancestor, descendant, or conflicting patches for this message:
dfblob:8da2a66 dfblob:e8937f5
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130823182409.GA30130@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=nico@fluxnic.net \
    --cc=pclouds@gmail.com \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).