From: Jeff King <firstname.lastname@example.org> To: Shawn Pearce <email@example.com> Cc: git <firstname.lastname@example.org> Subject: Re: [PATCH 4/4] pack-revindex: radix-sort the revindex Date: Mon, 8 Jul 2013 03:57:12 -0400 [thread overview] Message-ID: <20130708075712.GC25072@sigill.intra.peff.net> (raw) In-Reply-To: <CAJo=hJugvqBEQwPYcttNH+R8xUKxy1uDm5EjiWaye-wEuTxkemail@example.com> On Sun, Jul 07, 2013 at 04:52:23PM -0700, Shawn O. Pearce wrote: > On Sun, Jul 7, 2013 at 3:14 AM, Jeff King <firstname.lastname@example.org> wrote: > > The pack revindex stores the offsets of the objects in the > > pack in sorted order, allowing us to easily find the on-disk > > size of each object. To compute it, we populate an array > > with the offsets from the sha1-sorted idx file, and then use > > qsort to order it by offsets. > > > > That does O(n log n) offset comparisons, and profiling shows > > that we spend most of our time in cmp_offset. However, since > > we are sorting on a simple off_t, we can use numeric sorts > > that perform better. A radix sort can run in O(k*n), where k > > is the number of "digits" in our number. For a 64-bit off_t, > > using 16-bit "digits" gives us k=4. > > Did you try the simple bucket sort Colby now uses in JGit? > > The sort is pretty simple: > > bucket_size = pack_length / object_count; > buckets = malloc(object_count * sizeof(int)); > > foreach obj in idx: > push_chain(buckets[obj.offset / bucket_size], obj.idx_nth); > > foreach bucket: > insertion sort by offset I did do something similar (though I flattened my buckets into a single list afterwards), but I ended up closer to 700ms (down from 830ms, but with the radix sort around 200ms). It's entirely possible I screwed up something in the implementation (the bucket insertion can be done in a lot of different ways, many of which are terrible), but I didn't keep a copy of that attempt. If you try it and have better numbers, I'd be happy to see them. > We observed on linux.git that most buckets have an average number of > objects. IIRC the bucket_size was ~201 bytes and most buckets had very > few objects each. For lookups we keep the bucket_size parameter and a > bucket index table. This arrangement uses 8 bytes per object in the > reverse index, making it very memory efficient. Searches are typically > below O(log N) time because each bucket has <log N entries. I didn't measure lookups at all; I was focused on time to build the index. So if there were benefits there that make up for a longer setup time, I wouldn't have measured them (of course, we also care about the case with few lookups, so it would be a tradeoff). You could also leave each bucket unsorted and only lazily sort it when a lookup hits the bucket, which might help that case (I didn't look to see if you do that in JGit). -Peff
next prev parent reply other threads:[~2013-07-08 7:57 UTC|newest] Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top 2013-07-07 10:01 [RFC/PATCH 0/4] cat-file --batch-disk-sizes Jeff King 2013-07-07 10:03 ` [PATCH 1/4] zero-initialize object_info structs Jeff King 2013-07-07 17:34 ` Junio C Hamano 2013-07-07 10:04 ` [PATCH 2/4] teach sha1_object_info_extended a "disk_size" query Jeff King 2013-07-07 10:09 ` [PATCH 3/4] cat-file: add --batch-disk-sizes option Jeff King 2013-07-07 17:49 ` Junio C Hamano 2013-07-07 18:19 ` Jeff King 2013-07-08 11:04 ` Duy Nguyen 2013-07-08 12:00 ` Ramkumar Ramachandra 2013-07-08 13:13 ` Duy Nguyen 2013-07-08 13:37 ` Ramkumar Ramachandra 2013-07-09 2:55 ` Duy Nguyen 2013-07-09 10:32 ` Ramkumar Ramachandra 2013-07-10 11:16 ` Jeff King 2013-07-08 16:40 ` Junio C Hamano 2013-07-10 11:04 ` Jeff King 2013-07-11 16:35 ` Junio C Hamano 2013-07-07 21:15 ` brian m. carlson 2013-07-10 10:57 ` Jeff King 2013-07-07 10:14 ` [PATCH 4/4] pack-revindex: radix-sort the revindex Jeff King 2013-07-07 23:52 ` Shawn Pearce 2013-07-08 7:57 ` Jeff King [this message] 2013-07-08 15:38 ` Shawn Pearce 2013-07-08 20:50 ` Brandon Casey 2013-07-08 21:35 ` Brandon Casey 2013-07-10 10:57 ` Jeff King 2013-07-10 10:52 ` Jeff King 2013-07-10 11:34 ` [PATCHv2 00/10] cat-file formats/on-disk sizes Jeff King 2013-07-10 11:35 ` [PATCH 01/10] zero-initialize object_info structs Jeff King 2013-07-10 11:35 ` [PATCH 02/10] teach sha1_object_info_extended a "disk_size" query Jeff King 2013-07-10 11:36 ` [PATCH 03/10] t1006: modernize output comparisons Jeff King 2013-07-10 11:38 ` [PATCH 04/10] cat-file: teach --batch to stream blob objects Jeff King 2013-07-10 11:38 ` [PATCH 05/10] cat-file: refactor --batch option parsing Jeff King 2013-07-10 11:45 ` [PATCH 06/10] cat-file: add --batch-check=<format> Jeff King 2013-07-10 11:57 ` Eric Sunshine 2013-07-10 14:51 ` Ramkumar Ramachandra 2013-07-11 11:24 ` Jeff King 2013-07-10 11:46 ` [PATCH 07/10] cat-file: add %(objectsize:disk) format atom Jeff King 2013-07-10 11:48 ` [PATCH 08/10] cat-file: split --batch input lines on whitespace Jeff King 2013-07-10 15:29 ` Ramkumar Ramachandra 2013-07-11 11:36 ` Jeff King 2013-07-11 17:42 ` Junio C Hamano 2013-07-11 20:45 ` [PATCHv3 " Jeff King 2013-07-10 11:50 ` [PATCH 09/10] pack-revindex: use unsigned to store number of objects Jeff King 2013-07-10 11:55 ` [PATCH 10/10] pack-revindex: radix-sort the revindex Jeff King 2013-07-10 12:00 ` Jeff King 2013-07-10 13:17 ` Ramkumar Ramachandra 2013-07-11 11:03 ` Jeff King 2013-07-10 17:10 ` Brandon Casey 2013-07-11 11:17 ` Jeff King 2013-07-11 12:16 ` [PATCHv3 " Jeff King 2013-07-11 21:12 ` Brandon Casey
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style List information: http://vger.kernel.org/majordomo-info.html * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20130708075712.GC25072@sigill.intra.peff.net \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --subject='Re: [PATCH 4/4] pack-revindex: radix-sort the revindex' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Code repositories for project(s) associated with this inbox: https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).