git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Brandon Casey <drafnel@gmail.com>
Cc: "git@vger.kernel.org" <git@vger.kernel.org>,
	Ramkumar Ramachandra <artagnon@gmail.com>,
	Duy Nguyen <pclouds@gmail.com>,
	Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH 10/10] pack-revindex: radix-sort the revindex
Date: Thu, 11 Jul 2013 07:17:15 -0400	[thread overview]
Message-ID: <20130711111705.GB6015@sigill.intra.peff.net> (raw)
In-Reply-To: <CA+sFfMeL1a1cQXj+3OXvp5hxLXz8Qc70U_+MMg4LOHcvuH4wcw@mail.gmail.com>

On Wed, Jul 10, 2013 at 10:10:16AM -0700, Brandon Casey wrote:

> > On the linux.git repo, with about 3M objects to sort, this
> > yields a 400% speedup. Here are the best-of-five numbers for
> > running "echo HEAD | git cat-file --batch-disk-size", which
> > is dominated by time spent building the pack revindex:
> >
> >           before     after
> >   real    0m0.834s   0m0.204s
> >   user    0m0.788s   0m0.164s
> >   sys     0m0.040s   0m0.036s
> >
> > On a smaller repo, the radix sort would not be
> > as impressive (and could even be worse), as we are trading
> > the log(n) factor for the k=4 of the radix sort. However,
> > even on git.git, with 173K objects, it shows some
> > improvement:
> >
> >           before     after
> >   real    0m0.046s   0m0.017s
> >   user    0m0.036s   0m0.012s
> >   sys     0m0.008s   0m0.000s
> 
> k should only be 2 for git.git.  I haven't packed in a while, but I
> think it should all fit within 4G.  :)  The pathological case would be
> a pack file with very few very very large objects, large enough to
> push the pack size over the 2^48 threshold so we'd have to do all four
> radixes.

Yeah, even linux.git fits into k=2. And that does more or less explain
the numbers in both cases.

For git.git, With 173K objects, log(n) is ~18, so regular sort is 18n.
With a radix sort of k=2, which has a constant factor of 2 (you can see
by looking at the code that we go through the list twice per radix), we
have 4n. So there should be a 4.5x speedup. We don't quite get that,
which is probably due to the extra bookkeeping on the buckets.

For linux.git, with 3M objects, log(n) is ~22, so the speedup we hope
for is 5.5x. We end up with 4x.

> It's probably worth mentioning here and/or in the code that k is
> dependent on the pack file size and that we can jump out early for
> small pack files.  That's my favorite part of this code by the way. :)

Yeah, I agree it is probably worth mentioning along with the numbers; it
is where half of our speedup is coming from. I think the "max >> bits"
loop condition deserves to be commented, too. I'll add that.

Also note that my commit message still refers to "--batch-disk-size"
which does not exist anymore. :) I didn't update the timings in the
commit message for my re-roll, but I did confirm that they are the same.

> > +       /*
> > +        * We need O(n) temporary storage, so we sort back and forth between
> > +        * the real array and our tmp storage. To keep them straight, we always
> > +        * sort from "a" into buckets in "b".
> > +        */
> > +       struct revindex_entry *tmp = xcalloc(n, sizeof(*tmp));
> 
> Didn't notice it the first time I read this, but do we really need
> calloc here?  Or will malloc do?

No, a malloc should be fine. I doubt it matters much, but there's no
reason not to go the cheap route.

> > +       struct revindex_entry *a = entries, *b = tmp;
> > +       int bits = 0;
> > +       unsigned *pos = xmalloc(BUCKETS * sizeof(*pos));
> > +
> > +       while (max >> bits) {
> > +               struct revindex_entry *swap;
> > +               int i;
> 
> You forgot to make i unsigned.  See below too...

Oops. Thanks for catching.

> > +               /*
> > +                * Now we can drop the elements into their correct buckets (in
> > +                * our temporary array).  We iterate the pos counter backwards
> > +                * to avoid using an extra index to count up. And since we are
> > +                * going backwards there, we must also go backwards through the
> > +                * array itself, to keep the sort stable.
> > +                */
> > +               for (i = n - 1; i >= 0; i--)
> > +                       b[--pos[BUCKET_FOR(a, i, bits)]] = a[i];
> 
> ...which is why the above loop still works.

Since we are iterating by ones, I guess I can just compare to UINT_MAX.

-Peff

  reply	other threads:[~2013-07-11 11:17 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-07 10:01 [RFC/PATCH 0/4] cat-file --batch-disk-sizes Jeff King
2013-07-07 10:03 ` [PATCH 1/4] zero-initialize object_info structs Jeff King
2013-07-07 17:34   ` Junio C Hamano
2013-07-07 10:04 ` [PATCH 2/4] teach sha1_object_info_extended a "disk_size" query Jeff King
2013-07-07 10:09 ` [PATCH 3/4] cat-file: add --batch-disk-sizes option Jeff King
2013-07-07 17:49   ` Junio C Hamano
2013-07-07 18:19     ` Jeff King
2013-07-08 11:04     ` Duy Nguyen
2013-07-08 12:00       ` Ramkumar Ramachandra
2013-07-08 13:13         ` Duy Nguyen
2013-07-08 13:37           ` Ramkumar Ramachandra
2013-07-09  2:55             ` Duy Nguyen
2013-07-09 10:32               ` Ramkumar Ramachandra
2013-07-10 11:16             ` Jeff King
2013-07-08 16:40           ` Junio C Hamano
2013-07-10 11:04     ` Jeff King
2013-07-11 16:35       ` Junio C Hamano
2013-07-07 21:15   ` brian m. carlson
2013-07-10 10:57     ` Jeff King
2013-07-07 10:14 ` [PATCH 4/4] pack-revindex: radix-sort the revindex Jeff King
2013-07-07 23:52   ` Shawn Pearce
2013-07-08  7:57     ` Jeff King
2013-07-08 15:38       ` Shawn Pearce
2013-07-08 20:50   ` Brandon Casey
2013-07-08 21:35     ` Brandon Casey
2013-07-10 10:57       ` Jeff King
2013-07-10 10:52     ` Jeff King
2013-07-10 11:34 ` [PATCHv2 00/10] cat-file formats/on-disk sizes Jeff King
2013-07-10 11:35   ` [PATCH 01/10] zero-initialize object_info structs Jeff King
2013-07-10 11:35   ` [PATCH 02/10] teach sha1_object_info_extended a "disk_size" query Jeff King
2013-07-10 11:36   ` [PATCH 03/10] t1006: modernize output comparisons Jeff King
2013-07-10 11:38   ` [PATCH 04/10] cat-file: teach --batch to stream blob objects Jeff King
2013-07-10 11:38   ` [PATCH 05/10] cat-file: refactor --batch option parsing Jeff King
2013-07-10 11:45   ` [PATCH 06/10] cat-file: add --batch-check=<format> Jeff King
2013-07-10 11:57     ` Eric Sunshine
2013-07-10 14:51     ` Ramkumar Ramachandra
2013-07-11 11:24       ` Jeff King
2013-07-10 11:46   ` [PATCH 07/10] cat-file: add %(objectsize:disk) format atom Jeff King
2013-07-10 11:48   ` [PATCH 08/10] cat-file: split --batch input lines on whitespace Jeff King
2013-07-10 15:29     ` Ramkumar Ramachandra
2013-07-11 11:36       ` Jeff King
2013-07-11 17:42         ` Junio C Hamano
2013-07-11 20:45         ` [PATCHv3 " Jeff King
2013-07-10 11:50   ` [PATCH 09/10] pack-revindex: use unsigned to store number of objects Jeff King
2013-07-10 11:55   ` [PATCH 10/10] pack-revindex: radix-sort the revindex Jeff King
2013-07-10 12:00     ` Jeff King
2013-07-10 13:17     ` Ramkumar Ramachandra
2013-07-11 11:03       ` Jeff King
2013-07-10 17:10     ` Brandon Casey
2013-07-11 11:17       ` Jeff King [this message]
2013-07-11 12:16     ` [PATCHv3 " Jeff King
2013-07-11 21:12       ` Brandon Casey

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130711111705.GB6015@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=artagnon@gmail.com \
    --cc=drafnel@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pclouds@gmail.com \
    --subject='Re: [PATCH 10/10] pack-revindex: radix-sort the revindex' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).