expanding pack idx fanout table

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* expanding pack idx fanout table
@ 2013-07-08 15:54 Shawn Pearce
  2013-07-08 17:37 ` Junio C Hamano
  2013-07-09  4:19 ` Jeff King
  0 siblings, 2 replies; 4+ messages in thread
From: Shawn Pearce @ 2013-07-08 15:54 UTC (permalink / raw)
  To: git

Has anyone studied the impact of converting the pack idx fanout table
from 256 entries to 65536 entries?

Back of the envelope estimates for 3.1M objects in linux.git suggests
a 2^16 fanout table would decrease the number of binary search
iterations from ~14 to ~6. The increased table costs an extra 255 KiB
of disk. On a 70M idx file this is noise.

I'm starting to wonder if increasing the fanout table once the object
count is above a certain threshold is a reasonable optimization for
larger repositories.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: expanding pack idx fanout table
  2013-07-08 15:54 expanding pack idx fanout table Shawn Pearce
@ 2013-07-08 17:37 ` Junio C Hamano
  2013-07-08 17:58   ` Shawn Pearce
  2013-07-09  4:19 ` Jeff King
  1 sibling, 1 reply; 4+ messages in thread
From: Junio C Hamano @ 2013-07-08 17:37 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git

Shawn Pearce <spearce@spearce.org> writes:

> Has anyone studied the impact of converting the pack idx fanout table
> from 256 entries to 65536 entries?
>
> Back of the envelope estimates for 3.1M objects in linux.git suggests
> a 2^16 fanout table would decrease the number of binary search
> iterations from ~14 to ~6. The increased table costs an extra 255 KiB
> of disk. On a 70M idx file this is noise.
>
> I'm starting to wonder if increasing the fanout table once the object
> count is above a certain threshold is a reasonable optimization for
> larger repositories.

Yeah, and I do not think we have to be worried too much about
backward compatibility for .idx files, as they are local and can be
regenerated if an older version cannot read it.

I also wonder if we can generate a finer-grained fan-out table on
the fly, perhaps lazily, without changing the on-disk format ;-)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: expanding pack idx fanout table
  2013-07-08 17:37 ` Junio C Hamano
@ 2013-07-08 17:58   ` Shawn Pearce
  0 siblings, 0 replies; 4+ messages in thread
From: Shawn Pearce @ 2013-07-08 17:58 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Mon, Jul 8, 2013 at 10:37 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Shawn Pearce <spearce@spearce.org> writes:
>
>> Has anyone studied the impact of converting the pack idx fanout table
>> from 256 entries to 65536 entries?
>>
>> Back of the envelope estimates for 3.1M objects in linux.git suggests
>> a 2^16 fanout table would decrease the number of binary search
>> iterations from ~14 to ~6. The increased table costs an extra 255 KiB
>> of disk. On a 70M idx file this is noise.
>>
>> I'm starting to wonder if increasing the fanout table once the object
>> count is above a certain threshold is a reasonable optimization for
>> larger repositories.
>
> Yeah, and I do not think we have to be worried too much about
> backward compatibility for .idx files, as they are local and can be
> regenerated if an older version cannot read it.
>
> I also wonder if we can generate a finer-grained fan-out table on
> the fly, perhaps lazily, without changing the on-disk format ;-)

Heh. Yes. The reader at runtime could expand the 256 fanout table to a
65536 fanout table without changing the on disk format. Unfortunately
doing the expansion will require O(N) time as the 2nd byte of each
SHA-1 must be examined from every bucket. If you are going to perform
O(N) lookups this expansion might save time as it lowers the "log N"
bound of the O(N log N) algorithm for N lookups. It doesn't help
enough when doing only N/20 lookups.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: expanding pack idx fanout table
  2013-07-08 15:54 expanding pack idx fanout table Shawn Pearce
  2013-07-08 17:37 ` Junio C Hamano
@ 2013-07-09  4:19 ` Jeff King
  1 sibling, 0 replies; 4+ messages in thread
From: Jeff King @ 2013-07-09  4:19 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git

On Mon, Jul 08, 2013 at 08:54:24AM -0700, Shawn O. Pearce wrote:

> Has anyone studied the impact of converting the pack idx fanout table
> from 256 entries to 65536 entries?
> 
> Back of the envelope estimates for 3.1M objects in linux.git suggests
> a 2^16 fanout table would decrease the number of binary search
> iterations from ~14 to ~6. The increased table costs an extra 255 KiB
> of disk. On a 70M idx file this is noise.

No, I hadn't considered it, but I think your analysis is correct that it
would decrease the lookup cost. Having run "perf" on git a lot of times,
I do see find_pack_entry_one turn up as a noticeable hot spot, but
usually not more than a few percent.

Which kind of makes sense, because of the way we cache objects in
memory. If you are doing something like `rev-list --objects`, you are
going to do O(n) packfile lookups, but you end up doing _way_ more
lookups in the object hash. Not just one per tree entry, but one per
tree entry for each commit in which the entry was untouched.

I tried (maybe a year or so ago) to make the object hash faster using a
similar fan-out trick, but I don't remember it having any real impact
(because we keep our hash tables relatively collision free already). I
think I also tried a full prefix-trie, with no luck.

Obviously the exact results would depend on your workload, but in
general I'd expect the object hash to take the brunt of the load for
repeated lookups, and for non-repeated lookups to be dominated by the
time to actually access the object, as opposed to saving a few hashcmps.
That may change as we start using the pack .idx for more clever things
than simply accessing the objects, though (e.g., it might make a
difference when coupled with my commit cache patches).

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2013-07-09  4:19 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-08 15:54 expanding pack idx fanout table Shawn Pearce
2013-07-08 17:37 ` Junio C Hamano
2013-07-08 17:58   ` Shawn Pearce
2013-07-09  4:19 ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).