Re: Commit cache to speed up rev-list and merge

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Re: Commit cache to speed up rev-list and merge
@ 2012-09-27 15:51 Shawn Pearce
  2012-09-27 17:39 ` Jeff King
  2012-09-28  2:14 ` Nguyen Thai Ngoc Duy
  0 siblings, 2 replies; 10+ messages in thread
From: Shawn Pearce @ 2012-09-27 15:51 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: git, Colby Ranger

On Thu, Sep 27, 2012 at 5:17 AM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> I'd like to see some sort of extension mechanism like in
> $GIT_DIR/index, so that we don't have to increase pack index version
> often. What I have in mind is optional commit cache to speed up
> rev-list and merge, which could be stored in pack index too.

Can you share some of your ideas?

In Linus' Linux kernel tree there are currently about 323,178 commits.
If we store just the pre-parsed commit time as an int32 field this is
an additional 1.2 MiB of data in the pack-*.idx file, assuming we can
use additional data like pack offset position to correlate commit to
the parsed int. If we stored parent pointers in a similar way you
probably need at least 3.6 MiB of additional disk space on the index.
For example, use 12 bytes for each commit to store enough of the
parsed commit time to sort commits, and up to 2 parent pointers per
commit.... with a reserved magic value for octopus merges to mean the
commit itself has to be parsed to get the graph structure correct.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Commit cache to speed up rev-list and merge
  2012-09-27 15:51 Commit cache to speed up rev-list and merge Shawn Pearce
@ 2012-09-27 17:39 ` Jeff King
  2012-09-27 17:45   ` Shawn Pearce
  2012-09-28  1:43   ` Nguyen Thai Ngoc Duy
  2012-09-28  2:14 ` Nguyen Thai Ngoc Duy
  1 sibling, 2 replies; 10+ messages in thread
From: Jeff King @ 2012-09-27 17:39 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Nguyen Thai Ngoc Duy, git, Colby Ranger

On Thu, Sep 27, 2012 at 08:51:51AM -0700, Shawn O. Pearce wrote:

> On Thu, Sep 27, 2012 at 5:17 AM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> > I'd like to see some sort of extension mechanism like in
> > $GIT_DIR/index, so that we don't have to increase pack index version
> > often. What I have in mind is optional commit cache to speed up
> > rev-list and merge, which could be stored in pack index too.
> 
> Can you share some of your ideas?

Some of it is here:

  http://article.gmane.org/gmane.comp.version-control.git/203308

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Commit cache to speed up rev-list and merge
  2012-09-27 17:39 ` Jeff King
@ 2012-09-27 17:45   ` Shawn Pearce
  2012-09-27 18:32     ` Jeff King
  2012-09-28  1:43   ` Nguyen Thai Ngoc Duy
  1 sibling, 1 reply; 10+ messages in thread
From: Shawn Pearce @ 2012-09-27 17:45 UTC (permalink / raw)
  To: Jeff King; +Cc: Nguyen Thai Ngoc Duy, git, Colby Ranger

On Thu, Sep 27, 2012 at 10:39 AM, Jeff King <peff@peff.net> wrote:
> On Thu, Sep 27, 2012 at 08:51:51AM -0700, Shawn O. Pearce wrote:
>> On Thu, Sep 27, 2012 at 5:17 AM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
>> > I'd like to see some sort of extension mechanism like in
>> > $GIT_DIR/index, so that we don't have to increase pack index version
>> > often. What I have in mind is optional commit cache to speed up
>> > rev-list and merge, which could be stored in pack index too.
>>
>> Can you share some of your ideas?
>
> Some of it is here:
>
>   http://article.gmane.org/gmane.comp.version-control.git/203308

Quoting from that patch:

On  2012-08-12 Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> Long term we might gain slight lookup speedup if we know object type
> as search region is made smaller. But for that to happen, we need to
> propagate object type hint down to find_pack_entry_one() and friends.
> Possible thing to do, I think.

I'm not sure reclustering the index by object type is going to make a
worthwhile difference. Of 2.2m objects in the Linux tree, 320k are
commits. The difference between doing the binary search through all
objects vs. just commits is only 2 iterations more of binary search if
we assume the per-type ranges have their own fan-out tables.

> The main reason to group objects by type is to make it possible to
> create another sha1->something mapping for a particular object type,
> without wasting space for storing sha-1 keys again. For example, we
> can store commit caches, tree caches... at the end of the index as
> extensions.

Using ordinal position in the pack also works, and doesn't require
clustering objects by type.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Commit cache to speed up rev-list and merge
  2012-09-27 17:45   ` Shawn Pearce
@ 2012-09-27 18:32     ` Jeff King
  0 siblings, 0 replies; 10+ messages in thread
From: Jeff King @ 2012-09-27 18:32 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Nguyen Thai Ngoc Duy, git, Colby Ranger

On Thu, Sep 27, 2012 at 10:45:32AM -0700, Shawn O. Pearce wrote:

> On  2012-08-12 Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> > Long term we might gain slight lookup speedup if we know object type
> > as search region is made smaller. But for that to happen, we need to
> > propagate object type hint down to find_pack_entry_one() and friends.
> > Possible thing to do, I think.
> 
> I'm not sure reclustering the index by object type is going to make a
> worthwhile difference. Of 2.2m objects in the Linux tree, 320k are
> commits. The difference between doing the binary search through all
> objects vs. just commits is only 2 iterations more of binary search if
> we assume the per-type ranges have their own fan-out tables.

To me the big win would be implicit indexing for items that are present
for every instance of a particular object type. So if we wanted to keep
the timestamp for every commit, you could have a "pack-*.timestamps"
that is literally just a packed list of uint32's, one per commit, where
the position of a commit's timestamp in the list is the same as its
position in the index of sha1s in the pack index.

That's simple to do if your index is just commits. But if it includes
all objects, then your list is sparse. So either you waste space by
making an empty slot for the non-commit objects, or you have an extra
level of indirection mapping the commit into the packed list, which is
going to double the storage in this case (though you could reuse that
extra mapping for the parent, generation number, etc, so it at least
gets amortized as you store more data). Or is there some clever solution
I'm missing?

For your extension, I don't think it matters. You're sparse even in the
commit-object space, so you have to store the mapping anyway. And your
data is big enough that the overhead isn't too painful.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Commit cache to speed up rev-list and merge
  2012-09-27 17:39 ` Jeff King
  2012-09-27 17:45   ` Shawn Pearce
@ 2012-09-28  1:43   ` Nguyen Thai Ngoc Duy
  1 sibling, 0 replies; 10+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-09-28  1:43 UTC (permalink / raw)
  To: Jeff King; +Cc: Shawn Pearce, git, Colby Ranger

On Fri, Sep 28, 2012 at 12:39 AM, Jeff King <peff@peff.net> wrote:
> On Thu, Sep 27, 2012 at 08:51:51AM -0700, Shawn O. Pearce wrote:
>
>> On Thu, Sep 27, 2012 at 5:17 AM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
>> > I'd like to see some sort of extension mechanism like in
>> > $GIT_DIR/index, so that we don't have to increase pack index version
>> > often. What I have in mind is optional commit cache to speed up
>> > rev-list and merge, which could be stored in pack index too.
>>
>> Can you share some of your ideas?
>
> Some of it is here:
>
>   http://article.gmane.org/gmane.comp.version-control.git/203308

And an experiment was done earlier

http://thread.gmane.org/gmane.comp.version-control.git/194306/focus=194596
-- 
Duy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Commit cache to speed up rev-list and merge
  2012-09-27 15:51 Commit cache to speed up rev-list and merge Shawn Pearce
  2012-09-27 17:39 ` Jeff King
@ 2012-09-28  2:14 ` Nguyen Thai Ngoc Duy
  2012-10-01  1:49   ` Shawn Pearce
  1 sibling, 1 reply; 10+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-09-28  2:14 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git, Colby Ranger

On Thu, Sep 27, 2012 at 10:51 PM, Shawn Pearce <spearce@spearce.org> wrote:
> In Linus' Linux kernel tree there are currently about 323,178 commits.
> If we store just the pre-parsed commit time as an int32 field this is
> an additional 1.2 MiB of data in the pack-*.idx file, assuming we can
> use additional data like pack offset position to correlate commit to
> the parsed int. If we stored parent pointers in a similar way you
> probably need at least 3.6 MiB of additional disk space on the index.
> For example, use 12 bytes for each commit to store enough of the
> parsed commit time to sort commits, and up to 2 parent pointers per
> commit.... with a reserved magic value for octopus merges to mean the
> commit itself has to be parsed to get the graph structure correct.

This is much better than my naive approach (storing sha-1 and
timestamps). We could use less space by storing parent pointer of
non-merge commits only. Merge commits linux-2.6 is 6% the number of
commits. git.git has higher percentage, 21%. I bet many projects do
not merge as much and the number of merge commits is less than 5%.
-- 
Duy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Commit cache to speed up rev-list and merge
  2012-09-28  2:14 ` Nguyen Thai Ngoc Duy
@ 2012-10-01  1:49   ` Shawn Pearce
  2012-10-01  2:05     ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 10+ messages in thread
From: Shawn Pearce @ 2012-10-01  1:49 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: git, Colby Ranger

On Thu, Sep 27, 2012 at 7:14 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> On Thu, Sep 27, 2012 at 10:51 PM, Shawn Pearce <spearce@spearce.org> wrote:
>> In Linus' Linux kernel tree there are currently about 323,178 commits.
>> If we store just the pre-parsed commit time as an int32 field this is
>> an additional 1.2 MiB of data in the pack-*.idx file, assuming we can
>> use additional data like pack offset position to correlate commit to
>> the parsed int. If we stored parent pointers in a similar way you
>> probably need at least 3.6 MiB of additional disk space on the index.
>> For example, use 12 bytes for each commit to store enough of the
>> parsed commit time to sort commits, and up to 2 parent pointers per
>> commit.... with a reserved magic value for octopus merges to mean the
>> commit itself has to be parsed to get the graph structure correct.
>
> This is much better than my naive approach (storing sha-1 and
> timestamps). We could use less space by storing parent pointer of
> non-merge commits only. Merge commits linux-2.6 is 6% the number of
> commits. git.git has higher percentage, 21%. I bet many projects do
> not merge as much and the number of merge commits is less than 5%.

Some projects merge quite often. Android's frameworks/base repository
has a very large number of merges. Out of 79905 commits reachable from
the master branch, 65.3% are merges. So actually there are more merge
commits in the Android history than there are code commits. A cache of
only non-merges may be worthless on such a history.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Commit cache to speed up rev-list and merge
  2012-10-01  1:49   ` Shawn Pearce
@ 2012-10-01  2:05     ` Nguyen Thai Ngoc Duy
  2012-10-01  2:27       ` Shawn Pearce
  0 siblings, 1 reply; 10+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-10-01  2:05 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git, Colby Ranger

On Mon, Oct 1, 2012 at 8:49 AM, Shawn Pearce <spearce@spearce.org> wrote:
> On Thu, Sep 27, 2012 at 7:14 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
>> On Thu, Sep 27, 2012 at 10:51 PM, Shawn Pearce <spearce@spearce.org> wrote:
>>> In Linus' Linux kernel tree there are currently about 323,178 commits.
>>> If we store just the pre-parsed commit time as an int32 field this is
>>> an additional 1.2 MiB of data in the pack-*.idx file, assuming we can
>>> use additional data like pack offset position to correlate commit to
>>> the parsed int. If we stored parent pointers in a similar way you
>>> probably need at least 3.6 MiB of additional disk space on the index.
>>> For example, use 12 bytes for each commit to store enough of the
>>> parsed commit time to sort commits, and up to 2 parent pointers per
>>> commit.... with a reserved magic value for octopus merges to mean the
>>> commit itself has to be parsed to get the graph structure correct.
>>
>> This is much better than my naive approach (storing sha-1 and
>> timestamps). We could use less space by storing parent pointer of
>> non-merge commits only. Merge commits linux-2.6 is 6% the number of
>> commits. git.git has higher percentage, 21%. I bet many projects do
>> not merge as much and the number of merge commits is less than 5%.
>
> Some projects merge quite often. Android's frameworks/base repository
> has a very large number of merges. Out of 79905 commits reachable from
> the master branch, 65.3% are merges. So actually there are more merge
> commits in the Android history than there are code commits. A cache of
> only non-merges may be worthless on such a history.

The good thing about these cache is it's configurable. Merge-preferred
projects can choose to cache the first two parents. Non-merge projects
can choose to cache just the first parent. We don't need a fixed
format for both.
-- 
Duy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Commit cache to speed up rev-list and merge
  2012-10-01  2:05     ` Nguyen Thai Ngoc Duy
@ 2012-10-01  2:27       ` Shawn Pearce
  2012-10-01  5:16         ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 10+ messages in thread
From: Shawn Pearce @ 2012-10-01  2:27 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: git, Colby Ranger

On Sun, Sep 30, 2012 at 7:05 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> On Mon, Oct 1, 2012 at 8:49 AM, Shawn Pearce <spearce@spearce.org> wrote:
>> On Thu, Sep 27, 2012 at 7:14 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
>>> On Thu, Sep 27, 2012 at 10:51 PM, Shawn Pearce <spearce@spearce.org> wrote:
>>>> In Linus' Linux kernel tree there are currently about 323,178 commits.
>>>> If we store just the pre-parsed commit time as an int32 field this is
>>>> an additional 1.2 MiB of data in the pack-*.idx file, assuming we can
>>>> use additional data like pack offset position to correlate commit to
>>>> the parsed int. If we stored parent pointers in a similar way you
>>>> probably need at least 3.6 MiB of additional disk space on the index.
>>>> For example, use 12 bytes for each commit to store enough of the
>>>> parsed commit time to sort commits, and up to 2 parent pointers per
>>>> commit.... with a reserved magic value for octopus merges to mean the
>>>> commit itself has to be parsed to get the graph structure correct.
>>>
>>> This is much better than my naive approach (storing sha-1 and
>>> timestamps). We could use less space by storing parent pointer of
>>> non-merge commits only. Merge commits linux-2.6 is 6% the number of
>>> commits. git.git has higher percentage, 21%. I bet many projects do
>>> not merge as much and the number of merge commits is less than 5%.
>>
>> Some projects merge quite often. Android's frameworks/base repository
>> has a very large number of merges. Out of 79905 commits reachable from
>> the master branch, 65.3% are merges. So actually there are more merge
>> commits in the Android history than there are code commits. A cache of
>> only non-merges may be worthless on such a history.
>
> The good thing about these cache is it's configurable. Merge-preferred
> projects can choose to cache the first two parents. Non-merge projects
> can choose to cache just the first parent. We don't need a fixed
> format for both.

Git has enough magic switches. It doesn't need yet another magic
switch that one group of users needs to set, and another can safely
ignore because their project's usage just happens to align with Linus
Torvald's current world view.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Commit cache to speed up rev-list and merge
  2012-10-01  2:27       ` Shawn Pearce
@ 2012-10-01  5:16         ` Nguyen Thai Ngoc Duy
  0 siblings, 0 replies; 10+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-10-01  5:16 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git, Colby Ranger

On Mon, Oct 1, 2012 at 9:27 AM, Shawn Pearce <spearce@spearce.org> wrote:
> Git has enough magic switches. It doesn't need yet another magic
> switch that one group of users needs to set, and another can safely
> ignore because their project's usage just happens to align with Linus
> Torvald's current world view.

I see it as tuning, not switching.  It's like setting the number of
commits where bitmaps are taken. We can see the commit cache as a
table, where columns are commit properties. We have a column for date,
one for the first parent. Users can choose to have a third column for
the second parent, or another one for root tree sha-1. The
implementation could be made generic to support caching any <n>
sha1-columns.
-- 
Duy

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-10-01  5:17 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-27 15:51 Commit cache to speed up rev-list and merge Shawn Pearce
2012-09-27 17:39 ` Jeff King
2012-09-27 17:45   ` Shawn Pearce
2012-09-27 18:32     ` Jeff King
2012-09-28  1:43   ` Nguyen Thai Ngoc Duy
2012-09-28  2:14 ` Nguyen Thai Ngoc Duy
2012-10-01  1:49   ` Shawn Pearce
2012-10-01  2:05     ` Nguyen Thai Ngoc Duy
2012-10-01  2:27       ` Shawn Pearce
2012-10-01  5:16         ` Nguyen Thai Ngoc Duy

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).