Question about the ahead-behind computation and status

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Question about the ahead-behind computation and status
@ 2017-12-14 21:49 Jeff Hostetler
  2017-12-15 10:08 ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff Hostetler @ 2017-12-14 21:49 UTC (permalink / raw)
  To: Git Mailing List, Junio C Hamano, Jeff King, Ben Peart,
	Jameson Miller, Derrick Stolee

We working with the (enormous) Windows repo, we observed
a performance problem in the ahead-behind computation and
were considering a few options.

We had a local repo with a local branch that was 150K
commits behind the upstream branch[*].  There was a ~20
second different in the run times for these 2 commands:

     $ git status --porcelain=v2
     $ git status --porcelain=v2 --branch

Profiling showed the additional time was spent computing
the ahead/behind values for the branch.  (The problem is
not specific to porcelain V2, that was just the command
where we discovered it; for example, there is a similar
perf problem in "git branch" vs "git branch -vv".)

I don't want to jump into the graph algorithm at this time,
but was wondering about adding a --no-ahead-behind flag (or
something similar or a config setting) that would disable
the a/b computation during status.

For status V2 output, we could omit the "# branch.ab x y"
line.  For normal status output, change the prose a/b
message to say something like "are [not] up to date".

Thoughts or suggestions ???

Thanks,
Jeff

[*] Sadly, the local repo was only about 20 days out of
     date (including the Thanksgiving holidays)....

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Question about the ahead-behind computation and status
  2017-12-14 21:49 Question about the ahead-behind computation and status Jeff Hostetler
@ 2017-12-15 10:08 ` Jeff King
  2017-12-15 15:08   ` Jeff Hostetler
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff King @ 2017-12-15 10:08 UTC (permalink / raw)
  To: Jeff Hostetler
  Cc: Git Mailing List, Junio C Hamano, Ben Peart, Jameson Miller,
	Derrick Stolee

On Thu, Dec 14, 2017 at 04:49:31PM -0500, Jeff Hostetler wrote:

> I don't want to jump into the graph algorithm at this time,
> but was wondering about adding a --no-ahead-behind flag (or
> something similar or a config setting) that would disable
> the a/b computation during status.
> 
> For status V2 output, we could omit the "# branch.ab x y"
> line.  For normal status output, change the prose a/b
> message to say something like "are [not] up to date".

Is it actually "status --porcelain=v2" that you care about here? Or is
it normal "git status"?

Do you care about seeing the branch at all? I.e., would "--no-branch" be
a viable option (though I notice it seems to be a silent noop with the
long-form, which should probably be fixed).

If you really do want to see all branch details but just skip the
ahead/behind, then yeah, I'd agree that adding an option to do so makes
sense. Is "status" the only command that needs it? I think we do it
unconditionally with "git checkout", too.

> [*] Sadly, the local repo was only about 20 days out of
>     date (including the Thanksgiving holidays)....

Taking 20 seconds to traverse 20 days worth of history sounds pretty
awful. How many commits is it?

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Question about the ahead-behind computation and status
  2017-12-15 10:08 ` Jeff King
@ 2017-12-15 15:08   ` Jeff Hostetler
  2017-12-15 15:43     ` Derrick Stolee
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff Hostetler @ 2017-12-15 15:08 UTC (permalink / raw)
  To: Jeff King
  Cc: Git Mailing List, Junio C Hamano, Ben Peart, Jameson Miller,
	Derrick Stolee

On 12/15/2017 5:08 AM, Jeff King wrote:
> On Thu, Dec 14, 2017 at 04:49:31PM -0500, Jeff Hostetler wrote:
> 
>> I don't want to jump into the graph algorithm at this time,
>> but was wondering about adding a --no-ahead-behind flag (or
>> something similar or a config setting) that would disable
>> the a/b computation during status.
>>
>> For status V2 output, we could omit the "# branch.ab x y"
>> line.  For normal status output, change the prose a/b
>> message to say something like "are [not] up to date".
> 
> Is it actually "status --porcelain=v2" that you care about here? Or is
> it normal "git status"?

My primary concern is for porcelain V2 because Visual Studio
uses --branch with V2 to get the other branch details with
one process rather than two on inotify events.

Fixing/Helping normal "git status" would be nice for interactive
users.

> Do you care about seeing the branch at all? I.e., would "--no-branch" be
> a viable option (though I notice it seems to be a silent noop with the
> long-form, which should probably be fixed).

VS does not use the a/b counts at all, so it is just overhead.
VS does use some of the other branch fields, so it would be nice
to still have them.

It would be nice to minimize changes to the user experience.
Not printing anything about the a/b in the normal/short/long forms
might give the user a false sense of security.  I was thinking if
we changed the message to be "you are [not] up to date", they would
still know they needed to push.

As it is, at this scale, having the message give an exact count
just isn't that useful -- who cares if you are 150,000 or 149,000
behind.

Maybe we could do something like GCC and just give up at 100
and print "you are 100+ ...".

> If you really do want to see all branch details but just skip the
> ahead/behind, then yeah, I'd agree that adding an option to do so makes
> sense. Is "status" the only command that needs it? I think we do it
> unconditionally with "git checkout", too.

ok. thanks. i'll look at doing that.

Yeah, "checkout" and "branch -vv" could use it too.  I haven't looked
yet to see anywhere else we might want it, but these can wait.

>> [*] Sadly, the local repo was only about 20 days out of
>>      date (including the Thanksgiving holidays)....
> 
> Taking 20 seconds to traverse 20 days worth of history sounds pretty
> awful. How many commits is it?

150,000 commits, give or take.  The graph is many thousands of lanes
wide because of the merges and that isn't helping.

(But I should give you folks lots of credit -- it *only* took 20
seconds to find, unzip and decode 150,000 commit objects and walk
the commit graph.)

For a true solution, I think we want to wait for client-side
generation numbers before we spend any time looking at the
algorithm.

Thanks,
Jeff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Question about the ahead-behind computation and status
  2017-12-15 15:08   ` Jeff Hostetler
@ 2017-12-15 15:43     ` Derrick Stolee
  2017-12-15 18:30       ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: Derrick Stolee @ 2017-12-15 15:43 UTC (permalink / raw)
  To: Jeff Hostetler, Jeff King
  Cc: Git Mailing List, Junio C Hamano, Ben Peart, Jameson Miller

On 12/15/2017 10:08 AM, Jeff Hostetler wrote:
> On 12/15/2017 5:08 AM, Jeff King wrote:
>> On Thu, Dec 14, 2017 at 04:49:31PM -0500, Jeff Hostetler wrote:
>>> [*] Sadly, the local repo was only about 20 days out of
>>>      date (including the Thanksgiving holidays)....
>>
>> Taking 20 seconds to traverse 20 days worth of history sounds pretty
>> awful. How many commits is it?
>
> 150,000 commits, give or take.  The graph is many thousands of lanes
> wide because of the merges and that isn't helping.
>
> (But I should give you folks lots of credit -- it *only* took 20
> seconds to find, unzip and decode 150,000 commit objects and walk
> the commit graph.)

To give a few more data points, I created similar situation by checking 
out a big repo I hadn't updated in three months and it was 16,000 
commits behind. That took 1.5s to calculate the ahead/behind. Moving it 
100,000 commits behind it took 5s. This repo has about 300,000 total 
commits versus the 1.5 million commits in the Windows repo.

The biggest reason for the 20 seconds is not just the number of commits 
in the ahead/behind but how many commits are walked (including common to 
both branches) before paint_down_to_common() breaks its while loop due 
to queue_has_nonstale().

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Question about the ahead-behind computation and status
  2017-12-15 15:43     ` Derrick Stolee
@ 2017-12-15 18:30       ` Junio C Hamano
  2017-12-15 19:40         ` Derrick Stolee
  0 siblings, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2017-12-15 18:30 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Jeff Hostetler, Jeff King, Git Mailing List, Ben Peart,
	Jameson Miller

Derrick Stolee <stolee@gmail.com> writes:

> The biggest reason for the 20 seconds is not just the number of
> commits in the ahead/behind but how many commits are walked (including
> common to both branches) before paint_down_to_common() breaks its
> while loop due to queue_has_nonstale().

Hmm, queue_has_nonstale() looks to see if any element is not STALE
(where the definition of STALE is "known to be a common ancestor")
by potentially checking all elements in the queue.  I wonder if we
have an opportunity for a trivial optimization?  When the caller
knows that it dug one level and added the parents that are not
stale, it does not have to ask queue_has_nonstale() if there is any
non stale element, for example.

What do you exactly mean by "not just the number of commits in the
ahead/behind"?  Aren't the number of these commits pretty much
proportional to the number of commits we need to paint down to
common ancestors?  Is the latter a lot larger than the former
(i.e. are we somehow not stopping when we _could_ notice that we
can with better information)?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Question about the ahead-behind computation and status
  2017-12-15 18:30       ` Junio C Hamano
@ 2017-12-15 19:40         ` Derrick Stolee
  0 siblings, 0 replies; 6+ messages in thread
From: Derrick Stolee @ 2017-12-15 19:40 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff Hostetler, Jeff King, Git Mailing List, Ben Peart,
	Jameson Miller

On 12/15/2017 1:30 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
>> The biggest reason for the 20 seconds is not just the number of
>> commits in the ahead/behind but how many commits are walked (including
>> common to both branches) before paint_down_to_common() breaks its
>> while loop due to queue_has_nonstale().
> Hmm, queue_has_nonstale() looks to see if any element is not STALE
> (where the definition of STALE is "known to be a common ancestor")
> by potentially checking all elements in the queue.  I wonder if we
> have an opportunity for a trivial optimization?  When the caller
> knows that it dug one level and added the parents that are not
> stale, it does not have to ask queue_has_nonstale() if there is any
> non stale element, for example.

I thought this, too, but my tracing did not show significant time spent 
in this method. 99% of the time is spent unzipping and parsing commits.

If this was taking too long, then we could track a minimum timestamp for 
a commit that entered the queue in a non-stale state, but this will 
delay the termination condition a bit since commits can be marked stale 
after they enter the queue.

> What do you exactly mean by "not just the number of commits in the
> ahead/behind"?  Aren't the number of these commits pretty much
> proportional to the number of commits we need to paint down to
> common ancestors?  Is the latter a lot larger than the former
> (i.e. are we somehow not stopping when we _could_ notice that we
> can with better information)?
>
>

With the wide history, there is actually a large set of commits that are 
in the common history but have newer commit times than the oldest commit 
in only one history. Consider the following ASCII art:

   A
   |
   1
   |
   2
   |
   3
   |\
   4 B
   |\|
   5 C
   |\|
   6 D
   |\|
    .
    .
    .
   |\|
   N Z
   |/
   0

Between A and B, A is ahead by commits {A,1,2,3,4,5,6,...,N}. Meanwhile, 
commits B,C,D,...,Z are in the common history, but still must be walked.

Now imagine these two sets are actually much MUCH wider (thousands of 
commits that are pairwise non-ancestors). This causes the number of 
walked commits to be larger than just the number of commits in the 
symmetric difference of the histories.

Unfortunately, generation numbers are not the only optimization needed 
to make this call be sub-100ms. A graph data structure that lists the 
edges between commits would prevent the time spent in zlib decompressing 
and parsing commits. I'm working on investigating how such a data 
structure (and corresponding file format) could integrate in the 
commit-walking code.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-12-15 19:40 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-14 21:49 Question about the ahead-behind computation and status Jeff Hostetler
2017-12-15 10:08 ` Jeff King
2017-12-15 15:08   ` Jeff Hostetler
2017-12-15 15:43     ` Derrick Stolee
2017-12-15 18:30       ` Junio C Hamano
2017-12-15 19:40         ` Derrick Stolee

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).