* Question about the ahead-behind computation and status @ 2017-12-14 21:49 Jeff Hostetler 2017-12-15 10:08 ` Jeff King 0 siblings, 1 reply; 6+ messages in thread From: Jeff Hostetler @ 2017-12-14 21:49 UTC (permalink / raw) To: Git Mailing List, Junio C Hamano, Jeff King, Ben Peart, Jameson Miller, Derrick Stolee We working with the (enormous) Windows repo, we observed a performance problem in the ahead-behind computation and were considering a few options. We had a local repo with a local branch that was 150K commits behind the upstream branch[*]. There was a ~20 second different in the run times for these 2 commands: $ git status --porcelain=v2 $ git status --porcelain=v2 --branch Profiling showed the additional time was spent computing the ahead/behind values for the branch. (The problem is not specific to porcelain V2, that was just the command where we discovered it; for example, there is a similar perf problem in "git branch" vs "git branch -vv".) I don't want to jump into the graph algorithm at this time, but was wondering about adding a --no-ahead-behind flag (or something similar or a config setting) that would disable the a/b computation during status. For status V2 output, we could omit the "# branch.ab x y" line. For normal status output, change the prose a/b message to say something like "are [not] up to date". Thoughts or suggestions ??? Thanks, Jeff [*] Sadly, the local repo was only about 20 days out of date (including the Thanksgiving holidays).... ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Question about the ahead-behind computation and status 2017-12-14 21:49 Question about the ahead-behind computation and status Jeff Hostetler @ 2017-12-15 10:08 ` Jeff King 2017-12-15 15:08 ` Jeff Hostetler 0 siblings, 1 reply; 6+ messages in thread From: Jeff King @ 2017-12-15 10:08 UTC (permalink / raw) To: Jeff Hostetler Cc: Git Mailing List, Junio C Hamano, Ben Peart, Jameson Miller, Derrick Stolee On Thu, Dec 14, 2017 at 04:49:31PM -0500, Jeff Hostetler wrote: > I don't want to jump into the graph algorithm at this time, > but was wondering about adding a --no-ahead-behind flag (or > something similar or a config setting) that would disable > the a/b computation during status. > > For status V2 output, we could omit the "# branch.ab x y" > line. For normal status output, change the prose a/b > message to say something like "are [not] up to date". Is it actually "status --porcelain=v2" that you care about here? Or is it normal "git status"? Do you care about seeing the branch at all? I.e., would "--no-branch" be a viable option (though I notice it seems to be a silent noop with the long-form, which should probably be fixed). If you really do want to see all branch details but just skip the ahead/behind, then yeah, I'd agree that adding an option to do so makes sense. Is "status" the only command that needs it? I think we do it unconditionally with "git checkout", too. > [*] Sadly, the local repo was only about 20 days out of > date (including the Thanksgiving holidays).... Taking 20 seconds to traverse 20 days worth of history sounds pretty awful. How many commits is it? -Peff ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Question about the ahead-behind computation and status 2017-12-15 10:08 ` Jeff King @ 2017-12-15 15:08 ` Jeff Hostetler 2017-12-15 15:43 ` Derrick Stolee 0 siblings, 1 reply; 6+ messages in thread From: Jeff Hostetler @ 2017-12-15 15:08 UTC (permalink / raw) To: Jeff King Cc: Git Mailing List, Junio C Hamano, Ben Peart, Jameson Miller, Derrick Stolee On 12/15/2017 5:08 AM, Jeff King wrote: > On Thu, Dec 14, 2017 at 04:49:31PM -0500, Jeff Hostetler wrote: > >> I don't want to jump into the graph algorithm at this time, >> but was wondering about adding a --no-ahead-behind flag (or >> something similar or a config setting) that would disable >> the a/b computation during status. >> >> For status V2 output, we could omit the "# branch.ab x y" >> line. For normal status output, change the prose a/b >> message to say something like "are [not] up to date". > > Is it actually "status --porcelain=v2" that you care about here? Or is > it normal "git status"? My primary concern is for porcelain V2 because Visual Studio uses --branch with V2 to get the other branch details with one process rather than two on inotify events. Fixing/Helping normal "git status" would be nice for interactive users. > Do you care about seeing the branch at all? I.e., would "--no-branch" be > a viable option (though I notice it seems to be a silent noop with the > long-form, which should probably be fixed). VS does not use the a/b counts at all, so it is just overhead. VS does use some of the other branch fields, so it would be nice to still have them. It would be nice to minimize changes to the user experience. Not printing anything about the a/b in the normal/short/long forms might give the user a false sense of security. I was thinking if we changed the message to be "you are [not] up to date", they would still know they needed to push. As it is, at this scale, having the message give an exact count just isn't that useful -- who cares if you are 150,000 or 149,000 behind. Maybe we could do something like GCC and just give up at 100 and print "you are 100+ ...". > If you really do want to see all branch details but just skip the > ahead/behind, then yeah, I'd agree that adding an option to do so makes > sense. Is "status" the only command that needs it? I think we do it > unconditionally with "git checkout", too. ok. thanks. i'll look at doing that. Yeah, "checkout" and "branch -vv" could use it too. I haven't looked yet to see anywhere else we might want it, but these can wait. >> [*] Sadly, the local repo was only about 20 days out of >> date (including the Thanksgiving holidays).... > > Taking 20 seconds to traverse 20 days worth of history sounds pretty > awful. How many commits is it? 150,000 commits, give or take. The graph is many thousands of lanes wide because of the merges and that isn't helping. (But I should give you folks lots of credit -- it *only* took 20 seconds to find, unzip and decode 150,000 commit objects and walk the commit graph.) For a true solution, I think we want to wait for client-side generation numbers before we spend any time looking at the algorithm. Thanks, Jeff ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Question about the ahead-behind computation and status 2017-12-15 15:08 ` Jeff Hostetler @ 2017-12-15 15:43 ` Derrick Stolee 2017-12-15 18:30 ` Junio C Hamano 0 siblings, 1 reply; 6+ messages in thread From: Derrick Stolee @ 2017-12-15 15:43 UTC (permalink / raw) To: Jeff Hostetler, Jeff King Cc: Git Mailing List, Junio C Hamano, Ben Peart, Jameson Miller On 12/15/2017 10:08 AM, Jeff Hostetler wrote: > On 12/15/2017 5:08 AM, Jeff King wrote: >> On Thu, Dec 14, 2017 at 04:49:31PM -0500, Jeff Hostetler wrote: >>> [*] Sadly, the local repo was only about 20 days out of >>> date (including the Thanksgiving holidays).... >> >> Taking 20 seconds to traverse 20 days worth of history sounds pretty >> awful. How many commits is it? > > 150,000 commits, give or take. The graph is many thousands of lanes > wide because of the merges and that isn't helping. > > (But I should give you folks lots of credit -- it *only* took 20 > seconds to find, unzip and decode 150,000 commit objects and walk > the commit graph.) To give a few more data points, I created similar situation by checking out a big repo I hadn't updated in three months and it was 16,000 commits behind. That took 1.5s to calculate the ahead/behind. Moving it 100,000 commits behind it took 5s. This repo has about 300,000 total commits versus the 1.5 million commits in the Windows repo. The biggest reason for the 20 seconds is not just the number of commits in the ahead/behind but how many commits are walked (including common to both branches) before paint_down_to_common() breaks its while loop due to queue_has_nonstale(). Thanks, -Stolee ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Question about the ahead-behind computation and status 2017-12-15 15:43 ` Derrick Stolee @ 2017-12-15 18:30 ` Junio C Hamano 2017-12-15 19:40 ` Derrick Stolee 0 siblings, 1 reply; 6+ messages in thread From: Junio C Hamano @ 2017-12-15 18:30 UTC (permalink / raw) To: Derrick Stolee Cc: Jeff Hostetler, Jeff King, Git Mailing List, Ben Peart, Jameson Miller Derrick Stolee <stolee@gmail.com> writes: > The biggest reason for the 20 seconds is not just the number of > commits in the ahead/behind but how many commits are walked (including > common to both branches) before paint_down_to_common() breaks its > while loop due to queue_has_nonstale(). Hmm, queue_has_nonstale() looks to see if any element is not STALE (where the definition of STALE is "known to be a common ancestor") by potentially checking all elements in the queue. I wonder if we have an opportunity for a trivial optimization? When the caller knows that it dug one level and added the parents that are not stale, it does not have to ask queue_has_nonstale() if there is any non stale element, for example. What do you exactly mean by "not just the number of commits in the ahead/behind"? Aren't the number of these commits pretty much proportional to the number of commits we need to paint down to common ancestors? Is the latter a lot larger than the former (i.e. are we somehow not stopping when we _could_ notice that we can with better information)? ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Question about the ahead-behind computation and status 2017-12-15 18:30 ` Junio C Hamano @ 2017-12-15 19:40 ` Derrick Stolee 0 siblings, 0 replies; 6+ messages in thread From: Derrick Stolee @ 2017-12-15 19:40 UTC (permalink / raw) To: Junio C Hamano Cc: Jeff Hostetler, Jeff King, Git Mailing List, Ben Peart, Jameson Miller On 12/15/2017 1:30 PM, Junio C Hamano wrote: > Derrick Stolee <stolee@gmail.com> writes: > >> The biggest reason for the 20 seconds is not just the number of >> commits in the ahead/behind but how many commits are walked (including >> common to both branches) before paint_down_to_common() breaks its >> while loop due to queue_has_nonstale(). > Hmm, queue_has_nonstale() looks to see if any element is not STALE > (where the definition of STALE is "known to be a common ancestor") > by potentially checking all elements in the queue. I wonder if we > have an opportunity for a trivial optimization? When the caller > knows that it dug one level and added the parents that are not > stale, it does not have to ask queue_has_nonstale() if there is any > non stale element, for example. I thought this, too, but my tracing did not show significant time spent in this method. 99% of the time is spent unzipping and parsing commits. If this was taking too long, then we could track a minimum timestamp for a commit that entered the queue in a non-stale state, but this will delay the termination condition a bit since commits can be marked stale after they enter the queue. > What do you exactly mean by "not just the number of commits in the > ahead/behind"? Aren't the number of these commits pretty much > proportional to the number of commits we need to paint down to > common ancestors? Is the latter a lot larger than the former > (i.e. are we somehow not stopping when we _could_ notice that we > can with better information)? > > With the wide history, there is actually a large set of commits that are in the common history but have newer commit times than the oldest commit in only one history. Consider the following ASCII art: A | 1 | 2 | 3 |\ 4 B |\| 5 C |\| 6 D |\| . . . |\| N Z |/ 0 Between A and B, A is ahead by commits {A,1,2,3,4,5,6,...,N}. Meanwhile, commits B,C,D,...,Z are in the common history, but still must be walked. Now imagine these two sets are actually much MUCH wider (thousands of commits that are pairwise non-ancestors). This causes the number of walked commits to be larger than just the number of commits in the symmetric difference of the histories. Unfortunately, generation numbers are not the only optimization needed to make this call be sub-100ms. A graph data structure that lists the edges between commits would prevent the time spent in zlib decompressing and parsing commits. I'm working on investigating how such a data structure (and corresponding file format) could integrate in the commit-walking code. Thanks, -Stolee ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2017-12-15 19:40 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-12-14 21:49 Question about the ahead-behind computation and status Jeff Hostetler 2017-12-15 10:08 ` Jeff King 2017-12-15 15:08 ` Jeff Hostetler 2017-12-15 15:43 ` Derrick Stolee 2017-12-15 18:30 ` Junio C Hamano 2017-12-15 19:40 ` Derrick Stolee
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).