git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jakub Narebski <jnareb@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: Derrick Stolee <stolee@gmail.com>,
	Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org, Jeff King <peff@peff.net>,
	Derrick Stolee <dstolee@microsoft.com>
Subject: Re: [PATCH v4 4/7] revision.c: begin refactoring --topo-order logic
Date: Fri, 26 Oct 2018 22:56:09 +0200	[thread overview]
Message-ID: <8636ss5swm.fsf@gmail.com> (raw)
In-Reply-To: <xmqqk1m6ig52.fsf@gitster-ct.c.googlers.com> (Junio C. Hamano's message of "Thu, 25 Oct 2018 17:28:57 +0900")

Junio C Hamano <gitster@pobox.com> writes:

> Derrick Stolee <stolee@gmail.com> writes:
>
>>     time git log --topo-order -10 master >/dev/null
>>
>>     time git log --topo-order -10 maint..master >/dev/null
>>
>> I get 0.39s for the first call and 0.01s for the second. (Note: I
>> specified "-10" to ensure we are only writing 10 commits and the
>> output size does not factor into the time.) This is because the first
>> walks the entire history, while the second uses the heuristic walk to
>> identify a much smaller subgraph that the topo-order algorithm uses.
>
> The algorithm can be fooled by skewed timestamps (i.e. that SLOP
> thing tries to work around), but is helped by being able to leave
> early, and it will give us the correct answer as long as there is no
> timestamp inversion.
>
> But monotonically increasing "timestamp" without inversion is what
> we invented "generation numbers" for, no?  When there is no
> timestamp inversion, would a walk based on commit timestamps walk
> smaller set than a walk based on commit generation numbers?

The problem, as far as I understand it, is in B..A case, to be more
exact with the walk from the exclusion set.  There can be cases when
there are two paths from the commit in the exclusion set, and sorting by
generation number will always walk the longer one, while date-order
based heuristic walk traverses smaller subgraph.

This problem was described in "[PATCH 1/1] commit: don't use generation
numbers if not needed" [1]; relevant fragment cited below:

DS> For instance, computing the merge-base between consecutive versions of
DS> the Linux kernel has no effect for versions after v4.9, but 'git
DS> merge-base v4.8 v4.9' presents a performance regression [...]
DS> 
DS> The topology of this case can be described in a simplified way
DS> here:
DS> 
DS>   v4.9
DS>    |  \
DS>    |   \
DS>   v4.8  \
DS>    | \   \
DS>    |  \   |
DS>   ...  A  B
DS>    |  /  /
DS>    | /  /
DS>    |/__/
DS>    C
DS> 
DS> Here, the "..." means "a very long line of commits". By generation
DS> number, A and B have generation one more than C. However, A and B
DS> have commit date higher than most of the commits reachable from
DS> v4.8. When the walk reaches v4.8, we realize that it has PARENT1
DS> and PARENT2 flags, so everything it can reach is marked as STALE,
DS> including A. B has only the PARENT1 flag, so is not STALE.
DS> 
DS> When paint_down_to_common() is run using
DS> compare_commits_by_commit_date, A and B are removed from the queue
DS> early and C is inserted into the queue. At this point, C and the
DS> rest of the queue entries are marked as STALE. The loop then
DS> terminates.
DS> 
DS> When paint_down_to_common() is run using
DS> compare_commits_by_gen_then_commit_date, B is removed from the
DS> queue only after the many commits reachable from v4.8 are explored.
DS> This causes the loop to run longer. The reason for this regression
DS> is simple: the queue order is intended to not explore a commit
DS> until everything that _could_ reach that commit is explored. From
DS> the information gathered by the original ordering, we have no
DS> guarantee that there is not a commit D reachable from v4.8 that
DS> can also reach B. We gained absolute correctness in exchange for
DS> a performance regression.

[1]: https://public-inbox.org/git/efa3720fb40638e5d61c6130b55e3348d8e4339e.1535633886.git.gitgitgadget@gmail.com/T/#u

>> Just as before, by using this algorithm for the B..A case, we are
>> adding an extra restriction on the algorithm: always be correct. This
>> results in us walking a larger set (everything reachable from B or A
>> with generation number at least the smallest generation of a commit
>> reachable from only one).
>>
>> I believe this can be handled by using a smarter generation number
>> (one that relies on commit date as a heuristic, but still have enough
>> information to guarantee topological relationships), and I've already
>> started testing a few of these directions.
>
> Good to hear.

I'm not sure if you can enhance generation numbers, or rather using /
sorting of generation numbers in such way.

With generation numbers, if you have two possible paths to the merge
commit, and one is much longer than the other, then the commits on this
longer path will have larger generation numbers.


        0     1     2     3     4     5     6     7       = gen(c) - gen(B)
    --- B<----*<----*<----*<----*<----*<----*<----M ---
        ^                                        /
         \--------------------------x<--------x</         = gen(c) - gen(B)
                                    1         2

    ===================== commit date ======================>

Using priority-queue which selects max generation number would then
always walk the longer path (or walk longer path first).

This does not matter for the walk from the inclusive set (A in B..A), as
we want to walk all commits anyway; we walk both paths, it does not
matter which we walk first (it may change the unimportant parts of
topological sort, though).

For the walk from the exclusive set (B in B..A), any path is good; we
don't need and do not want to walk all paths.  Sorting by generation
numbers will always choose the longer path, while sorting by commit date
may walk the "shortcut" first, thus marking commits reachable from the
inclusive subset (from A) STALE earlier, thus earlier notice of
all-stale, and end of walk.

I think that in the case of walking from the exclusion subset, using
positive-cut reachbility index would be more useful than trying to come
up with "smarter generation number" (though I am not saying that it
would be impossible).  For example using reachability bitmap indices, or
post-order in spanning tree (the latter descibed in [2]) to mark more
commits stale than just parents, may help more.

Actually, this is something independent from "smarter generation
numbers", and can only help...

[2]: "[RFC] Other chunks for commit-graph, part 2 - reachability indexes"
     https://public-inbox.org/git/86muxcuyod.fsf@gmail.com/

-- 
Jakub Narębski

  reply	other threads:[~2018-10-26 20:56 UTC|newest]

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-27 20:41 [PATCH 0/6] Use generation numbers for --topo-order Derrick Stolee via GitGitGadget
2018-08-27 20:41 ` [PATCH 1/6] prio-queue: add 'peek' operation Derrick Stolee via GitGitGadget
2018-08-27 20:41 ` [PATCH 2/6] test-reach: add run_three_modes method Derrick Stolee via GitGitGadget
2018-08-27 20:41 ` [PATCH 3/6] test-reach: add rev-list tests Derrick Stolee via GitGitGadget
2018-08-27 20:41 ` [PATCH 4/6] revision.c: begin refactoring --topo-order logic Derrick Stolee via GitGitGadget
2018-08-27 20:41 ` [PATCH 5/6] commit/revisions: bookkeeping before refactoring Derrick Stolee via GitGitGadget
2018-08-27 20:41 ` [PATCH 6/6] revision.c: refactor basic topo-order logic Derrick Stolee via GitGitGadget
2018-08-27 21:23 ` [PATCH 0/6] Use generation numbers for --topo-order Junio C Hamano
2018-09-18  4:08 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2018-09-18  4:08   ` [PATCH v2 1/6] prio-queue: add 'peek' operation Derrick Stolee via GitGitGadget
2018-09-18  4:08   ` [PATCH v2 2/6] test-reach: add run_three_modes method Derrick Stolee via GitGitGadget
2018-09-18 18:02     ` SZEDER Gábor
2018-09-19 19:31       ` Junio C Hamano
2018-09-19 19:38         ` Junio C Hamano
2018-09-20 21:18           ` Junio C Hamano
2018-09-18  4:08   ` [PATCH v2 3/6] test-reach: add rev-list tests Derrick Stolee via GitGitGadget
2018-09-18  4:08   ` [PATCH v2 4/6] revision.c: begin refactoring --topo-order logic Derrick Stolee via GitGitGadget
2018-09-18  4:08   ` [PATCH v2 5/6] commit/revisions: bookkeeping before refactoring Derrick Stolee via GitGitGadget
2018-09-18  4:08   ` [PATCH v2 6/6] revision.c: refactor basic topo-order logic Derrick Stolee via GitGitGadget
2018-09-18  5:51     ` Ævar Arnfjörð Bjarmason
2018-09-18  6:05   ` [PATCH v2 0/6] Use generation numbers for --topo-order Ævar Arnfjörð Bjarmason
2018-09-21 15:47     ` Derrick Stolee
2018-09-21 17:39   ` [PATCH v3 0/7] " Derrick Stolee via GitGitGadget
2018-09-21 17:39     ` [PATCH v3 1/7] prio-queue: add 'peek' operation Derrick Stolee via GitGitGadget
2018-09-26 19:15       ` Derrick Stolee
2018-10-11 13:54       ` Jeff King
2018-09-21 17:39     ` [PATCH v3 2/7] test-reach: add run_three_modes method Derrick Stolee via GitGitGadget
2018-10-11 13:57       ` Jeff King
2018-09-21 17:39     ` [PATCH v3 3/7] test-reach: add rev-list tests Derrick Stolee via GitGitGadget
2018-10-11 13:58       ` Jeff King
2018-10-12  4:34         ` Junio C Hamano
2018-09-21 17:39     ` [PATCH v3 4/7] revision.c: begin refactoring --topo-order logic Derrick Stolee via GitGitGadget
2018-10-11 14:06       ` Jeff King
2018-10-12  6:33       ` Junio C Hamano
2018-10-12 12:32         ` Derrick Stolee
2018-10-12 16:15         ` Johannes Sixt
2018-10-13  8:05           ` Junio C Hamano
2018-09-21 17:39     ` [PATCH v3 5/7] commit/revisions: bookkeeping before refactoring Derrick Stolee via GitGitGadget
2018-10-11 14:21       ` Jeff King
2018-09-21 17:39     ` [PATCH v3 6/7] revision.h: add whitespace in flag definitions Derrick Stolee via GitGitGadget
2018-09-21 17:39     ` [PATCH v3 7/7] revision.c: refactor basic topo-order logic Derrick Stolee via GitGitGadget
2018-09-27 17:57       ` Derrick Stolee
2018-10-06 16:56         ` Jakub Narebski
2018-10-11 15:35       ` Jeff King
2018-10-11 16:21         ` Derrick Stolee
2018-10-25  9:43           ` Jeff King
2018-10-25 13:00             ` Derrick Stolee
2018-10-11 22:32       ` Stefan Beller
2018-09-21 21:22     ` [PATCH v3 0/7] Use generation numbers for --topo-order Junio C Hamano
2018-10-16 22:36     ` [PATCH v4 " Derrick Stolee via GitGitGadget
2018-10-16 22:36       ` [PATCH v4 1/7] prio-queue: add 'peek' operation Derrick Stolee via GitGitGadget
2018-10-16 22:36       ` [PATCH v4 2/7] test-reach: add run_three_modes method Derrick Stolee via GitGitGadget
2018-10-16 22:36       ` [PATCH v4 3/7] test-reach: add rev-list tests Derrick Stolee via GitGitGadget
2018-10-21 10:21         ` Jakub Narebski
2018-10-21 15:28           ` Derrick Stolee
2018-10-16 22:36       ` [PATCH v4 4/7] revision.c: begin refactoring --topo-order logic Derrick Stolee via GitGitGadget
2018-10-21 15:55         ` Jakub Narebski
2018-10-22  1:12           ` Junio C Hamano
2018-10-22  1:51             ` Derrick Stolee
2018-10-22  1:55               ` [RFC PATCH] revision.c: use new algorithm in A..B case Derrick Stolee
2018-10-25  8:28               ` [PATCH v4 4/7] revision.c: begin refactoring --topo-order logic Junio C Hamano
2018-10-26 20:56                 ` Jakub Narebski [this message]
2018-10-16 22:36       ` [PATCH v4 5/7] commit/revisions: bookkeeping before refactoring Derrick Stolee via GitGitGadget
2018-10-21 21:17         ` Jakub Narebski
2018-10-16 22:36       ` [PATCH v4 6/7] revision.c: generation-based topo-order algorithm Derrick Stolee via GitGitGadget
2018-10-22 13:37         ` Jakub Narebski
2018-10-23 13:54           ` Derrick Stolee
2018-10-26 16:55             ` Jakub Narebski
2018-10-16 22:36       ` [PATCH v4 7/7] t6012: make rev-list tests more interesting Derrick Stolee via GitGitGadget
2018-10-23 15:48         ` Jakub Narebski
2018-10-21 12:57       ` [PATCH v4 0/7] Use generation numbers for --topo-order Jakub Narebski
2018-11-01  5:21       ` Junio C Hamano
2018-11-01 13:49         ` Derrick Stolee
2018-11-01 23:54           ` Junio C Hamano
2018-11-01 13:46       ` [PATCH v5 " Derrick Stolee
2018-11-01 13:46         ` [PATCH v5 1/7] prio-queue: add 'peek' operation Derrick Stolee
2018-11-01 13:46         ` [PATCH v5 2/7] test-reach: add run_three_modes method Derrick Stolee
2018-11-01 13:46         ` [PATCH v5 3/7] test-reach: add rev-list tests Derrick Stolee
2018-11-01 13:46         ` [PATCH v5 4/7] revision.c: begin refactoring --topo-order logic Derrick Stolee
2018-11-01 13:46         ` [PATCH v5 5/7] commit/revisions: bookkeeping before refactoring Derrick Stolee
2018-11-01 13:46         ` [PATCH v5 6/7] revision.c: generation-based topo-order algorithm Derrick Stolee
2018-11-01 15:48           ` SZEDER Gábor
2018-11-01 16:12             ` Derrick Stolee
2019-11-08  2:50           ` Mike Hommey
2019-11-11  1:07             ` Derrick Stolee
2019-11-18 23:04               ` SZEDER Gábor
2018-11-01 13:46         ` [PATCH v5 7/7] t6012: make rev-list tests more interesting Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8636ss5swm.fsf@gmail.com \
    --to=jnareb@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).