git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: "Jeff King" <peff@peff.net>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: Junio C Hamano <gitster@pobox.com>, Eric Wong <e@80x24.org>,
	git@vger.kernel.org
Subject: Re: [PATCH v3] repack: enable bitmaps by default on bare repos
Date: Thu, 23 May 2019 08:53:56 -0400	[thread overview]
Message-ID: <1e0d216b-5e68-faff-883c-778a04f55bde@gmail.com> (raw)
In-Reply-To: <20190523113031.GA17448@sigill.intra.peff.net>

On 5/23/2019 7:30 AM, Jeff King wrote:
> On Wed, Apr 10, 2019 at 06:57:21PM -0400, Jeff King wrote:
> 
>>   2. The answer we get from a bitmap versus a regular traversal are not
>>      apples-to-apples equivalent. The regular traversal walks down to
>>      the UNINTERESTING commits, marks the boundaries trees and blobs as
>>      UNINTERESTING, and then adds in all the interesting trees and blobs
>>      minus the UNINTERESTING parts. So it can sometimes give the wrong
>>      answer, claiming something is interesting when it is not.
>>
>>      Whereas with bitmaps we fill in the trees and blobs as we walk, and
>>      you get the true answer. But it means we may open up a lot more
>>      trees than the equivalent traversal would.
>>
>>      So one thing I've never really experimented with (and indeed, never
>>      really thought about until writing this email) is that the bitmaps
>>      could try to do that looser style of traversal, knowing that we
>>      might err on the side of calling things interesting in a few cases.
>>      But hopefully spending a lot less time opening trees.
>>
>>      I'm not even 100% sure what that would look like in code, but just
>>      thinking about it from a high level, I don't there's a particular
>>      mathematical reason it couldn't work.
> 
> I spent a while thinking and experimenting with this tonight. The result
> is the patch below. Ævar, do you still have a copy of the repo that
> misbehaved? I'd be curious to see how it fares.

This patch caught my attention, and I think I understand some of the issues
at hand. I'm not well-versed specifically in Git's bitmap implementation.
The fundamental ideas are there, but my perspective is biased by my own
independent bitmap implementation for Azure Repos. What worked there may not
work at all here.

> Finding the right trees to explore is a little tricky with bitmaps.  In
> a normal traversal, we consider the "edges" to be worth exploring.
> I.e., the places where an UNINTERESTING commit is the parent of an
> interesting one.

This is the "commit frontier" which I bring up again below.

> But with bitmaps, we don't have that information in the same way,
> because we're trying to avoid walking the commits in the first place. So
> e.g., in a history like this:
> 
>   A--B--C
>       \
>        D
> 
> Let's imagine we're computing "C..D", and that D has a bitmap on disk
> but C does not.

(As I read your discussion below, I'm confused. For "C..D", C is a have
and D is a want. We should explore all the haves _first_, then walk the
wants, right?)

> When we visit D, we'll see that it has a bitmap, fill in
> the results in our cumulative "want" bitmap, and then stop walking its
> parents (because we know everything they could tell us already).

I may be misreading something, but we would construct _a_ bitmap for C
by walking its reachable objects until hitting a bitmap we know about.
Perhaps A or B have a bitmap to start the construction. If we never
construct a bitmap for C, then we don't know what to remove from the "D"
bitmap to show the difference.

If "C" is not even in the bitmap pack, then we don't use bitmaps for
this calculation because of this condition:

        /*
         * if we have a HAVES list, but none of those haves is contained
         * in the packfile that has a bitmap, we don't have anything to
         * optimize here
         */
        if (haves && !in_bitmapped_pack(bitmap_git, haves))
                goto cleanup;

> Then we visit C. It's not bitmapped, so we keep walking. Unlike a normal
> traversal, we can't stop walking even though there are only
> UNINTERESTING commits left, because we have to fill the complete bitmap,
> including A and B (and you can imagine there might be thousands of
> ancestors of A, too). The reason is that we skipped adding ancestors of
> D when we saw its bitmap, so no still_interesting() cutoff will work
> there.
> 
> But how do we know that it's worth looking into the tree of "B"? If we
> assume we visit commits before their ancestors (because of the commit
> timestamp ordering), then we'd see that "B" is in the "want" bitmap
> already, which makes it worth visiting its tree.
> 
> Unfortunately we'd then continue on to "A", and it would _also_ look
> interesting, because it's also in the "want" bitmap. We don't have an
> easy way of knowing that "A" is basically already covered by "B", and is
> therefore not worth pursuing. In this example, we could pass down a bit
> from B to A as we traverse. But you can imagine another interesting
> commit branched from A, which _would_ make A's tree worth considering.
> 
> Fundamentally, as soon as we load a bitmap and stop traversing, we lose
> all information about the graph structure.
> 
> So the patch below just looks at every tree that might be worth
> exploring (so both "A" and "B" here, but not "C"). That shouldn't be any
> _worse_ than what the current code does (because it looks at all the
> trees). It appears to behave correctly, at least so far as passing the
> test suite. Running p5310 and p5311 against git.git and linux.git, it
> seems to perform worse. I'm not quite sure why that is (i.e., if I
> screwed up something in the implementation, or if the algorithm is
> fundamentally worse).

I'm of the opinion that the old method was better. It followed a very clear
three-step process:

 1. Compute the "haves" bitmap.

 2. Compute the "wants" bitmap, but don't explore into the "haves" bitmap.

 3. Subtract the "haves" bitmap from the "wants" (in case we added bits to
    the wants before they were in the haves).

But there is hope in your idea to improve some cases. As noted, we give up
if all of the haves are not in the bitmapped pack. By starting with a
commit walk, we can find the "commit frontier," which is the set of commits
that are explored by paint_down_to_common(). In this case, the set {B, C, D}.

After walking commits and finding a set of UNINTERESTING commits, we could
look for any UNINTERESTING commits that have bitmaps (or simply are in the
bitmapped pack) and use those to see our "haves" bitmap. So, if B has a
bitmap, but C is too new for the bitmap, then we can still create the haves
based on B and then take a bitmap diff "D minus B".

In fact, using the "merge base set" for the diff reflects what the non-bitmap
algorithm does in this case. It ignores C and only explores B.

I learned a lot looking at both versions of this method side-by-side. Thanks!

-Stolee

  reply	other threads:[~2019-05-23 12:54 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-14  4:31 [PATCH 0/3] some prune optimizations Jeff King
2019-02-14  4:35 ` [PATCH 1/3] prune: lazily perform reachability traversal Jeff King
2019-02-14 10:54   ` Eric Sunshine
2019-02-14 11:07     ` Jeff King
2019-02-14  4:37 ` [PATCH 2/3] prune: use bitmaps for " Jeff King
2019-03-09  2:49   ` bitmaps by default? [was: prune: use bitmaps for reachability traversal] Eric Wong
2019-03-10 23:39     ` Jeff King
2019-03-12  3:13       ` [PATCH] repack: enable bitmaps by default on bare repos Eric Wong
2019-03-12  9:07         ` Ævar Arnfjörð Bjarmason
2019-03-12 10:49         ` Jeff King
2019-03-12 12:05           ` Jeff King
2019-03-13  1:51           ` Eric Wong
2019-03-13 14:54             ` Jeff King
2019-03-14  9:12               ` [PATCH v3] " Eric Wong
2019-03-14 16:02                 ` Jeff King
2019-03-15  6:21                   ` [PATCH 0/2] enable bitmap hash-cache by default Jeff King
2019-03-15  6:22                     ` [PATCH 1/2] t5310: correctly remove bitmaps for jgit test Jeff King
2019-03-15 13:25                       ` SZEDER Gábor
2019-03-15 18:36                         ` Jeff King
2019-03-15  6:25                     ` [PATCH 2/2] pack-objects: default to writing bitmap hash-cache Jeff King
2019-04-09 15:10                 ` [PATCH v3] repack: enable bitmaps by default on bare repos Ævar Arnfjörð Bjarmason
2019-04-10 22:57                   ` Jeff King
2019-04-25  7:16                     ` Junio C Hamano
2019-05-04  1:37                       ` Jeff King
2019-05-04  6:52                         ` Ævar Arnfjörð Bjarmason
2019-05-04 13:23                           ` SZEDER Gábor
2019-05-08 20:17                             ` Ævar Arnfjörð Bjarmason
2019-05-09  4:24                               ` Junio C Hamano
2019-05-07  7:45                           ` Jeff King
2019-05-07  8:12                             ` Ævar Arnfjörð Bjarmason
2019-05-08  7:11                               ` Jeff King
2019-05-08 14:20                                 ` Derrick Stolee
2019-05-08 16:13                                 ` Ævar Arnfjörð Bjarmason
2019-05-08 22:25                                   ` Jeff King
2019-05-23 11:30                     ` Jeff King
2019-05-23 12:53                       ` Derrick Stolee [this message]
2019-05-24  7:24                         ` Jeff King
2019-05-24 10:33                           ` Derrick Stolee
2019-05-23 19:26                       ` Ævar Arnfjörð Bjarmason
2019-05-24  7:27                         ` Jeff King
2019-05-24  7:55                           ` Ævar Arnfjörð Bjarmason
2019-05-24  8:26                             ` Jeff King
2019-05-24  9:01                               ` Ævar Arnfjörð Bjarmason
2019-05-24  9:29                                 ` SZEDER Gábor
2019-05-24 11:17                                   ` Ævar Arnfjörð Bjarmason
2019-05-24 11:41                                     ` SZEDER Gábor
2019-05-24 11:58                                       ` Ævar Arnfjörð Bjarmason
2019-05-24 12:34                                         ` SZEDER Gábor
2019-05-24 13:41                                           ` Ævar Arnfjörð Bjarmason
2019-05-24 11:31                       ` [PATCH] pack-bitmap: look for an uninteresting bitmap Derrick Stolee
2019-04-15 15:00   ` [PATCH 2/3] prune: use bitmaps for reachability traversal Derrick Stolee
2019-04-18 19:49     ` Jeff King
2019-04-18 20:08       ` [PATCH] t5304: add a test for pruning with bitmaps Jeff King
2019-04-20  1:01         ` Derrick Stolee
2019-04-20  3:24           ` Jeff King
2019-04-20 21:01             ` Derrick Stolee
2019-02-14  4:38 ` [PATCH 3/3] prune: check SEEN flag for reachability Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1e0d216b-5e68-faff-883c-778a04f55bde@gmail.com \
    --to=stolee@gmail.com \
    --cc=avarab@gmail.com \
    --cc=e@80x24.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).