git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Derrick Stolee <stolee@gmail.com>
Cc: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Junio C Hamano" <gitster@pobox.com>, "Eric Wong" <e@80x24.org>,
	git@vger.kernel.org
Subject: Re: [PATCH v3] repack: enable bitmaps by default on bare repos
Date: Fri, 24 May 2019 03:24:26 -0400	[thread overview]
Message-ID: <20190524072426.GG25694@sigill.intra.peff.net> (raw)
In-Reply-To: <1e0d216b-5e68-faff-883c-778a04f55bde@gmail.com>

On Thu, May 23, 2019 at 08:53:56AM -0400, Derrick Stolee wrote:

> > I spent a while thinking and experimenting with this tonight. The result
> > is the patch below. Ævar, do you still have a copy of the repo that
> > misbehaved? I'd be curious to see how it fares.
> 
> This patch caught my attention, and I think I understand some of the issues
> at hand. I'm not well-versed specifically in Git's bitmap implementation.
> The fundamental ideas are there, but my perspective is biased by my own
> independent bitmap implementation for Azure Repos. What worked there may not
> work at all here.

Thanks for looking at this. There are a lot of short-comings in the
current bitmap implementation, so if there's a better way to do things,
I'm not opposed to moving to a new format. :)

> > Finding the right trees to explore is a little tricky with bitmaps.  In
> > a normal traversal, we consider the "edges" to be worth exploring.
> > I.e., the places where an UNINTERESTING commit is the parent of an
> > interesting one.
> 
> This is the "commit frontier" which I bring up again below.

Right. I actually had trouble coming up with a succinct way of
describing this, and stole the definition from your recent blog post. ;)

> > But with bitmaps, we don't have that information in the same way,
> > because we're trying to avoid walking the commits in the first place. So
> > e.g., in a history like this:
> > 
> >   A--B--C
> >       \
> >        D
> > 
> > Let's imagine we're computing "C..D", and that D has a bitmap on disk
> > but C does not.
> 
> (As I read your discussion below, I'm confused. For "C..D", C is a have
> and D is a want. We should explore all the haves _first_, then walk the
> wants, right?)

I think I may have confused things by starting my description with a
hypothetical combined want/have walk. To take a step back: the problem
we were discussing is that we spend a lot of time reading trees to fill
in the "have" bitmap, even though most of those objects are unlikely to
be in the "want" in the first place (only the frontier trees are really
useful).

The current code does indeed fill the "have" bitmap first (so that while
walking "want", it can avoid walking down paths whose bits we know we're
going to clear eventually anyway). But we can't know the frontier if we
completely fill the "have" bitmap first. We have to walk both sides
together, looking for the frontier.

> > When we visit D, we'll see that it has a bitmap, fill in
> > the results in our cumulative "want" bitmap, and then stop walking its
> > parents (because we know everything they could tell us already).
> 
> I may be misreading something, but we would construct _a_ bitmap for C
> by walking its reachable objects until hitting a bitmap we know about.
> Perhaps A or B have a bitmap to start the construction. If we never
> construct a bitmap for C, then we don't know what to remove from the "D"
> bitmap to show the difference.

Yes, and that's how the current code works. If we walk back to B and it
has a bitmap, we can stop walking there and just OR in its bitmap, and
look at the trees for any intermediate commits (in this case just C) to
fill in the rest.

The problem is that we don't have a bitmap for every commit. So you can
imagine a history like this:

  A_1 -- A_2 -- ... -- A_1000 -- B -- C
                                   \
				    D

where we have a bitmap _only_ for D. Filling in the accurate and true
bitmap for C requires walking a thousand commits (and their trees!),
when the non-bitmap algorithm would find the frontier at B and call that
good enough.

> If "C" is not even in the bitmap pack, then we don't use bitmaps for
> this calculation because of this condition:
> 
>         /*
>          * if we have a HAVES list, but none of those haves is contained
>          * in the packfile that has a bitmap, we don't have anything to
>          * optimize here
>          */
>         if (haves && !in_bitmapped_pack(bitmap_git, haves))
>                 goto cleanup;

Right, but it may be in the pack but without a bitmap. We don't store a
bitmap for every commit. The idea was to save space, but select some key
commits that let us generally find a bitmap with a small amount of
walking. And most of the time it works fine. But in some cases, we seem
to find pathological cases where we do quite a lot of walking.

As I said earlier in the thread, I suspect our commit selection is not
great. It's mostly some heuristics we threw together in 2013, and I
don't think it was tested very thoroughly. So fixing that may be a
simpler approach.

What I was wondering here was whether we could get an easy fix based on
the same frontier ideas that the non-bitmap walk uses.

> I'm of the opinion that the old method was better. It followed a very clear
> three-step process:
> 
>  1. Compute the "haves" bitmap.
> 
>  2. Compute the "wants" bitmap, but don't explore into the "haves" bitmap.
> 
>  3. Subtract the "haves" bitmap from the "wants" (in case we added bits to
>     the wants before they were in the haves).

Right, this is _way_ simpler. But it necessarily means spending effort
to find the complete "haves", because we don't know which parts are
useful.

> But there is hope in your idea to improve some cases. As noted, we give up
> if all of the haves are not in the bitmapped pack. By starting with a
> commit walk, we can find the "commit frontier," which is the set of commits
> that are explored by paint_down_to_common(). In this case, the set {B, C, D}.
> 
> After walking commits and finding a set of UNINTERESTING commits, we could
> look for any UNINTERESTING commits that have bitmaps (or simply are in the
> bitmapped pack) and use those to see our "haves" bitmap. So, if B has a
> bitmap, but C is too new for the bitmap, then we can still create the haves
> based on B and then take a bitmap diff "D minus B".

But doing that commit walk to find the frontier negates part of the
purpose of using the bitmaps, which is avoiding even walking commits.
Going back to a variant of our example:

  A -- B -- C_1 -- .. -- C_1000
        \
	 D_1 -- .. -- D_1000

If we have a bitmap at C_1000 and D_1000, we don't have to walk any
commits at all. But finding the frontier requires walking 2000 commits.

Is there a way to find it with just bitmaps? You can certainly find the
intersection, but you don't have any idea which of the many shared
commits is the merge base. Of course in this example you don't actually
care about the frontier (you have the full answer immediately). But how
do you decide which way to optimize: try to avoid walking commits to
get a quick answer from bitmaps, or prefer to walk some commits to find
the frontier and avoid opening too many trees?

-Peff

  reply	other threads:[~2019-05-24  7:24 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-14  4:31 [PATCH 0/3] some prune optimizations Jeff King
2019-02-14  4:35 ` [PATCH 1/3] prune: lazily perform reachability traversal Jeff King
2019-02-14 10:54   ` Eric Sunshine
2019-02-14 11:07     ` Jeff King
2019-02-14  4:37 ` [PATCH 2/3] prune: use bitmaps for " Jeff King
2019-03-09  2:49   ` bitmaps by default? [was: prune: use bitmaps for reachability traversal] Eric Wong
2019-03-10 23:39     ` Jeff King
2019-03-12  3:13       ` [PATCH] repack: enable bitmaps by default on bare repos Eric Wong
2019-03-12  9:07         ` Ævar Arnfjörð Bjarmason
2019-03-12 10:49         ` Jeff King
2019-03-12 12:05           ` Jeff King
2019-03-13  1:51           ` Eric Wong
2019-03-13 14:54             ` Jeff King
2019-03-14  9:12               ` [PATCH v3] " Eric Wong
2019-03-14 16:02                 ` Jeff King
2019-03-15  6:21                   ` [PATCH 0/2] enable bitmap hash-cache by default Jeff King
2019-03-15  6:22                     ` [PATCH 1/2] t5310: correctly remove bitmaps for jgit test Jeff King
2019-03-15 13:25                       ` SZEDER Gábor
2019-03-15 18:36                         ` Jeff King
2019-03-15  6:25                     ` [PATCH 2/2] pack-objects: default to writing bitmap hash-cache Jeff King
2019-04-09 15:10                 ` [PATCH v3] repack: enable bitmaps by default on bare repos Ævar Arnfjörð Bjarmason
2019-04-10 22:57                   ` Jeff King
2019-04-25  7:16                     ` Junio C Hamano
2019-05-04  1:37                       ` Jeff King
2019-05-04  6:52                         ` Ævar Arnfjörð Bjarmason
2019-05-04 13:23                           ` SZEDER Gábor
2019-05-08 20:17                             ` Ævar Arnfjörð Bjarmason
2019-05-09  4:24                               ` Junio C Hamano
2019-05-07  7:45                           ` Jeff King
2019-05-07  8:12                             ` Ævar Arnfjörð Bjarmason
2019-05-08  7:11                               ` Jeff King
2019-05-08 14:20                                 ` Derrick Stolee
2019-05-08 16:13                                 ` Ævar Arnfjörð Bjarmason
2019-05-08 22:25                                   ` Jeff King
2019-05-23 11:30                     ` Jeff King
2019-05-23 12:53                       ` Derrick Stolee
2019-05-24  7:24                         ` Jeff King [this message]
2019-05-24 10:33                           ` Derrick Stolee
2019-05-23 19:26                       ` Ævar Arnfjörð Bjarmason
2019-05-24  7:27                         ` Jeff King
2019-05-24  7:55                           ` Ævar Arnfjörð Bjarmason
2019-05-24  8:26                             ` Jeff King
2019-05-24  9:01                               ` Ævar Arnfjörð Bjarmason
2019-05-24  9:29                                 ` SZEDER Gábor
2019-05-24 11:17                                   ` Ævar Arnfjörð Bjarmason
2019-05-24 11:41                                     ` SZEDER Gábor
2019-05-24 11:58                                       ` Ævar Arnfjörð Bjarmason
2019-05-24 12:34                                         ` SZEDER Gábor
2019-05-24 13:41                                           ` Ævar Arnfjörð Bjarmason
2019-05-24 11:31                       ` [PATCH] pack-bitmap: look for an uninteresting bitmap Derrick Stolee
2019-04-15 15:00   ` [PATCH 2/3] prune: use bitmaps for reachability traversal Derrick Stolee
2019-04-18 19:49     ` Jeff King
2019-04-18 20:08       ` [PATCH] t5304: add a test for pruning with bitmaps Jeff King
2019-04-20  1:01         ` Derrick Stolee
2019-04-20  3:24           ` Jeff King
2019-04-20 21:01             ` Derrick Stolee
2019-02-14  4:38 ` [PATCH 3/3] prune: check SEEN flag for reachability Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190524072426.GG25694@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=avarab@gmail.com \
    --cc=e@80x24.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).