git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Christian Couder <christian.couder@gmail.com>
Cc: git@vger.kernel.org, Junio C Hamano <gitster@pobox.com>,
	Christian Couder <chriscool@tuxfamily.org>,
	Ramsay Jones <ramsay@ramsayjones.plus.com>,
	Jonathan Tan <jonathantanmy@google.com>
Subject: Re: [PATCH v3 9/9] pack-objects: improve partial packfile reuse
Date: Mon, 9 Dec 2019 03:11:29 -0500	[thread overview]
Message-ID: <20191209081129.GA1546451@coredump.intra.peff.net> (raw)
In-Reply-To: <20191115141541.11149-10-chriscool@tuxfamily.org>

On Fri, Nov 15, 2019 at 03:15:41PM +0100, Christian Couder wrote:

> The old code to reuse deltas from an existing packfile
> just tried to dump a whole segment of the pack verbatim.
> That's faster than the traditional way of actually adding
> objects to the packing list, but it didn't kick in very
> often. This new code is really going for a middle ground:
> do _some_ per-object work, but way less than we'd
> traditionally do.

This is a good intro to the general problem, but...

> For instance, packing torvalds/linux on GitHub servers
> just now reused 6.5M objects, but only needed ~50k
> chunks.

What the heck is a chunk? :)

I know, because I wrote the patches, but I think we need to sketch out
the solution a bit for the reader.

I think the flow here is:

  - we want to do some per-object work (i.e., your intro above)

  - the general strategy is to make a bitmap of objects from the
    packfile we'll include, and then iterate over it, writing out each
    object exactly as it is in our on-disk pack, but _not_ adding it to
    our packlist (which costs memory, and increases the search space for
    deltas)

  - one complication is that if we're omitting some objects, we can't
    set a delta against a base that we're not sending. So we have to
    check each object in try_partial_reuse() to make sure we have its
    delta (actually, that third big comment in that function is
    incomplete, I think; it talks about sending the object later, not as
    part of the reused pack, but we might not send it at all!).

  - another complication is that when we omit parts of the packfile,
    that screws up delta base offsets. So to handle that, we have to
    keep track of which chunks we send (which is the bits you included
    below about chunks)

  - and then we can talk about performance; worst case we might have
    interleaved objects that we are sending or not sending, and we'd
    have as many chunks as objects. But in practice we send big chunks,
    so that's where the 6.5M objects vs 50k chunks comes in.

> Additional checks are added in have_duplicate_entry()
> and obj_is_packed() to avoid duplicate objects in the
> reuse bitmap. It was probably buggy to not have such a
> check before.
> 
> If a client both asks for a tag by sha1 and specifies
> "include-tag", we may end up including the tag in the reuse
> bitmap (due to the first thing), and then later adding it to
> the packlist (due to the second). This results in duplicate
> objects in the pack, which git chokes on. We should notice
> that we are already including it when doing the include-tag
> portion, and avoid adding it to the packlist.
> 
> The simplest place to fix this is right in add_ref_tag,
> where we could avoid peeling the tag at all if we know that
> we are already including it. However, I've pushed the check
> instead into have_duplicate_entry(). This fixes not only
> this case, but also means that we cannot have any similar
> problems lurking in other code.

I think this part could go in its own commit. If we introduce
reuse_packfile_bitmap early, even if it's always an all-or-nothing chunk
at the beginning of the file with the existing code, then we can
introduce the extra checks in have_duplicate_entry() on top of that.

And then it would be safe to do the cleanup in show_objects_from_type()
that triggers the test failure in patch 4.

>  builtin/pack-objects.c | 222 ++++++++++++++++++++++++++++++++---------
>  pack-bitmap.c          | 150 ++++++++++++++++++++--------
>  pack-bitmap.h          |   3 +-
>  3 files changed, 288 insertions(+), 87 deletions(-)

It might be worth having a perf test here. The case this is helping the
most, I think, is when somebody cloning wants all of the objects from
the beginning of the pack, but then there's a bunch of _other_ stuff.

That could be unreachable objects kept by "repack -k", or maybe objects
reachable outside of heads and tags. Or objects that are part of other
delta islands. This last is the most plausible, really, because we'll
put all of the root-island objects at the beginning of the pack. So
maybe a good perf test would be to take an existing repository add a
bunch of new commits in refs/foo/, and then repack with delta islands
such that refs/heads and refs/tags are in one (root) island, but
refs/foo is in another.

The old code should fail to do the pack-reuse thing, but we should get
pretty good reuse with the new code (and less CPU and peak RAM,
hopefully, though the perf suite doesn't measure RAM directly).


Answering some questions from Jonathan's response in the last round
(some of the quotes are his, some are yours):

> I still don't understand this part. Going off what's written in the
> commit message here, it seems to me that the issue is that (1) an object
> can be added to the reuse bitmap without going through to_pack, and (2)
> this is done when the client asks for a tag by SHA-1. But all objects
> when specified by SHA-1 go through to_pack, don't they?

No, if it's part of the reused chunk, we'd avoid adding it to to_pack at
all (and I think the commit message should make that more clear, as I
described above).

> >> No tests, because git does not actually exhibit this "ask
> >> for it and also include-tag" behavior. We do one or the
> >> other on clone, depending on whether --single-branch is set.
> >> However, libgit2 does both.
>
> So my wild guess sould be that maybe the reason is that some of this
> code was shared (or intended to be shared) with libgit2?

No, the question is how the client behaves, and how we on the server
react to it. Git as a client would never ask for both include-tag and
for the tag itself by sha1. But libgit2 will, so a libgit2 client
cloning from a Git server would trigger the bug.

> > +     if (pos >= bitmap_git->pack->num_objects)
> > +             return; /* not actually in the pack */
>
> Is this case possible? try_partial_reuse() is only called when there is
> a 1 at the bit location.

Yes, it's possible. The result of a bitmap walk is larger than a given
pack, because we have to extend it to match whatever objects the caller
asked for. E.g., imagine we have commit A, repack into into a bitmapped
pack, and then somebody pushes up commit B. Now I want to clone, getting
both A and B.

Our bitmap result represents the whole answer, and so must include both
objects. But since "B" isn't in the pack, it doesn't have an assigned
bit. We add it to the in-memory bitmap_git->ext_index, which gives it a
bit (valid only for that walk result!). But of course for pack reuse, we
only care about the objects that were actually in the bitmapped pack.

> > +     /* Don't mark objects not in the packfile */
> > +     if (i > bitmap_git->pack->num_objects / BITS_IN_EWORD)
> > +             i = bitmap_git->pack->num_objects / BITS_IN_EWORD;
>
> Why is this necessary? Let's say we have 42 fully-1 words, and therefore
> i is 42. Shouldn't we allocate 42 words in our reuse bitmap?

This is the same issue as above. Imagine instead of two objects, imagine
we have 42 words worth. But if only 20 words worth are in the packfile,
then we have to stop there.

Now we _could_ figure this out for each individual object through the
similar logic in try_partial_reuse(). But we try to take an even faster
path for the initial chunk of objects. We don't even walk over them
looking to see if they're deltas or not. We just send that chunk of the
pack verbatim.

I don't have numbers for how often we hit this path versus the
individual try_partial_reuse() path. Possibly the stats could
differentiate that.

Thinking on it more, though, I wonder if there's a bug hiding here. We
know that we can send the whole initial chunk verbatim for OFS_DELTA
objects, because the relative offsets will remain unchanged (i.e., there
are no "holes" that trigger our chunk code). But if there were a
REF_DELTA in that initial chunk, we have no certainty that the base is
being sent.

In practice, this doesn't happen because the objects in question have to
be in a pack with bitmaps, which means it was written ourselves by
git-repack. And we'd never write REF_DELTA objects there.

But I wonder if it's worth being a bit more careful (and what the
performance impact is; it would mean checking the type of every object
in that initial chunk).

-Peff

  reply	other threads:[~2019-12-09  8:11 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-15 14:15 [PATCH v3 0/9] Rewrite packfile reuse code Christian Couder
2019-11-15 14:15 ` [PATCH v3 1/9] builtin/pack-objects: report reused packfile objects Christian Couder
2019-12-09  6:24   ` Jeff King
2019-12-11 13:48     ` Christian Couder
2019-11-15 14:15 ` [PATCH v3 2/9] packfile: expose get_delta_base() Christian Couder
2019-12-09  6:26   ` Jeff King
2019-11-15 14:15 ` [PATCH v3 3/9] ewah/bitmap: introduce bitmap_word_alloc() Christian Couder
2019-12-09  6:28   ` Jeff King
2019-12-11 13:50     ` Christian Couder
2019-12-12  5:45       ` Jeff King
2019-11-15 14:15 ` [PATCH v3 4/9] pack-bitmap: don't rely on bitmap_git->reuse_objects Christian Couder
2019-12-09  6:47   ` Jeff King
2019-12-13 13:26     ` Christian Couder
2019-11-15 14:15 ` [PATCH v3 5/9] pack-bitmap: introduce bitmap_walk_contains() Christian Couder
2019-12-09  7:06   ` Jeff King
2019-12-13 13:27     ` Christian Couder
2019-11-15 14:15 ` [PATCH v3 6/9] csum-file: introduce hashfile_total() Christian Couder
2019-12-09  7:07   ` Jeff King
2019-11-15 14:15 ` [PATCH v3 7/9] pack-objects: introduce pack.allowPackReuse Christian Couder
2019-12-09  7:14   ` Jeff King
2019-12-13 13:27     ` Christian Couder
2019-11-15 14:15 ` [PATCH v3 8/9] builtin/pack-objects: introduce obj_is_packed() Christian Couder
2019-12-09  7:14   ` Jeff King
2019-11-15 14:15 ` [PATCH v3 9/9] pack-objects: improve partial packfile reuse Christian Couder
2019-12-09  8:11   ` Jeff King [this message]
2019-12-18 11:26     ` Christian Couder
2019-12-19  0:42       ` Jeff King
2020-01-23 22:29         ` Christian Couder
2019-11-15 18:03 ` [PATCH v3 0/9] Rewrite packfile reuse code Jonathan Tan
2019-11-25  6:30   ` Junio C Hamano
2019-11-25  6:36     ` Junio C Hamano
2019-12-06 21:42       ` Junio C Hamano
2019-12-07 10:12         ` Christian Couder
2019-12-07 20:47           ` Johannes Schindelin
2019-12-08  7:53             ` Christian Couder
2019-12-08  8:54               ` Johannes Schindelin
2019-12-08 10:26                 ` Christian Couder
2019-12-08 10:45                   ` Johannes Schindelin
2019-12-09  6:18                 ` Jeff King
2019-12-09  9:28                   ` Johannes Schindelin
2019-12-09 19:00                   ` Junio C Hamano
2019-12-09 19:05                   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191209081129.GA1546451@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=chriscool@tuxfamily.org \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    --cc=ramsay@ramsayjones.plus.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).