git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Christian Couder <christian.couder@gmail.com>
To: Jeff King <peff@peff.net>
Cc: git <git@vger.kernel.org>, Junio C Hamano <gitster@pobox.com>,
	Christian Couder <chriscool@tuxfamily.org>,
	Ramsay Jones <ramsay@ramsayjones.plus.com>,
	Jonathan Tan <jonathantanmy@google.com>
Subject: Re: [PATCH v3 9/9] pack-objects: improve partial packfile reuse
Date: Thu, 23 Jan 2020 23:29:24 +0100	[thread overview]
Message-ID: <CAP8UFD0h1TFVNqH7g823psaQzmEmzoz200CkZuDOV8GqNV7mrQ@mail.gmail.com> (raw)
In-Reply-To: <20191219004232.GA24434@coredump.intra.peff.net>

On Thu, Dec 19, 2019 at 1:42 AM Jeff King <peff@peff.net> wrote:
>
> On Wed, Dec 18, 2019 at 12:26:28PM +0100, Christian Couder wrote:
>
> > >   - one complication is that if we're omitting some objects, we can't
> > >     set a delta against a base that we're not sending. So we have to
> > >     check each object in try_partial_reuse() to make sure we have its
> > >     delta (actually, that third big comment in that function is
> > >     incomplete, I think; it talks about sending the object later, not as
> > >     part of the reused pack, but we might not send it at all!).
> >
> > Are you talking about this comment:
> >
> >         /*
> >          * And finally, if we're not sending the base as part of our
> >          * reuse chunk, then don't send this object either. The base
> >          * would come after us, along with other objects not
> >          * necessarily in the pack, which means we'd need to convert
> >          * to REF_DELTA on the fly. Better to just let the normal
> >          * object_entry code path handle it.
> >          */
> >
> > ?
> >
> > I don't see how it's talking about sending the object later, so I have
> > left it as is. Maybe you can check it again in the v4 series I am
> > going to send.
>
> Yes, that's the comment. It says "the base would come after us". That
> could be true, but it also could be that we are not sending the object
> at all (in fact, that seems more likely in practice). The outcome is the
> same, though: we don't verbatim reuse the delta'd object if its base is
> not also being reused.

In the V4 (https://public-inbox.org/git/20191218112547.4974-11-chriscool@tuxfamily.org/)
there is now the following in the commit message:

One complication is that if we're omitting some objects,
we can't set a delta against a base that we're not
sending. So we have to check each object in
try_partial_reuse() to make sure we have its delta.

Would you also want the comment to be improved to something like:

         /*
          * And finally, if we're not sending the base as part of our
          * reuse chunk, then don't send this object either. Because we
          * should not send the object at all. Or because the base
          * would come after us, along with other objects not
          * necessarily in the pack, which means we'd need to convert
          * to REF_DELTA on the fly. Better to just let the normal
          * object_entry code path handle it.
          */

> > > It might be worth having a perf test here. The case this is helping the
> > > most, I think, is when somebody cloning wants all of the objects from
> > > the beginning of the pack, but then there's a bunch of _other_ stuff.
> > >
> > > That could be unreachable objects kept by "repack -k", or maybe objects
> > > reachable outside of heads and tags. Or objects that are part of other
> > > delta islands. This last is the most plausible, really, because we'll
> > > put all of the root-island objects at the beginning of the pack. So
> > > maybe a good perf test would be to take an existing repository add a
> > > bunch of new commits in refs/foo/,
> >
> > Not sure how I could do so. How would you do that?
> >
> > I think it would be best to add completely new realistic commits that
> > are changing the main code base.
>
> I agree that would be the most realistic for the "forked repository
> network" case that we originally wrote this for. But I think a more
> mundane (and possibly easier to generate) example might be having a
> bunch of refs/notes.
>
> So perhaps something like this:

[...]

Ok, I will add this test to my patch series if I need to resend a V5.
Otherwise I will send it separately.

> From my limited prodding, it doesn't trigger pack-reuse with the current
> code, but would after your series. On linux.git, it produces:
>
> Test                          origin              origin/jk/packfile-reuse-cleanup
> ----------------------------------------------------------------------------------
> 5312.3: clone without notes   14.16(14.26+4.84)   10.80(10.40+0.44) -23.7%
> 5312.4: clone size                       1.4G                1.4G +0.0%

Yeah, nice improvement.

> though I suspect the more interesting savings is in RAM (since we don't
> have to allocate a "struct object_entry" for most of the objects). I
> don't know how hard it would be to collect that data in the perf suite.

I don't know either.

> Running the final pack-objects manually with massif, I get peak heap
> usage dropping from ~1500MB to ~380MB.

Very nice!

> For git.git it seems less impressive (very little CPU saved, and the
> resulting pack is slightly larger, perhaps due to not re-considering
> some deltas whose on-disk versions had to be thrown away):
>
> Test                          origin            origin/jk/packfile-reuse-cleanup
> --------------------------------------------------------------------------------
> 5312.3: clone without notes   1.22(3.60+0.16)   1.20(3.56+0.16) -1.6%
> 5312.4: clone size                      65.3M             67.0M +2.5%

Yeah, I think this is acceptable.

> > > > Is this case possible? try_partial_reuse() is only called when there is
> > > > a 1 at the bit location.
> > >
> > > Yes, it's possible. The result of a bitmap walk is larger than a given
> > > pack, because we have to extend it to match whatever objects the caller
> > > asked for. E.g., imagine we have commit A, repack into into a bitmapped
> > > pack, and then somebody pushes up commit B. Now I want to clone, getting
> > > both A and B.
> > >
> > > Our bitmap result represents the whole answer, and so must include both
> > > objects. But since "B" isn't in the pack, it doesn't have an assigned
> > > bit. We add it to the in-memory bitmap_git->ext_index, which gives it a
> > > bit (valid only for that walk result!). But of course for pack reuse, we
> > > only care about the objects that were actually in the bitmapped pack.
> >
> > Not sure if these explanations should go into the commit message or a
> > comment in the code. So I haven't added them for now.
>
> I think this is really outside the scope of this series, even, and gets
> into how the bitmap traversal works in general. I'm sure the
> documentation for that could be a bit better.

Ok, I will take a look at improving the documentation, either in this
patch series if I resend it, or in a follow up small series.

> > > Thinking on it more, though, I wonder if there's a bug hiding here. We
> > > know that we can send the whole initial chunk verbatim for OFS_DELTA
> > > objects, because the relative offsets will remain unchanged (i.e., there
> > > are no "holes" that trigger our chunk code). But if there were a
> > > REF_DELTA in that initial chunk, we have no certainty that the base is
> > > being sent.
> > >
> > > In practice, this doesn't happen because the objects in question have to
> > > be in a pack with bitmaps, which means it was written ourselves by
> > > git-repack. And we'd never write REF_DELTA objects there.
> > >
> > > But I wonder if it's worth being a bit more careful (and what the
> > > performance impact is; it would mean checking the type of every object
> > > in that initial chunk).
> >
> > I haven't done anything related to that. Maybe something can be done
> > in a follow up patch.
>
> Hmm.  I was thinking the problem was introduced in this series, but the
> old code should suffer from this as well. I wondered if it might simply
> be that the old code did not trigger often enough for anybody to notice,
> but we have been running the code in this series for several years at
> GitHub. So probably my reasoning above is sound (but it might still be
> worth addressing).

Ok, I will take a look.

Thanks,
Christian.

  reply	other threads:[~2020-01-23 22:29 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-15 14:15 [PATCH v3 0/9] Rewrite packfile reuse code Christian Couder
2019-11-15 14:15 ` [PATCH v3 1/9] builtin/pack-objects: report reused packfile objects Christian Couder
2019-12-09  6:24   ` Jeff King
2019-12-11 13:48     ` Christian Couder
2019-11-15 14:15 ` [PATCH v3 2/9] packfile: expose get_delta_base() Christian Couder
2019-12-09  6:26   ` Jeff King
2019-11-15 14:15 ` [PATCH v3 3/9] ewah/bitmap: introduce bitmap_word_alloc() Christian Couder
2019-12-09  6:28   ` Jeff King
2019-12-11 13:50     ` Christian Couder
2019-12-12  5:45       ` Jeff King
2019-11-15 14:15 ` [PATCH v3 4/9] pack-bitmap: don't rely on bitmap_git->reuse_objects Christian Couder
2019-12-09  6:47   ` Jeff King
2019-12-13 13:26     ` Christian Couder
2019-11-15 14:15 ` [PATCH v3 5/9] pack-bitmap: introduce bitmap_walk_contains() Christian Couder
2019-12-09  7:06   ` Jeff King
2019-12-13 13:27     ` Christian Couder
2019-11-15 14:15 ` [PATCH v3 6/9] csum-file: introduce hashfile_total() Christian Couder
2019-12-09  7:07   ` Jeff King
2019-11-15 14:15 ` [PATCH v3 7/9] pack-objects: introduce pack.allowPackReuse Christian Couder
2019-12-09  7:14   ` Jeff King
2019-12-13 13:27     ` Christian Couder
2019-11-15 14:15 ` [PATCH v3 8/9] builtin/pack-objects: introduce obj_is_packed() Christian Couder
2019-12-09  7:14   ` Jeff King
2019-11-15 14:15 ` [PATCH v3 9/9] pack-objects: improve partial packfile reuse Christian Couder
2019-12-09  8:11   ` Jeff King
2019-12-18 11:26     ` Christian Couder
2019-12-19  0:42       ` Jeff King
2020-01-23 22:29         ` Christian Couder [this message]
2019-11-15 18:03 ` [PATCH v3 0/9] Rewrite packfile reuse code Jonathan Tan
2019-11-25  6:30   ` Junio C Hamano
2019-11-25  6:36     ` Junio C Hamano
2019-12-06 21:42       ` Junio C Hamano
2019-12-07 10:12         ` Christian Couder
2019-12-07 20:47           ` Johannes Schindelin
2019-12-08  7:53             ` Christian Couder
2019-12-08  8:54               ` Johannes Schindelin
2019-12-08 10:26                 ` Christian Couder
2019-12-08 10:45                   ` Johannes Schindelin
2019-12-09  6:18                 ` Jeff King
2019-12-09  9:28                   ` Johannes Schindelin
2019-12-09 19:00                   ` Junio C Hamano
2019-12-09 19:05                   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAP8UFD0h1TFVNqH7g823psaQzmEmzoz200CkZuDOV8GqNV7mrQ@mail.gmail.com \
    --to=christian.couder@gmail.com \
    --cc=chriscool@tuxfamily.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    --cc=peff@peff.net \
    --cc=ramsay@ramsayjones.plus.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).