git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Christian Couder <christian.couder@gmail.com>
To: Jonathan Tan <jonathantanmy@google.com>
Cc: git <git@vger.kernel.org>, Junio C Hamano <gitster@pobox.com>,
	Jeff King <peff@peff.net>,
	Christian Couder <chriscool@tuxfamily.org>
Subject: Re: [PATCH v2 9/9] pack-objects: improve partial packfile reuse
Date: Sat, 26 Oct 2019 11:29:17 +0200	[thread overview]
Message-ID: <CAP8UFD0PfCb+NfPDZ36c9MLcOa2xN9C9viOEEaLrbeA8Y9Rd7A@mail.gmail.com> (raw)
In-Reply-To: <20191022194805.139215-1-jonathantanmy@google.com>

On Tue, Oct 22, 2019 at 9:48 PM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> First, the commit message could probably be reordered to start with the
> main point (reusing of packfiles at object granularity instead of
> allowing reuse of only one contiguous region). To specific points:

Done in my current version.

> > The dynamic array of `struct reused_chunk` is useful
> > because we need to know not just the number of zero bits,
> > but the accumulated offset due to missing objects.
>
> The number of zero bits is irrelevant, I think. (I know I introduced
> this idea, but at that time I was somehow confused that the offset was a
> number of objects and not a number of bytes.)

Ok, I removed "not just the number of zero bits,".

> > If a client both asks for a tag by sha1 and specifies
> > "include-tag", we may end up including the tag in the reuse
> > bitmap (due to the first thing), and then later adding it to
> > the packlist (due to the second). This results in duplicate
> > objects in the pack, which git chokes on. We should notice
> > that we are already including it when doing the include-tag
> > portion, and avoid adding it to the packlist.
>
> I still don't understand this part. Going off what's written in the
> commit message here, it seems to me that the issue is that (1) an object
> can be added to the reuse bitmap without going through to_pack, and (2)
> this is done when the client asks for a tag by SHA-1. But all objects
> when specified by SHA-1 go through to_pack, don't they?

That's how Peff explained it in response to one of your previous
messages. Peff, would you mind answering Jonathan's questions?

At the end of the commit message I added the following which might help though:

>> No tests, because git does not actually exhibit this "ask
>> for it and also include-tag" behavior. We do one or the
>> other on clone, depending on whether --single-branch is set.
>> However, libgit2 does both.

So my wild guess sould be that maybe the reason is that some of this
code was shared (or intended to be shared) with libgit2?

> On to the review of the code. The ordering of - and + lines are
> confusing, so I have just removed all - lines.
>
> > +/*
> > + * Record the offsets needed in our reused packfile chunks due to
> > + * "gaps" where we omitted some objects.
> > + */
> > +static struct reused_chunk {
> > +     off_t start;
> > +     off_t offset;
> > +} *reused_chunks;
> > +static int reused_chunks_nr;
> > +static int reused_chunks_alloc;
>
> A chunk is a set of objects from the original packfile that we are
> reusing; all objects in a chunk have the same relative position in the
> original packfile and the generated packfile. (This does not mean that
> the bytes are exactly the same or, even, that the last object in the
> chunk has the same size in the original packfile and the generated
> packfile.)
>
> "start" is the offset of the first object of the chunk in the original
> packfile.
>
> "offset" is "start" minus the offset of the first object of the chunk in
> the generated packfile. (I think it is easier to understand if this was
> "new minus old" instead of "old minus new", that is, the negation of its
> current value.)
>
> So I would prefer:
>
>   /*
>    * A reused set of objects. All objects in a chunk have the same
>    * relative position in the original packfile and the generated
>    * packfile.
>    */
>   struct reused_chunk {
>     /* The offset of the first object of this chunk in the original
>      * packfile. */
>     off_t original;
>     /* The offset of the first object of this chunk in the generated
>      * packfile minus "original". */
>     off_t difference;
>   }

Ok, I have used that though it introduces some discrepancies, as
record_reused_object() for example still has an "offset" argument.

> > +static size_t write_reused_pack_verbatim(struct hashfile *out,
> > +                                      struct pack_window **w_curs)
> > +{
> > +     size_t pos = 0;
> > +
> > +     while (pos < reuse_packfile_bitmap->word_alloc &&
> > +                     reuse_packfile_bitmap->words[pos] == (eword_t)~0)
> > +             pos++;
> > +
> > +     if (pos) {
> > +             off_t to_write;
> > +
> > +             written = (pos * BITS_IN_EWORD);
> > +             to_write = reuse_packfile->revindex[written].offset
> > +                     - sizeof(struct pack_header);
> > +
> > +             record_reused_object(sizeof(struct pack_header), 0);
>
> Probably worth a comment, since we're recording one chunk, not one
> object.

Ok, added a comment.

> The code is correct, though - one big chunk with original offset
> as written, and no difference between original and generated offsets.


> > @@ -2661,6 +2785,7 @@ static void prepare_pack(int window, int depth)
> >
> >       if (nr_deltas && n > 1) {
> >               unsigned nr_done = 0;
> > +
> >               if (progress)
> >                       progress_state = start_progress(_("Compressing objects"),
> >                                                       nr_deltas);
>
> Unnecessary added line - please remove.

We often separate variable declaration from the rest of function code
with a blank line, so I don't think the added line is completely
unnecessary. It helps make the code more standard and therefore more
readable, so I left it.

You might say that it should be in a preparatory patch, but a
preparatory patch for just such a small change is a bit wasteful. I
think it's ok to add this blank line "while at it" in this patch.

> On to pack-bitmap.c.
>
> > +static void try_partial_reuse(struct bitmap_index *bitmap_git,
> > +                           size_t pos,
> > +                           struct bitmap *reuse,
> > +                           struct pack_window **w_curs)
> >  {
> > +     struct revindex_entry *revidx;
> > +     off_t offset;
> > +     enum object_type type;
> > +     unsigned long size;
> > +
> > +     if (pos >= bitmap_git->pack->num_objects)
> > +             return; /* not actually in the pack */
>
> Is this case possible? try_partial_reuse() is only called when there is
> a 1 at the bit location.

I'd like Peff to answer here too, as I don't think I fully understand
what's going on here.

> > +     /* Don't mark objects not in the packfile */
> > +     if (i > bitmap_git->pack->num_objects / BITS_IN_EWORD)
> > +             i = bitmap_git->pack->num_objects / BITS_IN_EWORD;
>
> Why is this necessary? Let's say we have 42 fully-1 words, and therefore
> i is 42. Shouldn't we allocate 42 words in our reuse bitmap?

I'd also prefer Peff to answer here.

[...]

> And thanks for this patch - other than the difficult-to-understand parts
> I described above, I found the overall flow easy to discern.
>
> I presume tests would also need to be written. One way is to introduce
> yet another test variable, but maybe that is overkill.

This part of the commit message might again help understand why there
is no test:

>> No tests, because git does not actually exhibit this "ask
>> for it and also include-tag" behavior. We do one or the
>> other on clone, depending on whether --single-branch is set.
>> However, libgit2 does both.

But I'd be again happy if Peff could confirm.

Thanks for another review,
Christian.

      reply	other threads:[~2019-10-26  9:29 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-19 10:35 [PATCH v2 0/9] Rewrite packfile reuse code Christian Couder
2019-10-19 10:35 ` [PATCH v2 1/9] builtin/pack-objects: report reused packfile objects Christian Couder
2019-10-19 10:35 ` [PATCH v2 2/9] packfile: expose get_delta_base() Christian Couder
2019-10-19 10:35 ` [PATCH v2 3/9] ewah/bitmap: introduce bitmap_word_alloc() Christian Couder
2019-10-22 17:46   ` Jonathan Tan
2019-10-26  9:29     ` Christian Couder
2019-10-19 10:35 ` [PATCH v2 4/9] pack-bitmap: don't rely on bitmap_git->reuse_objects Christian Couder
2019-10-19 10:35 ` [PATCH v2 5/9] pack-bitmap: introduce bitmap_walk_contains() Christian Couder
2019-10-19 15:25   ` Philip Oakley
2019-10-19 18:55     ` Christian Couder
2019-10-19 20:15       ` Philip Oakley
2019-10-19 23:18     ` Jeff King
2019-10-19 10:35 ` [PATCH v2 6/9] csum-file: introduce hashfile_total() Christian Couder
2019-10-19 10:35 ` [PATCH v2 7/9] pack-objects: introduce pack.allowPackReuse Christian Couder
2019-10-19 10:35 ` [PATCH v2 8/9] builtin/pack-objects: introduce obj_is_packed() Christian Couder
2019-10-19 10:35 ` [PATCH v2 9/9] pack-objects: improve partial packfile reuse Christian Couder
2019-10-19 15:30   ` Philip Oakley
2019-10-19 19:20     ` Christian Couder
2019-10-19 23:23       ` Jeff King
2019-10-20 11:26         ` Philip Oakley
2019-10-22 19:48   ` Jonathan Tan
2019-10-26  9:29     ` Christian Couder [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAP8UFD0PfCb+NfPDZ36c9MLcOa2xN9C9viOEEaLrbeA8Y9Rd7A@mail.gmail.com \
    --to=christian.couder@gmail.com \
    --cc=chriscool@tuxfamily.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).