git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Jeff King <peff@peff.net>
Cc: Martin Fick <mfick@codeaurora.org>,
	Git Mailing List <git@vger.kernel.org>, "Jansen\,
	Geert" <gerardu@amazon.com>,
	Jonathan Tan <jonathantanmy@google.com>
Subject: Re: Resolving deltas dominates clone time
Date: Tue, 23 Apr 2019 09:07:06 +0200	[thread overview]
Message-ID: <8736m9td2d.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <20190422184329.GA20304@sigill.intra.peff.net>


On Mon, Apr 22 2019, Jeff King wrote:

> On Mon, Apr 22, 2019 at 08:01:15PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> > Your patch is optionally removing the "woah, we got an object with a
>> > duplicate sha1, let's check that the bytes are the same in both copies"
>> > check. But Martin's problem is a clone, so we wouldn't have any existing
>> > objects to duplicate in the first place.
>>
>> Right, but we do anyway, as reported by Geert at @amazon.com preceding
>> that patch of mine. But it is 99.99% irrelevant to *performance* in this
>> case after the loose object cache you added (but before that could make
>> all the difference depending on the FS).
>
> I scratched my head at this a bit. If we don't have any other objects,
> then what are we comparing against? But I think you mean that we have
> the overhead of doing the object lookups to find that out. Yes, that can
> add up if your filesystem has high latency, but I think in this case it
> is a drop in the bucket compared to dealing with the actual object data.

There was no "we have no objects" clause, so this bit is what dominated
clone time before the loose object cache...

>> I just mentioned it to plant a flag on another bit of the code where
>> index-pack in general has certain paranoias/validation the user might be
>> willing to optionally drop just at "clone" time.
>
> Yeah, I agree it may be worth pursuing independently. I just don't think
> it will help Martin's case in any noticeable way.

...indeed, as noted just mentioning this in the context of things in
index-pack that *in general* might benefit from some "we had no objects
before" special-case.

>> > Right, that would work. I will note one thing, though: the total time to
>> > do a 1-depth clone followed by an unshallow is probably much higher than
>> > doing the whole clone as one unit, for two reasons:
>>
>> Indeed. The hypothesis is that the user doesn't really care about the
>> clone-time, but the clone-to-repo-mostly-usable time.
>
> There was a little bit of self-interest in there for me, too, as a
> server operator. While it does add to the end-to-end time, most of the
> resource use for the shallow fetch gets put on the server. IOW, I don't
> think we'd be happy to see clients doing this depth-1-and-then-unshallow
> strategy for every clone.

Just change from per-seat pricing to charging a premium for CPU &
I/O. Now your problem is a solution :)

More seriously, yeah I think we definitely need to be careful about
changes to git that'll eat someone "free" server time to save the client
time/work.

At the same time we have dedicated internal operators who wouldn't mind
spending that CPU. So hopefully we can in general find some reasonable
middle-ground.

>> > So in general, I think you'd need some cooperation from the server side
>> > to ask it to generate and send the .idx that matches the .pack it is
>> > sending you. Or even if not the .idx format itself, some stable list of
>> > sha1s that you could use to reproduce it without hashing each
>> > uncompressed byte yourself.
>>
>> Yeah, depending on how jt/fetch-cdn-offload is designed (see my
>> https://public-inbox.org/git/87k1hv6eel.fsf@evledraar.gmail.com/) it
>> could be (ab)used to do this. I.e. you'd keep a "base" *.{pack,idx}
>> around for such a purpose.
>>
>> So in such a case you'd serve up that recent-enough *.{pack,idx} for the
>> client to "wget", and the client would then trust it (or not) and do the
>> equivalent of a "fetch" from that point to be 100% up-to-date.
>
> I think it's sort of orthogonal. Either way you have to teach the client
> how to get a .pack/.idx combo. Whether it learns to receive it inline
> from the first fetch, or whether it is taught to expect it from the
> out-of-band fetch, most of the challenge is the same.

I think it is, but maybe we're talking about different things.

I suspect a few of us have experimented with similar rsync-and-pull
hacks as a replacement for "clone". It's much faster (often 50-90%
faster).

I.e. just an rsync of a recent-enough .git dir (or .git/objects),
followed by a 'reset --hard' to get the worktree and then a 'git pull'.

>> > This could even be stuffed into the pack format and stripped out by
>> > the receiving index-pack (i.e., each entry is prefixed with "and by
>> > the way, here is my sha1...").
>>
>> That would be really interesting. I.e. just having room for that (or
>> anything else) in the pack format.
>>
>> I wonder if it could be added to the delta-chain in the current format
>> as a nasty hack :)
>
> There's definitely not "room" in any sense of the word in the pack
> format. :) However, as long as all parties agreed, we can stick whatever
> we want into the on-the-wire format. So I was imagining something more
> like:
>
>   1. pack-objects learns a --report-object-id option that sticks some
>      additional bytes before each object (in its simplest form,
>      $obj_hash bytes of id)
>
>   2. likewise, index-pack learns a --parse-object-id option to receive
>      it and skip hashing the object bytes
>
>   3. we get a new protocol capability, "send-object-ids". If the server
>      advertises and the client requests it, then both sides turn on the
>      appropriate option
>
> You could even imagine generalizing it to "--report-object-metadata",
> and including 0 or more metadata packets before each object. With object
> id being one, but possibly other computable bits like "generation
> number" after that. I'm not convinced other metadata is worth the
> space/time tradeoff, though. After all, this is stuff that the client
> _could_ generate and cache themselves, so you're trading off bandwidth
> to save the client from doing the computation.
>
> Anyway, food for thought. :)

  reply	other threads:[~2019-04-23  7:07 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-19 21:47 Resolving deltas dominates clone time Martin Fick
2019-04-20  3:58 ` Jeff King
2019-04-20  7:59   ` Ævar Arnfjörð Bjarmason
2019-04-22 15:57     ` Jeff King
2019-04-22 18:01       ` Ævar Arnfjörð Bjarmason
2019-04-22 18:43         ` Jeff King
2019-04-23  7:07           ` Ævar Arnfjörð Bjarmason [this message]
2019-04-22 20:21   ` Martin Fick
2019-04-22 20:56     ` Jeff King
2019-04-22 21:02       ` Jeff King
2019-04-22 21:19       ` [PATCH] p5302: create the repo in each index-pack test Jeff King
2019-04-23  1:09         ` Junio C Hamano
2019-04-23  2:07           ` Jeff King
2019-04-23  2:27             ` Junio C Hamano
2019-04-23  2:36               ` Jeff King
2019-04-23  2:40                 ` Junio C Hamano
2019-04-22 22:32       ` Resolving deltas dominates clone time Martin Fick
2019-04-23  1:55         ` Jeff King
2019-04-23  4:21           ` Jeff King
2019-04-23 10:08             ` Duy Nguyen
2019-04-23 20:09               ` Martin Fick
2019-04-30 18:02                 ` Jeff King
2019-04-30 22:08                   ` Martin Fick
2019-04-30 17:50               ` Jeff King
2019-04-30 18:48                 ` Ævar Arnfjörð Bjarmason
2019-04-30 20:33                   ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8736m9td2d.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=gerardu@amazon.com \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    --cc=mfick@codeaurora.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).