Re: [RFC PATCH] fetch-pack: space out sent "haves" in negotiation

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Stefan Beller <sbeller@google.com>
To: Jonathan Tan <jonathantanmy@google.com>
Cc: git <git@vger.kernel.org>
Subject: Re: [RFC PATCH] fetch-pack: space out sent "haves" in negotiation
Date: Mon, 21 May 2018 15:57:18 -0700	[thread overview]
Message-ID: <CAGZ79kb96Fxf1OBbnqmAtAm_EA5y9+0NKcNqhKjXhavWe6WzWA@mail.gmail.com> (raw)
In-Reply-To: <20180521204340.260572-1-jonathantanmy@google.com>

Hi Jonathan,

On Mon, May 21, 2018 at 1:43 PM, Jonathan Tan <jonathantanmy@google.com> wrote:
> I was thinking about fetch negotiation in some non-ideal situations
> (specifically, when the client repo contains two or more independent
> branches that meet only somewhere far in the past) and thought about
> skipping over intermediate commits, using exponentially larger skips as
> we proceed, when generating "have" lines. This is in the hope of
> reducing the bandwidth and roundtrips needed when fetching, and does not
> require a modification to the server.

In an ideal world, the server and client would both estimate the potential
reduction of the packfile to send, and base the decision if to continue
negotiating on the trade off if the packfile reduction savings are greater
than the cost of negotiation (in terms of bandwidth or time).
(e.g. the server could keep track of the "potential largest packfile to
sent" as well as the "potential smallest packfile to sent" given the
state of negotiation. And as soon as the difference between those
two packs is smaller than the size of one round of negotiation,
it is better to stop and just sent the large file).

You state that you do not want to change the server side, and stick to
the current protocol, which makes this ideal world scenario moot, but
shifts the problem to "picking haves more intelligently".

> I'm not sure if this is the best way,

I think it is the best for a short term gain, as the picking algorithm is
not part of the protocol, so it can be easily extended/reverted/improved
as we go. So I would continue this way.

> (1) The implementation that I have
>
> This patch contains some drop-in code that passes all existing tests,
> but the new negotiation algorithm is not tested.
>
> To mitigate the effect of skipping, I included functionality wherein
> the client will retry the commits in a skip if the server ACKs the
> destination of the skip, but this is currently imperfect - in
> particular, the server might end the negotiation early, and the commits
> retried in my current implementation are a superset due to the fact that
> I didn't store the commits in the skip.

So we start with exponential hops, fall back to linear probing and then
"make off by one errors" in the linear probes?

> (2) Possible future work for my implementation
>
> Since each sent commit maintains pointers to sent descendants and sent
> ancestors (strictly speaking, only the "close" ones - to find all of
> them, you need the transitive closure), this can be used for some sort
> of error correction when, during a stateless RPC negotiation, the server
> (which may be a group of eventually consistent servers behind a load
> balancer) reports that it no longer has a commit that it said it had.
> For example, we could in this case mark that commit as "they_have=NO"
> and for all its closest ancestors, set it to "they_have=YES" unless they
> in turn have a descendant with "they_have=YES" or
> "they_have=HAVE_DESCENDANT".
>
> (3) Other ways of improving negotiation
>
> If we're prepared to commit-walk a significant part of the entire local
> repo (as we are, in the situation I described in the first paragraph),

> and if we have access to corresponding remote-tracking information,

This is a dangerous assumption, as not everyone is having a 1:1 relationship
with their remote server (for e.g. code review), but there are these triangle
workflows in the kernel community for example, where you push in one
remote direction and (re-)obtain the history merged into the bigger picture
from another remote. And these two remotes are not special to each other
on the client side.

>  fetch-negotiator.c | 309 +++++++++++++++++++++++++++++++++++++++++++++
>  fetch-negotiator.h |  40 ++++++

This patch is moving the algorithm driving the selection of new
commits to pick to
a new file, but there is no new algorithm, yet?
As hinted at from (1), this is smarter than what we did before by
picking commits
non-linearly but with some sort of exponential back off, how does it end the
exponential phase?

The way forward out of RFC state, might be to separate the introduction of a new
improved algorithm and the refactoring. So first move code literally into the
fetch-negotiator file, and then add improvements in there, or is it
just not worth
the refactoring and directly put in the new algorithm?

Another use case we discussed was "open-ended bisection", where you know
you are in a bad state and "once upon a time it worked", and now you are tasked
to find the offending commit. To find such a commit, you probably
would also start
with such an exponential back off until you run into a "good frontier"
of commits
and then use conventional bisect to narrow down the exact commit.

> +enum they_have {
> +       /*
> +        * We do not know if the server has this commit, or we know that the
> +        * server does not have this commit.
> +        */
> +       NO,
> +
> +       /*
> +        * The server has this commit, and we do not know (or did not keep
> +        * track of) whether it has any of its descendants.
> +        */
> +       YES,
> +
> +       /*
> +        * The server has at least one of this commit's descendants, and that
> +        * descendant is marked with YES. When resending "have" lines, we do
> +        * not need to resend this commit, because doing so is redundant.
> +        */
> +       HAVE_DESCENDANT
> +};

Thanks for the docs!

> diff --git a/fetch-negotiator.h b/fetch-negotiator.h
> new file mode 100644
> index 0000000000..c51d52a0d2
> --- /dev/null
> +++ b/fetch-negotiator.h
> @@ -0,0 +1,40 @@
> +#ifndef FETCH_NEGOTIATOR_H
> +#define FETCH_NEGOTIATOR_H
> +
> +#include "prio-queue.h"
> +
> +struct sent_commit;
> +
> +struct fetch_negotiator {
> +       struct sent_commit **sent_commits;
> +       size_t sent_commit_nr, sent_commit_alloc;
> +       struct prio_queue candidates;
> +};

Maybe we can just declare the struct fetch_negotiator here and not
define it, such that nobody outside the actual implementation tries
to access its internals?

> +
> +void fetch_negotiator_init(struct fetch_negotiator *n, struct commit **tips,
> +                          int tip_nr);
> +
> +struct commit *fetch_negotiator_next(struct fetch_negotiator *n);
> +
> +/*
> + * Indicate that the server has this commit. The commits passed to this function
> + * should be in order of their return from fetch_negotiator_next().
> + *
> + * Invocations of this function on the same commit after the first time have no
> + * effect.
> + */
> +void fetch_negotiator_ack(struct fetch_negotiator *n,
> +                         const struct commit *commit);
> +
> +/*
> + * Iterate through the commits invoked with fetch_negotiator_ack. The negotiator
> + * makes an effort to remove redundant commits from the list.
> + *
> + * This is useful for stateless connections, in which information about what the
> + * client knows needs to be replayed in every request.

So even if we do not use the skip commit logic, this would be a benefit for any
http(-v0) and v2 users of the protocol?

Thanks,
Stefan

next prev parent reply	other threads:[~2018-05-21 22:57 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-21 20:43 [RFC PATCH] fetch-pack: space out sent "haves" in negotiation Jonathan Tan
2018-05-21 22:57 ` Stefan Beller [this message]
2018-05-22 18:44   ` Jonathan Tan
2018-05-22 19:01     ` Stefan Beller
2018-05-23  1:08 ` Junio C Hamano
2018-05-23  3:42 ` Junio C Hamano
2018-05-29 16:58   ` Jonathan Tan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAGZ79kb96Fxf1OBbnqmAtAm_EA5y9+0NKcNqhKjXhavWe6WzWA@mail.gmail.com \
    --to=sbeller@google.com \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).