[RFH] limiting ref advertisements

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [RFH] limiting ref advertisements
@ 2016-10-24 13:29 Jeff King
  2016-10-25 11:46 ` Duy Nguyen
  0 siblings, 1 reply; 4+ messages in thread
From: Jeff King @ 2016-10-24 13:29 UTC (permalink / raw)
  To: git

I'm looking into the oft-discussed idea of reducing the size of ref
advertisements by having the client say "these are the refs I'm
interested in". Let's set aside the protocol complexities for a
moment and imagine we magically have some way to communicate a set of
patterns to the server.

What should those patterns look like?

I had hoped that we could keep most of the pattern logic on the
client-side. Otherwise we risk incompatibilities between how the client
and server interpret a pattern. I had also hoped we could do some kind
of prefix-matching, which would let the server look only at the
interesting bits of the ref tree (so if you don't care about
refs/changes, and the server has some ref storage that is hierarchical,
they can literally get away without opening that sub-tree).

The patch at the end of this email is what I came up with in that
direction. It obviously won't compile without the twenty other patches
implementing transport->advertise_prefixes, but it gives you a sense of
what I'm talking about.

Unfortunately it doesn't work in all cases, because refspec sources may
be unqualified. If I ask for:

  git fetch $remote master:foo

then we have to actually dwim-resolve "master" from the complete list of
refs we get from the remote.  It could be "refs/heads/master",
"refs/tags/master", etc. Worse, it could be "refs/master". In that case,
at least, I think we are OK because we avoid advertising refs directly
below "refs/" in the first place. But if you have a slash, like:

  git fetch $remote jk/foo

then that _could_ be "refs/jk/foo". Likewise, we cannot even optimize
the common case of a fully-qualified ref, like "refs/heads/foo". If it
exists, we obviously want to use that. But if it doesn't, then it
could be refs/something-else/refs/heads/foo. That's unlikely, but it
_does_ work now, and optimizing the advertisement would break it.

So it seems like left-anchoring the refspecs can never be fully correct.
We can communicate "master" to the server, who can then look at every
ref it would advertise and ask "could this be called master"? But it
will be setting in stone the set of "could this be" patterns. Granted,
those haven't changed much over the history of git, but it seems awfully
fragile.

In an ideal world the client and server would negotiate to come to some
agreement on the patterns being used. But as we are bolting this onto
the existing protocol, I was really trying to do it without introducing
an extra capabilities phase or extra round-trips. I.e., something like
David Turner's "stick the refspec in the HTTP query parameters" trick,
but working everywhere[1].

Clever ideas?

-Peff

[1] I do have working patches to pass these "early capabilities"
    everywhere, but they're still somewhat rough. I got it to the point
    where I could flip the default to "on" to see what breaks. That's
    not something we'd want to do for real, but is good for running the
    test suite to uncover issues like this one.

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 7c10d70092..3a2585ffd7 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -302,6 +302,33 @@ static void find_non_local_tags(struct transport *transport,
 	string_list_clear(&remote_refs, 0);
 }

+static void add_advertise_prefixes(struct transport *transport,
+				   const struct refspec *refs, int nr)
+{
+	struct argv_array *list = &transport->advertise_prefixes;
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		const struct refspec *rs = &refs[i];
+		size_t len;
+
+		if (!rs->pattern)
+			argv_array_push(list, rs->src);
+		else if (strip_suffix(rs->src, "/*", &len))
+			argv_array_pushf(list, "%.*s", (int)len, rs->src);
+		else {
+			/*
+			 * This refspec is too complex for us to communicate;
+			 * not only do we skip it, but we must avoid
+			 * communicating any prefixes, since we need to see
+			 * all refs.
+			 */
+			transport->ignore_advertise_prefixes = 1;
+			return;
+		}
+	}
+}
+
 static struct ref *get_ref_map(struct transport *transport,
 			       struct refspec *refspecs, int refspec_count,
 			       int tags, int *autotags)
@@ -314,12 +341,18 @@ static struct ref *get_ref_map(struct transport *transport,
 	/* opportunistically-updated references: */
 	struct ref *orefs = NULL, **oref_tail = &orefs;

-	const struct ref *remote_refs = transport_get_remote_refs(transport);
+	const struct ref *remote_refs;
+
+	if (tags == TAGS_SET || (tags == TAGS_DEFAULT && *autotags))
+		add_advertise_prefixes(transport, tag_refspec, 1);

 	if (refspec_count) {
 		struct refspec *fetch_refspec;
 		int fetch_refspec_nr;

+		add_advertise_prefixes(transport, refspecs, refspec_count);
+		remote_refs = transport_get_remote_refs(transport);
+
 		for (i = 0; i < refspec_count; i++) {
 			get_fetch_map(remote_refs, &refspecs[i], &tail, 0);
 			if (refspecs[i].dst && refspecs[i].dst[0])
@@ -373,6 +406,17 @@ static struct ref *get_ref_map(struct transport *transport,
 		    (remote->fetch_refspec_nr ||
 		     /* Note: has_merge implies non-NULL branch->remote_name */
 		     (has_merge && !strcmp(branch->remote_name, remote->name)))) {
+
+			add_advertise_prefixes(transport, remote->fetch,
+					       remote->fetch_refspec_nr);
+			if (has_merge && !strcmp(branch->remote_name, remote->name)) {
+				int i;
+				for (i = 0; i < branch->merge_nr; i++)
+					add_advertise_prefixes(transport, branch->merge[i], 1);
+			}
+
+			remote_refs = transport_get_remote_refs(transport);
+
 			for (i = 0; i < remote->fetch_refspec_nr; i++) {
 				get_fetch_map(remote_refs, &remote->fetch[i], &tail, 0);
 				if (remote->fetch[i].dst &&
@@ -393,6 +437,8 @@ static struct ref *get_ref_map(struct transport *transport,
 			    !strcmp(branch->remote_name, remote->name))
 				add_merge_config(&ref_map, remote_refs, branch, &tail);
 		} else {
+			argv_array_push(&transport->advertise_prefixes, "HEAD");
+			remote_refs = transport_get_remote_refs(transport);
 			ref_map = get_remote_ref(remote_refs, "HEAD");
 			if (!ref_map)
 				die(_("Couldn't find remote ref HEAD"));

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [RFH] limiting ref advertisements
  2016-10-24 13:29 [RFH] limiting ref advertisements Jeff King
@ 2016-10-25 11:46 ` Duy Nguyen
  2016-11-14 21:21   ` Jeff King
  0 siblings, 1 reply; 4+ messages in thread
From: Duy Nguyen @ 2016-10-25 11:46 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List

On Mon, Oct 24, 2016 at 8:29 PM, Jeff King <peff@peff.net> wrote:
> I'm looking into the oft-discussed idea of reducing the size of ref
> advertisements by having the client say "these are the refs I'm
> interested in". Let's set aside the protocol complexities for a
> moment and imagine we magically have some way to communicate a set of
> patterns to the server.
>
> What should those patterns look like?
>
> I had hoped that we could keep most of the pattern logic on the
> client-side. Otherwise we risk incompatibilities between how the client
> and server interpret a pattern. I had also hoped we could do some kind
> of prefix-matching, which would let the server look only at the
> interesting bits of the ref tree (so if you don't care about
> refs/changes, and the server has some ref storage that is hierarchical,
> they can literally get away without opening that sub-tree).
>
> The patch at the end of this email is what I came up with in that
> direction. It obviously won't compile without the twenty other patches
> implementing transport->advertise_prefixes

Yes! git-upload-pack-2 is making a come back, one form or another.

> but it gives you a sense of what I'm talking about.
>
> Unfortunately it doesn't work in all cases, because refspec sources may
> be unqualified. If I ask for:
>
>   git fetch $remote master:foo
>
> then we have to actually dwim-resolve "master" from the complete list of
> refs we get from the remote.  It could be "refs/heads/master",
> "refs/tags/master", etc. Worse, it could be "refs/master". In that case,
> at least, I think we are OK because we avoid advertising refs directly
> below "refs/" in the first place. But if you have a slash, like:
>
>   git fetch $remote jk/foo
>
> then that _could_ be "refs/jk/foo". Likewise, we cannot even optimize
> the common case of a fully-qualified ref, like "refs/heads/foo". If it
> exists, we obviously want to use that. But if it doesn't, then it
> could be refs/something-else/refs/heads/foo. That's unlikely, but it
> _does_ work now, and optimizing the advertisement would break it.
>
> So it seems like left-anchoring the refspecs can never be fully correct.
> We can communicate "master" to the server, who can then look at every
> ref it would advertise and ask "could this be called master"? But it
> will be setting in stone the set of "could this be" patterns. Granted,
> those haven't changed much over the history of git, but it seems awfully
> fragile.

The first thought that comes to mind is, if left anchoring does not
work, let's support both left and right anchoring. I guess you
considered and discarded this.

If prefix matching does not work, and assuming "some-prefix" sent by
client to be in fact "**/some-prefix" pattern at server side will set
the "could this be" in stone, how about use wildmatch? It's flexible
enough and we have full control over the pattern matching engine so C
Git <-> C Git should be good regardless of platforms. I understand
that wildmatch is still complicated enough that a re-implementation
can easily divert in behavior. But a pattern with only '*', '/**',
'/**/' and '**/' wildcards (in other words, no [] or ?) could make the
engine a lot simpler and still fit our needs (and give some room for
client-optimization).

> In an ideal world the client and server would negotiate to come to some
> agreement on the patterns being used. But as we are bolting this onto
> the existing protocol, I was really trying to do it without introducing
> an extra capabilities phase or extra round-trips. I.e., something like
> David Turner's "stick the refspec in the HTTP query parameters" trick,
> but working everywhere[1].
-- 
Duy

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFH] limiting ref advertisements
  2016-10-25 11:46 ` Duy Nguyen
@ 2016-11-14 21:21   ` Jeff King
  2016-11-16 13:42     ` Duy Nguyen
  0 siblings, 1 reply; 4+ messages in thread
From: Jeff King @ 2016-11-14 21:21 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Git Mailing List

On Tue, Oct 25, 2016 at 06:46:21PM +0700, Duy Nguyen wrote:

> > So it seems like left-anchoring the refspecs can never be fully correct.
> > We can communicate "master" to the server, who can then look at every
> > ref it would advertise and ask "could this be called master"? But it
> > will be setting in stone the set of "could this be" patterns. Granted,
> > those haven't changed much over the history of git, but it seems awfully
> > fragile.
> 
> The first thought that comes to mind is, if left anchoring does not
> work, let's support both left and right anchoring. I guess you
> considered and discarded this.
> 
> If prefix matching does not work, and assuming "some-prefix" sent by
> client to be in fact "**/some-prefix" pattern at server side will set
> the "could this be" in stone, how about use wildmatch? It's flexible
> enough and we have full control over the pattern matching engine so C
> Git <-> C Git should be good regardless of platforms. I understand
> that wildmatch is still complicated enough that a re-implementation
> can easily divert in behavior. But a pattern with only '*', '/**',
> '/**/' and '**/' wildcards (in other words, no [] or ?) could make the
> engine a lot simpler and still fit our needs (and give some room for
> client-optimization).

Thanks for responding to this. I've been meaning to get back to it with
some code experiments, but they keep getting bumped down in priority. So
let me at least outline some of my thoughts, without code. :)

I was hoping to avoid right-anchoring because it's expensive to find all
of the right-anchored cases (assuming that ref storage is generally
hierarchical, which it is now and probably will be for future backends).

I also don't think it covers all cases. As bizarre as it is, I believe
you can currently do:

  git fetch $remote origin

and find refs/remotes/origin/HEAD.

So I think the best we can ever do is have the server look at a specific
set of patterns. Those patterns could be expressed by wildmatch. I was
just a little nervous to turn to wildmatch, because it's complicated and
we may want to update it in the future in a slightly-incompatible way.

We also want to give some preference-order to the patterns. If I give
you "refs/heads/master", and that ref exists, you do not need to tell me
whether you also have "refs/heads/refs/heads/master". So you have to
provide multiple patterns for each possible ref. And you need to group
them as "show the first one that matches from this group".

The pattern the client is using really is the ref_rev_parse_rules. So I
think the solution is more like one of:

  1. Specify the pattern set ahead of time, and then the server applies
     it to each refname. We need some pattern language that can express
     "fill in the thing in the middle". IOW, something like:

       advertise-pattern=%s
       advertise-pattern=refs/tags/%s
       advertise-pattern=refs/heads/%s
       advertise-lookup=master
       advertise-lookup=v1.0

     except that the thought of using snprintf() to handle formats
     provided by the user is vaguely terrifying. We could make sure they
     contain only a single "%s", but given the history there, it still
     makes me nervous. I guess we could write our own pseudo-%s parser
     that is much more careful and complains on bugs instead of
     executing arbitrary code. ;)

     I don't think wildmatch quite works for that, because it wants to
     have the full pattern.

  2. Declare the current set of ref_rev_parse_rules as "version 1", and
     send:

       advertise-lookup-v1=master
       advertise-lookup-v1=v1.0

     and the server would do the right thing. We could do a v2, but it
     gets hairy. Let's imagine we add "refs/notes/%s" to the lookup
     rules, and we'll call that v2.

     But remember that these are "early capabilities", before the server
     has spoken at all. So the client doesn't know if we can handle v2.
     So we have to send _both_ (and v2-aware servers can ignore the v1).

       advertise-lookup-v1=master
       advertise-lookup-v2=master

     But that's not quite enough. A v1 server won't look in refs/notes
     at all. So we have to say that, too:

       advertise-lookup-v1=refs/notes/master

     And of course the v1 server has no idea that this isn't necessary
     if we already found refs/heads/master.

     So I think you really do need the client to be able to say "also
     look at this pattern".

Of course we do still want left-anchoring, too. Wildcards like
"refs/heads/*" are always left-anchored. So I think we'd have two types,
and a full request for

  git fetch origin +refs/heads/*:refs/remotes/origin/* master:foo

would look like:

  (1) advertise-pattern-v1
  (2) advertise-pattern=refs/notes/%s
  (3) advertise-prefix=refs/heads
  (4) advertise-lookup=master

where the lines mean:

  1. Use the standard v1 patterns (we could spell them out, but this
     just saves bandwidth. In fact, it could just be implicit that v1
     patterns are included, and we could skip this line).

  2. This is for our fictional future version where the client knows
     added refs/notes/* to its DWIM but the server hasn't yet.

  3. Give me all of refs/heads/*

  4. Look up "master" using the advertise patterns and give me the first
     one you find.

So given that we can omit (1), and that (2) is just an example for the
future, it could look like:

  advertise-prefix=refs/heads
  advertise-lookup=master

which is pretty reasonable. It's not _completely_ bulletproof in terms
of backwards compatibility. The "v1" thing means the client can't insert
a new pattern in the middle (remember they're ordered by priority). So
maybe it is better to spell them all out (one thing that makes me
hesitate is that these will probably end up as URL parameters for the
HTTP version, which means our URL can start to get a little long).

Anyway. That's the direction I'm thinking. I haven't written the code
yet. The trickiest thing will probably be that the server would want to
avoid advertising the same ref twice via two mechanisms (or perhaps the
client just be tolerant of duplicates; that relieves the server of any
duplicate-storage requirements).

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFH] limiting ref advertisements
  2016-11-14 21:21   ` Jeff King
@ 2016-11-16 13:42     ` Duy Nguyen
  0 siblings, 0 replies; 4+ messages in thread
From: Duy Nguyen @ 2016-11-16 13:42 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List

On Tue, Nov 15, 2016 at 4:21 AM, Jeff King <peff@peff.net> wrote:
> Thanks for responding to this.

Glad to help (or more precisely annoy you somewhat :D)

> I've been meaning to get back to it with
> some code experiments, but they keep getting bumped down in priority. So
> let me at least outline some of my thoughts, without code. :)
>
> I was hoping to avoid right-anchoring because it's expensive to find all
> of the right-anchored cases (assuming that ref storage is generally
> hierarchical, which it is now and probably will be for future backends).

Urgh.. I completely forgot about future refs backends. Yeah this would
kick both wildmatch and right-anchoring out of the window.

For the record I almost suggested BPF as well (to keep the server side
simple, but at the super high cost of client side). That would also go
out of the window the same way wildmatch and right-anchoring does.

>      But remember that these are "early capabilities", before the server
>      has spoken at all. So the client doesn't know if we can handle v2.
>      So we have to send _both_ (and v2-aware servers can ignore the v1).
>
>        advertise-lookup-v1=master
>        advertise-lookup-v2=master
>
>      But that's not quite enough. A v1 server won't look in refs/notes
>      at all. So we have to say that, too:
>
>        advertise-lookup-v1=refs/notes/master
>
>      And of course the v1 server has no idea that this isn't necessary
>      if we already found refs/heads/master.

We discussed a bit about upgrading upload-pack version 1 to 2 in more
than one session: the first fetch just does v1 as normal, the server
returns v1 response to but also advertises that v2 is supported. The
client keeps this info and skips v1 and tries v2 right away in the
following fetches, falling back to v1 (new fetch session) if v2 is
unsupported. Can it work the same way here too?

I'm in favor of this option 2 (without trying to be absolute backward
compatible with older lookup versions) since it allows us to optimize
for common case and experiment a bit. Once we know better we can make
the next version that hopefully suits everybody.

>      So I think you really do need the client to be able to say "also
>      look at this pattern".

What about the order of patterns? Does it matter "this pattern" is in
the middle or the end of the pattern list? I suppose not, just
checking...

But does this call for the ability to remove a pattern from the
pattern list as well, as a way to narrow down the search scope and
avoid sending unwanted refs?

> Of course we do still want left-anchoring, too. Wildcards like
> "refs/heads/*" are always left-anchored. So I think we'd have two types,
> and a full request for
>
>   git fetch origin +refs/heads/*:refs/remotes/origin/* master:foo
>
> would look like:
>
>   (1) advertise-pattern-v1
>   (2) advertise-pattern=refs/notes/%s
>   (3) advertise-prefix=refs/heads
>   (4) advertise-lookup=master
>
> where the lines mean:
>
>   1. Use the standard v1 patterns (we could spell them out, but this
>      just saves bandwidth. In fact, it could just be implicit that v1
>      patterns are included, and we could skip this line).
>
>   2. This is for our fictional future version where the client knows
>      added refs/notes/* to its DWIM but the server hasn't yet.
>
>   3. Give me all of refs/heads/*
>
>   4. Look up "master" using the advertise patterns and give me the first
>      one you find.

Well.. it sounds good to me. But I would not trust myself on refs matters :D

> So given that we can omit (1), and that (2) is just an example for the
> future, it could look like:
>
>   advertise-prefix=refs/heads
>   advertise-lookup=master
>
> which is pretty reasonable. It's not _completely_ bulletproof in terms
> of backwards compatibility. The "v1" thing means the client can't insert
> a new pattern in the middle (remember they're ordered by priority).

OK so pattern order probably matters...

> So
> maybe it is better to spell them all out (one thing that makes me
> hesitate is that these will probably end up as URL parameters for the
> HTTP version, which means our URL can start to get a little long).
>
> Anyway. That's the direction I'm thinking. I haven't written the code
> yet. The trickiest thing will probably be that the server would want to
> avoid advertising the same ref twice via two mechanisms (or perhaps the
> client just be tolerant of duplicates; that relieves the server of any
> duplicate-storage requirements).

Thanks for sharing.
-- 
Duy

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-11-16 13:42 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-24 13:29 [RFH] limiting ref advertisements Jeff King
2016-10-25 11:46 ` Duy Nguyen
2016-11-14 21:21   ` Jeff King
2016-11-16 13:42     ` Duy Nguyen

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).