git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Jeff King <peff@peff.net>, git@vger.kernel.org
Cc: "Taylor Blau" <me@ttaylorr.com>,
	"Derrick Stolee" <dstolee@microsoft.com>,
	"Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>
Subject: Re: [PATCH] upload-pack: disable commit graph more gently for shallow traversal
Date: Thu, 12 Sep 2019 08:23:49 -0400	[thread overview]
Message-ID: <fdb81ce0-20ee-5880-2a6c-0c8b3f1739b9@gmail.com> (raw)
In-Reply-To: <20190912000414.GA31334@sigill.intra.peff.net>

On 9/11/2019 8:04 PM, Jeff King wrote:
> When the client has asked for certain shallow options like
> "deepen-since", we do a custom rev-list walk that pretends to be
> shallow. Before doing so, we have to disable the commit-graph, since it
> is not compatible with the shallow view of the repository. That's
> handled by 829a321569 (commit-graph: close_commit_graph before shallow
> walk, 2018-08-20). That commit literally closes and frees our
> repo->objects->commit_graph struct.
> 
> That creates an interesting problem for commits that have _already_ been
> parsed using the commit graph. Their commit->object.parsed flag is set,
> their commit->graph_pos is set, but their commit->maybe_tree may still
> be NULL. When somebody later calls repo_get_commit_tree(), we see that
> we haven't loaded the tree oid yet and try to get it from the commit
> graph. But since it has been freed, we segfault!

OOPS! That is certainly a bad thing. I'm glad you found it, but I
am sorry for how you (probably) found it.

> So the root of the issue is a data dependency between the commit's
> lazy-load of the tree oid and the fact that the commit graph can go
> away mid-process. How can we resolve it?
> 
> There are a couple of general approaches:
> 
>   1. The obvious answer is to avoid loading the tree from the graph when
>      we see that it's NULL. But then what do we return for the tree oid?
>      If we return NULL, our caller in do_traverse() will rightly
>      complain that we have no tree. We'd have to fallback to loading the
>      actual commit object and re-parsing it. That requires teaching
>      parse_commit_buffer() to understand re-parsing (i.e., not starting
>      from a clean slate and not leaking any allocated bits like parent
>      list pointers).
> 
>   2. When we close the commit graph, walk through the set of in-memory
>      objects and clear any graph_pos pointers. But this means we also
>      have to "unparse" any such commits so that we know they still need
>      to open the commit object to fill in their trees. So it's no less
>      complicated than (1), and is more expensive (since we clear objects
>      we might not later need).
> 
>   3. Stop freeing the commit-graph struct. Continue to let it be used
>      for lazy-loads of tree oids, but let upload-pack specify that it
>      shouldn't be used for further commit parsing.
> 
>   4. Push the whole shallow rev-list out to its own sub-process, with
>      the commit-graph disabled from the start, giving it a clean memory
>      space to work from.
> 
> I've chosen (3) here. Options (1) and (2) would work, but are
> non-trivial to implement. Option (4) is more expensive, and I'm not sure
> how complicated it is (shelling out for the actual rev-list part is
> easy, but we do then parse the resulting commits internally, and I'm not
> clear which parts need to be handling shallow-ness).

I agree that (3) is the best option. The commit-graph is just a database
of commit data, and we are choosing to interpret the parent data differently.
The tree data is perfectly good.

> The new test in t5500 triggers this segfault, but see the comments there
> for how horribly intimate it has to be with how both upload-pack and
> commit graphs work.
> 
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  commit-graph.c        | 10 ++++++++++
>  commit-graph.h        |  6 ++++++
>  t/t5500-fetch-pack.sh | 38 ++++++++++++++++++++++++++++++++++++++
>  upload-pack.c         |  2 +-
>  4 files changed, 55 insertions(+), 1 deletion(-)
> 
> diff --git a/commit-graph.c b/commit-graph.c
> index 9b02d2c426..bc5dd5913f 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -41,6 +41,8 @@
>  #define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * GRAPH_CHUNKLOOKUP_WIDTH \
>  			+ GRAPH_FANOUT_SIZE + the_hash_algo->rawsz)
>  
> +static int commit_graph_disabled;

Should we be putting this inside the repository struct instead?

> +
>  char *get_commit_graph_filename(const char *obj_dir)
>  {
>  	char *filename = xstrfmt("%s/info/commit-graph", obj_dir);
> @@ -472,6 +474,9 @@ static int prepare_commit_graph(struct repository *r)
>  		die("dying as requested by the '%s' variable on commit-graph load!",
>  		    GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD);
>  
> +	if (commit_graph_disabled)
> +		return 0;
> +
>  	if (r->objects->commit_graph_attempted)
>  		return !!r->objects->commit_graph;
>  	r->objects->commit_graph_attempted = 1;
> @@ -2101,3 +2106,8 @@ void free_commit_graph(struct commit_graph *g)
>  	free(g->filename);
>  	free(g);
>  }
> +
> +void disable_commit_graph(void)
> +{
> +	commit_graph_disabled = 1;
> +}

Then this would take a struct repository *r.

> diff --git a/commit-graph.h b/commit-graph.h
> index 486e64e591..42d6e7c289 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -107,4 +107,10 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
>  void close_commit_graph(struct raw_object_store *);
>  void free_commit_graph(struct commit_graph *);
>  
> +/*
> + * Disable further use of the commit graph in this process when parsing a
> + * "struct commit".
> + */
> +void disable_commit_graph(void);
> +
>  #endif
> diff --git a/t/t5500-fetch-pack.sh b/t/t5500-fetch-pack.sh
> index 8210f63d41..7601664b74 100755
> --- a/t/t5500-fetch-pack.sh
> +++ b/t/t5500-fetch-pack.sh
> @@ -783,6 +783,44 @@ test_expect_success 'clone shallow since selects no commits' '
>  	)
>  '
>  
> +# A few subtle things about the request in this test:
> +#
> +#  - the server must have commit-graphs present and enabled
> +#
> +#  - the history is such that our want/have share a common ancestor ("base"
> +#    here)
> +#
> +#  - we send only a single have, which is fewer than a normal client would
> +#    send. This ensures that we don't parse "base" up front with
> +#    parse_object(), but rather traverse to it as a parent while deciding if we
> +#    can stop the "have" negotiation, and call parse_commit(). The former
> +#    sees the actual object data and so always loads the three oid, whereas the
> +#    latter will try to lod it lazily.
> +#
> +#  - we must use protocol v2, because it handles the "have" negotiation before
> +#    processing the shallow direectives
> +#
> +test_expect_success 'shallow since with commit graph and already-seen commit' '
> +	test_create_repo shallow-since-graph &&
> +	(
> +	cd shallow-since-graph &&
> +	test_commit base &&
> +	test_commit master &&
> +	git checkout -b other HEAD^ &&
> +	test_commit other &&
> +	git commit-graph write --reachable &&
> +	git config core.commitGraph true &&
> +
> +	GIT_PROTOCOL=version=2 git upload-pack . <<-EOF >/dev/null
> +	0012command=fetch
> +	00010013deepen-since 1
> +	0032want $(git rev-parse other)
> +	0032have $(git rev-parse master)
> +	0000
> +	EOF
> +	)
> +'
> +
>  test_expect_success 'shallow clone exclude tag two' '
>  	test_create_repo shallow-exclude &&
>  	(
> diff --git a/upload-pack.c b/upload-pack.c
> index 222cd3ad89..a260b6b50d 100644
> --- a/upload-pack.c
> +++ b/upload-pack.c
> @@ -722,7 +722,7 @@ static void deepen_by_rev_list(struct packet_writer *writer, int ac,
>  {
>  	struct commit_list *result;
>  
> -	close_commit_graph(the_repository->objects);
> +	disable_commit_graph();
>  	result = get_shallow_commits_by_rev_list(ac, av, SHALLOW, NOT_SHALLOW);
>  	send_shallow(writer, result);
>  	free_commit_list(result);
> 

Your patch does not seem to actually cover the "I've already parsed some commits"
case, as you are only preventing the commit-graph from being prepared. Instead,
we need to have a short-circuit inside parse_commit() to avoid future parsing
from the commit-graph file.

(I'm going to the rest of the thread messages now to see if I just repeated
someone else's concerns.)

Thanks,
-Stolee

  parent reply	other threads:[~2019-09-12 12:23 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-12  0:04 [PATCH] upload-pack: disable commit graph more gently for shallow traversal Jeff King
2019-09-12  0:18 ` Jeff King
2019-09-12  1:11   ` [PATCH] list-objects: don't queue root trees unless revs->tree_objects is set Jeff King
2019-09-12  1:19     ` Jeff King
2019-09-12 12:31     ` Derrick Stolee
2019-09-12 16:52     ` Junio C Hamano
2019-09-12 22:34       ` Jeff King
2019-09-13 18:05         ` Junio C Hamano
2019-09-12  2:08   ` [PATCH] upload-pack: disable commit graph more gently for shallow traversal Taylor Blau
2019-09-12 14:03     ` Jeff King
2019-09-12  2:07 ` Taylor Blau
2019-09-12 11:06   ` SZEDER Gábor
2019-09-12 14:01     ` Jeff King
2019-09-12 12:46   ` Derrick Stolee
2019-09-12 13:59   ` Jeff King
2019-09-12 12:23 ` Derrick Stolee [this message]
2019-09-12 14:23   ` Jeff King
2019-09-12 19:27     ` Derrick Stolee
2019-09-12 14:41 ` [PATCH v2] upload-pack commit graph segfault fix Jeff King
2019-09-12 14:43   ` Jeff King
2019-09-12 14:44   ` [PATCH v2 1/2] commit-graph: bump DIE_ON_LOAD check to actual load-time Jeff King
2019-09-12 19:30     ` Derrick Stolee
2019-09-12 14:44   ` [PATCH v2 2/2] upload-pack: disable commit graph more gently for shallow traversal Jeff King
2019-09-13 13:26     ` Derrick Stolee
2019-09-12 16:56   ` [PATCH v2] upload-pack commit graph segfault fix Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fdb81ce0-20ee-5880-2a6c-0c8b3f1739b9@gmail.com \
    --to=stolee@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=me@ttaylorr.com \
    --cc=pclouds@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).