git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <derrickstolee@github.com>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>, git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>, Bagas Sanjaya <bagasdotme@gmail.com>
Subject: Re: [PATCH] pack-objects: lazily set up "struct rev_info", don't leak
Date: Fri, 25 Mar 2022 10:57:04 -0400	[thread overview]
Message-ID: <d71fa968-be75-f4ad-ea6c-644f9d2b52d9@github.com> (raw)
In-Reply-To: <patch-1.1-193534b0f07-20220325T121715Z-avarab@gmail.com>

On 3/25/2022 10:25 AM, Ævar Arnfjörð Bjarmason wrote:
> In the preceding [1] (pack-objects: move revs out of
> get_object_list(), 2022-03-22) the "repo_init_revisions()" was moved
> to cmd_pack_objects() so that it unconditionally took place for all
> invocations of "git pack-objects".
> 
> We'd thus start leaking memory, which is easily reproduced in
> e.g. git.git by feeding e83c5163316 (Initial revision of "git", the
> information manager from hell, 2005-04-07) to "git pack-objects";
> 
>     $ echo e83c5163316f89bfbde7d9ab23ca2e25604af290 | ./git pack-objects initial
>     [...]
> 	==19130==ERROR: LeakSanitizer: detected memory leaks
> 
> 	Direct leak of 7120 byte(s) in 1 object(s) allocated from:
> 	    #0 0x455308 in __interceptor_malloc (/home/avar/g/git/git+0x455308)
> 	    #1 0x75b399 in do_xmalloc /home/avar/g/git/wrapper.c:41:8
> 	    #2 0x75b356 in xmalloc /home/avar/g/git/wrapper.c:62:9
> 	    #3 0x5d7609 in prep_parse_options /home/avar/g/git/diff.c:5647:2
> 	    #4 0x5d415a in repo_diff_setup /home/avar/g/git/diff.c:4621:2
> 	    #5 0x6dffbb in repo_init_revisions /home/avar/g/git/revision.c:1853:2
> 	    #6 0x4f599d in cmd_pack_objects /home/avar/g/git/builtin/pack-objects.c:3980:2
> 	    #7 0x4592ca in run_builtin /home/avar/g/git/git.c:465:11
> 	    #8 0x457d81 in handle_builtin /home/avar/g/git/git.c:718:3
> 	    #9 0x458ca5 in run_argv /home/avar/g/git/git.c:785:4
> 	    #10 0x457b40 in cmd_main /home/avar/g/git/git.c:916:19
> 	    #11 0x562259 in main /home/avar/g/git/common-main.c:56:11
> 	    #12 0x7fce792ac7ec in __libc_start_main csu/../csu/libc-start.c:332:16
> 	    #13 0x4300f9 in _start (/home/avar/g/git/git+0x4300f9)
> 
> 	SUMMARY: LeakSanitizer: 7120 byte(s) leaked in 1 allocation(s).
> 	Aborted
> 
> Narrowly fixing that commit would have been easy, just add call
> repo_init_revisions() right before get_object_list(), which is
> effectively what was done before that commit.
> 
> But an unstated constraint when setting it up early is that it was
> needed for the subsequent [2] (pack-objects: parse --filter directly
> into revs.filter, 2022-03-22), i.e. we might have a --filter
> command-line option, and need to either have the "struct rev_info"
> setup when we encounter that option, or later.
> 
> Let's just change the control flow so that we'll instead set up the
> "struct rev_info" only when we need it. Doing so leads to a bit more
> verbosity, but it's a lot clearer what we're doing and why.

This makes sense.

> We could furthermore combine the two get_object_list() invocations
> here by having repo_init_revisions() invoked on &pfd.revs, but I think
> clearly separating the two makes the flow clearer. Likewise
> redundantly but explicitly (i.e. redundant v.s. a "{ 0 }") "0" to
> "have_revs" early in cmd_pack_objects().

I disagree, especially when you later want to make sure we free
the data from revs using your release_revisions().

> This does add the future constraint to opt_parse_list_objects_filter()
> that we'll need to adjust this wrapper code if it looks at any other
> value of the "struct option" than the "value" member.

So we are coupling ourselves to the implementation of this method.

> But that regression should be relatively easy to spot. I'm
> intentionally not initializing the "struct wrap" with e.g. "{ 0 }" so
> that various memory sanity checkers would spot that, we just
> initialize the "value" in po_filter_cb(). By doing this e.g. we'll die
> on e.g. this test if we were to use another member of "opt" in
> opt_parse_list_objects_filter()>

So you are using uninitialized memory as a way to discover any
necessary changes to that coupling. I'm not sure this is super-safe
because we don't necessarily run memory checkers during CI builds.

I'd rather have a consistently initialized chunk of data that would
behave predictably (and hopefully we'd discover it is behaving
incorrectly with that predictable behavior).


> While we're at it add parentheses around the arguments to the OPT_*
> macros in in list-objects-filter-options.h, as we need to change those
> lines anyway. It doesn't matter in this case, but is good general
> practice.
> 
> 1. https://lore.kernel.org/git/619b757d98465dbc4995bdc11a5282fbfcbd3daa.1647970119.git.gitgitgadget@gmail.com
> 2. https://lore.kernel.org/git/97de926904988b89b5663bd4c59c011a1723a8f5.1647970119.git.gitgitgadget@gmail.com/
> 
> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> ---
> 
> This is on top of ds/partial-bundle-more: I thought the fix for this
> new leak was involved enough to propose it as a commit-on-top rather
> than a fixup for a re-roll, especially since aside from the newly
> leaked memory I don't think ds/partial-bundle-more is breaking
> anything by doing that.
> 
> Except that is, interacting badly with my release_revisions() series
> in "seen", which currently causes the "linux-leaks" job to fail there:
> https://lore.kernel.org/git/cover-v2-00.27-00000000000-20220323T203149Z-avarab@gmail.com/
> 
> This is proper fix for the issue the interaction with my topic
> revealed (not caused, we just started testing for this leak there),
> i.e. it obsoletes the suggestion of adding an UNLEAK() there.
> 
>  builtin/pack-objects.c        | 30 +++++++++++++++++++++++++-----
>  list-objects-filter-options.h |  8 +++++---
>  2 files changed, 30 insertions(+), 8 deletions(-)
> 
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index d39f668ad56..735080a4a95 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -3857,6 +3857,23 @@ static int option_parse_unpack_unreachable(const struct option *opt,
>  	return 0;
>  }
>  
> +struct po_filter_data {
> +	unsigned have_revs:1;
> +	struct rev_info revs;
> +};
> +
> +static int po_filter_cb(const struct option *opt, const char *arg, int unset)
> +{
> +	struct po_filter_data *data = opt->value;
> +	struct option wrap; /* don't initialize! */
> +
> +	repo_init_revisions(the_repository, &data->revs, NULL);
> +	wrap.value = &data->revs.filter;
> +	data->have_revs = 1;
> +
> +	return opt_parse_list_objects_filter(&wrap, arg, unset);
> +}

The coupling here is unfortunate, but unavoidable. The future-proof
way to do it would be to modify opt->value and pass the rest of its
members as-is, but it's marked const so that's not an option.

One way to help make this coupling more obvious would be to move
this method into list-filter-options.c so we can have their
implementations adjacent and even refer to them.

Here is a potential version that looks like that:

--- >8 ---

diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c
index f02d8df142..55c7131814 100644
--- a/list-objects-filter-options.c
+++ b/list-objects-filter-options.c
@@ -281,6 +281,10 @@ void parse_list_objects_filter(
 		die("%s", errbuf.buf);
 }
 
+/*
+ * If you change this to use any member in 'opt' other than 'value',
+ * then be sure to update opt_parse_list_objects_filter_in_revs().
+ */
 int opt_parse_list_objects_filter(const struct option *opt,
 				  const char *arg, int unset)
 {
@@ -293,6 +297,18 @@ int opt_parse_list_objects_filter(const struct option *opt,
 	return 0;
 }
 
+int opt_parse_list_objects_filter_in_revs(const struct option *opt,
+					  const char *arg, int unset)
+{
+	struct rev_info_maybe_empty *ri = opt->value;
+	struct option wrap = { .value = &ri->revs.filter };
+
+	repo_init_revisions(the_repository, &ri->revs, NULL);
+	ri->has_revs = 1;
+
+	return opt_parse_list_objects_filter(&wrap, arg, unset);
+}
+
 const char *list_objects_filter_spec(struct list_objects_filter_options *filter)
 {
 	if (!filter->filter_spec.nr)
diff --git a/list-objects-filter-options.h b/list-objects-filter-options.h
index 90e4bc9625..cf8b710a76 100644
--- a/list-objects-filter-options.h
+++ b/list-objects-filter-options.h
@@ -106,6 +106,8 @@ void parse_list_objects_filter(
 
 int opt_parse_list_objects_filter(const struct option *opt,
 				  const char *arg, int unset);
+int opt_parse_list_objects_filter_in_revs(const struct option *opt,
+					  const char *arg, int unset);
 
 #define OPT_PARSE_LIST_OBJECTS_FILTER(fo) \
 	OPT_CALLBACK(0, "filter", fo, N_("args"), \
diff --git a/revision.h b/revision.h
index 5bc59c7bfe..95aa3ee891 100644
--- a/revision.h
+++ b/revision.h
@@ -329,6 +329,11 @@ struct rev_info {
 	struct tmp_objdir *remerge_objdir;
 };
 
+struct rev_info_maybe_empty {
+	int has_revs;
+	struct rev_info revs;
+};
+
 int ref_excluded(struct string_list *, const char *path);
 void clear_ref_exclusion(struct string_list **);
 void add_ref_exclusion(struct string_list **, const char *exclude);

--- >8 ---


> +	} else if (pfd.have_revs) {
> +		get_object_list(&pfd.revs, rp.nr, rp.v);
>  	} else {
> +		struct rev_info revs;
> +
> +		repo_init_revisions(the_repository, &revs, NULL);
>  		get_object_list(&revs, rp.nr, rp.v);
>  	}

Here, I think it would be better to have

	else {
		if (!pfd.have_revs) {
			repo_init_revisions(the_repository, &pfd.revs, NULL);
			pfd.have_revs = 1;
		}
		get_object_list(&pfd.revs, rp.nr, rp.v);
	}

and then later you can add

	if (pfd.have_revs)
		release_revisions(&pfd.revs);

to clear the memory in exactly one place.

Thanks,
-Stolee

  reply	other threads:[~2022-03-25 14:57 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-22 17:28 [PATCH 0/5] Partial bundle follow ups Derrick Stolee via GitGitGadget
2022-03-22 17:28 ` [PATCH 1/5] list-objects-filter: remove CL_ARG__FILTER Derrick Stolee via GitGitGadget
2022-03-22 17:28 ` [PATCH 2/5] pack-objects: move revs out of get_object_list() Derrick Stolee via GitGitGadget
2022-03-22 17:28 ` [PATCH 3/5] pack-objects: parse --filter directly into revs.filter Derrick Stolee via GitGitGadget
2022-03-22 19:37   ` [-SPAM-] " Ramsay Jones
2022-03-23 13:48     ` Derrick Stolee
2022-03-22 21:15   ` Ævar Arnfjörð Bjarmason
2022-03-22 17:28 ` [PATCH 4/5] bundle: move capabilities to end of 'verify' Derrick Stolee via GitGitGadget
2022-03-23  7:08   ` Bagas Sanjaya
2022-03-23 13:39     ` Derrick Stolee
2022-03-22 17:28 ` [PATCH 5/5] bundle: output hash information in 'verify' Derrick Stolee via GitGitGadget
2022-03-23 21:27 ` [PATCH 0/5] Partial bundle follow ups Junio C Hamano
2022-03-25 14:25 ` [PATCH] pack-objects: lazily set up "struct rev_info", don't leak Ævar Arnfjörð Bjarmason
2022-03-25 14:57   ` Derrick Stolee [this message]
2022-03-25 16:00     ` Ævar Arnfjörð Bjarmason
2022-03-25 16:41       ` Derrick Stolee
2022-03-25 17:34         ` Ævar Arnfjörð Bjarmason
2022-03-25 19:08           ` Derrick Stolee
2022-03-26  0:52             ` Ævar Arnfjörð Bjarmason
2022-03-28 14:04               ` Derrick Stolee
2022-03-25 18:53   ` Junio C Hamano
2022-03-26  1:09     ` Ævar Arnfjörð Bjarmason
2022-03-28 15:43   ` [PATCH v2] " Ævar Arnfjörð Bjarmason
2022-03-28 15:58     ` Derrick Stolee
2022-03-28 17:10     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d71fa968-be75-f4ad-ea6c-644f9d2b52d9@github.com \
    --to=derrickstolee@github.com \
    --cc=avarab@gmail.com \
    --cc=bagasdotme@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).