git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <derrickstolee@github.com>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: git@vger.kernel.org, Junio C Hamano <gitster@pobox.com>,
	Bagas Sanjaya <bagasdotme@gmail.com>
Subject: Re: [PATCH] pack-objects: lazily set up "struct rev_info", don't leak
Date: Fri, 25 Mar 2022 12:41:35 -0400	[thread overview]
Message-ID: <8d368240-dae5-7a66-6c0c-9e0a960ca34c@github.com> (raw)
In-Reply-To: <220325.86r16qkodl.gmgdl@evledraar.gmail.com>

On 3/25/2022 12:00 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Fri, Mar 25 2022, Derrick Stolee wrote:
> 
>> On 3/25/2022 10:25 AM, Ævar Arnfjörð Bjarmason wrote:
>>> In the preceding [1] (pack-objects: move revs out of
>>> get_object_list(), 2022-03-22) the "repo_init_revisions()" was moved
>>> to cmd_pack_objects() so that it unconditionally took place for all
>>> invocations of "git pack-objects".
>>>
>>> We'd thus start leaking memory, which is easily reproduced in
>>> e.g. git.git by feeding e83c5163316 (Initial revision of "git", the
>>> information manager from hell, 2005-04-07) to "git pack-objects";
>>>
>>>     $ echo e83c5163316f89bfbde7d9ab23ca2e25604af290 | ./git pack-objects initial
>>>     [...]
>>> 	==19130==ERROR: LeakSanitizer: detected memory leaks
>>>
>>> 	Direct leak of 7120 byte(s) in 1 object(s) allocated from:
>>> 	    #0 0x455308 in __interceptor_malloc (/home/avar/g/git/git+0x455308)
>>> 	    #1 0x75b399 in do_xmalloc /home/avar/g/git/wrapper.c:41:8
>>> 	    #2 0x75b356 in xmalloc /home/avar/g/git/wrapper.c:62:9
>>> 	    #3 0x5d7609 in prep_parse_options /home/avar/g/git/diff.c:5647:2
>>> 	    #4 0x5d415a in repo_diff_setup /home/avar/g/git/diff.c:4621:2
>>> 	    #5 0x6dffbb in repo_init_revisions /home/avar/g/git/revision.c:1853:2
>>> 	    #6 0x4f599d in cmd_pack_objects /home/avar/g/git/builtin/pack-objects.c:3980:2
>>> 	    #7 0x4592ca in run_builtin /home/avar/g/git/git.c:465:11
>>> 	    #8 0x457d81 in handle_builtin /home/avar/g/git/git.c:718:3
>>> 	    #9 0x458ca5 in run_argv /home/avar/g/git/git.c:785:4
>>> 	    #10 0x457b40 in cmd_main /home/avar/g/git/git.c:916:19
>>> 	    #11 0x562259 in main /home/avar/g/git/common-main.c:56:11
>>> 	    #12 0x7fce792ac7ec in __libc_start_main csu/../csu/libc-start.c:332:16
>>> 	    #13 0x4300f9 in _start (/home/avar/g/git/git+0x4300f9)
>>>
>>> 	SUMMARY: LeakSanitizer: 7120 byte(s) leaked in 1 allocation(s).
>>> 	Aborted
>>>
>>> Narrowly fixing that commit would have been easy, just add call
>>> repo_init_revisions() right before get_object_list(), which is
>>> effectively what was done before that commit.
>>>
>>> But an unstated constraint when setting it up early is that it was
>>> needed for the subsequent [2] (pack-objects: parse --filter directly
>>> into revs.filter, 2022-03-22), i.e. we might have a --filter
>>> command-line option, and need to either have the "struct rev_info"
>>> setup when we encounter that option, or later.
>>>
>>> Let's just change the control flow so that we'll instead set up the
>>> "struct rev_info" only when we need it. Doing so leads to a bit more
>>> verbosity, but it's a lot clearer what we're doing and why.
>>
>> This makes sense.
>>
>>> We could furthermore combine the two get_object_list() invocations
>>> here by having repo_init_revisions() invoked on &pfd.revs, but I think
>>> clearly separating the two makes the flow clearer. Likewise
>>> redundantly but explicitly (i.e. redundant v.s. a "{ 0 }") "0" to
>>> "have_revs" early in cmd_pack_objects().
>>
>> I disagree, especially when you later want to make sure we free
>> the data from revs using your release_revisions().
> 
> Because we'll need two release_revisions() instead of one?

And it would be easy to miss one of the two if you are only testing
certain paths with leak-check on.

>>> This does add the future constraint to opt_parse_list_objects_filter()
>>> that we'll need to adjust this wrapper code if it looks at any other
>>> value of the "struct option" than the "value" member.
>>
>> So we are coupling ourselves to the implementation of this method.
> 
> Sure, but aren't all OPT_* where the macro is providing some custom
> callback coupling themselves to how it's implemented?

No, I mean that the callback function you are making here is coupling
itself to the existing callback function in a novel way.

I tried to find another example of a nested callback, but I didn't
even find another instance of creating a new 'struct option'.

>>> But that regression should be relatively easy to spot. I'm
>>> intentionally not initializing the "struct wrap" with e.g. "{ 0 }" so
>>> that various memory sanity checkers would spot that, we just
>>> initialize the "value" in po_filter_cb(). By doing this e.g. we'll die
>>> on e.g. this test if we were to use another member of "opt" in
>>> opt_parse_list_objects_filter()>
>>
>> So you are using uninitialized memory as a way to discover any
>> necessary changes to that coupling. I'm not sure this is super-safe
>> because we don't necessarily run memory checkers during CI builds.
>>
>> I'd rather have a consistently initialized chunk of data that would
>> behave predictably (and hopefully we'd discover it is behaving
>> incorrectly with that predictable behavior).
> 
> I'd like to keep this as-is, i.e. if you zero it out it might also
> behave unpredictably or even in undefined ways (e.g. a NULL ptr
> dereference). Likewise in my version, but it has the benefit of being
> caught by some tooling we have.

I guess by "predictable" I mean "well-defined" or "deterministic".

I don't want the build options to change how this works (for instance,
debug builds sometimes initialize data to zero).

>> +struct rev_info_maybe_empty {
>> +	int has_revs;
>> +	struct rev_info revs;
>> +};

Thinking about this a second time, perhaps it would be best to add
an "unsigned initialized:1;" to struct rev_info so we can look at
such a struct and know whether or not repo_init_revisions() has
been run or not. Avoids the custom struct and unifies a few things.

In particular, release_revisions() could choose to do nothing if
revs->initialized is false.

Further, a second repo_init_revisions() could do nothing if
revs->initialized is true. This allows us to safely "re-init"
without our own "if (has_revs)" checks...
>> 	else {
>> 		if (!pfd.have_revs) {
>> 			repo_init_revisions(the_repository, &pfd.revs, NULL);
>> 			pfd.have_revs = 1;
>> 		}
>> 		get_object_list(&pfd.revs, rp.nr, rp.v);
>> 	}

...making this just

	else {
		repo_init_revisions(the_repository, &revs, NULL);
		get_object_list(&revs. rp.nr, rp.v);
	}

> Conceptually I think the saving of that one line isn't worth it to the
> reader.
> 
> Then you'd need to read up and see exactly how pfd.revs gets mutated,
> and its callback etc., only to see we're just doing this to save
> ourselves a variable deceleration and a call to release_revisions().
> 
>> and then later you can add
>>
>> 	if (pfd.have_revs)
>> 		release_revisions(&pfd.revs);

And this would just be

	release_revisions(&revs);

>> to clear the memory in exactly one place.
> 
> Yeah, it would work. I'd just prefer control flow that's trivial to
> reason about over saving a couple of lines here.

I think having multiple revision things that can live at different
levels (one embedded in a custom struct, one not) is not trivial to
reason about. If we change the "is this initialized" indicator to be
within 'struct rev_info', then this gets simpler.

It seems to me that it is easier to track a single struct and release
the memory in one place based on its lifespan.

Thanks,
-Stolee

  reply	other threads:[~2022-03-25 16:41 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-22 17:28 [PATCH 0/5] Partial bundle follow ups Derrick Stolee via GitGitGadget
2022-03-22 17:28 ` [PATCH 1/5] list-objects-filter: remove CL_ARG__FILTER Derrick Stolee via GitGitGadget
2022-03-22 17:28 ` [PATCH 2/5] pack-objects: move revs out of get_object_list() Derrick Stolee via GitGitGadget
2022-03-22 17:28 ` [PATCH 3/5] pack-objects: parse --filter directly into revs.filter Derrick Stolee via GitGitGadget
2022-03-22 19:37   ` [-SPAM-] " Ramsay Jones
2022-03-23 13:48     ` Derrick Stolee
2022-03-22 21:15   ` Ævar Arnfjörð Bjarmason
2022-03-22 17:28 ` [PATCH 4/5] bundle: move capabilities to end of 'verify' Derrick Stolee via GitGitGadget
2022-03-23  7:08   ` Bagas Sanjaya
2022-03-23 13:39     ` Derrick Stolee
2022-03-22 17:28 ` [PATCH 5/5] bundle: output hash information in 'verify' Derrick Stolee via GitGitGadget
2022-03-23 21:27 ` [PATCH 0/5] Partial bundle follow ups Junio C Hamano
2022-03-25 14:25 ` [PATCH] pack-objects: lazily set up "struct rev_info", don't leak Ævar Arnfjörð Bjarmason
2022-03-25 14:57   ` Derrick Stolee
2022-03-25 16:00     ` Ævar Arnfjörð Bjarmason
2022-03-25 16:41       ` Derrick Stolee [this message]
2022-03-25 17:34         ` Ævar Arnfjörð Bjarmason
2022-03-25 19:08           ` Derrick Stolee
2022-03-26  0:52             ` Ævar Arnfjörð Bjarmason
2022-03-28 14:04               ` Derrick Stolee
2022-03-25 18:53   ` Junio C Hamano
2022-03-26  1:09     ` Ævar Arnfjörð Bjarmason
2022-03-28 15:43   ` [PATCH v2] " Ævar Arnfjörð Bjarmason
2022-03-28 15:58     ` Derrick Stolee
2022-03-28 17:10     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8d368240-dae5-7a66-6c0c-9e0a960ca34c@github.com \
    --to=derrickstolee@github.com \
    --cc=avarab@gmail.com \
    --cc=bagasdotme@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).