From: Eric Sunshine <sunshine@sunshineco.com>
To: Jeff King <peff@peff.net>
Cc: Jan Christoph Uhde <Jan@uhdejc.com>,
Git List <git@vger.kernel.org>,
Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH] diff: discard blob data from stat-unmatched pairs
Date: Mon, 1 Jun 2020 16:53:50 -0400 [thread overview]
Message-ID: <CAPig+cRyrEHNxY+Y-_E6e4DZB-sAiVdSB2T53bKP6UJOx_DJDA@mail.gmail.com> (raw)
In-Reply-To: <20200601202218.GA2763518@coredump.intra.peff.net>
On Mon, Jun 1, 2020 at 4:22 PM Jeff King <peff@peff.net> wrote:
> Subject: [PATCH] diff: discard blob data from stat-unmatched pairs
>
> When performing a tree-level diff against the working tree, we may find
> that our index stat information is dirty, so we queue a filepair to be
> examined later. If the actual content hasn't changed, we call this a
> stat-unmatch; the stat information was out of date, but there's no
> actual diff. Normally diffcore_std() would detect and remove these
> identical filepairs via diffcore_skip_stat_unmatch(). However, when
> "--quiet" is used, we want to stop the diff as soon as we see any
> changes, so we check for stat-unmatches immediately in diff_change().
>
> That check may require us to actually load the file contents into the
> pair of diff_filespecs. If we find that the pair isn't a stat-unmatch,
> then no big deal; we'd likely load the contents later anyway to generate
> a patch, do rename detection, etc, so we want to hold on to it. But if
> it a stat-unmatch, then we have no more use for that data; the whole
s/it/& is/
> point is that we're going discard the pair. However, we never free the
> allocated diff_filespec data.
>
> In most cases, keeping that data isn't a problem. We don't expect a lot
> of stat-unmatch entries, and since we're using --quiet, we'd quit as
> soon as we saw such a real change anyway. However, there are extreme
> cases where it makes a big difference:
>
> 1. We'd generally mmap() the working tree half of the pair. And since
> the OS may limit the total number of maps, we can run afoul of this
> in large repositories. E.g.:
>
> $ cd linux
> $ git ls-files | wc -l
> 67959
> $ sysctl vm.max_map_count
> vm.max_map_count = 65530
> $ git ls-files | xargs touch ;# everything is stat-dirty!
> $ git diff --quiet
> fatal: mmap failed: Cannot allocate memory
>
> It should be unusual to have so many files stat-dirty, but it's
> possible if you've just run a script like "sed -i" or similar.
>
> After this patch, the above correctly exits with code 0.
>
> 2. Even if you don't hit mmap limits, the index half of the pair will
> have been pulled from the object database into heap memory. Again
> in a clone of linux.git, running:
>
> $ git ls-files | head -n 10000 | xargs touch
> $ git diff --quiet
>
> peaks at 145MB heap before this patch, and 94MB after.
>
> This patch solves the problem by freeing any diff_filespec data we
> picked up during the "--quiet" stat-unmatch check in diff_changes.
> Nobody is going to need that data later, so there's no point holding on
> to it. There are a few things to note:
>
> - we could skip queueing the pair entirely, which could in theory save
> a little work. But there's not much to save, as we need a
> diff_filepair to feed to diff_filespec_check_stat_unmatch() anyway.
> And since we cache the result of the stat-unmatch checks, a later
> call to diffcore_skip_stat_unmatch() call will quickly skip over
> them. The diffcore code also counts up the number of stat-unmatched
> pairs as it removes them. It's doubtful any callers would care about
> that in combination with --quiet, but we'd have to reimplement the
> logic here to be on the safe side. So it's not really worth the
> trouble.
>
> - I didn't write a test, because we always produce the correct output
> unless we run up against system mmap limits, which are both
> unportable and expensive to test against. Measuring peak heap
> would be interesting, but our perf suite isn't yet capable of that.
>
> - note that diff without "--quiet" does not suffer from the same
> problem. In diffcore_skip_stat_unmatch(), we detect the stat-unmatch
> entries and drop them immediately, so we're not carrying their data
> around.
>
> - you _can_ still trigger the mmap limit problem if you truly have
> that many files with actual changes. But it's rather unlikely. The
> stat-unmatch check avoids loading the file contents if the sizes
> don't match, so you'd need a pretty trivial change in every single
> file. Likewise, inexact rename detection might load the data for
> many files all at once. But you'd need not just 64k changes, but
> that many deletions and additions. The most likely candidate is
> perhaps break-detection, which would load the data for all pairs and
> keep it around for the content-level diff. But again, you'd need 64k
> actually changed files in the first place.
>
> So it's still possible to trigger this case, but it seems like "I
> accidentally made all my files stat-dirty" is the most likely case
> in the real world.
>
> Reported-by: Jan Christoph Uhde <Jan@UhdeJc.com>
> Signed-off-by: Jeff King <peff@peff.net>
next prev parent reply other threads:[~2020-06-01 20:55 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-05-31 10:46 mmap failure in master 1aa69c73577df21f5e37e47cc40cf44fc049121e Jan Christoph Uhde
2020-06-01 4:45 ` Jeff King
[not found] ` <cfc79aec-ec85-dbec-e37b-6b7035b4c5a4@UhdeJc.com>
2020-06-01 16:54 ` Jeff King
2020-06-01 20:22 ` [PATCH] diff: discard blob data from stat-unmatched pairs Jeff King
2020-06-01 20:53 ` Eric Sunshine [this message]
2020-06-02 16:49 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAPig+cRyrEHNxY+Y-_E6e4DZB-sAiVdSB2T53bKP6UJOx_DJDA@mail.gmail.com \
--to=sunshine@sunshineco.com \
--cc=Jan@uhdejc.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).