git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Eric Sunshine <sunshine@sunshineco.com>
To: Jeff King <peff@peff.net>
Cc: Jan Christoph Uhde <Jan@uhdejc.com>,
	Git List <git@vger.kernel.org>,
	Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH] diff: discard blob data from stat-unmatched pairs
Date: Mon, 1 Jun 2020 16:53:50 -0400	[thread overview]
Message-ID: <CAPig+cRyrEHNxY+Y-_E6e4DZB-sAiVdSB2T53bKP6UJOx_DJDA@mail.gmail.com> (raw)
In-Reply-To: <20200601202218.GA2763518@coredump.intra.peff.net>

On Mon, Jun 1, 2020 at 4:22 PM Jeff King <peff@peff.net> wrote:
> Subject: [PATCH] diff: discard blob data from stat-unmatched pairs
>
> When performing a tree-level diff against the working tree, we may find
> that our index stat information is dirty, so we queue a filepair to be
> examined later. If the actual content hasn't changed, we call this a
> stat-unmatch; the stat information was out of date, but there's no
> actual diff.  Normally diffcore_std() would detect and remove these
> identical filepairs via diffcore_skip_stat_unmatch().  However, when
> "--quiet" is used, we want to stop the diff as soon as we see any
> changes, so we check for stat-unmatches immediately in diff_change().
>
> That check may require us to actually load the file contents into the
> pair of diff_filespecs. If we find that the pair isn't a stat-unmatch,
> then no big deal; we'd likely load the contents later anyway to generate
> a patch, do rename detection, etc, so we want to hold on to it. But if
> it a stat-unmatch, then we have no more use for that data; the whole

s/it/& is/

> point is that we're going discard the pair. However, we never free the
> allocated diff_filespec data.
>
> In most cases, keeping that data isn't a problem. We don't expect a lot
> of stat-unmatch entries, and since we're using --quiet, we'd quit as
> soon as we saw such a real change anyway. However, there are extreme
> cases where it makes a big difference:
>
>   1. We'd generally mmap() the working tree half of the pair. And since
>      the OS may limit the total number of maps, we can run afoul of this
>      in large repositories. E.g.:
>
>        $ cd linux
>        $ git ls-files | wc -l
>        67959
>        $ sysctl vm.max_map_count
>        vm.max_map_count = 65530
>        $ git ls-files | xargs touch ;# everything is stat-dirty!
>        $ git diff --quiet
>        fatal: mmap failed: Cannot allocate memory
>
>      It should be unusual to have so many files stat-dirty, but it's
>      possible if you've just run a script like "sed -i" or similar.
>
>      After this patch, the above correctly exits with code 0.
>
>   2. Even if you don't hit mmap limits, the index half of the pair will
>      have been pulled from the object database into heap memory. Again
>      in a clone of linux.git, running:
>
>        $ git ls-files | head -n 10000 | xargs touch
>        $ git diff --quiet
>
>      peaks at 145MB heap before this patch, and 94MB after.
>
> This patch solves the problem by freeing any diff_filespec data we
> picked up during the "--quiet" stat-unmatch check in diff_changes.
> Nobody is going to need that data later, so there's no point holding on
> to it. There are a few things to note:
>
>   - we could skip queueing the pair entirely, which could in theory save
>     a little work. But there's not much to save, as we need a
>     diff_filepair to feed to diff_filespec_check_stat_unmatch() anyway.
>     And since we cache the result of the stat-unmatch checks, a later
>     call to diffcore_skip_stat_unmatch() call will quickly skip over
>     them. The diffcore code also counts up the number of stat-unmatched
>     pairs as it removes them. It's doubtful any callers would care about
>     that in combination with --quiet, but we'd have to reimplement the
>     logic here to be on the safe side. So it's not really worth the
>     trouble.
>
>   - I didn't write a test, because we always produce the correct output
>     unless we run up against system mmap limits, which are both
>     unportable and expensive to test against. Measuring peak heap
>     would be interesting, but our perf suite isn't yet capable of that.
>
>   - note that diff without "--quiet" does not suffer from the same
>     problem. In diffcore_skip_stat_unmatch(), we detect the stat-unmatch
>     entries and drop them immediately, so we're not carrying their data
>     around.
>
>   - you _can_ still trigger the mmap limit problem if you truly have
>     that many files with actual changes. But it's rather unlikely. The
>     stat-unmatch check avoids loading the file contents if the sizes
>     don't match, so you'd need a pretty trivial change in every single
>     file. Likewise, inexact rename detection might load the data for
>     many files all at once. But you'd need not just 64k changes, but
>     that many deletions and additions. The most likely candidate is
>     perhaps break-detection, which would load the data for all pairs and
>     keep it around for the content-level diff. But again, you'd need 64k
>     actually changed files in the first place.
>
>     So it's still possible to trigger this case, but it seems like "I
>     accidentally made all my files stat-dirty" is the most likely case
>     in the real world.
>
> Reported-by: Jan Christoph Uhde <Jan@UhdeJc.com>
> Signed-off-by: Jeff King <peff@peff.net>

  reply	other threads:[~2020-06-01 20:55 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-31 10:46 mmap failure in master 1aa69c73577df21f5e37e47cc40cf44fc049121e Jan Christoph Uhde
2020-06-01  4:45 ` Jeff King
     [not found]   ` <cfc79aec-ec85-dbec-e37b-6b7035b4c5a4@UhdeJc.com>
2020-06-01 16:54     ` Jeff King
2020-06-01 20:22       ` [PATCH] diff: discard blob data from stat-unmatched pairs Jeff King
2020-06-01 20:53         ` Eric Sunshine [this message]
2020-06-02 16:49         ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPig+cRyrEHNxY+Y-_E6e4DZB-sAiVdSB2T53bKP6UJOx_DJDA@mail.gmail.com \
    --to=sunshine@sunshineco.com \
    --cc=Jan@uhdejc.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).