git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Jan Christoph Uhde <Jan@UhdeJc.com>
Cc: git@vger.kernel.org
Subject: Re: mmap failure in master 1aa69c73577df21f5e37e47cc40cf44fc049121e
Date: Mon, 1 Jun 2020 12:54:08 -0400	[thread overview]
Message-ID: <20200601165408.GA2536619@coredump.intra.peff.net> (raw)
In-Reply-To: <cfc79aec-ec85-dbec-e37b-6b7035b4c5a4@UhdeJc.com>

[re-adding list to cc; I think there are some interesting bits for
discussion here]

On Mon, Jun 01, 2020 at 08:41:08AM +0200, Jan Christoph Uhde wrote:

> > Is it possible that your local repository has large number of packs? Git
> > will leave open maps to each pack's index file, plus some packs
> > themselves (ones we're accessing, plus we map+close small ones), plus
> > whatever maps are used by libc to malloc.  The kernel default limit for
> > the number of maps is 65530. If you have on the order of 30,000 packs
> > you might run into this limit.
> > 
> > You can check the number of packs with "git count-objects -v", and the
> 
> » git count-objects -v
> count: 0
> size: 0
> in-pack: 2399361
> packs: 1
> size-pack: 919395
> prune-packable: 0
> garbage: 0
> size-garbage: 0

OK, so that's just one pack (with a few million objects, which is
expected).

> > If that's the problem, the solution is to repack (which should also
> > generally improve performance). If you have trouble repacking due to the
> > limits, you can overcome the chicken and egg with:
> > 
> >    sysctl -w vm.max_map_count=131060
> 
> This fixes the problem indeed!

Now that surprises me. However, I can reproduce the issue with a fresh
clone of gcc:

  $ git clone https://github.com/gcc-mirror/gcc
  $ cd gcc
  $ git diff --quiet ;# works!
  $ find . -type f | xargs touch ;# dirty the index entry for each file
  $ git diff --quiet
  fatal: mmap failed: Cannot allocate memory

There are ~100k files in that repo. If we mmap each one as part of
diff_populate_filespec(), we're going to end up with too many maps.

So we can easily reproduce this case without having to download the huge
gcc repo:

  git init repo
  cd repo
  for i in $(seq 70); do
    mkdir $i
    for j in $(seq 1000); do
      echo "foo" >$i/$j
    done
  done
  git add .
  git commit -m 'add a bunch of files'

  git diff --quiet
  git ls-files | xargs touch
  git diff --quiet

> What could be the reason that the problem only occurs with the `--quiet` flag
> that I use in my prompt command, but not when using `git diff` without the flag.

That is curious, and I can reproduce it here.

In the non-quiet case, we queue each filepair that we find to be
stat-dirty. And then we call diffcore_skip_stat_unmatch() from
diffcore_std() on the whole list. We load each file singly, see that
it's not really a change (the index entries are just stat-dirty because
of our touch), and then discard it.

Whereas in the --quiet case, we hit this code in diff_change when
queuing each:

          if (options->flags.quick && options->skip_stat_unmatch &&
              !diff_filespec_check_stat_unmatch(options->repo, p))
                  return;

          options->flags.has_changes = 1;

That code is trying to make sure we accurately set the has_changes flag
in quick/quiet mode (because we can stop diffing as soon as we see a
single change). Since none of these changes is interesting (they're all
just stat-dirty entries), we have to keep going and look at each one.
But this code leaves the populated filespec in the queue. Doing this
makes "diff --quiet" succeed in this case:

diff --git a/diff.c b/diff.c
index d1ad6a3c4a..2535614735 100644
--- a/diff.c
+++ b/diff.c
@@ -6758,8 +6758,15 @@ void diff_change(struct diff_options *options,
 		return;
 
 	if (options->flags.quick && options->skip_stat_unmatch &&
-	    !diff_filespec_check_stat_unmatch(options->repo, p))
+	    !diff_filespec_check_stat_unmatch(options->repo, p)) {
+		/*
+		 * discard any populated data; this entry is uninteresting;
+		 * we probably ought to avoid queuing the pair at all!
+		 */
+		diff_free_filespec_data(p->one);
+		diff_free_filespec_data(p->two);
 		return;
+	}
 
 	options->flags.has_changes = 1;
 }

But that's only because these aren't "real" changes. If we swap out our
"touch" for:

  git ls-files | while read fn; do
    echo bar >$fn
  done

then we have real changes. Our --quiet case will exit immediately once
it sees a change, so it's OK. But now the non-quiet one will fail,
because it's going to load each file to confirm that it's an actual
change, but leave the mmap in place for when we do the actual
content-level diff.

So I think in general we do have problems diffing more than 65k working
tree files in a single process because of this limit. It may still be
worth doing something along the lines of the patch above, though,
because it seems more likely for somebody to have 65k stat-dirty files
than 65k actual changes.

-Peff

  parent reply	other threads:[~2020-06-01 16:54 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-31 10:46 mmap failure in master 1aa69c73577df21f5e37e47cc40cf44fc049121e Jan Christoph Uhde
2020-06-01  4:45 ` Jeff King
     [not found]   ` <cfc79aec-ec85-dbec-e37b-6b7035b4c5a4@UhdeJc.com>
2020-06-01 16:54     ` Jeff King [this message]
2020-06-01 20:22       ` [PATCH] diff: discard blob data from stat-unmatched pairs Jeff King
2020-06-01 20:53         ` Eric Sunshine
2020-06-02 16:49         ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200601165408.GA2536619@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=Jan@UhdeJc.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).