git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "SZEDER Gábor" <szeder.dev@gmail.com>
To: Elijah Newren <newren@gmail.com>
Cc: Martin Melka <martin.melka@gmail.com>,
	Git Mailing List <git@vger.kernel.org>,
	Samuel Lijin <sxlijin@gmail.com>
Subject: Re: git status --ignored hangs when a deep directory structure present in working tree
Date: Tue, 28 Jan 2020 14:57:07 +0100	[thread overview]
Message-ID: <20200128135707.GD10482@szeder.dev> (raw)
In-Reply-To: <CABPp-BGvU_DHQu66bqPZ+WXg5mL8bCP5Uxp4g5393WnWyO1Dhg@mail.gmail.com>

On Mon, Jan 27, 2020 at 09:06:01PM -0800, Elijah Newren wrote:
> > the runtime of 'git status
> > --ignored' grows quickly with the depth of the untracked directory.
> > Running this shell loop produces the numbers below:
> >
> > for depth in $(seq 10 30)
> > do
> >         dirs=$(for i in $(seq 1 $depth) ; do printf 'dir/' ; done)
> >         rm -rf dir
> >         mkdir -p $dirs
> >         >$dirs/untracked-file
> >         /usr/bin/time --format="$depth: %e" git status --ignored >/dev/null
> > done
> >
> > 10: 0.01
> > 11: 0.03
> > 12: 0.05
> > 13: 0.11
> > 14: 0.23
> > 15: 0.47
> > 16: 0.97
> > 17: 1.97
> > 18: 3.88
> > 19: 7.85
> > 20: 16.29
> > 21: 32.92
> > 22: 76.24
> 
> Wow.
> 
> Really nice testcase, though, thanks.
> 
> > Beautifully quadratic, isn't it? :)
> 
> I think you mean exponential (in particular 2^n rather than n^2).

Ouch, yeah, indeed.

> > Unless I messed up my numbers, with a depth of 120 directories it
> > would take over 6*10^23 years to complete... so yeah, it does qualify
> > as indefinitely.
> 
> No comment about how people today are spoiled and addicted to instant
> gratification?  I mean, can't folks just be a little patient?  ;-)

Nope.  Notice how my shell loop above goes to 30, but the results only
to 22 :)

> > This slowdown was caused by commit df5bcdf83a (dir: recurse into
> > untracked dirs for ignored files, 2017-05-18), which was part of a
> > patch series to fix 'git clean -d' deleting untracked directories even
> > if they contained ignored files.
> >
> > Cc'ing Samuel, author of that commit, and Elijah, who had quite some
> > fun with 'dir.c' recently.
> 
> Heh, yes, what "fun" it was.
> 
> Anyway, after digging around for quite a bit today... that commit
> added calling read_directory_recursive() directly from itself for
> certain untracked paths.  This means that read_directory_recursive()
> (which I'll abbreviate to r_d_r()), when we're dealing with certain
> untracked paths:
> 
>   * Calls treat_path() -> treat_one_path() -> treat_directory() -> r_d_r()
>   * Calls r_d_r() directly as well
> 
> So, from the toplevel directory, r_d_r() will call itself twice on the
> next directory down.  For each of those, it'll call r_d_r() twice on
> the second directory down.  From each of those, it'll call r_d_r()
> twice on the third directory in the hierarchy and so on until we have
> 2^n calls to r_d_r() for the nth level deep directory.

Got it, thanks.

> Trying to back out the underlying problem, I _think_ the cause behind
> all this is that r_d_r() and friends all use the path_treatment enum,
> which says that the treatment of any path has to be one of those four
> types. The dichotomy between path_untracked and path_recurse in
> particular means the system has no way to mark that something should
> be both marked as untracked and recursed into, yet we definitely need
> to have some directories marked as untracked and we also need to
> recurse into them.  This, I think led to Samuel's attempt to
> workaround that dichotomy by having the code in r_d_r() check for
> certain path_untracked cases which should also be recursed into.  I
> think adding another type to the enum and shifting the logic elsewhere
> might enable us to both simplify the logic and avoid this expensive
> exponential behavior, but I haven't gotten it working yet.  We'll see
> if my patch goes anywhere, or if it's just another dead-end among
> many.

I was wondering whether it would make sense to give the enum contants
power-of-two values, so we could say 'path_recurse | path_untracked'.
But while this particular combination makes sense, others, at least to
my superficial understanding, not at all (e.g. 'path_recurse |
path_exclude').


  reply	other threads:[~2020-01-28 13:57 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-27 10:55 git status --ignored hangs when a deep directory structure present in working tree Martin Melka
2020-01-27 12:08 ` SZEDER Gábor
2020-01-28  5:06   ` Elijah Newren
2020-01-28 13:57     ` SZEDER Gábor [this message]
2020-01-28 15:00       ` Elijah Newren
2020-01-29 22:10     ` Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200128135707.GD10482@szeder.dev \
    --to=szeder.dev@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=martin.melka@gmail.com \
    --cc=newren@gmail.com \
    --cc=sxlijin@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).