git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Patrick Marlier <patrick.marlier@gmail.com>
To: Elijah Newren <newren@gmail.com>
Cc: Git Mailing List <git@vger.kernel.org>
Subject: Re: Optimization for "git clean -ffdx"
Date: Tue, 15 Feb 2022 23:16:09 +0100	[thread overview]
Message-ID: <CAKQMxzRsq5uHme03a8DqMg1-Tku+0teDv0mDnsbJERfORtTB_w@mail.gmail.com> (raw)
In-Reply-To: <CABPp-BE2q5_LVVUw=2Wfys0AF350tTMMr43ZLpfsjE4ecb1Tpw@mail.gmail.com>

Hi Elijah,

Thanks for this nice and very detailed email!

On Sun, Feb 13, 2022 at 4:25 AM Elijah Newren <newren@gmail.com> wrote:
> There are two steps -- collect the list of things that are removable,
> and then a separate step to remove the things that are removable.  It
> can be the case that we recurse into the same directory twice, once
> for each step.  For the first step, see the fill_directory() call.  If
> you left off "-x" or only had one "-f", or your directories had some
> tracked files, then fill_directory()'s job is finding out which things
> are removable and it'd likely only be a subset of those directories.
> For the second step, some of those removable things may be entire
> directories.  So when it hits one of those and later calls
> remove_dirs() to remove it, it has to recurse again.

I am not sure I understand perfectly why there are 2 steps but this is detail.

The observation I had in the second step for removal, it is this type
of patterns using strace:
getcwd("/tmp/repo", 129) = 46
newfstatat(AT_FDCWD, "/tmp/repo/a", {st_mode=S_IFDIR|0700,
st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/tmp/repo/a/a", {st_mode=S_IFDIR|0700,
st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/tmp/repo/a/a/a", {st_mode=S_IFDIR|0700,
st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/tmp/repo/a/a/a/a", {st_mode=S_IFDIR|0700,
st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
rmdir("a/a/a/a")                        = 0
newfstatat(AT_FDCWD, "a/a/a/b", {st_mode=S_IFDIR|0700, st_size=4096,
...}, AT_SYMLINK_NOFOLLOW) = 0
openat(AT_FDCWD, "a/a/a/b", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 6
newfstatat(6, "", {st_mode=S_IFDIR|0700, st_size=4096, ...}, AT_EMPTY_PATH) = 0
getdents64(6, 0x55b718d8d650 /* 2 entries */, 32768) = 48
getdents64(6, 0x55b718d8d650 /* 0 entries */, 32768) = 0
close(6)                                = 0
getcwd("/tmp/repo", 129) = 46
newfstatat(AT_FDCWD, "/tmp/repo/a", {st_mode=S_IFDIR|0700,
st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/tmp/repo/a/a", {st_mode=S_IFDIR|0700,
st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/tmp/repo/a/a/a", {st_mode=S_IFDIR|0700,
st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/tmp/repo/a/a/a/b", {st_mode=S_IFDIR|0700,
st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
rmdir("a/a/a/b")                        = 0

Here I have a repository with quite some nested and untracked directories.
We can see that many "newfstatat" are repeated for the same path. Also
"getcwd" is repeated many times.
I guess here "strbuf_realpath" would benefit from some caching.


> > Using "-ff" should not check for nested .git and no need to recurse if
> > the directory is already untracked.
>
> Not quite; the -x and lack of -e options is critical here, otherwise
> we cannot just nuke the whole directory.  (Also, -d is important, but
> that just kind of goes without saying.)

Indeed, you are right.


> Further, setting this flag is only solving part of the performance
> problem.  We'll still recurse into the directories to look for ignored
> files, which should be avoided since we don't need to differentiate.
> That basically means that we need to avoid all the code in the current
>
>    if (remove_directories && !ignored_only)
>
> block, whenever (remove_directories && ignored && !exclude_list.nr &&
> force > 1).

Good point!


> > Another thfing to note is that it shows "Removing XXX" but it shows it
> > when the directory is already gone. So we could change to "Removed
> > XXX" or display the "Removing XXX" before starting to remove the
> > directory.
>
> I commented on Bagas' patch, but I think this is minutiae that won't
> be relevant to the end user, and would rather either ignore it or just
> move the print statement earlier rather than increasing work for
> translators.  That is, unless we tend to use past tense elsewhere in
> the UI and we want to make a concerted effort to convert to using past
> tense.

If we want to be correct, then we should then move the "Removing"
printing before but then it makes the "gone" flag a bit useless.
I will defer to you if this is something we want to fix.


> Anyway, I'll be happy to review your patch(es) and
> will take whichever parts you don't want to tackle.

If you don't mind, I will let you adjust the documentation.
Following the patch series for the 2 other changes.

Thanks a lot Elijah. Very appreciated!
--
Patrick Marlier

      reply	other threads:[~2022-02-15 22:17 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-08 22:03 Optimization for "git clean -ffdx" Patrick Marlier
2022-02-13  3:25 ` Elijah Newren
2022-02-15 22:16   ` Patrick Marlier [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAKQMxzRsq5uHme03a8DqMg1-Tku+0teDv0mDnsbJERfORtTB_w@mail.gmail.com \
    --to=patrick.marlier@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=newren@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).