git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* git rm VERY slow for directories with many files.
@ 2017-10-28 17:02 Christopher Jefferson
  2017-10-28 22:31 ` brian m. carlson
  0 siblings, 1 reply; 6+ messages in thread
From: Christopher Jefferson @ 2017-10-28 17:02 UTC (permalink / raw)
  To: git@vger.kernel.org

I stupidly added a directory with many files ( ~450,000 ) to git, and want to delete them — later I plan to rebase/squash various commits to remove the files from the history altogether.

However, ‘git rm’ is VERY slow. For example, in a directory with 10,000 files (on a Mac), git v2.14.2

Git add . : 5.95 secs
Git commit : 1.29 secs
Git rm -r : 22 secs

50,000 files

Git add . : 25 secs
Git commit : 11 secs
Git rm : After 20 minutes, I killed it.

Looking at an optimized profile, all the time seems to be spent in “get_tree_entry” — I assume there is some huge object representing the directory which is being re-expanded for each file?

Is there any way I can speed up removing this directory?

Chris

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git rm VERY slow for directories with many files.
  2017-10-28 17:02 git rm VERY slow for directories with many files Christopher Jefferson
@ 2017-10-28 22:31 ` brian m. carlson
  2017-10-29  0:51   ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: brian m. carlson @ 2017-10-28 22:31 UTC (permalink / raw)
  To: Christopher Jefferson; +Cc: git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1285 bytes --]

On Sat, Oct 28, 2017 at 05:02:02PM +0000, Christopher Jefferson wrote:
> I stupidly added a directory with many files ( ~450,000 ) to git, and want to delete them — later I plan to rebase/squash various commits to remove the files from the history altogether.
> 
> However, ‘git rm’ is VERY slow. For example, in a directory with 10,000 files (on a Mac), git v2.14.2
> 
> Git add . : 5.95 secs
> Git commit : 1.29 secs
> Git rm -r : 22 secs
> 
> 50,000 files
> 
> Git add . : 25 secs
> Git commit : 11 secs
> Git rm : After 20 minutes, I killed it.
> 
> Looking at an optimized profile, all the time seems to be spent in “get_tree_entry” — I assume there is some huge object representing the directory which is being re-expanded for each file?

Yes, there's a tree object that represents each directory.

> Is there any way I can speed up removing this directory?

First, make sure your working directory is clean with no changes.  Then,
remove the directory (by hand) or move it somewhere else.  Then, run
"git add -u".

That should allow you to commit the removal of those files quickly.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
https://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 867 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git rm VERY slow for directories with many files.
  2017-10-28 22:31 ` brian m. carlson
@ 2017-10-29  0:51   ` Junio C Hamano
  2017-10-29  3:52     ` Junio C Hamano
  2017-10-29 16:52     ` brian m. carlson
  0 siblings, 2 replies; 6+ messages in thread
From: Junio C Hamano @ 2017-10-29  0:51 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Christopher Jefferson, git@vger.kernel.org

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

>> Looking at an optimized profile, all the time seems to be spent in “get_tree_entry” — I assume there is some huge object representing the directory which is being re-expanded for each file?
>
> Yes, there's a tree object that represents each directory.
>
>> Is there any way I can speed up removing this directory?
>
> First, make sure your working directory is clean with no changes.  Then,
> remove the directory (by hand) or move it somewhere else.  Then, run
> "git add -u".
>
> That should allow you to commit the removal of those files quickly.

If get_tree_entry() shows up a lot in the profile, it would indicate
that a lot of cycles are spent in check_local_mod().  Bypassing it
with "-f" may be the first thing to try ;-)

The way "git rm" makes repeated calls to get_tree_entry() with deep
pathnames would be an easy recipe to get quadratic behaviour like
the one reported in the first message on this thread, as it always
goes from the root level, grabs an tree object and scans it to get
the entry for the next level, and (worse yet) a look-up of a path
component in each of these tree object must be done as a linear
scan.

I wonder how fast "git diff-index --cached -r HEAD --", with the
same pathspec used for the problematic "git rm", runs in this same
50,000 path project.  

If it runs in a reasonable time, one easy way out may be to revamp
the codepath to call check_local_mod() to:

 - first before making the call, do the "diff-index --cached" thing
   internally with the same pathspec to grab the list of paths that
   have local modifications; save the set of paths in a hashmap or
   something.

 - pass that hashmap to check_local_mod(), and where the function
   does the "staged_changes" check, consult the hashmap to see the
   path in question is different between the HEAD and the index.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git rm VERY slow for directories with many files.
  2017-10-29  0:51   ` Junio C Hamano
@ 2017-10-29  3:52     ` Junio C Hamano
  2017-10-29 16:52     ` brian m. carlson
  1 sibling, 0 replies; 6+ messages in thread
From: Junio C Hamano @ 2017-10-29  3:52 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Christopher Jefferson, git@vger.kernel.org

Junio C Hamano <gitster@pobox.com> writes:

> I wonder how fast "git diff-index --cached -r HEAD --", with the
> same pathspec used for the problematic "git rm", runs in this same
> 50,000 path project.  
>
> If it runs in a reasonable time, one easy way out may be to revamp
> the codepath to call check_local_mod() to:
>
>  - first before making the call, do the "diff-index --cached" thing
>    internally with the same pathspec to grab the list of paths that
>    have local modifications; save the set of paths in a hashmap or
>    something.
>
>  - pass that hashmap to check_local_mod(), and where the function
>    does the "staged_changes" check, consult the hashmap to see the
>    path in question is different between the HEAD and the index.

And if we want to try a more localized band-aid, another approach
may be to add a caching version of get_tree_entry() where we keep
track of (stack of) tree, the path component we found during the
last call to the helper and the tree_desc.  That way, when we get
the next call, we descend that stack as long as the leading path
components are still the same, and when we see that the path
component we are looking for is different from what we used in the
last call, we either (1) reuse the tree_desc and keep going forward
if the name we looked for the last sorts before what we are looking
for, or (2) discard and reopen the tree, rewinding the tree_desc to
the beginning and do the scan.

That way, the caller of the check_local_mod() does not have to know
the trick, and because the loop in check_local_mod() iterates over
the list that is already sorted in the index order, we'd not just
reduce the number of times we open the trees but also reduce the
number of times we scan and skip the entries in trees to find the
entries we are after.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git rm VERY slow for directories with many files.
  2017-10-29  0:51   ` Junio C Hamano
  2017-10-29  3:52     ` Junio C Hamano
@ 2017-10-29 16:52     ` brian m. carlson
  2017-10-30  1:36       ` Junio C Hamano
  1 sibling, 1 reply; 6+ messages in thread
From: brian m. carlson @ 2017-10-29 16:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Christopher Jefferson, git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1370 bytes --]

On Sun, Oct 29, 2017 at 09:51:55AM +0900, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> > First, make sure your working directory is clean with no changes.  Then,
> > remove the directory (by hand) or move it somewhere else.  Then, run
> > "git add -u".
> >
> > That should allow you to commit the removal of those files quickly.
> 
> If get_tree_entry() shows up a lot in the profile, it would indicate
> that a lot of cycles are spent in check_local_mod().  Bypassing it
> with "-f" may be the first thing to try ;-)

That is indeed faster.  I tested my solution by creating a directory
with 20,000 files in a temporary repo.  git rm -r took 17.96s, and git
rm -rf took .12s.  (This is on an SSD.)

That's also a nicer and more intuitive solution than mine.

> I wonder how fast "git diff-index --cached -r HEAD --", with the
> same pathspec used for the problematic "git rm", runs in this same
> 50,000 path project.

I'll let the original poster answer this one as well, but it was very
fast in my test repo.  I'm not very familiar with the code path in
question, but it definitely looks like we're avoiding the quadratic
behavior in this case.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
https://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 867 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git rm VERY slow for directories with many files.
  2017-10-29 16:52     ` brian m. carlson
@ 2017-10-30  1:36       ` Junio C Hamano
  0 siblings, 0 replies; 6+ messages in thread
From: Junio C Hamano @ 2017-10-30  1:36 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Christopher Jefferson, git@vger.kernel.org

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> On Sun, Oct 29, 2017 at 09:51:55AM +0900, Junio C Hamano wrote:
>> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
>> > First, make sure your working directory is clean with no changes.  Then,
>> > remove the directory (by hand) or move it somewhere else.  Then, run
>> > "git add -u".
>> >
>> > That should allow you to commit the removal of those files quickly.
>> 
>> If get_tree_entry() shows up a lot in the profile, it would indicate
>> that a lot of cycles are spent in check_local_mod().  Bypassing it
>> with "-f" may be the first thing to try ;-)
>
> That is indeed faster.  I tested my solution by creating a directory
> with 20,000 files in a temporary repo.  git rm -r took 17.96s, and git
> rm -rf took .12s.  (This is on an SSD.)
>
> That's also a nicer and more intuitive solution than mine.

Heh, the above was meant as a joke, though.  "-f" is bypassing an
important safety valve.  In fact in my early draft of the message,
the paragraph that followed started with "Jokes aside, ..." ;-)

>> I wonder how fast "git diff-index --cached -r HEAD --", with the
>> same pathspec used for the problematic "git rm", runs in this same
>> 50,000 path project.
>
> I'll let the original poster answer this one as well, but it was very
> fast in my test repo.  I'm not very familiar with the code path in
> question, but it definitely looks like we're avoiding the quadratic
> behavior in this case.

Because of the way "diff-index --cached" iterates over the index and
the tree in parallel, it should be a lot faster than doing
get_tree_entry() for each and every path you care about.  In
addition, the "--cached" form is further optimized to take advantage
of the cached-tree index extension, so you often can tell "all index
entries in this directory are untouched" without descending into
deep subdirectories.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-10-30  1:36 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-28 17:02 git rm VERY slow for directories with many files Christopher Jefferson
2017-10-28 22:31 ` brian m. carlson
2017-10-29  0:51   ` Junio C Hamano
2017-10-29  3:52     ` Junio C Hamano
2017-10-29 16:52     ` brian m. carlson
2017-10-30  1:36       ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).