Git status extremely slow if any file is a multiple of 8GBi

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Git status extremely slow if any file is a multiple of 8GBi
@ 2022-05-04  0:15 Jason Hatton
  2022-05-04 13:55 ` Junio C Hamano
  0 siblings, 1 reply; 9+ messages in thread
From: Jason Hatton @ 2022-05-04  0:15 UTC (permalink / raw)
  To: git@vger.kernel.org

Git is extremely slow if any file is a multiple of 8GBi. Below is
a working example.

> git init
> dd if=/dev/zero of=8gb bs=1024 count=$((1024*1024*8))
> git lfs track 8gb
> git add .
> git commit
> time git status
> time git status
> time git status

This seems related to git using uint32_t for storing the file size in the index.
A file of zero size has a special meaning.

I have an open issue on git-for-windows where @dscho has a lot of good technical
information.

https://github.com/git-for-windows/git/issues/3833#issuecomment-1116544918

I have a proposed idea that may or may not help. Would it be possible for any
file that is a multiple of 2^32 to be adjusted to a file size of 1 instead of
zero? Git is already functioning with mangling file sizes over 4GB in the
index, so maybe bumping up the size of 2^32 multiple files would mitigate the
issue. Using different versions of git would just cause 2^32 multiple files to
be re-checked again, which is the current behavior anyway. 0 would retain its
special meaning. If the file size is changed from 2^32 to 2^32 + 1, git may not
notice that the file changed because both files will look like they are size 1.
The ctime and mtime may still catch this.

// Some pseudo code near: 
https://github.com/git-for-windows/git/blob/dc88e3cd72a2f0bbe2fe513acfc72bd66b577851/read-cache.c#L176sd-

sd->sd_size = st->st_size;
if (sd->sd_size == 0 && st->st_size != 0) {
    sd->sd_size = 1;
}

Thanks
--
Jason D. Hatton

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Git status extremely slow if any file is a multiple of 8GBi
  2022-05-04  0:15 Jason Hatton
@ 2022-05-04 13:55 ` Junio C Hamano
  2022-05-04 16:08   ` Junio C Hamano
  0 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2022-05-04 13:55 UTC (permalink / raw)
  To: Jason Hatton; +Cc: git@vger.kernel.org

Jason Hatton <jhatton@globalfinishing.com> writes:

> I have a proposed idea that may or may not help. Would it be possible for any
> file that is a multiple of 2^32 to be adjusted to a file size of 1 instead of
> zero? Git is already functioning with mangling file sizes over 4GB in the
> index, so maybe bumping up the size of 2^32 multiple files would mitigate the
> issue.

Clever.  

The condition sd_size==0 is used as a signal for "no, we really need
to compare the contents", and causes the contents to be hashed, and
if the contents match the object name recorded in the index, the
on-disk size is stored in sd_size and the entry is marked as
CE_UPTODATE.  Alas, if the truncated st_size is 0, the resulting
entry would have sd_size==0 again, so a workaround like what you
outlined is needed.

You'd need to make sure that you tweaked

	if (sd->sd_size != (unsigned int) st->st_size)
		changed |= DATA_CHANGED;

that appears at the end of read-cache.c::match_stat_data() in a way
that is consistent with how you munge the sd_size member in
fill_stat_data(), if you are going to go that route.

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Git status extremely slow if any file is a multiple of 8GBi
  2022-05-04 13:55 ` Junio C Hamano
@ 2022-05-04 16:08   ` Junio C Hamano
  0 siblings, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2022-05-04 16:08 UTC (permalink / raw)
  To: git@vger.kernel.org; +Cc: Jason Hatton

Junio C Hamano <gitster@pobox.com> writes:

> The condition sd_size==0 is used as a signal for "no, we really need
> to compare the contents", and causes the contents to be hashed, and
> if the contents match the object name recorded in the index, the
> on-disk size is stored in sd_size and the entry is marked as
> CE_UPTODATE.  Alas, if the truncated st_size is 0, the resulting
> entry would have sd_size==0 again, so a workaround like what you
> outlined is needed.

This is of secondary importance, but the fact that Jason observed
8GBi files gets hashed over and over unnecessarily means that we
would do the same for an empty file, opening, reading 0-bytes,
hashing, and closing, without taking advantage of the fact that
CE_UPTODATE bit says the file contents should be up-to-date with
respect to the cached object name, doesn't it?

Or do we have "if st_size == 0 and sd_size == 0 then we know what it
hashes to (i.e. EMPTY_BLOB_SHA*) and there is no need to do the
usual open-read-hash-close dance" logic (I didn't check)?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Git status extremely slow if any file is a multiple of 8GBi
@ 2022-05-04 17:47 Jason Hatton
  2022-05-05 21:04 ` René Scharfe
  0 siblings, 1 reply; 9+ messages in thread
From: Jason Hatton @ 2022-05-04 17:47 UTC (permalink / raw)
  To: git@vger.kernel.org; +Cc: Junio C Hamano

> > The condition sd_size==0 is used as a signal for "no, we really need
> > to compare the contents", and causes the contents to be hashed, and
> > if the contents match the object name recorded in the index, the
> > on-disk size is stored in sd_size and the entry is marked as
> > CE_UPTODATE.  Alas, if the truncated st_size is 0, the resulting
> > entry would have sd_size==0 again, so a workaround like what you
> > outlined is needed.
> 
> Junio C Hamano <gitster@pobox.com> writes:
>
> This is of secondary importance, but the fact that Jason observed
> 8GBi files gets hashed over and over unnecessarily means that we
> would do the same for an empty file, opening, reading 0-bytes,
> hashing, and closing, without taking advantage of the fact that
> CE_UPTODATE bit says the file contents should be up-to-date with
> respect to the cached object name, doesn't it?
> 
> Or do we have "if st_size == 0 and sd_size == 0 then we know what it
> hashes to (i.e. EMPTY_BLOB_SHA*) and there is no need to do the
> usual open-read-hash-close dance" logic (I didn't check)?

Junio C Hamano

As best as I can tell, it rechecks the zero sized files. My Linux box can run
git ls in .006 seconds with 1000 zero sized files in the repo. Rehashing every
file that is a multiple of 2^32 with every "git ls" on the other hand...

I managed to actually compile git with the proposed changes. It seems to correct
the problem and "make test" passes. If upgrading to the patched version if git,
git will rehash the 8GBi files once and work normally. If downgrading to an
unpatched version, git will perceive that the 8GBi files have changes. This
needs to be corrected with "git add" or "git checkout". If you people are
interested, I may be able to find a way to send a patch to the list or put it
on github.

Thanks
Jason D. Hatton

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Git status extremely slow if any file is a multiple of 8GBi
  2022-05-04 17:47 Git status extremely slow if any file is a multiple of 8GBi Jason Hatton
@ 2022-05-05 21:04 ` René Scharfe
  2022-05-05 22:55   ` Philip Oakley
  0 siblings, 1 reply; 9+ messages in thread
From: René Scharfe @ 2022-05-05 21:04 UTC (permalink / raw)
  To: Jason Hatton, git@vger.kernel.org; +Cc: Junio C Hamano

Am 04.05.22 um 19:47 schrieb Jason Hatton:
>>> The condition sd_size==0 is used as a signal for "no, we really need
>>> to compare the contents", and causes the contents to be hashed, and
>>> if the contents match the object name recorded in the index, the
>>> on-disk size is stored in sd_size and the entry is marked as
>>> CE_UPTODATE.  Alas, if the truncated st_size is 0, the resulting
>>> entry would have sd_size==0 again, so a workaround like what you
>>> outlined is needed.
>>
>> Junio C Hamano <gitster@pobox.com> writes:
>>
>> This is of secondary importance, but the fact that Jason observed
>> 8GBi files gets hashed over and over unnecessarily means that we
>> would do the same for an empty file, opening, reading 0-bytes,
>> hashing, and closing, without taking advantage of the fact that
>> CE_UPTODATE bit says the file contents should be up-to-date with
>> respect to the cached object name, doesn't it?
>>
>> Or do we have "if st_size == 0 and sd_size == 0 then we know what it
>> hashes to (i.e. EMPTY_BLOB_SHA*) and there is no need to do the
>> usual open-read-hash-close dance" logic (I didn't check)?
>
> Junio C Hamano
>
> As best as I can tell, it rechecks the zero sized files. My Linux box can run
> git ls in .006 seconds with 1000 zero sized files in the repo. Rehashing every
> file that is a multiple of 2^32 with every "git ls" on the other hand...
>
> I managed to actually compile git with the proposed changes.

Meaning that file sizes of n * 2^32 bytes get recorded as 1 byte instead
of 0 bytes?  Why 1 and not e.g. 2^32-1 or 2^31 (or 42)?

> It seems to correct
> the problem and "make test" passes. If upgrading to the patched version if git,
> git will rehash the 8GBi files once and work normally. If downgrading to an
> unpatched version, git will perceive that the 8GBi files have changes. This
> needs to be corrected with "git add" or "git checkout".

Not nice, but safe.  Can there be an unsafe scenario as well?  Like if a
4GiB file gets added to the index by the new version, which records a
size of 1, then the file is extended by one byte while mtime stays the
same and then an old git won't detect the change?

> If you people are
> interested, I may be able to find a way to send a patch to the list or put it
> on github.

Patches are always welcome, they make discussions and testing easier.

René

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Git status extremely slow if any file is a multiple of 8GBi
  2022-05-05 21:04 ` René Scharfe
@ 2022-05-05 22:55   ` Philip Oakley
  2022-05-06  0:22     ` [Email External to GFS] " Jason Hatton
  0 siblings, 1 reply; 9+ messages in thread
From: Philip Oakley @ 2022-05-05 22:55 UTC (permalink / raw)
  To: René Scharfe, Jason Hatton, git@vger.kernel.org; +Cc: Junio C Hamano

On 05/05/2022 22:04, René Scharfe wrote:
> Am 04.05.22 um 19:47 schrieb Jason Hatton:
>>>> The condition sd_size==0 is used as a signal for "no, we really need
>>>> to compare the contents", and causes the contents to be hashed, and
>>>> if the contents match the object name recorded in the index, the
>>>> on-disk size is stored in sd_size and the entry is marked as
>>>> CE_UPTODATE.  Alas, if the truncated st_size is 0, the resulting
>>>> entry would have sd_size==0 again, so a workaround like what you
>>>> outlined is needed.
>>> Junio C Hamano <gitster@pobox.com> writes:
>>>
>>> This is of secondary importance, but the fact that Jason observed
>>> 8GBi files gets hashed over and over unnecessarily means that we
>>> would do the same for an empty file, opening, reading 0-bytes,
>>> hashing, and closing, without taking advantage of the fact that
>>> CE_UPTODATE bit says the file contents should be up-to-date with
>>> respect to the cached object name, doesn't it?
>>>
>>> Or do we have "if st_size == 0 and sd_size == 0 then we know what it
>>> hashes to (i.e. EMPTY_BLOB_SHA*) and there is no need to do the
>>> usual open-read-hash-close dance" logic (I didn't check)?
>> Junio C Hamano
>>
>> As best as I can tell, it rechecks the zero sized files. My Linux box can run
>> git ls in .006 seconds with 1000 zero sized files in the repo. Rehashing every
>> file that is a multiple of 2^32 with every "git ls" on the other hand...
>>
>> I managed to actually compile git with the proposed changes.
> Meaning that file sizes of n * 2^32 bytes get recorded as 1 byte instead
> of 0 bytes?  Why 1 and not e.g. 2^32-1 or 2^31 (or 42)?

My thought on this. after considering a few options, would be that the
'sign bit' of the uint32_t size should be set to 1 when the high word of
the 64 bit filesize value is non zero.

This would result in file sizes of 0 to 4GiB-1 retaining their existing
values, and those from 4GiB onward produces a down-folded 2GiB to 4GiB-1
values.

This would mean, That we are able to detect almost all incremental and
decremental changes in filesizes, as well as retaining the 'zero is
racy' flag aspect.
>> It seems to correct
>> the problem and "make test" passes. If upgrading to the patched version if git,
>> git will rehash the 8GBi files once and work normally. If downgrading to an
>> unpatched version, git will perceive that the 8GBi files have changes. This
>> needs to be corrected with "git add" or "git checkout".
> Not nice, but safe.  Can there be an unsafe scenario as well?  Like if a
> 4GiB file gets added to the index by the new version, which records a
> size of 1, then the file is extended by one byte while mtime stays the
> same and then an old git won't detect the change?

There is still some potential for different Git versions to be
'confused' for these very large files, but I feel that it's relatively
safe (no worse than the 'set to unity' idea). For large files we will
always have that loss of precision at the 32bit rollover. It just a case
of choosing a least worst.

I haven't considered if my proposed 'truncation' overhead would be fast
code.

>> If you people are
>> interested, I may be able to find a way to send a patch to the list or put it
>> on github.
> Patches are always welcome, they make discussions and testing easier.
>
> René
Philip

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [Email External to GFS] Re: Git status extremely slow if any file is a multiple of 8GBi
  2022-05-05 22:55   ` Philip Oakley
@ 2022-05-06  0:22     ` Jason Hatton
  2022-05-06  9:40       ` Philip Oakley
  2022-05-07  5:19       ` Carlo Marcelo Arenas Belón
  0 siblings, 2 replies; 9+ messages in thread
From: Jason Hatton @ 2022-05-06  0:22 UTC (permalink / raw)
  To: Philip Oakley, René Scharfe, git@vger.kernel.org; +Cc: Junio C Hamano

>On 05/05/2022 22:04, René Scharfe wrote:
>> Am 04.05.22 um 19:47 schrieb Jason Hatton:
>>>>> The condition sd_size==0 is used as a signal for "no, we really need
>>>>> to compare the contents", and causes the contents to be hashed, and
>>>>> if the contents match the object name recorded in the index, the
>>>>> on-disk size is stored in sd_size and the entry is marked as
>>>>> CE_UPTODATE.  Alas, if the truncated st_size is 0, the resulting
>>>>> entry would have sd_size==0 again, so a workaround like what you
>>>>> outlined is needed.
>>>> Junio C Hamano <gitster@pobox.com> writes:
>>>>
>>>> This is of secondary importance, but the fact that Jason observed
>>>> 8GBi files gets hashed over and over unnecessarily means that we
>>>> would do the same for an empty file, opening, reading 0-bytes,
>>>> hashing, and closing, without taking advantage of the fact that
>>>> CE_UPTODATE bit says the file contents should be up-to-date with
>>>> respect to the cached object name, doesn't it?
>>>>
>>>> Or do we have "if st_size == 0 and sd_size == 0 then we know what it
>>>> hashes to (i.e. EMPTY_BLOB_SHA*) and there is no need to do the
>>>> usual open-read-hash-close dance" logic (I didn't check)?
>>> Junio C Hamano
>>>
>>> As best as I can tell, it rechecks the zero sized files. My Linux box can run
>>> git ls in .006 seconds with 1000 zero sized files in the repo. Rehashing every
>>> file that is a multiple of 2^32 with every "git ls" on the other hand...
>>>
>>> I managed to actually compile git with the proposed changes.
>> Meaning that file sizes of n * 2^32 bytes get recorded as 1 byte instead
>> of 0 bytes?  Why 1 and not e.g. 2^32-1 or 2^31 (or 42)?
>
>My thought on this. after considering a few options, would be that the
>'sign bit' of the uint32_t size should be set to 1 when the high word of
>the 64 bit filesize value is non zero.
>
>This would result in file sizes of 0 to 4GiB-1 retaining their existing
>values, and those from 4GiB onward produces a down-folded 2GiB to 4GiB-1
>values.

I believe it would be best to only change the behavior of files that are
multiples of 2^32 exactly. Changing the behavior of all files larger than
4GBi may not be good. I like the idea of using 0x80000000 instead of 1.

>This would mean, That we are able to detect almost all incremental and
>decremental changes in filesizes, as well as retaining the 'zero is
>racy' flag aspect.
>>> It seems to correct
>>> the problem and "make test" passes. If upgrading to the patched version if git,
>>> git will rehash the 8GBi files once and work normally. If downgrading to an
>>> unpatched version, git will perceive that the 8GBi files have changes. This
>>> needs to be corrected with "git add" or "git checkout".
>> Not nice, but safe.  Can there be an unsafe scenario as well?  Like if a
>> 4GiB file gets added to the index by the new version, which records a
>> size of 1, then the file is extended by one byte while mtime stays the
>> same and then an old git won't detect the change?
>
>There is still some potential for different Git versions to be
>'confused' for these very large files, but I feel that it's relatively
>safe (no worse than the 'set to unity' idea). For large files we will
>always have that loss of precision at the 32bit rollover. It just a case
>of choosing a least worst.
>
>I haven't considered if my proposed 'truncation' overhead would be fast
>code.
>
>>> If you people are
>>> interested, I may be able to find a way to send a patch to the list or put it
>>> on github.
>> Patches are always welcome, they make discussions and testing easier.
>>
>> René
>Philip

I have a patch file, but I'm not sure how to actually submit it. I'm going to
attempt using outlook.

Jason


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Email External to GFS] Re: Git status extremely slow if any file is a multiple of 8GBi
  2022-05-06  0:22     ` [Email External to GFS] " Jason Hatton
@ 2022-05-06  9:40       ` Philip Oakley
  2022-05-07  5:19       ` Carlo Marcelo Arenas Belón
  1 sibling, 0 replies; 9+ messages in thread
From: Philip Oakley @ 2022-05-06  9:40 UTC (permalink / raw)
  To: Jason Hatton, René Scharfe, git@vger.kernel.org; +Cc: Junio C Hamano

On 06/05/2022 01:22, Jason Hatton wrote:
> I believe it would be best to only change the behavior of files that are
> multiples of 2^32 exactly. Changing the behavior of all files larger than
> 4GBi may not be good. I like the idea of using 0x80000000 instead of 1.

I think, having had a further think, that the 0x80000000 would be a
better balance as it would, as you say, only affect exact multiples of
4GiB, and fix the 'racy' confusion.

> I have a patch file, but I'm not sure how to actually submit it. I'm going to
> attempt using outlook.
>
> Jason
You maybe able to use the `git format-patch` to generate an easily
applied patch, and `git send-email` to post it to the list without
damage from outlook/gmail/whatever.

Se the bottom part of
https://public-inbox.org/git/CY4PR16MB165558026798A6E51E6A005FAFC59@CY4PR16MB1655.namprd16.prod.outlook.com/
which has "Reply instructions:"

or just start with a simple .patch or .txt attachment (or in-line) of
the diff to get a good visual.

The full discussion about submitting patches is at
https://github.com/git/git/blob/master/Documentation/SubmittingPatches
(and other hosts), but don't let perfect be the enemy of useful;

--
Philip



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Email External to GFS] Re: Git status extremely slow if any file is a multiple of 8GBi
  2022-05-06  0:22     ` [Email External to GFS] " Jason Hatton
  2022-05-06  9:40       ` Philip Oakley
@ 2022-05-07  5:19       ` Carlo Marcelo Arenas Belón
  1 sibling, 0 replies; 9+ messages in thread
From: Carlo Marcelo Arenas Belón @ 2022-05-07  5:19 UTC (permalink / raw)
  To: Jason Hatton
  Cc: Philip Oakley, René Scharfe, git@vger.kernel.org,
	Junio C Hamano

On Fri, May 06, 2022 at 12:22:36AM +0000, Jason Hatton wrote:
> 
> I have a patch file, but I'm not sure how to actually submit it. I'm going to
> attempt using outlook.

While I haven't used it as much as I should, gitgitgadget[1] allows you to
send your patch through a github PR if that is your thing, which is also
very handy to keep track of feedback when there is too much of it and you
prefer a web interface.

Carlo

[1] https://github.com/gitgitgadget/gitgitgadget

PS. you'll need someone to approve you, but I'll be happy to if you don't know
    anyone else that might be using it.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-05-07  5:19 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-04 17:47 Git status extremely slow if any file is a multiple of 8GBi Jason Hatton
2022-05-05 21:04 ` René Scharfe
2022-05-05 22:55   ` Philip Oakley
2022-05-06  0:22     ` [Email External to GFS] " Jason Hatton
2022-05-06  9:40       ` Philip Oakley
2022-05-07  5:19       ` Carlo Marcelo Arenas Belón
  -- strict thread matches above, loose matches on Subject: below --
2022-05-04  0:15 Jason Hatton
2022-05-04 13:55 ` Junio C Hamano
2022-05-04 16:08   ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).