Git status extremely slow if any file is a multiple of 8GBi

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Git status extremely slow if any file is a multiple of 8GBi
@ 2022-05-04  0:15 Jason Hatton
  2022-05-04 13:55 ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: Jason Hatton @ 2022-05-04  0:15 UTC (permalink / raw)
  To: git@vger.kernel.org

Git is extremely slow if any file is a multiple of 8GBi. Below is
a working example.

> git init
> dd if=/dev/zero of=8gb bs=1024 count=$((1024*1024*8))
> git lfs track 8gb
> git add .
> git commit
> time git status
> time git status
> time git status

This seems related to git using uint32_t for storing the file size in the index.
A file of zero size has a special meaning.

I have an open issue on git-for-windows where @dscho has a lot of good technical
information.

https://github.com/git-for-windows/git/issues/3833#issuecomment-1116544918

I have a proposed idea that may or may not help. Would it be possible for any
file that is a multiple of 2^32 to be adjusted to a file size of 1 instead of
zero? Git is already functioning with mangling file sizes over 4GB in the
index, so maybe bumping up the size of 2^32 multiple files would mitigate the
issue. Using different versions of git would just cause 2^32 multiple files to
be re-checked again, which is the current behavior anyway. 0 would retain its
special meaning. If the file size is changed from 2^32 to 2^32 + 1, git may not
notice that the file changed because both files will look like they are size 1.
The ctime and mtime may still catch this.

// Some pseudo code near: 
https://github.com/git-for-windows/git/blob/dc88e3cd72a2f0bbe2fe513acfc72bd66b577851/read-cache.c#L176sd-

sd->sd_size = st->st_size;
if (sd->sd_size == 0 && st->st_size != 0) {
    sd->sd_size = 1;
}

Thanks
--
Jason D. Hatton

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Git status extremely slow if any file is a multiple of 8GBi
  2022-05-04  0:15 Git status extremely slow if any file is a multiple of 8GBi Jason Hatton
@ 2022-05-04 13:55 ` Junio C Hamano
  2022-05-04 16:08   ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2022-05-04 13:55 UTC (permalink / raw)
  To: Jason Hatton; +Cc: git@vger.kernel.org

Jason Hatton <jhatton@globalfinishing.com> writes:

> I have a proposed idea that may or may not help. Would it be possible for any
> file that is a multiple of 2^32 to be adjusted to a file size of 1 instead of
> zero? Git is already functioning with mangling file sizes over 4GB in the
> index, so maybe bumping up the size of 2^32 multiple files would mitigate the
> issue.

Clever.  

The condition sd_size==0 is used as a signal for "no, we really need
to compare the contents", and causes the contents to be hashed, and
if the contents match the object name recorded in the index, the
on-disk size is stored in sd_size and the entry is marked as
CE_UPTODATE.  Alas, if the truncated st_size is 0, the resulting
entry would have sd_size==0 again, so a workaround like what you
outlined is needed.

You'd need to make sure that you tweaked

	if (sd->sd_size != (unsigned int) st->st_size)
		changed |= DATA_CHANGED;

that appears at the end of read-cache.c::match_stat_data() in a way
that is consistent with how you munge the sd_size member in
fill_stat_data(), if you are going to go that route.

Thanks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Git status extremely slow if any file is a multiple of 8GBi
  2022-05-04 13:55 ` Junio C Hamano
@ 2022-05-04 16:08   ` Junio C Hamano
  0 siblings, 0 replies; 6+ messages in thread
From: Junio C Hamano @ 2022-05-04 16:08 UTC (permalink / raw)
  To: git@vger.kernel.org; +Cc: Jason Hatton

Junio C Hamano <gitster@pobox.com> writes:

> The condition sd_size==0 is used as a signal for "no, we really need
> to compare the contents", and causes the contents to be hashed, and
> if the contents match the object name recorded in the index, the
> on-disk size is stored in sd_size and the entry is marked as
> CE_UPTODATE.  Alas, if the truncated st_size is 0, the resulting
> entry would have sd_size==0 again, so a workaround like what you
> outlined is needed.

This is of secondary importance, but the fact that Jason observed
8GBi files gets hashed over and over unnecessarily means that we
would do the same for an empty file, opening, reading 0-bytes,
hashing, and closing, without taking advantage of the fact that
CE_UPTODATE bit says the file contents should be up-to-date with
respect to the cached object name, doesn't it?

Or do we have "if st_size == 0 and sd_size == 0 then we know what it
hashes to (i.e. EMPTY_BLOB_SHA*) and there is no need to do the
usual open-read-hash-close dance" logic (I didn't check)?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Git status extremely slow if any file is a multiple of 8GBi
@ 2022-05-04 17:47 Jason Hatton
  2022-05-05 21:04 ` René Scharfe
  0 siblings, 1 reply; 6+ messages in thread
From: Jason Hatton @ 2022-05-04 17:47 UTC (permalink / raw)
  To: git@vger.kernel.org; +Cc: Junio C Hamano

> > The condition sd_size==0 is used as a signal for "no, we really need
> > to compare the contents", and causes the contents to be hashed, and
> > if the contents match the object name recorded in the index, the
> > on-disk size is stored in sd_size and the entry is marked as
> > CE_UPTODATE.  Alas, if the truncated st_size is 0, the resulting
> > entry would have sd_size==0 again, so a workaround like what you
> > outlined is needed.
> 
> Junio C Hamano <gitster@pobox.com> writes:
>
> This is of secondary importance, but the fact that Jason observed
> 8GBi files gets hashed over and over unnecessarily means that we
> would do the same for an empty file, opening, reading 0-bytes,
> hashing, and closing, without taking advantage of the fact that
> CE_UPTODATE bit says the file contents should be up-to-date with
> respect to the cached object name, doesn't it?
> 
> Or do we have "if st_size == 0 and sd_size == 0 then we know what it
> hashes to (i.e. EMPTY_BLOB_SHA*) and there is no need to do the
> usual open-read-hash-close dance" logic (I didn't check)?

Junio C Hamano

As best as I can tell, it rechecks the zero sized files. My Linux box can run
git ls in .006 seconds with 1000 zero sized files in the repo. Rehashing every
file that is a multiple of 2^32 with every "git ls" on the other hand...

I managed to actually compile git with the proposed changes. It seems to correct
the problem and "make test" passes. If upgrading to the patched version if git,
git will rehash the 8GBi files once and work normally. If downgrading to an
unpatched version, git will perceive that the 8GBi files have changes. This
needs to be corrected with "git add" or "git checkout". If you people are
interested, I may be able to find a way to send a patch to the list or put it
on github.

Thanks
Jason D. Hatton

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Git status extremely slow if any file is a multiple of 8GBi
  2022-05-04 17:47 Jason Hatton
@ 2022-05-05 21:04 ` René Scharfe
  2022-05-05 22:55   ` Philip Oakley
  0 siblings, 1 reply; 6+ messages in thread
From: René Scharfe @ 2022-05-05 21:04 UTC (permalink / raw)
  To: Jason Hatton, git@vger.kernel.org; +Cc: Junio C Hamano

Am 04.05.22 um 19:47 schrieb Jason Hatton:
>>> The condition sd_size==0 is used as a signal for "no, we really need
>>> to compare the contents", and causes the contents to be hashed, and
>>> if the contents match the object name recorded in the index, the
>>> on-disk size is stored in sd_size and the entry is marked as
>>> CE_UPTODATE.  Alas, if the truncated st_size is 0, the resulting
>>> entry would have sd_size==0 again, so a workaround like what you
>>> outlined is needed.
>>
>> Junio C Hamano <gitster@pobox.com> writes:
>>
>> This is of secondary importance, but the fact that Jason observed
>> 8GBi files gets hashed over and over unnecessarily means that we
>> would do the same for an empty file, opening, reading 0-bytes,
>> hashing, and closing, without taking advantage of the fact that
>> CE_UPTODATE bit says the file contents should be up-to-date with
>> respect to the cached object name, doesn't it?
>>
>> Or do we have "if st_size == 0 and sd_size == 0 then we know what it
>> hashes to (i.e. EMPTY_BLOB_SHA*) and there is no need to do the
>> usual open-read-hash-close dance" logic (I didn't check)?
>
> Junio C Hamano
>
> As best as I can tell, it rechecks the zero sized files. My Linux box can run
> git ls in .006 seconds with 1000 zero sized files in the repo. Rehashing every
> file that is a multiple of 2^32 with every "git ls" on the other hand...
>
> I managed to actually compile git with the proposed changes.

Meaning that file sizes of n * 2^32 bytes get recorded as 1 byte instead
of 0 bytes?  Why 1 and not e.g. 2^32-1 or 2^31 (or 42)?

> It seems to correct
> the problem and "make test" passes. If upgrading to the patched version if git,
> git will rehash the 8GBi files once and work normally. If downgrading to an
> unpatched version, git will perceive that the 8GBi files have changes. This
> needs to be corrected with "git add" or "git checkout".

Not nice, but safe.  Can there be an unsafe scenario as well?  Like if a
4GiB file gets added to the index by the new version, which records a
size of 1, then the file is extended by one byte while mtime stays the
same and then an old git won't detect the change?

> If you people are
> interested, I may be able to find a way to send a patch to the list or put it
> on github.

Patches are always welcome, they make discussions and testing easier.

René

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Git status extremely slow if any file is a multiple of 8GBi
  2022-05-05 21:04 ` René Scharfe
@ 2022-05-05 22:55   ` Philip Oakley
  0 siblings, 0 replies; 6+ messages in thread
From: Philip Oakley @ 2022-05-05 22:55 UTC (permalink / raw)
  To: René Scharfe, Jason Hatton, git@vger.kernel.org; +Cc: Junio C Hamano

On 05/05/2022 22:04, René Scharfe wrote:
> Am 04.05.22 um 19:47 schrieb Jason Hatton:
>>>> The condition sd_size==0 is used as a signal for "no, we really need
>>>> to compare the contents", and causes the contents to be hashed, and
>>>> if the contents match the object name recorded in the index, the
>>>> on-disk size is stored in sd_size and the entry is marked as
>>>> CE_UPTODATE.  Alas, if the truncated st_size is 0, the resulting
>>>> entry would have sd_size==0 again, so a workaround like what you
>>>> outlined is needed.
>>> Junio C Hamano <gitster@pobox.com> writes:
>>>
>>> This is of secondary importance, but the fact that Jason observed
>>> 8GBi files gets hashed over and over unnecessarily means that we
>>> would do the same for an empty file, opening, reading 0-bytes,
>>> hashing, and closing, without taking advantage of the fact that
>>> CE_UPTODATE bit says the file contents should be up-to-date with
>>> respect to the cached object name, doesn't it?
>>>
>>> Or do we have "if st_size == 0 and sd_size == 0 then we know what it
>>> hashes to (i.e. EMPTY_BLOB_SHA*) and there is no need to do the
>>> usual open-read-hash-close dance" logic (I didn't check)?
>> Junio C Hamano
>>
>> As best as I can tell, it rechecks the zero sized files. My Linux box can run
>> git ls in .006 seconds with 1000 zero sized files in the repo. Rehashing every
>> file that is a multiple of 2^32 with every "git ls" on the other hand...
>>
>> I managed to actually compile git with the proposed changes.
> Meaning that file sizes of n * 2^32 bytes get recorded as 1 byte instead
> of 0 bytes?  Why 1 and not e.g. 2^32-1 or 2^31 (or 42)?

My thought on this. after considering a few options, would be that the
'sign bit' of the uint32_t size should be set to 1 when the high word of
the 64 bit filesize value is non zero.

This would result in file sizes of 0 to 4GiB-1 retaining their existing
values, and those from 4GiB onward produces a down-folded 2GiB to 4GiB-1
values.

This would mean, That we are able to detect almost all incremental and
decremental changes in filesizes, as well as retaining the 'zero is
racy' flag aspect.
>> It seems to correct
>> the problem and "make test" passes. If upgrading to the patched version if git,
>> git will rehash the 8GBi files once and work normally. If downgrading to an
>> unpatched version, git will perceive that the 8GBi files have changes. This
>> needs to be corrected with "git add" or "git checkout".
> Not nice, but safe.  Can there be an unsafe scenario as well?  Like if a
> 4GiB file gets added to the index by the new version, which records a
> size of 1, then the file is extended by one byte while mtime stays the
> same and then an old git won't detect the change?

There is still some potential for different Git versions to be
'confused' for these very large files, but I feel that it's relatively
safe (no worse than the 'set to unity' idea). For large files we will
always have that loss of precision at the 32bit rollover. It just a case
of choosing a least worst.

I haven't considered if my proposed 'truncation' overhead would be fast
code.

>> If you people are
>> interested, I may be able to find a way to send a patch to the list or put it
>> on github.
> Patches are always welcome, they make discussions and testing easier.
>
> René
Philip

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-05-05 22:55 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-04  0:15 Git status extremely slow if any file is a multiple of 8GBi Jason Hatton
2022-05-04 13:55 ` Junio C Hamano
2022-05-04 16:08   ` Junio C Hamano
  -- strict thread matches above, loose matches on Subject: below --
2022-05-04 17:47 Jason Hatton
2022-05-05 21:04 ` René Scharfe
2022-05-05 22:55   ` Philip Oakley

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).