Re: Git status extremely slow if any file is a multiple of 8GBi

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Re: Git status extremely slow if any file is a multiple of 8GBi
@ 2022-05-04 17:47 Jason Hatton
  2022-05-05 21:04 ` René Scharfe
  0 siblings, 1 reply; 6+ messages in thread
From: Jason Hatton @ 2022-05-04 17:47 UTC (permalink / raw)
  To: git@vger.kernel.org; +Cc: Junio C Hamano

> > The condition sd_size==0 is used as a signal for "no, we really need
> > to compare the contents", and causes the contents to be hashed, and
> > if the contents match the object name recorded in the index, the
> > on-disk size is stored in sd_size and the entry is marked as
> > CE_UPTODATE.  Alas, if the truncated st_size is 0, the resulting
> > entry would have sd_size==0 again, so a workaround like what you
> > outlined is needed.
> 
> Junio C Hamano <gitster@pobox.com> writes:
>
> This is of secondary importance, but the fact that Jason observed
> 8GBi files gets hashed over and over unnecessarily means that we
> would do the same for an empty file, opening, reading 0-bytes,
> hashing, and closing, without taking advantage of the fact that
> CE_UPTODATE bit says the file contents should be up-to-date with
> respect to the cached object name, doesn't it?
> 
> Or do we have "if st_size == 0 and sd_size == 0 then we know what it
> hashes to (i.e. EMPTY_BLOB_SHA*) and there is no need to do the
> usual open-read-hash-close dance" logic (I didn't check)?

Junio C Hamano

As best as I can tell, it rechecks the zero sized files. My Linux box can run
git ls in .006 seconds with 1000 zero sized files in the repo. Rehashing every
file that is a multiple of 2^32 with every "git ls" on the other hand...

I managed to actually compile git with the proposed changes. It seems to correct
the problem and "make test" passes. If upgrading to the patched version if git,
git will rehash the 8GBi files once and work normally. If downgrading to an
unpatched version, git will perceive that the 8GBi files have changes. This
needs to be corrected with "git add" or "git checkout". If you people are
interested, I may be able to find a way to send a patch to the list or put it
on github.

Thanks
Jason D. Hatton

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Git status extremely slow if any file is a multiple of 8GBi
  2022-05-04 17:47 Git status extremely slow if any file is a multiple of 8GBi Jason Hatton
@ 2022-05-05 21:04 ` René Scharfe
  2022-05-05 22:55   ` Philip Oakley
  0 siblings, 1 reply; 6+ messages in thread
From: René Scharfe @ 2022-05-05 21:04 UTC (permalink / raw)
  To: Jason Hatton, git@vger.kernel.org; +Cc: Junio C Hamano

Am 04.05.22 um 19:47 schrieb Jason Hatton:
>>> The condition sd_size==0 is used as a signal for "no, we really need
>>> to compare the contents", and causes the contents to be hashed, and
>>> if the contents match the object name recorded in the index, the
>>> on-disk size is stored in sd_size and the entry is marked as
>>> CE_UPTODATE.  Alas, if the truncated st_size is 0, the resulting
>>> entry would have sd_size==0 again, so a workaround like what you
>>> outlined is needed.
>>
>> Junio C Hamano <gitster@pobox.com> writes:
>>
>> This is of secondary importance, but the fact that Jason observed
>> 8GBi files gets hashed over and over unnecessarily means that we
>> would do the same for an empty file, opening, reading 0-bytes,
>> hashing, and closing, without taking advantage of the fact that
>> CE_UPTODATE bit says the file contents should be up-to-date with
>> respect to the cached object name, doesn't it?
>>
>> Or do we have "if st_size == 0 and sd_size == 0 then we know what it
>> hashes to (i.e. EMPTY_BLOB_SHA*) and there is no need to do the
>> usual open-read-hash-close dance" logic (I didn't check)?
>
> Junio C Hamano
>
> As best as I can tell, it rechecks the zero sized files. My Linux box can run
> git ls in .006 seconds with 1000 zero sized files in the repo. Rehashing every
> file that is a multiple of 2^32 with every "git ls" on the other hand...
>
> I managed to actually compile git with the proposed changes.

Meaning that file sizes of n * 2^32 bytes get recorded as 1 byte instead
of 0 bytes?  Why 1 and not e.g. 2^32-1 or 2^31 (or 42)?

> It seems to correct
> the problem and "make test" passes. If upgrading to the patched version if git,
> git will rehash the 8GBi files once and work normally. If downgrading to an
> unpatched version, git will perceive that the 8GBi files have changes. This
> needs to be corrected with "git add" or "git checkout".

Not nice, but safe.  Can there be an unsafe scenario as well?  Like if a
4GiB file gets added to the index by the new version, which records a
size of 1, then the file is extended by one byte while mtime stays the
same and then an old git won't detect the change?

> If you people are
> interested, I may be able to find a way to send a patch to the list or put it
> on github.

Patches are always welcome, they make discussions and testing easier.

René

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Git status extremely slow if any file is a multiple of 8GBi
  2022-05-05 21:04 ` René Scharfe
@ 2022-05-05 22:55   ` Philip Oakley
  2022-05-06  0:22     ` [Email External to GFS] " Jason Hatton
  0 siblings, 1 reply; 6+ messages in thread
From: Philip Oakley @ 2022-05-05 22:55 UTC (permalink / raw)
  To: René Scharfe, Jason Hatton, git@vger.kernel.org; +Cc: Junio C Hamano

On 05/05/2022 22:04, René Scharfe wrote:
> Am 04.05.22 um 19:47 schrieb Jason Hatton:
>>>> The condition sd_size==0 is used as a signal for "no, we really need
>>>> to compare the contents", and causes the contents to be hashed, and
>>>> if the contents match the object name recorded in the index, the
>>>> on-disk size is stored in sd_size and the entry is marked as
>>>> CE_UPTODATE.  Alas, if the truncated st_size is 0, the resulting
>>>> entry would have sd_size==0 again, so a workaround like what you
>>>> outlined is needed.
>>> Junio C Hamano <gitster@pobox.com> writes:
>>>
>>> This is of secondary importance, but the fact that Jason observed
>>> 8GBi files gets hashed over and over unnecessarily means that we
>>> would do the same for an empty file, opening, reading 0-bytes,
>>> hashing, and closing, without taking advantage of the fact that
>>> CE_UPTODATE bit says the file contents should be up-to-date with
>>> respect to the cached object name, doesn't it?
>>>
>>> Or do we have "if st_size == 0 and sd_size == 0 then we know what it
>>> hashes to (i.e. EMPTY_BLOB_SHA*) and there is no need to do the
>>> usual open-read-hash-close dance" logic (I didn't check)?
>> Junio C Hamano
>>
>> As best as I can tell, it rechecks the zero sized files. My Linux box can run
>> git ls in .006 seconds with 1000 zero sized files in the repo. Rehashing every
>> file that is a multiple of 2^32 with every "git ls" on the other hand...
>>
>> I managed to actually compile git with the proposed changes.
> Meaning that file sizes of n * 2^32 bytes get recorded as 1 byte instead
> of 0 bytes?  Why 1 and not e.g. 2^32-1 or 2^31 (or 42)?

My thought on this. after considering a few options, would be that the
'sign bit' of the uint32_t size should be set to 1 when the high word of
the 64 bit filesize value is non zero.

This would result in file sizes of 0 to 4GiB-1 retaining their existing
values, and those from 4GiB onward produces a down-folded 2GiB to 4GiB-1
values.

This would mean, That we are able to detect almost all incremental and
decremental changes in filesizes, as well as retaining the 'zero is
racy' flag aspect.
>> It seems to correct
>> the problem and "make test" passes. If upgrading to the patched version if git,
>> git will rehash the 8GBi files once and work normally. If downgrading to an
>> unpatched version, git will perceive that the 8GBi files have changes. This
>> needs to be corrected with "git add" or "git checkout".
> Not nice, but safe.  Can there be an unsafe scenario as well?  Like if a
> 4GiB file gets added to the index by the new version, which records a
> size of 1, then the file is extended by one byte while mtime stays the
> same and then an old git won't detect the change?

There is still some potential for different Git versions to be
'confused' for these very large files, but I feel that it's relatively
safe (no worse than the 'set to unity' idea). For large files we will
always have that loss of precision at the 32bit rollover. It just a case
of choosing a least worst.

I haven't considered if my proposed 'truncation' overhead would be fast
code.

>> If you people are
>> interested, I may be able to find a way to send a patch to the list or put it
>> on github.
> Patches are always welcome, they make discussions and testing easier.
>
> René
Philip

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [Email External to GFS] Re: Git status extremely slow if any file is a multiple of 8GBi
  2022-05-05 22:55   ` Philip Oakley
@ 2022-05-06  0:22     ` Jason Hatton
  2022-05-06  9:40       ` Philip Oakley
  2022-05-07  5:19       ` Carlo Marcelo Arenas Belón
  0 siblings, 2 replies; 6+ messages in thread
From: Jason Hatton @ 2022-05-06  0:22 UTC (permalink / raw)
  To: Philip Oakley, René Scharfe, git@vger.kernel.org; +Cc: Junio C Hamano

>On 05/05/2022 22:04, René Scharfe wrote:
>> Am 04.05.22 um 19:47 schrieb Jason Hatton:
>>>>> The condition sd_size==0 is used as a signal for "no, we really need
>>>>> to compare the contents", and causes the contents to be hashed, and
>>>>> if the contents match the object name recorded in the index, the
>>>>> on-disk size is stored in sd_size and the entry is marked as
>>>>> CE_UPTODATE.  Alas, if the truncated st_size is 0, the resulting
>>>>> entry would have sd_size==0 again, so a workaround like what you
>>>>> outlined is needed.
>>>> Junio C Hamano <gitster@pobox.com> writes:
>>>>
>>>> This is of secondary importance, but the fact that Jason observed
>>>> 8GBi files gets hashed over and over unnecessarily means that we
>>>> would do the same for an empty file, opening, reading 0-bytes,
>>>> hashing, and closing, without taking advantage of the fact that
>>>> CE_UPTODATE bit says the file contents should be up-to-date with
>>>> respect to the cached object name, doesn't it?
>>>>
>>>> Or do we have "if st_size == 0 and sd_size == 0 then we know what it
>>>> hashes to (i.e. EMPTY_BLOB_SHA*) and there is no need to do the
>>>> usual open-read-hash-close dance" logic (I didn't check)?
>>> Junio C Hamano
>>>
>>> As best as I can tell, it rechecks the zero sized files. My Linux box can run
>>> git ls in .006 seconds with 1000 zero sized files in the repo. Rehashing every
>>> file that is a multiple of 2^32 with every "git ls" on the other hand...
>>>
>>> I managed to actually compile git with the proposed changes.
>> Meaning that file sizes of n * 2^32 bytes get recorded as 1 byte instead
>> of 0 bytes?  Why 1 and not e.g. 2^32-1 or 2^31 (or 42)?
>
>My thought on this. after considering a few options, would be that the
>'sign bit' of the uint32_t size should be set to 1 when the high word of
>the 64 bit filesize value is non zero.
>
>This would result in file sizes of 0 to 4GiB-1 retaining their existing
>values, and those from 4GiB onward produces a down-folded 2GiB to 4GiB-1
>values.

I believe it would be best to only change the behavior of files that are
multiples of 2^32 exactly. Changing the behavior of all files larger than
4GBi may not be good. I like the idea of using 0x80000000 instead of 1.

>This would mean, That we are able to detect almost all incremental and
>decremental changes in filesizes, as well as retaining the 'zero is
>racy' flag aspect.
>>> It seems to correct
>>> the problem and "make test" passes. If upgrading to the patched version if git,
>>> git will rehash the 8GBi files once and work normally. If downgrading to an
>>> unpatched version, git will perceive that the 8GBi files have changes. This
>>> needs to be corrected with "git add" or "git checkout".
>> Not nice, but safe.  Can there be an unsafe scenario as well?  Like if a
>> 4GiB file gets added to the index by the new version, which records a
>> size of 1, then the file is extended by one byte while mtime stays the
>> same and then an old git won't detect the change?
>
>There is still some potential for different Git versions to be
>'confused' for these very large files, but I feel that it's relatively
>safe (no worse than the 'set to unity' idea). For large files we will
>always have that loss of precision at the 32bit rollover. It just a case
>of choosing a least worst.
>
>I haven't considered if my proposed 'truncation' overhead would be fast
>code.
>
>>> If you people are
>>> interested, I may be able to find a way to send a patch to the list or put it
>>> on github.
>> Patches are always welcome, they make discussions and testing easier.
>>
>> René
>Philip

I have a patch file, but I'm not sure how to actually submit it. I'm going to
attempt using outlook.

Jason


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Email External to GFS] Re: Git status extremely slow if any file is a multiple of 8GBi
  2022-05-06  0:22     ` [Email External to GFS] " Jason Hatton
@ 2022-05-06  9:40       ` Philip Oakley
  2022-05-07  5:19       ` Carlo Marcelo Arenas Belón
  1 sibling, 0 replies; 6+ messages in thread
From: Philip Oakley @ 2022-05-06  9:40 UTC (permalink / raw)
  To: Jason Hatton, René Scharfe, git@vger.kernel.org; +Cc: Junio C Hamano

On 06/05/2022 01:22, Jason Hatton wrote:
> I believe it would be best to only change the behavior of files that are
> multiples of 2^32 exactly. Changing the behavior of all files larger than
> 4GBi may not be good. I like the idea of using 0x80000000 instead of 1.

I think, having had a further think, that the 0x80000000 would be a
better balance as it would, as you say, only affect exact multiples of
4GiB, and fix the 'racy' confusion.

> I have a patch file, but I'm not sure how to actually submit it. I'm going to
> attempt using outlook.
>
> Jason
You maybe able to use the `git format-patch` to generate an easily
applied patch, and `git send-email` to post it to the list without
damage from outlook/gmail/whatever.

Se the bottom part of
https://public-inbox.org/git/CY4PR16MB165558026798A6E51E6A005FAFC59@CY4PR16MB1655.namprd16.prod.outlook.com/
which has "Reply instructions:"

or just start with a simple .patch or .txt attachment (or in-line) of
the diff to get a good visual.

The full discussion about submitting patches is at
https://github.com/git/git/blob/master/Documentation/SubmittingPatches
(and other hosts), but don't let perfect be the enemy of useful;

--
Philip



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Email External to GFS] Re: Git status extremely slow if any file is a multiple of 8GBi
  2022-05-06  0:22     ` [Email External to GFS] " Jason Hatton
  2022-05-06  9:40       ` Philip Oakley
@ 2022-05-07  5:19       ` Carlo Marcelo Arenas Belón
  1 sibling, 0 replies; 6+ messages in thread
From: Carlo Marcelo Arenas Belón @ 2022-05-07  5:19 UTC (permalink / raw)
  To: Jason Hatton
  Cc: Philip Oakley, René Scharfe, git@vger.kernel.org,
	Junio C Hamano

On Fri, May 06, 2022 at 12:22:36AM +0000, Jason Hatton wrote:
> 
> I have a patch file, but I'm not sure how to actually submit it. I'm going to
> attempt using outlook.

While I haven't used it as much as I should, gitgitgadget[1] allows you to
send your patch through a github PR if that is your thing, which is also
very handy to keep track of feedback when there is too much of it and you
prefer a web interface.

Carlo

[1] https://github.com/gitgitgadget/gitgitgadget

PS. you'll need someone to approve you, but I'll be happy to if you don't know
    anyone else that might be using it.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-05-07  5:19 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-04 17:47 Git status extremely slow if any file is a multiple of 8GBi Jason Hatton
2022-05-05 21:04 ` René Scharfe
2022-05-05 22:55   ` Philip Oakley
2022-05-06  0:22     ` [Email External to GFS] " Jason Hatton
2022-05-06  9:40       ` Philip Oakley
2022-05-07  5:19       ` Carlo Marcelo Arenas Belón

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).