git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* RE: [PATCH] Prevent git from rehashing 4GBi files
       [not found] <CY4PR16MB165501ED1B535592033C76F2AFC49@CY4PR16MB1655.namprd16.prod.outlook.com>
@ 2022-05-07 18:10 ` Jason Hatton
  0 siblings, 0 replies; 14+ messages in thread
From: Jason Hatton @ 2022-05-07 18:10 UTC (permalink / raw)
  To: Philip Oakley; +Cc: René Scharfe, git, Junio C Hamano

> Philip wrote:
>
> Is this on Git for Windows or a 64 bit Linux?
> There are still some issues on GfW for 2GiB+ files (long Vs long long int).

It was on GfW 64 bit. The Linux 64 bit one seems OK.

Further testing shows that unpatched GfW 64 bit chokes on 4GiB files. Just
use git add and git fsck. At least this patch seems to work on Linux
and works on GfW when using LFS.

--
Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Prevent git from rehashing 4GBi files
  2022-05-10 22:45 ` Philip Oakley
@ 2022-05-11 22:24   ` Philip Oakley
  0 siblings, 0 replies; 14+ messages in thread
From: Philip Oakley @ 2022-05-11 22:24 UTC (permalink / raw)
  To: Jason D. Hatton, Jason Hatton, Junio C Hamano; +Cc: git, l.s.r

On 11/05/2022 18:47, Jason D. Hatton wrote:
>> Is there a problem that 1<<31, when on a 32bit long is MAX_NEG, 
>> rather than being MAX_POS? And the size would need to be positive to 
>> be an acceptable file size?
>> (The code is a bit of a mish-mash on the Windows LLP64 side, where 
>> long is only 32 bits).
>>
>> Philip
>> Apologies for the terseness.
>
> Philip
>
> I made a little test script and tried out several different
> things.
>
> tldr; It didn't make any difference.
>
> Files tested:
> 1, 2 and 4 GiB with and without LFS. Tested with 0, 1, 1<<30,
> and 1<<31 mung builds. I'm only listing the problems unless
> stated otherwise. The mung didn't appear to introduce any
> new issues with my limited tests.
>
> git 2.36.0.windows.1 release:    fails on 4GiB w/o LFS - corrupts pack 
> file
>    git status is very slow.
>    Sometimes stores zero file instead of corruption.
>
> git 2.36.0.windows.1 custom compile w/o patches:
>    fails on 4GiB w/o LFS - stores zero file
>    git status is very slow.
>
> git 2.36.0.windows.1 with 1U<<31 mung:
>    fails on 4GiB w/o LFS - stores zero file
>
> git 2.36.0.windows.1 with 1U<<30 mung:
>    fails on 4GiB w/o LFS - stores zero file
>
> git 2.36.0.windows.1 with 1 mung:
>    fails on 4GiB w/o LFS - stores zero file
>
> git 2.36.0 Ubuntu
>    unpatched works, but has the slow status issue.
>
> The test script I used is below:

Without the git-lfs (to grossly shorten the file size in the pack) I 
wasn't expecting much, given the use of 'long' in places in the code 
base for the file sizes, so 2GiB and 4GiB files would likely fail on the 
Windows LP32 parts.

I was under the impression that the core code for packs had been size_t 
hardened, but there may be some paths either in git-lfs or the actual 
file checkout that cause that fail.

There was a previous series by Matt Cooper on:
  "Allow clean/smudge filters to handle huge files in the LLP64 data model"
(https://lore.kernel.org/git/pull.1068.git.1635320952.gitgitgadget@gmail.com/t/#u)
  Merge commit f9ba6acaa9348ea7b733bf78adc2f084247a912f
'mc/clean-smudge-with-llp64'

That series had some in-code checks, and some test-suite tests, though 
the latter classed as EXPENSIVE (i.e. not normally run), which may add 
more insight.

>
>
> #!/bin/sh
>
> GB1=$((1 * 1024*1024*1024))
> GB2=$((2 * 1024*1024*1024))
> GB4=$((4 * 1024*1024*1024))
>
> die()
> {
>    echo "$1"
>    exit 1
> }
>
> test_file()
> {
>    echo "=== TESTING $2 ==="
>    rm -rf .git .gitattributes .gitignore .gitmodules &&
>        git init &&
>        git lfs track '*.big' &&
>        truncate --size "$1" "$2" &&
>        git add "$2" &&
>        git commit -m "$2" &&
>        git fsck &&
>        mv "$2" bak &&
>        git restore "$2" &&
>        cmp "$2" bak || die "$2"
>    git status && timeout 5 git status || die "$2 git status slow"
>    rm -rf .git .gitattributes .gitignore .gitmodules "$2" bak
> }
>
> test_file "$GB1" gb1.big
> test_file "$GB2" gb2.big
> test_file "$GB4" gb4.big
> test_file "$GB1" gb1
> test_file "$GB2" gb2
> test_file "$GB4" gb4
> echo done 
--
Philip

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Prevent git from rehashing 4GBi files
  2022-05-07  2:15 Jason Hatton
       [not found] ` <1DFD3E42-3EF3-4420-8E01-748EF3DBE7A1@iee.email>
@ 2022-05-10 22:45 ` Philip Oakley
  2022-05-11 22:24   ` Philip Oakley
  1 sibling, 1 reply; 14+ messages in thread
From: Philip Oakley @ 2022-05-10 22:45 UTC (permalink / raw)
  To: Jason Hatton, Junio C Hamano; +Cc: René Scharfe, git

On 07/05/2022 03:15, Jason Hatton wrote:
>> Philip Oakley <philipoakley@iee.email> writes:
>>
>>>> This may treat non-zero multiple of 4GiB as "not racy", but has
>>>> anybody double checked the concern Réne brought up earlier that a
>>>> 4GiB file that was added and then got rewritten to 2GiB within the
>>>> same second would suddenly start getting treated as not racy?
>>> This is the pre-existing problem, that ~1in 2^31 size changes might not
>>> get noticed for size change. The 0 byte / 4GiB change is an identical
>>> issue, as is changing from 3 bytes to 4GiB+3 bytes, etc., so that's no
>>> worse than before (well maybe twice as 'unlikely').
>> OK, it added one more case to 2^32-1 existing cases, I guess.
>>
>>>> The patch (the firnal version of it anyway) needs to be accompanied
>>>> by a handful of test additions to tickle corner cases like that.
>>> They'd be protected by the EXPENSIVE prerequisite I would assume.
>> Oh, absolutely.  Thanks for spelling that out.
> I have been testing out the patch a bit and have good and (mostly) bad news.
>
> What works using a munge value of 1.
>
> $ git add
> $ git status
>
> Racy seems to work.
>
> $ touch .git/index 4GiB # 4GiB is now racy
> $ git status # Git will rehash the racy file
> $ git status # Git cached the file. Second status is fast.
>
> What doesn't work.
>
> $ git checkout 4GiB
> $ fatal: packed object is corrupt!
>
> Using a munge value of 1<<31 causes even more problems. The file hash in the
> index for 4GiB files (git ls-files -s --debug) are set to the zero file hash.
>
> I looked up and down the code base and couldn't figure out how the munged
> value was leaking out of read-cache.c and breaking things. Most of the code
> I found tends to use stat and then convert that to a size_t, not using the
> munged unsigned int at all.
>
> Maybe someone else will have better luck. This seems over my head :(
>
> Thanks
> --
> Jason
>
Is there a problem that 1<<31, when on a 32bit long is MAX_NEG, rather 
than being MAX_POS? And the size would need to be positive to be an 
acceptable file size?
(The code is a bit of a mish-mash on the Windows LLP64 side, where long 
is only 32 bits).

Philip
Apologies for the terseness.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] Prevent git from rehashing 4GBi files
       [not found] <philipoakley@iee.email>
@ 2022-05-07 18:58 ` Jason D. Hatton
  0 siblings, 0 replies; 14+ messages in thread
From: Jason D. Hatton @ 2022-05-07 18:58 UTC (permalink / raw)
  To: jhatton; +Cc: git, gitster, l.s.r

Git cache stores file sizes using uint32_t. This causes any file
that is a multiple of 2^32 to have a cached file size of zero.
Zero is a special value used by racily clean. This causes git to
rehash every file that is a multiple of 2^32 every time git status
or git commit is run.

This patch mitigates the problem by making all files that are a
multiple of 2^32 appear to have a size of 1 instead of zero.

Signed-off-by: Jason D. Hatton <jhatton@globalfinishing.com>
---
 read-cache.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/read-cache.c b/read-cache.c
index 4df97e185e..d80c80ef90 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -163,6 +163,22 @@ void rename_index_entry_at(struct index_state *istate, int nr, const char *new_n
 		add_index_entry(istate, new_entry, ADD_CACHE_OK_TO_ADD|ADD_CACHE_OK_TO_REPLACE);
 }
 
+/*
+ * stat_data only stores file sizes as an unsigned int. Any file that is an
+ * exact multiple of 4GiB will get a cached file size of zero. Unfortunately,
+ * this is a special flag used by the racy update logic. Substitute a new file
+ * size if a non-zero sized file would would be cached as zero. 1U<<31 is used
+ * as the substitute because it is the furthest away from 0 and 4GiB.
+ */
+static inline unsigned int munge_st_size(off_t st_size) {
+	unsigned int sd_size = (unsigned int) st_size;
+
+	if (!sd_size && st_size)
+		return 1U<<31;
+	else
+		return sd_size;
+}
+
 void fill_stat_data(struct stat_data *sd, struct stat *st)
 {
 	sd->sd_ctime.sec = (unsigned int)st->st_ctime;
@@ -173,7 +189,7 @@ void fill_stat_data(struct stat_data *sd, struct stat *st)
 	sd->sd_ino = st->st_ino;
 	sd->sd_uid = st->st_uid;
 	sd->sd_gid = st->st_gid;
-	sd->sd_size = st->st_size;
+	sd->sd_size = munge_st_size(st->st_size);
 }
 
 int match_stat_data(const struct stat_data *sd, struct stat *st)
@@ -212,7 +228,7 @@ int match_stat_data(const struct stat_data *sd, struct stat *st)
 			changed |= INODE_CHANGED;
 #endif
 
-	if (sd->sd_size != (unsigned int) st->st_size)
+	if (sd->sd_size != munge_st_size(st->st_size))
 		changed |= DATA_CHANGED;
 
 	return changed;
-- 
2.36.0.3


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Prevent git from rehashing 4GBi files
       [not found] ` <1DFD3E42-3EF3-4420-8E01-748EF3DBE7A1@iee.email>
@ 2022-05-07 15:22   ` René Scharfe
  0 siblings, 0 replies; 14+ messages in thread
From: René Scharfe @ 2022-05-07 15:22 UTC (permalink / raw)
  To: Philip Oakley, Jason Hatton, Junio C Hamano; +Cc: git

Am 07.05.22 um 14:33 schrieb Philip Oakley:
>
>
> On 7 May 2022 03:15:00 BST, Jason Hatton <jhatton@globalfinishing.com> wrote:
>
>         Philip Oakley <philipoakley@iee.email> writes:
>
>                 This may treat non-zero multiple of 4GiB as "not racy", but has
>                 anybody double checked the concern Réne brought up earlier that a
>                 4GiB file that was added and then got rewritten to 2GiB within the
>                 same second would suddenly start getting treated as not racy?
>
>             This is the pre-existing problem, that ~1in 2^31 size changes might not
>             get noticed for size change. The 0 byte / 4GiB change is an identical
>             issue, as is changing from 3 bytes to 4GiB+3 bytes, etc., so that's no
>             worse than before (well maybe twice as 'unlikely').
>
>
>         OK, it added one more case to 2^32-1 existing cases, I guess.
>
>                 The patch (the firnal version of it anyway) needs to be accompanied
>                 by a handful of test additions to tickle corner cases like that.
>
>             They'd be protected by the EXPENSIVE prerequisite I would assume.
>
>
>         Oh, absolutely. Thanks for spelling that out.
>
>
>     I have been testing out the patch a bit and have good and (mostly) bad news.
>
>     What works using a munge value of 1.
>
>     $ git add
>     $ git status
>
>     Racy seems to work.
>
>     $ touch .git/index 4GiB # 4GiB is now racy
>     $ git status # Git will rehash the racy file
>     $ git status # Git cached the file. Second status is fast.
>
>     What doesn't work.
>
>     $ git checkout 4GiB
>     $ fatal: packed object is corrupt!
>
>     Using a munge value of 1<<31 causes even more problems. The file hash in the
>     index for 4GiB files (git ls-files -s --debug) are set to the zero file hash.
>
>     I looked up and down the code base and couldn't figure out how the munged
>     value was leaking out of read-cache.c and breaking things. Most of the code
>     I found tends to use stat and then convert that to a size_t, not using the
>     munged unsigned int at all.
>
>     Maybe someone else will have better luck. This seems over my head :(
>
>     Thanks
>     --
>     Jason
>
>
> Is this on Git for Windows or a 64 bit Linux?
> There are still some issues on GfW for 2GiB+ files (long Vs long long int).

Which would explain the zero file hash.  And make the platform unfit for
handling big files at all at this time.

FWIW, on MacOS I get this with the patch applied:

   $ git init --quiet /tmp/a
   $ cd /tmp/a
   $ : >size-0
   $ dd if=/dev/zero bs=1 oseek=4294967295 count=1 of=size-4294967296
   1+0 records in
   1+0 records out
   1 bytes transferred in 0.000365 secs (2740 bytes/sec)
   $ dd if=/dev/zero bs=1 oseek=4294967296 count=1 of=size-4294967297
   1+0 records in
   1+0 records out
   1 bytes transferred in 0.000293 secs (3413 bytes/sec)
   $ dd if=/dev/zero bs=1 oseek=6442450943 count=1 of=size-6442450944
   1+0 records in
   1+0 records out
   1 bytes transferred in 0.000266 secs (3759 bytes/sec)
   $ git add size-*
   $ git commit -m initial
   [master (root-commit) d9c2a0a] initial
    4 files changed, 0 insertions(+), 0 deletions(-)
    create mode 100644 size-0
    create mode 100644 size-4294967296
    create mode 100644 size-4294967297
    create mode 100644 size-6442450944

   $ time git checkout size-*
   Updated 0 paths from the index
   git checkout size-*  0.01s user 0.01s system 65% cpu 0.020 total

   $ git ls-files -s --debug | grep size
   100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0	size-0
     size: 0	flags: 0
   100644 451971a31ea5a207a10b391df2d5949910133565 0	size-4294967296
     size: 2147483648	flags: 0
   100644 3eb7feb1413c757f0d8181deb28d1dab03d64846 0	size-4294967297
     size: 1	flags: 0
   100644 741285bddfa7863072c238f34e27144c2501832d 0	size-6442450944
     size: 2147483648	flags: 0

So checkout skips all of the files and their cached sizes have the
expected values.

René

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Prevent git from rehashing 4GBi files
@ 2022-05-07  2:15 Jason Hatton
       [not found] ` <1DFD3E42-3EF3-4420-8E01-748EF3DBE7A1@iee.email>
  2022-05-10 22:45 ` Philip Oakley
  0 siblings, 2 replies; 14+ messages in thread
From: Jason Hatton @ 2022-05-07  2:15 UTC (permalink / raw)
  To: Junio C Hamano, Philip Oakley; +Cc: René Scharfe, git

>Philip Oakley <philipoakley@iee.email> writes:
>
>>> This may treat non-zero multiple of 4GiB as "not racy", but has
>>> anybody double checked the concern Réne brought up earlier that a
>>> 4GiB file that was added and then got rewritten to 2GiB within the
>>> same second would suddenly start getting treated as not racy?
>> This is the pre-existing problem, that ~1in 2^31 size changes might not
>> get noticed for size change. The 0 byte / 4GiB change is an identical
>> issue, as is changing from 3 bytes to 4GiB+3 bytes, etc., so that's no
>> worse than before (well maybe twice as 'unlikely').
>
>OK, it added one more case to 2^32-1 existing cases, I guess.
>
>>> The patch (the firnal version of it anyway) needs to be accompanied
>>> by a handful of test additions to tickle corner cases like that.
>> They'd be protected by the EXPENSIVE prerequisite I would assume.
> 
> Oh, absolutely.  Thanks for spelling that out.

I have been testing out the patch a bit and have good and (mostly) bad news.

What works using a munge value of 1.

$ git add
$ git status

Racy seems to work.

$ touch .git/index 4GiB # 4GiB is now racy
$ git status # Git will rehash the racy file
$ git status # Git cached the file. Second status is fast.

What doesn't work.

$ git checkout 4GiB
$ fatal: packed object is corrupt!

Using a munge value of 1<<31 causes even more problems. The file hash in the
index for 4GiB files (git ls-files -s --debug) are set to the zero file hash.

I looked up and down the code base and couldn't figure out how the munged
value was leaking out of read-cache.c and breaking things. Most of the code
I found tends to use stat and then convert that to a size_t, not using the
munged unsigned int at all.

Maybe someone else will have better luck. This seems over my head :(

Thanks
--
Jason


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Prevent git from rehashing 4GBi files
  2022-05-06 21:17     ` Philip Oakley
@ 2022-05-06 21:23       ` Junio C Hamano
  0 siblings, 0 replies; 14+ messages in thread
From: Junio C Hamano @ 2022-05-06 21:23 UTC (permalink / raw)
  To: Philip Oakley; +Cc: Jason Hatton, René Scharfe, git

Philip Oakley <philipoakley@iee.email> writes:

>> This may treat non-zero multiple of 4GiB as "not racy", but has
>> anybody double checked the concern Réne brought up earlier that a
>> 4GiB file that was added and then got rewritten to 2GiB within the
>> same second would suddenly start getting treated as not racy?
> This is the pre-existing problem, that ~1in 2^31 size changes might not
> get noticed for size change. The 0 byte / 4GiB change is an identical
> issue, as is changing from 3 bytes to 4GiB+3 bytes, etc., so that's no
> worse than before (well maybe twice as 'unlikely').

OK, it added one more case to 2^32-1 existing cases, I guess.

>> The patch (the firnal version of it anyway) needs to be accompanied
>> by a handful of test additions to tickle corner cases like that.
> They'd be protected by the EXPENSIVE prerequisite I would assume.

Oh, absolutely.  Thanks for spelling that out.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Prevent git from rehashing 4GBi files
  2022-05-06 16:36   ` Junio C Hamano
@ 2022-05-06 21:17     ` Philip Oakley
  2022-05-06 21:23       ` Junio C Hamano
  0 siblings, 1 reply; 14+ messages in thread
From: Philip Oakley @ 2022-05-06 21:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jason Hatton, René Scharfe, git

On 06/05/2022 17:36, Junio C Hamano wrote:
>>> + */
>>> +unsigned int munge_st_size(off_t st_size) {
>>> +	unsigned int sd_size = st_size;
>>> +
>>> +	if(!sd_size && st_size)
> Style.
Ah, the same line / braces choice (as per coding guidelines).
>>> +		return 0x80000000;
>>> +	else
>>> +		return sd_size;
>>> +}
> This may treat non-zero multiple of 4GiB as "not racy", but has
> anybody double checked the concern Réne brought up earlier that a
> 4GiB file that was added and then got rewritten to 2GiB within the
> same second would suddenly start getting treated as not racy?
This is the pre-existing problem, that ~1in 2^31 size changes might not
get noticed for size change. The 0 byte / 4GiB change is an identical
issue, as is changing from 3 bytes to 4GiB+3 bytes, etc., so that's no
worse than before (well maybe twice as 'unlikely').

>
> The patch (the firnal version of it anyway) needs to be accompanied
> by a handful of test additions to tickle corner cases like that.
They'd be protected by the EXPENSIVE prerequisite I would assume.
Any particular test t/txxx that they should be placed in (I'm not that
familiar with the test suit)
--
Philip

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Prevent git from rehashing 4GBi files
  2022-05-06 17:08 Jason Hatton
@ 2022-05-06 18:32 ` Junio C Hamano
  0 siblings, 0 replies; 14+ messages in thread
From: Junio C Hamano @ 2022-05-06 18:32 UTC (permalink / raw)
  To: Jason Hatton; +Cc: Philip Oakley, git, René Scharfe

Jason Hatton <jhatton@globalfinishing.com> writes:

>>Philip Oakley <philipoakley@iee.email> writes:
>>
>>> This "Munge" above isn't telling the reader 'why'/'what' is going on.
>>> The comment should in some way highlight that a zero size result is
>>> special, and that we have the roll over issue when the stored in 32 bits
>>> - the double duty of racy vs changed in the stat data heuristic.
>>> Synonyms of 'munge' ?
>
> mangle?
> hash?
>
>>>
>>>
>>>> + */
>>>> +unsigned int munge_st_size(off_t st_size) {
>>>> +    unsigned int sd_size = st_size;
>>>> +
>>>> +    if(!sd_size && st_size)
>>
>>Style.
>
> Something like 1<<31?

Sorry, missing SP between "if" and "(" was what stood out like a
sore thumb.

>>
>>>> +        return 0x80000000;

The .sd_size member is merely defined as "unsigned int" and so is
the return value from this helper.  They have no idea how big an
integer they are dealing with.  It is limited to 32-bit explicitly
only because create_from_disk() uses get_be32() on ondisk->size to
get the value to be assigned to the member.

So I agree with writing it as 31-bit shift for ease of reading, but
perhaps a comment to indicate where that size comes from would help
the readers while we are at it, perhaps?

		return 1U<<31; /* ondisk_cache_entry.size */

I dunno.

>>>> +    else
>>>> +        return sd_size;
>>>> +}
>>
>>This may treat non-zero multiple of 4GiB as "not racy", but has
>>anybody double checked the concern Réne brought up earlier that a
>>4GiB file that was added and then got rewritten to 2GiB within the
>>same second would suddenly start getting treated as not racy?
>>
>>The patch (the firnal version of it anyway) needs to be accompanied
>>by a handful of test additions to tickle corner cases like that.
>>
>>Thanks, all, for working on this.
>
> If the file size is changed by exactly 2GiB is a concern. This is an issue for
> files exactly a multiple of 4GiB. However, all files that are changed by a
> multiple of 4GiB are vulnerable.

So if you have a 4GiB file, "git add" it, then rewrite it with a
different contents to make it a 8GiB file within the same second,
would Git mistakenly think that there is no change, because the racy
git protection is gone with this change?  I think that was one of
the concerns (there may have been others I am forgetting).


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Prevent git from rehashing 4GBi files
@ 2022-05-06 17:08 Jason Hatton
  2022-05-06 18:32 ` Junio C Hamano
  0 siblings, 1 reply; 14+ messages in thread
From: Jason Hatton @ 2022-05-06 17:08 UTC (permalink / raw)
  To: Junio C Hamano, Philip Oakley, git; +Cc: René Scharfe

>Philip Oakley <philipoakley@iee.email> writes:
>
>> This "Munge" above isn't telling the reader 'why'/'what' is going on.
>> The comment should in some way highlight that a zero size result is
>> special, and that we have the roll over issue when the stored in 32 bits
>> - the double duty of racy vs changed in the stat data heuristic.
>> Synonyms of 'munge' ?

mangle?
hash?

>>
>>
>>> + */
>>> +unsigned int munge_st_size(off_t st_size) {
>>> +    unsigned int sd_size = st_size;
>>> +
>>> +    if(!sd_size && st_size)
>
>Style.

Something like 1<<31?

>
>>> +        return 0x80000000;
>>> +    else
>>> +        return sd_size;
>>> +}
>
>This may treat non-zero multiple of 4GiB as "not racy", but has
>anybody double checked the concern Réne brought up earlier that a
>4GiB file that was added and then got rewritten to 2GiB within the
>same second would suddenly start getting treated as not racy?
>
>The patch (the firnal version of it anyway) needs to be accompanied
>by a handful of test additions to tickle corner cases like that.
>
>Thanks, all, for working on this.

If the file size is changed by exactly 2GiB is a concern. This is an issue for
files exactly a multiple of 4GiB. However, all files that are changed by a
multiple of 4GiB are vulnerable. Say 4GiB + 42 and 8GiB + 42 would appear the
same with the current version of git. I'm sure the true fix involves updating
the index file format with 64 bit files sizes and an explicit racy flag. I'm
hopeful the rehashing issue for 4GiB files can be mitigated until than.

I have a question about the coding style. Torsten indicated that there should
be an explicit type cast. The original code did not use an explicit type cast,
so I'm unsure what is going on. One of you experts may have to make the final
patch. I hope my proof of concept gets the idea across.

Thanks
--
Jason


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Prevent git from rehashing 4GBi files
  2022-05-06 10:22 ` Philip Oakley
@ 2022-05-06 16:36   ` Junio C Hamano
  2022-05-06 21:17     ` Philip Oakley
  0 siblings, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2022-05-06 16:36 UTC (permalink / raw)
  To: Philip Oakley; +Cc: Jason Hatton, René Scharfe, git

Philip Oakley <philipoakley@iee.email> writes:

> This "Munge" above isn't telling the reader 'why'/'what' is going on.
> The comment should in some way highlight that a zero size result is
> special, and that we have the roll over issue when the stored in 32 bits
> - the double duty of racy vs changed in the stat data heuristic.
> Synonyms of 'munge' ?
>
>
>> + */
>> +unsigned int munge_st_size(off_t st_size) {
>> +	unsigned int sd_size = st_size;
>> +
>> +	if(!sd_size && st_size)

Style.

>> +		return 0x80000000;
>> +	else
>> +		return sd_size;
>> +}

This may treat non-zero multiple of 4GiB as "not racy", but has
anybody double checked the concern Réne brought up earlier that a
4GiB file that was added and then got rewritten to 2GiB within the
same second would suddenly start getting treated as not racy?

The patch (the firnal version of it anyway) needs to be accompanied
by a handful of test additions to tickle corner cases like that.

Thanks, all, for working on this.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Prevent git from rehashing 4GBi files
  2022-05-06  0:26 Jason Hatton
  2022-05-06  4:37 ` Torsten Bögershausen
@ 2022-05-06 10:22 ` Philip Oakley
  2022-05-06 16:36   ` Junio C Hamano
  1 sibling, 1 reply; 14+ messages in thread
From: Philip Oakley @ 2022-05-06 10:22 UTC (permalink / raw)
  To: Jason Hatton, René Scharfe, git; +Cc: Junio C Hamano

On 06/05/2022 01:26, Jason Hatton wrote:
> Git cache stores file sizes using uint32_t. This causes any file
> that is a multiple of 2^32 to have a cached file size of zero.
> Zero is a special value used by racily clean. This causes git to
> rehash every file that is a multiple of 2^32 every time git status
> or git commit is run.
>
> This patch mitigates the problem by making all files that are a
> multiple of 2^32 appear to have a size of 1<<31 instead of zero.
>
> The value of 1<<31 is chosen to keep it as far away from zero
> as possible to help prevent things getting mixed up with unpatched
> versions of git.
>
> An example would be to have a 2^32 sized file in the index of
> patched git. Patched git would save the file as 2^31 in the cache.
> An unpatched git would very much see the file has changed in size
> and force it to rehash the file, which is safe. The file would
> have to grow or shrink by exactly 2^31 and retain all of its
> ctime, mtime, and other attributes for old git to not notice
> the change.
>
> This patch does not change the behavior of any file that is not
> an exact multiple of 2^32.
>
> Signed-off-by: Jason D. Hatton <jhatton@globalfinishing.com>
> ---
>  cache.h      |  1 +
>  read-cache.c | 16 ++++++++++++++--
>  2 files changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/cache.h b/cache.h
> index 4b666b2848..74e983227b 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -898,6 +898,7 @@ int ie_modified(struct index_state *, const struct cache_entry *, struct stat *,
>  #define HASH_SILENT 8
>  int index_fd(struct index_state *istate, struct object_id *oid, int fd, struct stat *st, enum object_type type, const char *path, unsigned flags);
>  int index_path(struct index_state *istate, struct object_id *oid, const char *path, struct stat *st, unsigned flags);
> +unsigned int munge_st_size(off_t st_size);
>  
>  /*
>   * Record to sd the data from st that we use to check whether a file
> diff --git a/read-cache.c b/read-cache.c
> index ea6150ea28..b0a1b505db 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -163,6 +163,18 @@ void rename_index_entry_at(struct index_state *istate, int nr, const char *new_n
>  		add_index_entry(istate, new_entry, ADD_CACHE_OK_TO_ADD|ADD_CACHE_OK_TO_REPLACE);
>  }
>  
> +/*
> + * Munge st_size into an unsigned int.

This "Munge" above isn't telling the reader 'why'/'what' is going on.
The comment should in some way highlight that a zero size result is
special, and that we have the roll over issue when the stored in 32 bits
- the double duty of racy vs changed in the stat data heuristic.
Synonyms of 'munge' ?


> + */
> +unsigned int munge_st_size(off_t st_size) {
> +	unsigned int sd_size = st_size;
> +
> +	if(!sd_size && st_size)
> +		return 0x80000000;
> +	else
> +		return sd_size;
> +}
> +
>  void fill_stat_data(struct stat_data *sd, struct stat *st)
>  {
>  	sd->sd_ctime.sec = (unsigned int)st->st_ctime;
> @@ -173,7 +185,7 @@ void fill_stat_data(struct stat_data *sd, struct stat *st)
>  	sd->sd_ino = st->st_ino;
>  	sd->sd_uid = st->st_uid;
>  	sd->sd_gid = st->st_gid;
> -	sd->sd_size = st->st_size;
> +	sd->sd_size = munge_st_size(st->st_size);
>  }
>  
>  int match_stat_data(const struct stat_data *sd, struct stat *st)
> @@ -212,7 +224,7 @@ int match_stat_data(const struct stat_data *sd, struct stat *st)
>  			changed |= INODE_CHANGED;
>  #endif
>  
> -	if (sd->sd_size != (unsigned int) st->st_size)
> +	if (sd->sd_size != munge_st_size(st->st_size))
>  		changed |= DATA_CHANGED;
>  
>  	return changed;


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Prevent git from rehashing 4GBi files
  2022-05-06  0:26 Jason Hatton
@ 2022-05-06  4:37 ` Torsten Bögershausen
  2022-05-06 10:22 ` Philip Oakley
  1 sibling, 0 replies; 14+ messages in thread
From: Torsten Bögershausen @ 2022-05-06  4:37 UTC (permalink / raw)
  To: Jason Hatton; +Cc: Philip Oakley, René Scharfe, git, Junio C Hamano

On Fri, May 06, 2022 at 12:26:53AM +0000, Jason Hatton wrote:
> Git cache stores file sizes using uint32_t. This causes any file
> that is a multiple of 2^32 to have a cached file size of zero.
> Zero is a special value used by racily clean. This causes git to
> rehash every file that is a multiple of 2^32 every time git status
> or git commit is run.
>
> This patch mitigates the problem by making all files that are a
> multiple of 2^32 appear to have a size of 1<<31 instead of zero.
>
> The value of 1<<31 is chosen to keep it as far away from zero
> as possible to help prevent things getting mixed up with unpatched
> versions of git.
>
> An example would be to have a 2^32 sized file in the index of
> patched git. Patched git would save the file as 2^31 in the cache.
> An unpatched git would very much see the file has changed in size
> and force it to rehash the file, which is safe. The file would
> have to grow or shrink by exactly 2^31 and retain all of its
> ctime, mtime, and other attributes for old git to not notice
> the change.
>
> This patch does not change the behavior of any file that is not
> an exact multiple of 2^32.
>
> Signed-off-by: Jason D. Hatton <jhatton@globalfinishing.com>
> ---
>  cache.h      |  1 +
>  read-cache.c | 16 ++++++++++++++--
>  2 files changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/cache.h b/cache.h
> index 4b666b2848..74e983227b 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -898,6 +898,7 @@ int ie_modified(struct index_state *, const struct cache_entry *, struct stat *,
>  #define HASH_SILENT 8
>  int index_fd(struct index_state *istate, struct object_id *oid, int fd, struct stat *st, enum object_type type, const char *path, unsigned flags);
>  int index_path(struct index_state *istate, struct object_id *oid, const char *path, struct stat *st, unsigned flags);
> +unsigned int munge_st_size(off_t st_size);
>
>  /*
>   * Record to sd the data from st that we use to check whether a file
> diff --git a/read-cache.c b/read-cache.c
> index ea6150ea28..b0a1b505db 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -163,6 +163,18 @@ void rename_index_entry_at(struct index_state *istate, int nr, const char *new_n
>  		add_index_entry(istate, new_entry, ADD_CACHE_OK_TO_ADD|ADD_CACHE_OK_TO_REPLACE);
>  }
>
> +/*
> + * Munge st_size into an unsigned int.
> + */
> +unsigned int munge_st_size(off_t st_size) {
> +	unsigned int sd_size = st_size;

Should this be written as
	unsigned int sd_size = (unsigned int)st_size;


> +
> +	if(!sd_size && st_size)
> +		return 0x80000000;
> +	else
> +		return sd_size;
> +}
> +
>  void fill_stat_data(struct stat_data *sd, struct stat *st)
>  {
>  	sd->sd_ctime.sec = (unsigned int)st->st_ctime;
> @@ -173,7 +185,7 @@ void fill_stat_data(struct stat_data *sd, struct stat *st)
>  	sd->sd_ino = st->st_ino;
>  	sd->sd_uid = st->st_uid;
>  	sd->sd_gid = st->st_gid;
> -	sd->sd_size = st->st_size;
> +	sd->sd_size = munge_st_size(st->st_size);
>  }
>
>  int match_stat_data(const struct stat_data *sd, struct stat *st)
> @@ -212,7 +224,7 @@ int match_stat_data(const struct stat_data *sd, struct stat *st)
>  			changed |= INODE_CHANGED;
>  #endif
>
> -	if (sd->sd_size != (unsigned int) st->st_size)
> +	if (sd->sd_size != munge_st_size(st->st_size))
>  		changed |= DATA_CHANGED;
>
>  	return changed;
> --
> 2.36.0
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] Prevent git from rehashing 4GBi files
@ 2022-05-06  0:26 Jason Hatton
  2022-05-06  4:37 ` Torsten Bögershausen
  2022-05-06 10:22 ` Philip Oakley
  0 siblings, 2 replies; 14+ messages in thread
From: Jason Hatton @ 2022-05-06  0:26 UTC (permalink / raw)
  To: Philip Oakley, René Scharfe, git; +Cc: Junio C Hamano

Git cache stores file sizes using uint32_t. This causes any file
that is a multiple of 2^32 to have a cached file size of zero.
Zero is a special value used by racily clean. This causes git to
rehash every file that is a multiple of 2^32 every time git status
or git commit is run.

This patch mitigates the problem by making all files that are a
multiple of 2^32 appear to have a size of 1<<31 instead of zero.

The value of 1<<31 is chosen to keep it as far away from zero
as possible to help prevent things getting mixed up with unpatched
versions of git.

An example would be to have a 2^32 sized file in the index of
patched git. Patched git would save the file as 2^31 in the cache.
An unpatched git would very much see the file has changed in size
and force it to rehash the file, which is safe. The file would
have to grow or shrink by exactly 2^31 and retain all of its
ctime, mtime, and other attributes for old git to not notice
the change.

This patch does not change the behavior of any file that is not
an exact multiple of 2^32.

Signed-off-by: Jason D. Hatton <jhatton@globalfinishing.com>
---
 cache.h      |  1 +
 read-cache.c | 16 ++++++++++++++--
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/cache.h b/cache.h
index 4b666b2848..74e983227b 100644
--- a/cache.h
+++ b/cache.h
@@ -898,6 +898,7 @@ int ie_modified(struct index_state *, const struct cache_entry *, struct stat *,
 #define HASH_SILENT 8
 int index_fd(struct index_state *istate, struct object_id *oid, int fd, struct stat *st, enum object_type type, const char *path, unsigned flags);
 int index_path(struct index_state *istate, struct object_id *oid, const char *path, struct stat *st, unsigned flags);
+unsigned int munge_st_size(off_t st_size);
 
 /*
  * Record to sd the data from st that we use to check whether a file
diff --git a/read-cache.c b/read-cache.c
index ea6150ea28..b0a1b505db 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -163,6 +163,18 @@ void rename_index_entry_at(struct index_state *istate, int nr, const char *new_n
 		add_index_entry(istate, new_entry, ADD_CACHE_OK_TO_ADD|ADD_CACHE_OK_TO_REPLACE);
 }
 
+/*
+ * Munge st_size into an unsigned int.
+ */
+unsigned int munge_st_size(off_t st_size) {
+	unsigned int sd_size = st_size;
+
+	if(!sd_size && st_size)
+		return 0x80000000;
+	else
+		return sd_size;
+}
+
 void fill_stat_data(struct stat_data *sd, struct stat *st)
 {
 	sd->sd_ctime.sec = (unsigned int)st->st_ctime;
@@ -173,7 +185,7 @@ void fill_stat_data(struct stat_data *sd, struct stat *st)
 	sd->sd_ino = st->st_ino;
 	sd->sd_uid = st->st_uid;
 	sd->sd_gid = st->st_gid;
-	sd->sd_size = st->st_size;
+	sd->sd_size = munge_st_size(st->st_size);
 }
 
 int match_stat_data(const struct stat_data *sd, struct stat *st)
@@ -212,7 +224,7 @@ int match_stat_data(const struct stat_data *sd, struct stat *st)
 			changed |= INODE_CHANGED;
 #endif
 
-	if (sd->sd_size != (unsigned int) st->st_size)
+	if (sd->sd_size != munge_st_size(st->st_size))
 		changed |= DATA_CHANGED;
 
 	return changed;
-- 
2.36.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-05-11 22:24 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CY4PR16MB165501ED1B535592033C76F2AFC49@CY4PR16MB1655.namprd16.prod.outlook.com>
2022-05-07 18:10 ` [PATCH] Prevent git from rehashing 4GBi files Jason Hatton
     [not found] <philipoakley@iee.email>
2022-05-07 18:58 ` Jason D. Hatton
2022-05-07  2:15 Jason Hatton
     [not found] ` <1DFD3E42-3EF3-4420-8E01-748EF3DBE7A1@iee.email>
2022-05-07 15:22   ` René Scharfe
2022-05-10 22:45 ` Philip Oakley
2022-05-11 22:24   ` Philip Oakley
  -- strict thread matches above, loose matches on Subject: below --
2022-05-06 17:08 Jason Hatton
2022-05-06 18:32 ` Junio C Hamano
2022-05-06  0:26 Jason Hatton
2022-05-06  4:37 ` Torsten Bögershausen
2022-05-06 10:22 ` Philip Oakley
2022-05-06 16:36   ` Junio C Hamano
2022-05-06 21:17     ` Philip Oakley
2022-05-06 21:23       ` Junio C Hamano

Code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).