git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Philip Oakley <philipoakley@iee.email>
To: "Jason Hatton" <jhatton@globalfinishing.com>,
	"René Scharfe" <l.s.r@web.de>,
	"git@vger.kernel.org" <git@vger.kernel.org>
Cc: Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH] Prevent git from rehashing 4GBi files
Date: Fri, 6 May 2022 11:22:57 +0100	[thread overview]
Message-ID: <f3f883cb-9141-6871-0f3d-3d3293a17f5f@iee.email> (raw)
In-Reply-To: <CY4PR16MB16552D74E064638BEC11ECB1AFC59@CY4PR16MB1655.namprd16.prod.outlook.com>

On 06/05/2022 01:26, Jason Hatton wrote:
> Git cache stores file sizes using uint32_t. This causes any file
> that is a multiple of 2^32 to have a cached file size of zero.
> Zero is a special value used by racily clean. This causes git to
> rehash every file that is a multiple of 2^32 every time git status
> or git commit is run.
>
> This patch mitigates the problem by making all files that are a
> multiple of 2^32 appear to have a size of 1<<31 instead of zero.
>
> The value of 1<<31 is chosen to keep it as far away from zero
> as possible to help prevent things getting mixed up with unpatched
> versions of git.
>
> An example would be to have a 2^32 sized file in the index of
> patched git. Patched git would save the file as 2^31 in the cache.
> An unpatched git would very much see the file has changed in size
> and force it to rehash the file, which is safe. The file would
> have to grow or shrink by exactly 2^31 and retain all of its
> ctime, mtime, and other attributes for old git to not notice
> the change.
>
> This patch does not change the behavior of any file that is not
> an exact multiple of 2^32.
>
> Signed-off-by: Jason D. Hatton <jhatton@globalfinishing.com>
> ---
>  cache.h      |  1 +
>  read-cache.c | 16 ++++++++++++++--
>  2 files changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/cache.h b/cache.h
> index 4b666b2848..74e983227b 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -898,6 +898,7 @@ int ie_modified(struct index_state *, const struct cache_entry *, struct stat *,
>  #define HASH_SILENT 8
>  int index_fd(struct index_state *istate, struct object_id *oid, int fd, struct stat *st, enum object_type type, const char *path, unsigned flags);
>  int index_path(struct index_state *istate, struct object_id *oid, const char *path, struct stat *st, unsigned flags);
> +unsigned int munge_st_size(off_t st_size);
>  
>  /*
>   * Record to sd the data from st that we use to check whether a file
> diff --git a/read-cache.c b/read-cache.c
> index ea6150ea28..b0a1b505db 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -163,6 +163,18 @@ void rename_index_entry_at(struct index_state *istate, int nr, const char *new_n
>  		add_index_entry(istate, new_entry, ADD_CACHE_OK_TO_ADD|ADD_CACHE_OK_TO_REPLACE);
>  }
>  
> +/*
> + * Munge st_size into an unsigned int.

This "Munge" above isn't telling the reader 'why'/'what' is going on.
The comment should in some way highlight that a zero size result is
special, and that we have the roll over issue when the stored in 32 bits
- the double duty of racy vs changed in the stat data heuristic.
Synonyms of 'munge' ?


> + */
> +unsigned int munge_st_size(off_t st_size) {
> +	unsigned int sd_size = st_size;
> +
> +	if(!sd_size && st_size)
> +		return 0x80000000;
> +	else
> +		return sd_size;
> +}
> +
>  void fill_stat_data(struct stat_data *sd, struct stat *st)
>  {
>  	sd->sd_ctime.sec = (unsigned int)st->st_ctime;
> @@ -173,7 +185,7 @@ void fill_stat_data(struct stat_data *sd, struct stat *st)
>  	sd->sd_ino = st->st_ino;
>  	sd->sd_uid = st->st_uid;
>  	sd->sd_gid = st->st_gid;
> -	sd->sd_size = st->st_size;
> +	sd->sd_size = munge_st_size(st->st_size);
>  }
>  
>  int match_stat_data(const struct stat_data *sd, struct stat *st)
> @@ -212,7 +224,7 @@ int match_stat_data(const struct stat_data *sd, struct stat *st)
>  			changed |= INODE_CHANGED;
>  #endif
>  
> -	if (sd->sd_size != (unsigned int) st->st_size)
> +	if (sd->sd_size != munge_st_size(st->st_size))
>  		changed |= DATA_CHANGED;
>  
>  	return changed;


  parent reply	other threads:[~2022-05-06 10:23 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-06  0:26 [PATCH] Prevent git from rehashing 4GBi files Jason Hatton
2022-05-06  4:37 ` Torsten Bögershausen
2022-05-06 10:22 ` Philip Oakley [this message]
2022-05-06 16:36   ` Junio C Hamano
2022-05-06 21:17     ` Philip Oakley
2022-05-06 21:23       ` Junio C Hamano
  -- strict thread matches above, loose matches on Subject: below --
2022-05-06 17:08 Jason Hatton
2022-05-06 18:32 ` Junio C Hamano
2022-05-07  2:15 Jason Hatton
     [not found] ` <1DFD3E42-3EF3-4420-8E01-748EF3DBE7A1@iee.email>
2022-05-07 15:22   ` René Scharfe
2022-05-10 22:45 ` Philip Oakley
2022-05-11 22:24   ` Philip Oakley
     [not found] <CY4PR16MB165501ED1B535592033C76F2AFC49@CY4PR16MB1655.namprd16.prod.outlook.com>
2022-05-07 18:10 ` Jason Hatton
     [not found] <philipoakley@iee.email>
2022-05-07 18:58 ` Jason D. Hatton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f3f883cb-9141-6871-0f3d-3d3293a17f5f@iee.email \
    --to=philipoakley@iee.email \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jhatton@globalfinishing.com \
    --cc=l.s.r@web.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).