git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] packfile: enhance the mtime of packfile by idx file
@ 2021-07-10 19:01 Sun Chao via GitGitGadget
  2021-07-11 23:44 ` Ævar Arnfjörð Bjarmason
  2021-07-14  1:28 ` [PATCH v2] packfile: freshen the mtime of packfile by configuration Sun Chao via GitGitGadget
  0 siblings, 2 replies; 34+ messages in thread
From: Sun Chao via GitGitGadget @ 2021-07-10 19:01 UTC (permalink / raw)
  To: git; +Cc: Sun Chao, Sun Chao

From: Sun Chao <16657101987@163.com>

Commit 33d4221c79 (write_sha1_file: freshen existing objects,
2014-10-15) avoid writing existing objects by freshen their
mtime (especially the packfiles contains them) in order to
aid the correct caching, and some process like find_lru_pack
can make good decision. However, this is unfriendly to
incremental backup jobs or services rely on file system
cache when there are large '.pack' files exists.

For example, after packed all objects, use 'write-tree' to
create same commit with the same tree and same environments
such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can
notice the '.pack' file's mtime changed, but '.idx' file not.

So if we update the mtime of packfile by updating the '.idx'
file instead of '.pack' file, when we check the mtime
of packfile, get it from '.idx' file instead. Large git
repository may contains large '.pack' files, but '.idx'
files are smaller enough, this can avoid file system cache
reload the large files again and speed up git commands.

Signed-off-by: Sun Chao <16657101987@163.com>
---
    packfile: enhance the mtime of packfile by idx file
    
    Commit 33d4221c79 (write_sha1_file: freshen existing objects,
    2014-10-15) avoid writing existing objects by freshen their mtime
    (especially the packfiles contains them) in order to aid the correct
    caching, and some process like find_lru_pack can make good decision.
    However, this is unfriendly to incremental backup jobs or services rely
    on file system cache when there are large '.pack' files exists.
    
    For example, after packed all objects, use 'write-tree' to create same
    commit with the same tree and same environments such like
    GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can notice the '.pack' file's
    mtime changed, but '.idx' file not.
    
    So if we update the mtime of packfile by updating the '.idx' file
    instead of '.pack' file, when we check the mtime of packfile, get it
    from '.idx' file instead. Large git repository may contains large
    '.pack' files, but '.idx' files are smaller enough, this can avoid file
    system cache reload the large files again and speed up git commands.
    
    Signed-off-by: Sun Chao 16657101987@163.com

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1043%2Fsunchao9%2Fmaster-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1043/sunchao9/master-v1
Pull-Request: https://github.com/git/git/pull/1043

 builtin/index-pack.c                 | 19 +++----------------
 object-file.c                        |  7 ++++++-
 packfile.c                           | 19 ++++++++++++++++++-
 packfile.h                           |  7 +++++++
 t/t7701-repack-unpack-unreachable.sh |  2 +-
 5 files changed, 35 insertions(+), 19 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 3fbc5d70777..60bacc8ee7f 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1437,19 +1437,6 @@ static void fix_unresolved_deltas(struct hashfile *f)
 	free(sorted_by_pos);
 }
 
-static const char *derive_filename(const char *pack_name, const char *strip,
-				   const char *suffix, struct strbuf *buf)
-{
-	size_t len;
-	if (!strip_suffix(pack_name, strip, &len) || !len ||
-	    pack_name[len - 1] != '.')
-		die(_("packfile name '%s' does not end with '.%s'"),
-		    pack_name, strip);
-	strbuf_add(buf, pack_name, len);
-	strbuf_addstr(buf, suffix);
-	return buf->buf;
-}
-
 static void write_special_file(const char *suffix, const char *msg,
 			       const char *pack_name, const unsigned char *hash,
 			       const char **report)
@@ -1460,7 +1447,7 @@ static void write_special_file(const char *suffix, const char *msg,
 	int msg_len = strlen(msg);
 
 	if (pack_name)
-		filename = derive_filename(pack_name, "pack", suffix, &name_buf);
+		filename = derive_pack_filename(pack_name, "pack", suffix, &name_buf);
 	else
 		filename = odb_pack_name(&name_buf, hash, suffix);
 
@@ -1855,13 +1842,13 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	if (from_stdin && hash_algo)
 		die(_("--object-format cannot be used with --stdin"));
 	if (!index_name && pack_name)
-		index_name = derive_filename(pack_name, "pack", "idx", &index_name_buf);
+		index_name = derive_pack_filename(pack_name, "pack", "idx", &index_name_buf);
 
 	opts.flags &= ~(WRITE_REV | WRITE_REV_VERIFY);
 	if (rev_index) {
 		opts.flags |= verify ? WRITE_REV_VERIFY : WRITE_REV;
 		if (index_name)
-			rev_index_name = derive_filename(index_name,
+			rev_index_name = derive_pack_filename(index_name,
 							 "idx", "rev",
 							 &rev_index_name_buf);
 	}
diff --git a/object-file.c b/object-file.c
index f233b440b22..068ef99d461 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1974,11 +1974,16 @@ static int freshen_loose_object(const struct object_id *oid)
 static int freshen_packed_object(const struct object_id *oid)
 {
 	struct pack_entry e;
+	struct strbuf name_buf = STRBUF_INIT;
+	const char *filename;
+
 	if (!find_pack_entry(the_repository, oid, &e))
 		return 0;
 	if (e.p->freshened)
 		return 1;
-	if (!freshen_file(e.p->pack_name))
+
+	filename = derive_pack_filename(e.p->pack_name, "pack", "idx", &name_buf);
+	if (!freshen_file(filename))
 		return 0;
 	e.p->freshened = 1;
 	return 1;
diff --git a/packfile.c b/packfile.c
index 755aa7aec5e..46f8fb22462 100644
--- a/packfile.c
+++ b/packfile.c
@@ -40,6 +40,19 @@ char *sha1_pack_index_name(const unsigned char *sha1)
 	return odb_pack_name(&buf, sha1, "idx");
 }
 
+const char *derive_pack_filename(const char *pack_name, const char *strip,
+				const char *suffix, struct strbuf *buf)
+{
+	size_t len;
+	if (!strip_suffix(pack_name, strip, &len) || !len ||
+	    pack_name[len - 1] != '.')
+		die(_("packfile name '%s' does not end with '.%s'"),
+		    pack_name, strip);
+	strbuf_add(buf, pack_name, len);
+	strbuf_addstr(buf, suffix);
+	return buf->buf;
+}
+
 static unsigned int pack_used_ctr;
 static unsigned int pack_mmap_calls;
 static unsigned int peak_pack_open_windows;
@@ -693,6 +706,10 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
 	size_t alloc;
 	struct packed_git *p;
 
+	if (stat(path, &st) || !S_ISREG(st.st_mode)) {
+		return NULL;
+	}
+
 	/*
 	 * Make sure a corresponding .pack file exists and that
 	 * the index looks sane.
@@ -707,6 +724,7 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
 	alloc = st_add3(path_len, strlen(".promisor"), 1);
 	p = alloc_packed_git(alloc);
 	memcpy(p->pack_name, path, path_len);
+	p->mtime = st.st_mtime;
 
 	xsnprintf(p->pack_name + path_len, alloc - path_len, ".keep");
 	if (!access(p->pack_name, F_OK))
@@ -727,7 +745,6 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
 	 */
 	p->pack_size = st.st_size;
 	p->pack_local = local;
-	p->mtime = st.st_mtime;
 	if (path_len < the_hash_algo->hexsz ||
 	    get_sha1_hex(path + path_len - the_hash_algo->hexsz, p->hash))
 		hashclr(p->hash);
diff --git a/packfile.h b/packfile.h
index 3ae117a8aef..ff702b22e6a 100644
--- a/packfile.h
+++ b/packfile.h
@@ -31,6 +31,13 @@ char *sha1_pack_name(const unsigned char *sha1);
  */
 char *sha1_pack_index_name(const unsigned char *sha1);
 
+/*
+ * Return the corresponding filename with given suffix from "file_name"
+ * which must has "strip" suffix.
+ */
+const char *derive_pack_filename(const char *file_name, const char *strip,
+		const char *suffix, struct strbuf *buf);
+
 /*
  * Return the basename of the packfile, omitting any containing directory
  * (e.g., "pack-1234abcd[...].pack").
diff --git a/t/t7701-repack-unpack-unreachable.sh b/t/t7701-repack-unpack-unreachable.sh
index 937f89ee8c8..51c1afcbe2c 100755
--- a/t/t7701-repack-unpack-unreachable.sh
+++ b/t/t7701-repack-unpack-unreachable.sh
@@ -106,7 +106,7 @@ test_expect_success 'do not bother loosening old objects' '
 	git prune-packed &&
 	git cat-file -p $obj1 &&
 	git cat-file -p $obj2 &&
-	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack &&
+	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.idx &&
 	git repack -A -d --unpack-unreachable=1.hour.ago &&
 	git cat-file -p $obj1 &&
 	test_must_fail git cat-file -p $obj2

base-commit: d486ca60a51c9cb1fe068803c3f540724e95e83a
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH] packfile: enhance the mtime of packfile by idx file
  2021-07-10 19:01 [PATCH] packfile: enhance the mtime of packfile by idx file Sun Chao via GitGitGadget
@ 2021-07-11 23:44 ` Ævar Arnfjörð Bjarmason
  2021-07-12 16:17   ` Sun Chao
  2021-07-14  1:28 ` [PATCH v2] packfile: freshen the mtime of packfile by configuration Sun Chao via GitGitGadget
  1 sibling, 1 reply; 34+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-07-11 23:44 UTC (permalink / raw)
  To: Sun Chao via GitGitGadget; +Cc: git, Sun Chao


On Sat, Jul 10 2021, Sun Chao via GitGitGadget wrote:

> From: Sun Chao <16657101987@163.com>
>
> Commit 33d4221c79 (write_sha1_file: freshen existing objects,
> 2014-10-15) avoid writing existing objects by freshen their
> mtime (especially the packfiles contains them) in order to
> aid the correct caching, and some process like find_lru_pack
> can make good decision. However, this is unfriendly to
> incremental backup jobs or services rely on file system
> cache when there are large '.pack' files exists.
>
> For example, after packed all objects, use 'write-tree' to
> create same commit with the same tree and same environments
> such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can
> notice the '.pack' file's mtime changed, but '.idx' file not.
>
> So if we update the mtime of packfile by updating the '.idx'
> file instead of '.pack' file, when we check the mtime
> of packfile, get it from '.idx' file instead. Large git
> repository may contains large '.pack' files, but '.idx'
> files are smaller enough, this can avoid file system cache
> reload the large files again and speed up git commands.
>
> Signed-off-by: Sun Chao <16657101987@163.com>

Does this have the unstated trade-off that in a mixed-version
environment (say two git versions coordinating writes to an NFS share)
where one is old and thinks *.pack needs updating, and the other is new
and thinks *.idx is what should be checked, that until both are upgraded
we're effectively back to pre-33d4221c79.

I don't think it's a dealbreaker, just wondering if I've got that right
& if it is's a trade-off you thought about, maybe we should check the
mtime of both. The stat() is cheap, it's the re-sync that matters for
you.

But just to run with that thought, wouldn't it be even more helpful to
you to have say a config setting to create a *.bump file next to the
*.{idx,pack}.

Then you'd have an empty file (the *.idx is smaller, but still not
empty), and as a patch it seems relatively simple, i.e. some core.* or
gc.* or pack.* setting changing what we touch/stat().

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] packfile: enhance the mtime of packfile by idx file
  2021-07-11 23:44 ` Ævar Arnfjörð Bjarmason
@ 2021-07-12 16:17   ` Sun Chao
  0 siblings, 0 replies; 34+ messages in thread
From: Sun Chao @ 2021-07-12 16:17 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Sun Chao via GitGitGadget, git



> 2021年7月12日 07:44,Ævar Arnfjörð Bjarmason <avarab@gmail.com> 写道:
> 
> 
> On Sat, Jul 10 2021, Sun Chao via GitGitGadget wrote:
> 
>> From: Sun Chao <16657101987@163.com>
>> 
>> Commit 33d4221c79 (write_sha1_file: freshen existing objects,
>> 2014-10-15) avoid writing existing objects by freshen their
>> mtime (especially the packfiles contains them) in order to
>> aid the correct caching, and some process like find_lru_pack
>> can make good decision. However, this is unfriendly to
>> incremental backup jobs or services rely on file system
>> cache when there are large '.pack' files exists.
>> 
>> For example, after packed all objects, use 'write-tree' to
>> create same commit with the same tree and same environments
>> such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can
>> notice the '.pack' file's mtime changed, but '.idx' file not.
>> 
>> So if we update the mtime of packfile by updating the '.idx'
>> file instead of '.pack' file, when we check the mtime
>> of packfile, get it from '.idx' file instead. Large git
>> repository may contains large '.pack' files, but '.idx'
>> files are smaller enough, this can avoid file system cache
>> reload the large files again and speed up git commands.
>> 
>> Signed-off-by: Sun Chao <16657101987@163.com>
> 
> Does this have the unstated trade-off that in a mixed-version
> environment (say two git versions coordinating writes to an NFS share)
> where one is old and thinks *.pack needs updating, and the other is new
> and thinks *.idx is what should be checked, that until both are upgraded
> we're effectively back to pre-33d4221c79.
> 
Thanks for your reply, I can not agree with you more.

> I don't think it's a dealbreaker, just wondering if I've got that right
> & if it is's a trade-off you thought about, maybe we should check the
> mtime of both. The stat() is cheap, it's the re-sync that matters for
> you.
> 
> But just to run with that thought, wouldn't it be even more helpful to
> you to have say a config setting to create a *.bump file next to the
> *.{idx,pack}.
> 
> Then you'd have an empty file (the *.idx is smaller, but still not
> empty), and as a patch it seems relatively simple, i.e. some core.* or
> gc.* or pack.* setting changing what we touch/stat().

Yes, thanks. This is a good idea, let me try this way.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2] packfile: freshen the mtime of packfile by configuration
  2021-07-10 19:01 [PATCH] packfile: enhance the mtime of packfile by idx file Sun Chao via GitGitGadget
  2021-07-11 23:44 ` Ævar Arnfjörð Bjarmason
@ 2021-07-14  1:28 ` Sun Chao via GitGitGadget
  2021-07-14  1:39   ` Ævar Arnfjörð Bjarmason
  2021-07-19 19:53   ` [PATCH v3] " Sun Chao via GitGitGadget
  1 sibling, 2 replies; 34+ messages in thread
From: Sun Chao via GitGitGadget @ 2021-07-14  1:28 UTC (permalink / raw)
  To: git; +Cc: Sun Chao, Sun Chao

From: Sun Chao <16657101987@163.com>

Commit 33d4221c79 (write_sha1_file: freshen existing objects,
2014-10-15) avoid writing existing objects by freshen their
mtime (especially the packfiles contains them) in order to
aid the correct caching, and some process like find_lru_pack
can make good decision. However, this is unfriendly to
incremental backup jobs or services rely on file system
cache when there are large '.pack' files exists.

For example, after packed all objects, use 'write-tree' to
create same commit with the same tree and same environments
such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can
notice the '.pack' file's mtime changed, and '.idx' file not.

If we freshen the mtime of packfile by updating another
file instead of '.pack' file e.g. a empty '.bump' file,
when we need to check the mtime of packfile, get it from
another file instead. Large git repository may contains
large '.pack' files, and we can use smaller files even empty
file to do the mtime get/set operation, this can avoid
file system cache re-sync large '.pack' files again and
then speed up most git commands.

Signed-off-by: Sun Chao <16657101987@163.com>
---
    packfile: freshen the mtime of packfile by configuration
    
    Commit 33d4221 (write_sha1_file: freshen existing objects, 2014-10-15)
    avoid writing existing objects by freshen their mtime (especially the
    packfiles contains them) in order to aid the correct caching, and some
    process like find_lru_pack can make good decision. However, this is
    unfriendly to incremental backup jobs or services rely on file system
    cache when there are large '.pack' files exists.
    
    For example, after packed all objects, use 'write-tree' to create same
    commit with the same tree and same environments such like
    GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can notice the '.pack' file's
    mtime changed, and '.idx' file not.
    
    If we freshen the mtime of packfile by updating another file instead of
    '.pack' file e.g. a empty '.bump' file, when we need to check the mtime
    of packfile, get it from another file instead. Large git repository may
    contains large '.pack' files, and we can use smaller files even empty
    file to do the mtime get/set operation, this can avoid file system cache
    re-sync large '.pack' files again and then speed up most git commands.
    
    Signed-off-by: Sun Chao 16657101987@163.com

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1043%2Fsunchao9%2Fmaster-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1043/sunchao9/master-v2
Pull-Request: https://github.com/git/git/pull/1043

Range-diff vs v1:

 1:  f6adbb4e90f ! 1:  943e31e8587 packfile: enhance the mtime of packfile by idx file
     @@ Metadata
      Author: Sun Chao <16657101987@163.com>
      
       ## Commit message ##
     -    packfile: enhance the mtime of packfile by idx file
     +    packfile: freshen the mtime of packfile by configuration
      
          Commit 33d4221c79 (write_sha1_file: freshen existing objects,
          2014-10-15) avoid writing existing objects by freshen their
     @@ Commit message
          For example, after packed all objects, use 'write-tree' to
          create same commit with the same tree and same environments
          such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can
     -    notice the '.pack' file's mtime changed, but '.idx' file not.
     +    notice the '.pack' file's mtime changed, and '.idx' file not.
      
     -    So if we update the mtime of packfile by updating the '.idx'
     -    file instead of '.pack' file, when we check the mtime
     -    of packfile, get it from '.idx' file instead. Large git
     -    repository may contains large '.pack' files, but '.idx'
     -    files are smaller enough, this can avoid file system cache
     -    reload the large files again and speed up git commands.
     +    If we freshen the mtime of packfile by updating another
     +    file instead of '.pack' file e.g. a empty '.bump' file,
     +    when we need to check the mtime of packfile, get it from
     +    another file instead. Large git repository may contains
     +    large '.pack' files, and we can use smaller files even empty
     +    file to do the mtime get/set operation, this can avoid
     +    file system cache re-sync large '.pack' files again and
     +    then speed up most git commands.
      
          Signed-off-by: Sun Chao <16657101987@163.com>
      
     + ## Documentation/config/core.txt ##
     +@@ Documentation/config/core.txt: the largest projects.  You probably do not need to adjust this value.
     + +
     + Common unit suffixes of 'k', 'm', or 'g' are supported.
     + 
     ++core.packMtimeSuffix::
     ++	Normally we avoid writing existing object by freshening the mtime
     ++	of the *.pack file which contains it in order to aid some processes
     ++	such like prune. Use different file instead of *.pack file will
     ++	avoid file system cache re-sync the large packfiles, and consequently
     ++	make git commands faster.
     +++
     ++The default is 'pack' which means the *.pack file will be freshened by
     ++default. You can configure a different suffix to use, the file with the
     ++suffix will be created automatically, it's better not using any known
     ++suffix such like 'idx', 'keep', 'promisor'.
     ++
     + core.deltaBaseCacheLimit::
     + 	Maximum number of bytes per thread to reserve for caching base objects
     + 	that may be referenced by multiple deltified objects.  By storing the
     +
       ## builtin/index-pack.c ##
      @@ builtin/index-pack.c: static void fix_unresolved_deltas(struct hashfile *f)
       	free(sorted_by_pos);
     @@ builtin/index-pack.c: int cmd_index_pack(int argc, const char **argv, const char
       							 &rev_index_name_buf);
       	}
      
     + ## cache.h ##
     +@@ cache.h: extern size_t packed_git_limit;
     + extern size_t delta_base_cache_limit;
     + extern unsigned long big_file_threshold;
     + extern unsigned long pack_size_limit_cfg;
     ++extern const char *pack_mtime_suffix;
     + 
     + /*
     +  * Accessors for the core.sharedrepository config which lazy-load the value
     +
     + ## config.c ##
     +@@ config.c: static int git_default_core_config(const char *var, const char *value, void *cb)
     + 		return 0;
     + 	}
     + 
     ++	if (!strcmp(var, "core.packmtimesuffix")) {
     ++		return git_config_string(&pack_mtime_suffix, var, value);
     ++	}
     ++
     + 	if (!strcmp(var, "core.deltabasecachelimit")) {
     + 		delta_base_cache_limit = git_config_ulong(var, value);
     + 		return 0;
     +
     + ## environment.c ##
     +@@ environment.c: const char *git_hooks_path;
     + int zlib_compression_level = Z_BEST_SPEED;
     + int core_compression_level;
     + int pack_compression_level = Z_DEFAULT_COMPRESSION;
     ++const char *pack_mtime_suffix = "pack";
     + int fsync_object_files;
     + size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE;
     + size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT;
     +
       ## object-file.c ##
      @@ object-file.c: static int freshen_loose_object(const struct object_id *oid)
       static int freshen_packed_object(const struct object_id *oid)
       {
       	struct pack_entry e;
     ++	struct stat st;
      +	struct strbuf name_buf = STRBUF_INIT;
      +	const char *filename;
      +
     @@ object-file.c: static int freshen_loose_object(const struct object_id *oid)
       	if (e.p->freshened)
       		return 1;
      -	if (!freshen_file(e.p->pack_name))
     +-		return 0;
     ++
     ++	filename = e.p->pack_name;
     ++	if (!strcasecmp(pack_mtime_suffix, "pack")) {
     ++		if (!freshen_file(filename))
     ++			return 0;
     ++		e.p->freshened = 1;
     ++		return 1;
     ++	}
     ++
     ++	/* If we want to freshen different file instead of .pack file, we need
     ++	 * to make sure the file exists and create it if needed.
     ++	 */
     ++	filename = derive_pack_filename(filename, "pack", pack_mtime_suffix, &name_buf);
     ++	if (lstat(filename, &st) < 0) {
     ++		int fd = open(filename, O_CREAT|O_EXCL|O_WRONLY, 0664);
     ++		if (fd < 0) {
     ++			// here we need to check it again because other git process may created it
     ++			if (lstat(filename, &st) < 0)
     ++				die_errno("unable to create '%s'", filename);
     ++		} else {
     ++			close(fd);
     ++		}
     ++	} else {
     ++		if (!freshen_file(filename))
     ++			return 0;
     ++	}
      +
     -+	filename = derive_pack_filename(e.p->pack_name, "pack", "idx", &name_buf);
     -+	if (!freshen_file(filename))
     - 		return 0;
       	e.p->freshened = 1;
       	return 1;
     + }
      
       ## packfile.c ##
      @@ packfile.c: char *sha1_pack_index_name(const unsigned char *sha1)
     @@ packfile.c: char *sha1_pack_index_name(const unsigned char *sha1)
       static unsigned int pack_used_ctr;
       static unsigned int pack_mmap_calls;
       static unsigned int peak_pack_open_windows;
     -@@ packfile.c: struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
     - 	size_t alloc;
     - 	struct packed_git *p;
     - 
     -+	if (stat(path, &st) || !S_ISREG(st.st_mode)) {
     -+		return NULL;
     -+	}
     -+
     - 	/*
     - 	 * Make sure a corresponding .pack file exists and that
     - 	 * the index looks sane.
     -@@ packfile.c: struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
     - 	alloc = st_add3(path_len, strlen(".promisor"), 1);
     - 	p = alloc_packed_git(alloc);
     - 	memcpy(p->pack_name, path, path_len);
     -+	p->mtime = st.st_mtime;
     - 
     - 	xsnprintf(p->pack_name + path_len, alloc - path_len, ".keep");
     - 	if (!access(p->pack_name, F_OK))
      @@ packfile.c: struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
       	 */
       	p->pack_size = st.st_size;
       	p->pack_local = local;
     --	p->mtime = st.st_mtime;
     ++
     ++	/* If we have different file used to freshen the mtime, we should
     ++	 * use it at a higher priority.
     ++	 */
     ++	if (!!strcasecmp(pack_mtime_suffix, "pack")) {
     ++		struct strbuf name_buf = STRBUF_INIT;
     ++		const char *filename;
     ++
     ++		filename = derive_pack_filename(path, "idx", pack_mtime_suffix, &name_buf);
     ++		stat(filename, &st);
     ++	}
     + 	p->mtime = st.st_mtime;
       	if (path_len < the_hash_algo->hexsz ||
       	    get_sha1_hex(path + path_len - the_hash_algo->hexsz, p->hash))
     - 		hashclr(p->hash);
      
       ## packfile.h ##
      @@ packfile.h: char *sha1_pack_name(const unsigned char *sha1);
     @@ packfile.h: char *sha1_pack_name(const unsigned char *sha1);
      
       ## t/t7701-repack-unpack-unreachable.sh ##
      @@ t/t7701-repack-unpack-unreachable.sh: test_expect_success 'do not bother loosening old objects' '
     - 	git prune-packed &&
     - 	git cat-file -p $obj1 &&
     - 	git cat-file -p $obj2 &&
     --	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack &&
     -+	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.idx &&
     - 	git repack -A -d --unpack-unreachable=1.hour.ago &&
     - 	git cat-file -p $obj1 &&
       	test_must_fail git cat-file -p $obj2
     + '
     + 
     ++test_expect_success 'do not bother loosening old objects with core.packmtimesuffix config' '
     ++	obj1=$(echo three | git hash-object -w --stdin) &&
     ++	obj2=$(echo four | git hash-object -w --stdin) &&
     ++	pack1=$(echo $obj1 | git -c core.packmtimesuffix=bump pack-objects .git/objects/pack/pack) &&
     ++	pack2=$(echo $obj2 | git -c core.packmtimesuffix=bump pack-objects .git/objects/pack/pack) &&
     ++	git -c core.packmtimesuffix=bump prune-packed &&
     ++	git cat-file -p $obj1 &&
     ++	git cat-file -p $obj2 &&
     ++	touch .git/objects/pack/pack-$pack2.bump &&
     ++	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.bump &&
     ++	git -c core.packmtimesuffix=bump repack -A -d --unpack-unreachable=1.hour.ago &&
     ++	git cat-file -p $obj1 &&
     ++	test_must_fail git cat-file -p $obj2
     ++'
     ++
     + test_expect_success 'keep packed objects found only in index' '
     + 	echo my-unique-content >file &&
     + 	git add file &&


 Documentation/config/core.txt        | 12 ++++++++++
 builtin/index-pack.c                 | 19 +++-------------
 cache.h                              |  1 +
 config.c                             |  4 ++++
 environment.c                        |  1 +
 object-file.c                        | 33 ++++++++++++++++++++++++++--
 packfile.c                           | 24 ++++++++++++++++++++
 packfile.h                           |  7 ++++++
 t/t7701-repack-unpack-unreachable.sh | 15 +++++++++++++
 9 files changed, 98 insertions(+), 18 deletions(-)

diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
index c04f62a54a1..9a256992d8c 100644
--- a/Documentation/config/core.txt
+++ b/Documentation/config/core.txt
@@ -398,6 +398,18 @@ the largest projects.  You probably do not need to adjust this value.
 +
 Common unit suffixes of 'k', 'm', or 'g' are supported.
 
+core.packMtimeSuffix::
+	Normally we avoid writing existing object by freshening the mtime
+	of the *.pack file which contains it in order to aid some processes
+	such like prune. Use different file instead of *.pack file will
+	avoid file system cache re-sync the large packfiles, and consequently
+	make git commands faster.
++
+The default is 'pack' which means the *.pack file will be freshened by
+default. You can configure a different suffix to use, the file with the
+suffix will be created automatically, it's better not using any known
+suffix such like 'idx', 'keep', 'promisor'.
+
 core.deltaBaseCacheLimit::
 	Maximum number of bytes per thread to reserve for caching base objects
 	that may be referenced by multiple deltified objects.  By storing the
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 3fbc5d70777..60bacc8ee7f 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1437,19 +1437,6 @@ static void fix_unresolved_deltas(struct hashfile *f)
 	free(sorted_by_pos);
 }
 
-static const char *derive_filename(const char *pack_name, const char *strip,
-				   const char *suffix, struct strbuf *buf)
-{
-	size_t len;
-	if (!strip_suffix(pack_name, strip, &len) || !len ||
-	    pack_name[len - 1] != '.')
-		die(_("packfile name '%s' does not end with '.%s'"),
-		    pack_name, strip);
-	strbuf_add(buf, pack_name, len);
-	strbuf_addstr(buf, suffix);
-	return buf->buf;
-}
-
 static void write_special_file(const char *suffix, const char *msg,
 			       const char *pack_name, const unsigned char *hash,
 			       const char **report)
@@ -1460,7 +1447,7 @@ static void write_special_file(const char *suffix, const char *msg,
 	int msg_len = strlen(msg);
 
 	if (pack_name)
-		filename = derive_filename(pack_name, "pack", suffix, &name_buf);
+		filename = derive_pack_filename(pack_name, "pack", suffix, &name_buf);
 	else
 		filename = odb_pack_name(&name_buf, hash, suffix);
 
@@ -1855,13 +1842,13 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	if (from_stdin && hash_algo)
 		die(_("--object-format cannot be used with --stdin"));
 	if (!index_name && pack_name)
-		index_name = derive_filename(pack_name, "pack", "idx", &index_name_buf);
+		index_name = derive_pack_filename(pack_name, "pack", "idx", &index_name_buf);
 
 	opts.flags &= ~(WRITE_REV | WRITE_REV_VERIFY);
 	if (rev_index) {
 		opts.flags |= verify ? WRITE_REV_VERIFY : WRITE_REV;
 		if (index_name)
-			rev_index_name = derive_filename(index_name,
+			rev_index_name = derive_pack_filename(index_name,
 							 "idx", "rev",
 							 &rev_index_name_buf);
 	}
diff --git a/cache.h b/cache.h
index ba04ff8bd36..d95b3d127e0 100644
--- a/cache.h
+++ b/cache.h
@@ -956,6 +956,7 @@ extern size_t packed_git_limit;
 extern size_t delta_base_cache_limit;
 extern unsigned long big_file_threshold;
 extern unsigned long pack_size_limit_cfg;
+extern const char *pack_mtime_suffix;
 
 /*
  * Accessors for the core.sharedrepository config which lazy-load the value
diff --git a/config.c b/config.c
index f9c400ad306..27a6b619ed8 100644
--- a/config.c
+++ b/config.c
@@ -1431,6 +1431,10 @@ static int git_default_core_config(const char *var, const char *value, void *cb)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.packmtimesuffix")) {
+		return git_config_string(&pack_mtime_suffix, var, value);
+	}
+
 	if (!strcmp(var, "core.deltabasecachelimit")) {
 		delta_base_cache_limit = git_config_ulong(var, value);
 		return 0;
diff --git a/environment.c b/environment.c
index 2f27008424a..07c21e1a934 100644
--- a/environment.c
+++ b/environment.c
@@ -43,6 +43,7 @@ const char *git_hooks_path;
 int zlib_compression_level = Z_BEST_SPEED;
 int core_compression_level;
 int pack_compression_level = Z_DEFAULT_COMPRESSION;
+const char *pack_mtime_suffix = "pack";
 int fsync_object_files;
 size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE;
 size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT;
diff --git a/object-file.c b/object-file.c
index f233b440b22..b3e77213c42 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1974,12 +1974,41 @@ static int freshen_loose_object(const struct object_id *oid)
 static int freshen_packed_object(const struct object_id *oid)
 {
 	struct pack_entry e;
+	struct stat st;
+	struct strbuf name_buf = STRBUF_INIT;
+	const char *filename;
+
 	if (!find_pack_entry(the_repository, oid, &e))
 		return 0;
 	if (e.p->freshened)
 		return 1;
-	if (!freshen_file(e.p->pack_name))
-		return 0;
+
+	filename = e.p->pack_name;
+	if (!strcasecmp(pack_mtime_suffix, "pack")) {
+		if (!freshen_file(filename))
+			return 0;
+		e.p->freshened = 1;
+		return 1;
+	}
+
+	/* If we want to freshen different file instead of .pack file, we need
+	 * to make sure the file exists and create it if needed.
+	 */
+	filename = derive_pack_filename(filename, "pack", pack_mtime_suffix, &name_buf);
+	if (lstat(filename, &st) < 0) {
+		int fd = open(filename, O_CREAT|O_EXCL|O_WRONLY, 0664);
+		if (fd < 0) {
+			// here we need to check it again because other git process may created it
+			if (lstat(filename, &st) < 0)
+				die_errno("unable to create '%s'", filename);
+		} else {
+			close(fd);
+		}
+	} else {
+		if (!freshen_file(filename))
+			return 0;
+	}
+
 	e.p->freshened = 1;
 	return 1;
 }
diff --git a/packfile.c b/packfile.c
index 755aa7aec5e..a607dda4e25 100644
--- a/packfile.c
+++ b/packfile.c
@@ -40,6 +40,19 @@ char *sha1_pack_index_name(const unsigned char *sha1)
 	return odb_pack_name(&buf, sha1, "idx");
 }
 
+const char *derive_pack_filename(const char *pack_name, const char *strip,
+				const char *suffix, struct strbuf *buf)
+{
+	size_t len;
+	if (!strip_suffix(pack_name, strip, &len) || !len ||
+	    pack_name[len - 1] != '.')
+		die(_("packfile name '%s' does not end with '.%s'"),
+		    pack_name, strip);
+	strbuf_add(buf, pack_name, len);
+	strbuf_addstr(buf, suffix);
+	return buf->buf;
+}
+
 static unsigned int pack_used_ctr;
 static unsigned int pack_mmap_calls;
 static unsigned int peak_pack_open_windows;
@@ -727,6 +740,17 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
 	 */
 	p->pack_size = st.st_size;
 	p->pack_local = local;
+
+	/* If we have different file used to freshen the mtime, we should
+	 * use it at a higher priority.
+	 */
+	if (!!strcasecmp(pack_mtime_suffix, "pack")) {
+		struct strbuf name_buf = STRBUF_INIT;
+		const char *filename;
+
+		filename = derive_pack_filename(path, "idx", pack_mtime_suffix, &name_buf);
+		stat(filename, &st);
+	}
 	p->mtime = st.st_mtime;
 	if (path_len < the_hash_algo->hexsz ||
 	    get_sha1_hex(path + path_len - the_hash_algo->hexsz, p->hash))
diff --git a/packfile.h b/packfile.h
index 3ae117a8aef..ff702b22e6a 100644
--- a/packfile.h
+++ b/packfile.h
@@ -31,6 +31,13 @@ char *sha1_pack_name(const unsigned char *sha1);
  */
 char *sha1_pack_index_name(const unsigned char *sha1);
 
+/*
+ * Return the corresponding filename with given suffix from "file_name"
+ * which must has "strip" suffix.
+ */
+const char *derive_pack_filename(const char *file_name, const char *strip,
+		const char *suffix, struct strbuf *buf);
+
 /*
  * Return the basename of the packfile, omitting any containing directory
  * (e.g., "pack-1234abcd[...].pack").
diff --git a/t/t7701-repack-unpack-unreachable.sh b/t/t7701-repack-unpack-unreachable.sh
index 937f89ee8c8..15828a318c4 100755
--- a/t/t7701-repack-unpack-unreachable.sh
+++ b/t/t7701-repack-unpack-unreachable.sh
@@ -112,6 +112,21 @@ test_expect_success 'do not bother loosening old objects' '
 	test_must_fail git cat-file -p $obj2
 '
 
+test_expect_success 'do not bother loosening old objects with core.packmtimesuffix config' '
+	obj1=$(echo three | git hash-object -w --stdin) &&
+	obj2=$(echo four | git hash-object -w --stdin) &&
+	pack1=$(echo $obj1 | git -c core.packmtimesuffix=bump pack-objects .git/objects/pack/pack) &&
+	pack2=$(echo $obj2 | git -c core.packmtimesuffix=bump pack-objects .git/objects/pack/pack) &&
+	git -c core.packmtimesuffix=bump prune-packed &&
+	git cat-file -p $obj1 &&
+	git cat-file -p $obj2 &&
+	touch .git/objects/pack/pack-$pack2.bump &&
+	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.bump &&
+	git -c core.packmtimesuffix=bump repack -A -d --unpack-unreachable=1.hour.ago &&
+	git cat-file -p $obj1 &&
+	test_must_fail git cat-file -p $obj2
+'
+
 test_expect_success 'keep packed objects found only in index' '
 	echo my-unique-content >file &&
 	git add file &&

base-commit: d486ca60a51c9cb1fe068803c3f540724e95e83a
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration
  2021-07-14  1:28 ` [PATCH v2] packfile: freshen the mtime of packfile by configuration Sun Chao via GitGitGadget
@ 2021-07-14  1:39   ` Ævar Arnfjörð Bjarmason
  2021-07-14  2:52     ` Taylor Blau
  2021-07-14 16:11     ` Sun Chao
  2021-07-19 19:53   ` [PATCH v3] " Sun Chao via GitGitGadget
  1 sibling, 2 replies; 34+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-07-14  1:39 UTC (permalink / raw)
  To: Sun Chao via GitGitGadget; +Cc: git, Sun Chao


On Wed, Jul 14 2021, Sun Chao via GitGitGadget wrote:

> diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
> index c04f62a54a1..9a256992d8c 100644
> --- a/Documentation/config/core.txt
> +++ b/Documentation/config/core.txt
> @@ -398,6 +398,18 @@ the largest projects.  You probably do not need to adjust this value.
>  +
>  Common unit suffixes of 'k', 'm', or 'g' are supported.
>  
> +core.packMtimeSuffix::
> +	Normally we avoid writing existing object by freshening the mtime
> +	of the *.pack file which contains it in order to aid some processes
> +	such like prune. Use different file instead of *.pack file will
> +	avoid file system cache re-sync the large packfiles, and consequently
> +	make git commands faster.
> ++
> +The default is 'pack' which means the *.pack file will be freshened by
> +default. You can configure a different suffix to use, the file with the
> +suffix will be created automatically, it's better not using any known
> +suffix such like 'idx', 'keep', 'promisor'.
> +

Hrm, per my v1 feedback (and I'm not sure if my suggestion is even good
here, there's others more familiar with this area than I am), I was
thinking of something like a *.bump file written via:

    core.packUseBumpFiles=bool

Or something like that, anyway, the edge case in allowing the user to
pick arbitrary suffixes is that we'd get in-the-wild user arbitrary
configuration squatting on a relatively sensitive part of the object
store.

E.g. we recently added *.rev files to go with
*.{pack,idx,bitmap,keep,promisor} (and I'm probably forgetting some
suffix). What if before that a user had set:

    core.packMtimeSuffix=rev

In practice it's probably too obscure to worry about, but I think it's
still worth it to only strictly write things we decide to write into the
object store.
>  core.deltaBaseCacheLimit::
>  	Maximum number of bytes per thread to reserve for caching base objects
>  	that may be referenced by multiple deltified objects.  By storing the
> diff --git a/builtin/index-pack.c b/builtin/index-pack.c
> index 3fbc5d70777..60bacc8ee7f 100644
> --- a/builtin/index-pack.c
> +++ b/builtin/index-pack.c
> @@ -1437,19 +1437,6 @@ static void fix_unresolved_deltas(struct hashfile *f)
>  	free(sorted_by_pos);
>  }
>  
> -static const char *derive_filename(const char *pack_name, const char *strip,
> -				   const char *suffix, struct strbuf *buf)
> -{
> -	size_t len;
> -	if (!strip_suffix(pack_name, strip, &len) || !len ||
> -	    pack_name[len - 1] != '.')
> -		die(_("packfile name '%s' does not end with '.%s'"),
> -		    pack_name, strip);
> -	strbuf_add(buf, pack_name, len);
> -	strbuf_addstr(buf, suffix);
> -	return buf->buf;
> -}


Would be more readable to split this series up into at least two
patches, starting with just moving this function as-is to a the new
location (just renaming it & moving it), and then using it. I don't
think there's changes to it, but right now I'm just eyeballing the
diff. It's more obvious if it's split up.

> +	if (!strcmp(var, "core.packmtimesuffix")) {
> +		return git_config_string(&pack_mtime_suffix, var, value);

Can drop the {} braces here, per Documentation/CodingGuidelines

> +const char *pack_mtime_suffix = "pack";

I can see how having a configurable suffix made the implementation
easier, perhaps that's how it started?


>  int fsync_object_files;
>  size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE;
>  size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT;
> diff --git a/object-file.c b/object-file.c
> index f233b440b22..b3e77213c42 100644
> --- a/object-file.c
> +++ b/object-file.c
> @@ -1974,12 +1974,41 @@ static int freshen_loose_object(const struct object_id *oid)
>  static int freshen_packed_object(const struct object_id *oid)
>  {
>  	struct pack_entry e;
> +	struct stat st;
> +	struct strbuf name_buf = STRBUF_INIT;
> +	const char *filename;
> +
>  	if (!find_pack_entry(the_repository, oid, &e))
>  		return 0;
>  	if (e.p->freshened)
>  		return 1;
> -	if (!freshen_file(e.p->pack_name))
> -		return 0;
> +
> +	filename = e.p->pack_name;
> +	if (!strcasecmp(pack_mtime_suffix, "pack")) {
> +		if (!freshen_file(filename))
> +			return 0;
> +		e.p->freshened = 1;
> +		return 1;
> +	}
> +
> +	/* If we want to freshen different file instead of .pack file, we need
> +	 * to make sure the file exists and create it if needed.
> +	 */
> +	filename = derive_pack_filename(filename, "pack", pack_mtime_suffix, &name_buf);

You populate name_buf here, but don't strbuf_release(&name_buf) it at the end of this function.

> +	if (lstat(filename, &st) < 0) {
> +		int fd = open(filename, O_CREAT|O_EXCL|O_WRONLY, 0664);
> +		if (fd < 0) {
> +			// here we need to check it again because other git process may created it

/* */ comments, not //, if it's needed at all. Covered in CodingGuidelines

> +			if (lstat(filename, &st) < 0)
> +				die_errno("unable to create '%s'", filename);

If we can't create this specific file here shouldn't we just continue
silently at this point? Surely if this process is screwed we're just
about to die on something more important?

And lstat() can also return transitory errors that don't indicate
"unable to create", e.g. maybe we can out of memory the kernel is
willing to give us or something (just skimming the lstat manpage).

> +		} else {
> +			close(fd);
> +		}
> +	} else {
> +		if (!freshen_file(filename))
> +			return 0;

Style/indentatino: just do "} else if (!freshen..." ?

> +	}
> +
>  	e.p->freshened = 1;
>  	return 1;
>  }
> diff --git a/packfile.c b/packfile.c
> index 755aa7aec5e..a607dda4e25 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -40,6 +40,19 @@ char *sha1_pack_index_name(const unsigned char *sha1)
>  	return odb_pack_name(&buf, sha1, "idx");
>  }
>  
> +const char *derive_pack_filename(const char *pack_name, const char *strip,
> +				const char *suffix, struct strbuf *buf)
> +{
> +	size_t len;
> +	if (!strip_suffix(pack_name, strip, &len) || !len ||
> +	    pack_name[len - 1] != '.')
> +		die(_("packfile name '%s' does not end with '.%s'"),
> +		    pack_name, strip);
> +	strbuf_add(buf, pack_name, len);
> +	strbuf_addstr(buf, suffix);
> +	return buf->buf;
> +}

Just have this return void?

>  static unsigned int pack_used_ctr;
>  static unsigned int pack_mmap_calls;
>  static unsigned int peak_pack_open_windows;
> @@ -727,6 +740,17 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
>  	 */
>  	p->pack_size = st.st_size;
>  	p->pack_local = local;
> +
> +	/* If we have different file used to freshen the mtime, we should
> +	 * use it at a higher priority.
> +	 */
> +	if (!!strcasecmp(pack_mtime_suffix, "pack")) {
> +		struct strbuf name_buf = STRBUF_INIT;
> +		const char *filename;
> +
> +		filename = derive_pack_filename(path, "idx", pack_mtime_suffix, &name_buf);
> +		stat(filename, &st);

I.e. the "filename" here isn't needed, just call derive_pack_filename()
and use name.buf.buf to stat.

Also: We should check the stat return value here & report errno if
needed, no?

> +test_expect_success 'do not bother loosening old objects with core.packmtimesuffix config' '
> +	obj1=$(echo three | git hash-object -w --stdin) &&
> +	obj2=$(echo four | git hash-object -w --stdin) &&
> +	pack1=$(echo $obj1 | git -c core.packmtimesuffix=bump pack-objects .git/objects/pack/pack) &&
> +	pack2=$(echo $obj2 | git -c core.packmtimesuffix=bump pack-objects .git/objects/pack/pack) &&
> +	git -c core.packmtimesuffix=bump prune-packed &&
> +	git cat-file -p $obj1 &&
> +	git cat-file -p $obj2 &&
> +	touch .git/objects/pack/pack-$pack2.bump &&
> +	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.bump &&
> +	git -c core.packmtimesuffix=bump repack -A -d --unpack-unreachable=1.hour.ago &&

On command-lines we can spell it camel-cased, e.g. -c
core.packMtimeSuffix[...].

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration
  2021-07-14  1:39   ` Ævar Arnfjörð Bjarmason
@ 2021-07-14  2:52     ` Taylor Blau
  2021-07-14 16:46       ` Sun Chao
  2021-07-14 16:11     ` Sun Chao
  1 sibling, 1 reply; 34+ messages in thread
From: Taylor Blau @ 2021-07-14  2:52 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Sun Chao via GitGitGadget, git, Sun Chao

On Wed, Jul 14, 2021 at 03:39:18AM +0200, Ævar Arnfjörð Bjarmason wrote:
> Hrm, per my v1 feedback (and I'm not sure if my suggestion is even good
> here, there's others more familiar with this area than I am), I was
> thinking of something like a *.bump file written via:
>
>     core.packUseBumpFiles=bool
>
> Or something like that, anyway, the edge case in allowing the user to
> pick arbitrary suffixes is that we'd get in-the-wild user arbitrary
> configuration squatting on a relatively sensitive part of the object
> store.
>
> E.g. we recently added *.rev files to go with
> *.{pack,idx,bitmap,keep,promisor} (and I'm probably forgetting some
> suffix). What if before that a user had set:
>
>     core.packMtimeSuffix=rev

I think making the suffix configurable is probably a mistake. It seems
like an unnecessary detail to expose, but it also forces us to think
about cases like these where the configured suffix is already used for
some other purpose.

I don't think that a new ".bump" file is a bad idea, but it does seem
like we have a lot of files that represent a relatively little amount of
the state that a pack can be in. The ".promisor" and ".keep" files both
come to mind here. Some thoughts in this direction:

  - Combining *all* of the pack-related files (including the index,
    reverse-index, bitmap, and so on) into a single "pack-meta" file
    seems like a mistake for caching reasons.

  - But a meta file that contains just the small state (like promisor
    information and whether or not the pack is "kept") seems like it
    could be OK. On the other hand, being able to tweak the kept state
    by touching or deleting a file is convenient (and having to rewrite
    a meta file containing other information is much less so).

But a ".bump" file does seem like an awkward way to not rely on the
mtime of the pack itself. And I do think it runs into compatibility
issues like Ævar mentioned. Any version of Git that includes a
hypothetical .bump file (or something like it) needs to also update the
pack's mtime, too, so that old versions of Git can understand it. (Of
course, that could be configurable, but that seems far too obscure to
me).

Stepping back, I'm not sure I understand why freshening a pack is so
slow for you. freshen_file() just calls utime(2), and any sync back to
the disk shouldn't need to update the pack itself, just a couple of
fields in its inode. Maybe you could help explain further.

In any case, I couldn't find a spot in your patch that updates the
packed_git's 'mtime' field, which is used to (a) sort packs in the
linked list of packs, and (b) for determining the least-recently used
pack if it has individual windows mmapped.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration
  2021-07-14  1:39   ` Ævar Arnfjörð Bjarmason
  2021-07-14  2:52     ` Taylor Blau
@ 2021-07-14 16:11     ` Sun Chao
  1 sibling, 0 replies; 34+ messages in thread
From: Sun Chao @ 2021-07-14 16:11 UTC (permalink / raw)
  To: git; +Cc: Sun Chao via GitGitGadget



> 2021年7月14日 09:39,Ævar Arnfjörð Bjarmason <avarab@gmail.com> 写道:
> 
>> +The default is 'pack' which means the *.pack file will be freshened by
>> +default. You can configure a different suffix to use, the file with the
>> +suffix will be created automatically, it's better not using any known
>> +suffix such like 'idx', 'keep', 'promisor'.
>> +
> 
> Hrm, per my v1 feedback (and I'm not sure if my suggestion is even good
> here, there's others more familiar with this area than I am), I was
> thinking of something like a *.bump file written via:
> 
>    core.packUseBumpFiles=bool
> 
> Or something like that, anyway, the edge case in allowing the user to
> pick arbitrary suffixes is that we'd get in-the-wild user arbitrary
> configuration squatting on a relatively sensitive part of the object
> store.
> 
> E.g. we recently added *.rev files to go with
> *.{pack,idx,bitmap,keep,promisor} (and I'm probably forgetting some
> suffix). What if before that a user had set:
> 
>    core.packMtimeSuffix=rev
> 
> In practice it's probably too obscure to worry about, but I think it's
> still worth it to only strictly write things we decide to write into the
> object store.

Thanks, this makes sense, allowing user to pick arbitrary suffixes may cause
some unpredictable problems, E.g. users who don’t know about the *.keep files
but setting core.packMtimeSuffix=keep, so I think it’s right to restrict the
suffix here.

>> core.deltaBaseCacheLimit::
>> 	Maximum number of bytes per thread to reserve for caching base objects
>> 	that may be referenced by multiple deltified objects.  By storing the
>> diff --git a/builtin/index-pack.c b/builtin/index-pack.c
>> index 3fbc5d70777..60bacc8ee7f 100644
>> --- a/builtin/index-pack.c
>> +++ b/builtin/index-pack.c
>> @@ -1437,19 +1437,6 @@ static void fix_unresolved_deltas(struct hashfile *f)
>> 	free(sorted_by_pos);
>> }
>> 
>> -static const char *derive_filename(const char *pack_name, const char *strip,
>> -				   const char *suffix, struct strbuf *buf)
>> -{
>> -	size_t len;
>> -	if (!strip_suffix(pack_name, strip, &len) || !len ||
>> -	    pack_name[len - 1] != '.')
>> -		die(_("packfile name '%s' does not end with '.%s'"),
>> -		    pack_name, strip);
>> -	strbuf_add(buf, pack_name, len);
>> -	strbuf_addstr(buf, suffix);
>> -	return buf->buf;
>> -}
> 
> 
> Would be more readable to split this series up into at least two
> patches, starting with just moving this function as-is to a the new
> location (just renaming it & moving it), and then using it. I don't
> think there's changes to it, but right now I'm just eyeballing the
> diff. It's more obvious if it's split up.

Thanks, I will do it.

> 
>> +	if (!strcmp(var, "core.packmtimesuffix")) {
>> +		return git_config_string(&pack_mtime_suffix, var, value);
> 
> Can drop the {} braces here, per Documentation/CodingGuidelines

Thanks, I will read the CodingGuidelines again and fix the issue.

> 
>> +const char *pack_mtime_suffix = "pack";
> 
> I can see how having a configurable suffix made the implementation
> easier, perhaps that's how it started?
> 

yes, here is the default value.

> 
>> int fsync_object_files;
>> size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE;
>> size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT;
>> diff --git a/object-file.c b/object-file.c
>> index f233b440b22..b3e77213c42 100644
>> --- a/object-file.c
>> +++ b/object-file.c
>> @@ -1974,12 +1974,41 @@ static int freshen_loose_object(const struct object_id *oid)
>> static int freshen_packed_object(const struct object_id *oid)
>> {
>> 	struct pack_entry e;
>> +	struct stat st;
>> +	struct strbuf name_buf = STRBUF_INIT;
>> +	const char *filename;
>> +
>> 	if (!find_pack_entry(the_repository, oid, &e))
>> 		return 0;
>> 	if (e.p->freshened)
>> 		return 1;
>> -	if (!freshen_file(e.p->pack_name))
>> -		return 0;
>> +
>> +	filename = e.p->pack_name;
>> +	if (!strcasecmp(pack_mtime_suffix, "pack")) {
>> +		if (!freshen_file(filename))
>> +			return 0;
>> +		e.p->freshened = 1;
>> +		return 1;
>> +	}
>> +
>> +	/* If we want to freshen different file instead of .pack file, we need
>> +	 * to make sure the file exists and create it if needed.
>> +	 */
>> +	filename = derive_pack_filename(filename, "pack", pack_mtime_suffix, &name_buf);
> 
> You populate name_buf here, but don't strbuf_release(&name_buf) it at the end of this function.

Will fix it.

> 
>> +	if (lstat(filename, &st) < 0) {
>> +		int fd = open(filename, O_CREAT|O_EXCL|O_WRONLY, 0664);
>> +		if (fd < 0) {
>> +			// here we need to check it again because other git process may created it
> 
> /* */ comments, not //, if it's needed at all. Covered in CodingGuidelines

Will fix it.

> 
>> +			if (lstat(filename, &st) < 0)
>> +				die_errno("unable to create '%s'", filename);
> 
> If we can't create this specific file here shouldn't we just continue
> silently at this point? Surely if this process is screwed we're just
> about to die on something more important?

Yes, because here we have the *.pack file exists, we can step back to freshen
the *.pack file if the specific file cannot be created. And when we want to get
the mtime, we can also read mtime from *.pack file either if the specific
file does not exists. 

> 
> And lstat() can also return transitory errors that don't indicate
> "unable to create", e.g. maybe we can out of memory the kernel is
> willing to give us or something (just skimming the lstat manpage).

Thanks, I will think about it.

> 
>> +		} else {
>> +			close(fd);
>> +		}
>> +	} else {
>> +		if (!freshen_file(filename))
>> +			return 0;
> 
> Style/indentatino: just do "} else if (!freshen..." ?

Will fix it.
> 
>> +	}
>> +
>> 	e.p->freshened = 1;
>> 	return 1;
>> }
>> diff --git a/packfile.c b/packfile.c
>> index 755aa7aec5e..a607dda4e25 100644
>> --- a/packfile.c
>> +++ b/packfile.c
>> @@ -40,6 +40,19 @@ char *sha1_pack_index_name(const unsigned char *sha1)
>> 	return odb_pack_name(&buf, sha1, "idx");
>> }
>> 
>> +const char *derive_pack_filename(const char *pack_name, const char *strip,
>> +				const char *suffix, struct strbuf *buf)
>> +{
>> +	size_t len;
>> +	if (!strip_suffix(pack_name, strip, &len) || !len ||
>> +	    pack_name[len - 1] != '.')
>> +		die(_("packfile name '%s' does not end with '.%s'"),
>> +		    pack_name, strip);
>> +	strbuf_add(buf, pack_name, len);
>> +	strbuf_addstr(buf, suffix);
>> +	return buf->buf;
>> +}
> 
> Just have this return void?

I renamed the original 'derive_filename' here with new name 'derive_pack_filename',
when it is used like this:

    filename = derive_filename(pack_name, "pack", suffix, &name_buf);

and so it looks like a more convenient way to use the strbuf object, so I decided to
keep it the old way.

> 
>> static unsigned int pack_used_ctr;
>> static unsigned int pack_mmap_calls;
>> static unsigned int peak_pack_open_windows;
>> @@ -727,6 +740,17 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
>> 	 */
>> 	p->pack_size = st.st_size;
>> 	p->pack_local = local;
>> +
>> +	/* If we have different file used to freshen the mtime, we should
>> +	 * use it at a higher priority.
>> +	 */
>> +	if (!!strcasecmp(pack_mtime_suffix, "pack")) {
>> +		struct strbuf name_buf = STRBUF_INIT;
>> +		const char *filename;
>> +
>> +		filename = derive_pack_filename(path, "idx", pack_mtime_suffix, &name_buf);
>> +		stat(filename, &st);
> 
> I.e. the "filename" here isn't needed, just call derive_pack_filename()
> and use name.buf.buf to stat.

Let me thinks about it, it's a good idea then.

> 
> Also: We should check the stat return value here & report errno if
> needed, no?

Here I think we should step back to use the mtime of *.pack file instead, so if the specific file does
not exists, there are some reasons like (a) the *.pack is just created and mtime of it has not freshened
again or (b) we can not create the specific file for some reason, and in this case we will freshen the
*.pack file instead. (c) someone or some process just delete it, we should try to go futher.

I don't know if it's right to do this, but I think we should avoid die here.
> 
>> +test_expect_success 'do not bother loosening old objects with core.packmtimesuffix config' '
>> +	obj1=$(echo three | git hash-object -w --stdin) &&
>> +	obj2=$(echo four | git hash-object -w --stdin) &&
>> +	pack1=$(echo $obj1 | git -c core.packmtimesuffix=bump pack-objects .git/objects/pack/pack) &&
>> +	pack2=$(echo $obj2 | git -c core.packmtimesuffix=bump pack-objects .git/objects/pack/pack) &&
>> +	git -c core.packmtimesuffix=bump prune-packed &&
>> +	git cat-file -p $obj1 &&
>> +	git cat-file -p $obj2 &&
>> +	touch .git/objects/pack/pack-$pack2.bump &&
>> +	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.bump &&
>> +	git -c core.packmtimesuffix=bump repack -A -d --unpack-unreachable=1.hour.ago &&
> 
> On command-lines we can spell it camel-cased, e.g. -c
> core.packMtimeSuffix[...].
> 

Thanks, will do it.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration
  2021-07-14  2:52     ` Taylor Blau
@ 2021-07-14 16:46       ` Sun Chao
  2021-07-14 17:04         ` Taylor Blau
  0 siblings, 1 reply; 34+ messages in thread
From: Sun Chao @ 2021-07-14 16:46 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Ævar Arnfjörð Bjarmason, Sun Chao via GitGitGadget,
	git



> 2021年7月14日 10:52,Taylor Blau <me@ttaylorr.com> 写道:
> 
> On Wed, Jul 14, 2021 at 03:39:18AM +0200, Ævar Arnfjörð Bjarmason wrote:
>> Hrm, per my v1 feedback (and I'm not sure if my suggestion is even good
>> here, there's others more familiar with this area than I am), I was
>> thinking of something like a *.bump file written via:
>> 
>>    core.packUseBumpFiles=bool
>> 
>> Or something like that, anyway, the edge case in allowing the user to
>> pick arbitrary suffixes is that we'd get in-the-wild user arbitrary
>> configuration squatting on a relatively sensitive part of the object
>> store.
>> 
>> E.g. we recently added *.rev files to go with
>> *.{pack,idx,bitmap,keep,promisor} (and I'm probably forgetting some
>> suffix). What if before that a user had set:
>> 
>>    core.packMtimeSuffix=rev
> 
> I think making the suffix configurable is probably a mistake. It seems
> like an unnecessary detail to expose, but it also forces us to think
> about cases like these where the configured suffix is already used for
> some other purpose.

Thanks, I agree with you and will fix it, such like the *.keep file, we
do not use the suffix configuration to create keep files.

> 
> I don't think that a new ".bump" file is a bad idea, but it does seem
> like we have a lot of files that represent a relatively little amount of
> the state that a pack can be in. The ".promisor" and ".keep" files both
> come to mind here. Some thoughts in this direction:
> 
>  - Combining *all* of the pack-related files (including the index,
>    reverse-index, bitmap, and so on) into a single "pack-meta" file
>    seems like a mistake for caching reasons.
> 
>  - But a meta file that contains just the small state (like promisor
>    information and whether or not the pack is "kept") seems like it
>    could be OK. On the other hand, being able to tweak the kept state
>    by touching or deleting a file is convenient (and having to rewrite
>    a meta file containing other information is much less so).

Yes, read and rewrite a meta file also means we need do lock/unlock, which
may cause inconvenient operations.

> 
> But a ".bump" file does seem like an awkward way to not rely on the
> mtime of the pack itself. And I do think it runs into compatibility
> issues like Ævar mentioned. Any version of Git that includes a
> hypothetical .bump file (or something like it) needs to also update the
> pack's mtime, too, so that old versions of Git can understand it. (Of
> course, that could be configurable, but that seems far too obscure to
> me).

Here we will have a configuration and default is backward compatiblity,
and if user decide to use the '.bump' file, which means he can not use
the old versions of Git, like the Repository Format Version, it is limited.

> 
> Stepping back, I'm not sure I understand why freshening a pack is so
> slow for you. freshen_file() just calls utime(2), and any sync back to
> the disk shouldn't need to update the pack itself, just a couple of
> fields in its inode. Maybe you could help explain further.
> 
> In any case, I couldn't find a spot in your patch that updates the
> packed_git's 'mtime' field, which is used to (a) sort packs in the
> linked list of packs, and (b) for determining the least-recently used
> pack if it has individual windows mmapped.

The reason why we want to avoid freshen the mtime of ".pack" file is to
improve the reading speed of Git Servers.

We have some large repositories in our Git Severs (some are bigger than 10GB),
and we created '.keep' files for large ".pack" files, we want the big files
unchanged to speed up git upload-pack, because in our mind the file system
cache will reduce the disk IO if a file does not changed.

However we find the mtime of ".pack" files changes over time which makes the
file system always reload the big files, that takes a lot of IO time and result
in lower speed of git upload-pack and even further the disk IOPS is exhausted.

> 
> Thanks,
> Taylor
> 



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration
  2021-07-14 16:46       ` Sun Chao
@ 2021-07-14 17:04         ` Taylor Blau
  2021-07-14 18:19           ` Ævar Arnfjörð Bjarmason
  2021-07-15 16:30           ` Sun Chao
  0 siblings, 2 replies; 34+ messages in thread
From: Taylor Blau @ 2021-07-14 17:04 UTC (permalink / raw)
  To: Sun Chao
  Cc: Taylor Blau, Ævar Arnfjörð Bjarmason,
	Sun Chao via GitGitGadget, git

On Thu, Jul 15, 2021 at 12:46:47AM +0800, Sun Chao wrote:
> > Stepping back, I'm not sure I understand why freshening a pack is so
> > slow for you. freshen_file() just calls utime(2), and any sync back to
> > the disk shouldn't need to update the pack itself, just a couple of
> > fields in its inode. Maybe you could help explain further.
> >
> > [ ... ]
>
> The reason why we want to avoid freshen the mtime of ".pack" file is to
> improve the reading speed of Git Servers.
>
> We have some large repositories in our Git Severs (some are bigger than 10GB),
> and we created '.keep' files for large ".pack" files, we want the big files
> unchanged to speed up git upload-pack, because in our mind the file system
> cache will reduce the disk IO if a file does not changed.
>
> However we find the mtime of ".pack" files changes over time which makes the
> file system always reload the big files, that takes a lot of IO time and result
> in lower speed of git upload-pack and even further the disk IOPS is exhausted.

That's surprising behavior to me. Are you saying that calling utime(2)
causes the *page* cache to be invalidated and that most reads are
cache-misses lowering overall IOPS?

If so, then I am quite surprised ;). The only state that should be
dirtied by calling utime(2) is the inode itself, so the blocks referred
to by the inode corresponding to a pack should be left in-tact.

If you're on Linux, you can try observing the behavior of evicting
inodes, blocks, or both from the disk cache by changing "2" in the
following:

    hyperfine 'git pack-objects --all --stdout --delta-base-offset >/dev/null'
      --prepare='sync; echo 2 | sudo tee /proc/sys/vm/drop_caches'

where "1" drops the page cache, "2" drops the inodes, and "3" evicts
both.

I wonder if you could share the results of running the above varying
the value of "1", "2", and "3", as well as swapping the `--prepare` for
`--warmup=3` to warm your caches (and give us an idea of what your
expected performance is probably like).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration
  2021-07-14 17:04         ` Taylor Blau
@ 2021-07-14 18:19           ` Ævar Arnfjörð Bjarmason
  2021-07-14 19:11             ` Martin Fick
  2021-07-14 19:30             ` Taylor Blau
  2021-07-15 16:30           ` Sun Chao
  1 sibling, 2 replies; 34+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-07-14 18:19 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Sun Chao, Taylor Blau, Sun Chao via GitGitGadget, git


On Wed, Jul 14 2021, Taylor Blau wrote:

> On Thu, Jul 15, 2021 at 12:46:47AM +0800, Sun Chao wrote:
>> > Stepping back, I'm not sure I understand why freshening a pack is so
>> > slow for you. freshen_file() just calls utime(2), and any sync back to
>> > the disk shouldn't need to update the pack itself, just a couple of
>> > fields in its inode. Maybe you could help explain further.
>> >
>> > [ ... ]
>>
>> The reason why we want to avoid freshen the mtime of ".pack" file is to
>> improve the reading speed of Git Servers.
>>
>> We have some large repositories in our Git Severs (some are bigger than 10GB),
>> and we created '.keep' files for large ".pack" files, we want the big files
>> unchanged to speed up git upload-pack, because in our mind the file system
>> cache will reduce the disk IO if a file does not changed.
>>
>> However we find the mtime of ".pack" files changes over time which makes the
>> file system always reload the big files, that takes a lot of IO time and result
>> in lower speed of git upload-pack and even further the disk IOPS is exhausted.
>
> That's surprising behavior to me. Are you saying that calling utime(2)
> causes the *page* cache to be invalidated and that most reads are
> cache-misses lowering overall IOPS?
>
> If so, then I am quite surprised ;). The only state that should be
> dirtied by calling utime(2) is the inode itself, so the blocks referred
> to by the inode corresponding to a pack should be left in-tact.
>
> If you're on Linux, you can try observing the behavior of evicting
> inodes, blocks, or both from the disk cache by changing "2" in the
> following:
>
>     hyperfine 'git pack-objects --all --stdout --delta-base-offset >/dev/null'
>       --prepare='sync; echo 2 | sudo tee /proc/sys/vm/drop_caches'
>
> where "1" drops the page cache, "2" drops the inodes, and "3" evicts
> both.
>
> I wonder if you could share the results of running the above varying
> the value of "1", "2", and "3", as well as swapping the `--prepare` for
> `--warmup=3` to warm your caches (and give us an idea of what your
> expected performance is probably like).

I think you may be right narrowly, but wrong in this context :)

I.e. my understanding of this problem is that they have some incremental
backup job, e.g. rsync without --checksum (not that doing that would
help, chicken & egg issue)..

So by changing the mtime you cause the file to be re-synced.

Yes Linux (or hopefully any modern OS) isn't so dumb as to evict your FS
cache because of such a metadata change, but that's besides the point.

If you have a backup job like that your FS cache will get evicted or be
subject to churn anyway, because you'll shortly be having to deal with
the "rsync" job that's noticed the changed mtime competing for caching
resources with "real" traffic.

Sun: Does that summarize the problem you're having?

<large digression ahead>

Sun, also: Note that in general doing backups of live git repositories
with rsync is a bad idea, and will lead to corruption.

The most common cause of such corruption is that a tool like "rsync"
will iterate recursively through say "objects" followed by "refs".

So by the time it gets to the latter (or is doing a deep iteration
within those dirs) git's state has changed in such a way as to yield an
rsync backup in a state that the repository was never in.

(As an aside, I've often wondered what it is about git exactly makes
people who'd never think of doing the same thing with the FS part of an
RDMBS's data store think that implementing such an ad-hoc backup
solution for git would be a good idea, but I digress. Perhaps we need
more scarier looking BerkeleyDB-looking names in the .git directory :)

Even if you do FS snapshots of live git repositories you're likely to
get corruption, search this mailing list for references to fsync(),
e.g. [1].

In short, git's historically (and still) been sloppy about rsync, and
relied on non-standard behavior such as "if I do N updates for N=1..100,
and fsync just "100", then I can assume 1..99 are fsynced (spoiler: you
can't assume that).

Our use of fsync is still broken in that sense today, git is not a safe
place to store your data in the POSIXLY pedantic sense (and no, I don't
just mean that core.fsyncObjectFiles is `false` by default, it only
covers a small part of this, e.g. we don't fsync dir entries even with
that).

On a real live filesystem this is usually not an issue, because if
you're not dealing with yanked power cords (and even then, journals
might save you), then even if you fsync a file but don't fsync the dir
entry it's in, the FS is usually forgiving about such cases.

I.e. if someone does a concurrent request for the could-be-outdated dir
entry they'll service the up-to-date one, even without that having been
fsync'd, because the VFS layer isn't going to the synced disk, it's
checking it's current state and servicing your request from that.

But at least some FS snapshot implementations have a habit of exposing
the most pedantic interpretation possible of FS semantics, and one that
you wouldn't ever get on a live FS. I.e. you might be hooking into the
equivalent of the order in which things are written to disk, and end up
with a state that would never have been exposed to a running program
(there would be a 1=1 correspondence if we fsync'd properly, which we
don't).

The best way to get backups of git repositories you know are correct are
is to use git's own transport mechanisms, i.e. fetch/pull the data, or
create bundles from it. This would be the case even if we fixed all our
fsync issues, because doing so wouldn't help you in the case of a
bit-flip, but an "index-pack" on the other end will spot such issues.

1. https://lore.kernel.org/git/20200917112830.26606-2-avarab@gmail.com/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration
  2021-07-14 18:19           ` Ævar Arnfjörð Bjarmason
@ 2021-07-14 19:11             ` Martin Fick
  2021-07-14 19:41               ` Ævar Arnfjörð Bjarmason
  2021-07-14 19:30             ` Taylor Blau
  1 sibling, 1 reply; 34+ messages in thread
From: Martin Fick @ 2021-07-14 19:11 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Taylor Blau, Sun Chao, Taylor Blau, Sun Chao via GitGitGadget,
	git

On Wednesday, July 14, 2021 8:19:15 PM MDT Ævar Arnfjörð Bjarmason wrote:
> The best way to get backups of git repositories you know are correct are
> is to use git's own transport mechanisms, i.e. fetch/pull the data, or
> create bundles from it. 

I don't think this is a fair recommendation since unfortunately, this cannot 
be used to create a full backup. This can be used to back up the version 
controlled data, but not the repositories meta-data, i.e. configs, reflogs, 
alternate setups...

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration
  2021-07-14 18:19           ` Ævar Arnfjörð Bjarmason
  2021-07-14 19:11             ` Martin Fick
@ 2021-07-14 19:30             ` Taylor Blau
  2021-07-14 19:32               ` Ævar Arnfjörð Bjarmason
  2021-07-14 21:40               ` Junio C Hamano
  1 sibling, 2 replies; 34+ messages in thread
From: Taylor Blau @ 2021-07-14 19:30 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Taylor Blau, Sun Chao, Sun Chao via GitGitGadget, git

On Wed, Jul 14, 2021 at 08:19:15PM +0200, Ævar Arnfjörð Bjarmason wrote:
> >> The reason why we want to avoid freshen the mtime of ".pack" file is to
> >> improve the reading speed of Git Servers.
> >
> > That's surprising behavior to me. Are you saying that calling utime(2)
> > causes the *page* cache to be invalidated and that most reads are
> > cache-misses lowering overall IOPS?
>
> I think you may be right narrowly, but wrong in this context :)
>
> I.e. my understanding of this problem is that they have some incremental
> backup job, e.g. rsync without --checksum (not that doing that would
> help, chicken & egg issue)..

Ah, thanks for explaining. That's helpful, and changes my thinking.

Ideally, Sun would be able to use --checksum (if they are using rsync)
or some equivalent (if they are not). In other words, this seems like a
problem that Git shouldn't be bending over backwards for.

But if that isn't possible, then I find introducing a new file to
redefine the pack's mtime just to accommodate a backup system that
doesn't know better to be a poor justification for adding this
complexity. Especially since we agree that rsync-ing live Git
repositories is a bad idea in the first place ;).

If it were me, I would probably stop here and avoid pursuing this
further. But an OK middle ground might be core.freshenPackfiles=<bool>
to indicate whether or not packs can be freshened, or the objects
contained within them should just be rewritten loose.

Sun could then set this configuration to "false", implying:

  - That they would have more random loose objects, leading to some
    redundant work by their backup system.
  - But they wouldn't have to resync their huge packfiles.

...and we wouldn't have to introduce any new formats/file types to do
it. To me, that seems like a net-positive outcome.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration
  2021-07-14 19:30             ` Taylor Blau
@ 2021-07-14 19:32               ` Ævar Arnfjörð Bjarmason
  2021-07-14 19:52                 ` Taylor Blau
  2021-07-14 21:40               ` Junio C Hamano
  1 sibling, 1 reply; 34+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-07-14 19:32 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Taylor Blau, Sun Chao, Sun Chao via GitGitGadget, git


On Wed, Jul 14 2021, Taylor Blau wrote:

> On Wed, Jul 14, 2021 at 08:19:15PM +0200, Ævar Arnfjörð Bjarmason wrote:
>> >> The reason why we want to avoid freshen the mtime of ".pack" file is to
>> >> improve the reading speed of Git Servers.
>> >
>> > That's surprising behavior to me. Are you saying that calling utime(2)
>> > causes the *page* cache to be invalidated and that most reads are
>> > cache-misses lowering overall IOPS?
>>
>> I think you may be right narrowly, but wrong in this context :)
>>
>> I.e. my understanding of this problem is that they have some incremental
>> backup job, e.g. rsync without --checksum (not that doing that would
>> help, chicken & egg issue)..
>
> Ah, thanks for explaining. That's helpful, and changes my thinking.
>
> Ideally, Sun would be able to use --checksum (if they are using rsync)
> or some equivalent (if they are not). In other words, this seems like a
> problem that Git shouldn't be bending over backwards for.

Even with my strong opinions about rsync being bad for this use-case, in
practice it does work for a lot of people, e.g. with nightly jobs
etc. Not everyone's repository is insanely busy, where I've mainly seen
this sort of corruption.

In that case making people use --checksum is borderline inhumane :)

> But if that isn't possible, then I find introducing a new file to
> redefine the pack's mtime just to accommodate a backup system that
> doesn't know better to be a poor justification for adding this
> complexity. Especially since we agree that rsync-ing live Git
> repositories is a bad idea in the first place ;).
>
> If it were me, I would probably stop here and avoid pursuing this
> further. But an OK middle ground might be core.freshenPackfiles=<bool>
> to indicate whether or not packs can be freshened, or the objects
> contained within them should just be rewritten loose.
>
> Sun could then set this configuration to "false", implying:
>
>   - That they would have more random loose objects, leading to some
>     redundant work by their backup system.
>   - But they wouldn't have to resync their huge packfiles.
>
> ...and we wouldn't have to introduce any new formats/file types to do
> it. To me, that seems like a net-positive outcome.

This approach is getting quite close to my core.checkCollisions patch,
to the point of perhaps being indistinguishable in practice:
https://lore.kernel.org/git/20181028225023.26427-5-avarab@gmail.com/

I.e. if you're happy to re-write out duplicate objects then you're going
to be ignoring the collision check and don't need to do it. It's not the
same in that you might skip writing objects you know are reachable, and
with the collisions check off and not-so-thin packs you will/might get
more redundancy than you asked for.

But in practice with modern clients mostly/entirely sending you just the
things you need in the common case it might be close enough.

I mean it addresses the expiry race that a unreachable-becomes-reachable
again race would be "solved" by just re-writing that data, hrm, but
there's probably aspects of that race I'm not considering.

Anyway, in the sense that for a lot of systems syncing file additions is
a lot cheaper than rewrites it might get you what you want...

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration
  2021-07-14 19:11             ` Martin Fick
@ 2021-07-14 19:41               ` Ævar Arnfjörð Bjarmason
  2021-07-14 20:20                 ` Martin Fick
  2021-07-15  8:23                 ` Son Luong Ngoc
  0 siblings, 2 replies; 34+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-07-14 19:41 UTC (permalink / raw)
  To: Martin Fick
  Cc: Taylor Blau, Sun Chao, Taylor Blau, Sun Chao via GitGitGadget,
	git


On Wed, Jul 14 2021, Martin Fick wrote:

> On Wednesday, July 14, 2021 8:19:15 PM MDT Ævar Arnfjörð Bjarmason wrote:
>> The best way to get backups of git repositories you know are correct are
>> is to use git's own transport mechanisms, i.e. fetch/pull the data, or
>> create bundles from it. 
>
> I don't think this is a fair recommendation since unfortunately, this cannot 
> be used to create a full backup. This can be used to back up the version 
> controlled data, but not the repositories meta-data, i.e. configs, reflogs, 
> alternate setups...

*nod*

FWIW at an ex-job I helped systems administrators who'd produced such a
broken backup-via-rsync create a hybrid version as an interim
solution. I.e. it would sync the objects via git transport, and do an
rsync on a whitelist (or blacklist), so pickup config, but exclude
objects.

"Hybrid" because it was in a state of needing to deal with manual
tweaking of config.

But usually someone who's needing to thoroughly solve this backup
problem will inevitably end up with wanting to drive everything that's
not in the object or refstore from some external system, i.e. have
config be generated from puppet, a database etc., ditto for alternates
etc.

But even if you can't get to that point (or don't want to) I'd say aim
for the hybrid system.

This isn't some purely theoretical concern b.t.w., the system using
rsync like this was producing repos that wouldn't fsck all the time, and
it wasn't such a busy site.

I suspect (but haven't tried) that for someone who can't easily change
their backup solution they'd get most of the benefits of git-native
transport by having their "rsync" sync refs, then objects, not the other
way around. Glob order dictates that most backup systems will do
objects, then refs (which will of course, at that point, refer to
nonexisting objects).

It's still not safe, you'll still be subject to races, but probably a
lot better in practice.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration
  2021-07-14 19:32               ` Ævar Arnfjörð Bjarmason
@ 2021-07-14 19:52                 ` Taylor Blau
  0 siblings, 0 replies; 34+ messages in thread
From: Taylor Blau @ 2021-07-14 19:52 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Taylor Blau, Sun Chao, Sun Chao via GitGitGadget, git

On Wed, Jul 14, 2021 at 09:32:26PM +0200, Ævar Arnfjörð Bjarmason wrote:
> > But if that isn't possible, then I find introducing a new file to
> > redefine the pack's mtime just to accommodate a backup system that
> > doesn't know better to be a poor justification for adding this
> > complexity. Especially since we agree that rsync-ing live Git
> > repositories is a bad idea in the first place ;).
> >
> > If it were me, I would probably stop here and avoid pursuing this
> > further. But an OK middle ground might be core.freshenPackfiles=<bool>
> > to indicate whether or not packs can be freshened, or the objects
> > contained within them should just be rewritten loose.
> >
> > Sun could then set this configuration to "false", implying:
> >
> >   - That they would have more random loose objects, leading to some
> >     redundant work by their backup system.
> >   - But they wouldn't have to resync their huge packfiles.
> >
> > ...and we wouldn't have to introduce any new formats/file types to do
> > it. To me, that seems like a net-positive outcome.
>
> This approach is getting quite close to my core.checkCollisions patch,
> to the point of perhaps being indistinguishable in practice:
> https://lore.kernel.org/git/20181028225023.26427-5-avarab@gmail.com/

Hmm, I'm not sure if I understand. That collision check is only done
during index-pack, and reading builtin/index-pack.c:check_collision(),
it looks like we only do it for large blobs anyway.

> I.e. if you're happy to re-write out duplicate objects then you're going
> to be ignoring the collision check and don't need to do it. It's not the
> same in that you might skip writing objects you know are reachable, and
> with the collisions check off and not-so-thin packs you will/might get
> more redundancy than you asked for.

We may be talking about different things, but if users are concerned
about SHA-1 collisions, then they should still be able to build with
DC_SHA1=YesPlease to catch shattered-style collisions.

Anyway, I think we may be a little in the weeds for what we are trying
to accomplish here. I'm thinking something along the lines of the
following (sans documentation and tests, of course ;)).

--- >8 ---

diff --git a/object-file.c b/object-file.c
index f233b440b2..87c9238365 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1971,9 +1971,22 @@ static int freshen_loose_object(const struct object_id *oid)
 	return check_and_freshen(oid, 1);
 }

+static int can_freshen_packs = -1;
+static int get_can_freshen_packs(void)
+{
+	 if (can_freshen_packs < 0) {
+		if (git_config_get_bool("core.freshenpackfiles",
+					&can_freshen_packs))
+			can_freshen_packs = 1;
+	 }
+	 return can_freshen_packs;
+}
+
 static int freshen_packed_object(const struct object_id *oid)
 {
 	struct pack_entry e;
+	if (!get_can_freshen_packs())
+		return 0;
 	if (!find_pack_entry(the_repository, oid, &e))
 		return 0;
 	if (e.p->freshened)

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration
  2021-07-14 19:41               ` Ævar Arnfjörð Bjarmason
@ 2021-07-14 20:20                 ` Martin Fick
  2021-07-20  6:32                   ` Ævar Arnfjörð Bjarmason
  2021-07-15  8:23                 ` Son Luong Ngoc
  1 sibling, 1 reply; 34+ messages in thread
From: Martin Fick @ 2021-07-14 20:20 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Taylor Blau, Sun Chao, Taylor Blau, Sun Chao via GitGitGadget,
	git

On Wednesday, July 14, 2021 9:41:42 PM MDT you wrote:
> On Wed, Jul 14 2021, Martin Fick wrote:
> > On Wednesday, July 14, 2021 8:19:15 PM MDT Ævar Arnfjörð Bjarmason wrote:
> >> The best way to get backups of git repositories you know are correct are
> >> is to use git's own transport mechanisms, i.e. fetch/pull the data, or
> >> create bundles from it.
> > 
> > I don't think this is a fair recommendation since unfortunately, this
> > cannot be used to create a full backup. This can be used to back up the
> > version controlled data, but not the repositories meta-data, i.e.
> > configs, reflogs, alternate setups...
> 
> *nod*
> 
> FWIW at an ex-job I helped systems administrators who'd produced such a
> broken backup-via-rsync create a hybrid version as an interim
> solution. I.e. it would sync the objects via git transport, and do an
> rsync on a whitelist (or blacklist), so pickup config, but exclude
> objects.
> 
> "Hybrid" because it was in a state of needing to deal with manual
> tweaking of config.
> 
> But usually someone who's needing to thoroughly solve this backup
> problem will inevitably end up with wanting to drive everything that's
> not in the object or refstore from some external system, i.e. have
> config be generated from puppet, a database etc., ditto for alternates
> etc.
> 
> But even if you can't get to that point (or don't want to) I'd say aim
> for the hybrid system.
> 
> This isn't some purely theoretical concern b.t.w., the system using
> rsync like this was producing repos that wouldn't fsck all the time, and
> it wasn't such a busy site.
> 
> I suspect (but haven't tried) that for someone who can't easily change
> their backup solution they'd get most of the benefits of git-native
> transport by having their "rsync" sync refs, then objects, not the other
> way around. Glob order dictates that most backup systems will do
> objects, then refs (which will of course, at that point, refer to
> nonexisting objects).
> 
> It's still not safe, you'll still be subject to races, but probably a
> lot better in practice.

It would be great if git provided a command to do a reliable incremental 
backup, maybe it could copy things in the order that you mention?

However, most people will want to use the backup system they have and not a 
special git tool. Maybe git fsck should gain a switch that would rewind any 
refs to an older point that is no broken (using reflogs)? That way, most 
backups would just work and be rewound to the point at which the backup 
started?

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation


-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation


-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration
  2021-07-14 19:30             ` Taylor Blau
  2021-07-14 19:32               ` Ævar Arnfjörð Bjarmason
@ 2021-07-14 21:40               ` Junio C Hamano
  1 sibling, 0 replies; 34+ messages in thread
From: Junio C Hamano @ 2021-07-14 21:40 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Ævar Arnfjörð Bjarmason, Taylor Blau, Sun Chao,
	Sun Chao via GitGitGadget, git

Taylor Blau <me@ttaylorr.com> writes:

> But if that isn't possible, then I find introducing a new file to
> redefine the pack's mtime just to accommodate a backup system that
> doesn't know better to be a poor justification for adding this
> complexity. Especially since we agree that rsync-ing live Git
> repositories is a bad idea in the first place ;).
>
> If it were me, I would probably stop here and avoid pursuing this
> further.

;-).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration
  2021-07-14 19:41               ` Ævar Arnfjörð Bjarmason
  2021-07-14 20:20                 ` Martin Fick
@ 2021-07-15  8:23                 ` Son Luong Ngoc
  2021-07-20  6:29                   ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 34+ messages in thread
From: Son Luong Ngoc @ 2021-07-15  8:23 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Martin Fick, Taylor Blau, Sun Chao, Taylor Blau,
	Sun Chao via GitGitGadget, git

Hi folks,

On Wed, Jul 14, 2021 at 10:03 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> *nod*
>
> FWIW at an ex-job I helped systems administrators who'd produced such a
> broken backup-via-rsync create a hybrid version as an interim
> solution. I.e. it would sync the objects via git transport, and do an
> rsync on a whitelist (or blacklist), so pickup config, but exclude
> objects.
>
> "Hybrid" because it was in a state of needing to deal with manual
> tweaking of config.
>
> But usually someone who's needing to thoroughly solve this backup
> problem will inevitably end up with wanting to drive everything that's
> not in the object or refstore from some external system, i.e. have
> config be generated from puppet, a database etc., ditto for alternates
> etc.
>
> But even if you can't get to that point (or don't want to) I'd say aim
> for the hybrid system.

FWIW, we are running our repo on top of a some-what flickery DRBD setup and
we decided to use both

  git clone --upload-pack 'git -c transfer.hiderefs="!refs"
upload-pack' --mirror`

and

  `tar`

to create 2 separate snapshots for backup in parallel (full backup,
not incremental).

In case of recovery (manual), we first rely on the git snapshot and if
there is any
missing objects/refs, we will try to get it from the tarball.

>
> This isn't some purely theoretical concern b.t.w., the system using
> rsync like this was producing repos that wouldn't fsck all the time, and
> it wasn't such a busy site.
>
> I suspect (but haven't tried) that for someone who can't easily change
> their backup solution they'd get most of the benefits of git-native
> transport by having their "rsync" sync refs, then objects, not the other
> way around. Glob order dictates that most backup systems will do
> objects, then refs (which will of course, at that point, refer to
> nonexisting objects).
>
> It's still not safe, you'll still be subject to races, but probably a
> lot better in practice.

I would love to get some guidance in official documentation on what is the best
practice around handling git data on the server side.

Is git-clone + git-bundle the go-to solution?
Should tar/rsync not be used completely or is there a trade-off?

Thanks,
Son Luong.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration
  2021-07-14 17:04         ` Taylor Blau
  2021-07-14 18:19           ` Ævar Arnfjörð Bjarmason
@ 2021-07-15 16:30           ` Sun Chao
  2021-07-15 16:42             ` Taylor Blau
  1 sibling, 1 reply; 34+ messages in thread
From: Sun Chao @ 2021-07-15 16:30 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Taylor Blau, Ævar Arnfjörð Bjarmason,
	Sun Chao via GitGitGadget, git



> 2021年7月15日 01:04,Taylor Blau <ttaylorr@github.com> 写道:
> 
> [...]
>> 
>> However we find the mtime of ".pack" files changes over time which makes the
>> file system always reload the big files, that takes a lot of IO time and result
>> in lower speed of git upload-pack and even further the disk IOPS is exhausted.
> 
> That's surprising behavior to me. Are you saying that calling utime(2)
> causes the *page* cache to be invalidated and that most reads are
> cache-misses lowering overall IOPS?
> 
> If so, then I am quite surprised ;). The only state that should be
> dirtied by calling utime(2) is the inode itself, so the blocks referred
> to by the inode corresponding to a pack should be left in-tact.
> 
> If you're on Linux, you can try observing the behavior of evicting
> inodes, blocks, or both from the disk cache by changing "2" in the
> following:
> 
>    hyperfine 'git pack-objects --all --stdout --delta-base-offset >/dev/null'
>      --prepare='sync; echo 2 | sudo tee /proc/sys/vm/drop_caches'
> 
> where "1" drops the page cache, "2" drops the inodes, and "3" evicts
> both.
> 
> I wonder if you could share the results of running the above varying
> the value of "1", "2", and "3", as well as swapping the `--prepare` for
> `--warmup=3` to warm your caches (and give us an idea of what your
> expected performance is probably like).
> 
> Thanks,
> Taylor

I'm sorry to reply so late, I work long hours during the day, and the company
network can not send external mail, so I can only go home late at night to reply to you.

Thanks for your reply again, My explaination for 'why the mtime is so important' lost some
informations and it is not clear enough, I will tell the details here:

Servers:
- We maintain a number of servers, each mounting some NFS disks that hold our git
  repositories, some of them are so large (cannot reduce the size now), they are > 10GB
- There are too many objects and large files in the git history which result in some
  large '.pack' files in the '.git/objects/pack' directires
- We created the '.keep' files for each large '.pack' file, wish the disk cache can reduce
  the NFS IOPS and just load contents from caches.

Clients:
- There are too many CI systems are keep downloading the git repositories in a very
  high frequency, e.g. we find different CI systems make 600 download requests in a short
  period of time by 'git fetch'.
- Some developers are doing 'git push' at the same time, create Pull Requests after that
  (which trigger the CI then), so git servers will do some update tasks which may cause the
  mtime of '.pack' file freshend.

So, in this case there will be many 'git-upload-pack' processes running on the git servers,
they all need to load the big '.pack' files. The 'git-upload-pack' will be faster if the
disk cache is warmed up and the NFS server will be not so busy.

However we find the IOPS of the NFS server always be exhausted and the 'git-upload-pack' will
runs for a very long time. We noticed the mtime of '.pack' changes over time, one of my
colleagues who is familiar with the file system tell me it's the mtime who invalidate the
disk caches.

So we want the caches to be valid for a long time which can speed up the 'git-upload-pack'
processes.

I don't known if the `/proc/sys/vm/drop_caches` can help or not, but thanks for your tips,
I will try to check them and see if there are some differences.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration
  2021-07-15 16:30           ` Sun Chao
@ 2021-07-15 16:42             ` Taylor Blau
  2021-07-15 16:48               ` Sun Chao
  0 siblings, 1 reply; 34+ messages in thread
From: Taylor Blau @ 2021-07-15 16:42 UTC (permalink / raw)
  To: Sun Chao
  Cc: Taylor Blau, Ævar Arnfjörð Bjarmason,
	Sun Chao via GitGitGadget, git

On Fri, Jul 16, 2021 at 12:30:18AM +0800, Sun Chao wrote:
> I'm sorry to reply so late, I work long hours during the day, and the
> company network can not send external mail, so I can only go home late
> at night to reply to you.

There's no need to apologize :-).

> Thanks for your reply again, My explaination for 'why the mtime is so
> important' lost some informations and it is not clear enough, I will
> tell the details here:

Let me see if I can summarize here. Basically:

  - You have a number of servers that have NFS mounts which hold large
    repositories with packs in excess of 10 GB in size.
  - You have a lot of clients that are fetching, and a smaller number of
    clients that are pushing, some of which happen to freshen the mtimes
    of the packs.

...and the mtimes being updated cause the disk cache to be invalidated?

It's the last part that is so surprising to me. Ævar and I discussed
earlier in the thread that their understanding was that you had a backup
system which had to resynchronize an unchanged file because its metadata
had changed.

But this is different than that. If I understand what you're saying
correctly, then you're saying that the disk caches themselves are
invalidated by changing the mtime.

That is highly surprising to me, since the block cache should only be
invalidated if the *blocks* change, not metadata in the inode. It would
be good to confirm that this is actually what's happening.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration
  2021-07-15 16:42             ` Taylor Blau
@ 2021-07-15 16:48               ` Sun Chao
  0 siblings, 0 replies; 34+ messages in thread
From: Sun Chao @ 2021-07-15 16:48 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Ævar Arnfjörð Bjarmason, Sun Chao via GitGitGadget,
	git



> 2021年7月16日 00:42,Taylor Blau <me@ttaylorr.com> 写道:
> 
> On Fri, Jul 16, 2021 at 12:30:18AM +0800, Sun Chao wrote:
>> I'm sorry to reply so late, I work long hours during the day, and the
>> company network can not send external mail, so I can only go home late
>> at night to reply to you.
> 
> There's no need to apologize :-).
> 
>> Thanks for your reply again, My explaination for 'why the mtime is so
>> important' lost some informations and it is not clear enough, I will
>> tell the details here:
> 
> Let me see if I can summarize here. Basically:
> 
>  - You have a number of servers that have NFS mounts which hold large
>    repositories with packs in excess of 10 GB in size.
>  - You have a lot of clients that are fetching, and a smaller number of
>    clients that are pushing, some of which happen to freshen the mtimes
>    of the packs.
> 
> ...and the mtimes being updated cause the disk cache to be invalidated?
> 
> It's the last part that is so surprising to me. Ævar and I discussed
> earlier in the thread that their understanding was that you had a backup
> system which had to resynchronize an unchanged file because its metadata
> had changed.
> 
> But this is different than that. If I understand what you're saying
> correctly, then you're saying that the disk caches themselves are
> invalidated by changing the mtime.
> 
> That is highly surprising to me, since the block cache should only be
> invalidated if the *blocks* change, not metadata in the inode. It would
> be good to confirm that this is actually what's happening.
> 
> Thanks,
> Taylor

Oh, Maybe I didn't understand caching well enough, let me check it again,
and thanks for your and Ævar's answers, they are really helpful.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3] packfile: freshen the mtime of packfile by configuration
  2021-07-14  1:28 ` [PATCH v2] packfile: freshen the mtime of packfile by configuration Sun Chao via GitGitGadget
  2021-07-14  1:39   ` Ævar Arnfjörð Bjarmason
@ 2021-07-19 19:53   ` Sun Chao via GitGitGadget
  2021-07-19 20:51     ` Taylor Blau
  2021-08-15 17:08     ` [PATCH v4 0/2] " Sun Chao via GitGitGadget
  1 sibling, 2 replies; 34+ messages in thread
From: Sun Chao via GitGitGadget @ 2021-07-19 19:53 UTC (permalink / raw)
  To: git
  Cc: Ævar Arnfjörð Bjarmason, Taylor Blau, Taylor Blau,
	Martin Fick, Son Luong Ngoc, Sun Chao, Sun Chao

From: Sun Chao <16657101987@163.com>

Commit 33d4221c79 (write_sha1_file: freshen existing objects,
2014-10-15) avoid writing existing objects by freshen their
mtime (especially the packfiles contains them) in order to
aid the correct caching, and some process like find_lru_pack
can make good decision. However, this is unfriendly to
incremental backup jobs or services rely on cached file system
when there are large '.pack' files exists.

For example, after packed all objects, use 'write-tree' to
create same commit with the same tree and same environments
such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can
notice the '.pack' file's mtime changed. Git servers
that mount the same NFS disk will re-sync the '.pack' files
to cached file system which will slow the git commands.

So if add core.freshenPackfiles to indicate whether or not
packs can be freshened, turning off this option on some
servers can speed up the execution of some commands on servers
which use NFS disk instead of local disk.

Signed-off-by: Sun Chao <16657101987@163.com>
---
    packfile: freshen the mtime of packfile by configuration
    
    packfile: freshen the mtime of packfile by configuration
    
    Commit 33d4221c79 (write_sha1_file: freshen existing objects,
    2014-10-15) avoid writing existing objects by freshen their mtime
    (especially the packfiles contains them) in order to aid the correct
    caching, and some process like find_lru_pack can make good decision.
    However, this is unfriendly to incremental backup jobs or services rely
    on cached file system when there are large '.pack' files exists.
    
    For example, after packed all objects, use 'write-tree' to create same
    commit with the same tree and same environments such like
    GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can notice the '.pack' file's
    mtime changed. Git servers that mount the same NFS disk will re-sync the
    '.pack' files to cached file system which will slow the git commands.
    
    Here we can find the description of the cached file system for NFS
    Client from
    https://www.ibm.com/docs/en/aix/7.2?topic=performance-cache-file-system:
    
    3. To ensure that the cached directories and files are kept up to date, 
    CacheFS periodically checks the consistency of files stored in the cache.
    It does this by comparing the current modification time to the previous
    modification time.
    
    4. If the modification times are different, all data and attributes
    for the directory or file are purged from the cache, and new data and
    attributes are retrieved from the back file system.
    
    
    So if add core.freshenPackfiles to indicate whether or not packs can be
    freshened, turning off this option on some servers can speed up the
    execution of some commands on servers which use NFS disk instead of
    local disk.
    
    Signed-off-by: Sun Chao 16657101987@163.com

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1043%2Fsunchao9%2Fmaster-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1043/sunchao9/master-v3
Pull-Request: https://github.com/git/git/pull/1043

Range-diff vs v2:

 1:  943e31e8587 ! 1:  16c68923bea packfile: freshen the mtime of packfile by configuration
     @@ Commit message
          mtime (especially the packfiles contains them) in order to
          aid the correct caching, and some process like find_lru_pack
          can make good decision. However, this is unfriendly to
     -    incremental backup jobs or services rely on file system
     -    cache when there are large '.pack' files exists.
     +    incremental backup jobs or services rely on cached file system
     +    when there are large '.pack' files exists.
      
          For example, after packed all objects, use 'write-tree' to
          create same commit with the same tree and same environments
          such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can
     -    notice the '.pack' file's mtime changed, and '.idx' file not.
     +    notice the '.pack' file's mtime changed. Git servers
     +    that mount the same NFS disk will re-sync the '.pack' files
     +    to cached file system which will slow the git commands.
      
     -    If we freshen the mtime of packfile by updating another
     -    file instead of '.pack' file e.g. a empty '.bump' file,
     -    when we need to check the mtime of packfile, get it from
     -    another file instead. Large git repository may contains
     -    large '.pack' files, and we can use smaller files even empty
     -    file to do the mtime get/set operation, this can avoid
     -    file system cache re-sync large '.pack' files again and
     -    then speed up most git commands.
     +    So if add core.freshenPackfiles to indicate whether or not
     +    packs can be freshened, turning off this option on some
     +    servers can speed up the execution of some commands on servers
     +    which use NFS disk instead of local disk.
      
          Signed-off-by: Sun Chao <16657101987@163.com>
      
     @@ Documentation/config/core.txt: the largest projects.  You probably do not need t
       +
       Common unit suffixes of 'k', 'm', or 'g' are supported.
       
     -+core.packMtimeSuffix::
     ++core.freshenPackFiles::
      +	Normally we avoid writing existing object by freshening the mtime
      +	of the *.pack file which contains it in order to aid some processes
     -+	such like prune. Use different file instead of *.pack file will
     -+	avoid file system cache re-sync the large packfiles, and consequently
     -+	make git commands faster.
     ++	such like prune. Turning off this option on some servers can speed
     ++	up the execution of some commands like 'git-upload-pack'(e.g. some
     ++	servers that mount the same NFS disk will re-sync the *.pack files
     ++	to cached file system if the mtime cahnges).
      ++
     -+The default is 'pack' which means the *.pack file will be freshened by
     -+default. You can configure a different suffix to use, the file with the
     -+suffix will be created automatically, it's better not using any known
     -+suffix such like 'idx', 'keep', 'promisor'.
     ++The default is true which means the *.pack file will be freshened if we
     ++want to write a existing object whthin it.
      +
       core.deltaBaseCacheLimit::
       	Maximum number of bytes per thread to reserve for caching base objects
       	that may be referenced by multiple deltified objects.  By storing the
      
     - ## builtin/index-pack.c ##
     -@@ builtin/index-pack.c: static void fix_unresolved_deltas(struct hashfile *f)
     - 	free(sorted_by_pos);
     - }
     - 
     --static const char *derive_filename(const char *pack_name, const char *strip,
     --				   const char *suffix, struct strbuf *buf)
     --{
     --	size_t len;
     --	if (!strip_suffix(pack_name, strip, &len) || !len ||
     --	    pack_name[len - 1] != '.')
     --		die(_("packfile name '%s' does not end with '.%s'"),
     --		    pack_name, strip);
     --	strbuf_add(buf, pack_name, len);
     --	strbuf_addstr(buf, suffix);
     --	return buf->buf;
     --}
     --
     - static void write_special_file(const char *suffix, const char *msg,
     - 			       const char *pack_name, const unsigned char *hash,
     - 			       const char **report)
     -@@ builtin/index-pack.c: static void write_special_file(const char *suffix, const char *msg,
     - 	int msg_len = strlen(msg);
     - 
     - 	if (pack_name)
     --		filename = derive_filename(pack_name, "pack", suffix, &name_buf);
     -+		filename = derive_pack_filename(pack_name, "pack", suffix, &name_buf);
     - 	else
     - 		filename = odb_pack_name(&name_buf, hash, suffix);
     - 
     -@@ builtin/index-pack.c: int cmd_index_pack(int argc, const char **argv, const char *prefix)
     - 	if (from_stdin && hash_algo)
     - 		die(_("--object-format cannot be used with --stdin"));
     - 	if (!index_name && pack_name)
     --		index_name = derive_filename(pack_name, "pack", "idx", &index_name_buf);
     -+		index_name = derive_pack_filename(pack_name, "pack", "idx", &index_name_buf);
     - 
     - 	opts.flags &= ~(WRITE_REV | WRITE_REV_VERIFY);
     - 	if (rev_index) {
     - 		opts.flags |= verify ? WRITE_REV_VERIFY : WRITE_REV;
     - 		if (index_name)
     --			rev_index_name = derive_filename(index_name,
     -+			rev_index_name = derive_pack_filename(index_name,
     - 							 "idx", "rev",
     - 							 &rev_index_name_buf);
     - 	}
     -
       ## cache.h ##
      @@ cache.h: extern size_t packed_git_limit;
       extern size_t delta_base_cache_limit;
       extern unsigned long big_file_threshold;
       extern unsigned long pack_size_limit_cfg;
     -+extern const char *pack_mtime_suffix;
     ++extern int core_freshen_packfiles;
       
       /*
        * Accessors for the core.sharedrepository config which lazy-load the value
     @@ config.c: static int git_default_core_config(const char *var, const char *value,
       		return 0;
       	}
       
     -+	if (!strcmp(var, "core.packmtimesuffix")) {
     -+		return git_config_string(&pack_mtime_suffix, var, value);
     ++	if (!strcmp(var, "core.freshenpackfiles")) {
     ++		core_freshen_packfiles = git_config_bool(var, value);
      +	}
      +
       	if (!strcmp(var, "core.deltabasecachelimit")) {
     @@ config.c: static int git_default_core_config(const char *var, const char *value,
       		return 0;
      
       ## environment.c ##
     -@@ environment.c: const char *git_hooks_path;
     - int zlib_compression_level = Z_BEST_SPEED;
     - int core_compression_level;
     - int pack_compression_level = Z_DEFAULT_COMPRESSION;
     -+const char *pack_mtime_suffix = "pack";
     - int fsync_object_files;
     - size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE;
     - size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT;
     +@@ environment.c: int core_sparse_checkout_cone;
     + int merge_log_config = -1;
     + int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */
     + unsigned long pack_size_limit_cfg;
     ++int core_freshen_packfiles = 1;
     + enum log_refs_config log_all_ref_updates = LOG_REFS_UNSET;
     + 
     + #ifndef PROTECT_HFS_DEFAULT
      
       ## object-file.c ##
      @@ object-file.c: static int freshen_loose_object(const struct object_id *oid)
       static int freshen_packed_object(const struct object_id *oid)
       {
       	struct pack_entry e;
     -+	struct stat st;
     -+	struct strbuf name_buf = STRBUF_INIT;
     -+	const char *filename;
     ++
     ++	if (!core_freshen_packfiles)
     ++		return 1;
      +
       	if (!find_pack_entry(the_repository, oid, &e))
       		return 0;
       	if (e.p->freshened)
     - 		return 1;
     --	if (!freshen_file(e.p->pack_name))
     --		return 0;
     -+
     -+	filename = e.p->pack_name;
     -+	if (!strcasecmp(pack_mtime_suffix, "pack")) {
     -+		if (!freshen_file(filename))
     -+			return 0;
     -+		e.p->freshened = 1;
     -+		return 1;
     -+	}
     -+
     -+	/* If we want to freshen different file instead of .pack file, we need
     -+	 * to make sure the file exists and create it if needed.
     -+	 */
     -+	filename = derive_pack_filename(filename, "pack", pack_mtime_suffix, &name_buf);
     -+	if (lstat(filename, &st) < 0) {
     -+		int fd = open(filename, O_CREAT|O_EXCL|O_WRONLY, 0664);
     -+		if (fd < 0) {
     -+			// here we need to check it again because other git process may created it
     -+			if (lstat(filename, &st) < 0)
     -+				die_errno("unable to create '%s'", filename);
     -+		} else {
     -+			close(fd);
     -+		}
     -+	} else {
     -+		if (!freshen_file(filename))
     -+			return 0;
     -+	}
     -+
     - 	e.p->freshened = 1;
     - 	return 1;
     - }
     -
     - ## packfile.c ##
     -@@ packfile.c: char *sha1_pack_index_name(const unsigned char *sha1)
     - 	return odb_pack_name(&buf, sha1, "idx");
     - }
     - 
     -+const char *derive_pack_filename(const char *pack_name, const char *strip,
     -+				const char *suffix, struct strbuf *buf)
     -+{
     -+	size_t len;
     -+	if (!strip_suffix(pack_name, strip, &len) || !len ||
     -+	    pack_name[len - 1] != '.')
     -+		die(_("packfile name '%s' does not end with '.%s'"),
     -+		    pack_name, strip);
     -+	strbuf_add(buf, pack_name, len);
     -+	strbuf_addstr(buf, suffix);
     -+	return buf->buf;
     -+}
     -+
     - static unsigned int pack_used_ctr;
     - static unsigned int pack_mmap_calls;
     - static unsigned int peak_pack_open_windows;
     -@@ packfile.c: struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
     - 	 */
     - 	p->pack_size = st.st_size;
     - 	p->pack_local = local;
     -+
     -+	/* If we have different file used to freshen the mtime, we should
     -+	 * use it at a higher priority.
     -+	 */
     -+	if (!!strcasecmp(pack_mtime_suffix, "pack")) {
     -+		struct strbuf name_buf = STRBUF_INIT;
     -+		const char *filename;
     -+
     -+		filename = derive_pack_filename(path, "idx", pack_mtime_suffix, &name_buf);
     -+		stat(filename, &st);
     -+	}
     - 	p->mtime = st.st_mtime;
     - 	if (path_len < the_hash_algo->hexsz ||
     - 	    get_sha1_hex(path + path_len - the_hash_algo->hexsz, p->hash))
     -
     - ## packfile.h ##
     -@@ packfile.h: char *sha1_pack_name(const unsigned char *sha1);
     -  */
     - char *sha1_pack_index_name(const unsigned char *sha1);
     - 
     -+/*
     -+ * Return the corresponding filename with given suffix from "file_name"
     -+ * which must has "strip" suffix.
     -+ */
     -+const char *derive_pack_filename(const char *file_name, const char *strip,
     -+		const char *suffix, struct strbuf *buf);
     -+
     - /*
     -  * Return the basename of the packfile, omitting any containing directory
     -  * (e.g., "pack-1234abcd[...].pack").
      
       ## t/t7701-repack-unpack-unreachable.sh ##
      @@ t/t7701-repack-unpack-unreachable.sh: test_expect_success 'do not bother loosening old objects' '
       	test_must_fail git cat-file -p $obj2
       '
       
     -+test_expect_success 'do not bother loosening old objects with core.packmtimesuffix config' '
     ++test_expect_success 'do not bother loosening old objects without freshen pack time' '
      +	obj1=$(echo three | git hash-object -w --stdin) &&
      +	obj2=$(echo four | git hash-object -w --stdin) &&
     -+	pack1=$(echo $obj1 | git -c core.packmtimesuffix=bump pack-objects .git/objects/pack/pack) &&
     -+	pack2=$(echo $obj2 | git -c core.packmtimesuffix=bump pack-objects .git/objects/pack/pack) &&
     -+	git -c core.packmtimesuffix=bump prune-packed &&
     ++	pack1=$(echo $obj1 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) &&
     ++	pack2=$(echo $obj2 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) &&
     ++	git -c core.freshenPackFiles=false prune-packed &&
      +	git cat-file -p $obj1 &&
      +	git cat-file -p $obj2 &&
     -+	touch .git/objects/pack/pack-$pack2.bump &&
     -+	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.bump &&
     -+	git -c core.packmtimesuffix=bump repack -A -d --unpack-unreachable=1.hour.ago &&
     ++	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack &&
     ++	git -c core.freshenPackFiles=false repack -A -d --unpack-unreachable=1.hour.ago &&
      +	git cat-file -p $obj1 &&
      +	test_must_fail git cat-file -p $obj2
      +'


 Documentation/config/core.txt        | 11 +++++++++++
 cache.h                              |  1 +
 config.c                             |  4 ++++
 environment.c                        |  1 +
 object-file.c                        |  4 ++++
 t/t7701-repack-unpack-unreachable.sh | 14 ++++++++++++++
 6 files changed, 35 insertions(+)

diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
index c04f62a54a1..1e7cf366628 100644
--- a/Documentation/config/core.txt
+++ b/Documentation/config/core.txt
@@ -398,6 +398,17 @@ the largest projects.  You probably do not need to adjust this value.
 +
 Common unit suffixes of 'k', 'm', or 'g' are supported.
 
+core.freshenPackFiles::
+	Normally we avoid writing existing object by freshening the mtime
+	of the *.pack file which contains it in order to aid some processes
+	such like prune. Turning off this option on some servers can speed
+	up the execution of some commands like 'git-upload-pack'(e.g. some
+	servers that mount the same NFS disk will re-sync the *.pack files
+	to cached file system if the mtime cahnges).
++
+The default is true which means the *.pack file will be freshened if we
+want to write a existing object whthin it.
+
 core.deltaBaseCacheLimit::
 	Maximum number of bytes per thread to reserve for caching base objects
 	that may be referenced by multiple deltified objects.  By storing the
diff --git a/cache.h b/cache.h
index ba04ff8bd36..46126c6977c 100644
--- a/cache.h
+++ b/cache.h
@@ -956,6 +956,7 @@ extern size_t packed_git_limit;
 extern size_t delta_base_cache_limit;
 extern unsigned long big_file_threshold;
 extern unsigned long pack_size_limit_cfg;
+extern int core_freshen_packfiles;
 
 /*
  * Accessors for the core.sharedrepository config which lazy-load the value
diff --git a/config.c b/config.c
index f9c400ad306..02dcc8a028e 100644
--- a/config.c
+++ b/config.c
@@ -1431,6 +1431,10 @@ static int git_default_core_config(const char *var, const char *value, void *cb)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.freshenpackfiles")) {
+		core_freshen_packfiles = git_config_bool(var, value);
+	}
+
 	if (!strcmp(var, "core.deltabasecachelimit")) {
 		delta_base_cache_limit = git_config_ulong(var, value);
 		return 0;
diff --git a/environment.c b/environment.c
index 2f27008424a..397525609a8 100644
--- a/environment.c
+++ b/environment.c
@@ -73,6 +73,7 @@ int core_sparse_checkout_cone;
 int merge_log_config = -1;
 int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */
 unsigned long pack_size_limit_cfg;
+int core_freshen_packfiles = 1;
 enum log_refs_config log_all_ref_updates = LOG_REFS_UNSET;
 
 #ifndef PROTECT_HFS_DEFAULT
diff --git a/object-file.c b/object-file.c
index f233b440b22..884c3e92c38 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1974,6 +1974,10 @@ static int freshen_loose_object(const struct object_id *oid)
 static int freshen_packed_object(const struct object_id *oid)
 {
 	struct pack_entry e;
+
+	if (!core_freshen_packfiles)
+		return 1;
+
 	if (!find_pack_entry(the_repository, oid, &e))
 		return 0;
 	if (e.p->freshened)
diff --git a/t/t7701-repack-unpack-unreachable.sh b/t/t7701-repack-unpack-unreachable.sh
index 937f89ee8c8..b6a0b6c9695 100755
--- a/t/t7701-repack-unpack-unreachable.sh
+++ b/t/t7701-repack-unpack-unreachable.sh
@@ -112,6 +112,20 @@ test_expect_success 'do not bother loosening old objects' '
 	test_must_fail git cat-file -p $obj2
 '
 
+test_expect_success 'do not bother loosening old objects without freshen pack time' '
+	obj1=$(echo three | git hash-object -w --stdin) &&
+	obj2=$(echo four | git hash-object -w --stdin) &&
+	pack1=$(echo $obj1 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) &&
+	pack2=$(echo $obj2 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) &&
+	git -c core.freshenPackFiles=false prune-packed &&
+	git cat-file -p $obj1 &&
+	git cat-file -p $obj2 &&
+	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack &&
+	git -c core.freshenPackFiles=false repack -A -d --unpack-unreachable=1.hour.ago &&
+	git cat-file -p $obj1 &&
+	test_must_fail git cat-file -p $obj2
+'
+
 test_expect_success 'keep packed objects found only in index' '
 	echo my-unique-content >file &&
 	git add file &&

base-commit: 75ae10bc75336db031ee58d13c5037b929235912
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v3] packfile: freshen the mtime of packfile by configuration
  2021-07-19 19:53   ` [PATCH v3] " Sun Chao via GitGitGadget
@ 2021-07-19 20:51     ` Taylor Blau
  2021-07-20  0:07       ` Junio C Hamano
                         ` (2 more replies)
  2021-08-15 17:08     ` [PATCH v4 0/2] " Sun Chao via GitGitGadget
  1 sibling, 3 replies; 34+ messages in thread
From: Taylor Blau @ 2021-07-19 20:51 UTC (permalink / raw)
  To: Sun Chao via GitGitGadget
  Cc: git, Ævar Arnfjörð Bjarmason, Taylor Blau,
	Martin Fick, Son Luong Ngoc, Sun Chao, Jeff King

On Mon, Jul 19, 2021 at 07:53:19PM +0000, Sun Chao via GitGitGadget wrote:
> From: Sun Chao <16657101987@163.com>
>
> Commit 33d4221c79 (write_sha1_file: freshen existing objects,
> 2014-10-15) avoid writing existing objects by freshen their
> mtime (especially the packfiles contains them) in order to
> aid the correct caching, and some process like find_lru_pack
> can make good decision. However, this is unfriendly to
> incremental backup jobs or services rely on cached file system
> when there are large '.pack' files exists.
>
> For example, after packed all objects, use 'write-tree' to
> create same commit with the same tree and same environments
> such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can
> notice the '.pack' file's mtime changed. Git servers
> that mount the same NFS disk will re-sync the '.pack' files
> to cached file system which will slow the git commands.
>
> So if add core.freshenPackfiles to indicate whether or not
> packs can be freshened, turning off this option on some
> servers can speed up the execution of some commands on servers
> which use NFS disk instead of local disk.

Hmm. I'm still quite unconvinced that we should be taking this direction
without better motivation. We talked about your assumption that NFS
seems to be invalidating the block cache when updating the inodes that
point at those blocks, but I don't recall seeing further evidence.

Regardless, a couple of idle thoughts:

> +	if (!core_freshen_packfiles)
> +		return 1;

It is important to still freshen the object mtimes even when we cannot
update the pack mtimes. That's why we return 0 when "freshen_file"
returned 0: even if there was an error calling utime, we should still
freshen the object. This is important because it impacts when
unreachable objects are pruned.

So I would have assumed that if a user set "core.freshenPackfiles=false"
that they would still want their object mtimes updated, in which case
the only option we have is to write those objects out loose.

...and that happens by the caller of freshen_packed_object (either
write_object_file() or hash_object_file_literally()) then calling
write_loose_object() if freshen_packed_object() failed. So I would have
expected to see a "return 0" in the case that packfile freshening was
disabled.

But that leads us to an interesting problem: how many redundant objects
do we expect to see on the server? It may be a lot, in which case you
may end up having the same IO problems for a different reason. Peff
mentioned to me off-list that he suspected write-tree was overeager in
how many trees it would try to write out. I'm not sure.

> +test_expect_success 'do not bother loosening old objects without freshen pack time' '
> +	obj1=$(echo three | git hash-object -w --stdin) &&
> +	obj2=$(echo four | git hash-object -w --stdin) &&
> +	pack1=$(echo $obj1 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) &&
> +	pack2=$(echo $obj2 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) &&
> +	git -c core.freshenPackFiles=false prune-packed &&
> +	git cat-file -p $obj1 &&
> +	git cat-file -p $obj2 &&
> +	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack &&
> +	git -c core.freshenPackFiles=false repack -A -d --unpack-unreachable=1.hour.ago &&
> +	git cat-file -p $obj1 &&
> +	test_must_fail git cat-file -p $obj2
> +'

I had a little bit of a hard time following this test. AFAICT, it
proceeds as follows:

  - Write two packs, each containing a unique unreachable blob.
  - Call 'git prune-packed' with packfile freshening disabled, then
    check that the object survived.
  - Then repack while in a state where one of the pack's contents would
    be pruned.
  - Make sure that one object survives and the other doesn't.

This doesn't really seem to be testing the behavior of disabling
packfile freshening so much as it's testing prune-packed, and repack's
`--unpack-unreachable` option. I would probably have expected to see
something more along the lines of:

  - Write an unreachable object, pack it, and then remove the loose copy
    you wrote in the first place.
  - Then roll the pack's mtime to some fixed value in the past.
  - Try to write the same object again with packfile freshening
    disabled, and verify that:
    - the pack's mtime was unchanged,
    - the object exists loose again

But I'd really like to get some other opinions (especially from Peff,
who brought up the potential concerns with write-tree) as to whether or
not this is a direction worth pursuing.

My opinion is that it is not, and that the bizarre caching behavior you
are seeing is out of Git's control.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3] packfile: freshen the mtime of packfile by configuration
  2021-07-19 20:51     ` Taylor Blau
@ 2021-07-20  0:07       ` Junio C Hamano
  2021-07-20 15:07         ` Sun Chao
  2021-07-20  6:19       ` Ævar Arnfjörð Bjarmason
  2021-07-20 15:00       ` Sun Chao
  2 siblings, 1 reply; 34+ messages in thread
From: Junio C Hamano @ 2021-07-20  0:07 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Sun Chao via GitGitGadget, git,
	Ævar Arnfjörð Bjarmason, Martin Fick,
	Son Luong Ngoc, Sun Chao, Jeff King

Taylor Blau <me@ttaylorr.com> writes:

> Hmm. I'm still quite unconvinced that we should be taking this direction
> without better motivation. We talked about your assumption that NFS
> seems to be invalidating the block cache when updating the inodes that
> point at those blocks, but I don't recall seeing further evidence.

Me neither.  Not touching the pack and not updating the "most
recently used" time of individual objects smells like a recipe
for repository corruption.

> My opinion is that it is not, and that the bizarre caching behavior you
> are seeing is out of Git's control.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3] packfile: freshen the mtime of packfile by configuration
  2021-07-19 20:51     ` Taylor Blau
  2021-07-20  0:07       ` Junio C Hamano
@ 2021-07-20  6:19       ` Ævar Arnfjörð Bjarmason
  2021-07-20 15:34         ` Sun Chao
  2021-07-20 15:00       ` Sun Chao
  2 siblings, 1 reply; 34+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-07-20  6:19 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Sun Chao via GitGitGadget, git, Martin Fick, Son Luong Ngoc,
	Sun Chao, Jeff King


On Mon, Jul 19 2021, Taylor Blau wrote:

> On Mon, Jul 19, 2021 at 07:53:19PM +0000, Sun Chao via GitGitGadget wrote:
>> From: Sun Chao <16657101987@163.com>
>>
>> Commit 33d4221c79 (write_sha1_file: freshen existing objects,
>> 2014-10-15) avoid writing existing objects by freshen their
>> mtime (especially the packfiles contains them) in order to
>> aid the correct caching, and some process like find_lru_pack
>> can make good decision. However, this is unfriendly to
>> incremental backup jobs or services rely on cached file system
>> when there are large '.pack' files exists.
>>
>> For example, after packed all objects, use 'write-tree' to
>> create same commit with the same tree and same environments
>> such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can
>> notice the '.pack' file's mtime changed. Git servers
>> that mount the same NFS disk will re-sync the '.pack' files
>> to cached file system which will slow the git commands.
>>
>> So if add core.freshenPackfiles to indicate whether or not
>> packs can be freshened, turning off this option on some
>> servers can speed up the execution of some commands on servers
>> which use NFS disk instead of local disk.
>
> Hmm. I'm still quite unconvinced that we should be taking this direction
> without better motivation. We talked about your assumption that NFS
> seems to be invalidating the block cache when updating the inodes that
> point at those blocks, but I don't recall seeing further evidence.

I don't know about Sun's setup, but what he's describing is consistent
with how NFS works, or can commonly be made to work.

See e.g. "lookupcache" in nfs(5) on Linux, but also a lot of people use
some random vendor's proprietary NFS implementation, and commonly tweak
various options that make it anywhere between "I guess that's not too
crazy" and "are you kidding me?" levels of non-POSIX compliant.

> Regardless, a couple of idle thoughts:
>
>> +	if (!core_freshen_packfiles)
>> +		return 1;
>
> It is important to still freshen the object mtimes even when we cannot
> update the pack mtimes. That's why we return 0 when "freshen_file"
> returned 0: even if there was an error calling utime, we should still
> freshen the object. This is important because it impacts when
> unreachable objects are pruned.
>
> So I would have assumed that if a user set "core.freshenPackfiles=false"
> that they would still want their object mtimes updated, in which case
> the only option we have is to write those objects out loose.
>
> ...and that happens by the caller of freshen_packed_object (either
> write_object_file() or hash_object_file_literally()) then calling
> write_loose_object() if freshen_packed_object() failed. So I would have
> expected to see a "return 0" in the case that packfile freshening was
> disabled.
>
> But that leads us to an interesting problem: how many redundant objects
> do we expect to see on the server? It may be a lot, in which case you
> may end up having the same IO problems for a different reason. Peff
> mentioned to me off-list that he suspected write-tree was overeager in
> how many trees it would try to write out. I'm not sure.

In my experience with NFS the thing that kills you is anything that
needs to do iterations, i.e. recursive readdir() and the like, or to
read a lot of data, throughput was excellent. It's why I hacked core
that core.checkCollisions patch.

Jeff improved the situation I was mainly trying to fix with with the
loose objects cache. I never got around to benchmarking the two in
production, and now that setup is at an ex-job...

>> +test_expect_success 'do not bother loosening old objects without freshen pack time' '
>> +	obj1=$(echo three | git hash-object -w --stdin) &&
>> +	obj2=$(echo four | git hash-object -w --stdin) &&
>> +	pack1=$(echo $obj1 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) &&
>> +	pack2=$(echo $obj2 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) &&
>> +	git -c core.freshenPackFiles=false prune-packed &&
>> +	git cat-file -p $obj1 &&
>> +	git cat-file -p $obj2 &&
>> +	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack &&
>> +	git -c core.freshenPackFiles=false repack -A -d --unpack-unreachable=1.hour.ago &&
>> +	git cat-file -p $obj1 &&
>> +	test_must_fail git cat-file -p $obj2
>> +'
>
> I had a little bit of a hard time following this test. AFAICT, it
> proceeds as follows:
>
>   - Write two packs, each containing a unique unreachable blob.
>   - Call 'git prune-packed' with packfile freshening disabled, then
>     check that the object survived.
>   - Then repack while in a state where one of the pack's contents would
>     be pruned.
>   - Make sure that one object survives and the other doesn't.
>
> This doesn't really seem to be testing the behavior of disabling
> packfile freshening so much as it's testing prune-packed, and repack's
> `--unpack-unreachable` option. I would probably have expected to see
> something more along the lines of:
>
>   - Write an unreachable object, pack it, and then remove the loose copy
>     you wrote in the first place.
>   - Then roll the pack's mtime to some fixed value in the past.
>   - Try to write the same object again with packfile freshening
>     disabled, and verify that:
>     - the pack's mtime was unchanged,
>     - the object exists loose again
>
> But I'd really like to get some other opinions (especially from Peff,
> who brought up the potential concerns with write-tree) as to whether or
> not this is a direction worth pursuing.
>
> My opinion is that it is not, and that the bizarre caching behavior you
> are seeing is out of Git's control.

Thanks for this, I found the test hard to follow too, but didn't have
time to really think about it, this makes sense.

Back to the topic: I share your sentiment of trying to avoid complexity
in this area.

Sun: Have you considered --keep-unreachable to "git repack"? It's
orthagonal to what you're trying here, but I wonder if being more
aggressive about keeping objects + some impromevents to perhaps skip
this "touch" at all if we have that in effect wouldn't be more viable &
something e.g. Taylor would be more comforable having part of git.git.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration
  2021-07-15  8:23                 ` Son Luong Ngoc
@ 2021-07-20  6:29                   ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 34+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-07-20  6:29 UTC (permalink / raw)
  To: Son Luong Ngoc
  Cc: Martin Fick, Taylor Blau, Sun Chao, Taylor Blau,
	Sun Chao via GitGitGadget, git


On Thu, Jul 15 2021, Son Luong Ngoc wrote:

> Hi folks,
>
> On Wed, Jul 14, 2021 at 10:03 PM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>> *nod*
>>
>> FWIW at an ex-job I helped systems administrators who'd produced such a
>> broken backup-via-rsync create a hybrid version as an interim
>> solution. I.e. it would sync the objects via git transport, and do an
>> rsync on a whitelist (or blacklist), so pickup config, but exclude
>> objects.
>>
>> "Hybrid" because it was in a state of needing to deal with manual
>> tweaking of config.
>>
>> But usually someone who's needing to thoroughly solve this backup
>> problem will inevitably end up with wanting to drive everything that's
>> not in the object or refstore from some external system, i.e. have
>> config be generated from puppet, a database etc., ditto for alternates
>> etc.
>>
>> But even if you can't get to that point (or don't want to) I'd say aim
>> for the hybrid system.
>
> FWIW, we are running our repo on top of a some-what flickery DRBD setup and
> we decided to use both
>
>   git clone --upload-pack 'git -c transfer.hiderefs="!refs"
> upload-pack' --mirror`
>
> and
>
>   `tar`
>
> to create 2 separate snapshots for backup in parallel (full backup,
> not incremental).
>
> In case of recovery (manual), we first rely on the git snapshot and if
> there is any
> missing objects/refs, we will try to get it from the tarball.

That sounds good, and similar to what I described with that "hybrid"
setup.

>>
>> This isn't some purely theoretical concern b.t.w., the system using
>> rsync like this was producing repos that wouldn't fsck all the time, and
>> it wasn't such a busy site.
>>
>> I suspect (but haven't tried) that for someone who can't easily change
>> their backup solution they'd get most of the benefits of git-native
>> transport by having their "rsync" sync refs, then objects, not the other
>> way around. Glob order dictates that most backup systems will do
>> objects, then refs (which will of course, at that point, refer to
>> nonexisting objects).
>>
>> It's still not safe, you'll still be subject to races, but probably a
>> lot better in practice.
>
> I would love to get some guidance in official documentation on what is the best
> practice around handling git data on the server side.
>
> Is git-clone + git-bundle the go-to solution?
> Should tar/rsync not be used completely or is there a trade-off?

I should have tempered some of those comments, it's perfectly fine in
general to use tar+rsync for "backing up" git repositories in certain
contexts. E.g. when I switch laptops or whatever it's what I do to grab
data.

The problem is when the data isn't at rest, i.e. in the context of an
active server.

There you start moving towards a scale where it goes from "sure, it's
fine" to "this is such a bad idea that nobody should pursue it".

If you're running a setup where you're starting to submit patches to
git.git you're probably at the far end of that spectrum.

Whether it's clone, push, fetch, bundle etc. doesn't really matter, the
important part is that you're using git's pack transport mechanism to
ferry updates around, which gives you guarantees rsync+tar can't,
particularly in the face of concurrently updated data.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration
  2021-07-14 20:20                 ` Martin Fick
@ 2021-07-20  6:32                   ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 34+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-07-20  6:32 UTC (permalink / raw)
  To: Martin Fick
  Cc: Taylor Blau, Sun Chao, Taylor Blau, Sun Chao via GitGitGadget,
	git


On Wed, Jul 14 2021, Martin Fick wrote:

> On Wednesday, July 14, 2021 9:41:42 PM MDT you wrote:
>> On Wed, Jul 14 2021, Martin Fick wrote:
>> > On Wednesday, July 14, 2021 8:19:15 PM MDT Ævar Arnfjörð Bjarmason wrote:
>> >> The best way to get backups of git repositories you know are correct are
>> >> is to use git's own transport mechanisms, i.e. fetch/pull the data, or
>> >> create bundles from it.
>> > 
>> > I don't think this is a fair recommendation since unfortunately, this
>> > cannot be used to create a full backup. This can be used to back up the
>> > version controlled data, but not the repositories meta-data, i.e.
>> > configs, reflogs, alternate setups...
>> 
>> *nod*
>> 
>> FWIW at an ex-job I helped systems administrators who'd produced such a
>> broken backup-via-rsync create a hybrid version as an interim
>> solution. I.e. it would sync the objects via git transport, and do an
>> rsync on a whitelist (or blacklist), so pickup config, but exclude
>> objects.
>> 
>> "Hybrid" because it was in a state of needing to deal with manual
>> tweaking of config.
>> 
>> But usually someone who's needing to thoroughly solve this backup
>> problem will inevitably end up with wanting to drive everything that's
>> not in the object or refstore from some external system, i.e. have
>> config be generated from puppet, a database etc., ditto for alternates
>> etc.
>> 
>> But even if you can't get to that point (or don't want to) I'd say aim
>> for the hybrid system.
>> 
>> This isn't some purely theoretical concern b.t.w., the system using
>> rsync like this was producing repos that wouldn't fsck all the time, and
>> it wasn't such a busy site.
>> 
>> I suspect (but haven't tried) that for someone who can't easily change
>> their backup solution they'd get most of the benefits of git-native
>> transport by having their "rsync" sync refs, then objects, not the other
>> way around. Glob order dictates that most backup systems will do
>> objects, then refs (which will of course, at that point, refer to
>> nonexisting objects).
>> 
>> It's still not safe, you'll still be subject to races, but probably a
>> lot better in practice.
>
> It would be great if git provided a command to do a reliable incremental 
> backup, maybe it could copy things in the order that you mention?

I don't think we can or want to support this sort of thing ever, for the
same reason that you probably won't convince MySQL,PostgreSQL etc. that
they should support "cp -r" as a mode for backing up their live database
services.

I mean, there is the topic of git being lazy about fsync() etc, but even
if all of that were 100% solved you'd still get bad things if you picked
an arbitrary time to snapshot a running git directory, e.g. your
"master" branch might have a "master.lock" because it was in the middle
of an update.

If you used "fetch/clone/bundle" etc. to get the data no problem, but if
your snapshot happens then you'd need to manually clean that up, a
situation which in practice wouldn't persist, but would be persistent
with a snapshot approach.

> However, most people will want to use the backup system they have and not a 
> special git tool. Maybe git fsck should gain a switch that would rewind any 
> refs to an older point that is no broken (using reflogs)? That way, most 
> backups would just work and be rewound to the point at which the backup 
> started?

I think the main problem in the wild is not the inability of using a
special tool, but one of education. Most people wouldn't think of "cp
-r" as a first approach to say backing up a live mysql server, they'd
use mysqldump and the like.

But for some reason git is considered "not a database" enough that those
same people would just use rsync/tar/whatever, and are then surprised
when their data is corrupt or in some weird or inconsistent state...

Anyway, see also my just-posted:
https://lore.kernel.org/git/878s21wl4z.fsf@evledraar.gmail.com/

I.e. I'm not saying "never use rsync", there's cases where that's fine,
but for a live "real" server I'd say solutions in that class shouldn't
be considered/actively migrated away from.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3] packfile: freshen the mtime of packfile by configuration
  2021-07-19 20:51     ` Taylor Blau
  2021-07-20  0:07       ` Junio C Hamano
  2021-07-20  6:19       ` Ævar Arnfjörð Bjarmason
@ 2021-07-20 15:00       ` Sun Chao
  2021-07-20 16:53         ` Taylor Blau
  2 siblings, 1 reply; 34+ messages in thread
From: Sun Chao @ 2021-07-20 15:00 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Sun Chao via GitGitGadget, git,
	Ævar Arnfjörð Bjarmason, Martin Fick,
	Son Luong Ngoc, Jeff King



> 2021年7月20日 04:51,Taylor Blau <me@ttaylorr.com> 写道:
> 
> On Mon, Jul 19, 2021 at 07:53:19PM +0000, Sun Chao via GitGitGadget wrote:
>> From: Sun Chao <16657101987@163.com>
>> 
>> Commit 33d4221c79 (write_sha1_file: freshen existing objects,
>> 2014-10-15) avoid writing existing objects by freshen their
>> mtime (especially the packfiles contains them) in order to
>> aid the correct caching, and some process like find_lru_pack
>> can make good decision. However, this is unfriendly to
>> incremental backup jobs or services rely on cached file system
>> when there are large '.pack' files exists.
>> 
>> For example, after packed all objects, use 'write-tree' to
>> create same commit with the same tree and same environments
>> such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can
>> notice the '.pack' file's mtime changed. Git servers
>> that mount the same NFS disk will re-sync the '.pack' files
>> to cached file system which will slow the git commands.
>> 
>> So if add core.freshenPackfiles to indicate whether or not
>> packs can be freshened, turning off this option on some
>> servers can speed up the execution of some commands on servers
>> which use NFS disk instead of local disk.
> 
> Hmm. I'm still quite unconvinced that we should be taking this direction
> without better motivation. We talked about your assumption that NFS
> seems to be invalidating the block cache when updating the inodes that
> point at those blocks, but I don't recall seeing further evidence.

Yes, these days I'm trying to asking help from our SRE to do tests in our
production environments (where we can get the real traffic report of NFS server),
such like:

- setup a repository witch some large packfiles in a NFS disk
- keep running 'git pack-objects --all --stdout --delta-base-offset >/dev/null' in
multiple git servers that mount the same NFS disk above
- 'touch' the packfiles in another server, and check (a) if the IOPS and IO traffic
of the NFS server changes and (b) if the IO traffic of network interfaces from
the git servers to the NFS server changes

I like to share the data when I receive the reports.

Meanwhile I find the description of the cached file system for NFS Client:
   https://www.ibm.com/docs/en/aix/7.2?topic=performance-cache-file-system
It is mentioned that:

  3. To ensure that the cached directories and files are kept up to date, 
     CacheFS periodically checks the consistency of files stored in the cache.
     It does this by comparing the current modification time to the previous
     modification time.
  4. If the modification times are different, all data and attributes
     for the directory or file are purged from the cache, and new data and
     attributes are retrieved from the back file system.

It looks like reasonable, but perhaps I should check it from my test reports ;)

> 
> Regardless, a couple of idle thoughts:
> 
>> +	if (!core_freshen_packfiles)
>> +		return 1;
> 
> It is important to still freshen the object mtimes even when we cannot
> update the pack mtimes. That's why we return 0 when "freshen_file"
> returned 0: even if there was an error calling utime, we should still
> freshen the object. This is important because it impacts when
> unreachable objects are pruned.
> 
> So I would have assumed that if a user set "core.freshenPackfiles=false"
> that they would still want their object mtimes updated, in which case
> the only option we have is to write those objects out loose.
> 
> ...and that happens by the caller of freshen_packed_object (either
> write_object_file() or hash_object_file_literally()) then calling
> write_loose_object() if freshen_packed_object() failed. So I would have
> expected to see a "return 0" in the case that packfile freshening was
> disabled.
> 
> But that leads us to an interesting problem: how many redundant objects
> do we expect to see on the server? It may be a lot, in which case you
> may end up having the same IO problems for a different reason. Peff
> mentioned to me off-list that he suspected write-tree was overeager in
> how many trees it would try to write out. I'm not sure.

You are right, I haven't thought it in details, if we do not update the mtime
both of packfiles and loose files, 'prune' may delete them by accident.

> 
>> +test_expect_success 'do not bother loosening old objects without freshen pack time' '
>> +	obj1=$(echo three | git hash-object -w --stdin) &&
>> +	obj2=$(echo four | git hash-object -w --stdin) &&
>> +	pack1=$(echo $obj1 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) &&
>> +	pack2=$(echo $obj2 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) &&
>> +	git -c core.freshenPackFiles=false prune-packed &&
>> +	git cat-file -p $obj1 &&
>> +	git cat-file -p $obj2 &&
>> +	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack &&
>> +	git -c core.freshenPackFiles=false repack -A -d --unpack-unreachable=1.hour.ago &&
>> +	git cat-file -p $obj1 &&
>> +	test_must_fail git cat-file -p $obj2
>> +'
> 
> I had a little bit of a hard time following this test. AFAICT, it
> proceeds as follows:
> 
>  - Write two packs, each containing a unique unreachable blob.
>  - Call 'git prune-packed' with packfile freshening disabled, then
>    check that the object survived.
>  - Then repack while in a state where one of the pack's contents would
>    be pruned.
>  - Make sure that one object survives and the other doesn't.
> 
> This doesn't really seem to be testing the behavior of disabling
> packfile freshening so much as it's testing prune-packed, and repack's
> `--unpack-unreachable` option. I would probably have expected to see
> something more along the lines of:
> 
>  - Write an unreachable object, pack it, and then remove the loose copy
>    you wrote in the first place.
>  - Then roll the pack's mtime to some fixed value in the past.
>  - Try to write the same object again with packfile freshening
>    disabled, and verify that:
>    - the pack's mtime was unchanged,
>    - the object exists loose again
> 
> But I'd really like to get some other opinions (especially from Peff,
> who brought up the potential concerns with write-tree) as to whether or
> not this is a direction worth pursuing.
> 
> My opinion is that it is not, and that the bizarre caching behavior you
> are seeing is out of Git's control.
OK, I will try this. And I will try to get the test reports from our SRE and
check how the mtime impacts the caches. 

> 
> Thanks,
> Taylor


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3] packfile: freshen the mtime of packfile by configuration
  2021-07-20  0:07       ` Junio C Hamano
@ 2021-07-20 15:07         ` Sun Chao
  0 siblings, 0 replies; 34+ messages in thread
From: Sun Chao @ 2021-07-20 15:07 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Taylor Blau, Sun Chao via GitGitGadget, git,
	Ævar Arnfjörð Bjarmason, Martin Fick,
	Son Luong Ngoc, Jeff King



> 2021年7月20日 08:07,Junio C Hamano <gitster@pobox.com> 写道:
> 
> Taylor Blau <me@ttaylorr.com> writes:
> 
>> Hmm. I'm still quite unconvinced that we should be taking this direction
>> without better motivation. We talked about your assumption that NFS
>> seems to be invalidating the block cache when updating the inodes that
>> point at those blocks, but I don't recall seeing further evidence.
> 
> Me neither.  Not touching the pack and not updating the "most
> recently used" time of individual objects smells like a recipe
> for repository corruption.
> 
>> My opinion is that it is not, and that the bizarre caching behavior you
>> are seeing is out of Git's control.
> 
Thanks Junio, I will try to get a more detial reports of the NFS caches and
share it if it is valuable. Not touching the mtime of packfiles really has
potencial problems just as Taylor said.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3] packfile: freshen the mtime of packfile by configuration
  2021-07-20  6:19       ` Ævar Arnfjörð Bjarmason
@ 2021-07-20 15:34         ` Sun Chao
  0 siblings, 0 replies; 34+ messages in thread
From: Sun Chao @ 2021-07-20 15:34 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Taylor Blau, Sun Chao via GitGitGadget, git, Martin Fick,
	Son Luong Ngoc, Jeff King



> 2021年7月20日 14:19,Ævar Arnfjörð Bjarmason <avarab@gmail.com> 写道:
> 
> 
> On Mon, Jul 19 2021, Taylor Blau wrote:
> 
>> On Mon, Jul 19, 2021 at 07:53:19PM +0000, Sun Chao via GitGitGadget wrote:
>>> From: Sun Chao <16657101987@163.com>
>>> 
>>> Commit 33d4221c79 (write_sha1_file: freshen existing objects,
>>> 2014-10-15) avoid writing existing objects by freshen their
>>> mtime (especially the packfiles contains them) in order to
>>> aid the correct caching, and some process like find_lru_pack
>>> can make good decision. However, this is unfriendly to
>>> incremental backup jobs or services rely on cached file system
>>> when there are large '.pack' files exists.
>>> 
>>> For example, after packed all objects, use 'write-tree' to
>>> create same commit with the same tree and same environments
>>> such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can
>>> notice the '.pack' file's mtime changed. Git servers
>>> that mount the same NFS disk will re-sync the '.pack' files
>>> to cached file system which will slow the git commands.
>>> 
>>> So if add core.freshenPackfiles to indicate whether or not
>>> packs can be freshened, turning off this option on some
>>> servers can speed up the execution of some commands on servers
>>> which use NFS disk instead of local disk.
>> 
>> Hmm. I'm still quite unconvinced that we should be taking this direction
>> without better motivation. We talked about your assumption that NFS
>> seems to be invalidating the block cache when updating the inodes that
>> point at those blocks, but I don't recall seeing further evidence.
> 
> I don't know about Sun's setup, but what he's describing is consistent
> with how NFS works, or can commonly be made to work.
> 
> See e.g. "lookupcache" in nfs(5) on Linux, but also a lot of people use
> some random vendor's proprietary NFS implementation, and commonly tweak
> various options that make it anywhere between "I guess that's not too
> crazy" and "are you kidding me?" levels of non-POSIX compliant.
> 
>> Regardless, a couple of idle thoughts:
>> 
>>> +	if (!core_freshen_packfiles)
>>> +		return 1;
>> 
>> It is important to still freshen the object mtimes even when we cannot
>> update the pack mtimes. That's why we return 0 when "freshen_file"
>> returned 0: even if there was an error calling utime, we should still
>> freshen the object. This is important because it impacts when
>> unreachable objects are pruned.
>> 
>> So I would have assumed that if a user set "core.freshenPackfiles=false"
>> that they would still want their object mtimes updated, in which case
>> the only option we have is to write those objects out loose.
>> 
>> ...and that happens by the caller of freshen_packed_object (either
>> write_object_file() or hash_object_file_literally()) then calling
>> write_loose_object() if freshen_packed_object() failed. So I would have
>> expected to see a "return 0" in the case that packfile freshening was
>> disabled.
>> 
>> But that leads us to an interesting problem: how many redundant objects
>> do we expect to see on the server? It may be a lot, in which case you
>> may end up having the same IO problems for a different reason. Peff
>> mentioned to me off-list that he suspected write-tree was overeager in
>> how many trees it would try to write out. I'm not sure.
> 
> In my experience with NFS the thing that kills you is anything that
> needs to do iterations, i.e. recursive readdir() and the like, or to
> read a lot of data, throughput was excellent. It's why I hacked core
> that core.checkCollisions patch.
I have read your patch, looks like a good idea to reduce the expensive operations
like readdir(). And in my production environments, IOPS stress of the NFS server
bothers me which make the git commands slow.

> 
> Jeff improved the situation I was mainly trying to fix with with the
> loose objects cache. I never got around to benchmarking the two in
> production, and now that setup is at an ex-job...
> 
>>> +test_expect_success 'do not bother loosening old objects without freshen pack time' '
>>> +	obj1=$(echo three | git hash-object -w --stdin) &&
>>> +	obj2=$(echo four | git hash-object -w --stdin) &&
>>> +	pack1=$(echo $obj1 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) &&
>>> +	pack2=$(echo $obj2 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) &&
>>> +	git -c core.freshenPackFiles=false prune-packed &&
>>> +	git cat-file -p $obj1 &&
>>> +	git cat-file -p $obj2 &&
>>> +	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack &&
>>> +	git -c core.freshenPackFiles=false repack -A -d --unpack-unreachable=1.hour.ago &&
>>> +	git cat-file -p $obj1 &&
>>> +	test_must_fail git cat-file -p $obj2
>>> +'
>> 
>> I had a little bit of a hard time following this test. AFAICT, it
>> proceeds as follows:
>> 
>>  - Write two packs, each containing a unique unreachable blob.
>>  - Call 'git prune-packed' with packfile freshening disabled, then
>>    check that the object survived.
>>  - Then repack while in a state where one of the pack's contents would
>>    be pruned.
>>  - Make sure that one object survives and the other doesn't.
>> 
>> This doesn't really seem to be testing the behavior of disabling
>> packfile freshening so much as it's testing prune-packed, and repack's
>> `--unpack-unreachable` option. I would probably have expected to see
>> something more along the lines of:
>> 
>>  - Write an unreachable object, pack it, and then remove the loose copy
>>    you wrote in the first place.
>>  - Then roll the pack's mtime to some fixed value in the past.
>>  - Try to write the same object again with packfile freshening
>>    disabled, and verify that:
>>    - the pack's mtime was unchanged,
>>    - the object exists loose again
>> 
>> But I'd really like to get some other opinions (especially from Peff,
>> who brought up the potential concerns with write-tree) as to whether or
>> not this is a direction worth pursuing.
>> 
>> My opinion is that it is not, and that the bizarre caching behavior you
>> are seeing is out of Git's control.
> 
> Thanks for this, I found the test hard to follow too, but didn't have
> time to really think about it, this makes sense.
> 
> Back to the topic: I share your sentiment of trying to avoid complexity
> in this area.
> 
> Sun: Have you considered --keep-unreachable to "git repack"? It's
> orthagonal to what you're trying here, but I wonder if being more
> aggressive about keeping objects + some impromevents to perhaps skip
> this "touch" at all if we have that in effect wouldn't be more viable &
> something e.g. Taylor would be more comforable having part of git.git.

Yes, I will try to create some more useful test cases with both `--keep-unreachable`
and `--unpack-unreachable` if I still believe I need to do something with
the mtime, whatever I need to get my test reports first, thanks ;)


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3] packfile: freshen the mtime of packfile by configuration
  2021-07-20 15:00       ` Sun Chao
@ 2021-07-20 16:53         ` Taylor Blau
  0 siblings, 0 replies; 34+ messages in thread
From: Taylor Blau @ 2021-07-20 16:53 UTC (permalink / raw)
  To: Sun Chao
  Cc: Taylor Blau, Sun Chao via GitGitGadget, git,
	Ævar Arnfjörð Bjarmason, Martin Fick,
	Son Luong Ngoc, Jeff King

On Tue, Jul 20, 2021 at 11:00:18PM +0800, Sun Chao wrote:
> Meanwhile I find the description of the cached file system for NFS Client:
>    https://www.ibm.com/docs/en/aix/7.2?topic=performance-cache-file-system
> It is mentioned that:
>
>   3. To ensure that the cached directories and files are kept up to date,
>      CacheFS periodically checks the consistency of files stored in the cache.
>      It does this by comparing the current modification time to the previous
>      modification time.
>   4. If the modification times are different, all data and attributes
>      for the directory or file are purged from the cache, and new data and
>      attributes are retrieved from the back file system.
>
> It looks like reasonable, but perhaps I should check it from my test reports ;)

That seems reasonable, assuming that you have CacheFS mounted and that's
what you're interacting with (instead of talking to NFS directly).

It seems reasonable that CacheFS would also have some way to tune how
often the "purge cached blocks because of stale mtimes" would kick in,
and what the grace period for determining if an mtime is "stale" is. So
hopefully those values are (a) configurable and (b) you can find values
that result in acceptable performance.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v4 0/2] packfile: freshen the mtime of packfile by configuration
  2021-07-19 19:53   ` [PATCH v3] " Sun Chao via GitGitGadget
  2021-07-19 20:51     ` Taylor Blau
@ 2021-08-15 17:08     ` Sun Chao via GitGitGadget
  2021-08-15 17:08       ` [PATCH v4 1/2] packfile: rename `derive_filename()` to `derive_pack_filename()` Sun Chao via GitGitGadget
  2021-08-15 17:08       ` [PATCH v4 2/2] packfile: freshen the mtime of packfile by bump file Sun Chao via GitGitGadget
  1 sibling, 2 replies; 34+ messages in thread
From: Sun Chao via GitGitGadget @ 2021-08-15 17:08 UTC (permalink / raw)
  To: git; +Cc: Sun Chao

packfile: freshen the mtime of packfile by bump file

We've talked about the cache reload through earlier patches, and we stopped
because no further evidence can tell NFS client will reload the page caches
if the file mtime changed. So our team have done these experiments:

Step1: prepare git servers which mount the NFS disk and a big repo

We prepared 3 vms named c1, s1 and s2, we also have a NFS server named n1.
s1 and s2 mount the NFS disk from n1 by:

    mount -t nfs -o vers=3,timeo=600,nolock,noatime,lookupcache=postive,\
    actimeo=3 <n1 ip addr>:/repositories /mnt/repositories


We setup git server services on s1 and s2, so we can clone repos from s1 by
git commands. Then we created a repository under /mnt/repositories, and
pushed large files to the repository, so we can find a large .pack file in
the repository with about 1.2 GB size.

Step2: do first git clone from client after drop caches of s1

First we drop the caches from s1 by:

    sync; echo 3 > /proc/sys/vm/drop_caches


Then we run git command in c1 to clone the huge repository we created in
Step1, at the same time we run the two commands in s1:

    tcpdump -nn host <n1 ip addr> -w 1st_command.pcap
    nfsiostat 1 -p /mnt/repositories


try to get the result and check what happends.

Step3: do new git clones without drop caches of s1

After Step2, we called new git clone command in c1 to clone the huge
repository for serveral times, and also run the commands at the same time:

    tcpdump -nn host <n1 ip addr> -w lots_of_command.pcap
    nfsiostat 1 -p /mnt/repositories


Step4: do new git clones with packfile mtime changed

After Step2 and Step3, we try to touch all the ".pack" files from s2, and we
call a new git clone in c1 to download the huge repository again, and run
the two command in s1 at the same time:

    tcpdump -nn host <n1 ip addr> -w mtime_changed_command.pcap
    nfsiostat 1 -p /mnt/repositories


Result:

We got a about 1.4GB big pcap file during Step2 and Step4, we can find lots
of READ request and response after open it with wireshark. And by
'nfsiostat' command we can see the 'ops/s' and 'KB/s' of 'read' in the
output shows a relatively large value for a while.

But we got a 4MB pcap file in Step3, and open it with wireshark, we can only
find GETATTR and FSSTAT requests and response. And we the 'nfsiostat' always
show 0 in 'ops/s' and 'KB/s' of 'read' part in the output.

We have done Step1 to Step4 serveral times, each time the result are same.

So we can make sure the NFS client will reload the page cache if other NFS
client changes the mtime of the large .pack files. And for git servers which
use filesystem like NFS to manage large repositories, reload large files
that only have mtime changed result big NFS server IOPS pressure and that
also makes the git server slow because the IO is the bottleneck when there
are too many client requests for the same big repositries.

And I do think the team who manage the git servers need a configuration
choise which can enhance the mtime of packfile through another file which
should be small enough or even empty. It should be backward compatibility
when it is in default value, but just as metioned by Ævar before, maybe
somepeople what to use it in mixed-version environment, we should warn them
in documents, but such configuration do big help for some team who run some
servers mount the NFS disks.

Sun Chao (2):
  packfile: rename `derive_filename()` to `derive_pack_filename()`
  packfile: freshen the mtime of packfile by bump file

 Documentation/config/core.txt   |  11 +++
 builtin/index-pack.c            |  19 +----
 cache.h                         |   1 +
 config.c                        |   5 ++
 environment.c                   |   1 +
 object-file.c                   |  30 +++++++-
 packfile.c                      |  25 ++++++-
 packfile.h                      |   7 ++
 t/t5326-pack-mtime-bumpfiles.sh | 118 ++++++++++++++++++++++++++++++++
 9 files changed, 198 insertions(+), 19 deletions(-)
 create mode 100755 t/t5326-pack-mtime-bumpfiles.sh


base-commit: 5d213e46bb7b880238ff5ea3914e940a50ae9369
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1043%2Fsunchao9%2Fmaster-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1043/sunchao9/master-v4
Pull-Request: https://github.com/git/git/pull/1043

Range-diff vs v3:

 -:  ----------- > 1:  81afc69d22c packfile: rename `derive_filename()` to `derive_pack_filename()`
 1:  16c68923bea ! 2:  7166f776154 packfile: freshen the mtime of packfile by configuration
     @@ Metadata
      Author: Sun Chao <16657101987@163.com>
      
       ## Commit message ##
     -    packfile: freshen the mtime of packfile by configuration
     +    packfile: freshen the mtime of packfile by bump file
      
          Commit 33d4221c79 (write_sha1_file: freshen existing objects,
          2014-10-15) avoid writing existing objects by freshen their
     @@ Commit message
          create same commit with the same tree and same environments
          such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can
          notice the '.pack' file's mtime changed. Git servers
     -    that mount the same NFS disk will re-sync the '.pack' files
     -    to cached file system which will slow the git commands.
     +    that use filesystems like NFS will reload the '.pack' files
     +    to file system page cache, which will slow the git commands.
      
     -    So if add core.freshenPackfiles to indicate whether or not
     -    packs can be freshened, turning off this option on some
     -    servers can speed up the execution of some commands on servers
     -    which use NFS disk instead of local disk.
     +    So if we freshen the mtime of packfile by updating a '.bump'
     +    file instead, when we check the mtime of packfile, get it from
     +    '.bump' file also. Large git repository may contains large
     +    '.pack' files, but '.bump' files can be empty. This will avoid
     +    file system page caches reload large files from NFS and then
     +    make git commands faster.
      
          Signed-off-by: Sun Chao <16657101987@163.com>
      
     @@ Documentation/config/core.txt: the largest projects.  You probably do not need t
       +
       Common unit suffixes of 'k', 'm', or 'g' are supported.
       
     -+core.freshenPackFiles::
     ++core.packMtimeToBumpFiles::
      +	Normally we avoid writing existing object by freshening the mtime
      +	of the *.pack file which contains it in order to aid some processes
     -+	such like prune. Turning off this option on some servers can speed
     -+	up the execution of some commands like 'git-upload-pack'(e.g. some
     -+	servers that mount the same NFS disk will re-sync the *.pack files
     -+	to cached file system if the mtime cahnges).
     ++	such like prune. Use a *.bump file instead of *.pack file will
     ++	avoid file system cache re-sync the large packfiles on filesystems
     ++	like NFS, and consequently make git commands faster.
      ++
     -+The default is true which means the *.pack file will be freshened if we
     -+want to write a existing object whthin it.
     ++The default is 'false' which means the *.pack file will be freshened by
     ++default. If set to 'true', the file with the '.bump' suffix will be
     ++created automatically, and it's mtime will be freshened instead.
      +
       core.deltaBaseCacheLimit::
       	Maximum number of bytes per thread to reserve for caching base objects
       	that may be referenced by multiple deltified objects.  By storing the
      
       ## cache.h ##
     -@@ cache.h: extern size_t packed_git_limit;
     +@@ cache.h: extern const char *git_hooks_path;
     + extern int zlib_compression_level;
     + extern int core_compression_level;
     + extern int pack_compression_level;
     ++extern int pack_mtime_to_bumpfiles;
     + extern size_t packed_git_window_size;
     + extern size_t packed_git_limit;
       extern size_t delta_base_cache_limit;
     - extern unsigned long big_file_threshold;
     - extern unsigned long pack_size_limit_cfg;
     -+extern int core_freshen_packfiles;
     - 
     - /*
     -  * Accessors for the core.sharedrepository config which lazy-load the value
      
       ## config.c ##
      @@ config.c: static int git_default_core_config(const char *var, const char *value, void *cb)
       		return 0;
       	}
       
     -+	if (!strcmp(var, "core.freshenpackfiles")) {
     -+		core_freshen_packfiles = git_config_bool(var, value);
     ++	if (!strcmp(var, "core.packmtimetobumpfiles")) {
     ++		pack_mtime_to_bumpfiles = git_config_bool(var, value);
     ++		return 0;
      +	}
      +
       	if (!strcmp(var, "core.deltabasecachelimit")) {
     @@ config.c: static int git_default_core_config(const char *var, const char *value,
       		return 0;
      
       ## environment.c ##
     -@@ environment.c: int core_sparse_checkout_cone;
     - int merge_log_config = -1;
     - int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */
     - unsigned long pack_size_limit_cfg;
     -+int core_freshen_packfiles = 1;
     - enum log_refs_config log_all_ref_updates = LOG_REFS_UNSET;
     - 
     - #ifndef PROTECT_HFS_DEFAULT
     +@@ environment.c: const char *git_hooks_path;
     + int zlib_compression_level = Z_BEST_SPEED;
     + int core_compression_level;
     + int pack_compression_level = Z_DEFAULT_COMPRESSION;
     ++int pack_mtime_to_bumpfiles;
     + int fsync_object_files;
     + size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE;
     + size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT;
      
       ## object-file.c ##
      @@ object-file.c: static int freshen_loose_object(const struct object_id *oid)
       static int freshen_packed_object(const struct object_id *oid)
       {
       	struct pack_entry e;
     -+
     -+	if (!core_freshen_packfiles)
     -+		return 1;
     ++	struct stat st;
     ++	struct strbuf name_buf = STRBUF_INIT;
     ++	const char *filename;
      +
       	if (!find_pack_entry(the_repository, oid, &e))
       		return 0;
       	if (e.p->freshened)
     + 		return 1;
     +-	if (!freshen_file(e.p->pack_name))
     +-		return 0;
     ++
     ++	filename = e.p->pack_name;
     ++	if (!pack_mtime_to_bumpfiles) {
     ++		if (!freshen_file(filename))
     ++			return 0;
     ++		e.p->freshened = 1;
     ++		return 1;
     ++	}
     ++
     ++	filename = derive_pack_filename(filename, "pack", "bump", &name_buf);
     ++	if (lstat(filename, &st) < 0) {
     ++		int fd = open(filename, O_CREAT|O_EXCL|O_WRONLY, 0664);
     ++		if (fd < 0) {
     ++			// here we need to check it again because other git process may created it
     ++			if (lstat(filename, &st) < 0)
     ++				die_errno("unable to create '%s'", filename);
     ++		} else {
     ++			close(fd);
     ++		}
     ++	} else {
     ++		if (!freshen_file(filename))
     ++			return 0;
     ++	}
     ++
     + 	e.p->freshened = 1;
     + 	return 1;
     + }
      
     - ## t/t7701-repack-unpack-unreachable.sh ##
     -@@ t/t7701-repack-unpack-unreachable.sh: test_expect_success 'do not bother loosening old objects' '
     - 	test_must_fail git cat-file -p $obj2
     - '
     + ## packfile.c ##
     +@@ packfile.c: void close_object_store(struct raw_object_store *o)
       
     -+test_expect_success 'do not bother loosening old objects without freshen pack time' '
     -+	obj1=$(echo three | git hash-object -w --stdin) &&
     -+	obj2=$(echo four | git hash-object -w --stdin) &&
     -+	pack1=$(echo $obj1 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) &&
     -+	pack2=$(echo $obj2 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) &&
     -+	git -c core.freshenPackFiles=false prune-packed &&
     + void unlink_pack_path(const char *pack_name, int force_delete)
     + {
     +-	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor"};
     ++	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor", ".bump"};
     + 	int i;
     + 	struct strbuf buf = STRBUF_INIT;
     + 	size_t plen;
     +@@ packfile.c: struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
     + 	p->pack_size = st.st_size;
     + 	p->pack_local = local;
     + 	p->mtime = st.st_mtime;
     ++
     ++	if (pack_mtime_to_bumpfiles) {
     ++		struct strbuf name_buf = STRBUF_INIT;
     ++		const char *filename;
     ++
     ++		filename = derive_pack_filename(path, "idx", "bump", &name_buf);
     ++		if (!stat(filename, &st)) {
     ++			p->mtime = st.st_mtime;
     ++		}
     ++	}
     + 	if (path_len < the_hash_algo->hexsz ||
     + 	    get_sha1_hex(path + path_len - the_hash_algo->hexsz, p->hash))
     + 		hashclr(p->hash);
     +
     + ## t/t5326-pack-mtime-bumpfiles.sh (new) ##
     +@@
     ++#!/bin/sh
     ++
     ++test_description='packfile mtime use bump files'
     ++. ./test-lib.sh
     ++
     ++if stat -c %Y . >/dev/null 2>&1; then
     ++    get_modified_time() { stat -c %Y "$1" 2>/dev/null; }
     ++elif stat -f %m . >/dev/null 2>&1; then
     ++    get_modified_time() { stat -f %m "$1" 2>/dev/null; }
     ++elif date -r . +%s >/dev/null 2>&1; then
     ++    get_modified_time() { date -r "$1" +%s 2>/dev/null; }
     ++else
     ++    echo 'get_modified_time() is unsupported' >&2
     ++    get_modified_time() { printf '%s' 0; }
     ++fi
     ++
     ++test_expect_success 'freshen existing packfile without core.packMtimeToBumpFiles' '
     ++	obj1=$(echo one | git hash-object -w --stdin) &&
     ++	obj2=$(echo two | git hash-object -w --stdin) &&
     ++	pack1=$(echo $obj1 | git pack-objects .git/objects/pack/pack) &&
     ++	pack2=$(echo $obj2 | git pack-objects .git/objects/pack/pack) &&
     ++	test-tool chmtime =-60 .git/objects/pack/pack-$pack1.* &&
     ++	test-tool chmtime =-60 .git/objects/pack/pack-$pack2.* &&
     ++	pack1_mtime=$(get_modified_time .git/objects/pack/pack-$pack1.pack) &&
     ++	pack2_mtime=$(get_modified_time .git/objects/pack/pack-$pack2.pack) &&
     ++	(echo one | git hash-object -w --stdin) &&
     ++	! test_path_exists .git/objects/pack/pack-$pack1.bump &&
     ++	! test_path_exists .git/objects/pack/pack-$pack2.bump &&
     ++	pack1_mtime_new=$(get_modified_time .git/objects/pack/pack-$pack1.pack) &&
     ++	pack2_mtime_new=$(get_modified_time .git/objects/pack/pack-$pack2.pack) &&
     ++	echo "$pack1_mtime : $pack1_mtime_new" &&
     ++	test ! "$pack1_mtime" = "$pack1_mtime_new" &&
     ++	test "$pack2_mtime" = "$pack2_mtime_new"
     ++
     ++'
     ++
     ++test_expect_success 'freshen existing packfile with core.packMtimeToBumpFiles' '
     ++
     ++	rm -rf .git/objects && git init &&
     ++	obj1=$(echo one | git hash-object -w --stdin) &&
     ++	obj2=$(echo two | git hash-object -w --stdin) &&
     ++	pack1=$(echo $obj1 | git pack-objects .git/objects/pack/pack) &&
     ++	pack2=$(echo $obj2 | git pack-objects .git/objects/pack/pack) &&
     ++	test-tool chmtime =-60 .git/objects/pack/pack-$pack1.* &&
     ++	test-tool chmtime =-60 .git/objects/pack/pack-$pack2.* &&
     ++	pack1_mtime=$(get_modified_time .git/objects/pack/pack-$pack1.pack) &&
     ++	pack2_mtime=$(get_modified_time .git/objects/pack/pack-$pack2.pack) &&
     ++	(echo one | git -c core.packMtimeToBumpFiles=true hash-object -w --stdin) &&
     ++	test_path_exists .git/objects/pack/pack-$pack1.bump &&
     ++	! test_path_exists .git/objects/pack/pack-$pack2.bump &&
     ++	pack1_mtime_new=$(get_modified_time .git/objects/pack/pack-$pack1.pack) &&
     ++	pack2_mtime_new=$(get_modified_time .git/objects/pack/pack-$pack2.pack) &&
     ++	test "$pack1_mtime" = "$pack1_mtime_new" &&
     ++	test "$pack2_mtime" = "$pack2_mtime_new"
     ++
     ++'
     ++
     ++test_expect_success 'repack prune unreachable objects without core.packMtimeToBumpFiles' '
     ++
     ++	rm -rf .git/objects && git init &&
     ++	obj1=$(echo one | git hash-object -w --stdin) &&
     ++	obj2=$(echo two | git hash-object -w --stdin) &&
     ++	pack1=$(echo $obj1 | git pack-objects .git/objects/pack/pack) &&
     ++	pack2=$(echo $obj2 | git pack-objects .git/objects/pack/pack) &&
     ++	echo one | git hash-object -w --stdin &&
     ++	echo two | git hash-object -w --stdin &&
     ++	! test_path_exists .git/objects/pack/pack-$pack1.bump &&
     ++	! test_path_exists .git/objects/pack/pack-$pack2.bump &&
     ++	git prune-packed &&
      +	git cat-file -p $obj1 &&
      +	git cat-file -p $obj2 &&
      +	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack &&
     -+	git -c core.freshenPackFiles=false repack -A -d --unpack-unreachable=1.hour.ago &&
     ++	git repack -A -d --unpack-unreachable=1.hour.ago &&
     ++	git cat-file -p $obj1 &&
     ++	test_must_fail git cat-file -p $obj2
     ++
     ++'
     ++
     ++test_expect_success 'repack prune unreachable objects with core.packMtimeToBumpFiles and bump files' '
     ++
     ++	rm -rf .git/objects && git init &&
     ++	obj1=$(echo one | git hash-object -w --stdin) &&
     ++	obj2=$(echo two | git hash-object -w --stdin) &&
     ++	pack1=$(echo $obj1 | git pack-objects .git/objects/pack/pack) &&
     ++	pack2=$(echo $obj2 | git pack-objects .git/objects/pack/pack) &&
     ++	echo one | git -c core.packMtimeToBumpFiles=true hash-object -w --stdin &&
     ++	echo two | git -c core.packMtimeToBumpFiles=true hash-object -w --stdin &&
     ++	test_path_exists .git/objects/pack/pack-$pack1.bump &&
     ++	test_path_exists .git/objects/pack/pack-$pack2.bump &&
     ++	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack &&
     ++	git -c core.packMtimeToBumpFiles=true repack -A -d --unpack-unreachable=1.hour.ago &&
     ++	git cat-file -p $obj1 &&
     ++	git cat-file -p $obj2
     ++
     ++'
     ++
     ++test_expect_success 'repack prune unreachable objects with core.packMtimeToBumpFiles and old bump files' '
     ++
     ++	rm -rf .git/objects && git init &&
     ++	obj1=$(echo one | git hash-object -w --stdin) &&
     ++	obj2=$(echo two | git hash-object -w --stdin) &&
     ++	pack1=$(echo $obj1 | git pack-objects .git/objects/pack/pack) &&
     ++	pack2=$(echo $obj2 | git pack-objects .git/objects/pack/pack) &&
     ++	echo one | git -c core.packMtimeToBumpFiles=true hash-object -w --stdin &&
     ++	echo two | git -c core.packMtimeToBumpFiles=true hash-object -w --stdin &&
     ++	test_path_exists .git/objects/pack/pack-$pack1.bump &&
     ++	test_path_exists .git/objects/pack/pack-$pack2.bump &&
     ++	git prune-packed &&
     ++	git cat-file -p $obj1 &&
     ++	git cat-file -p $obj2 &&
     ++	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.bump &&
     ++	git -c core.packMtimeToBumpFiles=true repack -A -d --unpack-unreachable=1.hour.ago &&
      +	git cat-file -p $obj1 &&
      +	test_must_fail git cat-file -p $obj2
     ++
      +'
      +
     - test_expect_success 'keep packed objects found only in index' '
     - 	echo my-unique-content >file &&
     - 	git add file &&
     ++test_done

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v4 1/2] packfile: rename `derive_filename()` to `derive_pack_filename()`
  2021-08-15 17:08     ` [PATCH v4 0/2] " Sun Chao via GitGitGadget
@ 2021-08-15 17:08       ` Sun Chao via GitGitGadget
  2021-08-15 17:08       ` [PATCH v4 2/2] packfile: freshen the mtime of packfile by bump file Sun Chao via GitGitGadget
  1 sibling, 0 replies; 34+ messages in thread
From: Sun Chao via GitGitGadget @ 2021-08-15 17:08 UTC (permalink / raw)
  To: git; +Cc: Sun Chao, Sun Chao

From: Sun Chao <16657101987@163.com>

In order to allow some function get a new file name from `.pack` file
with a new suffix, move `derive_filename()` in `builtin/index-pack.c`
to `packfile.c` with a new name `derive_pack_filename(), and export
it from `packfile.h`.

Signed-off-by: Sun Chao <16657101987@163.com>
---
 builtin/index-pack.c | 19 +++----------------
 packfile.c           | 13 +++++++++++++
 packfile.h           |  7 +++++++
 3 files changed, 23 insertions(+), 16 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 8336466865c..3c83789ccef 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1435,19 +1435,6 @@ static void fix_unresolved_deltas(struct hashfile *f)
 	free(sorted_by_pos);
 }
 
-static const char *derive_filename(const char *pack_name, const char *strip,
-				   const char *suffix, struct strbuf *buf)
-{
-	size_t len;
-	if (!strip_suffix(pack_name, strip, &len) || !len ||
-	    pack_name[len - 1] != '.')
-		die(_("packfile name '%s' does not end with '.%s'"),
-		    pack_name, strip);
-	strbuf_add(buf, pack_name, len);
-	strbuf_addstr(buf, suffix);
-	return buf->buf;
-}
-
 static void write_special_file(const char *suffix, const char *msg,
 			       const char *pack_name, const unsigned char *hash,
 			       const char **report)
@@ -1458,7 +1445,7 @@ static void write_special_file(const char *suffix, const char *msg,
 	int msg_len = strlen(msg);
 
 	if (pack_name)
-		filename = derive_filename(pack_name, "pack", suffix, &name_buf);
+		filename = derive_pack_filename(pack_name, "pack", suffix, &name_buf);
 	else
 		filename = odb_pack_name(&name_buf, hash, suffix);
 
@@ -1853,13 +1840,13 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	if (from_stdin && hash_algo)
 		die(_("--object-format cannot be used with --stdin"));
 	if (!index_name && pack_name)
-		index_name = derive_filename(pack_name, "pack", "idx", &index_name_buf);
+		index_name = derive_pack_filename(pack_name, "pack", "idx", &index_name_buf);
 
 	opts.flags &= ~(WRITE_REV | WRITE_REV_VERIFY);
 	if (rev_index) {
 		opts.flags |= verify ? WRITE_REV_VERIFY : WRITE_REV;
 		if (index_name)
-			rev_index_name = derive_filename(index_name,
+			rev_index_name = derive_pack_filename(index_name,
 							 "idx", "rev",
 							 &rev_index_name_buf);
 	}
diff --git a/packfile.c b/packfile.c
index 9ef6d982928..315c3da259a 100644
--- a/packfile.c
+++ b/packfile.c
@@ -40,6 +40,19 @@ char *sha1_pack_index_name(const unsigned char *sha1)
 	return odb_pack_name(&buf, sha1, "idx");
 }
 
+const char *derive_pack_filename(const char *pack_name, const char *strip,
+				const char *suffix, struct strbuf *buf)
+{
+	size_t len;
+	if (!strip_suffix(pack_name, strip, &len) || !len ||
+	    pack_name[len - 1] != '.')
+		die(_("packfile name '%s' does not end with '.%s'"),
+		    pack_name, strip);
+	strbuf_add(buf, pack_name, len);
+	strbuf_addstr(buf, suffix);
+	return buf->buf;
+}
+
 static unsigned int pack_used_ctr;
 static unsigned int pack_mmap_calls;
 static unsigned int peak_pack_open_windows;
diff --git a/packfile.h b/packfile.h
index 3ae117a8aef..ff702b22e6a 100644
--- a/packfile.h
+++ b/packfile.h
@@ -31,6 +31,13 @@ char *sha1_pack_name(const unsigned char *sha1);
  */
 char *sha1_pack_index_name(const unsigned char *sha1);
 
+/*
+ * Return the corresponding filename with given suffix from "file_name"
+ * which must has "strip" suffix.
+ */
+const char *derive_pack_filename(const char *file_name, const char *strip,
+		const char *suffix, struct strbuf *buf);
+
 /*
  * Return the basename of the packfile, omitting any containing directory
  * (e.g., "pack-1234abcd[...].pack").
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 2/2] packfile: freshen the mtime of packfile by bump file
  2021-08-15 17:08     ` [PATCH v4 0/2] " Sun Chao via GitGitGadget
  2021-08-15 17:08       ` [PATCH v4 1/2] packfile: rename `derive_filename()` to `derive_pack_filename()` Sun Chao via GitGitGadget
@ 2021-08-15 17:08       ` Sun Chao via GitGitGadget
  1 sibling, 0 replies; 34+ messages in thread
From: Sun Chao via GitGitGadget @ 2021-08-15 17:08 UTC (permalink / raw)
  To: git; +Cc: Sun Chao, Sun Chao

From: Sun Chao <16657101987@163.com>

Commit 33d4221c79 (write_sha1_file: freshen existing objects,
2014-10-15) avoid writing existing objects by freshen their
mtime (especially the packfiles contains them) in order to
aid the correct caching, and some process like find_lru_pack
can make good decision. However, this is unfriendly to
incremental backup jobs or services rely on cached file system
when there are large '.pack' files exists.

For example, after packed all objects, use 'write-tree' to
create same commit with the same tree and same environments
such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can
notice the '.pack' file's mtime changed. Git servers
that use filesystems like NFS will reload the '.pack' files
to file system page cache, which will slow the git commands.

So if we freshen the mtime of packfile by updating a '.bump'
file instead, when we check the mtime of packfile, get it from
'.bump' file also. Large git repository may contains large
'.pack' files, but '.bump' files can be empty. This will avoid
file system page caches reload large files from NFS and then
make git commands faster.

Signed-off-by: Sun Chao <16657101987@163.com>
---
 Documentation/config/core.txt   |  11 +++
 cache.h                         |   1 +
 config.c                        |   5 ++
 environment.c                   |   1 +
 object-file.c                   |  30 +++++++-
 packfile.c                      |  12 +++-
 t/t5326-pack-mtime-bumpfiles.sh | 118 ++++++++++++++++++++++++++++++++
 7 files changed, 175 insertions(+), 3 deletions(-)
 create mode 100755 t/t5326-pack-mtime-bumpfiles.sh

diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
index c04f62a54a1..963d1b54e7e 100644
--- a/Documentation/config/core.txt
+++ b/Documentation/config/core.txt
@@ -398,6 +398,17 @@ the largest projects.  You probably do not need to adjust this value.
 +
 Common unit suffixes of 'k', 'm', or 'g' are supported.
 
+core.packMtimeToBumpFiles::
+	Normally we avoid writing existing object by freshening the mtime
+	of the *.pack file which contains it in order to aid some processes
+	such like prune. Use a *.bump file instead of *.pack file will
+	avoid file system cache re-sync the large packfiles on filesystems
+	like NFS, and consequently make git commands faster.
++
+The default is 'false' which means the *.pack file will be freshened by
+default. If set to 'true', the file with the '.bump' suffix will be
+created automatically, and it's mtime will be freshened instead.
+
 core.deltaBaseCacheLimit::
 	Maximum number of bytes per thread to reserve for caching base objects
 	that may be referenced by multiple deltified objects.  By storing the
diff --git a/cache.h b/cache.h
index bd4869beee4..a563cbacfa2 100644
--- a/cache.h
+++ b/cache.h
@@ -960,6 +960,7 @@ extern const char *git_hooks_path;
 extern int zlib_compression_level;
 extern int core_compression_level;
 extern int pack_compression_level;
+extern int pack_mtime_to_bumpfiles;
 extern size_t packed_git_window_size;
 extern size_t packed_git_limit;
 extern size_t delta_base_cache_limit;
diff --git a/config.c b/config.c
index f33abeab851..10ccf7c5581 100644
--- a/config.c
+++ b/config.c
@@ -1431,6 +1431,11 @@ static int git_default_core_config(const char *var, const char *value, void *cb)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.packmtimetobumpfiles")) {
+		pack_mtime_to_bumpfiles = git_config_bool(var, value);
+		return 0;
+	}
+
 	if (!strcmp(var, "core.deltabasecachelimit")) {
 		delta_base_cache_limit = git_config_ulong(var, value);
 		return 0;
diff --git a/environment.c b/environment.c
index d6b22ede7ea..5fa26cb3758 100644
--- a/environment.c
+++ b/environment.c
@@ -43,6 +43,7 @@ const char *git_hooks_path;
 int zlib_compression_level = Z_BEST_SPEED;
 int core_compression_level;
 int pack_compression_level = Z_DEFAULT_COMPRESSION;
+int pack_mtime_to_bumpfiles;
 int fsync_object_files;
 size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE;
 size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT;
diff --git a/object-file.c b/object-file.c
index a8be8994814..434073c17f1 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1994,12 +1994,38 @@ static int freshen_loose_object(const struct object_id *oid)
 static int freshen_packed_object(const struct object_id *oid)
 {
 	struct pack_entry e;
+	struct stat st;
+	struct strbuf name_buf = STRBUF_INIT;
+	const char *filename;
+
 	if (!find_pack_entry(the_repository, oid, &e))
 		return 0;
 	if (e.p->freshened)
 		return 1;
-	if (!freshen_file(e.p->pack_name))
-		return 0;
+
+	filename = e.p->pack_name;
+	if (!pack_mtime_to_bumpfiles) {
+		if (!freshen_file(filename))
+			return 0;
+		e.p->freshened = 1;
+		return 1;
+	}
+
+	filename = derive_pack_filename(filename, "pack", "bump", &name_buf);
+	if (lstat(filename, &st) < 0) {
+		int fd = open(filename, O_CREAT|O_EXCL|O_WRONLY, 0664);
+		if (fd < 0) {
+			// here we need to check it again because other git process may created it
+			if (lstat(filename, &st) < 0)
+				die_errno("unable to create '%s'", filename);
+		} else {
+			close(fd);
+		}
+	} else {
+		if (!freshen_file(filename))
+			return 0;
+	}
+
 	e.p->freshened = 1;
 	return 1;
 }
diff --git a/packfile.c b/packfile.c
index 315c3da259a..f5cee440601 100644
--- a/packfile.c
+++ b/packfile.c
@@ -374,7 +374,7 @@ void close_object_store(struct raw_object_store *o)
 
 void unlink_pack_path(const char *pack_name, int force_delete)
 {
-	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor"};
+	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor", ".bump"};
 	int i;
 	struct strbuf buf = STRBUF_INIT;
 	size_t plen;
@@ -741,6 +741,16 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
 	p->pack_size = st.st_size;
 	p->pack_local = local;
 	p->mtime = st.st_mtime;
+
+	if (pack_mtime_to_bumpfiles) {
+		struct strbuf name_buf = STRBUF_INIT;
+		const char *filename;
+
+		filename = derive_pack_filename(path, "idx", "bump", &name_buf);
+		if (!stat(filename, &st)) {
+			p->mtime = st.st_mtime;
+		}
+	}
 	if (path_len < the_hash_algo->hexsz ||
 	    get_sha1_hex(path + path_len - the_hash_algo->hexsz, p->hash))
 		hashclr(p->hash);
diff --git a/t/t5326-pack-mtime-bumpfiles.sh b/t/t5326-pack-mtime-bumpfiles.sh
new file mode 100755
index 00000000000..d6d9e6dc446
--- /dev/null
+++ b/t/t5326-pack-mtime-bumpfiles.sh
@@ -0,0 +1,118 @@
+#!/bin/sh
+
+test_description='packfile mtime use bump files'
+. ./test-lib.sh
+
+if stat -c %Y . >/dev/null 2>&1; then
+    get_modified_time() { stat -c %Y "$1" 2>/dev/null; }
+elif stat -f %m . >/dev/null 2>&1; then
+    get_modified_time() { stat -f %m "$1" 2>/dev/null; }
+elif date -r . +%s >/dev/null 2>&1; then
+    get_modified_time() { date -r "$1" +%s 2>/dev/null; }
+else
+    echo 'get_modified_time() is unsupported' >&2
+    get_modified_time() { printf '%s' 0; }
+fi
+
+test_expect_success 'freshen existing packfile without core.packMtimeToBumpFiles' '
+	obj1=$(echo one | git hash-object -w --stdin) &&
+	obj2=$(echo two | git hash-object -w --stdin) &&
+	pack1=$(echo $obj1 | git pack-objects .git/objects/pack/pack) &&
+	pack2=$(echo $obj2 | git pack-objects .git/objects/pack/pack) &&
+	test-tool chmtime =-60 .git/objects/pack/pack-$pack1.* &&
+	test-tool chmtime =-60 .git/objects/pack/pack-$pack2.* &&
+	pack1_mtime=$(get_modified_time .git/objects/pack/pack-$pack1.pack) &&
+	pack2_mtime=$(get_modified_time .git/objects/pack/pack-$pack2.pack) &&
+	(echo one | git hash-object -w --stdin) &&
+	! test_path_exists .git/objects/pack/pack-$pack1.bump &&
+	! test_path_exists .git/objects/pack/pack-$pack2.bump &&
+	pack1_mtime_new=$(get_modified_time .git/objects/pack/pack-$pack1.pack) &&
+	pack2_mtime_new=$(get_modified_time .git/objects/pack/pack-$pack2.pack) &&
+	echo "$pack1_mtime : $pack1_mtime_new" &&
+	test ! "$pack1_mtime" = "$pack1_mtime_new" &&
+	test "$pack2_mtime" = "$pack2_mtime_new"
+
+'
+
+test_expect_success 'freshen existing packfile with core.packMtimeToBumpFiles' '
+
+	rm -rf .git/objects && git init &&
+	obj1=$(echo one | git hash-object -w --stdin) &&
+	obj2=$(echo two | git hash-object -w --stdin) &&
+	pack1=$(echo $obj1 | git pack-objects .git/objects/pack/pack) &&
+	pack2=$(echo $obj2 | git pack-objects .git/objects/pack/pack) &&
+	test-tool chmtime =-60 .git/objects/pack/pack-$pack1.* &&
+	test-tool chmtime =-60 .git/objects/pack/pack-$pack2.* &&
+	pack1_mtime=$(get_modified_time .git/objects/pack/pack-$pack1.pack) &&
+	pack2_mtime=$(get_modified_time .git/objects/pack/pack-$pack2.pack) &&
+	(echo one | git -c core.packMtimeToBumpFiles=true hash-object -w --stdin) &&
+	test_path_exists .git/objects/pack/pack-$pack1.bump &&
+	! test_path_exists .git/objects/pack/pack-$pack2.bump &&
+	pack1_mtime_new=$(get_modified_time .git/objects/pack/pack-$pack1.pack) &&
+	pack2_mtime_new=$(get_modified_time .git/objects/pack/pack-$pack2.pack) &&
+	test "$pack1_mtime" = "$pack1_mtime_new" &&
+	test "$pack2_mtime" = "$pack2_mtime_new"
+
+'
+
+test_expect_success 'repack prune unreachable objects without core.packMtimeToBumpFiles' '
+
+	rm -rf .git/objects && git init &&
+	obj1=$(echo one | git hash-object -w --stdin) &&
+	obj2=$(echo two | git hash-object -w --stdin) &&
+	pack1=$(echo $obj1 | git pack-objects .git/objects/pack/pack) &&
+	pack2=$(echo $obj2 | git pack-objects .git/objects/pack/pack) &&
+	echo one | git hash-object -w --stdin &&
+	echo two | git hash-object -w --stdin &&
+	! test_path_exists .git/objects/pack/pack-$pack1.bump &&
+	! test_path_exists .git/objects/pack/pack-$pack2.bump &&
+	git prune-packed &&
+	git cat-file -p $obj1 &&
+	git cat-file -p $obj2 &&
+	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack &&
+	git repack -A -d --unpack-unreachable=1.hour.ago &&
+	git cat-file -p $obj1 &&
+	test_must_fail git cat-file -p $obj2
+
+'
+
+test_expect_success 'repack prune unreachable objects with core.packMtimeToBumpFiles and bump files' '
+
+	rm -rf .git/objects && git init &&
+	obj1=$(echo one | git hash-object -w --stdin) &&
+	obj2=$(echo two | git hash-object -w --stdin) &&
+	pack1=$(echo $obj1 | git pack-objects .git/objects/pack/pack) &&
+	pack2=$(echo $obj2 | git pack-objects .git/objects/pack/pack) &&
+	echo one | git -c core.packMtimeToBumpFiles=true hash-object -w --stdin &&
+	echo two | git -c core.packMtimeToBumpFiles=true hash-object -w --stdin &&
+	test_path_exists .git/objects/pack/pack-$pack1.bump &&
+	test_path_exists .git/objects/pack/pack-$pack2.bump &&
+	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack &&
+	git -c core.packMtimeToBumpFiles=true repack -A -d --unpack-unreachable=1.hour.ago &&
+	git cat-file -p $obj1 &&
+	git cat-file -p $obj2
+
+'
+
+test_expect_success 'repack prune unreachable objects with core.packMtimeToBumpFiles and old bump files' '
+
+	rm -rf .git/objects && git init &&
+	obj1=$(echo one | git hash-object -w --stdin) &&
+	obj2=$(echo two | git hash-object -w --stdin) &&
+	pack1=$(echo $obj1 | git pack-objects .git/objects/pack/pack) &&
+	pack2=$(echo $obj2 | git pack-objects .git/objects/pack/pack) &&
+	echo one | git -c core.packMtimeToBumpFiles=true hash-object -w --stdin &&
+	echo two | git -c core.packMtimeToBumpFiles=true hash-object -w --stdin &&
+	test_path_exists .git/objects/pack/pack-$pack1.bump &&
+	test_path_exists .git/objects/pack/pack-$pack2.bump &&
+	git prune-packed &&
+	git cat-file -p $obj1 &&
+	git cat-file -p $obj2 &&
+	test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.bump &&
+	git -c core.packMtimeToBumpFiles=true repack -A -d --unpack-unreachable=1.hour.ago &&
+	git cat-file -p $obj1 &&
+	test_must_fail git cat-file -p $obj2
+
+'
+
+test_done
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2021-08-15 17:09 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-10 19:01 [PATCH] packfile: enhance the mtime of packfile by idx file Sun Chao via GitGitGadget
2021-07-11 23:44 ` Ævar Arnfjörð Bjarmason
2021-07-12 16:17   ` Sun Chao
2021-07-14  1:28 ` [PATCH v2] packfile: freshen the mtime of packfile by configuration Sun Chao via GitGitGadget
2021-07-14  1:39   ` Ævar Arnfjörð Bjarmason
2021-07-14  2:52     ` Taylor Blau
2021-07-14 16:46       ` Sun Chao
2021-07-14 17:04         ` Taylor Blau
2021-07-14 18:19           ` Ævar Arnfjörð Bjarmason
2021-07-14 19:11             ` Martin Fick
2021-07-14 19:41               ` Ævar Arnfjörð Bjarmason
2021-07-14 20:20                 ` Martin Fick
2021-07-20  6:32                   ` Ævar Arnfjörð Bjarmason
2021-07-15  8:23                 ` Son Luong Ngoc
2021-07-20  6:29                   ` Ævar Arnfjörð Bjarmason
2021-07-14 19:30             ` Taylor Blau
2021-07-14 19:32               ` Ævar Arnfjörð Bjarmason
2021-07-14 19:52                 ` Taylor Blau
2021-07-14 21:40               ` Junio C Hamano
2021-07-15 16:30           ` Sun Chao
2021-07-15 16:42             ` Taylor Blau
2021-07-15 16:48               ` Sun Chao
2021-07-14 16:11     ` Sun Chao
2021-07-19 19:53   ` [PATCH v3] " Sun Chao via GitGitGadget
2021-07-19 20:51     ` Taylor Blau
2021-07-20  0:07       ` Junio C Hamano
2021-07-20 15:07         ` Sun Chao
2021-07-20  6:19       ` Ævar Arnfjörð Bjarmason
2021-07-20 15:34         ` Sun Chao
2021-07-20 15:00       ` Sun Chao
2021-07-20 16:53         ` Taylor Blau
2021-08-15 17:08     ` [PATCH v4 0/2] " Sun Chao via GitGitGadget
2021-08-15 17:08       ` [PATCH v4 1/2] packfile: rename `derive_filename()` to `derive_pack_filename()` Sun Chao via GitGitGadget
2021-08-15 17:08       ` [PATCH v4 2/2] packfile: freshen the mtime of packfile by bump file Sun Chao via GitGitGadget

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).